Machine learning: Disease Prediction Part 1.

Fafemi Adeola
5 min readDec 18, 2020

This article would be an end to end look into my idea of problem solving with machine learning. It would follow my process of finding a problem, creating my solution, looking for relevant data, analyzing the data and building a machine learning model, building an API using fastapi, linking my API to the frontend using axios and deployment.

Problem Identification and Solution Thinking.

Self-medication is one part of self-care which is known to contribute to primary healthcare. If practice appropriately has major benefits for the consumer such has self-reliance and decreased expense. However, inappropriate practice can lead to potential dangers like incorrect self-diagnosis, dangerous drug-drug interactions, incorrect manner of administration, incorrect dosage, incorrect choice of therapy, masking of a severe disease among others. In a bid to reduce some of the potential dangers of self-medications, machine learning could come in. Due to big data progress in biomedical science and accurate study of medical data, This benefits early disease recognition, patient care and reduced incorrect self-diagnosis.

The solution implemented here would focus on the incorrect self-diagnosis aspect, I would try to tackle it by by building a machine learning model that predicts the disease of a individual with the symptoms input by the individual. This would help prevent incorrect self-diagnosis and allows for early detection of diseases in individuals. I would now move into my process of solving the problem identified above.

Data

Data used in this problem solving was from Kaggle , The data contained 134 columns and 4920 rows. 133 of the columns are independent variables and each are disease symptoms, the last is a dependent variable(which are the different disease the symptoms points).

Machine learning model

Before building the model, The data was analyzed for better understanding and preprocessed for better model accuracy.

a. The first steps take is to import all libraries need for the process of data analysis, preprocessing training of the model and also the data.

b. The shape of the data was checked which showed that it contained 134 columns and 4920 rows.

c. The first 5 rows of the data was also printed to have a glimpse of the type of data we are dealing with, this showed that the variables were categorical with the independent variables binary(1 and 0) and the dependent variable with 41 different categories(diseases).

d. The distribution of the dependent variable was checked and it showed that each was evenly distributed.

e. The data was checked for any null variable and it showed that a column “Unnamed” was completely empty containing 4920 null rows. This column was removed totally. The other columns had no null variables.

f. Since the data’s independent variables has only categorical variables, there was no need for standardization, encoding of categorical since they were already in binary format, checking for skewness and normalization of skewed data etc. So the data was separated into dependent and dependent variables by dropping the dependent variable and storing it in a new data frame.

g. Then the model training began, The data was further split into train and test data using train_test_split to allow for testing of model accuracy. It was split in the ratio of 70:30[ 70% train data and 30% test data].

h. The machine learning algorithm used in this case is multinomial Naïve Bayes since it is suitable for suitable for classification with discrete features. It had already been imported above. Then the training dataset as features(x_train) and labels(y_train) are fit into the multinomial Naïve Bayes.

i. The accuracy of the model on the test data is gotten comparing the prediction results on the x_test against the actual values(y_test).

j. Then, I went a step further to get the top disease predictions with the highest ratings.

i. The model trained can be stored in a file so as to prevent rerunning the whole process of model training again, it can also be store for other uses like creating an API etc. In this case it was saved to allow for creating an API, Pickle was used for this purpose. The model was saved in a classifier.pkl file.

With the steps above a model is trained for disease prediction giving the first three diagnosis with highest ratings. I would then move into the process of creating an API with the trained and stored model in my next article. The link to my notebook can be found here, the link to my API can be found here.

--

--