Machine learning: Disease Prediction Part 1.

This article would be an end to end look into my idea of problem solving with machine learning. It would follow my process of finding a problem, creating my solution, looking for relevant data, analyzing the data and building a machine learning model, building an API using fastapi, linking my API to the frontend using axios and deployment.

Problem Identification and Solution Thinking.

Self-medication is one part of self-care which is known to contribute to primary healthcare. If practice appropriately has major benefits for the consumer such has self-reliance and decreased expense. However, inappropriate practice can lead to potential dangers like incorrect self-diagnosis, dangerous drug-drug interactions, incorrect manner of administration, incorrect dosage, incorrect choice of therapy, masking of a severe disease among others. In a bid to reduce some of the potential dangers of self-medications, machine learning could come in. Due to big data progress in biomedical science and accurate study of medical data, This benefits early disease recognition, patient care and reduced incorrect self-diagnosis.

The solution implemented here would focus on the incorrect self-diagnosis aspect, I would try to tackle it by by building a machine learning model that predicts the disease of a individual with the symptoms input by the individual. This would help prevent incorrect self-diagnosis and allows for early detection of diseases in individuals. I would now move into my process of solving the problem identified above.


Data used in this problem solving was from Kaggle , The data contained 134 columns and 4920 rows. 133 of the columns are independent variables and each are disease symptoms, the last is a dependent variable(which are the different disease the symptoms points).

Machine learning model

Before building the model, The data was analyzed for better understanding and preprocessed for better model accuracy.

a. The first steps take is to import all libraries need for the process of data analysis, preprocessing training of the model and also the data.

b. The shape of the data was checked which showed that it contained 134 columns and 4920 rows.

c. The first 5 rows of the data was also printed to have a glimpse of the type of data we are dealing with, this showed that the variables were categorical with the independent variables binary(1 and 0) and the dependent variable with 41 different categories(diseases).

d. The distribution of the dependent variable was checked and it showed that each was evenly distributed.

e. The data was checked for any null variable and it showed that a column “Unnamed” was completely empty containing 4920 null rows. This column was removed totally. The other columns had no null variables.

f. Since the data’s independent variables has only categorical variables, there was no need for standardization, encoding of categorical since they were already in binary format, checking for skewness and normalization of skewed data etc. So the data was separated into dependent and dependent variables by dropping the dependent variable and storing it in a new data frame.

g. Then the model training began, The data was further split into train and test data using train_test_split to allow for testing of model accuracy. It was split in the ratio of 70:30[ 70% train data and 30% test data].

h. The machine learning algorithm used in this case is multinomial Naïve Bayes since it is suitable for suitable for classification with discrete features. It had already been imported above. Then the training dataset as features(x_train) and labels(y_train) are fit into the multinomial Naïve Bayes.

i. The accuracy of the model on the test data is gotten comparing the prediction results on the x_test against the actual values(y_test).

j. Then, I went a step further to get the top disease predictions with the highest ratings.

i. The model trained can be stored in a file so as to prevent rerunning the whole process of model training again, it can also be store for other uses like creating an API etc. In this case it was saved to allow for creating an API, Pickle was used for this purpose. The model was saved in a classifier.pkl file.

With the steps above a model is trained for disease prediction giving the first three diagnosis with highest ratings. I would then move into the process of creating an API with the trained and stored model in my next article. The link to my notebook can be found here, the link to my API can be found here.




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Housing Price with scikit-learn’s StratifiedShuffleSplit

Brief History of Deep Convolutional Neural Network

Xfer: an open-source library for neural network transfer learning

Why Overfitting is a Bad Idea and How to Avoid It (Part 2: Overfitting in virtual assistants)

A large rack of free weights.

Spelling Rectification App using TextBlob & pyspellchecker

शब्द२भेक: भाग २

How To Choose Right Machine Learning Algorithm

Face Recognition using Deep Learning(Part I)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Fafemi Adeola

Fafemi Adeola

More from Medium

10 Techniques to deal with Imbalanced Classes in Machine Learning!

Predicting Future Purchases Using Neural Networks

2020 Olympic Medal Winning Countries Classification Modeling

The Implementation and Analysis of the Girvan-Newman Algorithm Part 2