Machine Learning: Some Basic Concepts of Decision Trees.

Fafemi Adeola
7 min readDec 16, 2020

--

Photo by Markus Winkler on Unsplash

This article would give a basic learnings and understandings in machine learning(focusing on decision trees) and a little prediction with decision trees. The first aspect of my article would be a little theoretical, I know theory is boring we just want to jump to practical aspect but this is required to have full understanding of what is happening under the hood in our prediction processes and allows for better prediction. And it begins……

Machine Learning refers to a broad range of algorithms that perform intelligent predictions based on a data set which are usually large(consisting of millions of unique data points)(James et al.,2019). In simpler terms, it is the use and development of computer systems that are able to learn and adapt without following explicit instructions. Machine learning models/algorithms are the files that have been trained to recognize certain types of patterns.

Fundamental Segmentation of machine learning models.

a. Supervised Learning: It is the machine learning task of learning a function that maps an input to an output. The training data is usually labeled in this case, for example, if I had a dataset that contained 2 columns house location and house price, the house location column can be used to predict the house price. There are two subcategories of supervised learning based on whether the predicted variable is continuous or discrete :

  1. Regression: In this case the predicted variable is continuous. examples of models used for regression operations are Linear regression, Decision trees(regression), Random forest.
  2. Classification: In this case the predicted variable is discrete. examples of models used for classification operations are Logistic Regression,, Support vector machine, Naive Bayes, Decision tree(classifier)

b. Unsupervised Learning: It is used to draw inferences and find patterns from the input data without references to labeled outcomes. In this case, the training data is not labeled. Examples of unsupervised learning clustering and dimensionality.

Decision Trees

https://ascelibrary.org/doi/abs/10.1061/%28ASCE%29CF.1943-5509.0000349

Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. Prediction is one of the most important usages of decision tree models. Using the tree model derived from historical data, it’s easy to predict the result for future records (song et al.,215).

Every time we split data into subsets based on some criteria that's how we get a decision tree i.e. a decision tree asks a question and classifies based on the answer. The aim of decision trees is to objectively have pure nodes.

Terminologies related to decision tree

  1. Root node: the entire population of the data. the root node gets split into two or more sets.
  2. Splitting: it is the process of dividing a node into two or more subnodes.
  3. Decision nodes: when a subnode is further split into subnodes.
  4. Leaf/terminal nodes: The nodes which do not split further.
  5. Branch/Sub-tree: A subsection of the entire tree.
  6. Parent node: it is a node divided into subnodes.
  7. Child node: it is a subnode.
  8. Depth of a tree: its the length of the longest part from the root.

From the image above, the table contained 4 features which are outlook, temp, humidity, windy which are used to predict whether or not golf is played. The decision tree is created from the table, starting with the outlook feature(root node). This is divided into 3 sub nodes according to the different outlooks, the overcast category is pure so the splitting is not continued but the sunny and rainy category still have impurity so their splitting is continued picking the windy feature basis for splitting while the raining has the humidity feature basis for splitting, splitting is continued until we have a pure node.

The decision tree splits the nodes on all available variables and selects the split which results in the most homogenous sub nodes/more purity leads to more homogeneity. To determine the homogenous sub nodes we need a way to measure and compare impurity. There are different ways to measure impurity which are:

  1. Gini impurity
  2. Chi-square
  3. Information gain
  4. Reduction invariance

The first 3 works only with categorical targets while the reduction invariance is used with continuous targets. For this article, I would be focusing only on the Gini impurity and information gain.

The image above shows the formula for Gini index(Gini impurity) and Entropy

  1. Gini impurity: it measures the impurity of the node and the probability that randomly picked points from a node would belong to the same class. The lower the Gini impurity, the higher the homogeneity of the node.

Gini impurity = 1- gini

Gini is the sum of the probability for each class/category.

The image above shows a simple calculation for Gini impurity. It calculates the probability for each category in a subnode, this is calculated for each feature in a data set. The feature with the lowest Gini impurity is chosen as basis for splitting.

2. Information gain: It is the entropy of the parent node minus the weighted average entropy of the child node. if the entropy of the child node is higher than the entropy of the parent that feature would not be considered for splitting.

The diagram above shows the calculation of entropy for all the different categories of the parent nodes (the feature).

steps to calculate information gain:

  1. Calculate the entropy of the parent node
  2. calculate the entropy of the child node
  3. calculate the weighted average entropy of the split( all child node entropies).
  4. Then subtract the weighted average of the child nodes from the parent node the higher the better.

Advantages of Decision trees

  1. It requires less effort compared to other algorithms in data preparation.
  2. It does not require normalization of data and scaling.
  3. Missing data does not affect the process of building a decision tree to any considerable extent.

Disadvantages of Decision trees

  1. A small change in data can cause a large change in the structure of a decision tree.
  2. The decision tree algorithm is inadequate for applying regression and predicting continuous values.

Finally, we have come to the end of the theoretical aspect. I know we are all excited to do actually predictions with our knowledge so I would move straight into prediction using a decision tree model with the scikit-learn module.

Prediction using a decision tree model

Although, the above explanation might look like a lot of work when trying to use decision trees to do predictions in a data set in the real world due to the fact that we have hundreds to thousand of features. This process has been made easier with the scikit-learn module with only a few lines of python code we have our model and prediction can be done.

I did a simple prediction with a salaries data set. This dataset contained four columns which are company, job, degree and salary more than 100k(this is the target variable).

Steps taken:

  1. Basic data preprocessing is done on the dataset. This is a very important step in machine learning as the quality of the data and the useful information that can be derived from the data directly affects the ability of our models to learn. Thereby, affecting model accuracy leading to the emphasis of this step in the process of prediction for any data set or model.
  2. The preprocessed data was split using the test_train_split from the sklearn.model_selection module to allow for testing our model accuracy and the module is used specifically for random selection of the test and train data. Our data was split into 70% train data and 30% test data.

3. Then our model was imported from the scikit-learn module and an object created from the model. This a decision tree classifier since the target variable is a categorical variable not a continuous variable.

4. Then the training dataset as features(train_features) and labels(train_target) are fit into the decision tree model.

5. The accuracy of the model on the test data is gotten comparing the prediction results against the actual values(y test).

With the few steps above a prediction of the data set using decision tree is done. I hope you enjoyed my take on the basics of machine learning, decision trees and it’s application. The notebook for the above prediction can be found here, if you want to take a closer look at the steps taken.

--

--