Introduction
Decision trees are a popular machine learning model used for classification and regression tasks. They are simple and easy to understand models that provide a visual representation of decision-making processes. Decision trees work by breaking down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed.
The final result is a tree with decision nodes and leaf nodes, which represent the classification or regression result. The leaf nodes or terminal nodes are the last nodes of a branch of a tree just like a real life tree where the leaves are the outermost and last part of the tree. The first node where the data splits and is subset is called the root node. Every decision node is a parent node. Every node subset of the decision node is its child node. The figure on the left is a great example of the visual representation of a decision tree and its components.
Decision tree works by position a condition asking a question about the value of a specific column. This value can only be quantitative for Python but can be quantitative as well as qualitative in R. The node can split the data into any number of subsets, but usually it is two. It can also be specified how many subsets the node splits the data into. Decision trees are especially preferred because they are easy to understand when seen visually. The goal of the tree is primarily to have pure leaf. This means that the leaf nodes of the tree should ideally contain data observation of a single class.
In a more technical definition, decision tree chooses an attribute to split by applying Hunt’s algorithm. Decision trees are grown recursively :
Step 1: If all the records in a set of training record belong to the same class then the node is a leaf node.
Step 2: If the records in a set of training record belong to multiple classes, an attribute test condition is selected to partition the records into smaller subsets. A child node is created for each outcome of the test condition and records are distributed to the children based on the outcomes. The algorithm is recursively applying to each child node.
At this stage, the reader might question, how are different datatypes split? and what criterion is used to split the dataset?
Splitting Nominal Data
Nominal data can be split either into its multiple categories if they exist or split binary. For different algorithms of Decision trees, this varies. For ex for CART algorithm there can only be a binary split. Example on nominal split can be seen in Fig 3.
Splitting Ordinal Data
Ordinal data can also undergo either a binary or multiway split. Ordinal data can be grouped for splitting until and unless it does not break the order. In Fig 4, a and b make sense since the grouping is in order. Whereas in c, Small is grouped with Large and Medium is grouped with Extra Large, this violates the property of maintaining order.
Splitting Continuous Data
For continuous data the test conditions are usually comparisons such as A<x or A>=x with binary outcomes. For a binary split, the tree algorithm must consider all possible values of x. It then selects the value that results in the best split. For multiway splits ranges of splits must be chosen by the algorithm.
Measures of splitting data
The measures used to split data at each node are based on the degree of impurity in each child node. The smaller the degree of impurity, the more skewed the class distribution. The equations might look complex. For this part of the explanation, and example will be used to show how to calculate the measures of splitting. A small dataset can be created, please note that this example does not include a continuous variable because for the purpose of understanding Entropy, Gini and Information gain, it would get too complex.
Before splitting the data and calculating Entropy, Gini and Information gain at each split, it’s important to know how they are calculated.
Entropy
For a binary classification problem the entropy varies from 0 to 1. For a muti-class classification problem the value of entropy varies from 0 to >1. The entropy is defined using the following equation:
Gini
In this article, we will explore some of the important concepts related to decision trees, such as Gini, Entropy, information gain, and why there are infinite possibilities to create trees.
Information Gain
Information Gain is used to compare the degree of impurity at each node. It is calculated at each parent and child node to find the improvement or deterioration of impurity. Following the the formula to measure information gain.
Information gain can be calculated using any impurity metric in Decision trees. The goal is to have the highest information gain from parent to child node. Hence the goal is to maximize the above metric at each split.
Data Prep
For decision tree in Python, EV_clean_data is used to build the classification model. The data is first split into training and testing set using train_test_split() function in Python. The code for this analysis can be found here. The dataset used for this analysis can be found here.
Case 1
First, all the features were used to create Decision tree models. Max depth for the tree was set to 4 and ‘Gini’ was used as the criterion. Resulting in the following tree:
The accuracy for the above model was 80%, which is quite bad in this scenario.
Case 2
Next, complete data was used with max depth of 4, but the criterion was changed from Gini to Entropy. Resulting in the following tree:
The accuracy of the above model was 80% as well, still not good enough.
Case 3
Later, the max depth was increased to 6, please note that increasing the depth of the tree can sometimes lead to overfitting of the model. The Decision tree for complete data with entropy as the criterion is:
The accuracy of the above model is 91%, definitely better than that of the previous model’s but still not good enough.
Case 4
From applying Naive Bayes it was observed that some independent features were highly correlated. Reducing the dimensionality of the data and keeping only ‘Percentage_pop_with_bachelors’, ‘Adults 26-34’, ‘Av_temperature’, ‘Median_income’, ‘GDP’, ‘Elec_price’ , setting amx depth to 4 and criteria to entropy resulted in the following tree:
The accuracy of the above model is 90%, worse than the model with max depth of 6, but better than that of complete data decision tree models in the first two cases.
Conclusion
Comparing the Confusion Matrices
From above comparison of the confusion matrices, it’s clear the data is imbalanced. Besides the issue of imbalanced data, it is also clear that the models are generally bad at predicting Class 2 and Class 3. The reasons for this could be that there are not enough observations to train the data to understand the features of Class 2 and Class 3. The features are not good indicators of the label which is the level of EV presence in each state. In Case 4, although the accuracy in predicting class 1 was better compared to that of other Cases.