A Noob’s Guide to Decision Trees

Published in

Analytics Vidhya

7 min readJan 13, 2021

Greetings, my fellow noobs. Today I have compiled a blog for decision trees, a concept which at first glance may seem like a “tangled mess” but is much simpler than that. Here, we will grasp the idea and working of the decision tree and how it makes your life a tad bit easier.

Introduction

It is a supervised learning algorithm used in classification and regression problems. It uses its tree-like representation to give visuals of all the decisions that can be made given the problem.

In very basic, noobish terms, a decision trees is a tree which represents all possible solutions to a problem. For example,

Suppose we are given a group containing birds , fishes, cats and dogs. We want to classify all these animals in their respective groups. So what questions can we ask to answer this problem?

We can first ask, “does the animal have fur?”. If it is true then, cats and dogs come in that group. The animals without fur would then have a seperate group containing birds and fishes.

Now, lets focus on animals with fur. We now ask, “whether that animal barks?”. If it does it is classified as a dog. If it doesn’t then it is a cat. Now, that this group is classified we move to the next one.

In case of animals without fur, we can ask, “whether that animal flies?” If it does then it is a bird, else it is a fish. This decision making can be described in the form of the tree given below:

There are two types of decision trees:

Categorical Variable Decision Tree: The target value is a categorical value. For example, if we have a dataset for diagnosing diabetes then, the target value is either “yes” or “no” for diagnosis.
Continuous Variable Decision Tree: The target value is a continuous value like in regression. For example, prediction of house prices based on the square feet of basement, first floor ,second floor, number of bathrooms, number of bedrooms etc.

Applications

It is used for data-mining and for classification in multivariate systems.
Used for developing prediction algorithms for the target variables.
Used in demographic data to find potential clients.
It helps you to map out different alternatives and choose the best one based on different situations.

Basic Terms

Root node: This is where we start making decisions. It represents the entire dataset and then based on the selected question, we split it.
Leaf node: This is where we reach a dead end. It can also be said that we cannot obtain any new information using such node and thus, can’t be divided.
Pruning: process of removing unwanted nodes in a tree. It’s the opposite of splitting.
Splitting: It is the process of division of our dataset based on some question or condition.
Parent/child node: the node from which divisions are made is the parent node while, child node is the sub-node originating from the parent node.
Branch or sub-tree: the result of splitting the tree/node.

How does it work?

We will be implementing the ID3 algorithm for decision trees which uses entropy as one of its metrics. We will be discussing about the metrics for decision trees involving categorical target variables.

Being a decision tree, it involves asking a lot of questions. But the real task is to find the order of questions that needs to be asked.

How to select a question in such a way that making decisions further down the line becomes easier for us? For this we have to introduce certain metrics through which we could evaluate our questions and order them up.

But to understand these terms, we need to understand some more basic terms:

Impurity: it is a measure of how homogeneous our node labels are. Let’s take the same group we defined previously, as an example. If we were to put any label (e.g. cat) on any of these animals randomly(for e.g., fish) “what is the probability that it would be a wrong classification?” . This question defines the concept of impurity.
Entropy: It is the measure of impurity and defines randomness in a dataset. If some part of dataset falls under one category and other part falls in other category, so we can say our dataset is random as it becomes inconclusive for us to make any decision based on this. For example,

In the above figure, we can see that labelling the first image (impure) into groups in tough compared to the second one (pure). It can be calculated as follows:

p is the probability of an item with a label i from i=1 to c where, c is the no. of groups

If you want to make it more poetic then entropy means adding salt in distilled water and then segrating them into groups. It is dissolved so it ain’t gonna show differently by itself. You gotta work to separate them up!!!

Now that the thesaurus and my cringiness is over, we will be explaining the metrics for selecting the question based on an attribute,

Gini Index: It is a measure of impurity and its value lies between 0 and 1. It is calculated as,

where, pi is the probability of an object classified into a group. Gini index only works for binary splits. Higher the value of Gini index, higher is the homogeneity. This metric is used in CART (Classification And Regression Tree) implementation for decision trees.

Information Gain: It gives a measure of how much information an attribute can give about a class/group. It specifies how much entropy can be decreased using an attribute.

As we go down our decision tree, we want to go from high entropy ( highly inconclusive) to a lower entropy ( where classification becomes straightforward). We can also see intuitively see that, if the weighted average of the entropy of children is less then, information gain must be high i.e., we have gained a lot of new information to classify items easily.

That being said, the attribute with the highest information gain is chosen first and the same process is repeated as we go down our decision tree.

(Source: towardsdatascience.com) Going from high to low entropy.

Gini index is preferred more over entropy and information gain as it doesn’t require intense computations as in the case of both.

Now we will see the steps to implement decision tree:

Our target variable here is ‘Return’. (Source:blog.quantinsti.com)

For each attribute of the dataset we will find the information gain from it. For example, for the attribute “past trend” there are two types of values here i.e., positive and negative.
For each type we find out their respective classification based on the target variable. Here for positive type in “past trend”, we have 4 “up” and 2 “down”. For negative type, we have all values of “down”. So, negative type is homogenous.
For each type we will find the entropy and then find their weighted average. In this example, for positive value our entropy would be: (-4/6)log₂(4/6)+(-2/6)log₂(2/6) = (-2/3)log₂(2/3)+(-1/3)log₂(1/3)=0.3899+0.5283=0.9182 ; Similarly for negative we have entropy =0
Now, we find it’s weighted average which is : (6/10)*(0.9182) + (4/10)*0 = 0.5509
After finding this weighted average we subtract it from the entropy of our parent. For root node however we find entropy of the dataset. This is found by finding the probability of each type of values in the target variable and then applying entropy formula on each of those probabilities.
Here, for our target variable “return” there are 4 “up”s and 6 “downs”.
Applying entropy formula, (-4/10)*log₂(4/10)+(-6/10)*log₂(6/10) = 0.9709
Subtracting weighted average from entropy of dataset, we get 0.42 which is the information gained from “past trends”.
Steps (1) to (4) are performed for all other attributes and then we select the attribute with maximum information gain as the root.
For “Open Interest”, (-4/6)log₂(4/6)+(-2/6)log₂(2/6) =0.9182 entropy for value of type “low” and for “high” we have 1. Weighted average would be (6/10)*0.9182 + (4/10)*1 = 0.9509. Information gained would be 0.02. For “Trading value” we have information gained = 0.9709-0.6896=0.2813.
So, we choose the “past trends” as our root and split our dataset accordingly. We perform all these steps for the child nodes of “past trends” and build a tree until we either cannot obtain any information anymore from the nodes or, we have reached our maximum specified depth.

Advantages:

We do not need to preprocess dataset for this algorithm. Outliers don’t have much effect on its results.
Performs well with large datasets.
Is able to handle both numeric and categorical data.
It is easy to interpret its decision rules.

Disadvantages:

They tend to overfit and thus, do not generalize well with the entire dataset. We use pruning to avoid this.
A small change in the dataset could lead to large changes in the tree and hence the final predictions.
It can predict only within the minimum and maximum range of the response variable in the training data.

So, with this we wrap up our lesson. I hope you have come a little bit closer to understanding the decision tree. Thank you for reading and have a good day.