A Noob’s Guide to Random Forest

Tanya Gupta
Level Up Coding
Published in
9 min readJan 28, 2021

--

As a noob, it is very easy to get lost while understanding random forests. But fear not my fellow noobs cuz this blog will give you a starter guide to survive this tangled mess of a topic. We will also be looking at the implementation of random forest as well as decision trees on a dataset which will help you further in understanding this algorithm.

Source:www.fromthegenesis.com

Introduction

Random Forest is an ensemble learning method used mainly for classification problems but it can also be used for regression as well. It combines the decision making of multiple decision trees and then takes the majority response out of all the decision trees as the final verdict/output.

Ensembles are basically predictive models that combines the predictions from the “weak learners” to create one “strong predictor”.

In order to understand and implement random forest you should first understand the concept of decision trees. If you would like, I have covered the theory for this topic in another blog whose link you can find here.

Working

In order to understand the working of random forest we need to understand the meaning of “bagging” or “bootstrap aggregating”.

Bagging is a technique wherein we select a number of samples for each tree with replacement. Sampling with replacement suggests that when choosing dataset rows for sampling, we can choose some or all of those rows again for other decision tree as well.

After assigning samples to these trees, we then train the trees using these samples and then proceed to predict for a testing instance/row using all of them.

After we get the results from all the trees, we find out the majority label out of all the outcomes from the pool of results in case of classification and average the results in case of regression.

This entire process can be depicted using the following image:

Source: towardsdatascience.com

If you still don’t understand this process we can take an example of a fruit basket containing cherries, apples and oranges. Assume we are to find the label for a certain fruit in the image given below:

Source:pinterest.com

This unknown fruit has the following attributes: diameter = 3, it is orange in colour, it grows in summer and shape is circular.

Now we will use the random forest algorithm to find out where it belongs.

  • So first we see that from our dataset, we take one sample each for our three decision trees and train them using their respective samples as shown in the image below:
Source:pinterest.com
  • After training them, we “pass” our test instance through each of these trees and see their prediction on it.
  • Taking first tree, we see that first it asks whether the diameter ≥3? for the test instance it is true so, it goes down the right path. Then the tree asks whether the color is orange. Again it is true, so we go down the right path which ends up in prediction being an orange.
  • Similarly, for tree 2 and tree 3 we get predicted label as “orange”.
  • So, the label which comes maximum number of times from all three predictions would be “orange”.

The working is pretty straightforward and with the knowledge and implementation of decision trees, random forest becomes fairly easy.

Implementation

I have implemented this algorithm on breast cancer dataset which you can find here.

  • Import numpy and pandas libraries.
  • pandas.read_csv() is used to read the csv file and store the dataframe in the variable dataset.
  • I checked for null values using numpy.isnull().sum(). In this dataset there aren’t any null values.
  • I converted my dataframe into a numpy array because it is easier for me personally to use its slicing operations ( I can also shuffle the rows easily using np.random.shuffle(<array>) and I don’t have to use dataframe.iloc everytime I want to access a column or a row).
  • I divided my dataset up in the ratio of 75:25 for training and testing set. I did this by first finding out the index at 3/4th of my dataset. Since, I am using Python 3.x, I would have to use integer division which uses “//” instead of “/” for normal division. This division would give me an integral value as opposed to float one.
  • train = dataset[:idx , : ] and test = dataset[idx: , : ]. Notice I have not divided train to x_train and y_train. Similar case for testing set. As we will go through the functions you will see why.

Decision Tree Functions

define count_label( <section of dataset>): It creates a dictionary for different type of values in our response variable. In our case different types of values in “diagnosis” is either 0 or 1.

In “cnt” dictionary, for each type it stores number of times it occurred in the response column.

define entropy( <section of dataset>): It finds the entropy or randomness in the entire dataset (specifically the response variable which in our case “diagnosis”).

It first gets the number of occurrences for each type of value in response variable column, then it calculates entropy by formula:

entropy = Σ (-n/total)* log(n/total) where,

  • total = number of rows in that section and,
  • n = number of occurrences for a certain type.

define info_gain(left_split, right_split, parent_entropy): where, left and right are the left and right portion of the dataset after partitioning.

It tells us how much information is gained after the splitting by:

  • first, finding the summation of product of entropies and their weighted average.
  • then, subtracting this information with the parent entropy.

class Node:

This class was made to facilitate the storage of the best splitting feature, the node’s children (i.e., left and right split), the best question for that given feature and the final predicted label.

So, for all the nodes which aren’t leaf nodes we will store everything except predicted label.

To find out whether a node is leaf or not, we use the method is_leaf() which returns true if predicted label isn’t none.

define partition(<rows of the section>, question value, feature):

So, the question value corresponds to the feature or the column based on which we are splitting. It creates a left and right partition based on this “threshold value” or question value.

If the column value of that row has value ≥ question value , then it is appended to right partition. Else, it is appended to the left partition.

It returns partitions as numpy arrays.

define best_split(rows, column):

here, row defines the section of dataset to be split and column is a list of features which are already used up in splitting.

It returns the best question value and the best attribute or feature based on which we will gain maximum information gain and the entropies of the resulting children would be minimum.

It first checks whether the feature is already used for splitting in early iterations. If it is, then we skip this feature else, we continue to find the best possible question value for this feature.

We find the information gain for each and every value in that feature and store the best possible value in ques variable which gives maximum value for information gain.

It returns ques and attr variables None, if there is no split with information gain >0.

define build_tree(rows, max_depth, column):

  • column is a list of features which are already used up in splitting.
  • max_depth gives us the limit till which we will let our tree grow.
  • rows is the dataset.

Tasks it performs:

  • First it finds the best split using best_split function.
  • If the attribute value returned is not None, it then proceeds to find whether, our response variable contains only one type of value or not. It also updates the column list.
  • It returns the leaf node if max_depth is 0 or attribute/feature is None or splitting is pure (no. of types = 1).
  • Else, we partition the rows into left and right partitions and recursively call build_tree for both partitions, having max_depth decremented by 1 and with our updated column list.

define find_common_label(column): For a given feature it returns the most common type of value.

define_root(rows, max_depth): It calls build_tree() function and stores the root of our decision tree.

define predict(test_set, root): It takes the root of the decision tree modelled using our training data and for each test row in test set it gives predictions.

define traverse_tree(test_row,node): It compares the feature value in the test row with the question value stored in our decision tree.

  • If the feature value ≥ question value for that node, traverses the right path
  • Else, traverses the left path.

If it reaches the leaf node then, it returns the predicted label.

Oof that’s a lot of functions for decision trees.

Random Forest Functions

define random_forest_classifier(): It takes number of trees, the training set, the test row (excluding the response column), maximum depth of tree as arguments.

  • Each tree is assigned with samples from training set.
  • They are trained using our decision tree functions.
  • We append each of the tree roots to “trees” list and tree predictions to “predictions_per_tree” list.

define get_samples(training set):

We choose random row indices using np.random.choice(). I have chosen sample size to be one third of train set.

These randomly selected rows are then appended to sample list and then return.

define random_forest_predict( tree_predictions, actual_test_response_labels): We count predictions for each row of each tree and find the majority label which is then returned as the final output.

I took number of trees = 5 and max_depth = 100. These values gave me an accuracy of 92.3%.

****************************************************************

Advantages:

  • It is robust to multicollinearity among predictors.
  • It takes care of missing data effectively.
  • It reduces overfitting problem of decision trees.
  • It is useful for both classification and regression problems.
  • It doesn’t require us to normalize values in our dataset, since it is rule-based search.

Disadvantages:

  • Large number of trees might slow down the algorithm’s speed and thus, limit its usage in real-time applications.
  • It makes it harder for us to interpret each decision, which was a great advantage in case of decision trees.
  • It requires a lot of time for training multiple trees and requires more computational power to build them up.

************************************************************

So, that wraps up the blog. The decision tree part of this blog might seem to be quite a mouthful. And I totally agree with it. It took me a lot of time to understand the algorithm and implementation. And I hope that this blog has helped you (even by a little bit) to understand the working behind decision trees and random forest.

In no way, I am saying this is a complete guide. All I can say is I tried.

So, thank you to all the readers who took the time of their day to read this blog. Have a wonderful day 😄!!!

--

--

Currently a CSE Undergrad at Panjab University. I enjoy learning new stuff and listening to music.