Decision Tree and Random Forest

This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.

Note that Decision Tree and Random Forest were not tackle in the course but are considered as a must know concept for everyone starting in the field.

I won't be able to understand Decision Tree and Random Forest without StatQuest videos. Feel free to check him out.

Also, if you want to read my other notes, feel free to check them at my blog.

I) Decision Tree

Decision Trees are versatile machine learning algorithms that can perform both classification and regression tasks, and even multi-output tasks. They are very powerful algorithms, capable of fitting complex datasets.

There are 2 types of decision trees:

Classification Trees: When the decision tree has categorical target variable.
Regression Trees: When the decision tree has a continuous target variable

Here is an example of a classification tree:

Here is an example of a regression tree:

In order to grow those trees, we will use the CART algorithm which produces only binary trees (non-leaf nodes that always have two children). However, other algorithms such as ID3 can produce decision trees with nodes that have more than two children.

1) Classification tree

How do we grow a classification tree ? Let's first go through an example and then deduct from it the general formula.

Here is our dataset. We want to grow a classification tree from it.

On the picture above we can see the first 10 rows of the iris dataset. The first 4 columns are the first 4 features that we will use to predict the target, the iris species, represented by the last column with numerical values : 0 for setosa, 1 for versicolor, 2 for virginica.

In total, we have 150 observations (150 rows), 50 observations for each iris species : the dataset is balanced.

Here is the decision tree that we will get:

How do we get such a tree? The tree is built iteratively from the root to the the leaves using the CART algorithm.

The goal of a Decision Tree is to split the training set into homogeneous areas where only one iris species is present according to the features given : here the petal and sepal widths.

Node 0: Root node

The graph above shows the distribution of iris species according to the two features selected : petal width on the x-axis and sepal width on the y-axis.

The color of the dots represents the iris species : red for setosa, yellow for versicolor, blue for virginica.

The root node, on the right of the picture above, gives us several information:

There are
$68$ irises ('
$s a m p l e s = 68$ ') dots that we can count on the plot on the left.
'
$v a l u e = [23, 20, 25]$ ' describes the repartition of these irises among the tree possible classes of iris species, i.e.
$23$ for the setosa,
$20$ for the versicolor, and
$25$ for the virginica.
'
$c l a s s = v i r g i n i c a$ '. This is the iris species predicted by the Decision Tree at the root node. This decision is taken because virginica is the most numerous species at the root node (25 virginica compared to 20 versicolor and 23 setosa). This is the reason why the background color on the left graph is blue, the color chosen for the virginica species.

This plot on the left is the same as the previous one but with the first decision boundary of the tree : petal width = 0.8 cm.

How was this decision boundary decided ?

A decision boundary is decided by testing all the possible decision boundaries splitting the dataset and choosing the one that minimizes the Gini impurity of the two splits.

Gini impurity explanation:

The Gini Impurity is a measure of homogeneity among a set. It can be described as:

G_{i} = 1 - \sum_{k = 1}^{n} p_{k}^{2}

with:

$n$ , the number of classes.
$p_{k}$ is the ratio of class
$k$ among the training instances in the
$i^{t h}$ node.

In our case, petal width = 0.8 cm because the 2 splits created by the decision boundary have the lowest possible Gini impurity.

The question to be asked to determine a decision boundary is : How to split the iris species so that we create more homogeneous groups ?

Intuitively, what we can observe on the graph above is that it's possible to create an homogeneous group containing only setosa species just by splitting the dataset along the petal width axis.

But the algorithm has no intuition. So how does it find the best split ?

It will try all the possible boundaries along all the features, i.e. all the axes petal width and sepal width.
For each split the algorithm will compute the Gini impurity of the two groups created.
Finally it will choose the decision boundary that gives the lowest Gini impurity for the two groups (comparing each Gini impurity weighted sum with each other).

In our case, the algorithm has found that among all the possible splits the split with petal width = 0.8 cm gives the lowest Gini impurity.

The Gini impurity for the left leaf is :

G_{l e f t} = 1 - (\frac{23}{23})^{2} = 0

The Gini impurity for the right leaf is :

G_{r i g h t} = 1 - (\frac{20}{45})^{2} - (\frac{25}{45})^{2} = 0.494

Node 1:

The process described will continue iteratively until the tree succeeds or tries to separate all the data points or a restrictive condition is applied to the algorithm like a limitation in the depth of the tree.

Sometimes, splitting to get an homogeneous group is not always the best option. We will see that in the node 2.

Node 2:

For this node the algorithm chose to split the tree at petal width = 1.55 cm creating two heterogeneous groups. Intuitively we would have split at petal width = 1.3 cm or sepal width = 3.1 cm so that we have a group with only versicolor irises.

Let's verify which decision boundary is the best by computing the Gini impurity in both split.

Gini impurity with the split at petal width = 1.55 cm (Left/Right/Weighted sum):

\begin{array}{ll} G_{l e f t} & = 1 - (\frac{17}{18})^{2} - (\frac{1}{18})^{2} = 0.105 \\ G_{r i g h t} & = 1 - (\frac{3}{5})^{2} - (\frac{2}{5})^{2} = 0.48 \\ G_{w 1} & = \frac{18}{23} * 0.105 + \frac{5}{23} * 0.48 = 0.187 \end{array}

Gini impurity with the split at petal width = 1.3 cm (Left/Right/Weighted sum):

\begin{array}{ll} G_{l e f t} & = 1 - (\frac{8}{8})^{2} = 0 \\ G_{r i g h t} & = 1 - (\frac{12}{15})^{2} - (\frac{3}{15})^{2} = 0.32 \\ G_{w 2} & = \frac{8}{23} * 0 + \frac{15}{23} * 0.32 = 0.209 \end{array}

By comparing the Gini impurity weighted sum for each of our split, we can say that the algorithm is right and our intuition was wrong (

G_{w 1} < G_{w 2}

Node 3:

Applying the same principle again and again the algorithm will try to isolate every point until it has only homogeneous groups. This can lead to overfitting if we don’t limit the size of the tree for example.

How does the built tree take a decision ?

When the Decision Tree has to predict a target, it travels down the tree from the root node until it reaches a leaf, deciding to go to the left or to the right child node by testing the feature value of the iris with the parent node condition.

With this example, we can deduce the CART cost function:

with,

k

a feature and

t_{k}

a threshold (sepal width cm)

2) Regression tree

The CART algorithm works mostly the same way as earlier, except that instead of trying to split the training set in a way that minimizes the Gini impurity, it now tries to split the training set in a way that minimizes the MSE (Mean Square Error).

The smallest MSE will enable us to find

k

and

t_{k}

to form our decision boundary.

Just like for classification tasks, Decision Trees are prone to overfitting when dealing with regression tasks. Without any regularization (i.e, using the default hyperparameters), you get the predictions on the left. It is obviously overfitting the training set very badly. Just setting min_samples_leaf=10 results in a much more reasonable model, represented on the right. This hyperparameter enables you to allow a split only if there is at least min_samples_leaf points.

II) Random Forest

Suppose you ask a complex question to thousands of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert's answer. This is called the wisdom of the crowd. Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble. Thus, this technique is called Ensemble Learning, and an Ensemble Learning algorithm is called an Ensemble method.

Random Forest is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. It combines the simplicity of decision trees with flexibility, resulting in a vast improvement in accuracy.

How do we build a Random Forest?

Create a "bootstrapped" dataset. To do so, use one of the following techniques:
- Bagging: Sampling performed with replacement.
- Pasting: Sampling performed without replacement.
Create a decision tree using the bootstrapped dataset, but only using a random subset of features at each step.
Repeat

Step 1:

Using bagging technique:

Step 2:

Step 3:

After repeating, we get a variety of decision trees.

Now to make a prediction (to make a decision), we just need to take a new example and then check the output for each decision trees among the random forest. We then take the one with the most votes.

How do we measure the accuracy of our Random Forest?

When creating our "bootstrapped" dataset with bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all. We called them Out-of-Bag dataset.

We can use them to evaluate our model by running the Out-of-Bag dataset through our Random Forest and then compare the most voted label with the Out-of-Bag decision sample.

Here, we used Hard voting where the label with the most votes wins. We could use Soft voting where we average the votes of the all the learners.

Note: Soft margin often achieves higher performance than hard voting because it gives more weight to highly confident votes.

Remember when we build our decision trees with a random number of features ? How do we concretely find this number ?

We started by selecting 2 random features to build our Random Forest decision trees and measured its accuracy. We then selected 3 random features to build our Random Forest decision trees and measured its accuracy. We then compared the accuracy of the 2 models and took the best and repeated the process.

Typically, we start by using the square root of the number of variables and then try few settings above and below that value.

III) Ensemble Methods

The main causes of error in learning are due to noise, bias and variance. Ensemble methods helps to minimize these factors. These methods are designed to improve the stability and the accuracy of machine learning algorithms.

Combinations of multiple classifiers decrease variance, especially in the case of unstable classifiers, and may produce a more reliable classification than a single classifier. To use Bagging or Boosting you must select a base learner algorithm. For example, if we choose a classification tree, Bagging and Boosting would consist of a pool of trees as big as we want

1) Bagging

Bagging (Bootstrap Aggregation) gets N decision trees (learners) from N datasets created by random sampling with replacement. The result is then obtained by averaging the responses of the N learners (Soft voting) or majority vote (Hard voting).

2) Boosting

Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. A weak learner is simply a classifier that performs poorly, but performs better than random guessing.

The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are AdaBoost (short for Adaptive Boosting) and Gradient Boosting. Let's start with AdaBoost.

a) AdaBoost

Here are the main ideas behind AdaBoost:

AdaBoost combines a lot of "weak learners" (predictors) to make classification. The "weak learners" are almost always stumps.
Some stumps get more saying in the classification than others.
Each stump is made by taking the previous stump mistakes into account.

Here are the differences between a Random Forest and a Forest of Stumps (using AdaBoost):

How to create a forest of stumps ? (Using AdaBoost)

Assign
$s a m p l e_w e i g h t$ to dataset and initialize with
$\frac{1}{# s a m p l e_w e i g h t s}$ .

Choose the 1st stump in forest.

Check how well each feature correctly classifies.
Compute the Gini impurity for each stump.
Lowest Gini impurity = 1st stump.

Compute its amount of say.

Increase/Decrease sample weights for incorrectly/correctly classified samples using the
$a m o u n t_o f_s a y$ previously computed.

\begin{array}{ll} (Increase) n e w_s a m p l e_w e i g h t = & s a m p l e_w e i g h t * e^{a m o u n t_o f_s a y} \\ (Decrease) n e w_s a m p l e_w e i g h t = & s a m p l e_w e i g h t * e^{- a m o u n t_o f_s a y} \end{array}

Normalize
$n e w_s a m p l e_w e i g h t$ so that its sum is always equal to 1. Replace
$s a m p l e_w e i g h t$ by
$n e w_s a m p l e_w e i g h t$ .

Create a new dataset by picking a random number between
$[0, 1]$ and use the
$n e w_s a m p l e_w e i g h t$ as a probabilty distribution. Pick sample in which your random number belongs to.

After creating the new dataset, replace all
$s a m p l e_w e i g h t s$ by
$\frac{1}{# s a m p l e_w e i g h t s}$ .

Repeat by going back to step 2. The algorithm stops when the desired number of predictors (models) is reached, or when a perfect predictor (model) is found.

How a forest of stumps created by AdaBoost makes classification ?

How to ensure that increasing the weights of misclassified points in AdaBoost does not adversely affect the learning progress?

The new classifier in each round might indeed classify the old points incorrectly. However the previous 'versions' of the classifier (from previous iterations) are not thrown away. The end result is an ensemble/average of all the classifiers of each step where the contribution of each classifier is weighted by how well that particular classifier did at that round.

For example if we have an outlier that is hard to classify correctly, the outlier will accumulate a lot of weight. The classifier will be forced to give priority to that point and classify it correctly. This might mean that all the other points are misclassified.

However, this classifier's 'opinion' will not be so important in the end because only one point was classified correctly (the outlier). This is also a good way to detect outliers. Just find the points with very large weight.

Remark:

In the 1st Iteration the classifier based on all datapoints classifies all points correctly except those in
$x < 0.2$ and
$y > 0.8$ and the point around
$0.4 / 0.55$ (see the circles in the second picture).
In the second iteration exactly those points gain a higher weight so that the classifier based on that weighted sample classifies them correctly (2nd Iteration, added dashed line). The combined classifiers (i.e. "the combination of the dashed lines") result in the classifier represent by the green line. Now the second classifier produces another missclassifications
$(x \in [0.5, 0.6] / y \in [0.3, 0.4])$ , which gain more focus in the third iteration and so on and so on.
At every step, the combined classifier gets closer and closer to the best shape (although not continuously). The final classifier (i.e. the combination of all single classifiers) in the 100th Iteration classifies all points correctly.

b) Gradient Boosting

Another very popular Boosting algorithm is Gradient Boosting. Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.

Like AdaBoost, Gradient boosting builds fixed sized trees based on previous trees errors.

Unlike Adaboost, Gradient boosting trees can be larger than a stump.

Regression:

Here is the Gradient Boosting algorithm for regression:

Here is our dataset:

For the sake of our example, we will used stumps instead of larger trees because our dataset is small.

Input:

$x_{1}$ correspond to "Height/Favorite color/Gender" of the first row
$y_{1}$ corresponds to "Weight".
$x_{2}$ correspond to "Height/Favorite color/Gender" of the second row
$y_{2}$ corresponds to "Weight".
$x_{3}$ correspond to "Height/Favorite color/Gender" of the third row
$y_{3}$ corresponds to "Weight".
$L (y_{i}, F (x))$ corresponds to
$\frac{1}{2} (o b s e r v e d - p r e d i c t e d)^{2}$

Step 1:

F_{0} (x) = a r g m i n_{γ} \sum_{i = 1}^{n} L (y_{i}, γ)

We want the predicted value

γ

that will minimize the sum of loss function

L

. By setting the derivatives of the loss function

L

0

, the predicted value

γ

is equal to the average of the observed weights. This result will correspond to the initial predicted value

F_{0} (x)

, which is just a leaf.

Step 2:

We just enter the loop so

m = 1

$r_{i, m} = - {[\frac{\partial L (y_{i}, F (x_{i}))}{\partial F (x_{i})}]}_{F (x) = F_{m - 1} (x)}$
- $- [\frac{\partial L (y_{i}, F (x_{i}))}{\partial F (x_{i})}]$ corresponds to the
  $(o b s e r v e d - p r e d i c t e d)$ .
- Since
  $m = 1$ , the condition
  $F (x) = F_{m - 1} (x) \Leftrightarrow F (x) = F_{0} (x)$ which means that
  $(o b s e r v e d - F_{0} (x))$ .
- $r_{i, m}$ which is the residual for sample
  $i$ of the tree
  $m$ .
- In our case, we compute the residual for the tree 1.
Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

We need to build a regression tree to predict the residual instead of the weight. For each leaves, need to assign a region
$R_{j, m}$ where
$j$ is the index for each leaf in the tree
$m$ .
Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

$γ_{j, m} = a r g m i n_{γ} \sum_{x_{i} \in R_{i, j}} L (y_{i}, F_{m - 1} (x_{i}) + γ)$
- The output value
  $γ_{j, m}$ for each leaf is the value of
  $γ$ for which the summation is minimized.
- $x_{i} \in R_{i, j}$ means that only the sample that belongs to a specific region will be used in the summation.
- By setting the derivative of
  $L$ in fuction of
  $γ$ to 0, we will find that
  $γ_{j, m}$ is equal to the average value of each region
  $R_{j, m}$ .
Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

$F_{m} (x) = F_{m - 1} (x) + ν \sum_{j = 1}^{J_{m}} γ_{m} I (x \in R_{j, m})$
- Since
  $m = 1$ ,
  $F_{m} (x) = F_{1} (x)$ .
- Since
  $m = 1$ ,
  $F_{m - 1} (x) = F_{0} (x)$ .
- $ν$ is the learning rate. It is usually equal to
  $0.1$ .
- $γ_{m} I (x \in R_{j, m})$ is just the tree we finished to build previously in 3.

Now we set

m = 2

and we repeat. We then get:

Step 3: Output

$F_{M} (x)$

Let's assume that

M = 2

(Note that in practice,

M \sim 100

) and make a prediction:

Classification:

Here is the Gradient Boosting algorithm for classification:

Here is our dataset:

For the sake of our example, we will used stumps instead of larger trees because our dataset is small.

Input:

$L (y_{i}, F (x))$ corresponds to
$[- o b s e r v e d * l o g (\frac{p}{1 - p}) + l o g (1 + e^{l o g (\frac{p}{1 - p})})]$

Step 1:

$F_{0} (x) = γ = l o g (o d d s) = l o g (\frac{p}{1 - p})$ .

Step 2:

We just enter the loop so

m = 1

$r_{i, m} = - {[\frac{\partial L (y_{i}, F (x_{i}))}{\partial F (x_{i})}]}_{F (x) = F_{m - 1} (x)}$
- $- [\frac{\partial L (y_{i}, F (x_{i}))}{\partial F (x_{i})}]$ corresponds to the
  $(o b s e r v e d - p r e d i c t e d)$ .
- Since
  $m = 1$ , the condition
  $F (x) = F_{m - 1} (x) \Leftrightarrow F (x) = F_{0} (x)$ which means that
  $(o b s e r v e d - F_{0} (x))$ .
- $r_{i, m}$ which is the residual for sample
  $i$ of the tree
  $m$ .
- In our case, we compute the residual for the tree 1.

We need to build a regression tree to predict the residual instead of the weight. For each leaves, need to assign a region
$R_{j, m}$ where
$j$ is the index for each leaf in the tree
$m$ .

Using taylor expansion to the order 2, we find that:

γ_{j, m} = \frac{\sum r_{j, m}}{\sum p * (1 - p)]} with p = \frac{e^{l o g (o d d s)}}{1 + e^{l o g (o d d s)}}

$F_{m} (x) = F_{m - 1} (x) + ν \sum_{j = 1}^{J_{m}} γ_{m} I (x \in R_{j, m})$
- Since
  $m = 1$ ,
  $F_{m} (x) = F_{1} (x)$ .
- Since
  $m = 1$ ,
  $F_{m - 1} (x) = F_{0} (x)$ .
- $ν$ is the learning rate. It is usually equal to
  $0.1$ .
- $γ_{m} I (x \in R_{j, m})$ is just the tree we finished to build in Step 2) 2.

Now we set

m = 2

and we repeat. We then get:

Step 3: Output

$F_{M} (x)$

Let's assume that

M = 2

(Note that in practice,

M \sim 100

) and make a prediction:

IV) Hybrid Methods

Contrary to ensemble methods, hybrid method takes a set of different learners and combines them using new learning techniques.

1) Stacking

Stacking (or blending, or stacked generalization) is a combining mechanism where the output (predictions) of the classifiers (Level 0 classifiers) will be used as training data for another classifier (Level 1 classifier) called a blender, or a meta learner to approximate the final prediction.

Method:

Decision Tree and Random Forest

I) Decision Tree

1) Classification tree

2) Regression tree

II) Random Forest

III) Ensemble Methods

1) Bagging

2) Boosting

a) AdaBoost

b) Gradient Boosting

Regression:

Classification:

IV) Hybrid Methods

1) Stacking

Read more

Convolutional Neural Network with Numpy (Fast)

Convolutional Neural Network with Numpy (Slow)

AlexNet: Summary and Implementation

ZFNet/DeconvNet: Summary and Implementation