--- tags: machine-learning --- # Decision Tree and Random Forest <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/1.png" height="100%" width="70%"> </div> > This is my personal notes taken for the course [Machine learning](https://www.coursera.org/learn/machine-learning#syllabus) by Standford. Feel free to check the [assignments](https://github.com/3outeille/Coursera-Labs). > > > Note that Decision Tree and Random Forest were not tackle in the course but are considered as a must know concept for everyone starting in the field. > > I won't be able to understand Decision Tree and Random Forest without [StatQuest](https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw) videos. Feel free to check him out. > > Also, if you want to read my other notes, feel free to check them at my [blog](https://ferdinandmom.engineer/machine-learning/). # I) Decision Tree **Decision Trees** are versatile machine learning algorithms that can perform both classification and regression tasks, and even multi-output tasks. They are very powerful algorithms, capable of fitting complex datasets. There are 2 types of decision trees: 1. **Classification Trees**: When the decision tree has categorical target variable. 2. **Regression Trees**: When the decision tree has a continuous target variable #### Here is an example of a classification tree: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/2.png" width="70%"> </div> <br> Here is an example of a regression tree: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/3.png" width="70%"> </div> <br> In order to grow those trees, we will use the **CART** algorithm which produces only binary trees (non-leaf nodes that always have two children). However, other algorithms such as ID3 can produce decision trees with nodes that have more than two children. ## 1) Classification tree How do we grow a classification tree ? Let's first go through an example and then deduct from it the general formula. Here is our dataset. We want to grow a classification tree from it. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/4.png"> </div> <br> On the picture above we can see the first 10 rows of the iris dataset. The first 4 columns are the first 4 features that we will use to predict the target, the iris species, represented by the last column with numerical values : 0 for setosa, 1 for versicolor, 2 for virginica. In total, we have 150 observations (150 rows), 50 observations for each iris species : the dataset is balanced. Here is the decision tree that we will get: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/5.png" width="70%"> </div> <br> How do we get such a tree? The tree is built iteratively from the root to the the leaves using the **CART** algorithm. **The goal of a Decision Tree is to split the training set into homogeneous areas where only one iris species is present according to the features given : here the petal and sepal widths.** <ins>**Node 0: Root node**</ins> <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/6.png"> </div> <br> The graph above shows the distribution of iris species according to the two features selected : **petal width on the x-axis** and **sepal width on the y-axis**. The color of the dots represents the iris species : **red for setosa, yellow for versicolor, blue for virginica**. The root node, on the right of the picture above, gives us several information: - There are $68$ irises ('$samples=68$') dots that we can count on the plot on the left. - '$value=[23,20,25]$' describes the repartition of these irises among the tree possible classes of iris species, i.e. $23$ for the setosa, $20$ for the versicolor, and $25$ for the virginica. - '$class = virginica$'. This is the iris species predicted by the Decision Tree at the root node. This decision is taken because virginica is the most numerous species at the root node (25 virginica compared to 20 versicolor and 23 setosa). This is the reason why the background color on the left graph is blue, the color chosen for the virginica species. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/7.png"> </div> <br> This plot on the left is the same as the previous one but with the first decision boundary of the tree : **petal width = 0.8 cm**. <ins>How was this decision boundary decided ?</ins> **A decision boundary is decided by testing all the possible decision boundaries splitting the dataset and choosing the one that minimizes the Gini impurity of the two splits.** --- <ins>**Gini impurity explanation:**</ins> The **Gini Impurity** is a measure of homogeneity among a set. It can be described as: $$\boxed{G_i = 1 - \sum_{k=1}^{n} p_k^2}$$ with: - $n$, the number of classes. - $p_k$ is the ratio of class $k$ among the training instances in the $i^{th}$ node. --- In our case, **petal width = 0.8 cm** because the 2 splits created by the decision boundary have the **lowest possible Gini impurity**. The question to be asked to determine a decision boundary is : **How to split the iris species so that we create more homogeneous groups ?** Intuitively, what we can observe on the graph above is that it's possible to create an homogeneous group containing only setosa species just by splitting the dataset along the petal width axis. But the algorithm has no intuition. So how does it find the best split ? - It will try all the possible boundaries along all the features, i.e. all the axes petal width and sepal width. - For each split the algorithm will compute the Gini impurity of the two groups created. - Finally it will choose the decision boundary that gives the lowest Gini impurity for the two groups (comparing each Gini impurity weighted sum with each other). In our case, the algorithm has found that among all the possible splits the split with **petal width = 0.8 cm** gives the lowest Gini impurity. The Gini impurity for the left leaf is : $$G_{left} = 1 - (\frac{23}{23})^2 = 0$$ The Gini impurity for the right leaf is : $$G_{right} = 1 - (\frac{20}{45})^2 - (\frac{25}{45})^2 = 0.494$$ <ins>**Node 1:**</ins> The process described will continue iteratively until the tree succeeds or tries to separate all the data points or a restrictive condition is applied to the algorithm like a limitation in the depth of the tree. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/8.png"> </div> <br> Sometimes, splitting to get an homogeneous group is not always the best option. We will see that in the node 2. <ins>**Node 2:**</ins> <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/9.png"> </div> <br> For this node the algorithm chose to split the tree at **petal width = 1.55 cm** creating two heterogeneous groups. Intuitively we would have split at **petal width = 1.3 cm** or **sepal width = 3.1 cm** so that we have a group with only versicolor irises. Let's verify which decision boundary is the best by computing the Gini impurity in both split. Gini impurity with the split at **petal width = 1.55 cm** (Left/Right/Weighted sum): $$\begin{array}{ll} G_{left} &= 1 - (\frac{17}{18})^2 - (\frac{1}{18})^2 = 0.105 \\ \\ G_{right} &= 1 - (\frac{3}{5})^2 - (\frac{2}{5})^2 = 0.48 \\ \\ G_{w1} &= \frac{18}{23} * 0.105 + \frac{5}{23} * 0.48 = 0.187 \end{array}$$ Gini impurity with the split at **petal width = 1.3 cm** (Left/Right/Weighted sum): $$\begin{array}{ll} G_{left} &= 1 - (\frac{8}{8})^2 = 0 \\ \\ G_{right} &= 1 - (\frac{12}{15})^2 - (\frac{3}{15})^2 = 0.32 \\ \\ G_{w2} &= \frac{8}{23} * 0 + \frac{15}{23} * 0.32 = 0.209 \end{array}$$ By comparing the Gini impurity weighted sum for each of our split, we can say that the algorithm is right and our intuition was wrong ($G_{w1} < G_{w2}$). <ins>**Node 3:**</ins> <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/10.png"> </div> <br> Applying the same principle again and again the algorithm will try to isolate every point until it has only homogeneous groups. This can lead to overfitting if we don’t limit the size of the tree for example. <ins>How does the built tree take a decision ?</ins> When the **Decision Tree** has to predict a target, it travels down the tree from the root node until it reaches a leaf, deciding to go to the left or to the right child node by testing the feature value of the iris with the parent node condition. With this example, we can deduce the CART cost function: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/10_b.png"> </div> <br> with, $k$ a feature and $t_k$ a threshold (sepal width cm) ## 2) Regression tree The CART algorithm works mostly the same way as earlier, except that instead of trying to split the training set in a way that minimizes the Gini impurity, it now tries to split the training set in a way that minimizes the MSE (Mean Square Error). <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/11.png"> </div> <br> <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/11_b.png"> </div> <br> The smallest MSE will enable us to find $k$ and $t_k$ to form our decision boundary. Just like for classification tasks, Decision Trees are prone to overfitting when dealing with regression tasks. Without any regularization (i.e, using the default hyperparameters), you get the predictions on the left. It is obviously overfitting the training set very badly. Just setting **min\_samples\_leaf=10** results in a much more reasonable model, represented on the right. This hyperparameter enables you to allow a split only if there is at least **min\_samples\_leaf** points. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/12.png"> </div> <br> # II) Random Forest Suppose you ask a complex question to thousands of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert's answer. This is called the wisdom of the crowd. Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A **group of predictors** is called an **ensemble**. Thus, this technique is called **Ensemble Learning**, and an **Ensemble Learning algorithm** is called an **Ensemble method**. **Random Forest** is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. It combines the simplicity of decision trees with flexibility, resulting in a vast improvement in accuracy. How do we build a Random Forest? 1. Create a \"bootstrapped\" dataset. To do so, use one of the following techniques: - **Bagging**: Sampling performed with replacement. - **Pasting**: Sampling performed without replacement. 2. Create a decision tree using the bootstrapped dataset, but only using a random subset of features at each step. 3. Repeat <ins>**Step 1:**</ins> Using bagging technique: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/13.png"> </div> <br> <ins>**Step 2:**</ins> <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/14.png"> </div> <br> <ins>**Step 3:**</ins> After repeating, we get a variety of decision trees. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/15.png"> </div> <br> Now to make a prediction (to make a decision), we just need to take a new example and then check the output for each decision trees among the random forest. We then take the one with the most votes. <ins>How do we measure the accuracy of our Random Forest?</ins> When creating our \"bootstrapped\" dataset with bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all. We called them **Out-of-Bag** dataset. We can use them to evaluate our model by running the Out-of-Bag dataset through our Random Forest and then compare the most voted label with the Out-of-Bag decision sample. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/16.png"> </div> <br> Here, we used **Hard voting** where the label with the most votes wins. We could use **Soft voting** where we average the votes of the all the learners. <ins>**Note:**</ins> **Soft margin** often achieves higher performance than hard voting because it gives more weight to highly confident votes. Remember when we build our decision trees with a random number of features ? How do we concretely find this number ? We started by selecting 2 random features to build our Random Forest decision trees and measured its accuracy. We then selected 3 random features to build our Random Forest decision trees and measured its accuracy. We then compared the accuracy of the 2 models and took the best and repeated the process. Typically, we start by using the square root of the number of variables and then try few settings above and below that value. # III) Ensemble Methods The main causes of error in learning are due to noise, bias and variance. Ensemble methods helps to minimize these factors. These methods are designed to improve the stability and the accuracy of machine learning algorithms. Combinations of multiple classifiers decrease variance, especially in the case of unstable classifiers, and may produce a more reliable classification than a single classifier. To use Bagging or Boosting you must select a base learner algorithm. For example, if we choose a classification tree, Bagging and Boosting would consist of a pool of trees as big as we want ## 1) Bagging **Bagging** (Bootstrap Aggregation) gets N decision trees (learners) from N datasets created by random sampling with replacement. The result is then obtained by averaging the responses of the N learners (Soft voting) or majority vote (Hard voting). <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/17.png"> </div> <br> ## 2) Boosting **Boosting** (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. A weak learner is simply a classifier that performs poorly, but performs better than random guessing. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are **AdaBoost** (short for Adaptive Boosting) and **Gradient Boosting**. Let's start with AdaBoost. ### <ins>a) AdaBoost </ins> Here are the main ideas behind AdaBoost: 1. AdaBoost combines a lot of \"weak learners\" (predictors) to make classification. The \"weak learners\" are almost always stumps. 2. Some stumps get more saying in the classification than others. 3. Each stump is made by taking the previous stump mistakes into account. Here are the differences between a Random Forest and a Forest of Stumps (using AdaBoost): <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/18.png"> </div> <br> <ins>**How to create a forest of stumps ? (Using AdaBoost)**</ins> 1. Assign $sample\_weight$ to dataset and initialize with $\frac{1}{\# sample\_weights}$. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/19.png" width=90%> </div> <br> 2. Choose the 1st stump in forest. - Check how well each feature correctly classifies. - Compute the Gini impurity for each stump. - Lowest Gini impurity = 1st stump. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/20.png" width=90%> </div> <br> 3. Compute its amount of say. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/21.png" width=90%> </div> <br> 4. Increase/Decrease sample weights for incorrectly/correctly classified samples using the $amount\_of\_say$ previously computed. $$\begin{array}{ll} \text{(Increase)} \;\; new\_sample\_weight =& sample\_weight * e^{amount\_of\_say} \\ \text{(Decrease)} \;\; new\_sample\_weight =& sample\_weight * e^{- amount\_of\_say} \end{array}$$ <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/22.png" width=70%> </div> <br> 5. Normalize $new\_sample\_weight$ so that its sum is always equal to 1. Replace $sample\_weight$ by $new\_sample\_weight$. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/23.png" width=90%> </div> <br> 6. Create a new dataset by picking a random number between $[0, 1]$ and use the $new\_sample\_weight$ as a probabilty distribution. Pick sample in which your random number belongs to. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/24.png" width=70%> </div> <br> <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/25.png" width=90%> </div> <br> 7. After creating the new dataset, replace all $sample\_weights$ by $\frac{1}{\# sample\_weights}$. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/26.png" width=90%> </div> <br> 8. Repeat by going back to **step 2**. The algorithm stops when the desired number of predictors (models) is reached, or when a perfect predictor (model) is found. <ins>**How a forest of stumps created by AdaBoost makes classification ?**</ins> <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/27.png" width=90%> </div> <br> --- **How to ensure that increasing the weights of misclassified points in AdaBoost does not adversely affect the learning progress?** The new classifier in each round might indeed classify the old points incorrectly. However the previous 'versions' of the classifier (from previous iterations) are not thrown away. The end result is an ensemble/average of all the classifiers of each step where the contribution of each classifier is weighted by how well that particular classifier did at that round. For example if we have an outlier that is hard to classify correctly, the outlier will accumulate a lot of weight. The classifier will be forced to give priority to that point and classify it correctly. This might mean that all the other points are misclassified. However, this classifier's 'opinion' will not be so important in the end because only one point was classified correctly (the outlier). This is also a good way to detect outliers. Just find the points with very large weight. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/28.png" width=90%> </div> <br> <ins>**Remark:**</ins> - In the 1st Iteration the classifier based on all datapoints classifies all points correctly except those in $x < 0.2$ and $y > 0.8$ and the point around $0.4/0.55$ (see the circles in the second picture). - In the second iteration exactly those points gain a higher weight so that the classifier based on that weighted sample classifies them correctly (2nd Iteration, added dashed line). The combined classifiers (i.e. "the combination of the dashed lines") result in the classifier represent by the green line. Now the second classifier produces another missclassifications $(\text{x} \in [0.5,0.6] \text{/ y} \in [0.3,0.4])$, which gain more focus in the third iteration and so on and so on. - At every step, the combined classifier gets closer and closer to the best shape (although not continuously). The final classifier (i.e. the combination of all single classifiers) in the 100th Iteration classifies all points correctly. ### <ins>b) Gradient Boosting </ins> Another very popular Boosting algorithm is Gradient Boosting. Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor. Like AdaBoost, **Gradient boosting** builds fixed sized trees based on previous trees errors. Unlike Adaboost, **Gradient boosting** trees can be larger than a stump. #### <ins>Regression:</ins> Here is the Gradient Boosting algorithm for regression: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/29.png" width=90%> </div> <br> Here is our dataset: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/30.png"> </div> <br> For the sake of our example, we will used stumps instead of larger trees because our dataset is small. --- <ins>**Input:**</ins> - $x_1$ correspond to \"Height/Favorite color/Gender\" of the **first row** $y_1$ corresponds to \"Weight\". - $x_2$ correspond to \"Height/Favorite color/Gender\" of the **second row** $y_2$ corresponds to \"Weight\". - $x_3$ correspond to \"Height/Favorite color/Gender\" of the **third row** $y_3$ corresponds to \"Weight\". - $L({\color{red}y_i}, {\color{blue}F(x)})$ corresponds to $\frac{1}{2}({\color{red}observed} - {\color{blue}predicted})^2$ <ins>**Step 1:**</ins> $${\color{magenta}F_0(x)} = argmin_{\color{magenta}\gamma} \sum_{i=1}^n {\color{green}L}({\color{red}y_i}, {\color{magenta}\gamma})$$ We want the <span style="color:magenta">predicted value $\gamma$</span> that will minimize the sum of <span style="color:green">loss function $L$</span>. By setting the derivatives of the <span style="color:green">loss function $L$</span> to $0$, the <span style="color:magenta"> predicted value $\gamma$ </span> is equal to the average of the <span style="color:red"> observed </span> weights. This result will correspond to the initial predicted value <span style="color:magenta">$F_0(x)$</span>, which is just a leaf. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/31.png" width="70%"> </div> <br> <ins>**Step 2:**</ins> We just enter the loop so $m = 1$. 1. $$\boxed{{\color{plum}r_{i,m}} = - \left[ \frac{\partial L({\color{red}y_i}, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)}}$$ - $- \left[ \frac{\partial L({\color{red}y_i}, F(x_i))}{\partial F(x_i)} \right ]$ corresponds to the $({\color{red}observed} - {\color{blue}predicted})$. - Since $m = 1$, the condition ${F(x) = F_{m-1}(x)} \Leftrightarrow {F(x) = {\color{blue}F_0(x)}}$ which means that $({\color{red}observed} - {\color{blue}F_0(x)})$. - ${\color{plum}r_{i,m}}$ which is the residual for sample $i$ of the tree $m$. - In our case, we compute the residual for the tree 1. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/32.png" width="70%"> </div> <br> 2. We need to build a regression tree to predict the <span style="color:plum"> residual </span> instead of the weight. For each leaves, need to assign a region $R_{j,m}$ where $j$ is the index for each leaf in the tree $m$. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/33.png" width="70%"> </div> <br> 3. $$\boxed{{\color{teal}\gamma_{j,m}} = argmin_{\color{teal}\gamma} \sum_{x_i \in R_{i,j}} L(y_i, F_{m-1}(x_i) + \gamma)}$$ - The output value $\gamma_{j,m}$ for each leaf is the value of $\gamma$ for which the summation is minimized. - $x_i \in R_{i,j}$ means that only the sample that belongs to a specific region will be used in the summation. - By setting the derivative of $L$ in fuction of $\gamma$ to 0, we will find that **$\gamma_{j,m}$ is equal to the average value of each region $R_{j,m}$.** <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/34.png" width="70%"> </div> <br> 4. $$\boxed{{\color{red}F_m(x)} = {\color{blue}F_{m-1}(x)} + \nu \sum_{j=1}^{J_m} {\color{dark green}\gamma_m I(x \in R_{j,m})}}$$ - Since $m=1$, ${\color{red}F_m(x) = F_1(x)}$. - Since $m=1$, ${\color{blue}F_{m-1}(x) = F_0(x)}$. - $\nu$ is the learning rate. It is usually equal to $0.1$. - $\gamma_m I(x \in R_{j,m})$ is just the tree we finished to build previously in <ins>**3.**</ins> Now we set $m=2$ and we repeat. We then get: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/35.png" width="70%"> </div> <br> <ins>**Step 3: Output $F_M(x)$**</ins> Let's assume that $M=2$ (Note that in practice, $M \sim 100$) and make a prediction: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/36.png" width="70%"> </div> <br> #### <ins>Classification:</ins> Here is the Gradient Boosting algorithm for classification: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/37.png"> </div> <br> Here is our dataset: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/38.png"> </div> <br> For the sake of our example, we will used stumps instead of larger trees because our dataset is small. --- <ins>**Input:**</ins> - $L({\color{red}y_i}, {\color{blue}F(x)})$ corresponds to $\left [ -{\color{red}observed} * {\color{blue}log(\frac{p}{1-p})} + log(1 + e^{{\color{blue}log(\frac{p}{1-p})}}) \right ]$ <ins>**Step 1:**</ins> - $F_0(x) = \gamma = log(odds) = log(\frac{p}{1-p})$. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/39.png" width="70%"> </div> <br> <ins>**Step 2:**</ins> We just enter the loop so $m = 1$. 1. $$\boxed{{\color{plum}r_{i,m}} = - \left[ \frac{\partial L({\color{red}y_i}, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)}}$$ - $- \left[ \frac{\partial L({\color{red}y_i}, F(x_i))}{\partial F(x_i)} \right ]$ corresponds to the $({\color{red}observed} - {\color{blue}predicted})$. - Since $m = 1$, the condition ${F(x) = F_{m-1}(x)} \Leftrightarrow {F(x) = {\color{blue}F_0(x)}}$ which means that $({\color{red}observed} - {\color{blue}F_0(x)})$. - $r_{i,m}$ which is the residual for sample $i$ of the tree $m$. - In our case, we compute the residual for the tree 1. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/40.png" width="70%"> </div> <br> 2. We need to build a regression tree to predict the <span style="color:plum"> residual </span> instead of the weight. For each leaves, need to assign a region $R_{j,m}$ where $j$ is the index for each leaf in the tree $m$. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/41.png" width="70%"> </div> <br> 3. Using taylor expansion to the order 2, we find that: $$\boxed{\gamma_{j,m} = \frac{\sum r_{j,m}}{\sum p * (1 - p)]} \;\; \text{with} \; \; p = \frac{e^{log(odds)}}{1 + e^{log(odds)}}}$$ <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/42.png" width="70%"> </div> <br> <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/43.png" width="70%"> </div> <br> <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/44.png" width="70%"> </div> <br> 4. $$\boxed{{\color{red}F_m(x)} = {\color{blue}F_{m-1}(x)} + \nu \sum_{j=1}^{J_m} {\color{dark green}\gamma_m I(x \in R_{j,m})}}$$ - Since $m=1$, ${\color{red}F_m(x) = F_1(x)}$. - Since $m=1$, ${\color{blue}F_{m-1}(x) = F_0(x)}$. - $\nu$ is the learning rate. It is usually equal to $0.1$. - $\gamma_m I(x \in R_{j,m})$ is just the tree we finished to build in <ins>**Step 2) 2**</ins>. Now we set $m=2$ and we repeat. We then get: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/45.png" width="70%"> </div> <br> <ins>**Step 3: Output $F_M(x)$**</ins> Let's assume that $M=2$ (Note that in practice, $M \sim 100$) and make a prediction: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/46.png" width="70%"> </div> <br> <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/47.png" width="70%"> </div> <br> # IV) Hybrid Methods Contrary to ensemble methods, **hybrid method** takes a set of different learners and combines them using new learning techniques. ## 1) Stacking **Stacking (or blending, or stacked generalization)** is a combining mechanism where the output (predictions) of the classifiers (Level 0 classifiers) will be used as training data for another classifier (Level 1 classifier) **called a blender, or a meta learner** to approximate the final prediction. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/48.png" width="70%"> </div> <br> <ins>**Method:**</ins> <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/decision-tree-random-forest/49.png"> </div>