# Decision Trees and Random Forests The goal of this guide is to introduce the high level concepts behind decision trees and random forests. :::info This guide is **NOT** a replacement for lecture. It is meant to be a supplement to lecture. ::: ## Motivating Decision Trees ### Nonlinear Decision Boundaries The ++**decision boundary**++ of a classifier is the "line" at which a classifier changes its prediction from one class to another. For example, the decision boundary for a logistic regression classifier might look something like: ![](https://i.imgur.com/0zwvZqc.png) Notice that the decision boundary here is a line (i.e. linear with respect to the features $x_1$ and $x_2$). These ++linear decision boundaries++ work well when the data is ++linearly separable++ (or mostly linearly separable). > **Note**: A set of data is linearly separable if it is possible to find a line that completely separates all the points in one class from the points in the other class. But not all data is linearly separable! Consider the following data: ![](https://i.imgur.com/mRtNwvz.png) As you can see, the postives are clustered in a circle, with the negatives surrounding the circle. A logistic regression model would likely produce the following decision boundary: ![](https://i.imgur.com/dBpoeuE.png) Notice that the decision boundary is again a line. In general, ++logistic regression classifiers always produce linear decision boundaries++. Thus, there is no hope for a logistic regression classifier to be able to distinguish positives from negatives in this data (because there is no *line* that separates the positives and negatives). We need a classifier that can produce a decision boundary like this: ![](https://i.imgur.com/3Q5sgQC.png) Such decision boundaries are called ++**nonlinear decision boundaries**++. Decision trees and random forests are capable of producing these decision boundaries (as well as linear decision boundaries), which is why they are extremely useful. ### Model Assumptions Recall that in linear regression, we made the assumption that the data is generated from some process $g(x) = \theta \cdot x + \epsilon$. This model makes two strong assumptions: 1. The true relationship between the features and the target variable is linear. 2. The noise in each point within the observed data is independent and normally distributed. Logistic regression also makes its own set of assumptions, which are slightly outside the scope of this course. But the bottom line is ++the models we have learned so far make strong assumptions about the data which may or may not be true++. In contrast, ++decision trees make **NO** assumptions about the data++. This is because there are no parameters to learn in a decision tree (i.e. there is no $\theta$ to solve for). ## Defining Decision Trees ### Anatomy of a Decision Tree Below is a simple decision tree that classifies between two classes A and B. > A ++**class**++ is one category of the labels (remember in classification that labels are categories, not continuous values as in regression). ![](https://i.imgur.com/Pk3CwSw.png) Some things to note about the figure: 1. The root node (the topmost node) contains *all* the training data points. This is always true. 2. We have not said anything yet about how the split rules are chosen. This is the magic of decision trees, and will be discussed in the next section. 3. To make a prediction for a new point $x = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}$, simply start at the root node and use the split rules to walk down the tree until you reach a leaf node, which will contain the prediction for the point $x$. ## Building a Decision Tree ### Notation/Terminology - a node is considered ++**pure**++ if all the points in the node belong to one class (likewise, an impure node contains points belonging to more than one class) ### Design Goals In building a decision <a href="https://arboristcolumbusohio.com/why-is-tree-trimming-so-expensive/">tree</a>, we want to make sure that the tree we build can actually give a prediction for any point. As shown in the anatomy of a decision tree above, a decision tree can only make a prediction when it reaches a leaf node that is pure. Thus, our eventual goal in building a decision tree should be to construct a tree that has ++all pure leaf nodes++. ### The Algorithm Below is the algorithm that builds a decision tree which has the property that all leaf nodes are pure, as discussed above. ``` 1) put all the training data into the root node 2) while there is an impure leaf node N: A. find the best split rule for node N B. use the split rule to split N into two child nodes, L and R ``` Since step 2 keeps repeating until there are no impure leaf nodes, we are guaranteed that all leaf nodes will be pure at the end! Let's see an example of this procedure, where the training data contains 20 points belonging to class A and 30 belonging to class B: ![](https://i.imgur.com/SXMnCXT.png) ![](https://i.imgur.com/rTmaYaN.png) ![](https://i.imgur.com/aI8b1GU.png) ### Deciding the Best Split As was discussed in the goals section, our goal is to keep splitting nodes until all leaf nodes are pure. This means that each time we split a node, we should split the node in a way that brings each child as close to pure as possible so we make progress towards our goal. In other words, we have a "target node" (a pure node) that we are trying to achieve, and we want to make the split that brings our actual nodes as close to the target as possible. *This sounds a lot like minimizing error/loss!* Minimizing a loss function is exactly how we are going to accomplish the goal of finding the split that brings the children nodes as close to pure as possible. The loss function is called ++**weighted entropy**++ and is defined as: $$\text{Weighted Entropy} = \frac{N_L S(L) + N_R S(R)}{N_L + N_R}$$ where $L$ and $R$ are the left and right children produced by the split, and $N_L$ and $N_R$ are the number of points in $L$ and $R$, respectively. The $S(N)$ term is the ++**entropy**++ of node $N$ and is defined as: $$S(N) = - \sum \limits_{C} p_C \times \log_2 p_C$$ where $p_C$ represents the proportion of points in node $N$ belonging to class $C$. **++Every time we split a node, we will choose the split that minimizes the weighted entropy of the split++**. Let's look at an example of computing weighted entropy for a split. We will calculate the weighted entropy for the first split in the example decision tree above: ![](https://i.imgur.com/pJil2CM.png) ## Performance Evaluation of Decision Trees ### Overfitting We achieved our goal of building a tree that contains all pure leaf nodes, which means that our tree correctly classifies all the training data points (i.e. 100% training accuracy)! This should immediately be a red flag, based on what we know about the bias-variance tradeoff. If even one training data point is changed/removed/added, the decision tree that results could look completely different. In the language of bias-variance tradeoff, this means ++decision trees have high variance because they overfit to the training data++. Intuitively, this means decision trees are unstable (small changes in the training data can lead to large changes in the tree). ### Overcoming Overfitting There are *many* ways to address the issue that decision trees overfit. Perhaps the simplest of these ways is to stop building the tree before all the leaf nodes are pure. In this class, we will not talk about these methods in detail. A more powerful way of addressing overfitting is by using random forests. ## Random Forests The basic idea behind random forests is that instead of building 1 decision tree, we can build many (100+) decision trees. Then when it comes time to make a prediction, we walk down each of the trees in the forest to find each of the trees' predictions and take the most common prediction as the prediction of the random forest. There is one glaring issue with this approach: ++if every tree is trained using the same training data, then each tree is going to be identical.++ Thus, we need to ensure that each tree is different somehow. ### Randomness in a Random Forest There are 2 sources of randomness in a random forest that ensures each tree is different: 1) Each tree is trained on a bootstrap resample (sample *with replacement*) of the training data. 2) Within each tree, each split only considers a random subset of size $M$ out of all possible $d$ features. ### The Algorithm Below is the algorithm that builds a random forest: ``` 1) Generate T bootstrap resamples of the training data. Call these resamples D_1 , D_2, ... , D_T. 2) for each D_t: A) initialize a decision tree DecisionTree_t B) while there is an impure leaf node N in DecisionTree_t: i) pick a random subset of M out of d features ii) choose the best split rule based on these M features iii) use the split rule to split N into L and R ``` As in decision trees, the best split rule is chosen by minimizing weighted entropy. ### Performance Evaluation of Random Forests The goal of a random forest is to reduce the overfitting that decision trees suffer from. Let's see if random forests accomplish this goal. By taking the most common prediction of many trees, random forests ensure that the overall prediction is not affected too much even if 1 or a few of the trees within the forests suffer from overfitting. Furthermore, random forests prevent individual trees from overfitting by restricting each split to using only $M$ of the possible $d$ features. So we can see that random forests accomplish the goal of reducing overfitting. ## Notable Things Not Discussed in this Guide As mentioned at the beginning, this guide is not a replacement for lecture so it does not contain everything you need to know about decision trees and random forests. Here is a list of some of the important topics not in this guide: - What happens when a leaf node is not pure? Can you still make a prediction? - What are the hyperparameters of a decision tree? - Can a node that is impure always be split? - How do decision trees and random forests compare in terms of the bias-variance tradeoff? How do the hyperparameters of both decision trees and random forests affect their bias-variance tradeoff? - How do decision trees and random forests work for regression? # Conclusion & Feedback I hope this guide helped you clear up any conceptual confusion you had. If you have any feedback about this guide or suggestions on how to make it a better resource for you, please fill out my [feedback form](https://tinyurl.com/raguvirTAfeedback). If you have questions about the content of this guide, come see me in OH or email me at rkunani@berkeley.edu!