The goal of this guide is to introduce the high level concepts behind decision trees and random forests.
This guide is NOT a replacement for lecture. It is meant to be a supplement to lecture.
The decision boundary of a classifier is the "line" at which a classifier changes its prediction from one class to another.
For example, the decision boundary for a logistic regression classifier might look something like:
Notice that the decision boundary here is a line (i.e. linear with respect to the features
Note: A set of data is linearly separable if it is possible to find a line that completely separates all the points in one class from the points in the other class.
But not all data is linearly separable! Consider the following data:
As you can see, the postives are clustered in a circle, with the negatives surrounding the circle. A logistic regression model would likely produce the following decision boundary:
Notice that the decision boundary is again a line. In general, logistic regression classifiers always produce linear decision boundaries. Thus, there is no hope for a logistic regression classifier to be able to distinguish positives from negatives in this data (because there is no line that separates the positives and negatives).
We need a classifier that can produce a decision boundary like this:
Such decision boundaries are called nonlinear decision boundaries. Decision trees and random forests are capable of producing these decision boundaries (as well as linear decision boundaries), which is why they are extremely useful.
Recall that in linear regression, we made the assumption that the data is generated from some process
Logistic regression also makes its own set of assumptions, which are slightly outside the scope of this course. But the bottom line is the models we have learned so far make strong assumptions about the data which may or may not be true.
In contrast, decision trees make NO assumptions about the data. This is because there are no parameters to learn in a decision tree (i.e. there is no
Below is a simple decision tree that classifies between two classes A and B.
A class is one category of the labels (remember in classification that labels are categories, not continuous values as in regression).
Some things to note about the figure:
In building a decision tree, we want to make sure that the tree we build can actually give a prediction for any point. As shown in the anatomy of a decision tree above, a decision tree can only make a prediction when it reaches a leaf node that is pure. Thus, our eventual goal in building a decision tree should be to construct a tree that has all pure leaf nodes.
Below is the algorithm that builds a decision tree which has the property that all leaf nodes are pure, as discussed above.
1) put all the training data into the root node
2) while there is an impure leaf node N:
A. find the best split rule for node N
B. use the split rule to split N into two child nodes, L and R
Since step 2 keeps repeating until there are no impure leaf nodes, we are guaranteed that all leaf nodes will be pure at the end!
Let's see an example of this procedure, where the training data contains 20 points belonging to class A and 30 belonging to class B:
As was discussed in the goals section, our goal is to keep splitting nodes until all leaf nodes are pure. This means that each time we split a node, we should split the node in a way that brings each child as close to pure as possible so we make progress towards our goal.
In other words, we have a "target node" (a pure node) that we are trying to achieve, and we want to make the split that brings our actual nodes as close to the target as possible. This sounds a lot like minimizing error/loss!
Minimizing a loss function is exactly how we are going to accomplish the goal of finding the split that brings the children nodes as close to pure as possible.
The loss function is called weighted entropy and is defined as:
where
The
where
Every time we split a node, we will choose the split that minimizes the weighted entropy of the split.
Let's look at an example of computing weighted entropy for a split. We will calculate the weighted entropy for the first split in the example decision tree above:
We achieved our goal of building a tree that contains all pure leaf nodes, which means that our tree correctly classifies all the training data points (i.e. 100% training accuracy)!
This should immediately be a red flag, based on what we know about the bias-variance tradeoff. If even one training data point is changed/removed/added, the decision tree that results could look completely different. In the language of bias-variance tradeoff, this means decision trees have high variance because they overfit to the training data. Intuitively, this means decision trees are unstable (small changes in the training data can lead to large changes in the tree).
There are many ways to address the issue that decision trees overfit. Perhaps the simplest of these ways is to stop building the tree before all the leaf nodes are pure. In this class, we will not talk about these methods in detail.
A more powerful way of addressing overfitting is by using random forests.
The basic idea behind random forests is that instead of building 1 decision tree, we can build many (100+) decision trees. Then when it comes time to make a prediction, we walk down each of the trees in the forest to find each of the trees' predictions and take the most common prediction as the prediction of the random forest.
There is one glaring issue with this approach: if every tree is trained using the same training data, then each tree is going to be identical.
Thus, we need to ensure that each tree is different somehow.
There are 2 sources of randomness in a random forest that ensures each tree is different:
Below is the algorithm that builds a random forest:
1) Generate T bootstrap resamples of the training data.
Call these resamples D_1 , D_2, ... , D_T.
2) for each D_t:
A) initialize a decision tree DecisionTree_t
B) while there is an impure leaf node N in DecisionTree_t:
i) pick a random subset of M out of d features
ii) choose the best split rule based on these M features
iii) use the split rule to split N into L and R
As in decision trees, the best split rule is chosen by minimizing weighted entropy.
The goal of a random forest is to reduce the overfitting that decision trees suffer from. Let's see if random forests accomplish this goal.
By taking the most common prediction of many trees, random forests ensure that the overall prediction is not affected too much even if 1 or a few of the trees within the forests suffer from overfitting.
Furthermore, random forests prevent individual trees from overfitting by restricting each split to using only
So we can see that random forests accomplish the goal of reducing overfitting.
As mentioned at the beginning, this guide is not a replacement for lecture so it does not contain everything you need to know about decision trees and random forests. Here is a list of some of the important topics not in this guide:
I hope this guide helped you clear up any conceptual confusion you had.
If you have any feedback about this guide or suggestions on how to make it a better resource for you, please fill out my feedback form.
If you have questions about the content of this guide, come see me in OH or email me at rkunani@berkeley.edu!