Decision Trees and Random Forests

The goal of this guide is to introduce the high level concepts behind decision trees and random forests.

This guide is NOT a replacement for lecture. It is meant to be a supplement to lecture.

Motivating Decision Trees

Nonlinear Decision Boundaries

The decision boundary of a classifier is the "line" at which a classifier changes its prediction from one class to another.

For example, the decision boundary for a logistic regression classifier might look something like:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Notice that the decision boundary here is a line (i.e. linear with respect to the features

x_{1}

and

x_{2}

). These linear decision boundaries work well when the data is linearly separable (or mostly linearly separable).

Note: A set of data is linearly separable if it is possible to find a line that completely separates all the points in one class from the points in the other class.

But not all data is linearly separable! Consider the following data:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

As you can see, the postives are clustered in a circle, with the negatives surrounding the circle. A logistic regression model would likely produce the following decision boundary:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Notice that the decision boundary is again a line. In general, logistic regression classifiers always produce linear decision boundaries. Thus, there is no hope for a logistic regression classifier to be able to distinguish positives from negatives in this data (because there is no line that separates the positives and negatives).

We need a classifier that can produce a decision boundary like this:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Such decision boundaries are called nonlinear decision boundaries. Decision trees and random forests are capable of producing these decision boundaries (as well as linear decision boundaries), which is why they are extremely useful.

Model Assumptions

Recall that in linear regression, we made the assumption that the data is generated from some process

g (x) = θ \cdot x + ϵ

. This model makes two strong assumptions:

The true relationship between the features and the target variable is linear.
The noise in each point within the observed data is independent and normally distributed.

Logistic regression also makes its own set of assumptions, which are slightly outside the scope of this course. But the bottom line is the models we have learned so far make strong assumptions about the data which may or may not be true.

In contrast, decision trees make NO assumptions about the data. This is because there are no parameters to learn in a decision tree (i.e. there is no

θ

to solve for).

Defining Decision Trees

Anatomy of a Decision Tree

Below is a simple decision tree that classifies between two classes A and B.

A class is one category of the labels (remember in classification that labels are categories, not continuous values as in regression).

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Some things to note about the figure:

The root node (the topmost node) contains all the training data points. This is always true.
We have not said anything yet about how the split rules are chosen. This is the magic of decision trees, and will be discussed in the next section.
To make a prediction for a new point
$x = [\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix}]$ , simply start at the root node and use the split rules to walk down the tree until you reach a leaf node, which will contain the prediction for the point
$x$ .

Building a Decision Tree

Notation/Terminology

a node is considered pure if all the points in the node belong to one class (likewise, an impure node contains points belonging to more than one class)

Design Goals

In building a decision tree, we want to make sure that the tree we build can actually give a prediction for any point. As shown in the anatomy of a decision tree above, a decision tree can only make a prediction when it reaches a leaf node that is pure. Thus, our eventual goal in building a decision tree should be to construct a tree that has all pure leaf nodes.

The Algorithm

Below is the algorithm that builds a decision tree which has the property that all leaf nodes are pure, as discussed above.

1) put all the training data into the root node

2) while there is an impure leaf node N:
    A. find the best split rule for node N
    B. use the split rule to split N into two child nodes, L and R

Since step 2 keeps repeating until there are no impure leaf nodes, we are guaranteed that all leaf nodes will be pure at the end!

Let's see an example of this procedure, where the training data contains 20 points belonging to class A and 30 belonging to class B:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Deciding the Best Split

As was discussed in the goals section, our goal is to keep splitting nodes until all leaf nodes are pure. This means that each time we split a node, we should split the node in a way that brings each child as close to pure as possible so we make progress towards our goal.

In other words, we have a "target node" (a pure node) that we are trying to achieve, and we want to make the split that brings our actual nodes as close to the target as possible. This sounds a lot like minimizing error/loss!

Minimizing a loss function is exactly how we are going to accomplish the goal of finding the split that brings the children nodes as close to pure as possible.

The loss function is called weighted entropy and is defined as:

Weighted Entropy = \frac{N_{L} S (L) + N_{R} S (R)}{N_{L} + N_{R}}

where

L

and

R

are the left and right children produced by the split, and

N_{L}

and

N_{R}

are the number of points in

L

and

R

, respectively.

The

S (N)

term is the entropy of node

N

and is defined as:

S (N) = - \sum_{C} p_{C} \times \log_{2} p_{C}

where

p_{C}

represents the proportion of points in node

N

belonging to class

C

Every time we split a node, we will choose the split that minimizes the weighted entropy of the split.

Let's look at an example of computing weighted entropy for a split. We will calculate the weighted entropy for the first split in the example decision tree above:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Performance Evaluation of Decision Trees

Overfitting

We achieved our goal of building a tree that contains all pure leaf nodes, which means that our tree correctly classifies all the training data points (i.e. 100% training accuracy)!

This should immediately be a red flag, based on what we know about the bias-variance tradeoff. If even one training data point is changed/removed/added, the decision tree that results could look completely different. In the language of bias-variance tradeoff, this means decision trees have high variance because they overfit to the training data. Intuitively, this means decision trees are unstable (small changes in the training data can lead to large changes in the tree).

Overcoming Overfitting

There are many ways to address the issue that decision trees overfit. Perhaps the simplest of these ways is to stop building the tree before all the leaf nodes are pure. In this class, we will not talk about these methods in detail.

A more powerful way of addressing overfitting is by using random forests.

Random Forests

The basic idea behind random forests is that instead of building 1 decision tree, we can build many (100+) decision trees. Then when it comes time to make a prediction, we walk down each of the trees in the forest to find each of the trees' predictions and take the most common prediction as the prediction of the random forest.

There is one glaring issue with this approach: if every tree is trained using the same training data, then each tree is going to be identical.

Thus, we need to ensure that each tree is different somehow.

Randomness in a Random Forest

There are 2 sources of randomness in a random forest that ensures each tree is different:

Each tree is trained on a bootstrap resample (sample with replacement) of the training data.
Within each tree, each split only considers a random subset of size
$M$ out of all possible
$d$ features.

The Algorithm

Below is the algorithm that builds a random forest:

1) Generate T bootstrap resamples of the training data. 
   Call these resamples D_1 , D_2, ... , D_T.

2) for each D_t:
    A) initialize a decision tree DecisionTree_t
    B) while there is an impure leaf node N in DecisionTree_t:
        i) pick a random subset of M out of d features
        ii) choose the best split rule based on these M features
        iii) use the split rule to split N into L and R

As in decision trees, the best split rule is chosen by minimizing weighted entropy.

Performance Evaluation of Random Forests

The goal of a random forest is to reduce the overfitting that decision trees suffer from. Let's see if random forests accomplish this goal.

By taking the most common prediction of many trees, random forests ensure that the overall prediction is not affected too much even if 1 or a few of the trees within the forests suffer from overfitting.

Furthermore, random forests prevent individual trees from overfitting by restricting each split to using only

M

of the possible

d

features.

So we can see that random forests accomplish the goal of reducing overfitting.

Notable Things Not Discussed in this Guide

As mentioned at the beginning, this guide is not a replacement for lecture so it does not contain everything you need to know about decision trees and random forests. Here is a list of some of the important topics not in this guide:

What happens when a leaf node is not pure? Can you still make a prediction?
What are the hyperparameters of a decision tree?
Can a node that is impure always be split?
How do decision trees and random forests compare in terms of the bias-variance tradeoff? How do the hyperparameters of both decision trees and random forests affect their bias-variance tradeoff?
How do decision trees and random forests work for regression?

Conclusion & Feedback

I hope this guide helped you clear up any conceptual confusion you had.

If you have any feedback about this guide or suggestions on how to make it a better resource for you, please fill out my feedback form.

If you have questions about the content of this guide, come see me in OH or email me at rkunani@berkeley.edu!

Guest Franklin2020/12/06 05:11:10

Can a node that is impure always be split?

h (Edited)

Guest Rice2020/12/06 05:11:19

Furthermore, random forests prevent individual trees from overfitting by restricting each split to using only $M$ of the possible $d$ features.

p (Edited)

Guest Garza2020/12/06 05:11:46

edit: isn't this no or am i reading it wrong? (Edited)

Decision Trees and Random Forests

Motivating Decision Trees

Nonlinear Decision Boundaries

Model Assumptions

Defining Decision Trees

Anatomy of a Decision Tree

Building a Decision Tree

Notation/Terminology

Design Goals

The Algorithm

Deciding the Best Split

Performance Evaluation of Decision Trees

Overfitting

Overcoming Overfitting

Random Forests

Randomness in a Random Forest

The Algorithm

Performance Evaluation of Random Forests

Notable Things Not Discussed in this Guide

Conclusion & Feedback

Read more

Feature Engineering

Hypothesis Testing