Advice for applying Machine Learning

This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.
Also, if you want to read my other notes, feel free to check them at my blog.

I) Evaluating a Learning Algorithm

Suppose you have implemented regularized linear regression to predict housing prices. However, when you test your hypothesis on a new set of houses, you find that it makes unacceptably large errors in its predictions (Maybe overfit/underfit the training set ?). Thus, you want to improve it.

Errors in your predictions can be troubleshooted by:

Getting more training examples.
Trying smaller sets of features.
Try getting additional features.
Try adding polynomial features (
$x_{1}^{2}$ ,
$x_{2}^{2}$ ,
$x_{1} x_{2}$ , …).
Increasing
$λ$ .
Decreasing
$λ$ .

Don't just pick one of these avenues at random. The question is then, which one to pick in order to improve your model ?

To do so, we are going to answer the following questions:

What is Bias/Variance ?
How degree of polynomial
$d$ and Bias/Variance are related ?
How parameter
$λ$ and Bias/Variance are related ?
How number of example
$m$ and Bias/Variance are related ?

1) What is Bias/Variance ?

Bias and Variance are error measures based on average performance across possible training sets.
Understanding how different sources of error lead to bias and variance helps us improve the data fitting process resulting in more accurate models.

We define bias and variance in 2 ways: conceptually and graphically. Then, we will do an illustrative example.

Conceptual definition:

Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. Of course you only have one model so talking about expected or average prediction values might seem a little strange. However, imagine you could repeat the whole model building process more than once: each time you gather new data and run a new analysis creating a new model. Due to randomness in the underlying data sets, the resulting models will have a range of predictions. Bias measures how far off in general these models' predictions are from the correct value.
Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model.

Graphical definition:

We can create a graphical visualization of bias and variance using a bulls-eye diagram. Imagine that the center of the target is a model that perfectly predicts the correct values. As we move away from the bulls-eye, our predictions get worse and worse. Imagine we can repeat our entire model building process to get a number of separate hits on the target. Each hit represents an individual realization of our model, given the chance variability in the training data we gather. Sometimes we will get a good distribution of training data so we predict very well and we are close to the bulls-eye, while sometimes our training data might be full of outliers or non-standard values resulting in poorer predictions. These different realizations result in a scatter of hits on the target.

Illustrative Example: Voting Intentions:

Let's undertake a simple model building task. We wish to create a model for the percentage of people who will vote for a Republican president in the next election. As models go, this is conceptually trivial and is much simpler than what people commonly envision when they think of "modeling", but it helps us to cleanly illustrate the difference between bias and variance.

A straightforward, if flawed (as we will see below), way to build this model would be to randomly choose 50 numbers from the phone book, call each one and ask the for who they planned to vote for in the next election. Imagine we got the following results:

From the data, we estimate that the probability of voting Republican is 13/(13+16), or 44.8%. We put out our press release that the Democrats are going to win by over 10 points; but, when the election comes around, it turns out they actually lose by 10 points. That certainly reflects poorly on us. Where did we go wrong in our model?

Clearly, there are many issues with the trivial model we built. A list would include that we only sample people from the phone book and so only include people with listed numbers, we did not follow up with non-respondents and they might have different voting patterns from the respondents, we do not try to weight responses by likeliness to vote and we have a very small sample size.

It is tempting to lump all these causes of error into one big box. However, they can actually be separate sources causing bias and those causing variance.

For instance, using a phone book to select participants in our survey is one of our sources of bias. By only surveying certain classes of people, it skews the results in a way that will be consistent if we repeated the entire model building exercise. Similarly, not following up with respondents is another source of bias, as it consistently changes the mixture of responses we get. On our bulls-eye diagram these move us away from the center of the target, but they would not result in an increased scatter of estimates.

On the other hand, the small sample size is a source of variance. If we increased our sample size, the results would be more consistent each time we repeated the survey and prediction. The results still might be highly inaccurate due to our large sources of bias, but the variance of predictions will be reduced. On the bulls-eye diagram, the low sample size results in a wide scatter of estimates. Increasing the sample size would make the estimates clump closer together, but they still might miss the center of the target.

Again, this voting model is trivial and quite removed from the modeling tasks most often faced in practice. In general the data set used to build the model is provided prior to model construction and the modeler cannot simply say, "Let's increase the sample size to reduce variance." In practice an explicit trade-off exists between bias and variance where decreasing one increases the other. Minimizing the total error of the model requires a careful balancing of these two forms of error.

We've already seen a lot of times the problem of overfitting, in which just because a learning algorithm fits a training set well, that doesn't mean it's a good hypothesis.

So how to pick a good hypothesis function ? By choosing the right degree of polynomial

d

To solve our initial problem, we can introduce a third set, the Cross Validation Set, to serve as an intermediate set that we can train

d

with. Then our test set will give us an accurate, non-optimistic error.

Here is one example way to break down our dataset into the three sets (This is called training/validation/test sets):

Training set: 60%.
Cross Validation set (or Validation set): 20%.
Test set: 20%.

We can now calculate three separate error values for the three different
sets.

Optimize the parameters
$θ$ s using the training set for each polynomial degree.
Using the cross validation set, find the polynomial degree
$d$ with the lowest cross validation error.
Estimate the generalization error using the test set with
$J_{t e s t} (θ^{(d)})$ . (Take the thetas of the lowest degree
$d$ ).

This way, the degree of the polynomial

d

has not been trained using the test set.

Now that we know what is bias and variance, we need to distinguish whether bias or variance is the problem contributing to bad predictions.

The training error will tend to decrease as we increase the degree

d

of the polynomial.

At the same time, the cross validation error will tend to decrease as we increase

d

up to a point, and then it will increase as

d

is increased, forming a convex curve. This is represented in the figure below:

High bias (underfitting): both
$J_{t r a i n} (θ)$ and
$J_{C V} (θ)$ will be high. Also,
$J_{t r a i n} (θ) \approx J_{C V} (θ)$ .
High variance (overfitting):
$J_{t r a i n} (θ)$ will be low and
$J_{C V} (θ)$ will be much greater than
$J_{t r a i n} (θ)$ .

Bonus: Cross Validation concept

Before testing out any model, would you not like to test it with an independent dataset? Normally, in any prediction problem, your model works on a known dataset. You also call it the training dataset. However, in real time, your model will have to work on an unknown dataset.

Under such circumstances, will your model be able to predict the outcome correctly? You do not know unless you test your model on a random dataset. This testing is what we refer to as Cross Validation. Once your model passes this test, it is fit to work anywhere.

Thus, Cross validation is a technique for assessing how your prediction model performs with an unknown dataset. There are 2 types of cross validation:

Exhaustive Cross Validation: This method involves testing the model on all possible ways by dividing the original sample into training and validation sets.
Non-Exhaustive Cross Validation: Here, you do not split the original sample into all the possible permutations and combinations.

Instead of looking at the degree

d

contributing to Bias/Variance, now we will look at the regularization parameter
$λ$ .

We can say the following:

Small
$λ$ : High variance (overfitting).
Intermediate
$λ$ : Perfect.
Large
$λ$ : High bias (underfitting). (Large
$λ$
$\to$
$θ$ s near 0
$\to$ underfitting)

The relationship of

λ

to the training set and the validation set is as follows:

Low
$λ$ :
$J_{t r a i n} (θ)$ is low and
$J_{C V} (θ)$ is high (high variance/overfitting).
Intermediate
$λ$ :
$J_{t r a i n} (θ)$ and
$J_{C V} (θ)$ are somewhat low and
$J_{t r a i n} (θ) \approx J_{C V} (θ)$ .
Large
$λ$ : both
$J_{t r a i n} (θ)$ and J
$J_{C V} (θ)$ will be high (underfitting /high bias)

How to choose the right regularization

$λ$ ?

Create a list of
$λ$ s (i.e.
$λ \in {0, 0.01, 0.02, 0.04, 0.08, . . .}$ ).
Plug each
$λ$ to the regularization term of the cost function
$J$ (model).
Minimize the cost function for each
$λ$ .
Compute the cross validation error using the learned
$θ$ (computed with
$λ$ ) on the
$J_{C V} (θ)$ without regularization or
$λ = 0$ .
Select the best combo that produces the lowest error on the cross validation set.
Using the best combo
$θ$ and
$λ$ , apply it on
$J_{t e s t} (θ)$ to see if it has a good generalization of the problem.

Low training set size
$m$ : causes
$J_{t r a i n} (θ)$ to be low and
$J_{C V} (θ)$ to be high.
Large training set size
$m$ : causes both
$J_{t r a i n} (θ)$ and
$J_{C V} (θ)$ to be high with
$J_{t r a i n} (θ) \approx J_{C V} (θ)$ .

Remark: If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.

Low training set size
$m$ : causes
$J_{t r a i n} (θ)$ to be low and
$J_{C V} (θ)$ to be high.
Large training set size
$m$ :
$J_{t r a i n} (θ)$ increases with training set size and
$J_{C V} (θ)$ continues to decrease without leveling off.
Also,
$J_{t r a i n} (θ) < J_{C V} (θ)$ but the difference between them remains significant.

Remark: If a learning algorithm is suffering from high variance, getting more training data is likely to help.

II) Debugging a Learning Algorithm

1) Linear Regression / Logistic Regression

Our decision process can be broken down as follows:

Getting more training examples
$\to$ Fixes High Variance.
Trying smaller sets of features
$\to$ Fixes High Variance.
Trying getting additional features
$\to$ Fixes High Bias.
Trying adding polynomial features
$\to$ Fixes High Bias.
Increasing
$λ$
$\to$ Fixes High Variance.
Decreasing
$λ$
$\to$ Fixes High Bias.

2) Neural Network

A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase
$λ$ ) to address the overfitting.

Remark: Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.

A typical rule of thumb when running diagnostics is:

More training examples fixes high variance but not high bias.
Fewer features fixes high variance but not high bias.
Additional features fixes high bias but not high variance.
The addition of polynomial and interaction features fixes high bias but not high variance.
When using gradient descent, decreasing lambda can fix high bias and increasing lambda can fix high variance (lambda is the regularization parameter).
When using neural networks, small neural networks are more prone to under-fitting and big neural networks are prone to over-fitting. Cross-validation of network size is a way to choose alternatives.

Bonus: Model Complexity Effects:

Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.

III) Machine Learning System Design

1) Prioritizing What to Work On

Different ways we can approach a machine learning problem:

Collect lots of data (for example "honeypot" project but doesn't always work)
Develop sophisticated features (for example: using email header data in spam emails)
Develop algorithms to process your input in different ways (recognizing misspellings in spam).

It is difficult to tell which of the options will be helpful.

2) Error Analysis

The recommended approach to solving machine learning problems is:

Start with a simple algorithm, implement it quickly, and test it early.
Plot learning curves to decide if more data, more features, etc … will help.
Error analysis: manually examine the errors on examples in the cross-validation set and try to spot a trend.

It's important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm's performance.

You may need to process your input before it is useful. For example, if your input is a set of words, you may want to treat the same word with different forms (fail/failing/failed) as one word, so must use stemming software to recognize them all as one.

IV) Error Metrics for skewed classes (Classification problem)

It is sometimes difficult to tell whether a reduction in error is actually an improvement of the algorithm.

Suppose that we train a logistic regression model (

y = 1

if cancer,

y = 0

otherwise). You find that you have

1 %

error on the test set. Cool right ? But you also find in your training/test set that only

0.5 %

patients have cancer. Thus, the

1 %

error is not impressive anymore.

This is callled skewed classes, that is, when our class is very rare in the entire data set. Or to say it in another way, when we have lot more examples from one class than from the other class.

To prevent this, we can use 2 error metrics : Precision/Recall.

To understand those notion, lets go through the following example:

Imagine that, your girlfriend gave you a birthday surprise every year in last 10years. However, one day, your girlfriend asks you: "Sweetie, do you remember all birthday surprises from me?"

This simple question makes your life in danger. To extend your life, you need to recall all 10 surprising events from your memory.

So, recall is the ratio of a number of events you can correctly recall to a number of all correct events.

If you can recall all 10 events correctly, then, your recall ratio is 1.0 (100%). If you can recall 7 events correctly, your recall ratio is 0.7 (70%).

Now, it's easier to map the word recall to real life usage of that word. However, you might be wrong in some answers.

For example, you answers 15 times, 10 events are correct and 5 events are wrong. This means you can recall all events but it's not so precise.

So, precision is the ratio of a number of events you can correctly recall to a number all events you recall (mix of correct and wrong recalls). In other words, it is how precise of your recall.

From the previous example (10 real events, 15 answers: 10 correct answers, 5 wrong answers), you get 100% recall but your precision is only 66.67% (10 / 15).

Yes, you can guess what I'm going to say next. If a machine learning algorithm is good at recall, it doesn't mean that algorithm is good at precision.

a) Precision

Precision is defined as following:

\frac{# r e l e v a n t s e l e c t e d o b j e c t}{# s e l e c t e d o b j e c t}

Example:

Imagine you have 3 markers and 2 pens. You have 5 objects in total and you are looking for the markers (relevant selected object).

If we take the 5 elements then
$p r e c i s i o n = \frac{3}{5}$ .
If we only take 2 pens then
$p r e c i s i o n = \frac{0}{2} = 0$ .
If we only take 2 markers then
$p r e c i s i o n = \frac{2}{2} = 1$ .
If we take 2 markers and 2 pens then
$p r e c i s i o n = \frac{2}{4} = \frac{1}{2}$ .

Precision measures how many of the selected object by the model were correct.

b) Recall

Recall is defined as following:

\frac{# r e l e v a n t s e l e c t e d o b j e c t}{# r e l e v a n t o b j e c t}

Example:

Imagine you have 3 markers and 2 pens. You have 5 objects in total and you are looking for the markers (relevant selected object).

If we take the 5 elements then
$r e c a l l = \frac{3}{3} = 1$ .
If we only take 2 pens then
$r e c a l l = \frac{0}{3} = 0$ .
If we only take 2 markers then
$r e c a l l = \frac{2}{3}$ .
If we take 2 markers and 2 pens then
$r e c a l l = \frac{2}{3}$ .

Recall measures how many of the correct objects were selected by the model as correct.

Lets come back to our classification problem and draw a 2x2 table.

Precision tells us what proportion of patients that we classified as having cancer, actually had cancer.

P r e c i s i o n = \frac{# p r e d i c t e d w i t h c a n c e r}{t o t a l # p r e d i c t i o n s} = \frac{T r u e p o s t i v e}{T r u e p o s t i v e + F a l s e p o s t i v e}

Recall tells us what proportion of patients that actually had
cancer were classified by the algorithm as having cancer.

R e c a l l = \frac{# p r e d i c t e d w i t h c a n c e r}{# c a n c e r o u s p e o p l e} = \frac{T r u e p o s t i v e}{T r u e p o s t i v e + F a l s e n e g a t i v e}

These two metrics give us a better sense of how our classifier is doing. We want both precision and recall to be high.

Example:

P r e c i s i o n = \frac{80}{80 + 20} = 0.8

R e c a l l = \frac{80}{80 + 80} = 0.5

If we write them in terms of percentage:

Precision: 80% of patients that the model classified as having cancer, actually had cancer.
Recall: 50% of patients that actually had cancer were classified by the model as having cancer.

Trading Off Precision and Recall:

For our machine learning model, we either want to increase precision or recall. It depends on the context of our classification problem.

If we want to predict if a patient has cancer with high confident, we will better focus ourselves on precision. But we can miss some patient that has cancer but were difficult to detect (early stage of cancer). Thus if we want to avoid missing some cancerous patients, we will better focus ourselves on recall.

It is clear that recall gives us information about a classifier's performance with respect to false negatives (how many did we miss), while precision gives us information about its performance with respect to false positives(how many did we caught).

Precision is about being precise. So even if we managed to capture only one cancer case, and we captured it correctly, then we are 100% precise.
Recall is not so much about capturing cases correctly but more about capturing all cases that have "cancer" with the answer as "cancer". So if we simply always say every case as "cancer", we have 100% recall.

In simple terms, high precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results.

In a classification task, a precision score of 1.0 for a class C means that every item labeled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labeled correctly) whereas a recall of 1.0 means that every item from class C was labeled as belonging to class C (but says nothing about how many other items were incorrectly also labeled as belonging to class C)

Given the following:

Predict 1 if:
$h_{θ} \geq t h r e s h o l d$ .
Predict 0 if:
$h_{θ} \leq t h r e s h o l d$

We can say that:

The greater the threshold, the greater the precision and the lower the recall.
The lower the threshold, the greater the recall and the lower the precision.

In order to turn these two metrics into one single number, we can take the F value. It will enable us to depict how well our model is performing. One way is to compute F score (or F1 score).

F S c o r e = 2 * \frac{P * R}{P + R}

In order for the F Score to be large, both precision and recall must be large.

We want to train precision and recall on the cross validation set in order to not to bias our test set.

V) Using Large datasets

How much data should we train on? In certain cases, an "inferior algorithm," if given enough data, can outperform a superior algorithm with less data.

We must choose our features to have enough information. A useful test is: Given input x, would a human expert be able to confidently predict y?

For large data, if we have a low bias algorithm (many features or hidden units making a very complex function), then the larger the training set we use, the less we will have overfitting (and the more accurate the algorithm will be on the test set).

Advice for applying Machine Learning

I) Evaluating a Learning Algorithm

1) What is Bias/Variance ?

2) How degree of polynomial d and Bias/Variance are related ?

3) How regularization parameter λ and Bias/Variance are related ?

4) How number of example m and Bias/Variance are related ?

II) Debugging a Learning Algorithm

1) Linear Regression / Logistic Regression

2) Neural Network

III) Machine Learning System Design

1) Prioritizing What to Work On

2) Error Analysis

IV) Error Metrics for skewed classes (Classification problem)

a) Precision

b) Recall

V) Using Large datasets

Read more

Convolutional Neural Network with Numpy (Fast)

Convolutional Neural Network with Numpy (Slow)

AlexNet: Summary and Implementation

ZFNet/DeconvNet: Summary and Implementation

2) How degree of polynomial
$d$ and Bias/Variance are related ?

3) How regularization parameter
$λ$ and Bias/Variance are related ?

4) How number of example
$m$ and Bias/Variance are related ?