Solving the problem of overfitting / underfitting

This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.
Also, if you want to read my other notes, feel free to check them at my blog.

I) Introduction

Consider the problem of predicting

y

from

x \in R

The leftmost picture shows the result of
$h (x) = θ_{0} + θ_{1} x_{1}$ . You can notice that the line doesn't fit well the plotted data. This is called
underfitting or high bias.
The middle pictures show the result of
$h (x) = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2}^{2}$ . It seems to be a very good curve.
The rightmost picture shows the result of
$h (x) = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2}^{2} + θ_{3} x_{3}^{3} + θ_{4} x_{4}^{4}$ . It might seem that the more features we add, the better the curve will be. You can notice here that the curve passes through the plotted data perfectly. However, there is a danger. Adding to much features can lead to a curve that fits perfectly the plotted data but will give poor prediction. This is called overfitting or high variance.

Think of it like this. There are 3 students doing the same MCQ. The first student is the one that didn't revise enough (leftmost). The second student is the one that understand his course and can adapt to every type of MCQ (middle one). The last student is the one that learned by hearth all the annals of previous MCQ (rightmost). He may do well for this particular MCQ but will fail if we change the question. "Overfitting" and "underfitting" apply to both linear and logistic regression.

To get out of underfitting, you have to add more features (polynomials).

To get out of overfitting, there are 2 mains options:

Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm.
Regularization:
- Keep all the features, but reduce the magnitude of parameters
  $θ$ .
- Regularization works well when we have a lot of slightly useful features.

Modifying the cost function

Suppose we want to make our following hypothesis function

h (x)

more
quadratic:

h (x) = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2}^{2} + θ_{3} x_{3}^{3} + θ_{4} x_{4}^{4}

We'll want to eliminate the influence of

θ_{3} x^{3}

and

θ_{4} x^{4}

. Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:

m i n_{θ} \frac{1}{2 m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)})^{2} + \underset{regularization term}{\underset{⏟}{1000 * θ_{3}^{2} + 1000 * θ_{4}^{2}}}

Why are we adding up

1000 * θ_{3}^{2}

and

1000 * θ_{4}^{2}

to the initial

J (Θ)

? Well, remember that our goal is to minimize

J (Θ)

. That means finding best value of

θ

s so that

J (Θ)

is near as much as possible to 0.

Thus adding extra values will require

$J (Θ)$ to find smallest values of
$θ$ s

In our case, the two extra term inflate the cost of

θ_{3}

and

θ_{4}

. Now, in order for the cost function to get close to zero, we will have to reduce the values of

θ_{3}

and

θ_{4}

to near zero. This will in turn greatly reduce the values of

θ_{3} x^{3}

and

θ_{4} x^{4}

in our hypothesis function.

Multiply

1000

θ_{3}^{2}

and

θ_{4}^{2}

enable us to know when

θ_{3}

and

θ_{4}

are near 0 since the regularization term will disappear. If we didn't multiply by

θ_{3}^{2}

and

θ_{4}^{2}

, it means that the regularization term will always be there, forcing

θ_{1}

and

θ_{2}

to have values they should normally not take.

If you are wondering why we are regularizing with

θ_{3}^{2}

and

θ_{4}^{2}

instead of

θ_{3}

and

θ_{4}

, here is the answer:

Suppose

θ_{3} = + 1000000

and

θ_{4} = - 1000001

$s u m (θ) = + 1000000 - 1000001 = - 1$ which is small.
$s u m (θ^{2}) = (+ 1000000)^{2} + (- 1000001)^{2}$ will be very big.

Thus, as you have noticed, using

s u m (θ)

comes down to not regularizing at all.

A general formula will be:

m i n_{θ} \frac{1}{2 m} [\sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)})^{2} + \underset{regularization term}{\underset{⏟}{λ \sum_{j = 1}^{m} θ_{j}^{2}]}}

Notice that we are not regularizing

θ_{0}

(summation start at index 1).

The

λ

is the regularization parameter: the bigger

λ

is, the smaller the concerning

θ

s will be.

However, if

λ

is too large (

λ = 10^{10}

), all the

θ

s will be too small (near 0) and we will be left with

h (x) = θ_{0}

(underfitting problem).

II) Regularized Linear Regression

In order to build a regularized linear regression, we have to tweak the gradient descent algorithm. The original one looked like this :

\begin{aligned} repeat until convergence: { \\ θ_{0} := θ_{0} - α \frac{1}{m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)} \\ \dots \\ θ_{j} := θ_{j} - α \frac{1}{m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)} \\ } \end{aligned}

Nothing new, right? The derivative at this point has already been computed and we plugged it into the formula. In order to upgrade this into a regularized algorithm, we have to figure out the derivative for the new regularized cost function seen above. This is how it looks after computing the derivative.

\frac{\partial}{\partial θ_{j}} J_{r e g} (Θ) = \frac{1}{m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)} + \frac{λ}{m} θ_{j}

Bonus:

Let's compute

\frac{\partial}{\partial θ_{j}} J_{r e g} (Θ)

J (Θ) = \frac{1}{2 m} [\sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)})^{2} + λ \sum_{j = 1}^{m} θ_{j}^{2}]

Thus,

\begin{aligned} \frac{\partial}{\partial θ_{j}} J & = \underset{compute in "Linear Regression (one variable)"}{\underset{⏟}{\frac{\partial}{\partial θ_{j}} (\frac{1}{2 m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)})^{2})}} + \frac{\partial}{\partial θ_{j}} (\frac{1}{2 m} λ \sum_{j = 1}^{m} θ_{j}^{2}) \\ = \frac{1}{m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)} + \frac{λ}{m} θ_{j} \end{aligned}

Notice: That's why we squared

θ

in the regularization term because we are going to differentiate it in the gradient descent algorithm. If we didn't square it, it will disappear while differentiating (showing that the

θ

s not raise to the square are too small when adding up).

And now we are ready to plug it into the gradient descent algorithm. Note how the first

θ_{0}

is left unprocessed, as we said earlier:

\begin{aligned} repeat until convergence: { \\ θ_{0} := θ_{0} - α \frac{1}{m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)} \\ \dots \\ θ_{j} := θ_{j} - α \frac{1}{m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)} + \frac{λ}{m} θ_{j} \\ } \end{aligned}

We can do even better. Let's group all the terms together that depends on

θ_{j}

(and

θ_{0}

left untouched as always):

\begin{aligned} repeat until convergence: { \\ θ_{0} := θ_{0} - α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) \cdot x_{0}^{(i)} \\ \dots \\ θ_{j} := θ_{j} (1 - α \frac{λ}{m}) - α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x^{(i)}) - y^{(i)}) \cdot x_{j}^{(i)} \\ } \end{aligned}

I have simply rearranged things around in each

θ_{j}

update, except for

θ_{0}

. This alternate way of writing shows the regularization in action: as you may see, the term

(1 - α \frac{λ}{m})

multiplies

θ_{j}

and it's responsible for its shrinkage (because it will be always

<

1). The rest of the equation is the same as before the whole regularization thing.

Normal equation:

Now let's approach regularization using the alternate method of the non-iterative normal equation.
To add in regularization, the equation is the same as our riginal, except that we add another term inside the parentheses:

\begin{aligned} θ = {(X^{T} X + λ \cdot L)}^{- 1} X^{T} y where L = [\begin{array}{c} 0 \\ 1 \\ 1 \\ \cdot \\ \cdot \\ 1 \end{array}] \end{aligned}

L

is a matrix with

θ

at the top left and

1

's down the diagonal, with

0

's everywhere else. It should have dimension

(n + 1) \times (n + 1)

. Intuitively, this is the identity matrix (though we are not including

x_{0}

), multiplied with a single real number

λ

Recall that if

m \leq n

(

m

: number of training examples and

n

: number of features), then

X^{T} X

is non-invertible. However, when we add the term λ⋅L, then

X^{T} X + λ \cdot L

becomes invertible.

III) Regularized Logistic Regression

The gradient descent algorithm for logistic regression looks identical to the linear regression one. So the good news is that we can apply the very same regularization trick as we did above in order to shrink the

θ

s parameters.

The original, non-regularized cost function for logistic regression looks like:

J (Θ) = - \frac{1}{m} [\sum_{i = 1}^{m} y^{(i)} \log (h (x^{(i)})) + (1 - y^{(i)}) \log (1 - h (x^{(i)}))]

As we did in the regularized linear regression, we have to add the regularization term that shrinks all the parameters. It is slightly different here:

J_{r e g} (Θ) = - \frac{1}{m} [\sum_{i = 1}^{m} y^{(i)} \log (h (x^{(i)})) + (1 - y^{(i)}) \log (1 - h (x^{(i)}))] + \frac{λ}{2 m} \sum_{j = 1}^{n} θ_{j}^{2}

Thus, its derivative is:

\frac{\partial}{\partial θ_{j}} J_{r e g} (θ) = θ_{j} - α [\frac{1}{m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)}) x_{j}^{(i)} + \frac{λ}{m} θ_{j}]

We are now ready to plug it into the gradient descent algorithm for the logistic regression. As always the first item

θ_{0}

is left unprocessed:

\begin{aligned} repeat until convergence: { \\ θ_{0} := θ_{0} - α \frac{1}{m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)}) x_{0}^{(i)} \\ \dots \\ θ_{j} := θ_{j} - α [\frac{1}{m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)}) x_{j}^{(i)} + \frac{λ}{m} θ_{j}] \\ } \end{aligned}

Solving the problem of overfitting / underfitting

I) Introduction

II) Regularized Linear Regression

III) Regularized Logistic Regression

Read more

Convolutional Neural Network with Numpy (Fast)

Convolutional Neural Network with Numpy (Slow)

AlexNet: Summary and Implementation

ZFNet/DeconvNet: Summary and Implementation