Try   HackMD

Solving the problem of overfitting / underfitting

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.
Also, if you want to read my other notes, feel free to check them at my blog.

I) Introduction

Consider the problem of predicting

y from
xR
.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
  1. The leftmost picture shows the result of
    h(x)=θ0+θ1x1
    . You can notice that the line doesn't fit well the plotted data. This is called
    underfitting or high bias.
  2. The middle pictures show the result of
    h(x)=θ0+θ1x1+θ2x22
    . It seems to be a very good curve.
  3. The rightmost picture shows the result of
    h(x)=θ0+θ1x1+θ2x22+θ3x33+θ4x44
    . It might seem that the more features we add, the better the curve will be. You can notice here that the curve passes through the plotted data perfectly. However, there is a danger. Adding to much features can lead to a curve that fits perfectly the plotted data but will give poor prediction. This is called overfitting or high variance.

Think of it like this. There are 3 students doing the same MCQ. The first student is the one that didn't revise enough (leftmost). The second student is the one that understand his course and can adapt to every type of MCQ (middle one). The last student is the one that learned by hearth all the annals of previous MCQ (rightmost). He may do well for this particular MCQ but will fail if we change the question. "Overfitting" and "underfitting" apply to both linear and logistic regression.

To get out of underfitting, you have to add more features (polynomials).

To get out of overfitting, there are 2 mains options:

  1. Reduce the number of features:

    • Manually select which features to keep.
    • Use a model selection algorithm.
  2. Regularization:

    • Keep all the features, but reduce the magnitude of parameters
      θ
      .
    • Regularization works well when we have a lot of slightly useful features.

Modifying the cost function

Suppose we want to make our following hypothesis function

h(x) more
quadratic:

h(x)=θ0+θ1x1+θ2x22+θ3x33+θ4x44

We'll want to eliminate the influence of

θ3x3 and
θ4x4
. Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:

minθ12mi=1m(h(x(i))y(i))2+1000θ32+1000θ42regularization term

Why are we adding up

1000θ32 and
1000θ42
to the initial
J(Θ)
? Well, remember that our goal is to minimize
J(Θ)
. That means finding best value of
θ
s so that
J(Θ)
is near as much as possible to 0.

Thus adding extra values will require

J(Θ) to find smallest values of
θ
s

In our case, the two extra term inflate the cost of

θ3 and
θ4
. Now, in order for the cost function to get close to zero, we will have to reduce the values of
θ3
and
θ4
to near zero. This will in turn greatly reduce the values of
θ3x3
and
θ4x4
in our hypothesis function.

Multiply

1000 by
θ32
and
θ42
enable us to know when
θ3
and
θ4
are near 0 since the regularization term will disappear. If we didn't multiply by
θ32
and
θ42
, it means that the regularization term will always be there, forcing
θ1
and
θ2
to have values they should normally not take.

If you are wondering why we are regularizing with

θ32 and
θ42
instead of
θ3
and
θ4
, here is the answer:

Suppose

θ3=+1000000 and
θ4=1000001
.

  • sum(θ)=+10000001000001=1
    which is small.
  • sum(θ2)=(+1000000)2+(1000001)2
    will be very big.

Thus, as you have noticed, using

sum(θ) comes down to not regularizing at all.

A general formula will be:

minθ12m[i=1m(h(x(i))y(i))2+λj=1mθj2]regularization term

Notice that we are not regularizing

θ0 (summation start at index 1).

The

λ is the regularization parameter: the bigger
λ
is, the smaller the concerning
θ
s will be.

However, if

λ is too large (
λ=1010
), all the
θ
s will be too small (near 0) and we will be left with
h(x)=θ0
(underfitting problem).

II) Regularized Linear Regression

In order to build a regularized linear regression, we have to tweak the gradient descent algorithm. The original one looked like this :

repeat until convergence:{θ0:=θ0α1mi=1m(h(x(i))y(i))x0(i)θj:=θjα1mi=1m(h(x(i))y(i))xj(i)}

Nothing new, right? The derivative at this point has already been computed and we plugged it into the formula. In order to upgrade this into a regularized algorithm, we have to figure out the derivative for the new regularized cost function seen above. This is how it looks after computing the derivative.

θjJreg(Θ)=1mi=1m(h(x(i))y(i))xj(i)+λmθj

Bonus:

Let's compute

θjJreg(Θ).

J(Θ)=12m[i=1m(h(x(i))y(i))2+λj=1mθj2]

Thus,

θjJ=θj(12mi=1m(h(x(i))y(i))2)compute in "Linear Regression (one variable)"+θj(12mλj=1mθj2)=1mi=1m(h(x(i))y(i))xj(i)+λmθj

Notice: That's why we squared

θ in the regularization term because we are going to differentiate it in the gradient descent algorithm. If we didn't square it, it will disappear while differentiating (showing that the
θ
s not raise to the square are too small when adding up).

And now we are ready to plug it into the gradient descent algorithm. Note how the first

θ0 is left unprocessed, as we said earlier:

repeat until convergence:{θ0:=θ0α1mi=1m(h(x(i))y(i))x0(i)θj:=θjα1mi=1m(h(x(i))y(i))xj(i)+λmθj}

We can do even better. Let's group all the terms together that depends on

θj (and
θ0
left untouched as always):

repeat until convergence:{θ0:=θ0α1mi=1m(hθ(x(i))y(i))x0(i)θj:=θj(1αλm)α1mi=1m(hθ(x(i))y(i))xj(i)}

I have simply rearranged things around in each

θj update, except for
θ0
. This alternate way of writing shows the regularization in action: as you may see, the term
(1αλm)
multiplies
θj
and it's responsible for its shrinkage (because it will be always
<
1). The rest of the equation is the same as before the whole regularization thing.

Normal equation:

Now let's approach regularization using the alternate method of the non-iterative normal equation.
To add in regularization, the equation is the same as our riginal, except that we add another term inside the parentheses:

θ=(XTX+λL)1XTy  where  L=[0111]

L is a matrix with
θ
at the top left and
1
's down the diagonal, with
0
's everywhere else. It should have dimension
(n+1)×(n+1)
. Intuitively, this is the identity matrix (though we are not including
x0
), multiplied with a single real number
λ
.

Recall that if

mn (
m
: number of training examples and
n
: number of features), then
XTX
is non-invertible. However, when we add the term λ⋅L, then
XTX+λL
becomes invertible.

III) Regularized Logistic Regression

The gradient descent algorithm for logistic regression looks identical to the linear regression one. So the good news is that we can apply the very same regularization trick as we did above in order to shrink the

θs parameters.

The original, non-regularized cost function for logistic regression looks like:

J(Θ)=1m[i=1my(i)log(h(x(i)))+(1y(i))log(1h(x(i)))]

As we did in the regularized linear regression, we have to add the regularization term that shrinks all the parameters. It is slightly different here:

Jreg(Θ)=1m[i=1my(i)log(h(x(i)))+(1y(i))log(1h(x(i)))]+λ2mj=1nθj2

Thus, its derivative is:

θjJreg(θ)=θjα[1mi=1m(h(x(i))y(i))xj(i)+λmθj]

We are now ready to plug it into the gradient descent algorithm for the logistic regression. As always the first item

θ0 is left unprocessed:

repeat until convergence:{θ0:=θ0α1mi=1m(h(x(i))y(i))x0(i)θj:=θjα[1mi=1m(h(x(i))y(i))xj(i)+λmθj]}