This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.
Also, if you want to read my other notes, feel free to check them at my blog.
Consider the problem of predicting from .
Think of it like this. There are 3 students doing the same MCQ. The first student is the one that didn't revise enough (leftmost). The second student is the one that understand his course and can adapt to every type of MCQ (middle one). The last student is the one that learned by hearth all the annals of previous MCQ (rightmost). He may do well for this particular MCQ but will fail if we change the question. "Overfitting" and "underfitting" apply to both linear and logistic regression.
To get out of underfitting, you have to add more features (polynomials).
To get out of overfitting, there are 2 mains options:
Reduce the number of features:
Regularization:
Modifying the cost function
Suppose we want to make our following hypothesis function more
quadratic:
We'll want to eliminate the influence of and . Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:
Why are we adding up and to the initial ? Well, remember that our goal is to minimize . That means finding best value of s so that is near as much as possible to 0.
Thus adding extra values will require to find smallest values of s
In our case, the two extra term inflate the cost of and . Now, in order for the cost function to get close to zero, we will have to reduce the values of and to near zero. This will in turn greatly reduce the values of and in our hypothesis function.
Multiply by and enable us to know when and are near 0 since the regularization term will disappear. If we didn't multiply by and , it means that the regularization term will always be there, forcing and to have values they should normally not take.
If you are wondering why we are regularizing with and instead of and , here is the answer:
Suppose and .
Thus, as you have noticed, using comes down to not regularizing at all.
A general formula will be:
Notice that we are not regularizing (summation start at index 1).
The is the regularization parameter: the bigger is, the smaller the concerning s will be.
However, if is too large (), all the s will be too small (near 0) and we will be left with (underfitting problem).
In order to build a regularized linear regression, we have to tweak the gradient descent algorithm. The original one looked like this :
Nothing new, right? The derivative at this point has already been computed and we plugged it into the formula. In order to upgrade this into a regularized algorithm, we have to figure out the derivative for the new regularized cost function seen above. This is how it looks after computing the derivative.
Bonus:
Let's compute .
Thus,
Notice: That's why we squared in the regularization term because we are going to differentiate it in the gradient descent algorithm. If we didn't square it, it will disappear while differentiating (showing that the s not raise to the square are too small when adding up).
And now we are ready to plug it into the gradient descent algorithm. Note how the first is left unprocessed, as we said earlier:
We can do even better. Let's group all the terms together that depends on (and left untouched as always):
I have simply rearranged things around in each update, except for . This alternate way of writing shows the regularization in action: as you may see, the term multiplies and it's responsible for its shrinkage (because it will be always 1). The rest of the equation is the same as before the whole regularization thing.
Normal equation:
Now let's approach regularization using the alternate method of the non-iterative normal equation.
To add in regularization, the equation is the same as our riginal, except that we add another term inside the parentheses:
is a matrix with at the top left and 's down the diagonal, with 's everywhere else. It should have dimension . Intuitively, this is the identity matrix (though we are not including ), multiplied with a single real number .
Recall that if ( : number of training examples and : number of features), then is non-invertible. However, when we add the term λ⋅L, then becomes invertible.
The gradient descent algorithm for logistic regression looks identical to the linear regression one. So the good news is that we can apply the very same regularization trick as we did above in order to shrink the s parameters.
The original, non-regularized cost function for logistic regression looks like:
As we did in the regularized linear regression, we have to add the regularization term that shrinks all the parameters. It is slightly different here:
Thus, its derivative is:
We are now ready to plug it into the gradient descent algorithm for the logistic regression. As always the first item is left unprocessed: