---
tags: machine-learning
---
# Solving the problem of overfitting / underfitting
<div style="text-align:center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/solving-problem-overfitting-uderfitting/1.png" height="100%" width="70%">
</div>
> This is my personal notes taken for the course [Machine learning](https://www.coursera.org/learn/machine-learning#syllabus) by Standford. Feel free to check the [assignments](https://github.com/3outeille/Coursera-Labs).
> Also, if you want to read my other notes, feel free to check them at my [blog](https://ferdinandmom.engineer/machine-learning/).
# I) Introduction
Consider the problem of predicting $y$ from $x \in \mathbb{R}$.
<div style="text-align:center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/solving-problem-overfitting-uderfitting/2.png">
</div>
1. The leftmost picture shows the result of $h(x)=\theta_0 + \theta_1x_1$. You can notice that the line doesn't fit well the plotted data. This is called
**underfitting** or **high bias**.
2. The middle pictures show the result of $h(x)=\theta_0 + \theta_1x_1 + \theta_2x_2^2$. It seems to be a very good curve.
3. The rightmost picture shows the result of $h(x)=\theta_0 + \theta_1x_1 + \theta_2x_2^2 + \theta_3x_3^3 + \theta_4x_4^4$. It might seem that the more features we add, the better the curve will be. You can notice here that the curve passes through the plotted data perfectly. However, there is a danger. Adding to much features can lead to a curve that fits perfectly the plotted data but will give **poor prediction**. This is called **overfitting** or **high variance**.
Think of it like this. There are 3 students doing the same MCQ. The first student is the one that didn't revise enough (leftmost). The second student is the one that understand his course and can adapt to every type of MCQ (middle one). The last student is the one that learned by hearth all the annals of previous MCQ (rightmost). He may do well for this particular MCQ but will fail if we change the question. "Overfitting" and "underfitting" apply to both linear and logistic regression.
To get out of underfitting, you have to add more features (polynomials).
To get out of overfitting, there are 2 mains options:
1. Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm.
2. Regularization:
- Keep all the features, but reduce the magnitude of parameters $\theta$.
- Regularization works well when we have a lot of slightly useful features.
<ins>**Modifying the cost function**</ins>
Suppose we want to make our following hypothesis function $h(x)$ more
quadratic:
$$h(x) = \theta_0 + \theta_1x_1 + \theta_2x_2^2 + \theta_3x_3^3 + \theta_4x_4^4$$
We'll want to eliminate the influence of $\theta_3x^3$ and $\theta_4x^4$. Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:
$$min_\theta \frac{1}{2m} \sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})^2 + \underbrace{1000*\theta_3^2 + 1000*\theta_4^2}_{\text{regularization term}}$$
Why are we adding up $1000*\theta_3^2$ and $1000*\theta_4^2$ to the initial $J(\Theta)$? Well, remember that our goal is to minimize $J(\Theta)$. That means finding best value of $\theta$s so that $J(\Theta)$ is near as much as possible to 0.
**Thus adding extra values will require $J(\Theta)$ to find smallest values of $\theta$s**
In our case, the two extra term inflate the cost of $\theta_3$ and $\theta_4$. Now, in order for the cost function to get close to zero, we will have to reduce the values of $\theta_3$ and $\theta_4$ to near zero. This will in turn greatly reduce the values of $\theta_3x^3$ and $\theta_4x^4$ in our hypothesis function.
Multiply $1000$ by $\theta_3^2$ and $\theta_4^2$ enable us to know when $\theta_3$ and $\theta_4$ are near 0 since the regularization term will disappear. If we didn't multiply by $\theta_3^2$ and $\theta_4^2$, it means that the regularization term will always be there, forcing $\theta_1$ and $\theta_2$ to have values they should normally not take.
If you are wondering why we are regularizing with $\theta_3^2$ and $\theta_4^2$ instead of $\theta_3$ and $\theta_4$, here is the answer:
Suppose $\theta_3 = +1000000$ and $\theta_4 = -1000001$.
- $sum(\theta) = +1000000 - 1000001 = -1$ which is small.
- $sum(\theta^2) = (+1000000)^2 + (-1000001)^2$ will be very big.
Thus, as you have noticed, using $sum(\theta)$ comes down to not regularizing at all.
A general formula will be:
$$\boxed{min_\theta \frac{1}{2m} \bigg[\sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2 + \underbrace{\lambda \sum_{j=1}^{m} \theta_j^2\bigg]}_{\text{regularization term}}}$$
Notice that we are not regularizing $\theta_0$ (summation start at index 1).
The $\lambda$ is the **regularization parameter**: the bigger $\lambda$ is, the smaller the concerning $\theta$s will be.
However, if $\lambda$ is too large ($\lambda = 10^{10}$), all the $\theta$s will be too small (near 0) and we will be left with $h(x) = \theta_0$ (underfitting problem).
# II) Regularized Linear Regression
In order to build a regularized linear regression, we have to tweak the gradient descent algorithm. The original one looked like this :
\begin{align*}
& \text{repeat until convergence:} \; \lbrace \\ \;&
\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \\ \; &
\cdots \\ \; &
\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} &
\\ \rbrace
\end{align*}
Nothing new, right? The derivative at this point has already been computed and we plugged it into the formula. In order to upgrade this into a regularized algorithm, we have to figure out the derivative for the new regularized cost function seen above. This is how it looks after computing the derivative.
$$\frac{\partial}{\partial \theta_j} J_{reg}(\Theta) = \frac{1}{m} \sum\limits_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} + \frac{\lambda}{m}\theta_j$$
<ins>**Bonus:**</ins>
Let's compute $\frac{\partial}{\partial \theta_j}J_{reg}(\Theta)$.
$$J(\Theta) = \frac{1}{2m} \bigg[\sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{m} \theta_j^2\bigg]$$
Thus,
$$\begin{aligned}
\frac{\partial}{\partial \theta_j} J &=
\underbrace{
\frac{\partial}{\partial \theta_j} \bigg(\frac{1}{2m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2\bigg)
}_{\text{compute in "Linear Regression (one variable)"}}
+ \frac{\partial}{\partial \theta_j} \bigg(\frac{1}{2m} \lambda \sum_{j=1}^{m} \theta_j^2 \bigg)
\\
&=
\frac{1}{m} \sum\limits_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} + \frac{\lambda}{m}\theta_j\end{aligned}$$
<ins>**Notice:**</ins> That's why we squared $\theta$ in the regularization term because we are going to differentiate it in the gradient descent algorithm. If we didn't square it, it will disappear while differentiating (showing that the $\theta$s not raise to the square are too small when adding up).
And now we are ready to plug it into the gradient descent algorithm. Note how the first $\theta_0$ is left unprocessed, as we said earlier:
\begin{align*}
& \text{repeat until convergence:} \; \lbrace \\ \; &
\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \\ \; &
\cdots \\ \; &
\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} + \frac{\lambda}{m}\theta_j & \\
\rbrace
\end{align*}
We can do even better. Let's group all the terms together that depends on $\theta_j$ (and $\theta_0$ left untouched as always):
\begin{align*}
& \text{repeat until convergence:} \; \lbrace \\ \; &
\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \\ \; &
\cdots \\ \; &
\theta_j := \theta_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} & \\
\rbrace
\end{align*}
I have simply rearranged things around in each $\theta_j$ update, except for $\theta_0$. This alternate way of writing shows the regularization in action: as you may see, the term $(1 - \alpha\frac{\lambda}{m})$ multiplies $\theta_j$ and it's responsible for its shrinkage (because it will be always $<$ 1). The rest of the equation is the same as before the whole regularization thing.
<ins>**Normal equation:**</ins>
Now let's approach regularization using the alternate method of the non-iterative normal equation.
To add in regularization, the equation is the same as our riginal, except that we add another term inside the parentheses:
$$\begin{aligned}
& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \ \ \text{where} \ \ L = \begin{bmatrix}
0& & & & & \\
& 1& & & & \\
& & 1& & & \\
& & & \cdot& & \\
& & & & \cdot& \\
& & & & & 1&
\end{bmatrix}\end{aligned}$$
$L$ is a matrix with $\theta$ at the top left and $1$'s down the diagonal, with $0$'s everywhere else. It should have dimension $(n+1)×(n+1)$. Intuitively, this is the identity matrix (though we are not including $x_0$), multiplied with a single real number $\lambda$.
Recall that if $m \leq n$ ($m$ : number of training examples and $n$ : number of features), then $X^TX$ is non-invertible. However, when we add the term λ⋅L, then $X^TX + \lambda \cdot L$ becomes invertible.
# III) Regularized Logistic Regression
The gradient descent algorithm for logistic regression looks identical to the linear regression one. So the good news is that we can apply the very same regularization trick as we did above in order to shrink the $\theta$s parameters.
The original, non-regularized cost function for logistic regression looks like:
$$J(\Theta) = - \dfrac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log(h(x^{(i)})) + (1 - y^{(i)}) \log(1-h(x^{(i)}))\right]$$
As we did in the regularized linear regression, we have to add the regularization term that shrinks all the parameters. It is slightly different here:
$$J_{reg}(\Theta) = - \dfrac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log(h(x^{(i)})) + (1 - y^{(i)}) \log(1-h(x^{(i)}))\right] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2$$
Thus, its derivative is:
$$\frac{\partial}{\partial \theta_j} J_{reg}(\theta) = \theta_j - \alpha \left[\dfrac{1}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j\right]$$
We are now ready to plug it into the gradient descent algorithm for the logistic regression. As always the first item $\theta_0$ is left unprocessed:
\begin{align*} & \text{repeat until convergence:} \; \lbrace \\ \; &
\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) x_0^{(i)} \\ \; &
\cdots \\ \; &
\theta_j := \theta_j - \alpha \left[\dfrac{1}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j\right] & \\
\rbrace
\end{align*}