[Archived] Introduction to Least Squares

# [Archived] Introduction to Least Squares by [Alvin Wan](http://alvinwan.com/cs189), [UGBA 198 Lecturer](http://ugba198.com) and [CS189 Head Student Instructor](http://eecs189.org). # Introduction Let us first step back to high school, where we learned to fit a line to $n$ points. However, let's take a different perspective. We consider each point to be a *sample* or "observation", and we consider $x$ to be each sample's *features*. $y$ is each sample's *label*. In least squares, our goal is to predict y using x. Let's now use this new vocabulary in another context. For example, we may observe birds in the wild. Each bird is a *sample*, and each bird's traits (wingspan, beak color, height, weight etc.) are its *features*. We may say its lifespan is its *label*. Can we use least squares to predict the lifespan of each bird? In fact, can we predict the lifespan if our measurements are noisy? What if some features are irrelevant? To answer these questions, we discuss an extension of grade school mathematics: the ordinary least squares (OLS) problem. We additionally explore its utility across a number of machine learning contexts, using more powerful, robust variants of OLS. ## Prerequisites This tutorial assumes familiarity with the following concepts: - Fundamental linear algebra -- definitions of a matrix, vector, matrix multiplication - Fundamental multivariable calculus -- taking a partial derivative - Python3.6  # Step 0 -- Installing Prerequisite Software  We will need just two packages to begin, where the former `numpy` is an optimized linear algebra library and the latter `matplotlib` is a plotting utility. pip install numpy matplotlib # Step 1 -- Ordinary Least Squares with Scalars First, consider the two-dimensional case from high school. We have a set of coordinates and wish to fit a line to these points. Let us first assume this line passes through the origin. Then, our goal is to find $w$, for the following line: $$y = wx$$ To estimate the best $w$, we collect $n$ different $(x, y)$ coordinates. Each coordinate, or *sample*, has just one x-coordinate as a *feature* and a y-coordinate as a *label*. We will denote the $i$th sample as $(x_i, y_i)$. Say that our estimate for $w$ is $\hat{w}$. Then, our predicted $y_i$ is $$\hat{y}_i = \hat{w}x_i$$ Naturally, our goal is to find the $w$ that minimizes the difference between our predicted $\hat{y_i}$ and true $y_i$, for all of our $n$ samples. Then, translating the previous sentence into mathematics, we have our *objective function*. $$\min_w \sum_{i=1}^n (\hat{y}_i - y_i)^2$$ Plugging in $\hat{y_i}$, we then have $$\min_w \sum_{i=1}^n (wx_i - y_i)^2$$ How do we minimize this? We take the derivative with respect to $w$, set to 0 and solve. Let us define the *loss* to be $L(w) = \sum_{i=1}^n (wx_i - y_i)^2$. Then, we wish to find the partial derivative: $$\frac{\partial L}{\partial w} = \sum_{i=1}^n 2(wx_i - y_i)x_i = \sum_{i=1}^n 2(wx_i^2 - y_ix_i)$$ Set to 0 and solve for $w$. $$\sum_{i=1}^n 2(wx_i^2 - y_ix_i) = 0\\ \sum_{i=1}^n (wx_i^2 - y_ix_i) = 0\\ w\sum_{i=1}^n x_i^2 = \sum_{i=1}^n y_ix_i\\ \hat{w} = \frac{\sum_{i=1}^n y_ix_i}{\sum_{i=1}^n x_i^2} $$ This expression may look slightly intimidating. However, to gain some intuition, take the case where $n=1$. Then, the expression becomes $\hat{w} = \frac{y_1x_1}{x_1^2} = \frac{y_1}{x_1}$, as we would expect. The important portion of this section is to remind you that least squares has a *closed form* solution, meaning that the optimal solution for our problem can be computed, explicitly. To understand why this is significant, we need to examine a problem without a closed-form solution. For example, we could never solve $x = \log x$, for a standard base-10 logarithm. Try graphing these two lines, and we see that they never intersect. In which case, we have no closed-form solution. On the other hand, ordinary least squares has a closed-form, which is good news. For any problem reduced to least squares, we can then compute the optimal solution, given our data and assumptions. # Step 2 - Ordinary Least Squares Vectorized This representation is slightly unwiedly. What if we considered lines that did not pass through the origin? Then, our line is of the form $$y = mx + b$$ Our expression becomes slightly more complicated. However, there is a way out. Let us *vectorize* this expression, meaning that we re-express this mathematical statement using vectors and matrices. As we will show later, our objective then remains the same, regardless of our linear model. There are a number of advantages to vectorizing: - Our expression gives more intuition and a deeper understanding. - We can invoke a multitude of matrix properties. Let us begin with the original model that we discussed in the previous part. We will take the model $y_i = wa_i$ where we denote each sample $a_i$ using the following $$a_i = \begin{bmatrix}x_i\end{bmatrix}$$ Succinctly, we denote this as $a_i \in \mathbb{R}^1$. This notation means each sample $a_i$ belongs to a real-valued, one-dimensional space; more simply said, each sample is a column vector with 1 real number. Using these samples $a_i$, we then stack all of our samples $a_i$ into a matrix $A$, in the following way  $$A = \begin{bmatrix} - & a_1^T & -\\ - & a_2^T & -\\ & \vdots& \\- & a_n^T & -\end{bmatrix} = \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_n\end{bmatrix} $$ Say we have $n$ samples. The above $A$ is then a $n\times 1$ matrix. Said more succinctly, we have $A \in \mathbb{R}^{n\times 1}$. We can likewise stack our labels, or y-coordinates, for each sample in a column vector, giving us $$\vec{y} = \begin{bmatrix}y_1 \\ y_2 \\ \vdots \\ y_n\end{bmatrix}$$ We now have a new expression, using $A, y$. Note that this looks very similar to the scalar form we obtained in the first step: $$\min_w \|Aw - y\|_2^2$$ Generally speaking, we denote $a_i \in \mathbb{R}^d$ to mean that $a_i$ belongs to a $d-$dimensional space. Then, $A \in \mathbb{R}^{n \times d}, w \in \mathbb{R}^d$. We will skip the mathematical details (see Appendix) for now. The solution to our vectorized objective function is $$\hat{w} = (X^TX)^{-1}X^Ty$$ As before, this expression resembles our scalar form solution. # Step 3 - Fitting a Line Demonstration One item to note is that we simplified the problem. True linear regression models an affine function of $x$, meaning we include a constant offset $b$. In other words, we fit a line $$y = mx + b$$ Note that this is equal to the following $$y = \begin{bmatrix}m & b\end{bmatrix}\begin{bmatrix}x \\ 1\end{bmatrix}$$ Instead of defining a new framework, we can simplify redefine our sample and model. We then have a new weight vector $w$ and sample $a_i$, using $$w = \begin{bmatrix}m \\ b\end{bmatrix}, a_i = \begin{bmatrix}x_i \\ 1\end{bmatrix}$$ Whereas our scalar objective and solution would change, our vectorized expressions conveniently remain the same. Rewriting our vectorized objective function, we then have $$\min_w \|Aw - y\|_2^2 = \|\begin{bmatrix}x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_n & 1\end{bmatrix}\begin{bmatrix}m \\ b\end{bmatrix} - \begin{bmatrix}y_1 \\ y_2 \\ \vdots \\ y_n\end{bmatrix}\|_2^2$$ First, navigate to your home directory. cd ~ Using `nano` or your favorite text editor, open `ols.py`. nano ols.py Then, copy and paste the following in. Take a moment to read through the comments and ensure that you understand the code. ``` import numpy as np import matplotlib matplotlib.use('Agg') # wizardry to prevent Tkinter error import matplotlib.pyplot as plt n = 100 y = lambda x: 2 * x + 3 # true line e = 1000 # 1. Create n random samples and plot xs = list(range(n)) # create xs from 0 to 99 ys = [y(x) + np.random.random() * e - (e/2.) for x in xs] # create ys with 0-mean noise plt.scatter(xs, ys, s=2) # plot samples plt.plot(xs, [y(x) for x in xs], label='original') # plot real line plt.legend() plt.savefig('points_with_real_line.png') # 2. Create A and y A = np.array([[x, 1] for x in xs]) y = np.array([[y] for y in ys]) print(A.shape) # (100, 2) print(y.shape) # (100, 1) # 3. Solve for optimal model w and plot predicted ys w = np.linalg.inv(A.T.dot(A)).dot(A.T.dot(y)) plt.plot(xs, A.dot(w), label='fitted') # plot fitted line plt.legend() plt.savefig('points_with_real_fitted_line.png') ``` Exit your editor, then run the script to generate two plots `points_with_real_line.png` and `points_with_real_fitted_line.png`. python ols.py We expect the following shell output Size of A: (100, 2) Size of y: (100, 1) The two generated plots will then resemble the following <img src="https://i.imgur.com/A1qWfox.png"> <img src="https://i.imgur.com/GcZ7QiO.png"> # Step 4 - Ridge Regression Least squares appears to do fairly well. Given that we use a linear model to predict a linear model, albeit noisy, we expect as much. However, what if the noise in our dataset increases? We examine what happens as noise increases in magnitude by 10-fold. Unfortunately, least squares on its own begins to deviate quite significantly from the true line. Below, the orange line is the original, and blue is the predicted line. <img src="https://i.imgur.com/X2tHJAG.png"> By *regularizing* our least squares objective function using the l2-norm, we see a new approach called ridge regression. Regularization is important for a number of reasons, which we will motivate in greater detail later on. For now, we can consider the l2-norm regularization term, $\|w\|_2^2$ (1) a penalty on model complexity and (2) semi-effective antidote for overfitting. See the conclusion for links to formal notions. Our new objective function is then following: $$\min_w \|Aw-y\|_2^2 + \lambda \|w\|_2^2$$ Note that when $\lambda \rightarrow \infty$, $w$ is pushed towards 0. In the opposite scenario, when $\lambda \rightarrow 0$, we see ordinary least squares. Already, we see that $\lambda$ allows us to control tradeoffs between the optimality of our solution, for the given training data, and "overfitting". Our new closed-form solution is the following: $$\hat{w}_{ridge} = (X^TX + \lambda I)^{-1}X^Ty$$ Using ridge regression, we see far superior performance. Subjectively, we can see closer fits for the line. Below, orange is again the true line, and blue is again the predicted line using ordinary least squares. Green is the predicted line using ridge regression. <img src="https://i.imgur.com/bKqYQ4G.png"> To make this clearer, we plot the error for both ordinar least squares and ridge regression as functions of noise magnitude. <img src="https://i.imgur.com/Yr7Ynxf.png"> For the curious, the script used to generate these plots can be found [here](https://gist.github.com/alvinwan/ea990e63bae78f4945b92bb8b9f8ca24). # Step 5 - Fitting Nonlinear Features Thus far, we have used least squares to find linear relationships between the label and the features. This means that for some label $y$ and feature $x$, we find a scalar coefficient $w$ where $y = wx$. This makes sense, as least squares is a linear model. However, what if we had nonlinear relationships? Specifically, consider a set of data sampled along a circle. Recall that the equation for a circle is $$x^2 + y^2 = r^2$$ We would be hard-pressed to find a linear model that represents this quadratic equation, well. However, it turns out that we can use least squares to model even labels that are nonlinear in the raw features. To accomplish this, we *featurize* our samples. Consider the following framework: Again, $x, r$ are our features and $y$ is our label. Can we learn a model quadratic in $r, x, y$ using least squares? We cannot. However, can we learn a model linear in $r^2, x^2, y^2$? We can. Note the subtlety. The model is quadratic in $r, x, y$ but linear in $r^2, x^2, y^2$. More generally, we can take any non-linear featurization to apply least squares to labels that are non-linear in the features. This is a fairly powerful tool, known as *feature-lifting*, where we take various higher-order features. For now, we will consider a particular featurization. Before, *linear regression* considered all models of the form $y = mx+b$. However, we now consider all polynomials. We refer to the following featurization as *polynomial regression* $$y = a_nx^n + a_{n-1}x^{n-1} + \cdots + a_1x + a_0$$ where $a_1, a_2 ... a_n$ are the variables that least squares will solve for. As before, we can rewrite this as $$y = \begin{bmatrix}a_0 \\ a_1 \\ \vdots \\ a_n\end{bmatrix}\begin{bmatrix}1 & x & \cdots & x^n\end{bmatrix}$$ Then, our weight vector $w$ is the former and our latter is the sample $a_i$. Again, we simply plug into the ordinary least squares or ridge regression formulations without further hassle. # Conclusion In this tutorial, we've covered ordinary least squares and one of its derivatives, ridge regression. However, this only scratches the surface. We've re-introduced the grade-school notion of fitting a line to points, but along with this re-introduction comes a number of other avenues to explore: a probabilistic interpretation using Gaussians, other variants such as LASSO, and finally, a framework to discuss fundamental concepts in machine learning.  # Appendix - Vectorized Ordinary Least Squares Solution This section is written for those familiar with vector calculus.  $$\frac{\partial L}{\partial w} = \frac{\partial}{\partial w}((Aw - y)^T(Aw-y))\\ = \frac{\partial}{\partial w}((Aw)^T(Aw) - 2(Aw)^Ty + y^Ty)\\ = 2A^TAw - 2A^Ty$$ Now, set the gradient to 0. $$A^TAw = A^Ty\\w= (A^TA)^{-1}A^Ty$$ where, we are guaranteed that $A^TA$ is invertible, as $A^TA$ is PSD. For a quick proof, note that $\forall x, x^TA^TAx = \|Ax\|_2^2 \geq 0$ by the properties of all norms. This leads to another interesting interpretation of least squares, as $A^{\dagger} = (A^TA)^{-1}A^T$ is in fact the pseudo-inverse for matrix $A$. Then, we have that our objective $y = Aw$ directly leads to $$w = A^{\dagger}Aw = A^{\dagger}y = (A^TA)^{-1}A^Ty$$ # Appendix - Vectorized Ridge Regression Solution As above, we can take the gradient. We know the gradient for the ordinary least squares objective is $2A^TAw - 2A^Ty$, so it suffices to examine the regularization term $$\frac{\partial \lambda\|w\|_2^2}{\partial w} = 2\lambda w$$ Thus, we have the following, set to 0 $$2A^TAw - 2A^Ty + 2\lambda w = 0\\ A^TAw + \lambda w = A^Ty\\ (A^TA + \lambda I)w = A^Ty\\ w = (A^TA + \lambda I)^{-1}A^Ty$$