# The Math of Linear and Polynomial Regression
#### CS181
#### Spring 2023

In this set of notes, we show you all the grusome algebra, with concrete examples, behind linear regression, multiple regression, polynomial regression and regression on arbitrary bases.
## Simple Linear Regression
Let say we have a data set with $N$ number of observations: $\mathcal{D} = \{(x_1, y_1), \ldots, (x_N, y_N) \}$ and we want to fit a linear model that predicts $y$ using $x$.
**Our model:** $y = w_1 x_1 + w_0.$
**Our metric for measuring how well our model fits the data:** $\mathcal{L}(w_1, w_0) = \frac{1}{N}\sum_{n=1}^N (y_n - (w_1 x_n + w_0))^2$, *(Mean Square Error)*
*Remark:* each $y_n$ is the observed outcome and $\hat{y}_n = w_1 x_n + w_0$ is the corresponding *predicted* outcome given by our model.
>*Example:* If our data set consists of two points $\mathcal{D} = \{(1, 2), (3, 4) \}$, the Mean Square Error expression would look like
>$$\mathcal{L}(w_1, w_0) = \frac{1}{2}\left( (2 - (w_1 * 1 + w_0))^2 + (4 - (w_1 * 3 + w_0))^2\right)$$
>If I wanted to measure how well the model $M_1: y = x + 1$ fits my data, I would compute
>$$\mathcal{L}(1, 1) = \frac{1}{2}\left( (2 - (1 * 1 + 1))^2 + (4 - (1 * 3 + 1))^2\right) = 0.$$
>If I wanted to measure how well the model $M_1: y = x$ fits my data, I would compute
>$$\mathcal{L}(1, 0) = \frac{1}{2}\left( (2 - (1 * 1 + 0))^2 + (4 - (1 * 3 + 0))^2\right) = 1.$$
>Since $M_1$ has a smaller Mean Square Error than $M_2$, we conclude that $M_1$ is a better model for our data.
**Our goal:** We want to find the coefficients $w_0, w_1$ such that the Mean Square Error, $\mathcal{L}(w_1, w_0)$, is as small as possible -- i.e. our model predictions are as accurate as possible, on average.
**Our goal as an optimization problem:** In math formalism, our goal can be written as solving the below
$$
w^*_0, w^*_1 = \mathrm{argmin}_{w_1, w_0} \mathcal{L}(w_1, w_0),
$$
where $\mathcal{L}$ is called our *objective function*.
**How to solve our optmization problem:** We use calculus to minimize $\mathcal{L}(w_1, w_0)$!
*It's an easier than you think!* mathematical optimization is mostly taking a "derivative" and then "setting it equal to zero"!
*It's more complex than you think!* finding where the "derivative" is zero produces a set of candidate places where our objective *might* be globally optimized. But in order to verify that an objective function is globally optimized at an input, we need to reason about the global shape of the function. This often us to engage in the deep theory of optimization.
**Optimizing $\mathcal{L}:$** Theory of convex optimization (not required for this class) tells us that the Mean Square Error objective, $\mathcal{L}$, is *convex*. That is, where the "derivative" of $\mathcal{L}$ is "zero" is where $\mathcal{L}$ is globally minimized.
Now, $\mathcal{L}$ is a function of two variables, the analog of the "derivative" of a multivariate function is called the *gradient*, denoted $\nabla \mathcal{L}$. The gradient of $\mathcal{L}$ is a vector of the *partial derivatives* of $\mathcal{L}$:
$$
\nabla \mathcal{L} =
\begin{pmatrix}
\frac{\partial \mathcal{L}}{\partial w_1}\\
\frac{\partial \mathcal{L}}{\partial w_0}
\end{pmatrix}
$$
where, for example, the partial derivative $\frac{\partial \mathcal{L}}{\partial w_1}$ is the derivative of $\mathcal{L}$ with respect to the variable $w_1$ while treating the other variable $w_0$ as a (unknown) constant.
### Step 1: Compute $\nabla \mathcal{L}$
We have:
$\begin{align}
\nabla \mathcal{L} &=
\begin{pmatrix}
\frac{\partial \mathcal{L}}{\partial w_1}\\
\frac{\partial \mathcal{L}}{\partial w_0}
\end{pmatrix}\\
&=
\begin{pmatrix}
\frac{\partial}{\partial w_1} \frac{1}{N}\sum_{n=1}^N (y_n - (w_1 x_n + w_0))^2\\
\frac{\partial}{\partial w_0} \frac{1}{N}\sum_{n=1}^N (y_n - (w_1 x_n + w_0))^2
\end{pmatrix}\\
&=
\begin{pmatrix}
\frac{1}{N}\sum_{n=1}^N 2(-x_n)(y_n - (w_1 x_n + w_0))\\
\frac{1}{N}\sum_{n=1}^N -2(y_n - (w_1 x_n + w_0))
\end{pmatrix}, \hskip0.5cm \text{(Chain Rule)}\\
\end{align}$
### Step 2: Solve for stationary points -- where $\nabla \mathcal{L}$ is zero
Points $(w_1, w_0)$ where $\nabla \mathcal{L}$ is zero are called *stationary points*.
Since $\nabla \mathcal{L}$ is a vector, we set:
$\begin{align}
\nabla \mathcal{L} &= \begin{pmatrix}
\frac{1}{N}\sum_{n=1}^N 2(-x_n)(y_n - (w_1 x_n + w_0))\\
\frac{1}{N}\sum_{n=1}^N -2(y_n - (w_1 x_n + w_0))
\end{pmatrix}
= \begin{pmatrix}
0\\
0
\end{pmatrix}
\end{align}$
This gives us a system of two equations,
$\begin{align}
\frac{1}{N}\sum_{n=1}^N 2(-x_n)(y_n - (w_1 x_n + w_0)) & =0\\
\frac{1}{N}\sum_{n=1}^N -2(y_n - (w_1 x_n + w_0)) &= 0
\end{align}$
which we can solve by substitution, first for $w_0$ in equation (2),
$\begin{align}
\frac{1}{N}\sum_{n=1}^N -2(y_n - (w_1 x_n + w_0)) &= 0\\
\frac{2}{N}\sum_{n=1}^N (-y_n + w_1 x_n + w_0) &= 0\\
\sum_{n=1}^N (-y_n + w_1 x_n + w_0) &= 0\\
-\sum_{n=1}^N y_n + \sum_{n=1}^N w_1 x_n + \sum_{n=1}^N w_0 &= 0\\
-\sum_{n=1}^N y_n + \sum_{n=1}^N w_1 x_n + N w_0 &= 0\\
N w_0 &= \sum_{n=1}^N y_n - \sum_{n=1}^N w_1 x_n\\
w_0 &= \frac{\sum_{n=1}^N y_n - \sum_{n=1}^N w_1 x_n}{N}\\
&= \frac{\sum_{n=1}^N y_n - w_1\sum_{n=1}^N x_n}{N}.
\end{align}$
We then substitute $w_0$ (after renaming the index variable in the above expression, from $n$ to $i$) into equation (1) and solve for $w_1$:
$\begin{align}
\frac{1}{N}\sum_{n=1}^N 2(-x_n)(y_n - (w_1 x_n + w_0)) & =0\\
\frac{1}{N}\sum_{n=1}^N 2(-x_n)\left(y_n - \left(w_1 x_n + \frac{\sum_{i=1}^N y_i - w_1\sum_{i=1}^N x_i}{N}\right)\right) & =0\\
\frac{-2}{N}\sum_{n=1}^N x_n\left(y_n - \left(w_1 x_n + \frac{\sum_{i=1}^N y_i - w_1\sum_{n=1}^N x_i}{N}\right)\right) & =0\\
\sum_{n=1}^N x_n\left(y_n - \left(w_1 x_n + \frac{\sum_{i=1}^N y_i - w_1\sum_{i=1}^N x_i}{N}\right)\right) & =0\\
\sum_{n=1}^N \left(x_n y_n - w_1 x^2_n - \frac{x_n\sum_{i=1}^N y_i - w_1x_n \sum_{i=1}^N x_i}{N}\right) & =0\\
\sum_{n=1}^N x_n y_n - w_1 \sum_{n=1}^N x^2_n - \frac{1}{N}\sum_{n=1}^Nx_n\sum_{i=1}^N y_i + w_1\frac{1}{N}\sum_{n=1}^Nx_n \sum_{i=1}^N x_i & =0\\
-w_1 \sum_{n=1}^N x^2_n + w_1\frac{1}{N}\sum_{n=1}^Nx_n \sum_{i=1}^N x_i & = \frac{1}{N}\sum_{n=1}^Nx_n\sum_{i=1}^N y_i -\sum_{n=1}^N x_n y_n\\
w_1 &= \frac{\frac{1}{N}\sum_{n=1}^Nx_n\sum_{i=1}^N y_i -\sum_{n=1}^N x_n y_n}{\frac{1}{N}\sum_{n=1}^Nx_n \sum_{i=1}^N x_i - \sum_{n=1}^N x^2_n}
\end{align}$
### Step 3: Verify $\mathcal{L}$ is globally minimized at the stationary points
We have a unique stationary point of $\mathcal{L}$ at $(w^*_1, w^*_0)$ where
$\begin{align}
w^*_1 &= \frac{\frac{1}{N}\sum_{n=1}^Nx_n\sum_{i=1}^N y_i -\sum_{n=1}^N x_n y_n}{\frac{1}{N}\sum_{n=1}^Nx_n \sum_{i=1}^N x_i - \sum_{n=1}^N x^2_n}\\
w^*_0 &=\frac{\sum_{n=1}^N y_n - w_1\sum_{n=1}^N x_n}{N}
\end{align}$
Since we know (from theory outside the scope of the class) that $\mathcal{L}$ is convex, we concluce that $\mathcal{L}$ is minimized at $(w^*_1, w^*_0)$.
---
## Multiple Linear Regression
Now we consider fitting a linear model to a data set $\mathcal{D}=\{(x_{11},\ldots, x_{1D}, y_1), \ldots, (x_{N1},\ldots, x_{ND}, y_N) \}$ -- that is, instead of one input variable we have $D$ number of them. Our linear model will have the form
$$
y = w_0 + w_1x_1 + \ldots + w_Dx_D.
$$
### Introducing Matrix Notation
**The Problem:** In this case, our Mean Square Error will look like
$$
\mathcal{L}(w_0, \ldots, w_D) = \frac{1}{N}\sum_{n=1}^N (y_n - (w_0 + w_1x_1 + \ldots + w_Dx_D))^2
$$
and its gradient will be a $D+1$-dimensional vector with $D$ partial derivatives of $\mathcal{L}$. If we optimized $\mathcal{L}$ as we did in simple linear regression, we'd have to solve a $D+1$ system of linear equations for $w_0, w_1, \ldots, w_D$. This is not tractable!
**The Solution:** Instead we will rewrite our model and the Mean Square Error using matrix notation, then do the calculus for optimization using directly matrices and vectors.
We write $\mathbf{x}$ for the $D+1$-dimensional vector
$$
\mathbf{x} =
\begin{pmatrix}
1\\
x_{1}\\
\vdots\\
x_{D}
\end{pmatrix}.
$$
Note the addition of the scalar 1 to the inputs $x_1$, $\ldots$, $x_D$, this will allow us to simplify the expression for our linear model later.
We write $\mathbf{w}$ for the $D+1$-dimensional vector
$$
\mathbf{w} =
\begin{pmatrix}
w_0\\
w_{1}\\
\vdots\\
w_{D}
\end{pmatrix},
$$
This notation allows our model to be written very simply.
**Our Model:** $y = \mathbf{w}^T\mathbf{x}.$
*Remark:* The notation $\mathbf{w}^T\mathbf{x}$ denotes the inner product of the vectors $\mathbf{w}, \mathbf{x}$, i.e. $\mathbf{w}^T\mathbf{x} = \sum_{d=0}^D w_dx_d$, where $x_0$ is just the 1 we put as the first component of $\mathbf{x}$.
Now, for our data set, $\mathcal{D}=\{(x_{11},\ldots, x_{1D}, y_1), \ldots, (x_{N1},\ldots, x_{ND}, y_N) \}$, we contruct a matrix denoted by $\mathbf{X}$
$$
\mathbf{X} = \begin{pmatrix}
1 & x_{11} & \ldots & x_{1D}\\
\vdots & \vdots & \vdots \\
1 & x_{n1} & \ldots & x_{nD}\\
\vdots & \vdots & \vdots \\
1 & x_{N1} & \ldots & x_{ND}\\
\end{pmatrix}
$$
When we wish to refer to the $n$-th row vector in $\mathbf{X}$, we write $\mathbf{x}_n$:
$$
\mathbf{x}_n =
\begin{pmatrix}
1\\
x_{n1}\\
\vdots\\
x_{nD}
\end{pmatrix}.
$$
*Remark on notation:* on the hand written lecture slides, for readability and to save space, I often denote vectors as row vectors and denote the $n,d$-th place in the matrix $\mathbf{X}$ by $x^{(n)}_d$ (instead of $x_{nd}$); I also denote each row in the matrix $\mathbf{X}$ as $\mathbf{x}^{(n)}$ instead of $\mathbf{x}_{n}$. You may use which ever version of these two notations as long as it is consistent.
Now, we let $\mathbf{y}$ denote the $N$-dimensional vector consisting of the observed outcomes in our data set
$$
\mathbf{y} = \begin{pmatrix}
y_1\\
y_{2}\\
\vdots\\
y_{N}
\end{pmatrix}.
$$
This notation allows us to write our Mean Square Error very simply.
**Our Objective Function:**
$\mathcal{L}(\mathbf{w}) = \frac{1}{N} \sum_{n=1}^N(y_n - \mathbf{w}^T\mathbf{x}_n)^2 = \frac{1}{N}(\mathbf{y} - \mathbf{X}\mathbf{w})^\top(\mathbf{y} - \mathbf{X}\mathbf{w})$
You will see in the following that writing $\mathcal{L}$ in terms of matrices and vectors will simplify our optimization.
### Step 1: Compute $\nabla \mathcal{L}$
We compute $\nabla \mathcal{L}$ using matrix differentiation -- a set of differention rules that help us quickly compute gradients of scalar functions defined in terms of matrices and vectors.
$$
\begin{align}
\nabla\mathcal{L} &= \frac{\partial}{\partial \mathbf{w}} \frac{1}{N}(\mathbf{y} - \mathbf{X}\mathbf{w})^\top(\mathbf{y} - \mathbf{X}\mathbf{w})\\
&= \frac{\partial}{\partial \mathbf{w}} \frac{1}{N}(\mathbf{y}^\top - \mathbf{w}^\top\mathbf{X}^\top)(\mathbf{y} - \mathbf{X}\mathbf{w})\\
&= \frac{1}{N}\frac{\partial}{\partial \mathbf{w}} (\mathbf{y}^\top \mathbf{y} - \mathbf{w}^\top\mathbf{X}^\top \mathbf{y} - \mathbf{y}^\top\mathbf{X}\mathbf{w}+ \mathbf{w}^\top\mathbf{X}^\top \mathbf{X}\mathbf{w})\\
&= \frac{1}{N} (\frac{\partial}{\partial \mathbf{w}}\mathbf{y}^\top \mathbf{y} - \frac{\partial}{\partial \mathbf{w}}\mathbf{w}^\top\mathbf{X}^\top \mathbf{y} - \frac{\partial}{\partial \mathbf{w}}\mathbf{y}^\top\mathbf{X}\mathbf{w}+ \frac{\partial}{\partial \mathbf{w}}\mathbf{w}^\top\mathbf{X}^\top \mathbf{X}\mathbf{w})\\
&= \frac{1}{N} (0 - \mathbf{X}^\top \mathbf{y} - \mathbf{X}^\top \mathbf{y} + 2\mathbf{X}^\top \mathbf{X}\mathbf{w})\\
&= \frac{2}{N}(\mathbf{X}^\top \mathbf{X}\mathbf{w} - \mathbf{X}^\top \mathbf{y})
\end{align}
$$
### Step 2: Solve for stationary points -- where $\nabla \mathcal{L}$ is zero
We set $\nabla\mathcal{L} =0$ and solve for $\mathbf{w}$
$$
\begin{align}
\nabla\mathcal{L}= \frac{2}{N}(\mathbf{X}^\top \mathbf{X}\mathbf{w} - \mathbf{X}^\top \mathbf{y}) &=0\\
\mathbf{X}^\top \mathbf{X}\mathbf{w} - \mathbf{X}^\top \mathbf{y}&=0\\
\mathbf{X}^\top \mathbf{X}\mathbf{w} &= \mathbf{X}^\top \mathbf{y}\\
\mathbf{w} &= (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}\\
\end{align}
$$
### Step 3: Verify $\mathcal{L}$ is globally minimized at the stationary points
We have a unique stationary point of $\mathcal{L}$ at $\mathbf{w}^*$ where
$$\begin{align}
\mathbf{w}^* &= (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}
\end{align}$$
Since we know (from theory outside the scope of the class) that $\mathcal{L}$ is convex, we concluce that $\mathcal{L}$ is minimized at $\mathbf{w}^*$.
---
## Polynomial Regression
Let's go back to our 1D data set with $N$ number of observations: $\mathcal{D} = \{(x_1, y_1), \ldots, (x_N, y_N) \}$. Say now we want to fit a polynomial model of degree $D$ that predicts $y$ using $x$.
**Our model:** $y = w_0 + w_1x + w_2x^2 + \ldots + w_Dx^D.$
In order to find coefficients for our model that minimizes the Mean Square Error, do we need to do another round of complex optimization?
Luckily, polynomial regression is secretly multiple linear regression in disguise!
Let's introduce the notation $\phi(x)$ for the vector
$$
\phi(x) =
\begin{pmatrix}
1\\
x\\
x^2\\
\vdots\\
x^D
\end{pmatrix}.
$$
We can now write our model as:
$$
y = \mathbf{w}^T\phi(x),
$$
where again $\mathbf{w}$ denotes the vector of model coefficients
$$
\mathbf{w} =
\begin{pmatrix}
w_0\\
w_1\\
\vdots\\
w_D
\end{pmatrix}.
$$
We think of $\phi:\mathbb{R} \to \mathbb{R}^{D+1}$ as a map turning each scalar input $x$ into a $D+1$-dimensional vector consisting of polynomial powers of $x$. We call $\phi$ the *feature map* and $\phi(x)$ the *feature vector*.
Now, let us use $\phi(\mathbf{X})$ or $\mathbf{\Phi}$ to denote the matrix
$$
\mathbf{\Phi} = \begin{pmatrix}
1 & x_{1} & \ldots & x_{1}^D\\
\vdots & \vdots & \vdots \\
1 & x_{n} & \ldots & x_{n}^D\\
\vdots & \vdots & \vdots \\
1 & x_{N} & \ldots & x_{N}^D\\
\end{pmatrix}
$$
where each row of $\Phi$ is the feature vector $\phi(x_n)$ of the $n$-th input in the data set
$$
\phi(x_n) =
\begin{pmatrix}
1\\
x_n\\
x_n^2\\
\vdots\\
x_n^D
\end{pmatrix}.
$$
Just like in multiple linear regression, we can write out the Mean Square Error very simple using matrix notation.
**Our Objective Function:** $\mathcal{L}(\mathbf{w}) = \frac{1}{N} \sum_{n=1}^N(y_n - \mathbf{w}^T\phi(x_n))^2 = \frac{1}{N}(\mathbf{y} - \mathbf{\Phi}\mathbf{w})^\top(\mathbf{y} - \mathbf{\Phi}\mathbf{w})$
Using our solution for multiple linear regression, we get that the solution for polynomial regression is given by
$$
\mathbf{w}^* = (\mathbf{\Phi}^\top \mathbf{\Phi})^{-1}\mathbf{\Phi}^\top \mathbf{y}.
$$
> *Example:* The essense of polynomial regression is to first expand the data set by adding polynomial features for each input in the data set, and then fit a linear model on the augmented data.
> Suppose our dataset is $\mathcal{D}=\{(1, 2), (3, 4), (5, 6)\}$, and we want to fit a degree 2 polynomial to this data set. We first form the *feature matrix* $\mathbf{\Phi}$
>$$
\mathbf{\Phi} = \begin{pmatrix}
1 & 1 & 1^2\\
1 & 3 & 3^2\\
1 & 5 & 5^2\\
\end{pmatrix}
$$
---
## Regression on Arbitrary Bases
There is nothing special about the relaionship between polynomial regression and multiple linear regression. If we wanted to fit a general non-linear function mapping $x$ to $y$ of the form
$$
y = f(x) = w_0f_0(x) + \ldots + w_Df_D(x)
$$
where each $f_d$ is a scalar value function of $x$. We can re-state the problem of fitting $f$, i.e. finding values of $w_d$ that minimizes the Mean Square Error of the model, as fitting a linear regression model on the data *after* applying a feature map to the inputs.
We define $\phi: \mathbb{R} \to \mathbb{R}^D$ by
$$
\phi(x) = \begin{pmatrix}
f_0(x)\\
f_1(x)\\
\vdots\\
f_D(x)
\end{pmatrix},
$$
and rewrite our model as $y = \mathbf{w}^\top\phi(x)$. The collection of functions $f_d$ is called the *feature basis*.
> *Example:* (Finite Random Fourier Basis Regression) We can define a feature map $\phi: \mathbb{R} \to \mathbb{R}^3$ by
>$$
\phi(x) = \begin{pmatrix}
\cos(a_0x+b_0)\\
\cos(a_1x+b_1)\\
\cos(a_2x+b_2)
\end{pmatrix},
$$
> where $a_d$ and $b_d$ are *randomly* sampled from certain special distributions.