From Orthogonal Projetion to Linear Regression--Part 1

# From Orthogonal Projetion to Linear Regression--Part 1 In this article, we explore the relationship between linear regression and orthogonal projection. We begin with the simplest toy example: **Linear Regression with a Single Parameter**. The model prediction $\hat{y}$ can be represented by the equation: $$ \hat{y}=wx. $$ In this scenario, both the parameter and a given input instance are scalars, denoted as $w \in \mathbb{R}$ and $x \in \mathbb{R}$. The training data $\mathbf{X} \in \mathbb{R}^{n \times 1}$ is represented as a vector of length $n$, where $n$ is the number of instances. We use the notation $\mathbf{x}$ to define the vector of training instances $\mathbf{x} = [x_1, \ldots, x_n]^\top$, where $x_i$ is the $i$-th data instance. Similarly, we define $\mathbf{y} = [y_1, \ldots, y_n]^\top$. The parameter $w$ can be obtained by solving the following equation: $$ w = \arg\min_{\hat{w}} \left(\sum_{i=1}^n (\hat{w}x_i - y_i)^2 + \lambda \hat{w}^2\right). $$ We illustrate how different effects on $w$ are caused by the training data $\mathbf{x}$ or the regularization constant $\lambda$. ## Effects on $w$ caused by training data $\mathbf{x}$: We consider the case where $\lambda = 0$ and solve the following equation: $$ w = \arg\min_{\hat{w}} \left(\sum_{i=1}^n(\hat{w}x_i - y_i)^2\right). $$ The solution to this equation is derived as follows: $$ \begin{aligned} & \frac{d}{dw}\left(\sum_{i=1}^n(wx_i - y_i)^2\right) = 0 \\ & \Rightarrow \sum_{i=1}^n 2x_i(wx_i - y_i) = 0 \\ & \Rightarrow \sum_{i=1}^n x_i^2w = \sum_{i=1}^n x_iy_i \\ & \Rightarrow w = \frac{\sum_{i=1}^n x_iy_i}{\sum_{i=1}^n x_i^2}. \end{aligned} $$ The last equation can be expressed as: $$ w = \frac{\mathbf{x}^\top \mathbf{y}}{\|\mathbf{x}\|^2} = \left(\frac{\mathbf{x}}{\|\mathbf{x}\|^2}\right)^\top \mathbf{y}, $$ where $w\mathbf{x}$ can be viewed as the orthogonal projection of vector $\mathbf{y}$ onto vector $\mathbf{x}$. Now let's explore why the solution represents an orthogonal projection. First, consider a scenario with only one training data point, i.e., $n=1$. In this case, $w$ can be represented by: $$ w = \frac{x_1y_1}{x_1^2} = \frac{x_1}{y_1}, $$ given that for a single point $(x_1, y_1)$, there is a unique line $y = wx$ that passes through the point $(x_1, y_1)$, thus determining $w$. However, with more than one data point, i.e., $n>1$, the equation: $$ [y_1, y_2, \ldots, y_n]^\top = w[x_1, x_2, \ldots, x_n]^\top $$ may not have a solution when vector $\mathbf{y}$ does not lie in the span of vector $\mathbf{x}$. In this case, there is a nonzero distance between the vector $\mathbf{y}$ and the vector $w\mathbf{x}$. Finding the solution of $\arg\min_{w} \left(\sum_{i=1}^n(wx_i - y_i)^2\right)$ can be viewed as finding the $w \in \mathbb{R}$ that minimizes the distance between the vector $w\mathbf{x}$ and $\mathbf{y}$. The minimum distance between two vectors is the orthogonal distance between $w\mathbf{x}$ and $\mathbf{y}$, thus $w$ is the coefficient that makes $w\mathbf{x}$ the orthogonal projection of $\mathbf{y}$ onto the span of $\mathbf{x}$. **Example** In this example, we have $\mathbf{x} = [1,2]^\top$ and $\mathbf{y} = [2.5,3]^\top$. In the left figure, there are two blue dots representing the points $(x_1,y_1) = (1,2.5)$ and $(x_2,y_2) = (2,3)$. The purple line represents $\mathbf{y} = w\mathbf{x}$, where $w$ varies from $1.0$ to $2.4$, and the dashed orange line represents $\mathbf{y} = w^\star\mathbf{x}$, where $w^\star = \frac{\mathbf{x}^\top \mathbf{y}}{\|\mathbf{x}\|^2} = 1.7$. In the right figure, the blue vector represents $\mathbf{x} = (1,2)$ with the first coordinate being $1$ and the second coordinate being $2$. The red vector represents $\mathbf{y} = (2.5,3)$, the orange vector represents the orthogonal projection of $\mathbf{y}$ onto the span of $\mathbf{x}$. During the change of $w$ from $1.0$ to $2.4$, it can be observed that, when the purple line overlaps with the orange line, the vector $w\mathbf{x}$ is exactly the orthogonal projection of the vector $\mathbf{y}$ onto the span of $\mathbf{x}$. ![figure_1(4)](https://hackmd.io/_uploads/S13LIZjpa.gif) ## Effects on $w$ caused by the regularization constant $\lambda$: The solution for $w$ when $\lambda \neq 0$ can be derived as follows: $$ \begin{aligned} & \frac{d}{dw}\left(\sum_{i=1}^n(wx_i-y_i)^2+\lambda w^2\right)=0 \\ & \Rightarrow \sum_{i=1}^n 2x_i(wx_i-y_i)+2\lambda w = 0 \\ & \Rightarrow \sum_{i=1}^n x_i^2w +\lambda w = \sum_{i=1}^{n}x_iy_i \\ & \Rightarrow w\left(\sum_{i=1}^n x_i^2 +\lambda\right) = \sum_{i=1}^{n}x_iy_i \\ & \Rightarrow w =\frac{\sum_{i=1}^nx_iy_i}{\sum_{i=1}^nx_i^2+\lambda} \\ & \Rightarrow w =\left(\frac{\mathbf{x}}{\|\mathbf{x}\|^2+\lambda}\right)^\top \mathbf{y}. \end{aligned} $$ First, consider the scenario where $\|\mathbf{x}\|=0$. In this case, if $\lambda \neq 0$, we can still obtain the solution for $w$, which is $w = 0$. However, a valid solution for $w$ cannot be obtained if $\lambda = 0$. To illustrate the impact of $\lambda$ when $\|\mathbf{x}\| > 0$: if $\|\mathbf{x}\| \ll \lambda$, then $\lambda$ can dominate the solution for $w$. This typically occurs when $n$ is small. Therefore, $\lambda$ can also be viewed as the inverse of the prior assumption of $w$. Conversely, when $\|\mathbf{x}\| \gg \lambda$, the dominating terms in the solution for $w$ are $\|\mathbf{x}\|^2$ and $\mathbf{x}^\top \mathbf{y}$, making $\lambda$ less significant. This usually occurs when $n$ is large. Intuitively, when there are plenty of training instances, the prior assumption of $w$ becomes less important. From the perspective of orthogonal projection, it is not possible to obtain the projection of $\mathbf{y}$ onto $\mathbf{x}$ (i.e., $\frac{\mathbf{x}^\top \mathbf{y}}{\|\mathbf{x}\|^2}$) when $\|\mathbf{x}\| = 0$. If we set $\lambda > 0$, we can avoid the division-by-zero error that would occur when $\|\mathbf{x}\| = 0$, should we wish to calculate the orthogonal projection.