From Orthogonal Projetion to Linear Regression--Part 2

# From Orthogonal Projetion to Linear Regression--Part 2 In this article, we continue to explore the relationship between linear regression and orthogonal projection. Here, we consider another case: **Linear Regression with Two Parameters** with the properties that **$\mathbf{x}_1$ and $\mathbf{x}_2$ are orthogonal**. This setting helps us understand how to optimally project and predict outcomes using orthogonal features in linear regression. The linear regression model can be represented as: $$ \hat{y} = \mathbf{w}^{\top}\mathbf{x}, $$ where $\mathbf{x}$ denotes the input features of a given instance, $\hat{y}$ is the predicted output, and $\mathbf{w} \in \mathbb{R}^{2}$ represents the model weights. In this scenario, the training dataset $\mathbf{X} \in \mathbb{R}^{n \times 2}$ is a two-dimensional matrix, described as follows: $$ \mathbf{X} = \begin{bmatrix} \mathbf{x}_1 & \mathbf{x}_2 \end{bmatrix} = \begin{bmatrix} (x_1)_1 & (x_2)_1 \\ (x_1)_2 & (x_2)_2 \\ (x_1)_3 & (x_2)_3 \\ \vdots & \vdots \\ (x_1)_n & (x_2)_n \\ \end{bmatrix}, $$ where $\mathbf{x}_j, j \in \{1,2\}$, is the $j$-th feature, and $(x_j)_i$ is the $j$-th input feature of the $i$-th element. The weight vector $\mathbf{w} \in \mathbb{R}^d$ is represented as: $$ \mathbf{w} = \begin{bmatrix} w_1 & w_2 \end{bmatrix}^{\top}. $$ $\mathbf{w}$ can be determined by solving the following optimization problem: $$ \begin{aligned} \mathbf{w} & = \arg\min\limits_{\hat{\mathbf{w}}}\left(\sum_{i=1}^n(\hat{\mathbf{w}}^{\top}\mathbf{x}_i-y_i)^2+\lambda \|\mathbf{\hat{w}}\|^2\right) \\ & = \arg\min\limits_{\hat{w}_1,\hat{w}_2,...,\hat{w}_d}\left(\sum_{i=1}^n\left(\hat{w}_1(x_1)_i + \hat{w}_2(x_2)_i-y_i\right)^2+\lambda (\hat{w}_1^2+\hat{w}_2^2)\right) \\ \end{aligned} $$ To solve for $\mathbf{w}$, we set the gradient of the loss function to zero: $$ \begin{aligned} & \nabla_{\mathbf{w}}\left(\sum_{i=1}^n(\mathbf{w}^{\top}\mathbf{x}_i-y_i)^2+\lambda \|\mathbf{w}\|^2\right) = 0 \\ & \Rightarrow \begin{cases} \frac{\partial}{\partial w_1}\left(\sum_{i=1}^n\left(w_1(x_1)_i + w_2(x_2)_i-y_i\right)^2+\lambda (w_1^2+w_2^2)\right) = 0 \\ \frac{\partial}{\partial w_2}\left(\sum_{i=1}^n\left(w_1(x_1)_i + w_2(x_2)_i-y_i\right)^2+\lambda (w_1^2+w_2^2)\right) = 0 \end{cases} \\ & \Rightarrow \begin{cases} \sum_{i=1}^n(x_1)_i\left(w_1(x_1)_i + w_2(x_2)_i-y_i\right)+\lambda w_1 = 0 \\ \sum_{i=1}^n(x_2)_i\left(w_1(x_1)_i + w_2(x_2)_i-y_i\right)+\lambda w_2 = 0 \end{cases} \\ & \Rightarrow \begin{cases} w_1\sum_{i=1}^n(x_1)_i^2 + w_2\sum_{i=1}^n(x_1)_i(x_2)_i + \lambda w_1 = \sum_{i=1}^n(x_1)_iy_i \\ w_1\sum_{i=1}^n(x_2)_i(x_1)_i + w_2\sum_{i=1}^n(x_2)_i^2 + \lambda w_2 = \sum_{i=1}^n(x_2)_iy_i \end{cases} \\ & \Rightarrow \begin{cases} w_1\|\mathbf{x}_1\|^2 + w_2\mathbf{x}_1^{\top}\mathbf{x}_2 + \lambda w_1 = \mathbf{x}_1^{\top}\mathbf{y} \\ w_1\mathbf{x}_1^{\top}\mathbf{x}_2 + w_2\|\mathbf{x}_2\|^2 + \lambda w_2 = \mathbf{x}_2^{\top}\mathbf{y} \end{cases} \\ \end{aligned} $$ Assuming $\lambda = 0$ for simplicity and considering the case where $\mathbf{x}_1$ and $\mathbf{x}_2$ are orthogonal, we obtain: $$ \begin{aligned} & \begin{cases} w_1\|\mathbf{x}_1\|^2 + w_2\mathbf{x}_1^{\top}\mathbf{x}_2 = \mathbf{x}_1^{\top}\mathbf{y} \\ w_1\mathbf{x}_1^{\top}\mathbf{x}_2 + w_2\|\mathbf{x}_2\|^2 = \mathbf{x}_2^{\top}\mathbf{y} \end{cases} & \Rightarrow \begin{cases} w_1 = \frac{\mathbf{x}_1^{\top}\mathbf{y}}{\|\mathbf{x}_1\|^2} \\ w_2 = \frac{\mathbf{x}_2^{\top}\mathbf{y}}{\|\mathbf{x}_2\|^2} \end{cases} \\ \end{aligned} $$ In this case, $\mathbf{x}_1$ and $\mathbf{x}_2$ form the orthogonal basis for the span of $\{\mathbf{x}_1, \mathbf{x}_2\}$. Thus, the minimum distance (orthogonal distance) of $\mathbf{y}$ to the span is: $$ \mathbf{y} - w_1\mathbf{x}_1 - w_2\mathbf{x}_2 = \mathbf{y} - \frac{\mathbf{x}_1^{\top}\mathbf{y}}{\|\mathbf{x}_1\|^2}\mathbf{x}_1 - \frac{\mathbf{x}_2^{\top}\mathbf{y}}{\|\mathbf{x}_2 \|^2}\mathbf{x}_2 $$ This formulation reveals that the weights are proportional to the length of the orthogonal projection of $\mathbf{y}$ onto the corresponding input feature $\mathbf{x}_i, i \leq d$. The alignment of $\mathbf{y}$ with $\mathbf{x}_i$ influences the magnitude of the projection and, consequently, the weight $w_i$. **Example** In this example, we consider $n=3$ with $\mathbf{x}_1 = [1,2,3]^{\top}$, $\mathbf{x}_2 = [-1,-1,1]^{\top}$, and $\mathbf{y}$ varying from $[2,3,6]$ to $[-2,-3,3]$. The following diagrams illustrate three points plotted against $\mathbf{x}_1$ and $\mathbf{x}_2$. The regression model for a given input $(x_1)_i, (x_2)_i$ is represented by the shaded area in the plot: $$ \hat{y}_i = w_1(x_1)_i + w_2(x_2)_i = \frac{\mathbf{x}_1^{\top}\mathbf{y}}{\|\mathbf{x}_1\|^2}(x_1)_i + \frac{\mathbf{x}_2^{\top}\mathbf{y}}{\|\mathbf{x}_2\|^2}(x_2)_i $$ In the lower diagram, vectors colored in light blue and green represent $\mathbf{x}_1$ and $\mathbf{x}_2$, respectively, while the red vector denotes $\mathbf{y}$. The dark green and dark blue vectors indicate the orthogonal projection of $\mathbf{y}$ onto the spans of $\mathbf{x}_1$ and $\mathbf{x}_2$, respectively, highlighting changes in projection lengths and alignment with the input points as $\mathbf{y}$ shifts. ![figure_2](https://hackmd.io/_uploads/r1LqdubBC.gif) ![figure_3](https://hackmd.io/_uploads/r1Zj_OZSR.gif)