# From Orthogonal Projetion to Linear Regression--Part 2
In this article, we continue to explore the relationship between linear regression and orthogonal projection. Here, we consider another case: **Linear Regression with Two Parameters** with the properties that **$\mathbf{x}_1$ and $\mathbf{x}_2$ are orthogonal**. This setting helps us understand how to optimally project and predict outcomes using orthogonal features in linear regression.
The linear regression model can be represented as:
$$
\hat{y} = \mathbf{w}^{\top}\mathbf{x},
$$
where $\mathbf{x}$ denotes the input features of a given instance, $\hat{y}$ is the predicted output, and $\mathbf{w} \in \mathbb{R}^{2}$ represents the model weights.
In this scenario, the training dataset $\mathbf{X} \in \mathbb{R}^{n \times 2}$ is a two-dimensional matrix, described as follows:
$$
\mathbf{X} =
\begin{bmatrix}
\mathbf{x}_1 & \mathbf{x}_2
\end{bmatrix} =
\begin{bmatrix}
(x_1)_1 & (x_2)_1 \\
(x_1)_2 & (x_2)_2 \\
(x_1)_3 & (x_2)_3 \\
\vdots & \vdots \\
(x_1)_n & (x_2)_n \\
\end{bmatrix},
$$
where $\mathbf{x}_j, j \in \{1,2\}$, is the $j$-th feature, and $(x_j)_i$ is the $j$-th input feature of the $i$-th element. The weight vector $\mathbf{w} \in \mathbb{R}^d$ is represented as:
$$
\mathbf{w} = \begin{bmatrix}
w_1 & w_2
\end{bmatrix}^{\top}.
$$
$\mathbf{w}$ can be determined by solving the following optimization problem:
$$
\begin{aligned}
\mathbf{w} & = \arg\min\limits_{\hat{\mathbf{w}}}\left(\sum_{i=1}^n(\hat{\mathbf{w}}^{\top}\mathbf{x}_i-y_i)^2+\lambda \|\mathbf{\hat{w}}\|^2\right) \\
& = \arg\min\limits_{\hat{w}_1,\hat{w}_2,...,\hat{w}_d}\left(\sum_{i=1}^n\left(\hat{w}_1(x_1)_i + \hat{w}_2(x_2)_i-y_i\right)^2+\lambda (\hat{w}_1^2+\hat{w}_2^2)\right) \\
\end{aligned}
$$
To solve for $\mathbf{w}$, we set the gradient of the loss function to zero:
$$
\begin{aligned}
& \nabla_{\mathbf{w}}\left(\sum_{i=1}^n(\mathbf{w}^{\top}\mathbf{x}_i-y_i)^2+\lambda \|\mathbf{w}\|^2\right) = 0 \\
& \Rightarrow \begin{cases}
\frac{\partial}{\partial w_1}\left(\sum_{i=1}^n\left(w_1(x_1)_i + w_2(x_2)_i-y_i\right)^2+\lambda (w_1^2+w_2^2)\right) = 0 \\
\frac{\partial}{\partial w_2}\left(\sum_{i=1}^n\left(w_1(x_1)_i + w_2(x_2)_i-y_i\right)^2+\lambda (w_1^2+w_2^2)\right) = 0
\end{cases} \\
& \Rightarrow \begin{cases}
\sum_{i=1}^n(x_1)_i\left(w_1(x_1)_i + w_2(x_2)_i-y_i\right)+\lambda w_1 = 0 \\
\sum_{i=1}^n(x_2)_i\left(w_1(x_1)_i + w_2(x_2)_i-y_i\right)+\lambda w_2 = 0
\end{cases} \\
& \Rightarrow \begin{cases}
w_1\sum_{i=1}^n(x_1)_i^2 + w_2\sum_{i=1}^n(x_1)_i(x_2)_i + \lambda w_1 = \sum_{i=1}^n(x_1)_iy_i \\
w_1\sum_{i=1}^n(x_2)_i(x_1)_i + w_2\sum_{i=1}^n(x_2)_i^2 + \lambda w_2 = \sum_{i=1}^n(x_2)_iy_i
\end{cases} \\
& \Rightarrow \begin{cases}
w_1\|\mathbf{x}_1\|^2 + w_2\mathbf{x}_1^{\top}\mathbf{x}_2 + \lambda w_1 = \mathbf{x}_1^{\top}\mathbf{y} \\
w_1\mathbf{x}_1^{\top}\mathbf{x}_2 + w_2\|\mathbf{x}_2\|^2 + \lambda w_2 = \mathbf{x}_2^{\top}\mathbf{y}
\end{cases} \\
\end{aligned}
$$
Assuming $\lambda = 0$ for simplicity and considering the case where $\mathbf{x}_1$ and $\mathbf{x}_2$ are orthogonal, we obtain:
$$
\begin{aligned}
& \begin{cases}
w_1\|\mathbf{x}_1\|^2 + w_2\mathbf{x}_1^{\top}\mathbf{x}_2 = \mathbf{x}_1^{\top}\mathbf{y} \\
w_1\mathbf{x}_1^{\top}\mathbf{x}_2 + w_2\|\mathbf{x}_2\|^2 = \mathbf{x}_2^{\top}\mathbf{y}
\end{cases}
& \Rightarrow \begin{cases}
w_1 = \frac{\mathbf{x}_1^{\top}\mathbf{y}}{\|\mathbf{x}_1\|^2} \\
w_2 = \frac{\mathbf{x}_2^{\top}\mathbf{y}}{\|\mathbf{x}_2\|^2}
\end{cases} \\
\end{aligned}
$$
In this case, $\mathbf{x}_1$ and $\mathbf{x}_2$ form the orthogonal basis for the span of $\{\mathbf{x}_1, \mathbf{x}_2\}$. Thus, the minimum distance (orthogonal distance) of $\mathbf{y}$ to the span is:
$$
\mathbf{y} - w_1\mathbf{x}_1 - w_2\mathbf{x}_2 = \mathbf{y} - \frac{\mathbf{x}_1^{\top}\mathbf{y}}{\|\mathbf{x}_1\|^2}\mathbf{x}_1 - \frac{\mathbf{x}_2^{\top}\mathbf{y}}{\|\mathbf{x}_2
\|^2}\mathbf{x}_2
$$
This formulation reveals that the weights are proportional to the length of the orthogonal projection of $\mathbf{y}$ onto the corresponding input feature $\mathbf{x}_i, i \leq d$. The alignment of $\mathbf{y}$ with $\mathbf{x}_i$ influences the magnitude of the projection and, consequently, the weight $w_i$.
**Example**
In this example, we consider $n=3$ with $\mathbf{x}_1 = [1,2,3]^{\top}$, $\mathbf{x}_2 = [-1,-1,1]^{\top}$, and $\mathbf{y}$ varying from $[2,3,6]$ to $[-2,-3,3]$. The following diagrams illustrate three points plotted against $\mathbf{x}_1$ and $\mathbf{x}_2$. The regression model for a given input $(x_1)_i, (x_2)_i$ is represented by the shaded area in the plot:
$$
\hat{y}_i = w_1(x_1)_i + w_2(x_2)_i = \frac{\mathbf{x}_1^{\top}\mathbf{y}}{\|\mathbf{x}_1\|^2}(x_1)_i + \frac{\mathbf{x}_2^{\top}\mathbf{y}}{\|\mathbf{x}_2\|^2}(x_2)_i
$$
In the lower diagram, vectors colored in light blue and green represent $\mathbf{x}_1$ and $\mathbf{x}_2$, respectively, while the red vector denotes $\mathbf{y}$. The dark green and dark blue vectors indicate the orthogonal projection of $\mathbf{y}$ onto the spans of $\mathbf{x}_1$ and $\mathbf{x}_2$, respectively, highlighting changes in projection lengths and alignment with the input points as $\mathbf{y}$ shifts.

