# 2-2 Linear Regression
:::info
**Example (Linear Least Square Regression)**
Consider $\mathcal{X}\in\mathbb{R}^d$ and $\mathcal{Y}=\mathbb{R}$ with MSE loss $l(\hat{y},y)=(\hat{y}-y)^2$, we consider the hypothesis space of all linear functions $\mathcal{H}=\{h_{\textbf{w}}\mid \textbf{x} \rightarrow \textbf{w}\cdot\textbf{x} \}$. The goal is to find the parmeter $\textbf{w}$ such that the empirical risk
$$\hat{R_n}(h_{\textbf{w}})=\frac{1}{n}\sum_{i=1}^n(y_i-\textbf{w}\cdot\textbf{x}_i)^2$$
is minimized.
:::
This equation can be rewritten in the matrix-vector form. Let $\textbf{y}=[y_1, ..., y_n]^T$ be the output vector, and
$$\textbf{X}=
\begin{bmatrix}
- & \textbf{x}_1^T& - \\
&\vdots & \\
- & \textbf{x}_n^T& -
\end{bmatrix}
\in\mathbb{R}^{n\times d}$$
be the matrix input. The risk is then
$$\hat{R_n}(h_{\textbf{w}})=\frac{1}{n}\|\textbf{y}-\textbf{X}\textbf{w}\|^2$$
It can be shown (from the linear algebra) that the **orthogonal projection** of $\textbf{y}$ on the subspace spanned by $\textbf{x}_i$ can acheive minimum empirical risk: (Proof are omitted here)
$$\textbf{w}^*=(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{y}$$
###### tags: `machine learning`