# 2-2 Linear Regression :::info **Example (Linear Least Square Regression)** Consider $\mathcal{X}\in\mathbb{R}^d$ and $\mathcal{Y}=\mathbb{R}$ with MSE loss $l(\hat{y},y)=(\hat{y}-y)^2$, we consider the hypothesis space of all linear functions $\mathcal{H}=\{h_{\textbf{w}}\mid \textbf{x} \rightarrow \textbf{w}\cdot\textbf{x} \}$. The goal is to find the parmeter $\textbf{w}$ such that the empirical risk $$\hat{R_n}(h_{\textbf{w}})=\frac{1}{n}\sum_{i=1}^n(y_i-\textbf{w}\cdot\textbf{x}_i)^2$$ is minimized. ::: This equation can be rewritten in the matrix-vector form. Let $\textbf{y}=[y_1, ..., y_n]^T$ be the output vector, and $$\textbf{X}= \begin{bmatrix} - & \textbf{x}_1^T& - \\ &\vdots & \\ - & \textbf{x}_n^T& - \end{bmatrix} \in\mathbb{R}^{n\times d}$$ be the matrix input. The risk is then $$\hat{R_n}(h_{\textbf{w}})=\frac{1}{n}\|\textbf{y}-\textbf{X}\textbf{w}\|^2$$ It can be shown (from the linear algebra) that the **orthogonal projection** of $\textbf{y}$ on the subspace spanned by $\textbf{x}_i$ can acheive minimum empirical risk: (Proof are omitted here) $$\textbf{w}^*=(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{y}$$ ###### tags: `machine learning`