$$
\newcommand{\Vec}{\operatorname{vec}}
\newcommand{\T}{^\top}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\R}{\mathbb{R}}
\newcommand{\bmat}[1]{\begin{bmatrix} #1 \end{bmatrix}}
\newcommand{\dif}[1]{\partial_{ #1 }}
\newcommand{\normal}{\mathcal{N}}
$$
# Linear Regression
## Problem
Given vector spaces $V$ and $W$, a set $S \subseteq V$, and $y:S \mapsto W$, find an affine map $f:V \mapsto W, x \mapsto Ax+b$ whose restriction to $S$ best fits $y$.
### Remark
- The notion of best-fitting varies by context.
- This setting is too general. In practice, one exclusively deals with multivariate linear regression, particularly, planar linear regression.
## Multivariate Linear Regression
It is the case in which $V = \R^n$, $W = \R^m$, and $S$ is finite. The different senses of best-fitting are illustrated below.
### Least Mean Squared Error
Minimize $\norm{f|_S - y}_{\ell^2} = \left( \sum_{x \in S} \norm{f(x) - y(x)}_2^2 \right)^{1/2}$ subject to $A \in \R^{m \times n}$ and $b \in \R^m$.
#### Solution 1
Order $S$ as $\{x_i\}_{i=1}^N$, and let $\{y_i\}_{i=1}^N = \{y(x_i)\}_{i=1}^N$. Equivalently, the problem is to minimize
$$
L = \sum_{i=1}^N \sum_{j=1}^m (Ax_i + b - y_i)_j^2
= \sum_{i=1}^N \sum_{j=1}^m \left(
\bmat{A & b} \bmat{x_i \\ 1} - y_i
\right)_j^2
= \sum_{i=1}^N \sum_{j=1}^m (BX - Y)_{ji}^2
$$
with respect to $B = \bmat{A & b} \in \R^{m \times (n+1)}$, where $X = \bmat{x_1 & \cdots & x_N \\ 1 & \cdots & 1}$ and $Y = \bmat{y_1 & \cdots & y_N}$. Since the square root function is increasing, one can omit it. $L$ is a smooth function of the entries of $B$. Hence, the extreme points must be stationary points.
$$
\begin{align*}
0 &= \dif{B_{pq}} L
= \dif{B_{pq}} \sum_{i=1}^N \sum_{j=1}^m \left(
\sum_{k=1}^{n+1} B_{jk}X_{ki} - Y_{ji}
\right)^2
\\ &= \sum_{i=1}^N \dif{B_{pq}} \left(
\sum_{j \neq p} \left(\sum_{k=1}^{n+1} B_{jk}X_{ki} - Y_{ji}\right)^2
+ \left(\sum_{k=1}^{n+1} B_{pk}X_{ki} - Y_{pi}\right)^2
\right)
\\ &= \sum_{i=1}^N \dif{B_{pq}} \left(
\sum_{k \neq q} B_{pk}X_{ki} + B_{pq}X_{qi} - Y_{pi}
\right)^2
= \sum_{i=1}^N 2 \left(\sum_{k=1}^{n+1} B_{pk}X_{ki} - Y_{pi}\right) X_{qi}
\\ &= 2 \sum_{i=1}^N (BX-Y)_{pi} X_{qi}
= 2 \bigg( (BX-Y)X\T \bigg)_{pq}
\\ &\implies (BX-Y)X\T = 0
\implies BXX\T = YX\T
\implies B = YX\T (XX\T)^{-1}
\end{align*}
$$
The last implication holds iff $XX^\top$ is invertible iff $X$ has rank $n+1$.
#### Solution 2
To avoid the tedious indices, one can work with total derivatives directly. Finding the stationary points amounts to the following.
$$
\begin{align*}
0 &= \dif{\Vec(B)} L
= \dif{\Vec(B)} \Vec(BX-Y)\T \Vec(BX-Y)
\\ &= 2 \Vec(BX-Y)\T \dif{\Vec(B)} \Vec(BX-Y)
\\ &= 2 \Vec(BX-Y)\T \dif{\Vec(B)}
\bigg( (X\T \otimes I_m) \Vec(B) - \Vec(Y) \bigg)
\\ &= 2 \Vec(BX-Y)\T (X\T \otimes I_m)
= 2 \bigg( (X \otimes I_m) \Vec(BX-Y) \bigg)\T
\\ &= 2 \Vec((BX-Y)X\T)\T
\\ &\implies (BX-Y)X\T = 0
\implies BXX\T = YX\T
\implies B = YX\T (XX\T)^{-1}
\end{align*}
$$
The last implication holds as in the previous solution.
### Gaussian Maximum Likelihood
Maximize $\Pr(y \mid f|_S)$ subject to $b \sim \normal(\mu, \sigma^2 I_m)$.
#### Solution
Clearly, $f(x) \sim \normal(Ax+\mu, \sigma^2 I_m)$. Define $\{x_i\}_{i=1}^N$ and $\{y_i\}_{i=1}^N$ as one did previously. Maximizing
$$
L = \Pr(y \mid f|_S)
= \prod_{i=1}^N \Pr(y_i \mid f(x_i))
= \prod_{i=1}^N (2\pi\sigma^2)^{-m/2} \exp\left(
\frac{-1}{2\sigma^2} \norm{y_i - Ax_i - \mu}_2^2
\right)
$$
is equivalent to minimizing
$$
-\log{L} = \sum_{i=1}^N \frac{m}{2} \log(2\pi\sigma^2)
+ \sum_{i=1}^N \frac{1}{2\sigma^2} \norm{Ax_i + \mu - y_i}_2^2
= \frac{Nm}{2} \log(2\pi\sigma^2) + \frac{a}{2\sigma^2}
$$
where $a = \sum_{i=1}^N \norm{Ax_i + \mu - y_i}_2^2$.
$-\log{L}$ is a smooth function of $\sigma^2$ and $a$. Thus, the minimizers are stationary points.
$$
0 = \pd[-\log{L}]{\sigma^2}
= \frac{Nm}{2} \frac{2\pi}{2\pi\sigma^2} - \frac{a}{2(\sigma^2)^2}
\implies \sigma^2 = \frac{a}{Nm}
\\
-\log{L} = \frac{Nm}{2}\log\frac{2\pi a}{Nm} + \frac{Nm}{2}
$$
The problem is reduced to minimizing $a$ which is precisely the minimization of the mean squared error.
## Planar Linear Regression
This is a special case of multivariate linear regression in which $n = m = 1$. It admits an intuitive visualization and can be computed very efficiently.
### Computation
Let $x = \bmat{x_1 & \cdots & x_N}\T$ and $u = \bmat{1 & \cdots & 1}\T$.
$$
\bmat{A & b} = Y \bmat{x & u}
\left( \bmat{x\T \\ u\T} \bmat{x & u} \right)^{-1}
= \bmat{Yx & Yu} \bmat{x\T x & x\T u \\ u\T x & u\T u}^{-1}
\\ = \frac{\bmat{Yx & Yu}}{Nx\T x - (u\T x)^2}
\bmat{N & -u\T x \\ -u\T x & x\T x}
= \frac{\bmat{\sum x_iy_i & \sum y_i}}{N\sum x_i^2 - (\sum x_i)^2}
\bmat{N & -\sum x_i \\ -\sum x_i & \sum x_i^2}
$$