Linear Regression

$$ \newcommand{\Vec}{\operatorname{vec}} \newcommand{\T}{^\top} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\R}{\mathbb{R}} \newcommand{\bmat}[1]{\begin{bmatrix} #1 \end{bmatrix}} \newcommand{\dif}[1]{\partial_{ #1 }} \newcommand{\normal}{\mathcal{N}} $$ # Linear Regression ## Problem Given vector spaces $V$ and $W$, a set $S \subseteq V$, and $y:S \mapsto W$, find an affine map $f:V \mapsto W, x \mapsto Ax+b$ whose restriction to $S$ best fits $y$. ### Remark - The notion of best-fitting varies by context. - This setting is too general. In practice, one exclusively deals with multivariate linear regression, particularly, planar linear regression. ## Multivariate Linear Regression It is the case in which $V = \R^n$, $W = \R^m$, and $S$ is finite. The different senses of best-fitting are illustrated below. ### Least Mean Squared Error Minimize $\norm{f|_S - y}_{\ell^2} = \left( \sum_{x \in S} \norm{f(x) - y(x)}_2^2 \right)^{1/2}$ subject to $A \in \R^{m \times n}$ and $b \in \R^m$. #### Solution 1 Order $S$ as $\{x_i\}_{i=1}^N$, and let $\{y_i\}_{i=1}^N = \{y(x_i)\}_{i=1}^N$. Equivalently, the problem is to minimize $$ L = \sum_{i=1}^N \sum_{j=1}^m (Ax_i + b - y_i)_j^2 = \sum_{i=1}^N \sum_{j=1}^m \left( \bmat{A & b} \bmat{x_i \\ 1} - y_i \right)_j^2 = \sum_{i=1}^N \sum_{j=1}^m (BX - Y)_{ji}^2 $$ with respect to $B = \bmat{A & b} \in \R^{m \times (n+1)}$, where $X = \bmat{x_1 & \cdots & x_N \\ 1 & \cdots & 1}$ and $Y = \bmat{y_1 & \cdots & y_N}$. Since the square root function is increasing, one can omit it. $L$ is a smooth function of the entries of $B$. Hence, the extreme points must be stationary points. $$ \begin{align*} 0 &= \dif{B_{pq}} L = \dif{B_{pq}} \sum_{i=1}^N \sum_{j=1}^m \left( \sum_{k=1}^{n+1} B_{jk}X_{ki} - Y_{ji} \right)^2 \\ &= \sum_{i=1}^N \dif{B_{pq}} \left( \sum_{j \neq p} \left(\sum_{k=1}^{n+1} B_{jk}X_{ki} - Y_{ji}\right)^2 + \left(\sum_{k=1}^{n+1} B_{pk}X_{ki} - Y_{pi}\right)^2 \right) \\ &= \sum_{i=1}^N \dif{B_{pq}} \left( \sum_{k \neq q} B_{pk}X_{ki} + B_{pq}X_{qi} - Y_{pi} \right)^2 = \sum_{i=1}^N 2 \left(\sum_{k=1}^{n+1} B_{pk}X_{ki} - Y_{pi}\right) X_{qi} \\ &= 2 \sum_{i=1}^N (BX-Y)_{pi} X_{qi} = 2 \bigg( (BX-Y)X\T \bigg)_{pq} \\ &\implies (BX-Y)X\T = 0 \implies BXX\T = YX\T \implies B = YX\T (XX\T)^{-1} \end{align*} $$ The last implication holds iff $XX^\top$ is invertible iff $X$ has rank $n+1$. #### Solution 2 To avoid the tedious indices, one can work with total derivatives directly. Finding the stationary points amounts to the following. $$ \begin{align*} 0 &= \dif{\Vec(B)} L = \dif{\Vec(B)} \Vec(BX-Y)\T \Vec(BX-Y) \\ &= 2 \Vec(BX-Y)\T \dif{\Vec(B)} \Vec(BX-Y) \\ &= 2 \Vec(BX-Y)\T \dif{\Vec(B)} \bigg( (X\T \otimes I_m) \Vec(B) - \Vec(Y) \bigg) \\ &= 2 \Vec(BX-Y)\T (X\T \otimes I_m) = 2 \bigg( (X \otimes I_m) \Vec(BX-Y) \bigg)\T \\ &= 2 \Vec((BX-Y)X\T)\T \\ &\implies (BX-Y)X\T = 0 \implies BXX\T = YX\T \implies B = YX\T (XX\T)^{-1} \end{align*} $$ The last implication holds as in the previous solution. ### Gaussian Maximum Likelihood Maximize $\Pr(y \mid f|_S)$ subject to $b \sim \normal(\mu, \sigma^2 I_m)$. #### Solution Clearly, $f(x) \sim \normal(Ax+\mu, \sigma^2 I_m)$. Define $\{x_i\}_{i=1}^N$ and $\{y_i\}_{i=1}^N$ as one did previously. Maximizing $$ L = \Pr(y \mid f|_S) = \prod_{i=1}^N \Pr(y_i \mid f(x_i)) = \prod_{i=1}^N (2\pi\sigma^2)^{-m/2} \exp\left( \frac{-1}{2\sigma^2} \norm{y_i - Ax_i - \mu}_2^2 \right) $$ is equivalent to minimizing $$ -\log{L} = \sum_{i=1}^N \frac{m}{2} \log(2\pi\sigma^2) + \sum_{i=1}^N \frac{1}{2\sigma^2} \norm{Ax_i + \mu - y_i}_2^2 = \frac{Nm}{2} \log(2\pi\sigma^2) + \frac{a}{2\sigma^2} $$ where $a = \sum_{i=1}^N \norm{Ax_i + \mu - y_i}_2^2$. $-\log{L}$ is a smooth function of $\sigma^2$ and $a$. Thus, the minimizers are stationary points. $$ 0 = \pd[-\log{L}]{\sigma^2} = \frac{Nm}{2} \frac{2\pi}{2\pi\sigma^2} - \frac{a}{2(\sigma^2)^2} \implies \sigma^2 = \frac{a}{Nm} \\ -\log{L} = \frac{Nm}{2}\log\frac{2\pi a}{Nm} + \frac{Nm}{2} $$ The problem is reduced to minimizing $a$ which is precisely the minimization of the mean squared error. ## Planar Linear Regression This is a special case of multivariate linear regression in which $n = m = 1$. It admits an intuitive visualization and can be computed very efficiently. ### Computation Let $x = \bmat{x_1 & \cdots & x_N}\T$ and $u = \bmat{1 & \cdots & 1}\T$. $$ \bmat{A & b} = Y \bmat{x & u} \left( \bmat{x\T \\ u\T} \bmat{x & u} \right)^{-1} = \bmat{Yx & Yu} \bmat{x\T x & x\T u \\ u\T x & u\T u}^{-1} \\ = \frac{\bmat{Yx & Yu}}{Nx\T x - (u\T x)^2} \bmat{N & -u\T x \\ -u\T x & x\T x} = \frac{\bmat{\sum x_iy_i & \sum y_i}}{N\sum x_i^2 - (\sum x_i)^2} \bmat{N & -\sum x_i \\ -\sum x_i & \sum x_i^2} $$