Over-parametrization helps gradient descent find a global minimum

# Over-parametrization helps gradient descent find a global minimum written by [@marc_lelarge](https://twitter.com/marc_lelarge) We look at the simplest possible example showing that over-parametrization helps gradient descent. We start with a definition of gradient descent (to set the notation), then study gradient descent for ordinary least squares, and then in the particular setting of over-parametrized models. ## Gradient descent **setting:** minimize $F:\mathbb{R}^d\to \mathbb{R}$ with gradient descent (GD) **algo (GD):** pick $\theta_0\in \mathbb{R}^d$ and for $t\geq 1$, let $$ \theta_{t} = \theta_{t-1} - \gamma F'(\theta_{t-1}), $$ where $\gamma$ is called the step size or the learning rate. ## Gradient descent for ordinary least squares Consider a design matrix $\phi\in \mathbb{R}^{n\times d}$ and targets $y\in \mathbb{R}^n$ so that the function to minimize is $F(\theta) =\frac{1}{2n}\|\phi\theta-y\|^2$. The gradient of $F$ is $$ F'(\theta) = \frac{1}{n}\phi^\top(\phi\theta-y) = \frac{1}{n}\phi^\top\phi \theta - \frac{1}{n}\phi^\top y. $$ Note that $H=\frac{1}{n}\phi^\top\phi\in \mathbb{R}^{d\times d}$ is the Hessian matrix and a minimizer $\theta^*$ should satisfy $$ F'(\theta^*)=0 \Leftrightarrow H\theta^* = \frac{1}{n}\phi^\top y. $$ Moreover, we have by a simple exact Taylor expansion and using the fact that $F'(\theta^*)=0$: $$ F(\theta) - F(\theta^*) = \frac{1}{2}(\theta-\theta^*)^\top H (\theta -\theta^*) $$ As a consequence, we see that if $\theta^1$ and $\theta^2$ are minimizers, then since $H\theta^1 =H\theta^2 (= \frac{1}{n}\phi^\top y)$, we have $F(\theta^1) = F(\theta^2)$. Now for any minimizer $\theta^*$, we have: \begin{align*} \theta_{t} &= \theta_{t-1} -\gamma H(\theta_{t-1}-\theta^*)\\ \theta_{t}-\theta^* &= (I-\gamma H)(\theta_{t-1}-\theta^*)\\ \theta_{t}-\theta^* &= (I-\gamma H)^{t}(\theta_{0}-\theta^*), \end{align*} so that by the exact Taylor expansion above, we get \begin{align*} F(\theta_t) - F(\theta^*) &= \frac{1}{2}(\theta_{0}-\theta^*)^\top (I-\gamma H)^t H (I-\gamma H)^t(\theta_{0}-\theta^*)\\ F(\theta_t) - F(\theta^*) &= \frac{1}{2}(\theta_{0}-\theta^*)^\top(I-\gamma H)^{2t}H(\theta_{0}-\theta^*) \end{align*} since $H$ is symmetric. We see that the eigenvalues of $(I-\gamma H)^{2t}H$ will play a crucial role. They are equal to $\lambda(1-\gamma \lambda)^{2t}$ where $\lambda$ ranges over the eigenvalues of $H$. Note that $H$ is positive semi-definite as $x^\top H x= \frac{1}{n}\|\phi x \|^2 \geq 0$ so all its eigenvalues are non-negative $0\leq \lambda_1\leq \dots \leq \lambda_d$. If $H$ is invertible, we have $\lambda_1 >0$ so that $\max_{i}\lambda_i(1-\gamma \lambda_i)^{2t}\leq \lambda_d(1-\gamma \lambda_1)^{2t}$ and we get \begin{align*} F(\theta) - F(\theta^*) &\leq \frac{1}{2}\lambda_d(1-\gamma \lambda_1)^{2t}\|\theta_{0}-\theta^*\|^2. \end{align*} Note that since $H$ is invertible, we have a unique minimizer $\theta^* = \frac{1}{n}H^{-1}\phi^\top y$. Finally, taking $\gamma =\frac{1}{\lambda_d}$, we get $$ F(\theta) - F(\theta^*) = O\left((1-\kappa^{-1})^{2t}\right) = O\left(\exp(-2t/\kappa)\right), $$ where $\kappa=\frac{\lambda_d}{\lambda_1}$ is the condition number of $H$. Consider now the case where $H$ is not invertible so that $\lambda_1=0$. Note that since $1-x\leq e^{-x}$, we have for $\gamma\leq \lambda_d^{-1}$ that $$ \max_{i}\lambda_i|1-\gamma \lambda_i|^{2t}\leq \max_{i}\lambda_ie^{-2t\gamma\lambda_i} = \frac{1}{2t\gamma}\max_{\alpha\geq 0} \alpha e^{-\alpha} = \frac{1}{2t\gamma e}\leq \frac{1}{4t \gamma}, $$ since $\max_{\alpha\geq 0} \alpha e^{-\alpha}=e^{-1}$. Finally, we get \begin{align*} F(\theta_t) - F(\theta^*) &\leq \frac{1}{8t \gamma}\|\theta_{0}-\theta^*\|^2, \end{align*} where $\theta^*$ is any minimizer (recall that still $F(\theta^*)$ is the same for all minimizer). ## Gradient descent for ordinary least squares in the over-parametrized regime We now consider the case $d>n$. Since $H \in \mathbb{R}^{d\times d}$ is of rank at most $n$, we see that $H$ is not invertible. Under the assumption that $\phi\phi^\top\in \mathbb{R}^{n\times n}$ is invertible, we can get a much better upper bound than above. First note that for any $y\in \mathbb{R}^n$, we have $\phi\left(\phi^\top(\phi\phi^\top)^{-1}y\right) = y$ so that $F(\phi^\top(\phi\phi^\top)^{-1}y)=0$. Indeed, there are a lot of minimizers with zero loss. We just showed that $\text{Im} \phi = \mathbb{R}^n$ so that $\dim\text{Ker}\phi = d-n>0$, where $\text{Ker}\phi =\{ v, \phi v=0\}$ is the kernel of $\phi$. The set of solutions $\phi\theta =y$ is $\theta = \phi^\top(\phi\phi^\top)^{-1}y +v$, where $v\in \text{Ker}\phi$. By previous analysis, we know that $F(\theta_t) = O(t^{-1})$, where $\theta_t$ are the iterates of $\theta_t$, but we can prove a much better upper bound. Denote by $\hat{y}_t = \phi \theta_t$, we have \begin{align*} \hat{y}_t &= \hat{y}_{t-1}-\frac{\gamma}{n}\phi\phi^\top (\hat{y}_{t-1}-y)\\ \hat{y}_t-y &= \left(I-\frac{\gamma}{n}\phi\phi^\top \right)(\hat{y}_{t-1}-y)\\ \hat{y}_t-y &= \left(I-\frac{\gamma}{n}\phi\phi^\top \right)^t(\hat{y}_{0}-y) \end{align*} Hence, we see that for $\gamma$ small enough, we have exponential convergence of $\hat{y}_t = \phi \theta_t$ towards $y$ so that $F(\theta_t)$ goes exponentially fast to zero. Now if we start GD with $\theta_0=0$, we have for $\theta^*=\phi^\top(\phi\phi^\top)^{-1}y$, that $\theta_{t} = \theta^*-(I-\gamma H)^{t}\theta^*$, so that we can write $\theta_t = \phi^\top\alpha_t$ for a given $\alpha_t$. Since $\phi \theta_t = \phi\phi^\top \alpha_t \to y$, we have $\alpha_t \to (\phi\phi^\top)^{-1}y$ and GD converges with $\theta_t \to \theta^*=\phi^\top(\phi\phi^\top)^{-1}y$. What is special about $\theta^*$? Consider $v\in \text{Ker}\phi$, then we have $$ v^\top \theta^* = v^\top\phi^\top(\phi\phi^\top)^{-1}y= (\phi v)^\top(\phi\phi^\top)^{-1}y = 0. $$ Hence we have $\theta^*\perp \text{Ker}\phi$, so that $\theta^*$ is the minimizer with zero loss of minimal norm since any minimizer can be written as $\theta^*+v$ with $v\in \text{Ker}\phi \perp \theta^*$ and has norm $\|\theta^*\|^2+\|v\|^2$. **Summary: in the over-parametrized regime (for OLS), gradient descent converges exponentially fast towards the minimizer of zero loss with minimal $L_2$ norm.** To go further: [Gradient Descent Finds Global Minima of Deep Neural Networks](https://arxiv.org/abs/1811.03804) by Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai ###### tags: `public`, `optimization`, `over-parametrization`

Read more

Broadcasting in Python: K-means algorithm

Basics about probability distribution and Gaussians

Autodiff and Backpropagation

Transformers using Named Tensor Notation