Why Gradient Descent Learns So Well

$$ \newcommand{\R}{\mathbb{R}} \newcommand{\der}{\partial} \newcommand{\dldt}{\frac{\der l}{\der t}} \newcommand{\len}{\text{length}} \newcommand{\en}{\text{energy}} \newcommand{\dim}{\text{dim}} \newcommand{\tr}{\text{trace}} \newcommand{\lin}{\text{Lin}} \newcommand{\rnk}{\text{rank}} \newcommand{\ht}{\widehat} \newcommand{\dwdt}{\frac{\der w}{\der t}} \newcommand{\l}{\mathscr{l}} \newcommand{\E}{\mathbb E} $$ # Why Gradient Descent Minimizes the Loss I've often heard that neural networks are difficult to understand because they are not convex, making it impossible to prove that gradient descent will always minimize the loss. For the past few months, I've been examining a two-layer MLP, aiming to prove that it converges to zero loss. By approaching the problem from a geometric perspective, I've come across some straightforward techniques that circumvent the need for convexity. By "geometric ideas," I refer to viewing the parameters as a vector and the process of learning as a differentiable curve in that vector space. Concepts like gradient flows and the energy of a curve are key to the argument. Currently, this article consists of several semi-rigorous proofs, demonstrating that a simple 2 layer MLP, learning via gradient descent can minimize the loss to approximately the global minimum if the hidden layer is wide enough. There are several crucial points that I am completely ignoring for now: * Do the paths of stochastic gradient descent and gradient descent approximate one another? This depends on the learning rate. How small does it need to be for the approximation to be good? * Large datasets. All current proofs just deal with datasets containing a single data point. * More general NN arhitectures. General MLPs, convnets, transformers etc... * The learning algorithm we implement in practice performs discrete steps. I need to discuss the connection between discrete gradient descent and gradient flow, specifically addressing how small the learning rate needs to be for the two algorithms to behave similarly. This is important if you care about how many FLOPs we need to minimize the loss. This is a work in progress. I will keep workig and updating the article. Apologies for any writing or math errors. While I'm uncertain about the novelty of these ideas, I found them interesting and wanted to share. Please let me know if you are aware of related work. ## 0) Notation and Concepts Much of the theory on this article relies on an inner on the space of model parameters. The parameters tend to be either vectros in $\R^{n}$ or matrices in $\R^{n\times m}$. So we need to define suitable inner products for these two types of vectors. **Inner Product and L2 norm** For two vectors $x,y \in \R^n$ we take the inner product to be $<x, y> = x^T y$. For two matrices $M,N\in \R^{n\times m}$ we use $<M, N> = \tr(M^T N)$. Remember that any inner product induces a norm $|\cdot|$ via $|v| = \sqrt{<v,v>}$. And you can verify that, in the $\R^n$ and $M,N\in \R^{n\times m}$ cases, this induced norm takes the following forms: \begin{align} |x| = \sqrt{\sum_i x_i^2} \quad\quad |M| = \sqrt{\sum_{ij} M_{ij}^2} \end{align} Which are also known as the L2 norm of the vector spaces. **Operator Norm** If $M: V \to W$ is a linear map between two vector spaces equipped with inner products, the the operator norm $|| \cdot ||$ is defined as $$ ||M|| = \max_{v\in V} \frac{|Mv|}{|v|} $$ Where the nominator uses the norm of $W$ and the denominator the one of $V$. When the two norms on $V$ and $W$ arise from inner products, then $||M||$ is the largest singular value of $M$. **Derivative** Let $f: V \to W$ be a function between two vector spaces. The derivative at a point $v\in V$ is a linear function $\der f(v): V\to W$. It tells us how changing the input $v \to v + u$ will affect the output (if the change $u\in V$ is small). Concretely, $$ \der f(v)(u) \simeq f(v + u) - f(v) $$ Some people would call $\der f(v)(u)$ the directional derivative of $f$ along $u$ at a point $v$. **Path Derivative** A specially important case is when we are taking the derivative of a funciton $h:\R \to V$ (a path through $V$). Here, using the notation $\der h(x): \R \to V$ is a little bit heavy handed. Any linear map $M: \R \to V$ can instead be represented by the vector $v = M(1)$ together with scalar vector multiplication. Just see that $M(r) = r \; v$ for all $r\in \R$. Effectively, Newton's notation $h'$ applies this idea to path derivatives by defining $h'(t) = \der h(t)(1)$. **Gradient** The gradient of a differentiable function $f: V \to R$, denoted $\nabla f$, is a map $\nabla f: V\to V$ defined so that $\nabla f(v)\in V$ is the unique vector satisfying $$ \der f(v)(u) = <\nabla f(v), u> $$ for all $u\in V$. Sometimes, when it's clear from the context what the function $f$ is, we can just use the shorthand $\hat v = \nabla f(v)$. $\blacksquare$ *Note: You might be used to a different definition of the gradient. The one you know turns out to be equivalent to the one I'm using. If you find this confusing you should take a look at this great [article](https://charlesfrye.github.io/math/2018/03/06/frechet-derivative-introduction.html).* **Gradient Flow:** Let $V$ be a vector space with an inner product. Given a differentiable, lower bounded function $f: V\to \R$. The gradient flow starting at $v_0\in V$ is defined as the path $h: [0, \infty) \to V$ satisfying $$ h(0) = v_0 \quad\text{and}\quad h'(t) = - \nabla f\circ h(t) $$ The existance of such a path follows from the existance and uniqueness of PDEs. TODO: Explain why the flow $h$ doesn't "diverge to infinity", that the flow isn't just a curve $h:(0, \epsilon) \to V$, but a map with domain $[0, \infty)$. This uses the assumption that $f$ is lower bouded. $\blacksquare$ ## 1) Learning with Gradient Flows **Definition 1.1:** The learning problem setup is composed of a vector spacce $W$ equiped with an inner product $<\cdot, \cdot>$. An initial point $w_0 \in W$ and a funtion $f: W \to \R^+$ where: * $W$ is the parameter space that controls the behavior of our model. $w_0$ is the initial point from which the learning will proceed. * $f$ is the loss function that tells us how good the parameters are (presumably at fitting some dataset). A loss function must be lower bounded so it's nice to assume wlog that $\inf_{w} f(w) = 0$. * A path $\gamma: [0, \infty) \to W$ given by some learning algorithm that must be specified. From these objcts we usually define a some useful quantities and variables: * A loss curve $l:\R \to \R$ defined as $l(t) = f\circ \gamma(t)$. * To refer to the weights at some moment in time $t$ we write $w=\gamma(t)$. If we wanted to talk about how $w$ changes wrt time we would write $\frac{\der w}{\der t}$ which is equivalent to $\gamma'(t)$. * $L = l(0) = f(w_0)$. $L$ tells us the amount of "work" that gradient descent to do to fully solve the problem. $\blacksquare$ The basic idea of training a model is that we will start with some initial parameters $w_0\in W$ and we will slowly move the to minimize the loss. The most important algorithm in this space is gradient descent. The gradient descent algorithm updates the parameters at any given point during training with the rule $w_{t+1} = w_t - \delta \nabla f(w_)$ for some learning rate hyperparameter $\delta \in \R^+$. Even though this is the algorithm we use in practice, it's very useful to think about what happens in the limit $\delta \to 0$. Then, the weights end up following a continuous curve $\gamma:\R \to W$ which is the gradient flow of $\nabla f$. **Result 1.1** The derivative of the loss curve is the magnitude of the gradient: $$l'(t) = - | \nabla f\circ \gamma(t)| ^2$$ **Proof** \begin{align} l'(t) &= \der f(w)( \small\dwdt) \quad &\text{(chain rule)} \\ &= < \nabla f(w), \small\dwdt> \quad &\text{(gradient def.)} \\ &= -< \nabla f(w), \nabla f(w)> \quad &\text{(grad flow def.)} \\ &= - | \nabla f\circ \gamma(t)| ^2 \end{align} $\blacksquare$ So the magnitude of the gradient is the rate of change of the loss! Large gradient means fast learning. ## 2) Length and Energy of a Path Let's introduce some important definitions and results about differentiable paths. **Definition: length and energy of a path** Given a normed vector space $V$ and a differentiable path $h: [0, t] \to V$, the length and energy of the path are defined as \begin{align} \len(h) &= \int_0^t |h'(s)| ds \\ \en(h) &= \int_0^t |h'(s)|^2 ds \end{align} $\blacksquare$ When the path $h$ is a gradient flow of $f$, the energy has an imporant interpretation. It measures how much the gradient flow has reduced the loss. **Observation 2.1: energy of a gradient flow** If $\gamma:[0, t] \to W$ is a gradient flow of $f:W \to \R$, then \begin{align} \en(\gamma) &= \int_0^t |\gamma'(s)|^2 ds \\ &= \int_0^t |\nabla f \circ \gamma(s)|^2 ds = \int_0^t l'(s) ds \\ &= l(0) - l(t) \end{align} Since $l(t)\ge 0$, this also means that the energy of any gradient flow can be bounded. Letting $L=l(0)=f(w_0)$, then $$\en(\gamma) \le L$$ $\blacksquare$ An important result is that the length and energy of a path are related by the following bound: **Result 2.1:** If $h: [0, t] \to W$ is a differentiable path, then $$ \len(h)^2 \le t\; \en(h) $$ **Proof** Just note that \begin{align} \len(h) &= \int_0^t |h'(s)| ds \\ &\le \sqrt{\int_0^t 1^2 ds } \sqrt{\int_0^t | h'(s)|^2 ds } \;\;\;\;\;\; \text{(by Cauchy Schwarz)} \\ &\le\sqrt{ t \; \en(h)} \\ \end{align} And, since $\len(h)$ is positive and squaring positive numbers is a monotone function, the result follows. $\blacksquare$ **Result 2.2: triangle inequality for integrals** If $V$ is a normed vector space, $a,b\in \R$ and $h: [a, b] \to V$ is a differentiable path, then the following inequality holds: $$ \left|\int_a^b h(s) ds \right| \le \int_a^b |h(s)| \: ds $$ **Proof** The proof is a simple application of the definition of the integral (as the limit of a sum) combined with the triangleg inequality of the inner product. $\blacksquare$ The following result is the culmination of this section. It tell us that, no matter what the architecture and loss landscape look like, learning via a gradient flow for a time $t$ cannot move the weights very far away. **Result 2.2** If $f$ is our loss funciton, $w_0$ the initial parameters, $L=f(w_0)$ and $\gamma$ a gradient flow of $f$ starting at $w_0$, then $$ |\gamma(t) - w_0| \le \sqrt{t\;L} $$ **Proof** \begin{align} |\gamma(t) - \gamma(0)| &= \left |\int_0^t \gamma'(s) ds \right| \\ &\le \int_0^t \left | \gamma'(s) \right| ds \quad &\text{(result 2.2)} \\ &= \len(\gamma) \\ &\le \sqrt{t \; \en(\gamma)} \quad &\text{(result 2.1)} \end{align} The last step is using observation 2.1, which tells us that $\en(\gamma) \le L$. $\blacksquare$ ## 3) A Replacement for Convexity We are trying to show that neural networks trained via gradient descent will (approximately) reach 0 loss. One way to get such type of guarantees is to assume the loss funciton $f$ is convex. But that is simply not the case for NNs. We will certainly need some form of replacement of convexity that is applicable to NNs. I believe property 3.1 has the core idea **Property 3.1** For all $w\in W$ and for some $\alpha \in \R^+$ $$ \frac{|\nabla f(w)|^2}{f(w)} \ge \alpha $$ $\blacksquare$ This assumption is another way to guarantee no local minima exist. Think about it. The definition of a local minima is a point $w$ with no gradient but high loss. Clearly, property 3.1 guarantees such points don't exist. But the following result tells us something much stronger. If the property holds, the loss is guaranteed to decay to $0$ exponentially fast with time. **Result 3.1** If $f$ satisfyies property 3.1 then $$ l(t) \le L \; e^{-\alpha t} $$ **Proof** \begin{align} \frac{\der \ln l(t)}{\der t} &= \frac{l'(t)}{l(t)} = -\frac{|\nabla f \circ \gamma(t)|^2}{f\circ \gamma(t)} \le -\alpha \\ \end{align} so $$ \ln l(t) - \ln l(0) = \int_0^t \frac{\der \ln l(s)}{\der s} ds \le - \int_0^t \alpha ds = - \alpha t $$ And then $\ln l(t) \le \ln L - \alpha t$. To conclude just use the fact that exponential is a monotone function and $$ l(t) = e^{\ln l(t)} \le e^{\ln L - \alpha t} = L e^{ - \alpha t} $$ as desired. $\blacksquare$ ## 4) A More Realistic Case The next question we should ask is obvious: Do Neural Networks Satisfy property 3.1? The short answer is **No!** To see that, just consider an MLP with "degenerate" parameters where all the weight matrices are $0$. The MLP will have 0 gradient even when it has high loss. But the long answer is more interesting. Turns out that the standard initialization techniques of NNs guarantee that, at initialization, $|\nabla f(w_0)|^2 \ge \alpha f(w_0)$ for some large $\alpha$. And even better, the property remains true for all points near $w_0$. Consider this generalization of property 3.1: **Property 4.1** Given a loss function $f$, an initialization $w_0\in W$ and for some $\alpha, \beta \in \R^+$. Then, $$ \frac{|\nabla f(w)|^2}{f(w)} \ge \alpha - \beta \; |w-w_0| $$ for all $w\in W$ $\blacksquare$ Section 5 is dedicated to showing that property 4.1 applies to a 2 layer MLP. But for now, let's investigate the implications of this modified property. **Result 4.2** If $f$ and $w_0$ satisfy assumption 4.1, then $$ l(\infty) \le L \; \exp({-\frac {\alpha^3}{3 \beta^2 L} }) $$ **Proof** The proof follows a very similar argument to 3.1, \begin{align} \frac{\der \ln l(t)}{\der t} &= -\frac{|\nabla f(w)|^2}{f(w)} \quad &\text{(result 1.1)} \\ &\le - \alpha + \beta \; |w - w_0| \quad &\text{(assumption 4.1)}\\ &\le - \alpha + \beta \sqrt {L t} \quad &\text{(result 2.1)} \\ \end{align} Integrating both sides from $0$ to $t$ we get $$ \ln l(t) - \ln L \le - \alpha t + \frac{2}{3} \beta \sqrt L \; t^{3/2} $$ which implies $$ l(t) \le L \exp(- \alpha t + \frac{2}{3} \beta \sqrt L \; t^{3/2}) $$ Now solve for the value of $t$ that minimizes the term in the expoential. By setting the derivative to $0$ you can easily verity that it is $t^* = \frac {\alpha^2} {\beta^2 L}$. At this particular time, the bound is minimized. It tells us that: $$ l(t^*) \le L \exp({-\frac{\alpha^3}{3\beta^2 L}}) $$ So $l(\infty) \le l(t^*) \le L\exp({-\frac{\alpha^3}{3\beta^2 L}})$ proving the result. $\blacksquare$ ## 5) The Simplest Neural Network A 2 layer MLP with ReLU nonlinearity. It has two parameter matrices $M\in \R^{k\times m}$ and $N\in \R^{n\times k}$. The ReLU is denoted by $\sigma$. Given an input $x \in \R^m$ the MLP produces the output $$ y = N \sigma(M x) $$ It's convinient to break down the computation into steps to define intermediate variables. First we apply a matrix to the input $a = Mx$ then the nonlineary to get $b=\sigma(a)$ and finally we produce the output $y=Nb$. The variables/activations $a,b,y$ dependant on the parameters $M,N$ so, when we consider changing the parameters with time, these activations will change too. **Parameter space** The weight vectors are pairs of matrices $w = (N, M)$ and so the parameter space is $$W= \R^{n\times k} \oplus \R^{k\times m}$$ **The loss function** Keeping in with the philosophy of studying the simplest example, let's consider the loss on a single input-target pair $(x, q)\in \R^n \oplus \R^m$, and take the loss to be the L2 error: $$ f(w) = \frac 1 2 |y - q|^2 $$ It's also useful to have define the loss variable $\l = f(w)$. In the initialization subsection we will need $x$ to be normalized, so we just assume that $|x|^2 = m$. **Initialization** The standard way to initialize neural networks is to sample independently all the entries in the weight matrices form some distribution. In this case we use $$ M_{ij} \sim \mathcal N (0, {\small \frac 2 m} ) \quad\quad N_{ij} \sim \mathcal N (0, {\small \frac 2 k}) $$ This is the commonly used He init [], which works well for networks with ReLU nonlinearities. ### Gradients The gradients of the loss $\l$ wrt all the intermediate variables and weights are: $$ \ht y = y -q, \quad \ht b = N^T \ht y, \quad \ht a =\der \sigma(a)^T \ht b, \quad \ht M = \ht a x^T, \quad \ht N = \ht y b^T, \quad $$ Note that $|\hat y |^2 = \l$ so, when the loss is large, the gradient of $\l$ wrt $y$ is large too. Ultimately are trying to show something a similar, but about gradients of $\l$ wrt $M$ and $N$ so we can apply result 4.1. Below I've writen a short derivation of all these gradient formulas, but it uses some unconventional notation and techniques. If you find them confusing, just work out the gradients in your own way and confirm you get the same answers. Starting with $\ht y$, the gradient of $\l$ wrt $y$. Let $\dot y \in \R^n$ denote an arbitrary change to $y$. Then \begin{align} <\ht y, \dot y> =\frac {\der \l} {\der y} (\dot y) &\simeq \frac 1 2 |y +\dot y - q|^2 - \frac 1 2 |y - q|^2 \\ &= \frac 1 2 ( <\dot y, y-q> + <\dot y, \dot y> + <y-q, \dot y> ) \\ &\simeq <y-q, \hat y> \quad \text{(dropping the lower order term)} \\ \end{align} Now let's see how a change in $b$ affects $y$. Like before, let $\dot b$ denote a change to $b$. The derivative $\frac {\der y}{\der b}(\dot b) \simeq N(b+ \dot b) - M b = \dot M b$. So, the gradient of $\l$ wrt $b$ satisfies $$ <\ht b, \dot b> = {\frac {\der \l}{\der b}(\dot b)} = {\frac {\der \l}{\der y}} \left( {\frac {\der y}{\der b}}(\dot b) \right) = <\ht y, {\frac {\der y}{\der b}} (\dot b)> = <\hat y, M \dot b> = <M^T \ht y, \dot b> $$ Now let's look at $N$. The derivative $\frac {\der y}{\der N}(\dot N) \simeq (N+\dot N) b - N b = \dot N b$. And the gradient $$ <\ht N, \dot N> = <\ht y, \dot N b> = <\ht y b^T, \dot N> $$ Finally, gradients of $a$ and $M$. \begin{gather} <\hat a, \dot a> = <\hat b, \der \sigma(a)(\dot a)> = <\der \sigma(a)^T \ht b, \dot a> \\ <\ht M, \dot M> = <\ht a, \dot M x> = <\ht a x^T, \dot M> \end{gather} ### Activations at Initialization Remember that we are sampling the initial weight matrices via $$ M_{ij} \sim \mathcal N (0, {\small \frac 2 m} ) \quad\quad N_{ij} \sim \mathcal N (0, {\small \frac 2 k}) $$ Since $M,N$ are random variables, so are $a$ and $b$. This initialization used carefuly chosen variances to ensure that the entries of the initial activations $a = M x$ and $b = \sigma(a)$ are within some reasonable range. In particular, we want the entries of $b_i$ to have $0$ mean and variance $1$. And since all the entries are independent, that will mean that $|b|^2 \simeq k$ and so $b$ lives close to the sphere of radius $\sqrt k$. But this is only the case if the input vector is also normalized. That is why we needed the assumption that $|x|^2 = m$. This section is dedicated to proving these statements. First we want to understand $\E[a_i^2]$ \begin{align} \E[a_i^2] &= \E[ ({\small \sum_j M_{ij} x_j})^2] \\ &= \E[ {\small \sum_j M_{ij}^2 x_j^2 + \sum_{k\neq j} M_{ij} x_j M_{ik} x_k } ] \\ &= \sum_j \E[M_{ij}^2 x_j^2] \\ &= \frac 2 m |x|^2= 2 \end{align} where we used the independence of different entries of $M$ and the fact that $\E[M_{ij}]=0$. Then $$ \E[|a|^2] = \sum_i \E[a_i^2] = 2k $$ Let $p_M, p_N$ denote the pdf's of the entries of $M_{ij}$ and $N_{ij}$ (all entries have the same pdf because they are iid). The pdf of $a_i$ will be the nested integral of $\delta(a_i - M_i^T x)$ as $M_{i1},\cdots,M_{im}$ range from $-\infty \to \infty$, where $\delta$ denote the Dirac delta function. $$ p_{a}(z) = \int_{-\infty}^\infty \cdots \int_{-\infty}^\infty p_M(M_{i1}) \cdots p_M(M_{im}) \; \delta (z - M_i^T x)\; d M_{i1} \cdots d M_{im} $$ Recall that a distribution is symmetric if $p(z) = p(-z)$. Since $p_M, p_N$ are both gaussians, they are symmetric. The next thing we need to prove is that the pdf of $a_i$ is symmetric too. \begin{align} p_{a}(z) &= \int_{-\infty}^\infty \cdots \int_{-\infty}^\infty p_M(M_{i1}) \cdots p_M(M_{im}) \; \delta (z - M_i^T x)\; d M_{i1} \cdots d M_{im} \\ &= \int_{-\infty}^\infty \cdots \int_{-\infty}^\infty p_M(-M_{i1}) \cdots p_M(-M_{im}) \; \delta (z + M_i^T x)\; d M_{i1} \cdots d M_{im} \\ &= \int_{-\infty}^\infty \cdots \int_{-\infty}^\infty p_M(M_{i1}) \cdots p_M(M_{im}) \; \delta (z + M_i^T x)\; d M_{i1} \cdots d M_{im} \\ &= p_a(-z) \end{align} The first step uses the change of variable $M_{ij}\to -M_{ij}$ and the second one exploits the symmetry of $p_M$. Finally, this allows us to compute the term we care about \begin{align} \E[b_i^2] &= \int_{-\infty}^\infty p_a(a_i) \sigma(a_i)^2 da_i \\ &= \int_{-\infty}^0 p_a(a_i) \:0\: da_i + \int_0^\infty p_a(a_i) a_i^2 da_i \\ &= \frac 1 2 \int_{-\infty}^\infty p_a(a_i) a_i^2 da_i \quad \text{(using symmetry)}\\ &= \frac 1 2 \E[a_i^2] \\ &= k \end{align} We have only been working out $\E[|b|^2] = k$ but we wanted to make claims about the particular samples themselves being approximately $|b|^2\simeq k$. Of course, the rigorous way to go about it is to prove statements like $\Pr( k - \epsilon \le |b|^2 \le k+\epsilon) \le \delta$. But this type of argument is very tedious and adds very little insight about the ideas this document is exploring. So to conclude the proof in an elegant way, I'll just assume that $|b|^2 = k$. ### Activations During Learning It is not only important that $a_0 = M_0, \; b_0 = \sigma(a_0)$ is within some reasonable range at init. We need to know it will remain so during learning. Concretely, if we let $(M_0, N_0)=w_0$ and $w\in W$ be an arbitrary vector. we want to derive upper bounds on $|a - a_0|$ and $|y - y_0|$ based on on $|w-w_0|$. First \begin{align*} |a_0 - a| &\le |M_0 x - Mx| \\ & = ||M_0 - M|| \; |x|\quad \text{(operator norm def.)} \\ & = \sqrt m |M_0 - M| \quad \text{(using result A.2)} \\ & \le \sqrt m \; |w-w_0| \end{align*} To bound $|b_0 - b |$ we need to use the fact that the ReLU $\sigma$ is 1-Lipschitz. So $$ |b_0 - b| = |\sigma(a) - \sigma(a_0)\le |a_0 - a | \le \sqrt n \; |w-w_0| $$ ### Learning Guarantees To apply result 4.2. we need to find $\alpha, \beta \in \R^+$ such that $|\nabla f(w)|^2 \ge f(w)( \alpha - \beta \; |w-w_0|)$. The first step is to lower bound the gradient magnitude of $N$. \begin{align} | \ht{N} |^2 &= |\ht y b^T|^2 = \text{trace}(b y^T y b^T) = |\ht{y}|^2 |b|^2 = \l |b|^2 \\ &\ge \l (|b_0| - |b_0 - b|)^2 = \l ( \sqrt k - |b_0 - b| )^2 \\ &= \l ( k - 2 \sqrt k |b_0 - b| + |b_0 - b| ^2 ) \\ &\ge \l ( k - 2 \sqrt k |b_0 - b| ) \\ &\ge \l ( k - 2 \sqrt {kn} |w_0 - w| ) \\ \end{align} We could also attempt to derive a lower bound for $|\hat M|^2$ but it's not really necessary. We already have enough to apply 4.2. \begin{align} \frac{|\nabla f(w)|^2}{f(w)} &= \frac{|\hat M|^2 + |\hat N|^2}{f(w)} \\ &\ge \frac{|\hat N|^2}{f(w)} \\ &\ge \frac{\l ( k - 2 \sqrt {kn}|w_0 - w|)}{f(w)} \\ &\ge k - 2 \sqrt {kn} |w_0 - w|\\ \end{align} Setting $\alpha = k$ and $\beta = 2 \sqrt {kn}$, the application of result 4.1 gives $$ l(\infty) \le L \; \text{exp}({-\frac {k^2}{12 n L} }) $$ From looking at the above equation, it is apparent that the scale of the MLP helps it learn. By growing $k$ we can arbitrarily lower the loss to any acceptable value. ## Appendix: **Result A.1** Let $A\in \R^{n\times m}$ and recall that $|\cdot|$ and $||\cdot||$ denote the Frobenious and Operator norms respectively. Then $$|| A || \le { \sqrt {\text{rank}(A) } } \; |A|$$ **Proof** TODO $\blacksquare$ **Result A.2** If $\gamma: \R \to \R^{n\times m}$ is a differentiable path and the derivative satisfies $\text{rank}( \gamma'(t) ) \le r$ for all $t$, then $$ ||M - M_0|| \le \sqrt r \; \len(\gamma) $$ **Proof** First we use the triangle inequality of the operator norm $$||M - M_0|| = || \int_0^t \gamma'(s) ds || \le \int_0^t || \gamma'(s) || ds $$ Then, result A.1 tells us that $|| A || \le \sqrt{\rnk(A)} \; |A|$ and so $$ ||M - M_0|| \le \int_0^t \sqrt r \;| \gamma'(s) | ds = \sqrt r \;\len(\gamma) $$ $\blacksquare$