--- title: DeepNN Notes on Automatic Differentiation tags: DeepNN, Teaching, Lecture Notes description: Lecture notes on general algorithms for calculating Jacobians and gradients, to motivate the use of backpropagation in deep learning. --- # Automatic differentiation ## Background: gradient based optimisation To train a neural network, we would like to minimise the empirical risk: $$ \hat{R}(\mathbf{w}, \mathcal{D}) = \sum_{n=1}^N \ell(y_n, \hat{y}(x_n, \mathbf{w})) $$ We will do this with (a variant of) gradient descent. The most basic version of Gradient descent looks like this: $$ \mathbf{w}_{t+1} = \mathbf{w}_{t} - \eta \nabla_\mathbf{w} \hat{R}(\mathbf{w}_t), $$ where I denote by $\nabla_\mathbf{w} \hat{R}(\mathbf{w}_t)$ the gradient of the empirical loss function with respect to $\mathbf{w}$, evaluated at $\mathbf{w}_t$. So, in order to run gradient descent, we have to be able to calculate gradients. Recall the functional form of a multilayer perceptron: $$ \small \hat{y}(x) = \phi\left(b_L + W_L \phi\left(b_{L-1} + W_{L-1} \phi\left( \cdots \phi\left(b_1 + W_1 x\right) \cdots \right)\right)\right) $$ The function's output is a function of a function of a $\ldots$ function of $W_1$. For the purposes of this note, we are going to ignore the specifics and just look at how to calculate jacobians of functions which have a compositional nature of this. So consider instead, generally, trying to compute the Jacobian of a function which has the following form: $$ f_L(f_{L-1}(\cdots f_1(\mathbb{w}))), $$ where each $f_l: \mathbb{R}^{d_{l-1}} \rightarrow \mathbb{R}^{d_{l}}$ is a differentiable function that acts on $d_{l-1}$-dimensional vectors and outputs a $$d_{l}$-dimensional vector. How do I calculate the derivative, or in general, the Jacobian, of such nested function with respect to $\mathbb{w}$? ## The chain rule and automatic differentiation Recall the chain rule of differentiation. The Jacobian can be written as a product of matrices (or tensors, in the general case) like so: $$ \frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \cdots \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}}, $$ This notation hides a lot of detail. Here is what I mean by each term: $$ \frac{\partial \mathbf{f}_l}{\partial \mathbf{f}_{l-1}} = \left. \frac{\partial \mathbf{f}_l(\mathbf{a})}{\partial \mathbf{a}}\right\vert_{\mathbf{a} = f_{l-1}(\cdots(f_1(\mathbf{w})))}, $$ Which is the Jacobian of the function $f_l$ with respect to its input, evaluated at $f_{l-1}(\cdots(f_1(\mathbf{w})))$. Let's assume for a second we have calculated each of these Jacobians. What's left to do is to multiply them together. Due to the associative property of matrix multiplication, I can go about doing this in various orders. I can start at the input $\mathbf{w}$ and work my way through like this: $$ \small \frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \left( \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \left( \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \cdots \left( \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \left( \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}} \right) \right) \cdots \right) \right) $$ Or, I could start at the output, and work my way backwards like so: $$ \small \frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \left( \left( \cdots \left( \left( \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \right) \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \right) \cdots \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \right) \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \right) \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}} $$ Or, I can choose any arbitrary, funky way like this: $$ \small \frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \left( \left( \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \right) \left( \left( \cdots \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \right) \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \right) \right)\frac{\partial \mathbf{f}_1}{\partial \mathbf{w}} $$ These three options are called, in the order they were presented * forward-mode * reverse-mode * mixed-mode automatic differentiation, autodiff for short. ## Computational cost of autodiff ### Forward-mode Let's look at how many operations does it take to evaluate the chain rule in the forward-mode: $$ \small \frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \left( \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \left( \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \cdots \left( \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \left( \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}} \right) \right) \cdots \right) \right) $$ I start by multiplying $\frac{\partial \mathbf{f}_1}{\partial \mathbf{w}}$ by $\frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}}$. The former is $d_1\times d_0$, where $d_0$ is the dimensionality of $\mathbf{w}$ itself, while $d_1$ is the dimensionality of the output vector after applying $f_1$. Similarly, the other matrix is $d_2\times d_1$ in size. Multiplying these together takes $d_2d_1d_0$ operations. Then we end up with a $d_2\times d_0$ matrix, which we have to multiply from the left by a $d_3\times d_2$ one. This takes $d_2d_1d_0$ operations. Following this logic we can see that the number of floating point operations required for forward-mode autodiff is: $$ d_0d_1d_2 + d_0d_2d_3 + \ldots + d_0d_{L-1}d_L = d_0 \sum_{l=2}^{L}d_ld_{l-1} $$ ### Reverse-mode Reverse-mode autodiff multiplies the Jacobians in this order: $$ \small \frac{\partial \mathbf{f}_L}{\partial \mathbb{w}} = \left( \left( \cdots \left( \left( \frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}} \frac{\partial \mathbf{f}_{L-1}}{\partial \mathbf{f}_{L-2}} \right) \frac{\partial \mathbf{f}_{L-2}}{\partial \mathbf{f}_{L-3}} \right) \cdots \frac{\partial \mathbf{f}_3}{\partial \mathbf{f}_{2}} \right) \frac{\partial \mathbf{f}_2}{\partial \mathbf{f}_{1}} \right) \frac{\partial \mathbf{f}_1}{\partial \mathbf{w}} $$ The cost of this, following an argument simila to before, is: $$ \small d_Ld_{L-1}d_{L-2} + d_{L}d_{L-2}d_{L-3} + \ldots + d_Ld_{1}d_0 = d_L \sum_{l=0}^{L-2}d_ld_{l+1} $$ ## Memory cost of autodiff The different automatic differentiation methods also differ in their memory cost. Recoll what each fator in the product means: $$ \frac{\partial \mathbf{f}_l}{\partial \mathbf{f}_{l-1}} = \left. \frac{\partial \mathbf{f}_l(\mathbf{a})}{\partial \mathbf{a}}\right\vert_{\mathbf{a} = f_{l-1}(\cdots(f_1(\mathbf{w})))}, $$ These are Jacobians which have to be evaluated at $\mathbf{a} = f_{l-1}(\cdots(f_1(\mathbf{w})))$. ### Forward-mode When running forward-mode autodiff, we are computing these in the same order as we would normally compute the output of the function, so all we have to keep in memory is * the activation of the current layer $f_{l-1}(\cdots(f_1(\mathbf{w})))$, which is $d_{l-1}$ dimensional * the product of the Jacobians so far $\frac{\partial \mathbf{f}_{l-1}}{\partial \mathbf{w}}$, which is $d_0 \times d_{l-1}$ in size * and the new Jacobian term which is $d_{l} \times d_{l-1}$ in size The overall memory budget we need is therefore: $$ (d_0 + 1)\cdot \max(d_l) + \max(d_ld_{l-1}) $$ ### Reverse-mode In reverse-mode, we start by computing the last Jacobian first. To calculate $\frac{\partial \mathbf{f}_L}{\partial \mathbf{f}_{L-1}}$, we first have to calculate $f_{L-1}(\mathbf{w})$. However, to compute this, we also had to compute $f_{L-2}(\mathbf{w})$, $f_{L-3}(\mathbf{w})$, etc already. We will need these later, so it would be wasteful to throw these away. In reverse-mode autodiff, we typically store all activations $f_{l}(\mathbf{w})$ in memory, which has additional cost. The total memory cost works out to be: $$ (d_L + 1)\cdot \max(d_l) + \max(d_ld_{l-1}) + \sum_{l=0}^{L-1} d_l $$ Which one of these terms dominates depends on whether the function is deeper (i.e. $L$ is large), or wider (i.e. $d_l$ are large). ## Backprop in deep networks In conclusion, if we have to calculate a Jacobian, we have several options of how to go about it. The computational and memory cost of forward-mode, reverse-mode (and mixed-mode) will differ substantially depending on the dimensionality of the various layers in the composite function. In deep learning, we generally want to calculate a gradient of a scalar function (the empirical risk), thus $d_L = 1$. This means that for calculating gradients reverse-mode automatic differentiation is almost always the optimal choice from both a computational cost perspective. In neural networks, reverse-mode automatic differentiation is called backpropagation. In rare occasions we want to calculate things other than the gradient of a scalar function. We can use automatic differentiation creatively to do so. ### Example: Calculating a Hessian The Hessian of a function is the matrix of second partial derivatives $$ H(\mathbb{w}) = \frac{\partial^2}{\partial\mathbf{w}\partial\mathbf{w}^T} L(\mathbf{w}) = \left[ \begin{matrix} \frac{\partial^2 L(\mathbb{w})}{\partial w_i \partial w_j} \end{matrix} \right]_{i,j = 1\ldots dim(\mathbb{w})} $$ If we have access to the gradient function, we can of course calculate this by running automatic differentiation twice. The first time, we calculate the gradient of a scalar function, so backpropagation (reverse-mode) is optimal. The gradient, however, is a function that has the same output dimensionality as its input. So to calculate the gradient of the gradient, forward-mode tends to be faster. So, if we want to compute a Hessian, running forward-mode automatic differentiation on reverse-mode automatic differentiation is a good algorithm. ## Example: Hessian-vector product It's rarely the case that we want to calculate the Hessian in deep learning, partly because the Hessian is enormous, and we may not even be able to store it. What's often sufficient when we want to do something with the Hessian is to calculate the product of the Hessian with a vector $\mathbb{v}$. To do this clevelry, we can notice that: $$ \mathbf{v}^TH(\mathbf{w}) = \frac{\partial}{\partial\mathbf{w}} \left( \mathbf{v}^T \frac{\partial}{\partial\mathbf{w}} L(\mathbf{w}) \right) $$ This suggests an algorithm where we run automatic differentiation twice. First, to calculate the gradient of $L(\mathbf{w})$, then, to calculate the gradient of $\mathbf{v}^T \frac{\partial}{\partial\mathbf{w}} L(\mathbf{w})$, which is a scalar function. Since in both steps we run autodiff on scalar functions, using backprop twice is the ideal algorithm here. ## Further reading I recommend playing with the [JAX framework](https://github.com/google/jax#jax-autograd-and-xla-) which implements automatic differentiation in a very clear and transparent way. In particular, I find this [Autodiff Cookbook](https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html) very good to illustrate some of the things I covered here while also adding a little more colour to the discussion of autodiff and Jacobian-vector products (JVPs) and vector-Jacobian products (VJPs). Of course, other frameworks, like [pytorch autograd](https://pytorch.org/docs/stable/autograd.html) use largely the same algorithms to calculate gradients, so it's also a good idea to spend some time understanding how it's done under the hood there. It's worth knowing about how the computational graph gets built, and how you can change the way gradients are passed around if you want to implement something more complex.