To train a neural network, we would like to minimise the empirical risk:
We will do this with (a variant of) gradient descent. The most basic version of Gradient descent looks like this:
where I denote by the gradient of the empirical loss function with respect to , evaluated at . So, in order to run gradient descent, we have to be able to calculate gradients.
Recall the functional form of a multilayer perceptron:
The function's output is a function of a function of a function of . For the purposes of this note, we are going to ignore the specifics and just look at how to calculate jacobians of functions which have a compositional nature of this. So consider instead, generally, trying to compute the Jacobian of a function which has the following form:
where each is a differentiable function that acts on -dimensional vectors and outputs a $-dimensional vector.
How do I calculate the derivative, or in general, the Jacobian, of such nested function with respect to ?
Recall the chain rule of differentiation. The Jacobian can be written as a product of matrices (or tensors, in the general case) like so:
This notation hides a lot of detail. Here is what I mean by each term:
Which is the Jacobian of the function with respect to its input, evaluated at .
Let's assume for a second we have calculated each of these Jacobians. What's left to do is to multiply them together. Due to the associative property of matrix multiplication, I can go about doing this in various orders.
I can start at the input and work my way through like this:
Or, I could start at the output, and work my way backwards like so:
Or, I can choose any arbitrary, funky way like this:
These three options are called, in the order they were presented
automatic differentiation, autodiff for short.
Let's look at how many operations does it take to evaluate the chain rule in the forward-mode:
I start by multiplying by . The former is , where is the dimensionality of itself, while is the dimensionality of the output vector after applying . Similarly, the other matrix is in size. Multiplying these together takes operations. Then we end up with a matrix, which we have to multiply from the left by a one. This takes operations.
Following this logic we can see that the number of floating point operations required for forward-mode autodiff is:
Reverse-mode autodiff multiplies the Jacobians in this order:
The cost of this, following an argument simila to before, is:
The different automatic differentiation methods also differ in their memory cost. Recoll what each fator in the product means:
These are Jacobians which have to be evaluated at .
When running forward-mode autodiff, we are computing these in the same order as we would normally compute the output of the function, so all we have to keep in memory is
The overall memory budget we need is therefore:
In reverse-mode, we start by computing the last Jacobian first. To calculate , we first have to calculate . However, to compute this, we also had to compute , , etc already. We will need these later, so it would be wasteful to throw these away. In reverse-mode autodiff, we typically store all activations in memory, which has additional cost. The total memory cost works out to be:
Which one of these terms dominates depends on whether the function is deeper (i.e. is large), or wider (i.e. are large).
In conclusion, if we have to calculate a Jacobian, we have several options of how to go about it. The computational and memory cost of forward-mode, reverse-mode (and mixed-mode) will differ substantially depending on the dimensionality of the various layers in the composite function.
In deep learning, we generally want to calculate a gradient of a scalar function (the empirical risk), thus . This means that for calculating gradients reverse-mode automatic differentiation is almost always the optimal choice from both a computational cost perspective.
In neural networks, reverse-mode automatic differentiation is called backpropagation.
In rare occasions we want to calculate things other than the gradient of a scalar function. We can use automatic differentiation creatively to do so.
The Hessian of a function is the matrix of second partial derivatives
If we have access to the gradient function, we can of course calculate this by running automatic differentiation twice. The first time, we calculate the gradient of a scalar function, so backpropagation (reverse-mode) is optimal. The gradient, however, is a function that has the same output dimensionality as its input. So to calculate the gradient of the gradient, forward-mode tends to be faster.
So, if we want to compute a Hessian, running forward-mode automatic differentiation on reverse-mode automatic differentiation is a good algorithm.
It's rarely the case that we want to calculate the Hessian in deep learning, partly because the Hessian is enormous, and we may not even be able to store it.
What's often sufficient when we want to do something with the Hessian is to calculate the product of the Hessian with a vector . To do this clevelry, we can notice that:
This suggests an algorithm where we run automatic differentiation twice. First, to calculate the gradient of , then, to calculate the gradient of , which is a scalar function. Since in both steps we run autodiff on scalar functions, using backprop twice is the ideal algorithm here.
I recommend playing with the JAX framework which implements automatic differentiation in a very clear and transparent way. In particular, I find this Autodiff Cookbook very good to illustrate some of the things I covered here while also adding a little more colour to the discussion of autodiff and Jacobian-vector products (JVPs) and vector-Jacobian products (VJPs).
Of course, other frameworks, like pytorch autograd use largely the same algorithms to calculate gradients, so it's also a good idea to spend some time understanding how it's done under the hood there. It's worth knowing about how the computational graph gets built, and how you can change the way gradients are passed around if you want to implement something more complex.