To train a neural network, we would like to minimise the empirical risk:
We will do this with (a variant of) gradient descent. The most basic version of Gradient descent looks like this:
where I denote by
Recall the functional form of a multilayer perceptron:
The function's output is a function of a function of a
where each
How do I calculate the derivative, or in general, the Jacobian, of such nested function with respect to
Recall the chain rule of differentiation. The Jacobian can be written as a product of matrices (or tensors, in the general case) like so:
This notation hides a lot of detail. Here is what I mean by each term:
Which is the Jacobian of the function
Let's assume for a second we have calculated each of these Jacobians. What's left to do is to multiply them together. Due to the associative property of matrix multiplication, I can go about doing this in various orders.
I can start at the input
Or, I could start at the output, and work my way backwards like so:
Or, I can choose any arbitrary, funky way like this:
These three options are called, in the order they were presented
automatic differentiation, autodiff for short.
Let's look at how many operations does it take to evaluate the chain rule in the forward-mode:
I start by multiplying
Following this logic we can see that the number of floating point operations required for forward-mode autodiff is:
Reverse-mode autodiff multiplies the Jacobians in this order:
The cost of this, following an argument simila to before, is:
The different automatic differentiation methods also differ in their memory cost. Recoll what each fator in the product means:
These are Jacobians which have to be evaluated at
When running forward-mode autodiff, we are computing these in the same order as we would normally compute the output of the function, so all we have to keep in memory is
The overall memory budget we need is therefore:
In reverse-mode, we start by computing the last Jacobian first. To calculate
Which one of these terms dominates depends on whether the function is deeper (i.e.
In conclusion, if we have to calculate a Jacobian, we have several options of how to go about it. The computational and memory cost of forward-mode, reverse-mode (and mixed-mode) will differ substantially depending on the dimensionality of the various layers in the composite function.
In deep learning, we generally want to calculate a gradient of a scalar function (the empirical risk), thus
In neural networks, reverse-mode automatic differentiation is called backpropagation.
In rare occasions we want to calculate things other than the gradient of a scalar function. We can use automatic differentiation creatively to do so.
The Hessian of a function is the matrix of second partial derivatives
If we have access to the gradient function, we can of course calculate this by running automatic differentiation twice. The first time, we calculate the gradient of a scalar function, so backpropagation (reverse-mode) is optimal. The gradient, however, is a function that has the same output dimensionality as its input. So to calculate the gradient of the gradient, forward-mode tends to be faster.
So, if we want to compute a Hessian, running forward-mode automatic differentiation on reverse-mode automatic differentiation is a good algorithm.
It's rarely the case that we want to calculate the Hessian in deep learning, partly because the Hessian is enormous, and we may not even be able to store it.
What's often sufficient when we want to do something with the Hessian is to calculate the product of the Hessian with a vector
This suggests an algorithm where we run automatic differentiation twice. First, to calculate the gradient of
I recommend playing with the JAX framework which implements automatic differentiation in a very clear and transparent way. In particular, I find this Autodiff Cookbook very good to illustrate some of the things I covered here while also adding a little more colour to the discussion of autodiff and Jacobian-vector products (JVPs) and vector-Jacobian products (VJPs).
Of course, other frameworks, like pytorch autograd use largely the same algorithms to calculate gradients, so it's also a good idea to spend some time understanding how it's done under the hood there. It's worth knowing about how the computational graph gets built, and how you can change the way gradients are passed around if you want to implement something more complex.