Try   HackMD

Automatic differentiation

Background: gradient based optimisation

To train a neural network, we would like to minimise the empirical risk:

R^(w,D)=n=1N(yn,y^(xn,w))

We will do this with (a variant of) gradient descent. The most basic version of Gradient descent looks like this:

wt+1=wtηwR^(wt),

where I denote by

wR^(wt) the gradient of the empirical loss function with respect to
w
, evaluated at
wt
. So, in order to run gradient descent, we have to be able to calculate gradients.

Recall the functional form of a multilayer perceptron:

y^(x)=ϕ(bL+WLϕ(bL1+WL1ϕ(ϕ(b1+W1x))))

The function's output is a function of a function of a

function of
W1
. For the purposes of this note, we are going to ignore the specifics and just look at how to calculate jacobians of functions which have a compositional nature of this. So consider instead, generally, trying to compute the Jacobian of a function which has the following form:

fL(fL1(f1(w))),

where each

fl:Rdl1Rdl is a differentiable function that acts on
dl1
-dimensional vectors and outputs a $
dl
-dimensional vector.

How do I calculate the derivative, or in general, the Jacobian, of such nested function with respect to

w?

The chain rule and automatic differentiation

Recall the chain rule of differentiation. The Jacobian can be written as a product of matrices (or tensors, in the general case) like so:

fLw=fLfL1fL1fL2fL2fL3f3f2f2f1f1w,

This notation hides a lot of detail. Here is what I mean by each term:

flfl1=fl(a)a|a=fl1((f1(w))),

Which is the Jacobian of the function

fl with respect to its input, evaluated at
fl1((f1(w)))
.

Let's assume for a second we have calculated each of these Jacobians. What's left to do is to multiply them together. Due to the associative property of matrix multiplication, I can go about doing this in various orders.

I can start at the input

w and work my way through like this:

fLw=fLfL1(fL1fL2(fL2fL3(f3f2(f2f1f1w))))

Or, I could start at the output, and work my way backwards like so:

fLw=((((fLfL1fL1fL2)fL2fL3)f3f2)f2f1)f1w

Or, I can choose any arbitrary, funky way like this:

fLw=fLfL1((fL1fL2fL2fL3)((f3f2)f2f1))f1w

These three options are called, in the order they were presented

  • forward-mode
  • reverse-mode
  • mixed-mode

automatic differentiation, autodiff for short.

Computational cost of autodiff

Forward-mode

Let's look at how many operations does it take to evaluate the chain rule in the forward-mode:

fLw=fLfL1(fL1fL2(fL2fL3(f3f2(f2f1f1w))))

I start by multiplying

f1w by
f2f1
. The former is
d1×d0
, where
d0
is the dimensionality of
w
itself, while
d1
is the dimensionality of the output vector after applying
f1
. Similarly, the other matrix is
d2×d1
in size. Multiplying these together takes
d2d1d0
operations. Then we end up with a
d2×d0
matrix, which we have to multiply from the left by a
d3×d2
one. This takes
d2d1d0
operations.

Following this logic we can see that the number of floating point operations required for forward-mode autodiff is:

d0d1d2+d0d2d3++d0dL1dL=d0l=2Ldldl1

Reverse-mode

Reverse-mode autodiff multiplies the Jacobians in this order:

fLw=((((fLfL1fL1fL2)fL2fL3)f3f2)f2f1)f1w

The cost of this, following an argument simila to before, is:

dLdL1dL2+dLdL2dL3++dLd1d0=dLl=0L2dldl+1

Memory cost of autodiff

The different automatic differentiation methods also differ in their memory cost. Recoll what each fator in the product means:

flfl1=fl(a)a|a=fl1((f1(w))),

These are Jacobians which have to be evaluated at

a=fl1((f1(w))).

Forward-mode

When running forward-mode autodiff, we are computing these in the same order as we would normally compute the output of the function, so all we have to keep in memory is

  • the activation of the current layer
    fl1((f1(w)))
    , which is
    dl1
    dimensional
  • the product of the Jacobians so far
    fl1w
    , which is
    d0×dl1
    in size
  • and the new Jacobian term which is
    dl×dl1
    in size

The overall memory budget we need is therefore:

(d0+1)max(dl)+max(dldl1)

Reverse-mode

In reverse-mode, we start by computing the last Jacobian first. To calculate

fLfL1, we first have to calculate
fL1(w)
. However, to compute this, we also had to compute
fL2(w)
,
fL3(w)
, etc already. We will need these later, so it would be wasteful to throw these away. In reverse-mode autodiff, we typically store all activations
fl(w)
in memory, which has additional cost. The total memory cost works out to be:

(dL+1)max(dl)+max(dldl1)+l=0L1dl

Which one of these terms dominates depends on whether the function is deeper (i.e.

L is large), or wider (i.e.
dl
are large).

Backprop in deep networks

In conclusion, if we have to calculate a Jacobian, we have several options of how to go about it. The computational and memory cost of forward-mode, reverse-mode (and mixed-mode) will differ substantially depending on the dimensionality of the various layers in the composite function.

In deep learning, we generally want to calculate a gradient of a scalar function (the empirical risk), thus

dL=1. This means that for calculating gradients reverse-mode automatic differentiation is almost always the optimal choice from both a computational cost perspective.

In neural networks, reverse-mode automatic differentiation is called backpropagation.

In rare occasions we want to calculate things other than the gradient of a scalar function. We can use automatic differentiation creatively to do so.

Example: Calculating a Hessian

The Hessian of a function is the matrix of second partial derivatives

H(w)=2wwTL(w)=[2L(w)wiwj]i,j=1dim(w)

If we have access to the gradient function, we can of course calculate this by running automatic differentiation twice. The first time, we calculate the gradient of a scalar function, so backpropagation (reverse-mode) is optimal. The gradient, however, is a function that has the same output dimensionality as its input. So to calculate the gradient of the gradient, forward-mode tends to be faster.

So, if we want to compute a Hessian, running forward-mode automatic differentiation on reverse-mode automatic differentiation is a good algorithm.

Example: Hessian-vector product

It's rarely the case that we want to calculate the Hessian in deep learning, partly because the Hessian is enormous, and we may not even be able to store it.

What's often sufficient when we want to do something with the Hessian is to calculate the product of the Hessian with a vector

v. To do this clevelry, we can notice that:

vTH(w)=w(vTwL(w))

This suggests an algorithm where we run automatic differentiation twice. First, to calculate the gradient of

L(w), then, to calculate the gradient of
vTwL(w)
, which is a scalar function. Since in both steps we run autodiff on scalar functions, using backprop twice is the ideal algorithm here.

Further reading

I recommend playing with the JAX framework which implements automatic differentiation in a very clear and transparent way. In particular, I find this Autodiff Cookbook very good to illustrate some of the things I covered here while also adding a little more colour to the discussion of autodiff and Jacobian-vector products (JVPs) and vector-Jacobian products (VJPs).

Of course, other frameworks, like pytorch autograd use largely the same algorithms to calculate gradients, so it's also a good idea to spend some time understanding how it's done under the hood there. It's worth knowing about how the computational graph gets built, and how you can change the way gradients are passed around if you want to implement something more complex.