Automatic differentiation

Background: gradient based optimisation

To train a neural network, we would like to minimise the empirical risk:

\hat{R} (w, D) = \sum_{n = 1}^{N} ℓ (y_{n}, \hat{y} (x_{n}, w))

We will do this with (a variant of) gradient descent. The most basic version of Gradient descent looks like this:

w_{t + 1} = w_{t} - η \nabla_{w} \hat{R} (w_{t}),

where I denote by

\nabla_{w} \hat{R} (w_{t})

the gradient of the empirical loss function with respect to

w

, evaluated at

w_{t}

. So, in order to run gradient descent, we have to be able to calculate gradients.

Recall the functional form of a multilayer perceptron:

\hat{y} (x) = ϕ (b_{L} + W_{L} ϕ (b_{L - 1} + W_{L - 1} ϕ (\dots ϕ (b_{1} + W_{1} x) \dots)))

The function's output is a function of a function of a

\dots

function of

W_{1}

. For the purposes of this note, we are going to ignore the specifics and just look at how to calculate jacobians of functions which have a compositional nature of this. So consider instead, generally, trying to compute the Jacobian of a function which has the following form:

f_{L} (f_{L - 1} (\dots f_{1} (w))),

where each

f_{l} : R^{d_{l - 1}} \to R^{d_{l}}

is a differentiable function that acts on

d_{l - 1}

-dimensional vectors and outputs a $

d_{l}

-dimensional vector.

How do I calculate the derivative, or in general, the Jacobian, of such nested function with respect to

w

The chain rule and automatic differentiation

Recall the chain rule of differentiation. The Jacobian can be written as a product of matrices (or tensors, in the general case) like so:

\frac{\partial f_{L}}{\partial w} = \frac{\partial f_{L}}{\partial f_{L - 1}} \frac{\partial f_{L - 1}}{\partial f_{L - 2}} \frac{\partial f_{L - 2}}{\partial f_{L - 3}} \dots \frac{\partial f_{3}}{\partial f_{2}} \frac{\partial f_{2}}{\partial f_{1}} \frac{\partial f_{1}}{\partial w},

This notation hides a lot of detail. Here is what I mean by each term:

\frac{\partial f_{l}}{\partial f_{l - 1}} = {\frac{\partial f_{l} (a)}{\partial a} |}_{a = f_{l - 1} (\dots (f_{1} (w)))},

Which is the Jacobian of the function

f_{l}

with respect to its input, evaluated at

f_{l - 1} (\dots (f_{1} (w)))

Let's assume for a second we have calculated each of these Jacobians. What's left to do is to multiply them together. Due to the associative property of matrix multiplication, I can go about doing this in various orders.

I can start at the input

w

and work my way through like this:

\frac{\partial f_{L}}{\partial w} = \frac{\partial f_{L}}{\partial f_{L - 1}} (\frac{\partial f_{L - 1}}{\partial f_{L - 2}} (\frac{\partial f_{L - 2}}{\partial f_{L - 3}} \dots (\frac{\partial f_{3}}{\partial f_{2}} (\frac{\partial f_{2}}{\partial f_{1}} \frac{\partial f_{1}}{\partial w})) \dots))

Or, I could start at the output, and work my way backwards like so:

\frac{\partial f_{L}}{\partial w} = ((\dots ((\frac{\partial f_{L}}{\partial f_{L - 1}} \frac{\partial f_{L - 1}}{\partial f_{L - 2}}) \frac{\partial f_{L - 2}}{\partial f_{L - 3}}) \dots \frac{\partial f_{3}}{\partial f_{2}}) \frac{\partial f_{2}}{\partial f_{1}}) \frac{\partial f_{1}}{\partial w}

Or, I can choose any arbitrary, funky way like this:

\frac{\partial f_{L}}{\partial w} = \frac{\partial f_{L}}{\partial f_{L - 1}} ((\frac{\partial f_{L - 1}}{\partial f_{L - 2}} \frac{\partial f_{L - 2}}{\partial f_{L - 3}}) ((\dots \frac{\partial f_{3}}{\partial f_{2}}) \frac{\partial f_{2}}{\partial f_{1}})) \frac{\partial f_{1}}{\partial w}

These three options are called, in the order they were presented

forward-mode
reverse-mode
mixed-mode

automatic differentiation, autodiff for short.

Computational cost of autodiff

Forward-mode

Let's look at how many operations does it take to evaluate the chain rule in the forward-mode:

\frac{\partial f_{L}}{\partial w} = \frac{\partial f_{L}}{\partial f_{L - 1}} (\frac{\partial f_{L - 1}}{\partial f_{L - 2}} (\frac{\partial f_{L - 2}}{\partial f_{L - 3}} \dots (\frac{\partial f_{3}}{\partial f_{2}} (\frac{\partial f_{2}}{\partial f_{1}} \frac{\partial f_{1}}{\partial w})) \dots))

I start by multiplying

\frac{\partial f_{1}}{\partial w}

\frac{\partial f_{2}}{\partial f_{1}}

. The former is

d_{1} \times d_{0}

, where

d_{0}

is the dimensionality of

w

itself, while

d_{1}

is the dimensionality of the output vector after applying

f_{1}

. Similarly, the other matrix is

d_{2} \times d_{1}

in size. Multiplying these together takes

d_{2} d_{1} d_{0}

operations. Then we end up with a

d_{2} \times d_{0}

matrix, which we have to multiply from the left by a

d_{3} \times d_{2}

one. This takes

d_{2} d_{1} d_{0}

operations.

Following this logic we can see that the number of floating point operations required for forward-mode autodiff is:

d_{0} d_{1} d_{2} + d_{0} d_{2} d_{3} + \dots + d_{0} d_{L - 1} d_{L} = d_{0} \sum_{l = 2}^{L} d_{l} d_{l - 1}

Reverse-mode

Reverse-mode autodiff multiplies the Jacobians in this order:

\frac{\partial f_{L}}{\partial w} = ((\dots ((\frac{\partial f_{L}}{\partial f_{L - 1}} \frac{\partial f_{L - 1}}{\partial f_{L - 2}}) \frac{\partial f_{L - 2}}{\partial f_{L - 3}}) \dots \frac{\partial f_{3}}{\partial f_{2}}) \frac{\partial f_{2}}{\partial f_{1}}) \frac{\partial f_{1}}{\partial w}

The cost of this, following an argument simila to before, is:

d_{L} d_{L - 1} d_{L - 2} + d_{L} d_{L - 2} d_{L - 3} + \dots + d_{L} d_{1} d_{0} = d_{L} \sum_{l = 0}^{L - 2} d_{l} d_{l + 1}

Memory cost of autodiff

The different automatic differentiation methods also differ in their memory cost. Recoll what each fator in the product means:

\frac{\partial f_{l}}{\partial f_{l - 1}} = {\frac{\partial f_{l} (a)}{\partial a} |}_{a = f_{l - 1} (\dots (f_{1} (w)))},

These are Jacobians which have to be evaluated at

a = f_{l - 1} (\dots (f_{1} (w)))

Forward-mode

When running forward-mode autodiff, we are computing these in the same order as we would normally compute the output of the function, so all we have to keep in memory is

the activation of the current layer
$f_{l - 1} (\dots (f_{1} (w)))$ , which is
$d_{l - 1}$ dimensional
the product of the Jacobians so far
$\frac{\partial f_{l - 1}}{\partial w}$ , which is
$d_{0} \times d_{l - 1}$ in size
and the new Jacobian term which is
$d_{l} \times d_{l - 1}$ in size

The overall memory budget we need is therefore:

(d_{0} + 1) \cdot max (d_{l}) + max (d_{l} d_{l - 1})

Reverse-mode

In reverse-mode, we start by computing the last Jacobian first. To calculate

\frac{\partial f_{L}}{\partial f_{L - 1}}

, we first have to calculate

f_{L - 1} (w)

. However, to compute this, we also had to compute

f_{L - 2} (w)

f_{L - 3} (w)

, etc already. We will need these later, so it would be wasteful to throw these away. In reverse-mode autodiff, we typically store all activations

f_{l} (w)

in memory, which has additional cost. The total memory cost works out to be:

(d_{L} + 1) \cdot max (d_{l}) + max (d_{l} d_{l - 1}) + \sum_{l = 0}^{L - 1} d_{l}

Which one of these terms dominates depends on whether the function is deeper (i.e.

L

is large), or wider (i.e.

d_{l}

are large).

Backprop in deep networks

In conclusion, if we have to calculate a Jacobian, we have several options of how to go about it. The computational and memory cost of forward-mode, reverse-mode (and mixed-mode) will differ substantially depending on the dimensionality of the various layers in the composite function.

In deep learning, we generally want to calculate a gradient of a scalar function (the empirical risk), thus

d_{L} = 1

. This means that for calculating gradients reverse-mode automatic differentiation is almost always the optimal choice from both a computational cost perspective.

In neural networks, reverse-mode automatic differentiation is called backpropagation.

In rare occasions we want to calculate things other than the gradient of a scalar function. We can use automatic differentiation creatively to do so.

Example: Calculating a Hessian

The Hessian of a function is the matrix of second partial derivatives

H (w) = \frac{\partial^{2}}{\partial w \partial w^{T}} L (w) = {[\begin{matrix} \frac{\partial^{2} L (w)}{\partial w_{i} \partial w_{j}} \end{matrix}]}_{i, j = 1 \dots d i m (w)}

If we have access to the gradient function, we can of course calculate this by running automatic differentiation twice. The first time, we calculate the gradient of a scalar function, so backpropagation (reverse-mode) is optimal. The gradient, however, is a function that has the same output dimensionality as its input. So to calculate the gradient of the gradient, forward-mode tends to be faster.

So, if we want to compute a Hessian, running forward-mode automatic differentiation on reverse-mode automatic differentiation is a good algorithm.

Example: Hessian-vector product

It's rarely the case that we want to calculate the Hessian in deep learning, partly because the Hessian is enormous, and we may not even be able to store it.

What's often sufficient when we want to do something with the Hessian is to calculate the product of the Hessian with a vector

v

. To do this clevelry, we can notice that:

v^{T} H (w) = \frac{\partial}{\partial w} (v^{T} \frac{\partial}{\partial w} L (w))

This suggests an algorithm where we run automatic differentiation twice. First, to calculate the gradient of

L (w)

, then, to calculate the gradient of

v^{T} \frac{\partial}{\partial w} L (w)

, which is a scalar function. Since in both steps we run autodiff on scalar functions, using backprop twice is the ideal algorithm here.

Automatic differentiation

Background: gradient based optimisation

The chain rule and automatic differentiation

Computational cost of autodiff

Forward-mode

Reverse-mode

Memory cost of autodiff

Forward-mode

Reverse-mode

Backprop in deep networks

Example: Calculating a Hessian

Example: Hessian-vector product

Further reading

Read more

Solomonoff-like rule extrapolation in a simple programmin language

DeepNN Notes on Inductive Biases of Neural Architectures

DeepNN Notes on The Recent History of Deep Learning

abc conjecture with rationals