Try   HackMD

Autodiff and Backpropagation

written by @marc_lelarge (part of the deep learning course)

Jacobian

Let

f:RnRm, we define its Jacobian as:
fx=Jf(x)=(f1x1f1xnfmx1fmxn)=(fx1,,fxn)=(f1(x)Tfm(x)T)

Hence the Jacobian

Jf(x)Rm×n is a linear map from
Rn
to
Rm
such that for
x,vRn
and
hR
:
f(x+hv)=f(x)+hJf(x)v+o(h).

The term
Jf(x)vRm
is a Jacobian Vector Product (JVP), corresponding to the interpretation where the Jacobian is the linear map:
Jf(x):RnRm
, where
Jf(x)(v)=Jf(x)v
.

Chain composition

In machine learning, we are computing gradient of the loss function with respect to the parameters. In particular, if the parameters are high-dimensional, the loss is a real number. Hence, consider a real-valued function

f:Rng1Rmg2RdhR, so that
f(x)=h(g2(g1(x)))R
. We have
f(x)n×1=Jg1(x)Tn×mJg2(g1(x))Tm×dh(g2(g1(x)))d×1.

To do this computation, if we start from the right so that we start with a matrix times a vector to obtain a vector (of size
m
) and we need to make another matrix times a vector, resulting in
O(nm+md)
operations. If we start from the left with the matrix-matrix multiplication, we get
O(nmd+nd)
operations. Hence we see that as soon as
md
, starting for the right is much more efficient. Note however that doing the computation from the right to the left requires to keep in memory the values of
g1(x)Rm
, and
xRn
.

Backpropagation is an efficient algorithm computing the gradient "from the right to the left", i.e. backward. In particular, we will need to compute quantities of the form:

Jf(x)TuRn with
uRm
which can be rewritten
uTJf(x)
which is a Vector Jacobian Product (VJP), correponding to the interpretation where the Jacobian is the linear map:
Jf(x):RnRm
, composed with the linear map
u:RmR
so that
uTJf(x)=uJf(x)
.

example: let

f(x,W)=xWRb where
WRa×b
and
xRa
. We clearly have
Jf(x)=WT.

Note that here, we are slightly abusing notations and considering the partial function
xf(x,W)
. To see this, we can write
fj=ixiWij
so that
fxi=(Wi1Wib)T

Then recall from definitions that
Jf(x)=(fx1,,fxn)=WT.

Now we clearly have
Jf(W)=x since, f(x,W+ΔW)=f(x,W)+xΔW.

Note that multiplying
x
on the right is actually convenient when using broadcasting, i.e. we can take a batch of input vectors of shape
bs×a
without modifying the math above.

Implementation

In PyTorch, torch.autograd provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. To create a custom autograd.Function, subclass this class and implement the forward() and backward() static methods. Here is an example:

class Exp(Function): @staticmethod def forward(ctx, i): result = i.exp() ctx.save_for_backward(result) return result @staticmethod def backward(ctx, grad_output): result, = ctx.saved_tensors return grad_output * result # Use it by calling the apply method: output = Exp.apply(input)

You can have a look at Module 2b to learn more about this approach as well as MLP from scratch.

Backprop the functional way

Here we will implement in numpy a different approach mimicking the functional approach of JAX see The Autodiff Cookbook.

Each function will take 2 arguments: one being the input x and the other being the parameters w. For each function, we build 2 vjp functions taking as argument a gradient

u and corresponding to
Jf(x)
and
Jf(w)
so that these functions return
Jf(x)Tu
and
Jf(w)Tu
respectively. To summarize, for
xRn
,
wRd
, and,
f(x,w)Rm
,
vjpx(u)=Jf(x)Tu, with Jf(x)Rm×n,uRmvjpw(u)=Jf(w)Tu, with Jf(w)Rm×d,uRm

Then backpropagation is simply done by first computing the gradient of the loss and then composing the vjp functions in the right order.

Code

tags: public dataflowr jax autodiff