The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Autodiff and Backpropagation

written by @marc_lelarge (part of the deep learning course)

Jacobian

Let

f : R^{n} \to R^{m}

, we define its Jacobian as:

\begin{aligned} \frac{\partial f}{\partial x} = J_{f} (x) & = (\begin{array}{ccc} \frac{\partial f_{1}}{\partial x_{1}} & \dots & \frac{\partial f_{1}}{\partial x_{n}} \\ ⋮ & ⋮ \\ \frac{\partial f_{m}}{\partial x_{1}} & \dots & \frac{\partial f_{m}}{\partial x_{n}} \end{array}) \\ = (\frac{\partial f}{\partial x_{1}}, \dots, \frac{\partial f}{\partial x_{n}}) \\ = (\begin{array}{c} \nabla f_{1} (x)^{T} \\ ⋮ \\ \nabla f_{m} (x)^{T} \end{array}) \end{aligned}

Hence the Jacobian

J_{f} (x) \in R^{m \times n}

is a linear map from

R^{n}

R^{m}

such that for

x, v \in R^{n}

and

h \in R

\begin{array}{r} f (x + h v) = f (x) + h J_{f} (x) v + o (h) . \end{array}

The term

J_{f} (x) v \in R^{m}

is a Jacobian Vector Product (JVP), corresponding to the interpretation where the Jacobian is the linear map:

J_{f} (x) : R^{n} \to R^{m}

, where

J_{f} (x) (v) = J_{f} (x) v

Chain composition

In machine learning, we are computing gradient of the loss function with respect to the parameters. In particular, if the parameters are high-dimensional, the loss is a real number. Hence, consider a real-valued function

f : R^{n} \overset{g_{1}}{\to} R^{m} \overset{g_{2}}{\to} R^{d} \overset{h}{\to} R

, so that

f (x) = h (g_{2} (g_{1} (x))) \in R

. We have

\begin{array}{r} \underset{n \times 1}{\underset{⏟}{\nabla f (x)}} = \underset{n \times m}{\underset{⏟}{J_{g_{1}} (x)^{T}}} \underset{m \times d}{\underset{⏟}{J_{g_{2}} (g_{1} (x))^{T}}} \underset{d \times 1}{\underset{⏟}{\nabla h (g_{2} (g_{1} (x)))}} . \end{array}

To do this computation, if we start from the right so that we start with a matrix times a vector to obtain a vector (of size

m

) and we need to make another matrix times a vector, resulting in

O (n m + m d)

operations. If we start from the left with the matrix-matrix multiplication, we get

O (n m d + n d)

operations. Hence we see that as soon as

m \approx d

, starting for the right is much more efficient. Note however that doing the computation from the right to the left requires to keep in memory the values of

g_{1} (x) \in R^{m}

, and

x \in R^{n}

Backpropagation is an efficient algorithm computing the gradient "from the right to the left", i.e. backward. In particular, we will need to compute quantities of the form:

J_{f} (x)^{T} u \in R^{n}

with

u \in R^{m}

which can be rewritten

u^{T} J_{f} (x)

which is a Vector Jacobian Product (VJP), correponding to the interpretation where the Jacobian is the linear map:

J_{f} (x) : R^{n} \to R^{m}

, composed with the linear map

u : R^{m} \to R

so that

u^{T} J_{f} (x) = u \circ J_{f} (x)

example: let

f (x, W) = x W \in R^{b}

where

W \in R^{a \times b}

and

x \in R^{a}

. We clearly have

J_{f} (x) = W^{T} .

Note that here, we are slightly abusing notations and considering the partial function

x \mapsto f (x, W)

. To see this, we can write

f_{j} = \sum_{i} x_{i} W_{i j}

so that

\frac{\partial f}{\partial x_{i}} = {(W_{i 1} \dots W_{i b})}^{T}

Then recall from definitions that

J_{f} (x) = (\frac{\partial f}{\partial x_{1}}, \dots, \frac{\partial f}{\partial x_{n}}) = W^{T} .

Now we clearly have

J_{f} (W) = x since, f (x, W + Δ W) = f (x, W) + x Δ W .

Note that multiplying

x

on the right is actually convenient when using broadcasting, i.e. we can take a batch of input vectors of shape

bs \times a

without modifying the math above.

Implementation

In PyTorch, torch.autograd provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. To create a custom autograd.Function, subclass this class and implement the forward() and backward() static methods. Here is an example:












class Exp(Function):
    @staticmethod
    def forward(ctx, i):
        result = i.exp()
        ctx.save_for_backward(result)
        return result
    @staticmethod
    def backward(ctx, grad_output):
        result, = ctx.saved_tensors
        return grad_output * result
# Use it by calling the apply method:
output = Exp.apply(input)

You can have a look at Module 2b to learn more about this approach as well as MLP from scratch.

Backprop the functional way

Here we will implement in numpy a different approach mimicking the functional approach of JAX see The Autodiff Cookbook.

Each function will take 2 arguments: one being the input x and the other being the parameters w. For each function, we build 2 vjp functions taking as argument a gradient

u

and corresponding to

J_{f} (x)

and

J_{f} (w)

so that these functions return

J_{f} (x)^{T} u

and

J_{f} (w)^{T} u

respectively. To summarize, for

x \in R^{n}

w \in R^{d}

, and,

f (x, w) \in R^{m}

\begin{aligned} {v j p}_{x} (u) & = J_{f} (x)^{T} u, with J_{f} (x) \in R^{m \times n}, u \in R^{m} \\ {v j p}_{w} (u) & = J_{f} (w)^{T} u, with J_{f} (w) \in R^{m \times d}, u \in R^{m} \end{aligned}

Then backpropagation is simply done by first computing the gradient of the loss and then composing the vjp functions in the right order.

Code

intro to JAX: autodiff the functional way autodiff_functional_empty.ipynb and its solution autodiff_functional_sol.ipynb
Linear regression in JAX linear_regression_jax.ipynb

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Autodiff and Backpropagation

Jacobian

Chain composition

Implementation

Backprop the functional way

Code

tags: public dataflowr jax autodiff

Read more

A* algorithm

Broadcasting in Python: K-means algorithm

Basics about probability distribution and Gaussians

Transformers using Named Tensor Notation

tags: `public` `dataflowr` `jax` `autodiff`