Yaae: Yet another autodiff engine (written in Numpy)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- In this blog post, we will have an in-depth review of what an automatic differentiation is and how to write one in pure Numpy.
- Please refer to Yaae repository for the full code implementation.
- I mostly used Wikipedia to write this blog post which, in my opinion, provides an excellent explanation of what an automatic differentiation is.
- Also, if you want to read other posts, feel free to check them at my blog.
I) Introduction
- Automatic differentiation (AD), also called algorithmic/computational differentiation, is a set of techniques to numerically differentiate a function through the use of exact formulas and floating-point values. The resulting numerical value involves no approximation error.
- AD is different from:
- numerical differentiation:
- can introduce round-off errors.
- symbolic differentiation:
- can lead to inefficient code.
- can have difficulty of converting a computer program into a single expression.
- The goal of AD is not a formula, but a procedure to compute derivatives.
- Both classical methods:
- have problems with calculating higher derivatives, where complexity and errors increase.
- are slow at computing partial derivatives of a function with respect to many inputs, as it is needed for gradient-based optimization algorithms.
- Automatic differentiation solves all of these problems.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
For your information, you will often hear "Autograd" instead of automatic differentitation which is the name of a particular autodiff package.
- Backpropagation is a special case of autodiff applied to neural networks. But in machine learning, we often use backpropagation synonymously with autodiff.
II) Automatic differentiation modes
- There are usually 2 automatic differentiation modes:
- Forward automatic differentiation.
- Reverse automatic differentiation.
- For each mode, we will have an in-depth explanation and a walkthrough example but let's first define what the chain rule is.
- The chain rule is a formula to compute the derivative of a composite function.
- For a simple composition:
- Thus, using chain rule, we have:
1) Forward automatic differentiation
- Forward automatic differentiation divides the expression into a sequence of differentiable elementary operations. The chain rule and well-known differentiation rules are then applied to each elementary operation.
- In forward AD mode, we traverse the chain rule from right to left. That is, according to our simple composition function, we first compute and then and finally .
- Here is the recursive relation:
- More precisely, we compute the derivative starting from the inside to the outside:
- Let's consider the following example:
- It can be represented as a computational graph.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Setting and to is called seeding.
- Forward AD is more efficient than Reverse AD for functions with (more outputs than inputs) as only sweeps are necessary compared to sweeps for Reverse AD.
2) Reverse automatic differentiation
- Reverse automatic differentiation is pretty much like Forward AD but this time, we traverse the chain rule from left to right. That is, according to our simple composition function, we first compute and then and finally .
- Here is the recursive relation:
- More precisely, we compute the derivative starting from the outside to the inside:
- Let's consider the following example:
- It can be represented as a computational graph.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- The reverse automatic differentiation is divided into 2 parts:
- A forward pass to store intermediate values, needed during the backward pass.
- A backward pass to compute the derivatives.
☰ Forward pass
- We simply compute the value of each node. They will be used during the backward pass to compute the derivatives.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
☰ Backward pass
- Using the previous recursive formula, we start from the output to the inputs.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Reverse AD is more efficient than Forward AD for functions with (less outputs than inputs) as only sweeps are necessary, compared to sweeps for Forward AD.
III) Implementation
- There are 2 ways to implement an autodiff:
- source-to-source transformation (hard):
- Uses a compiler to convert source code with mathematical expressions to source code with automatic differentiation expressions.
- operator overloading (easy):
- Users-defined data types and the operator overloading features of a language are used to implement automatic differentiation.
- Moreover, we usually have less outputs than inputs in a neural network scenario.
- Therefore, it is more efficient to implement the Reverse AD over the Forward AD mode.
- Our implementation revolves around the class
Node
which is defined below with its main components:
- Let's go through each method.
1) __init__() method
- The __init__() method has the following arguments:
data
: a scalar/tensor value.
requires_grad
: a boolean that tells us if we want to keep track of the gradient.
children
: a list to keep track of children.
- It has also other variables such as:
grad
: equal to 0-scalar/tensor (only if requires_grad is True) depending on data
type.
_compute_derivatives
: lambda function, useful during backward() method (we will dive into it later on).
2) Operators & Functions
- As you can see, the main strength of my implementation is that both Operators and Functions follow the same structure.
- Indeed, the code structure was designed in such a way it explicitly reflects the Reverse AD mode logic that is, first a forward pass and then a backward pass (to compute derivatives).
- For each operator/function, the
forward_pass()
will create a new Node
using the operator/function output with its children stored in children
.
- And each
grad
will be a Node
whose value is a 0-scalar/tensor depending on data
type.
- We then compute the gradient using
compute_derivatives()
method of each operator/function classes.
- This method is then stored in the lambda function
_compute_derivatives()
of each resulting nodes.
- Notice that this method is not executed yet. We will see the reason later.
3) backward() method
- In the backward() method, we need to propagate the gradient from the output to the input as done in the previous example.
- To do so, we have to:
- Set
grad
of the output to 1.
- Perform a topological sort.
- Use the
_compute_derivatives()
lambda function which store the gradients of each node.
- A topological sort is performed for problems where some objects are ordered in a particular way.
- A typical example would be a set of tasks where certain tasks must be executed after other ones. Such example can be represented by a Directed Acyclic Graph (DAG) which is exactly the graph we designed.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Figure: A computational graph (DAG)
- On top of that, order matters here (since we are working with chain rules). Hence, we want the node to propagate the gradient first to node and .
- Now that we know a topological sort is needed, let's perform one. For the above DAG, the topological sort will output us
topo = [node_w1, node_w2, node_w3, node_w4, node_w5]
.
- We then reverse
topo
and call for each node in topo
the _compute_derivatives()
method which is a lambda function containing the node derivative computations, in charge of propagating the gradient through the DAG.
IV) Application
- Let’s write down our previous example using Yaae and analyze itstep by step.
- Let's first define the following:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- In the following, the lambda function
l
will contain a check symbol to express that l
is no more empty.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Still skeptical ? We can even go further by writing our own GAN.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →