In this blog post, we will have an in-depth review of what an automatic differentiation is and how to write one in pure Numpy.
Please refer to Yaae repository for the full code implementation.
I mostly used Wikipedia to write this blog post which, in my opinion, provides an excellent explanation of what an automatic differentiation is.
Also, if you want to read other posts, feel free to check them at my blog.
I) Introduction
Automatic differentiation (AD), also called algorithmic/computational differentiation, is a set of techniques to numerically differentiate a function through the use of exact formulas and floating-point values. The resulting numerical value involves no approximation error.
For your information, you will often hear "Autograd" instead of automatic differentitation which is the name of a particular autodiff package.
Backpropagation is a special case of autodiff applied to neural networks. But in machine learning, we often use backpropagation synonymously with autodiff.
II) Automatic differentiation modes
There are usually 2 automatic differentiation modes:
Forward automatic differentiation.
Reverse automatic differentiation.
For each mode, we will have an in-depth explanation and a walkthrough example but let's first define what the chain rule is.
The chain rule is a formula to compute the derivative of a composite function.
For a simple composition:
Thus, using chain rule, we have:
1) Forward automatic differentiation
Forward automatic differentiation divides the expression into a sequence of differentiable elementary operations. The chain rule and well-known differentiation rules are then applied to each elementary operation.
In forward AD mode, we traverse the chain rule from right to left. That is, according to our simple composition function, we first compute and then and finally .
Here is the recursive relation:
More precisely, we compute the derivative starting from the inside to the outside:
Forward AD is more efficient than Reverse AD for functions with (more outputs than inputs) as only sweeps are necessary compared to sweeps for Reverse AD.
2) Reverse automatic differentiation
Reverse automatic differentiation is pretty much like Forward AD but this time, we traverse the chain rule from left to right. That is, according to our simple composition function, we first compute and then and finally .
Here is the recursive relation:
More precisely, we compute the derivative starting from the outside to the inside:
Reverse AD is more efficient than Forward AD for functions with (less outputs than inputs) as only sweeps are necessary, compared to sweeps for Forward AD.
The __init__() method has the following arguments:
data: a scalar/tensor value.
requires_grad: a boolean that tells us if we want to keep track of the gradient.
children: a list to keep track of children.
It has also other variables such as:
grad: equal to 0-scalar/tensor (only if requires_grad is True) depending on data type.
_compute_derivatives: lambda function, useful during backward() method (we will dive into it later on).
2) Operators & Functions
# Operators.def__add__(self, other):
# Operator overloading.
op = Add(self, other)
out = op.forward_pass()
out._compute_derivatives = op.compute_derivatives
return out
defmatmul(self, other):
# Custom matrix multiplication function.
op = Matmul(self, other)
out = op.forward_pass()
out._compute_derivatives = op.compute_derivatives
return out
# Functions.defsin(self):
op = Sin(self)
out = op.forward_pass()
out._compute_derivatives = op.compute_derivatives
return out
# ------# To make things clearer, I removed some line of codes.# Check Yaae repository for full implementation.classAdd():
def__init__(self, node1, node2):
passdefforward_pass(self):
# Compute value and store children.passdefcompute_derivatives(self):
passclassMatmul():
def__init__(self, node1, node2):
passdefforward_pass(self):
# Compute value and store children.passdefcompute_derivatives(self):
passclassSin():
def__init__(self, node1):
passdefforward_pass(self):
# Compute value and store children.passdefcompute_derivatives(self):
pass
As you can see, the main strength of my implementation is that both Operators and Functions follow the same structure.
Indeed, the code structure was designed in such a way it explicitly reflects the Reverse AD mode logic that is, first a forward pass and then a backward pass (to compute derivatives).
For each operator/function, the forward_pass() will create a new Node using the operator/function output with its children stored in children.
And each grad will be a Node whose value is a 0-scalar/tensor depending on data type.
We then compute the gradient using compute_derivatives() method of each operator/function classes.
This method is then stored in the lambda function _compute_derivatives() of each resulting nodes.
Notice that this method is not executed yet. We will see the reason later.
3) backward() method
defbackward(self, grad=None):
L, visited = [], set()
topo = topo_sort(self, L, visited)
if grad isNone:
if self.shape == ():
self.grad = Node(1, requires_grad=False)
else:
raise RuntimeError('backward: grad must be specified for non-0 tensor')
else:
self.grad = Node(grad, requires_grad=False) ifisinstance(grad, (np.ndarray, list)) else grad
for v inreversed(topo):
v._compute_derivatives()
deftopo_sort(v, L, visited):
"""
Returns list of all of the children in the graph in topological order.
Parameters:
- v: last node in the graph.
- L: list of all of the children in the graph in topological order (empty at the beginning).
- visited: set of visited children.
"""if v notin visited:
visited.add(v)
for child in v.children:
topo_sort(child, L, visited)
L.append(v)
return L
In the backward() method, we need to propagate the gradient from the output to the input as done in the previous example.
To do so, we have to:
Set grad of the output to 1.
Perform a topological sort.
Use the _compute_derivatives() lambda function which store the gradients of each node.
A topological sort is performed for problems where some objects are ordered in a particular way.
A typical example would be a set of tasks where certain tasks must be executed after other ones. Such example can be represented by a Directed Acyclic Graph (DAG) which is exactly the graph we designed.
On top of that, order matters here (since we are working with chain rules). Hence, we want the node to propagate the gradient first to node and .
Now that we know a topological sort is needed, let's perform one. For the above DAG, the topological sort will output us topo = [node_w1, node_w2, node_w3, node_w4, node_w5].
We then reverse topo and call for each node in topo the _compute_derivatives() method which is a lambda function containing the node derivative computations, in charge of propagating the gradient through the DAG.
IV) Application
Let’s write down our previous example using Yaae and analyze itstep by step.
It seems to work properly ! Since Yaae also supports tensors operations, we can easily build a small neural network library on top of it and workout some simple regression/classification problems !