---
tags: machine-learning
---
# Yaae: Yet another autodiff engine (written in Numpy)
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/0.png" width="40%">
</div>
<br>
:::info
- In this blog post, we will have an in-depth review of what an automatic differentiation is and how to write one in pure Numpy.
- Please refer to <a href="https://github.com/3outeille/Yaae" style="color:red">Yaae</a> repository for the full code implementation.
- I mostly used Wikipedia to write this blog post which, in my opinion, provides an excellent explanation of what an automatic differentiation is.
- Also, if you want to read other posts, feel free to check them at my <a href="https://ferdinandmom.tech/blog/" style="color:green">blog</a>.
:::
## I) Introduction
- **Automatic differentiation (AD)**, also called **algorithmic/computational differentiation**, is ==a set of techniques to numerically differentiate a function through the use of exact formulas and floating-point values==. The resulting numerical value involves **no approximation error**.
- AD is different from:
- [numerical differentiation:](https://en.wikipedia.org/wiki/Numerical_differentiation)
- $f'(x) = \lim\limits_{h \to 0} \frac{f(x+h) - f(x)}{h}$
- can introduce round-off errors.
- [symbolic differentiation:](https://en.wikipedia.org/wiki/Computer_algebra)
- $\left(\frac{x^{2} \cos (x-7)}{\sin (x)}\right)'= x \csc (x)(\cos (x-7)(2-x \cot (x))-x \sin (x-7))$
- can lead to inefficient code.
- can have difficulty of converting a computer program into a single expression.
- The goal of AD is **not a formula**, but a **procedure to compute derivatives**.
- Both classical methods:
- **have problems with calculating higher derivatives**, where complexity and errors increase.
- **are slow at computing partial derivatives of a function with respect to many inputs**, as it is needed for gradient-based optimization algorithms.
- ==Automatic differentiation solves all of these problems==.
:::info
- :bulb: For your information, you will often hear **"Autograd"** instead of automatic differentitation which is the name of a particular autodiff package.
- **Backpropagation** is a special case of autodiff applied to neural networks. But in machine learning, we often use backpropagation synonymously with autodiff.
:::
## II) Automatic differentiation modes
- There are usually 2 automatic differentiation modes:
- **Forward automatic differentiation**.
- **Reverse automatic differentiation**.
- For each mode, we will have an in-depth explanation and a walkthrough example but let's first define what the **chain rule** is.
- The **chain rule** is a ==formula to compute the derivative of a composite function.==
- For a simple composition:
$$\boxed{y= f(g(h(x))}$$
1. $w_0 = x$
2. $w_1 = h(x)$
3. $w_2 = g(w_1)$
4. $w_3 = f(w_2) = y$
- Thus, using **chain rule**, we have:
$$\boxed{\frac{\mathrm{d} y}{\mathrm{d} x} = \frac{\mathrm{d} y}{\mathrm{d} w_2} \frac{\mathrm{d} w_2}{\mathrm{d} w_1}\frac{\mathrm{d} w_1}{\mathrm{d} x} = \frac{\mathrm{d} f(w_2)}{\mathrm{d} w_2} \frac{\mathrm{d} g(w_1)}{\mathrm{d} w_1}\frac{\mathrm{d} h(w_0)}{\mathrm{d} w_0}}$$
### <ins>1) Forward automatic differentiation</ins>
- Forward automatic differentiation ==divides the expression into a sequence of differentiable elementary operations==. The chain rule and well-known differentiation rules are then applied to each elementary operation.
- In forward AD mode, we traverse the chain rule from **right to left**. That is, according to our simple composition function, we first compute $\frac{\mathrm{d} w_1}{\mathrm{d} x}$ and then $\frac{\mathrm{d} w_2}{\mathrm{d} w_1}$ and finally $\frac{\mathrm{d} y}{\mathrm{d} w_2}$.
<br>
- Here is the recursive relation:
$$\boxed{\frac{\mathrm{d} w_i}{\mathrm{d} x} = \frac{\mathrm{d} w_i}{\mathrm{d} w_{i-1}} \frac{\mathrm{d} w_{i-1}}{\mathrm{d} x}}$$
- More precisely, we compute the derivative starting from the **inside to the outside**:
$$\boxed{\frac{\mathrm{d} y}{\mathrm{d} x} = \frac{\mathrm{d} y}{\mathrm{d} w_{i-1}} \frac{\mathrm{d} w_{i-1}}{\mathrm{d} x} = \frac{\mathrm{d} y}{\mathrm{d} w_{i-1}} \left(\frac{\mathrm{d} w_{i-1}}{\mathrm{d} w_{i-2}} \frac{\mathrm{d} w_{i-2}}{\mathrm{d} x}\right) = \frac{\mathrm{d} y}{\mathrm{d} w_{i-1}} \left( \frac{\mathrm{d} w_{i-1}}{\mathrm{d} w_{i-2}}\left(\frac{\mathrm{d} w_{i-2}}{\mathrm{d} w_{i-3}} \frac{\mathrm{d} w_{i-3}}{\mathrm{d} x}\right)\right) = ...}$$
- Let's consider the following example:
$$
\boxed{
\begin{align*}
w_1 &= x_1 = 2
\\
w_2 &= x_2 = 3
\\
w_3 &= w_1w_2
\\
w_4 &= sin(w_1)
\\
w_5 &= w_3 + w_4 = y
\end{align*}
}
$$
- It can be represented as a **computational graph**.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/1.png" width="40%">
</div>
<br>
- Since $y$ depends on $w_1$ and $w_2$, its derivative will depend on $w_1$ and $w_2$ too. Thus, we have to perform 2 chain rules: $\frac{\mathrm{d} y}{\mathrm{d} w_1}$ and $\frac{\mathrm{d} y}{\mathrm{d} w_2}$.
- $\boxed{\text{For } \frac{\mathrm{d} y}{\mathrm{d} w_1}:}$
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/2.png" width="80%">
</div>
<br>
- Thus,
$$
\boxed{
\begin{align*}
\frac{\mathrm{d} y}{\mathrm{d} w_1} &= \dot{w_4} + \dot{w_3}
\\
&= cos(2)*1 + (1*3 + 2*0)
\\
&\approx 2.58
\end{align*}
}
$$
---
- $\boxed{\text{For } \frac{\mathrm{d} y}{\mathrm{d} w_2}:}$
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/3.png" width="80%">
</div>
<br>
- Thus,
$$
\boxed{
\begin{align*}
\frac{\mathrm{d} y}{\mathrm{d} w_2} &= \dot{w_4} + \dot{w_3}
\\
&= cos(2)*0 + (0*3 + 2*1)
\\
&\approx 2
\end{align*}
}
$$
---
- Setting $\dot{w_1}$ and $\dot{w_2}$ to $\{0, 1\}$ is called **seeding**.
- ==Forward AD is more efficient than Reverse AD for functions $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$ with $m \gg n$ **(more outputs than inputs)** as only $n$ sweeps are necessary compared to $m$ sweeps for Reverse AD.==
### <ins>2) Reverse automatic differentiation</ins>
- Reverse automatic differentiation is pretty much like Forward AD but this time, we traverse the chain rule from **left to right**. That is, according to our simple composition function, we first compute $\frac{\mathrm{d} y}{\mathrm{d} w_2}$ and then $\frac{\mathrm{d} w_2}{\mathrm{d} w_1}$ and finally $\frac{\mathrm{d} w_1}{\mathrm{d} x}$.
- Here is the recursive relation:
$$\boxed{\frac{\mathrm{d} y}{\mathrm{d} w_i} = \frac{\mathrm{d} y}{\mathrm{d} w_{i+1}} \frac{\mathrm{d} w_{i+1}}{\mathrm{d} w_{i}}}$$
- More precisely, we compute the derivative starting from the **outside to the inside**:
$$\boxed{\frac{\mathrm{d} y}{\mathrm{d} x} = \frac{\mathrm{d} y}{\mathrm{d} w_{i+1}} \frac{\mathrm{d} w_{i+1}}{\mathrm{d} x} = \left(\frac{\mathrm{d} y}{\mathrm{d} w_{i+2}} \frac{\mathrm{d} w_{i+2}}{\mathrm{d} w_{i+1}}\right)\frac{\mathrm{d} w_{i+1}}{\mathrm{d} x} = \left( \left(\frac{\mathrm{d} y}{\mathrm{d} w_{i+3}} \frac{\mathrm{d} w_{i+3}}{\mathrm{d} w_{i+2}}\right) \frac{\mathrm{d} w_{i+2}}{\mathrm{d} w_{i+1}}\right) \frac{\mathrm{d} w_{i+1}}{\mathrm{d} x} = ...}$$
- Let's consider the following example:
$$
\boxed{
\begin{align*}
w_1 &= x_1 = 2
\\
w_2 &= x_2 = 3
\\
w_3 &= w_1w_2
\\
w_4 &= sin(w_1)
\\
w_5 &= w_3 + w_4 = y
\end{align*}
}
$$
- It can be represented as a **computational graph**.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/4.png" width="40%">
</div>
<br>
- The **reverse automatic differentiation** is divided into 2 parts:
- A **forward pass** to store intermediate values, needed during the backward pass.
- A **backward pass** to compute the derivatives.
---
### ☰ Forward pass
- We simply compute the value of each node. They will be used during the backward pass to compute the derivatives.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/5.png" width="70%">
</div>
### ☰ Backward pass
- Using the previous recursive formula, we start from the output to the inputs.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/6.png" width="90%">
</div>
<br>
- Thus,
$$\boxed{
\begin{align*}
\frac{\mathrm{d} y}{\mathrm{d} w_1} &= \bar{w_1} = cos(2) + 3 \approx 2.58
\\
\frac{\mathrm{d} y}{\mathrm{d} w_2} &= \bar{w_2} = 2
\end{align*}
}$$
---
- ==Reverse AD is more efficient than Forward AD for functions $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$ with $m \ll n$ **(less outputs than inputs)** as only $m$ sweeps are necessary, compared to $n$ sweeps for Forward AD.==
## III) Implementation
- There are 2 ways to implement an autodiff:
- **source-to-source transformation** <a style="color:red">(hard)</a>:
- Uses a compiler to convert source code with mathematical expressions to source code with automatic differentiation expressions.
- **operator overloading** <a style="color:green">(easy)</a>:
- Users-defined data types and the operator overloading features of a language are used to implement automatic differentiation.
- Moreover, we usually have **less outputs than inputs in a neural network scenario**.
- Therefore, it is more efficient to implement the **Reverse AD** over the **Forward AD** mode.
- Our implementation revolves around the class `Node` which is defined below with its main components:
```python
class Node:
def __init__(self, data, requires_grad=True, children=[]):
pass
def zero_grad(self):
pass
def backward(self, grad=None):
pass
# Operators.
def __add__(self, other):
# built-in python operators.
pass
def matmul(self, other):
# custom operators.
pass
... # More operators.
# Functions.
def sin(self):
pass
... # More functions.
```
- Let's go through each method.
---
### 1) \_\_init\_\_() method
```python=
def __init__(self, data, requires_grad=True, children=[]):
self.data = data if isinstance(data, np.ndarray) else np.array(data)
self.requires_grad = requires_grad
self.children = children
self.grad = None
if self.requires_grad:
self.zero_grad()
# Stores function.
self._compute_derivatives = lambda: None
def zero_grad(self):
self.grad = Node(np.zeros_like(self.data, dtype=np.float64), requires_grad=False)
```
- The **\_\_init\_\_()** method has the following arguments:
- `data`: a scalar/tensor value.
- `requires_grad`: a boolean that tells us if we want to keep track of the gradient.
- `children`: a list to keep track of children.
- It has also other variables such as:
- `grad`: equal to 0-scalar/tensor (only if **requires_grad** is True) depending on `data` type.
- `_compute_derivatives`: lambda function, useful during **backward()** method (we will dive into it later on).
### 2) Operators & Functions
```python=
# Operators.
def __add__(self, other):
# Operator overloading.
op = Add(self, other)
out = op.forward_pass()
out._compute_derivatives = op.compute_derivatives
return out
def matmul(self, other):
# Custom matrix multiplication function.
op = Matmul(self, other)
out = op.forward_pass()
out._compute_derivatives = op.compute_derivatives
return out
# Functions.
def sin(self):
op = Sin(self)
out = op.forward_pass()
out._compute_derivatives = op.compute_derivatives
return out
# ------
# To make things clearer, I removed some line of codes.
# Check Yaae repository for full implementation.
class Add():
def __init__(self, node1, node2):
pass
def forward_pass(self):
# Compute value and store children.
pass
def compute_derivatives(self):
pass
class Matmul():
def __init__(self, node1, node2):
pass
def forward_pass(self):
# Compute value and store children.
pass
def compute_derivatives(self):
pass
class Sin():
def __init__(self, node1):
pass
def forward_pass(self):
# Compute value and store children.
pass
def compute_derivatives(self):
pass
```
- As you can see, the main strength of my implementation is that both **Operators and Functions follow the same structure**.
- Indeed, the code structure was designed in such a way it **explicitly reflects the Reverse AD mode logic** that is, ==first a forward pass and then a backward pass (to compute derivatives).==
- For each operator/function, the `forward_pass()` will create a new `Node` using the operator/function output with its children stored in `children`.
- And each `grad` will be a `Node` whose value is a 0-scalar/tensor depending on `data` type.
- We then compute the gradient using `compute_derivatives()` method of each operator/function classes.
- This method is then **stored in the lambda function `_compute_derivatives()` of each resulting nodes.**
- Notice that this method is not executed yet. We will see the reason later.
### 3) backward() method
```python=
def backward(self, grad=None):
L, visited = [], set()
topo = topo_sort(self, L, visited)
if grad is None:
if self.shape == ():
self.grad = Node(1, requires_grad=False)
else:
raise RuntimeError('backward: grad must be specified for non-0 tensor')
else:
self.grad = Node(grad, requires_grad=False) if isinstance(grad, (np.ndarray, list)) else grad
for v in reversed(topo):
v._compute_derivatives()
def topo_sort(v, L, visited):
"""
Returns list of all of the children in the graph in topological order.
Parameters:
- v: last node in the graph.
- L: list of all of the children in the graph in topological order (empty at the beginning).
- visited: set of visited children.
"""
if v not in visited:
visited.add(v)
for child in v.children:
topo_sort(child, L, visited)
L.append(v)
return L
```
- In the **backward()** method, we need to propagate the gradient from the output to the input as done in the previous example.
- To do so, we have to:
- Set `grad` of the output to 1.
- Perform a **topological sort**.
- Use the `_compute_derivatives()` lambda function which store the gradients of each node.
- A **topological sort** is performed for problems where some objects are ordered in a particular way.
- A typical example would be a set of tasks where certain tasks must be executed after other ones. Such example can be represented by a **Directed Acyclic Graph (DAG)** which is exactly the graph we designed.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/7.png" width="70%">
<figcaption>Figure: A computational graph (DAG) </figcaption>
</div>
<br>
- On top of that, order matters here (since we are working with chain rules). Hence, we want the node $w_5$ to propagate the gradient first to node $w_3$ and $w_4$.
- Now that we know a **topological sort** is needed, let's perform one. For the above DAG, the topological sort will output us `topo = [node_w1, node_w2, node_w3, node_w4, node_w5]`.
- We then reverse `topo` and call for each node in `topo` the `_compute_derivatives()` method which is a ==**lambda function** containing the node derivative computations, in charge of propagating the gradient through the DAG.==
## IV) Application
- Let’s write down our previous example using Yaae and analyze itstep by step.
```python
# Yaae.
w1 = Node(2, requires_grad=True)
w2 = Node(3, requires_grad=True)
w3 = w2 * w1
w4 = w1.sin()
w5 = w3 + w4
z = w5
z.backward()
print(w1.grad.data) # 2.5838531634528574 ≈ 2.58
print(w2.grad.data) # 2.0
```
- Let's first define the following:
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-1.png" width="60%">
</div>
<br>
- In the following, the lambda function `l` will contain a check symbol to express that `l` is no more empty.
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-2.png" width="90%">
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-3.png" width="90%">
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-4.png" width="90%">
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-5.png" width="90%">
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-6.png" width="90%">
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-7.png" width="90%">
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-8.png" width="90%">
</div>
<br>
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-9.png" width="90%">
</div>
<br>
- It seems to work properly ! Since [Yaae](https://github.com/3outeille/Yaae) also supports tensors operations, we can easily build a small neural network library on top of it and workout some simple regression/classification problems !
- Here are the result from [demo_regression.ipynb](https://github.com/3outeille/Yaae/blob/master/src/example/demo_regression.ipynb) and [demo_classification.ipynb](https://github.com/3outeille/Yaae/blob/master/src/example/demo_classification.ipynb) files which can be found in the [repository example folder](https://github.com/3outeille/Yaae).
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/8.gif" width=70%">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/9.gif" width=70%">
</div>
</br>
- Still skeptical ? We can even go further by writing our own [GAN](https://github.com/3outeille/GANumpy).
<div style="text-align: center">
<img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/10.gif" width=60%">
</div>