--- tags: machine-learning --- # Yaae: Yet another autodiff engine (written in Numpy) <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/0.png" width="40%"> </div> <br> :::info - In this blog post, we will have an in-depth review of what an automatic differentiation is and how to write one in pure Numpy. - Please refer to <a href="https://github.com/3outeille/Yaae" style="color:red">Yaae</a> repository for the full code implementation. - I mostly used Wikipedia to write this blog post which, in my opinion, provides an excellent explanation of what an automatic differentiation is. - Also, if you want to read other posts, feel free to check them at my <a href="https://ferdinandmom.tech/blog/" style="color:green">blog</a>. ::: ## I) Introduction - **Automatic differentiation (AD)**, also called **algorithmic/computational differentiation**, is ==a set of techniques to numerically differentiate a function through the use of exact formulas and floating-point values==. The resulting numerical value involves **no approximation error**. - AD is different from: - [numerical differentiation:](https://en.wikipedia.org/wiki/Numerical_differentiation) - $f'(x) = \lim\limits_{h \to 0} \frac{f(x+h) - f(x)}{h}$ - can introduce round-off errors. - [symbolic differentiation:](https://en.wikipedia.org/wiki/Computer_algebra) - $\left(\frac{x^{2} \cos (x-7)}{\sin (x)}\right)'= x \csc (x)(\cos (x-7)(2-x \cot (x))-x \sin (x-7))$ - can lead to inefficient code. - can have difficulty of converting a computer program into a single expression. - The goal of AD is **not a formula**, but a **procedure to compute derivatives**. - Both classical methods: - **have problems with calculating higher derivatives**, where complexity and errors increase. - **are slow at computing partial derivatives of a function with respect to many inputs**, as it is needed for gradient-based optimization algorithms. - ==Automatic differentiation solves all of these problems==. :::info - :bulb: For your information, you will often hear **"Autograd"** instead of automatic differentitation which is the name of a particular autodiff package. - **Backpropagation** is a special case of autodiff applied to neural networks. But in machine learning, we often use backpropagation synonymously with autodiff. ::: ## II) Automatic differentiation modes - There are usually 2 automatic differentiation modes: - **Forward automatic differentiation**. - **Reverse automatic differentiation**. - For each mode, we will have an in-depth explanation and a walkthrough example but let's first define what the **chain rule** is. - The **chain rule** is a ==formula to compute the derivative of a composite function.== - For a simple composition: $$\boxed{y= f(g(h(x))}$$ 1. $w_0 = x$ 2. $w_1 = h(x)$ 3. $w_2 = g(w_1)$ 4. $w_3 = f(w_2) = y$ - Thus, using **chain rule**, we have: $$\boxed{\frac{\mathrm{d} y}{\mathrm{d} x} = \frac{\mathrm{d} y}{\mathrm{d} w_2} \frac{\mathrm{d} w_2}{\mathrm{d} w_1}\frac{\mathrm{d} w_1}{\mathrm{d} x} = \frac{\mathrm{d} f(w_2)}{\mathrm{d} w_2} \frac{\mathrm{d} g(w_1)}{\mathrm{d} w_1}\frac{\mathrm{d} h(w_0)}{\mathrm{d} w_0}}$$ ### <ins>1) Forward automatic differentiation</ins> - Forward automatic differentiation ==divides the expression into a sequence of differentiable elementary operations==. The chain rule and well-known differentiation rules are then applied to each elementary operation. - In forward AD mode, we traverse the chain rule from **right to left**. That is, according to our simple composition function, we first compute $\frac{\mathrm{d} w_1}{\mathrm{d} x}$ and then $\frac{\mathrm{d} w_2}{\mathrm{d} w_1}$ and finally $\frac{\mathrm{d} y}{\mathrm{d} w_2}$. <br> - Here is the recursive relation: $$\boxed{\frac{\mathrm{d} w_i}{\mathrm{d} x} = \frac{\mathrm{d} w_i}{\mathrm{d} w_{i-1}} \frac{\mathrm{d} w_{i-1}}{\mathrm{d} x}}$$ - More precisely, we compute the derivative starting from the **inside to the outside**: $$\boxed{\frac{\mathrm{d} y}{\mathrm{d} x} = \frac{\mathrm{d} y}{\mathrm{d} w_{i-1}} \frac{\mathrm{d} w_{i-1}}{\mathrm{d} x} = \frac{\mathrm{d} y}{\mathrm{d} w_{i-1}} \left(\frac{\mathrm{d} w_{i-1}}{\mathrm{d} w_{i-2}} \frac{\mathrm{d} w_{i-2}}{\mathrm{d} x}\right) = \frac{\mathrm{d} y}{\mathrm{d} w_{i-1}} \left( \frac{\mathrm{d} w_{i-1}}{\mathrm{d} w_{i-2}}\left(\frac{\mathrm{d} w_{i-2}}{\mathrm{d} w_{i-3}} \frac{\mathrm{d} w_{i-3}}{\mathrm{d} x}\right)\right) = ...}$$ - Let's consider the following example: $$ \boxed{ \begin{align*} w_1 &= x_1 = 2 \\ w_2 &= x_2 = 3 \\ w_3 &= w_1w_2 \\ w_4 &= sin(w_1) \\ w_5 &= w_3 + w_4 = y \end{align*} } $$ - It can be represented as a **computational graph**. <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/1.png" width="40%"> </div> <br> - Since $y$ depends on $w_1$ and $w_2$, its derivative will depend on $w_1$ and $w_2$ too. Thus, we have to perform 2 chain rules: $\frac{\mathrm{d} y}{\mathrm{d} w_1}$ and $\frac{\mathrm{d} y}{\mathrm{d} w_2}$. - $\boxed{\text{For } \frac{\mathrm{d} y}{\mathrm{d} w_1}:}$ <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/2.png" width="80%"> </div> <br> - Thus, $$ \boxed{ \begin{align*} \frac{\mathrm{d} y}{\mathrm{d} w_1} &= \dot{w_4} + \dot{w_3} \\ &= cos(2)*1 + (1*3 + 2*0) \\ &\approx 2.58 \end{align*} } $$ --- - $\boxed{\text{For } \frac{\mathrm{d} y}{\mathrm{d} w_2}:}$ <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/3.png" width="80%"> </div> <br> - Thus, $$ \boxed{ \begin{align*} \frac{\mathrm{d} y}{\mathrm{d} w_2} &= \dot{w_4} + \dot{w_3} \\ &= cos(2)*0 + (0*3 + 2*1) \\ &\approx 2 \end{align*} } $$ --- - Setting $\dot{w_1}$ and $\dot{w_2}$ to $\{0, 1\}$ is called **seeding**. - ==Forward AD is more efficient than Reverse AD for functions $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$ with $m \gg n$ **(more outputs than inputs)** as only $n$ sweeps are necessary compared to $m$ sweeps for Reverse AD.== ### <ins>2) Reverse automatic differentiation</ins> - Reverse automatic differentiation is pretty much like Forward AD but this time, we traverse the chain rule from **left to right**. That is, according to our simple composition function, we first compute $\frac{\mathrm{d} y}{\mathrm{d} w_2}$ and then $\frac{\mathrm{d} w_2}{\mathrm{d} w_1}$ and finally $\frac{\mathrm{d} w_1}{\mathrm{d} x}$. - Here is the recursive relation: $$\boxed{\frac{\mathrm{d} y}{\mathrm{d} w_i} = \frac{\mathrm{d} y}{\mathrm{d} w_{i+1}} \frac{\mathrm{d} w_{i+1}}{\mathrm{d} w_{i}}}$$ - More precisely, we compute the derivative starting from the **outside to the inside**: $$\boxed{\frac{\mathrm{d} y}{\mathrm{d} x} = \frac{\mathrm{d} y}{\mathrm{d} w_{i+1}} \frac{\mathrm{d} w_{i+1}}{\mathrm{d} x} = \left(\frac{\mathrm{d} y}{\mathrm{d} w_{i+2}} \frac{\mathrm{d} w_{i+2}}{\mathrm{d} w_{i+1}}\right)\frac{\mathrm{d} w_{i+1}}{\mathrm{d} x} = \left( \left(\frac{\mathrm{d} y}{\mathrm{d} w_{i+3}} \frac{\mathrm{d} w_{i+3}}{\mathrm{d} w_{i+2}}\right) \frac{\mathrm{d} w_{i+2}}{\mathrm{d} w_{i+1}}\right) \frac{\mathrm{d} w_{i+1}}{\mathrm{d} x} = ...}$$ - Let's consider the following example: $$ \boxed{ \begin{align*} w_1 &= x_1 = 2 \\ w_2 &= x_2 = 3 \\ w_3 &= w_1w_2 \\ w_4 &= sin(w_1) \\ w_5 &= w_3 + w_4 = y \end{align*} } $$ - It can be represented as a **computational graph**. <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/4.png" width="40%"> </div> <br> - The **reverse automatic differentiation** is divided into 2 parts: - A **forward pass** to store intermediate values, needed during the backward pass. - A **backward pass** to compute the derivatives. --- ### ☰ Forward pass - We simply compute the value of each node. They will be used during the backward pass to compute the derivatives. <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/5.png" width="70%"> </div> ### ☰ Backward pass - Using the previous recursive formula, we start from the output to the inputs. <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/6.png" width="90%"> </div> <br> - Thus, $$\boxed{ \begin{align*} \frac{\mathrm{d} y}{\mathrm{d} w_1} &= \bar{w_1} = cos(2) + 3 \approx 2.58 \\ \frac{\mathrm{d} y}{\mathrm{d} w_2} &= \bar{w_2} = 2 \end{align*} }$$ --- - ==Reverse AD is more efficient than Forward AD for functions $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$ with $m \ll n$ **(less outputs than inputs)** as only $m$ sweeps are necessary, compared to $n$ sweeps for Forward AD.== ## III) Implementation - There are 2 ways to implement an autodiff: - **source-to-source transformation** <a style="color:red">(hard)</a>: - Uses a compiler to convert source code with mathematical expressions to source code with automatic differentiation expressions. - **operator overloading** <a style="color:green">(easy)</a>: - Users-defined data types and the operator overloading features of a language are used to implement automatic differentiation. - Moreover, we usually have **less outputs than inputs in a neural network scenario**. - Therefore, it is more efficient to implement the **Reverse AD** over the **Forward AD** mode. - Our implementation revolves around the class `Node` which is defined below with its main components: ```python class Node: def __init__(self, data, requires_grad=True, children=[]): pass def zero_grad(self): pass def backward(self, grad=None): pass # Operators. def __add__(self, other): # built-in python operators. pass def matmul(self, other): # custom operators. pass ... # More operators. # Functions. def sin(self): pass ... # More functions. ``` - Let's go through each method. --- ### 1) \_\_init\_\_() method ```python= def __init__(self, data, requires_grad=True, children=[]): self.data = data if isinstance(data, np.ndarray) else np.array(data) self.requires_grad = requires_grad self.children = children self.grad = None if self.requires_grad: self.zero_grad() # Stores function. self._compute_derivatives = lambda: None def zero_grad(self): self.grad = Node(np.zeros_like(self.data, dtype=np.float64), requires_grad=False) ``` - The **\_\_init\_\_()** method has the following arguments: - `data`: a scalar/tensor value. - `requires_grad`: a boolean that tells us if we want to keep track of the gradient. - `children`: a list to keep track of children. - It has also other variables such as: - `grad`: equal to 0-scalar/tensor (only if **requires_grad** is True) depending on `data` type. - `_compute_derivatives`: lambda function, useful during **backward()** method (we will dive into it later on). ### 2) Operators & Functions ```python= # Operators. def __add__(self, other): # Operator overloading. op = Add(self, other) out = op.forward_pass() out._compute_derivatives = op.compute_derivatives return out def matmul(self, other): # Custom matrix multiplication function. op = Matmul(self, other) out = op.forward_pass() out._compute_derivatives = op.compute_derivatives return out # Functions. def sin(self): op = Sin(self) out = op.forward_pass() out._compute_derivatives = op.compute_derivatives return out # ------ # To make things clearer, I removed some line of codes. # Check Yaae repository for full implementation. class Add(): def __init__(self, node1, node2): pass def forward_pass(self): # Compute value and store children. pass def compute_derivatives(self): pass class Matmul(): def __init__(self, node1, node2): pass def forward_pass(self): # Compute value and store children. pass def compute_derivatives(self): pass class Sin(): def __init__(self, node1): pass def forward_pass(self): # Compute value and store children. pass def compute_derivatives(self): pass ``` - As you can see, the main strength of my implementation is that both **Operators and Functions follow the same structure**. - Indeed, the code structure was designed in such a way it **explicitly reflects the Reverse AD mode logic** that is, ==first a forward pass and then a backward pass (to compute derivatives).== - For each operator/function, the `forward_pass()` will create a new `Node` using the operator/function output with its children stored in `children`. - And each `grad` will be a `Node` whose value is a 0-scalar/tensor depending on `data` type. - We then compute the gradient using `compute_derivatives()` method of each operator/function classes. - This method is then **stored in the lambda function `_compute_derivatives()` of each resulting nodes.** - Notice that this method is not executed yet. We will see the reason later. ### 3) backward() method ```python= def backward(self, grad=None): L, visited = [], set() topo = topo_sort(self, L, visited) if grad is None: if self.shape == (): self.grad = Node(1, requires_grad=False) else: raise RuntimeError('backward: grad must be specified for non-0 tensor') else: self.grad = Node(grad, requires_grad=False) if isinstance(grad, (np.ndarray, list)) else grad for v in reversed(topo): v._compute_derivatives() def topo_sort(v, L, visited): """ Returns list of all of the children in the graph in topological order. Parameters: - v: last node in the graph. - L: list of all of the children in the graph in topological order (empty at the beginning). - visited: set of visited children. """ if v not in visited: visited.add(v) for child in v.children: topo_sort(child, L, visited) L.append(v) return L ``` - In the **backward()** method, we need to propagate the gradient from the output to the input as done in the previous example. - To do so, we have to: - Set `grad` of the output to 1. - Perform a **topological sort**. - Use the `_compute_derivatives()` lambda function which store the gradients of each node. - A **topological sort** is performed for problems where some objects are ordered in a particular way. - A typical example would be a set of tasks where certain tasks must be executed after other ones. Such example can be represented by a **Directed Acyclic Graph (DAG)** which is exactly the graph we designed. <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/7.png" width="70%"> <figcaption>Figure: A computational graph (DAG) </figcaption> </div> <br> - On top of that, order matters here (since we are working with chain rules). Hence, we want the node $w_5$ to propagate the gradient first to node $w_3$ and $w_4$. - Now that we know a **topological sort** is needed, let's perform one. For the above DAG, the topological sort will output us `topo = [node_w1, node_w2, node_w3, node_w4, node_w5]`. - We then reverse `topo` and call for each node in `topo` the `_compute_derivatives()` method which is a ==**lambda function** containing the node derivative computations, in charge of propagating the gradient through the DAG.== ## IV) Application - Let’s write down our previous example using Yaae and analyze itstep by step. ```python # Yaae. w1 = Node(2, requires_grad=True) w2 = Node(3, requires_grad=True) w3 = w2 * w1 w4 = w1.sin() w5 = w3 + w4 z = w5 z.backward() print(w1.grad.data) # 2.5838531634528574 ≈ 2.58 print(w2.grad.data) # 2.0 ``` - Let's first define the following: <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-1.png" width="60%"> </div> <br> - In the following, the lambda function `l` will contain a check symbol to express that `l` is no more empty. <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-2.png" width="90%"> </div> <br> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-3.png" width="90%"> </div> <br> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-4.png" width="90%"> </div> <br> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-5.png" width="90%"> </div> <br> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-6.png" width="90%"> </div> <br> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-7.png" width="90%"> </div> <br> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-8.png" width="90%"> </div> <br> <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/application-9.png" width="90%"> </div> <br> - It seems to work properly ! Since [Yaae](https://github.com/3outeille/Yaae) also supports tensors operations, we can easily build a small neural network library on top of it and workout some simple regression/classification problems ! - Here are the result from [demo_regression.ipynb](https://github.com/3outeille/Yaae/blob/master/src/example/demo_regression.ipynb) and [demo_classification.ipynb](https://github.com/3outeille/Yaae/blob/master/src/example/demo_classification.ipynb) files which can be found in the [repository example folder](https://github.com/3outeille/Yaae). <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/8.gif" width=70%"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/9.gif" width=70%"> </div> </br> - Still skeptical ? We can even go further by writing our own [GAN](https://github.com/3outeille/GANumpy). <div style="text-align: center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-deep-learning/yaae/10.gif" width=60%"> </div>