PYNQ Tutorial 4: Neural Network Processor

# PYNQ Tutorial 4: Neural Network Processor ## Objective After you complete this tutorial, you should be able to: - Understand the **methodology** of designing an ANN accelerator in SoC. ## Source Code This repository contains all of the code required in order to follow this tutorial: https://github.com/yohanes-erwin/pemrograman_zynq/tree/main/pynq_part_4 ## 1. Simple Neural Network In this tutorial, we are going to use an example of a simple NN model from this reference: http://www.lsi-contest.com/2018/shiyou_3-2e.html. Let's consider the following example: - There are four types of fruits: **orange**, **lemon**, **pineapple**, and **Japanese persimmon**. - A man eats these four types of fruits, and **decides the level of sweetness and sourness** of the fruits from the range of 0 to 9. - After deciding the level of sweetness $k_1$ and sourness $k_2$, then he decides which fruits he likes and which fruits he dislikes. - So let’s consider the fruits he likes as $[t1,t2]=[1,0]$ and the fruits he dislikes as $[t1,t2]=[0,1]$. | Fruits $n$ | Sweetness $k_1[n]$ | Sourness $k_2[n]$ | Taste | Supervisor $[t_1[n],t_2[n]]$ | |-------------|---------------------|--------------------|-------|------------------------------| | Orange | 8 | 8 | Like | $[1,0]$ | | Lemon | 8 | 5 | Dislike | $[0,1]$ | | Pineapple | 5 | 8 | Dislike | $[0,1]$ | | Persimmon | 5 | 5 | Dislike | $[0,1]$ | So, this is a classification problem. The goal is to classify whether the man will like the fruit or not. ## 2. NN Architecture The NN architecture is from the LSI contest: http://www.lsi-contest.com/2018/shiyou_3-2e.html. It consists of an input layer, a hidden layer, and an output layer. In this tutorial, we are only considering the design of forward propagation in hardware as matrix multiplications. ![ann_arch_1](https://hackmd.io/_uploads/r1qhVGXMyl.jpg =400x) The definition of each parameter are as describe below: - $k_i$ is the input - $w_{ij}^2$ is the weight for layer 2 - $b_{i}^2$ is the bias for layer 2 - $z_{i}^2$ is the sum of the product of layer 2 - $a_{i}^2$ is the output of layer 2 - $w_{ij}^3$ is the weight for layer 3 - $b_{i}^3$ is the bias for layer 3 - $z_{i}^3$ is the sum of the product of layer 3 - $a_{i}^3$ is the output of layer 3 This is the weight and bias values that are obtained from the training process. ![ann_arch_2](https://hackmd.io/_uploads/Hy2ZazQGJl.jpg =400x) The $z_{i}^2$ and $z_{i}^3$ are the sum of the product of weight and neuron input and also the bias. For example, in this neuron, we calculate: $z_{1}^2=1.37\times k_1+1.37\times k_2+(-19.88)$ Then, the calculations for the rest of the neurons are the same. Then, to obtain the outputs $a_{i}^2$ and $a_{i}^3$, we calculate the sigmoid function. For example, in this neuron, we calculate: $a_{1}^2=\frac{1}{1+e^{-z_{1}^2}}$ Then, the calculations for the rest of the neurons are the same. To process four samples, the NN will process the samples one by one. Calculate the value of z and a, one by one, for every neuron. Then, the first sample will produce the first output classification. ![ann_arch_3](https://hackmd.io/_uploads/ry9UNmmf1g.jpg) From the process, we know that the calculation for each neuron is independent of each other. Therefore, in hardware, we can calculate all of the neurons in parallel at the same time. The input samples are also independent of each other. Therefore, we can also calculate each sample in parallel at the same time. It is like we have four same networks that do calculations with different input samples. So, all of the outputs is produced at the same time. ![ann_arch_4](https://hackmd.io/_uploads/SJ9ktXmG1x.jpg) This parallelism in neuron and network calculation can be described in matrix form. And the calculation can be done as matrix multiplication. These are the step-by-step: 1. Padding input: $$\bf{K_{p}}= \begin{bmatrix}8 & 8 & 5 & 5\\ 8 & 5 & 8 & 5\\ 1 & 1 & 1 & 1\end{bmatrix}$$ 2. Matrix multiplication layer 2: $$\bf{Z_2=WB_2*K_p}$$ $$\bf{Z_2=\begin{bmatrix}1.37 & 1.37 & -19.88\\ 0.77 & 0.97 & -0.90\\ 1.05 & 0.64 & -0.89\end{bmatrix} * \begin{bmatrix}8 & 8 & 5 & 5\\ 8 & 5 & 8 & 5\\ 1 & 1 & 1 & 1\end{bmatrix}}=\begin{bmatrix}2.04 & -2.07 & -2.07 & -6.18\\ 13.02 & 10.11 & 10.71 & 7.80\\ 12.63 & 10.71 & 9.48 & 7.56\end{bmatrix}$$ 3. Activation layer 2: $$\bf{\sigma(Z_2)=A_2}$$ $$\sigma(\begin{bmatrix}2.04 & -2.07 & -2.07 & -6.18\\ 13.02 & 10.11 & 10.71 & 7.80\\ 12.63 & 10.71 & 9.48 & 7.56\end{bmatrix})=\begin{bmatrix}0.884 & 0.112 & 0.112 & 0.002\\ 1.000 & 0.999 & 0.999 & 0.999\\ 1.000 & 0.999 & 0.999 & 0.999\end{bmatrix}$$ 4. Padding output layer 2: $$\bf{A_{2p}=\begin{bmatrix}0.884 & 0.112 & 0.112 & 0.002\\ 1.000 & 0.999 & 0.999 & 0.999\\ 1.000 & 0.999 & 0.999 & 0.999\\ 1 & 1 & 1 & 1\end{bmatrix}}$$ 5. Matrix multiplication layer 3: $$\bf{Z_3 = WB_3 * A_{2p} }$$ $$\bf{Z_3=\begin{bmatrix}7.11 & -1.31 & 0.08 & -2.59\\ -7.10 & 1.63 & 1.97 & 0.20\end{bmatrix} * \begin{bmatrix}0.884 & 0.112 & 0.112 & 0.002\\ 1.000 & 0.999 & 0.999 & 0.999\\ 1.000 & 0.999 & 0.999 & 0.999\\ 1 & 1 & 1 & 1\end{bmatrix}}=\begin{bmatrix}2.47 & -3.02 & -3.02 & -3.80\\ -2.48 & 3.00 & 3.00 & 3.78\end{bmatrix}$$ 6. Activation layer 3: $$\bf{\sigma(Z_3)=A_3}$$ $$\sigma(\begin{bmatrix}2.47 & -3.02 & -3.02 & -3.80\\ -2.48 & 3.00 & 3.00 & 3.78\end{bmatrix})=\begin{bmatrix}0.922 & 0.046 & 0.046 & 0.021\\ 0.077 & 0.952 & 0.952 & 0.977\end{bmatrix}=\begin{bmatrix}1 & 0 & 0 & 0\\ 0 & 1 & 1 & 1\end{bmatrix}$$ This is Matlab code to verify this calculation. ## 3. RTL Design ### 3.1. Fixed-Point Representation In this design, we use fixed-point representation: - 1 sign bit - 5 integer bit - 10 fraction bit Examples: - 16'b0 00000 1000000000 = 0.5 Use this matlab function to convert to fixed-point and vice versa: - Decimal to fixed point: https://www.mathworks.com/matlabcentral/fileexchange/61669-decimal-to-fixed-point-q-format-converter - Fixed point to decimal: https://www.mathworks.com/matlabcentral/fileexchange/61670-fixed-point-q-format-to-decimal-converter ### 3.2. Systolic Matrix Now we know that to process the NN, we just need matrix multiplication as its core computation. Therefore, we can design an RTL circuit for matrix multiplication. There are many architectures that can be used to implement matrix multiplication in RTL. In this case, we consider using systolic array architecture, the same architecture as used in the Google TPU accelerator. ![](https://weenslab.gitbook.io/~gitbook/image?url=https%3A%2F%2F4146991827-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FIsb2SAYKLkGlVOGOY0EE%252Fuploads%252F2sAzMjBa8S5UI6gIOx63%252Fsystolic_4x4.jpg%3Falt%3Dmedia%26token%3D16a19175-d9ea-4713-8773-47c35506758c&width=768&dpr=1&quality=100&sign=f823d7e3&sv=1) A systolic array is a data-flow architecture that is constructed from several identical cells. This cell is called a processing element (PE). ![](https://weenslab.gitbook.io/~gitbook/image?url=https%3A%2F%2F4146991827-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FIsb2SAYKLkGlVOGOY0EE%252Fuploads%252FLcZgWAp1rk5HxoRQDUWC%252Fpe.jpg%3Falt%3Dmedia%26token%3D22cbdfa0-10e3-4ff9-824e-6da5d151fd8d&width=768&dpr=1&quality=100&sign=6e1df3ab&sv=1 =300x) The input A is called moving input, and the input B is called stationary input. Every clock cycle, the input A enter the systolic array diagonally. Then, the output Y come out of the systolic array diagonally for every clock cycle. Systolic array architecture is designed to address the communication requirements of parallel computer systems in order to achieve both balance and scaling. This is an example of how systolic array architecture works to multiply two 4x4 matrices $A$ and $B$. ![Screenshot 2024-11-14 152509](https://hackmd.io/_uploads/ByZ-44Qz1g.png) ![](https://weenslab.gitbook.io/~gitbook/image?url=https%3A%2F%2F4146991827-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FIsb2SAYKLkGlVOGOY0EE%252Fuploads%252FtXOGF3RqvTccTGPOtauk%252Fsys_anim.gif%3Falt%3Dmedia%26token%3D207d4d75-a957-4ce4-9d03-cecc96527f17&width=768&dpr=1&quality=100&sign=c9fff21f&sv=1) **systolic.v**: https://github.com/yohanes-erwin/pemrograman_zynq/blob/main/pynq_part_4/systolic.v ### 3.3. Sigmoid A lookup table (LUT) circuit is often used to implement complex mathematical operations that are not easy to implement in FPGA. For example, the sigmoid function is defined by the following formula. This function is commonly used in neural networks as an activation function. $$\sigma(x)=\frac{1}{1+e^{-x}}$$ Since exponential implementation in Verilog is not easy, So we can create a table that maps input x to output sigmoid as follows: | x | σ(x) | |-------|------| | ... | ... | | -0.1250 | 0.4687500000 | | -0.0625 | 0.4843750000 | | +0.0000 | 0.5000000000 | | +0.0625 | 0.5146484375 | | ... | ... | Then, we can create a circuit that implements this table as shown in the following figure. ![](https://weenslab.gitbook.io/~gitbook/image?url=https%3A%2F%2F4146991827-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FIsb2SAYKLkGlVOGOY0EE%252Fuploads%252FRU3nHSHkWa2oRCApEMvt%252Flut_sigmoid.jpg%3Falt%3Dmedia%26token%3Db031b3bc-f8c6-4b61-ada5-a066365ab027&width=768&dpr=1&quality=100&sign=36b37d49&sv=1 =500x) **sigmoid.v**: https://github.com/yohanes-erwin/pemrograman_zynq/blob/main/pynq_part_4/sigmoid.v ## 4. NN Top Module So we have two matrix multiplication processes for layers 2 and 3. Do we need two matrix multiplication modules? Well, the layer 3 multiplication depends on the result from layer 2. In other words, layer 3 cannot be processed before we obtain the result from layer 2. Therefore, we can use one matrix multiplication module and use a time-sharing hardware architecture. ![](https://weenslab.gitbook.io/~gitbook/image?url=https%3A%2F%2F4146991827-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FIsb2SAYKLkGlVOGOY0EE%252Fuploads%252F5aDlXXyJpASZMomO4A1o%252Fnn_module.jpg%3Falt%3Dmedia%26token%3D6f4b5597-6a7d-44dd-95b7-b92b497ea3f3&width=768&dpr=1&quality=100&sign=85b05681&sv=1) This is how the data flows: 1. Input data are read from the BRAM input, then sent to the stationary input of the systolic. 2. Weight and bias for the hidden layer are read from the BRAM weight, then sent to the moving input of the systolic. 3. Output from the systolic is processed by the sigmoid module, and then the result is sent to the stationary input of the systolic. 4. Weight and bias for the output layer are read from the BRAM weight, then sent to the moving input of the systolic. 5. Output from the systolic is processed by the sigmoid module, and then the result is sent to the BRAM output. The following figures show how the data is stored inside the BRAM. The BRAM data width is 64-bit. For weight and bias, the data depth is 8, while for input and output, it is 4. ![Screenshot 2024-11-19 093548](https://hackmd.io/_uploads/H1SnKdtM1x.png) ![Screenshot 2024-11-19 093754](https://hackmd.io/_uploads/rknG5utGke.png) Timing diagram. ![nn_timing](https://hackmd.io/_uploads/SJdrNYKfkl.jpg) **nn.v**: https://github.com/yohanes-erwin/pemrograman_zynq/blob/main/pynq_part_4/nn.v ## 5. AXI-Stream Module Now, we already have the NN module. The next step is to connect these blocks to the standard interface that can be understood by the CPU. In this case, we use the AXI-Stream interface. ![](https://weenslab.gitbook.io/~gitbook/image?url=https%3A%2F%2F4146991827-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FIsb2SAYKLkGlVOGOY0EE%252Fuploads%252FhVUviD7Hp1C5WWWL7vBy%252Faxis_nn_module.jpg%3Falt%3Dmedia%26token%3D5bb13a2d-376e-44bb-8cc8-fa2e7cc8e442&width=768&dpr=1&quality=100&sign=256c4e43&sv=1) This is how the data flows: 1. Both the weight and the input data are streamed through the `S_AXIS` port. 2. The demultiplexer separates which data goes to the weight port and which data goes to the input port of the NN module. 3. The control unit starts the NN module and waits until it is finished. 4. The output data is streamed out to the `M_AXIS` port. ![nn_fsm](https://hackmd.io/_uploads/HJJ5wOYzkx.jpg) **axis_nn.v**: https://github.com/yohanes-erwin/pemrograman_zynq/blob/main/pynq_part_4/axis_nn.v ## 6. SoC Design The following figure shows the overall SoC system design for the NN accelerator. We use the AXI DMA IP that converts memory-mapped data to stream data and vice versa. ![block_diagram_nn](https://hackmd.io/_uploads/HyvXqQVG1e.jpg) The following figure shows the block design in Vivado. ![Screenshot 2024-11-15 091050](https://hackmd.io/_uploads/Sk90pmNMyl.png) ## 7. Software Design Program the FPGA and initialize DMA. ![Screenshot 2024-11-15 101711](https://hackmd.io/_uploads/BJ9HANVG1e.png) Write weight, bias, and input to DDR memory location. The data is stored every 32 bits in the DDR memory. ![nn_mmap_64_32](https://hackmd.io/_uploads/SkE5yYYGJx.jpg =450x) ![Screenshot 2024-11-15 101811](https://hackmd.io/_uploads/BJJL0VEzkx.png) Transfer the weight, bias, and input to AXIS NN and read it back to the DDR location. ![Screenshot 2024-11-15 101841](https://hackmd.io/_uploads/Syyw04Nz1l.png) Check the output of NN. Compare it to the model. ![Screenshot 2024-11-15 101904](https://hackmd.io/_uploads/rkwP0VVGJx.png) **nn.ipynb**: https://github.com/yohanes-erwin/pemrograman_zynq/blob/main/pynq_part_4/nn.ipynb ## 8. Performance Do 1 million operations of NN in FPGA. ![Screenshot 2024-11-15 102056](https://hackmd.io/_uploads/SJnPC4Ef1e.png) Compare with software calculations. ![Screenshot 2024-11-15 102109](https://hackmd.io/_uploads/SkgORNEfJl.png) **nn_compare.ipynb**: https://github.com/yohanes-erwin/pemrograman_zynq/blob/main/pynq_part_4/nn_compare.ipynb The greater the number of inferences, the faster HW computation time compared to SW computation time. | Number of Inference | HW Computation Time [s] | SW Computation Time [s] | |---------------------|-------------------------|-------------------------| | 1 | 0.013 | 0.005 | | 10 | 0.018 | 0.010 | | 100 | 0.059 | 0.058 | | 1000 | 0.445 | 0.507 | | 10000 | 4.2 | 4.9 | | 100000 | 42.6 | 49.3 | | 1000000 | 416.6 | 468.482 | Several things cause this result: - **ARM Cortex A9 clock = 650 MHz vs. FPGA clock = 100 MHz** - **There is a lot of idle time on AXIS packet transactions for large data.** The ANN accelerator can be more optimized for large data by reducing idle time between input packets in the AXI stream packet. Multi-core systolic can be used. ![axis_ann_module](https://hackmd.io/_uploads/r1xGSYtM1l.png) More real examples of systolic array for accelerator: - https://ieeexplore.ieee.org/document/9576092 - https://ieeexplore.ieee.org/document/10235901 - https://cloud.google.com/blog/products/ai-machine-learning/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu