# AIC2021 Project1 - TPU
###### tags: `aic2021`
> Student ID: P76091226, Name: 鄭惟
[TOC]
## Goals
- [x] Pass testbench 1 ~ 3
- [x] Support `(M*K)*(K*N)`
- [x] Pass both pre-sim and post-sim
## Matrix multiplication algorithm

- The [block matrix multiplication](https://en.wikipedia.org/wiki/Block_matrix) is used in my TPU design. This algorithm is illustrated in the figure above. In order to compute the result of block $C_{11}$, the multiplication result of $A_{11} * B_{11} + A_{12} * B_{21} + A_{13} * B_{31} + A_{14} * B_{41}$ should be accumulated.
- The matrix A is of size MxN, matrix B is of size NxK, matrix C is of size MxN. The high level idea of this algorithm is as follow:
```cpp=1
for i = 1 to M:
for j = 1 to N:
for k = 1 to K:
C[i, j] += A[i,k] * B[k,j]
```
- Notice that for matrix B, this algorithm imposes **column stride** on the memory system. Which is a problem since most memory system are row-major. To tackle this issue, the simplest way is to store matrix B in **transpose**, which shall covert all column stride access to row stride access. This is exactly how matrix B is organized in SRAM in this project.
- Another important parameter in this algorithm is the size of a block. This parameter is crucial in general purpose system such as CPU or GPU. Normally, block size is tuned to match the size of on-chip memory (i.e. cache size on CPU or local memory size on GPU). In our case, we are limited to use 4x4 MACs in this project; therefore, the **matrix block size is set to 4x4**.
## MAC design
- A MAC unit is designed to pass 8-bits two inputs to other MAC units, at the same time, accumulate the multiply result of the two. The MACs are organized in the TPU as follow:

- The block diagram of a MAC unit is as follow:

## TPU architecture
This TPU design consists of the following units:
- Control unit
- Datapath unit
- Systolic array
- Two 14 bytes buffer

### Dataflow for systolic array
```
-------------------------
| 0 | 0 | 0 | b33 |
| 0 | 0 | b32 | b23 |
4X4 Systolic array | 0 | b31 | b22 | b13 |
for matrix multiplication | b30 | b21 | b12 | b03 |
A X B | b20 | b11 | b02 | 0 |
| b10 | b01 | 0 | 0 |
| b00 | 0 | 0 | 0 |
-------------------------------------------------------------------
| 0 | 0 | 0 | a03 | a02 | a01 | a00 | c00 | c01 | c02 | c03 |
-------------------------------------------------------------------
| 0 | 0 | a13 | a12 | a11 | a10 | 0 | c10 | c11 | c12 | c13 |
-------------------------------------------------------------------
| 0 | a23 | a22 | a21 | a20 | 0 | 0 | c20 | c21 | c22 | c23 |
-------------------------------------------------------------------
| a33 | a32 | a31 | a30 | 0 | 0 | 0 | c30 | c31 | c32 | c33 |
-------------------------------------------------------------------
```
- The above figure demonstrates how matrix A and matrix B are organized such that these two dataflows will result in matrix C when fed into a 4x4 systolic array.
### Overall FSM
- To implement the block matrix multiplication algorithm in hardware, a FSM with 9 states are design for this project. These states are designed to implement the 3-layer for loop structure shown in previous section.
- The state transition diagram for all 9 states in FSM is as follow:  (Generate JasperGold Superlint App)
- States description:
- S1: Wait for system reset signal
- S2: Read a 4x4 block from buffer A
- S4: Read a 4x4 block from buffer B
- S8: Feed two 4x4 blocks into systolic array
- S16: Check if the multiplication result is accumulated
- S32: Write to matrix C
- S64: Check if all blocks in matrix B is read
- S128: Check if all blocks in matrix A is read
- S256: Final state
## Project directory hierachy
:::spoiler click me
```
|____README.md
|____img
| |____matrix_b.png
| |____matrix_a.png
| |____full_system.png
| |____testbench.png
| |____top.png
|____tree
|____Makefile
|____.gitignore
|____build
| |____matrix_a.bin
| |____matrix_b.bin
| |____golden.bin
| |____matrix_define.v
|____conf
| |____simvision_conf
| | |____rtl.sv
| | |____post.sv
|____src
| |____ctrl.v
| |____tpu.v
| |____gen_def.sh
| |____mac.v
| |____define.v
| |____top.v
| |____dp.v
| |____matrix_define.v
| |____global_buffer.v
| |____def.v
|____script
| |____synthesize.tcl
| |____synopsys_dc.setup
| |____tpu.sdc
| |____superlint.tcl
|____sim
| |____inputs3
| | |____matrix_b.txt
| | |____matrix_a.txt
| |____inputs2
| | |____matrix_b.txt
| | |____matrix_a.txt
| |____define.v
| |____matmul.py
| |____matrix_define.v
| |____inputs1
| | |____matrix_b.txt
| | |____matrix_a.txt
| |____tsmc13_neg.v
| |____top_tb.v
```
:::
## Prerequisite
* python3 with numpy
* ncverilog
* design compiler
* bash
## Synthesize report
### Process
- IC Contest_v2.1 CBDK (based on T18 process)
### Timing report
:::spoiler click me
```
****************************************
Report : timing
-path full
-delay max
-max_paths 1
Design : tpu
Version: Q-2019.12
Date : Sun May 30 15:23:50 2021
****************************************
Operating Conditions: slow Library: slow
Wire Load Model Mode: top
Startpoint: k[1] (input port clocked by clk)
Endpoint: ul_ctrl/curr_state_reg_1_
(rising edge-triggered flip-flop clocked by clk)
Path Group: clk
Path Type: max
Des/Clust/Port Wire Load Model Library
------------------------------------------------
tpu tsmc13_wl10 slow
Point Incr Path
--------------------------------------------------------------------------
clock clk (fall edge) 7.50 7.50
clock network delay (ideal) 0.50 8.00
input external delay 0.00 8.00 f
k[1] (in) 0.00 8.00 f
ul_ctrl/k[1] (ctrl) 0.00 8.00 f
ul_ctrl/U6/Y (OA21XL) 0.54 8.54 f
ul_ctrl/U90/Y (NOR2X1) 0.57 9.11 r
ul_ctrl/U12/Y (INVXL) 0.46 9.56 f
ul_ctrl/U93/Y (OAI21XL) 0.50 10.06 r
ul_ctrl/U10/Y (INVXL) 0.33 10.39 f
ul_ctrl/U94/Y (NAND2BX1) 0.31 10.70 f
ul_ctrl/U95/Y (INVXL) 0.26 10.97 r
ul_ctrl/U7/Y (OA21XL) 0.52 11.49 r
ul_ctrl/U96/Y (AOI21X1) 0.35 11.83 f
ul_ctrl/U98/Y (OAI21XL) 0.50 12.34 r
ul_ctrl/U99/Y (OAI21XL) 0.30 12.64 f
ul_ctrl/U108/Y (NOR4X1) 0.50 13.14 r
ul_ctrl/U22/Y (OAI22XL) 0.30 13.44 f
ul_ctrl/U25/Y (AOI211XL) 0.53 13.97 r
ul_ctrl/U33/Y (OAI21XL) 0.29 14.27 f
ul_ctrl/curr_state_reg_1_/D (DFFRX1) 0.00 14.27 f
data arrival time 14.27
clock clk (rise edge) 15.00 15.00
clock network delay (ideal) 0.50 15.50
clock uncertainty -0.10 15.40
ul_ctrl/curr_state_reg_1_/CK (DFFRX1) 0.00 15.40 r
library setup time -0.26 15.14
data required time 15.14
--------------------------------------------------------------------------
data required time 15.14
data arrival time -14.27
--------------------------------------------------------------------------
slack (MET) 0.87
```
:::
### Area report
:::spoiler click me
```
****************************************
Report : area
Design : tpu
Version: Q-2019.12
Date : Sun May 30 15:20:36 2021
****************************************
Library(s) Used:
slow (File: /home/nfs_cad/lib/CBDK_IC_Contest_v2.1/SynopsysDC/db/slow.db)
Number of ports: 1105
Number of nets: 5059
Number of cells: 3753
Number of combinational cells: 2765
Number of sequential cells: 849
Number of macros/black boxes: 0
Number of buf/inv: 423
Number of references: 4
Combinational area: 30154.311150
Buf/Inv area: 2982.331756
Noncombinational area: 27345.113110
Macro/Black Box area: 0.000000
Net Interconnect area: 465782.353027
Total cell area: 57499.424259
Total area: 523281.777287
Hierarchical area distribution
------------------------------
Global cell area Local cell area
------------------- ------------------------------
Hierarchical cell Absolute Percent Combi- Noncombi- Black-
Total Total national national boxes Design
-------------------------------- ---------- ------- ---------- ---------- ------ ---------------------------
tpu 57499.4243 100.0 943.7544 0.0000 0.0000 tpu
ul_ctrl 1432.6056 2.5 751.9482 609.3666 0.0000 ctrl
ul_ctrl/clk_gate_a_blk_idx_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_ctrl_2
ul_ctrl/clk_gate_b_blk_idx_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_ctrl_1
ul_ctrl/clk_gate_blk_local_idx_reg
23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_ctrl_0
ul_dp 55123.0643 95.9 9704.0359 15970.8361 0.0000 dp
ul_dp/clk_gate_K_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_0
ul_dp/clk_gate_addr_a_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_12
ul_dp/clk_gate_addr_b_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_20
ul_dp/clk_gate_bufA_reg_0_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_19
ul_dp/clk_gate_bufA_reg_1_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_18
ul_dp/clk_gate_bufA_reg_2_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_17
ul_dp/clk_gate_bufA_reg_3_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_16
ul_dp/clk_gate_bufB_reg_0_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_11
ul_dp/clk_gate_bufB_reg_1_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_10
ul_dp/clk_gate_bufB_reg_2_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_9
ul_dp/clk_gate_bufB_reg_3_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_8
ul_dp/clk_gate_bufB_reg_4_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_7
ul_dp/clk_gate_in_c_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_2
ul_dp/clk_gate_mat_b_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_4
ul_dp/sys_row_0__sys_col_0__ul_mac
1952.0100 3.4 1172.9034 779.1066 0.0000 mac_15
ul_dp/sys_row_0__sys_col_1__ul_mac
1952.0100 3.4 1172.9034 779.1066 0.0000 mac_14
ul_dp/sys_row_0__sys_col_2__ul_mac
1946.9178 3.4 1172.9034 774.0144 0.0000 mac_13
ul_dp/sys_row_0__sys_col_3__ul_mac
1687.2156 2.9 1171.2060 516.0096 0.0000 mac_12
ul_dp/sys_row_1__sys_col_0__ul_mac
1952.0100 3.4 1172.9034 779.1066 0.0000 mac_11
ul_dp/sys_row_1__sys_col_1__ul_mac
1952.0100 3.4 1172.9034 779.1066 0.0000 mac_10
ul_dp/sys_row_1__sys_col_2__ul_mac
1946.9178 3.4 1172.9034 774.0144 0.0000 mac_9
ul_dp/sys_row_1__sys_col_3__ul_mac
1687.2156 2.9 1171.2060 516.0096 0.0000 mac_8
ul_dp/sys_row_2__sys_col_0__ul_mac
1952.0100 3.4 1172.9034 779.1066 0.0000 mac_7
ul_dp/sys_row_2__sys_col_1__ul_mac
1952.0100 3.4 1172.9034 779.1066 0.0000 mac_6
ul_dp/sys_row_2__sys_col_2__ul_mac
1946.9178 3.4 1172.9034 774.0144 0.0000 mac_5
ul_dp/sys_row_2__sys_col_3__ul_mac
1687.2156 2.9 1171.2060 516.0096 0.0000 mac_4
ul_dp/sys_row_3__sys_col_0__ul_mac
1692.3078 2.9 1171.2060 521.1018 0.0000 mac_3
ul_dp/sys_row_3__sys_col_1__ul_mac
1692.3078 2.9 1171.2060 521.1018 0.0000 mac_2
ul_dp/sys_row_3__sys_col_2__ul_mac
1687.2156 2.9 1171.2060 516.0096 0.0000 mac_1
ul_dp/sys_row_3__sys_col_3__ul_mac
1429.2108 2.5 1171.2060 258.0048 0.0000 mac_0
-------------------------------- ---------- ------- ---------- ---------- ------ ---------------------------
Total 30154.3111 27345.1131 0.0000
```
:::
### Power report
:::spoiler
:::
## Makefile target
* ```make init```
* Initialize project directory (Always run `make init` after `make clean`)
* ```make test1```
* Generate ```A(2*2)*B(2*2)``` test case
* ```make test2```
* Generate ```A(4*4)*B(4*4)``` test case
* ```make test3```
* Generate ```A(4*K)*B(K*4)``` test case, where ```K=9```
* ```make monster```
* Generate ```A(M*K)*B(K*N)``` test case, where ```K<10```, ```M<10```, ```N<10``` are randomly generated
* ```make syn```
* Synthesize design with design compiler
* ```make rtl```
* Pre-sim with existing test case
* ```make clean```
* This will remove the ```build/``` folder
## Results
- `make test1`
- 
- `make test2`
- 
- `make test3`
- 
- `make monster`
- 