AIC2021 Project1 - TPU

# AIC2021 Project1 - TPU ###### tags: `aic2021` > Student ID: P76091226, Name: 鄭惟 [TOC] ## Goals - [x] Pass testbench 1 ~ 3 - [x] Support `(M*K)*(K*N)` - [x] Pass both pre-sim and post-sim ## Matrix multiplication algorithm ![](https://i.imgur.com/sJ1mbMu.png) - The [block matrix multiplication](https://en.wikipedia.org/wiki/Block_matrix) is used in my TPU design. This algorithm is illustrated in the figure above. In order to compute the result of block $C_{11}$, the multiplication result of $A_{11} * B_{11} + A_{12} * B_{21} + A_{13} * B_{31} + A_{14} * B_{41}$ should be accumulated. - The matrix A is of size MxN, matrix B is of size NxK, matrix C is of size MxN. The high level idea of this algorithm is as follow: ```cpp=1 for i = 1 to M: for j = 1 to N: for k = 1 to K: C[i, j] += A[i,k] * B[k,j] ``` - Notice that for matrix B, this algorithm imposes **column stride** on the memory system. Which is a problem since most memory system are row-major. To tackle this issue, the simplest way is to store matrix B in **transpose**, which shall covert all column stride access to row stride access. This is exactly how matrix B is organized in SRAM in this project. - Another important parameter in this algorithm is the size of a block. This parameter is crucial in general purpose system such as CPU or GPU. Normally, block size is tuned to match the size of on-chip memory (i.e. cache size on CPU or local memory size on GPU). In our case, we are limited to use 4x4 MACs in this project; therefore, the **matrix block size is set to 4x4**. ## MAC design - A MAC unit is designed to pass 8-bits two inputs to other MAC units, at the same time, accumulate the multiply result of the two. The MACs are organized in the TPU as follow: ![](https://i.imgur.com/PkeEK0I.png) - The block diagram of a MAC unit is as follow: ![](https://i.imgur.com/AkAFaEy.png) ## TPU architecture This TPU design consists of the following units: - Control unit - Datapath unit - Systolic array - Two 14 bytes buffer ![](https://i.imgur.com/tnDBnjI.jpg) ### Dataflow for systolic array ``` ------------------------- | 0 | 0 | 0 | b33 | | 0 | 0 | b32 | b23 | 4X4 Systolic array | 0 | b31 | b22 | b13 | for matrix multiplication | b30 | b21 | b12 | b03 | A X B | b20 | b11 | b02 | 0 | | b10 | b01 | 0 | 0 | | b00 | 0 | 0 | 0 | ------------------------------------------------------------------- | 0 | 0 | 0 | a03 | a02 | a01 | a00 | c00 | c01 | c02 | c03 | ------------------------------------------------------------------- | 0 | 0 | a13 | a12 | a11 | a10 | 0 | c10 | c11 | c12 | c13 | ------------------------------------------------------------------- | 0 | a23 | a22 | a21 | a20 | 0 | 0 | c20 | c21 | c22 | c23 | ------------------------------------------------------------------- | a33 | a32 | a31 | a30 | 0 | 0 | 0 | c30 | c31 | c32 | c33 | ------------------------------------------------------------------- ``` - The above figure demonstrates how matrix A and matrix B are organized such that these two dataflows will result in matrix C when fed into a 4x4 systolic array. ### Overall FSM - To implement the block matrix multiplication algorithm in hardware, a FSM with 9 states are design for this project. These states are designed to implement the 3-layer for loop structure shown in previous section. - The state transition diagram for all 9 states in FSM is as follow: ![](https://i.imgur.com/eEvpEiq.png) (Generate JasperGold Superlint App) - States description: - S1: Wait for system reset signal - S2: Read a 4x4 block from buffer A - S4: Read a 4x4 block from buffer B - S8: Feed two 4x4 blocks into systolic array - S16: Check if the multiplication result is accumulated - S32: Write to matrix C - S64: Check if all blocks in matrix B is read - S128: Check if all blocks in matrix A is read - S256: Final state ## Project directory hierachy :::spoiler click me ``` |____README.md |____img | |____matrix_b.png | |____matrix_a.png | |____full_system.png | |____testbench.png | |____top.png |____tree |____Makefile |____.gitignore |____build | |____matrix_a.bin | |____matrix_b.bin | |____golden.bin | |____matrix_define.v |____conf | |____simvision_conf | | |____rtl.sv | | |____post.sv |____src | |____ctrl.v | |____tpu.v | |____gen_def.sh | |____mac.v | |____define.v | |____top.v | |____dp.v | |____matrix_define.v | |____global_buffer.v | |____def.v |____script | |____synthesize.tcl | |____synopsys_dc.setup | |____tpu.sdc | |____superlint.tcl |____sim | |____inputs3 | | |____matrix_b.txt | | |____matrix_a.txt | |____inputs2 | | |____matrix_b.txt | | |____matrix_a.txt | |____define.v | |____matmul.py | |____matrix_define.v | |____inputs1 | | |____matrix_b.txt | | |____matrix_a.txt | |____tsmc13_neg.v | |____top_tb.v ``` ::: ## Prerequisite * python3 with numpy * ncverilog * design compiler * bash ## Synthesize report ### Process - IC Contest_v2.1 CBDK (based on T18 process) ### Timing report :::spoiler click me ``` **************************************** Report : timing -path full -delay max -max_paths 1 Design : tpu Version: Q-2019.12 Date : Sun May 30 15:23:50 2021 **************************************** Operating Conditions: slow Library: slow Wire Load Model Mode: top Startpoint: k[1] (input port clocked by clk) Endpoint: ul_ctrl/curr_state_reg_1_ (rising edge-triggered flip-flop clocked by clk) Path Group: clk Path Type: max Des/Clust/Port Wire Load Model Library ------------------------------------------------ tpu tsmc13_wl10 slow Point Incr Path -------------------------------------------------------------------------- clock clk (fall edge) 7.50 7.50 clock network delay (ideal) 0.50 8.00 input external delay 0.00 8.00 f k[1] (in) 0.00 8.00 f ul_ctrl/k[1] (ctrl) 0.00 8.00 f ul_ctrl/U6/Y (OA21XL) 0.54 8.54 f ul_ctrl/U90/Y (NOR2X1) 0.57 9.11 r ul_ctrl/U12/Y (INVXL) 0.46 9.56 f ul_ctrl/U93/Y (OAI21XL) 0.50 10.06 r ul_ctrl/U10/Y (INVXL) 0.33 10.39 f ul_ctrl/U94/Y (NAND2BX1) 0.31 10.70 f ul_ctrl/U95/Y (INVXL) 0.26 10.97 r ul_ctrl/U7/Y (OA21XL) 0.52 11.49 r ul_ctrl/U96/Y (AOI21X1) 0.35 11.83 f ul_ctrl/U98/Y (OAI21XL) 0.50 12.34 r ul_ctrl/U99/Y (OAI21XL) 0.30 12.64 f ul_ctrl/U108/Y (NOR4X1) 0.50 13.14 r ul_ctrl/U22/Y (OAI22XL) 0.30 13.44 f ul_ctrl/U25/Y (AOI211XL) 0.53 13.97 r ul_ctrl/U33/Y (OAI21XL) 0.29 14.27 f ul_ctrl/curr_state_reg_1_/D (DFFRX1) 0.00 14.27 f data arrival time 14.27 clock clk (rise edge) 15.00 15.00 clock network delay (ideal) 0.50 15.50 clock uncertainty -0.10 15.40 ul_ctrl/curr_state_reg_1_/CK (DFFRX1) 0.00 15.40 r library setup time -0.26 15.14 data required time 15.14 -------------------------------------------------------------------------- data required time 15.14 data arrival time -14.27 -------------------------------------------------------------------------- slack (MET) 0.87 ``` ::: ### Area report :::spoiler click me ``` **************************************** Report : area Design : tpu Version: Q-2019.12 Date : Sun May 30 15:20:36 2021 **************************************** Library(s) Used: slow (File: /home/nfs_cad/lib/CBDK_IC_Contest_v2.1/SynopsysDC/db/slow.db) Number of ports: 1105 Number of nets: 5059 Number of cells: 3753 Number of combinational cells: 2765 Number of sequential cells: 849 Number of macros/black boxes: 0 Number of buf/inv: 423 Number of references: 4 Combinational area: 30154.311150 Buf/Inv area: 2982.331756 Noncombinational area: 27345.113110 Macro/Black Box area: 0.000000 Net Interconnect area: 465782.353027 Total cell area: 57499.424259 Total area: 523281.777287 Hierarchical area distribution ------------------------------ Global cell area Local cell area ------------------- ------------------------------ Hierarchical cell Absolute Percent Combi- Noncombi- Black- Total Total national national boxes Design -------------------------------- ---------- ------- ---------- ---------- ------ --------------------------- tpu 57499.4243 100.0 943.7544 0.0000 0.0000 tpu ul_ctrl 1432.6056 2.5 751.9482 609.3666 0.0000 ctrl ul_ctrl/clk_gate_a_blk_idx_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_ctrl_2 ul_ctrl/clk_gate_b_blk_idx_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_ctrl_1 ul_ctrl/clk_gate_blk_local_idx_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_ctrl_0 ul_dp 55123.0643 95.9 9704.0359 15970.8361 0.0000 dp ul_dp/clk_gate_K_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_0 ul_dp/clk_gate_addr_a_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_12 ul_dp/clk_gate_addr_b_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_20 ul_dp/clk_gate_bufA_reg_0_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_19 ul_dp/clk_gate_bufA_reg_1_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_18 ul_dp/clk_gate_bufA_reg_2_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_17 ul_dp/clk_gate_bufA_reg_3_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_16 ul_dp/clk_gate_bufB_reg_0_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_11 ul_dp/clk_gate_bufB_reg_1_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_10 ul_dp/clk_gate_bufB_reg_2_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_9 ul_dp/clk_gate_bufB_reg_3_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_8 ul_dp/clk_gate_bufB_reg_4_ 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_7 ul_dp/clk_gate_in_c_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_2 ul_dp/clk_gate_mat_b_reg 23.7636 0.0 0.0000 23.7636 0.0000 SNPS_CLOCK_GATE_HIGH_dp_4 ul_dp/sys_row_0__sys_col_0__ul_mac 1952.0100 3.4 1172.9034 779.1066 0.0000 mac_15 ul_dp/sys_row_0__sys_col_1__ul_mac 1952.0100 3.4 1172.9034 779.1066 0.0000 mac_14 ul_dp/sys_row_0__sys_col_2__ul_mac 1946.9178 3.4 1172.9034 774.0144 0.0000 mac_13 ul_dp/sys_row_0__sys_col_3__ul_mac 1687.2156 2.9 1171.2060 516.0096 0.0000 mac_12 ul_dp/sys_row_1__sys_col_0__ul_mac 1952.0100 3.4 1172.9034 779.1066 0.0000 mac_11 ul_dp/sys_row_1__sys_col_1__ul_mac 1952.0100 3.4 1172.9034 779.1066 0.0000 mac_10 ul_dp/sys_row_1__sys_col_2__ul_mac 1946.9178 3.4 1172.9034 774.0144 0.0000 mac_9 ul_dp/sys_row_1__sys_col_3__ul_mac 1687.2156 2.9 1171.2060 516.0096 0.0000 mac_8 ul_dp/sys_row_2__sys_col_0__ul_mac 1952.0100 3.4 1172.9034 779.1066 0.0000 mac_7 ul_dp/sys_row_2__sys_col_1__ul_mac 1952.0100 3.4 1172.9034 779.1066 0.0000 mac_6 ul_dp/sys_row_2__sys_col_2__ul_mac 1946.9178 3.4 1172.9034 774.0144 0.0000 mac_5 ul_dp/sys_row_2__sys_col_3__ul_mac 1687.2156 2.9 1171.2060 516.0096 0.0000 mac_4 ul_dp/sys_row_3__sys_col_0__ul_mac 1692.3078 2.9 1171.2060 521.1018 0.0000 mac_3 ul_dp/sys_row_3__sys_col_1__ul_mac 1692.3078 2.9 1171.2060 521.1018 0.0000 mac_2 ul_dp/sys_row_3__sys_col_2__ul_mac 1687.2156 2.9 1171.2060 516.0096 0.0000 mac_1 ul_dp/sys_row_3__sys_col_3__ul_mac 1429.2108 2.5 1171.2060 258.0048 0.0000 mac_0 -------------------------------- ---------- ------- ---------- ---------- ------ --------------------------- Total 30154.3111 27345.1131 0.0000 ``` ::: ### Power report :::spoiler ::: ## Makefile target * ```make init``` * Initialize project directory (Always run `make init` after `make clean`) * ```make test1``` * Generate ```A(2*2)*B(2*2)``` test case * ```make test2``` * Generate ```A(4*4)*B(4*4)``` test case * ```make test3``` * Generate ```A(4*K)*B(K*4)``` test case, where ```K=9``` * ```make monster``` * Generate ```A(M*K)*B(K*N)``` test case, where ```K<10```, ```M<10```, ```N<10``` are randomly generated * ```make syn``` * Synthesize design with design compiler * ```make rtl``` * Pre-sim with existing test case * ```make clean``` * This will remove the ```build/``` folder ## Results - `make test1` - ![](https://i.imgur.com/2oICqJP.png) - `make test2` - ![](https://i.imgur.com/kSXVG50.png) - `make test3` - ![](https://i.imgur.com/hNBdEiL.png) - `make monster` - ![](https://i.imgur.com/vsV4x3m.png)