# Project 3 Report: Tiled Matrix Multiplication
# Task 1 INT8 extension to INT16
This task exploits the BRAM width address that we prepared from Project 2 where we only used 1 byte for operands even though we had 4 bytes (addresses 0 to 4). In project 2 when we read 32bit data from BRAM we only used 8 bits for the operands but in this task since we increase to INT16 we will use 16bits for the operand, as mentioned in Q[127](https://piazza.com/class/lk4la7genxl4ew/post/127). Since we did not have the implementation of Project 2 in Verilog so that we could change it, we couldn't finish this task even though it would be straightforward to change it in Verilog.
# Task 2 Quantization-Dequantization results for different values of `m`
Due to different quantization bits resulting in overflow or underflow, result predictions differ (bounding boxes found vs not found) for different values of `m`. Due to the discussion made with the TAs in Piazza questions [148](https://piazza.com/class/lk4la7genxl4ew/post/148) and [152](https://piazza.com/class/lk4la7genxl4ew/post/152), values below 12 don't yield any results.
| Value of`m` | `prediction.jpg` (`dog.jpg`) | `prediction.jpg(test.jpeg)`|
| -------- | -------- | ----------- |
| 10 | | |
| 11 | | |
| 12 | | |
| 13 | ||
| 14 | |
|
# Task 3 Tiling
## Algorithm description
||
|:--:|
| **Figure 1** Tiled matrix multiplication algorithm diagram |
Tiled matrix multiplication can be tricky at first since vanilla matrix multiplication involves multiplying the row of the first matrix with the column of the second matrix to get a single output of the output matrix. The main difference is that a single element of the output matrix is not computed at once but it gradually gets formed by adding partial results. The optimization is the key. We optimize for data sharing and minimize memory fetching. From the diagram above it can be seen that depending on the tile size we populate many partial output results that we then sequentially add new partial sums to them. Therefore when the horizontal traversal of the first matrix and the vertical traversal of the second matrix are done, a whole **tile** of the output will have been computed.
## Software tiling implementation (Project Step 3)
To implement software tiling, we define the function `matmul_CPU_tiling(mat1,mat2)`
which takes full-sized matrices (e.g. weight and activation) as inputs and returns `out`, their matmul result using tiling.
The provided skeleton code populates the appropriate shape for `out` (MxN) which is initialized to zeros. This is similar to the *Output Stationary* setup shown in the project document figure.
The tile size is *hard-coded* as 8 for step 3 of the project, while in step 5 this is provided in the `global_config['tile_size']` variable (tested in test_step5 with tile_size=40).
SW_TILE_SIZE = 8
M, K = mat1.shape
K, N = mat2.shape
out = np.zeros((M, N))
The implemented part of our code can be seen below:
for i in range(0, M, SW_TILE_SIZE):
for j in range(0, N, SW_TILE_SIZE):
for k in range(0, K, SW_TILE_SIZE):
# Compute the end indices for the tiles, handling edge cases
i_end = min(i + SW_TILE_SIZE, M)
j_end = min(j + SW_TILE_SIZE, N)
k_end = min(k + SW_TILE_SIZE, K)
# Extract the tiles
tile_mat1 = mat1[i:i_end, k:k_end]
tile_mat2 = mat2[k:k_end, j:j_end]
# Perform matrix multiplication on the tiles and accumulate the result
out[i:i_end, j:j_end] += self.matmul(tile_mat1, tile_mat2)
### Handling edge cases of `matrix_dimension/tile_size` using end indices
In our implementation, we handle edge cases of **not evenly divisible** matrix dimensions by tile size by calculating **end indices** for the final tiles in each dimension.
As seen in the code above, we use $min(\ (current\ iterator\ value + tile\ size)\ ,\ dimension\ )$ for each corresponding dimension's end index.
In this way, if there is a tile with non-matching size in the *end* (i.e. smaller or bigger than the dimension), we use the smallest one that *fits* in the original matrix.
### Extracting tiles
We use 3 for loop iterators, i.e. first 2 iterators`i`, `j` corresponding to number of *outer* dimensions $(M,N)$ used for the *output*, and the inner iterator `k`, corresponding to the *inner shared dimensions* of `mat1` and `mat2` (i.e. $K$) used for the tiling.
Then, following the [tiling algorithm](#Algorithm-description), `tile_mat1` is taken as the **slice** from `mat1` corresponding to the *row* of the *current tile* (i.e. `[iterator : end index of iterator]`, while `tile_mat2`) is similarly the **slice** from `mat2` corresponding to the *column* of the *current tile*.
Finally, we need to **accumulate** the partial sum to the corresponding slice of `out` corresponding to the $(i^{th},j^{th})$ position (i.e. `[i:i_end , j:j_end]`) for all tiles (inside the inner loop).
# Task 4: Design of FPGA modules for tiling (Verilog)
Important considerations in the hardware implementation of tiling is the addition of FSM signals. Since the two matrices to be multilplied are already passed in terms of their tiles, the FSM has to dictate when to move to the next tile from `w_bram` and multiply it with the next tile of `a_bram`.
A module hierarchically on top of `matmul_system` called `tiling_controller` would need to be introduced to manage the movement of tiles. It used `num_iter_K` to determine the number of tiles and *when to finish the full matmul* (i.e. change FSM state to end and set SP_BRAM address 100 = 1), `num_iter_row` to fetch the corresponding row from activation BRAM (A_BRAM), and `num_iter_col` to fetch the corresponding column from weight BRAM (W_BRAM). Note that python code already populates the tiles *flattened* in BRAMs, and that is why these iterators are needed as inputs to the system.
At each tile matrix multiplication, the data from A_BRAM and W_BRAM is loaded to the systolic array. The `step_a` and `step_w` are used to index the addresses of BRAMs with correct step size when buffering to `a_buffer` and `w_buffer` respectively for streaming data to the systolic array. The tiles (`num_iter_K`) are **looped** and at each iteration the matrix multiplication is calculated normally using *systolic array* (with appropriate mode OS/WS).
Note that *similarly to project 2*, the *input activations* are **sheared** and **pre-loaded** for Weight Stationary mode(WS), while *both* input activations *and* input weights are sheared for Output Stationary mode (OS). These are controlled by the FSM from the SP_BRAM` mode` signal, similarly to project 2.
The iterators are also used to store the output in O_BRAM addresses. In Output Stationary Mode (OS), the output is accumulated *within PEs* (without resetting for multiple tiles) with a 1-to-1 mapping of PEs to output tile, and those are stored *directly* to the output buffer `o_buffer`. Then, data from the buffer is stored to O_BRAM (flattened), for the python code to be able to access the result when multiplication ends.
In Weight Stationary mode, the **new accumulator** module is needed. This module *accumulates* the results (after *unshearing* the output similarly to project 2) until *all* outputs are calculated for the tiles by the WS systolic array. Then after accumulation is complete, the outputs are passed to `o_buffer` and in turn to O_BRAM flattened for access from python.
Since the activation matrix is the first matrix to be multiplied tiles will be traversed horizontally whereas the weight matrix has to be traversed vertically, similar to the [algorithm description](#Algorithm-description).
A final note is that correct bit-alignment between the way python provides the tiles to BRAM and Verilog expects them needs to be ensured in terms of size and format (MSB:0 or 0:MSB).
# Task 5: Integrating all steps
Performing tiled matrix multiplication with FPGA acceleration is inherently bound by the BRAM size, in our case less than 2048. Limited BRAM size is a known disadvantage of FPGAs. Therefore calling the hardware implementation of tiling directly to handle YOLO's matrix multiplication from would break. Particularly,
```bram_a_arr[dim*dim*(i+k*num_iter_row):dim*dim*(i+k*num_iter_row+1)] = serial_tile_mat1```
Therefore we need to combine Step 3 with Step 4 and leverage both software and hardware tiling. Hardware tiling is now treated as normal matrix multiplication and is abstracted away. An important optimization before jumping to the software tiling abstraction level is to exploit the maximum tile size. The maximum tile size is a hard boundary since hardware is not dynamically configurable. From `TASK4_python_code` it can be inferred that the underlying systolic array and hence the maximum tile size is 8. Leveraging this insight in the hardware tiling part of step 5 we have padded the matrices if they're less than 8x8 and then passed them to the FPGA acceleration. In the higher level of abstraction, i.e. software tiling, the tile size is set by the `global_config` and the call to the hardware implementation of tiling is maxed in all dimensions of the matrices by 40. This means that calls to `partial_fpga = self.matmul_**(tile_mat1,tile_mat2)` have their respective `tile_mat1` and `tile_mat2` shapes always less than 40 for every dimension to respect the hard boundary of the Verilog implementation.
The results are correctly predicted as can be seen in test5 jupyter notebook.