# Extend tiny-gpu
> 章元豪
Planned improvements include:
* Explore the behavior the tiny-gpu. ✔️
* Optimize control flow and use of registers to improve cycle time.✔️
* Add basic pipelining.
* Add a simple cache for instructions.
## Source Code
[Original](https://github.com/adam-maj/tiny-gpu)
## Detailed Architecture
### GPU

### Compute Core

## Original Test
### Matrix Addition
This matrix addition kernel adds two 1 x 8 matrices by performing 8 element wise additions in separate threads.
This demonstration makes use of the `%blockIdx`, `%blockDim`, and `%threadIdx` registers to show SIMD programming on this GPU. It also uses the `LDR` and `STR` instructions which require async memory management.
`matadd.asm`
```asm
.threads 8
.data 0 1 2 3 4 5 6 7 ; matrix A (1 x 8)
.data 0 1 2 3 4 5 6 7 ; matrix B (1 x 8)
MUL R0, %blockIdx, %blockDim
ADD R0, R0, %threadIdx ; i = blockIdx * blockDim + threadIdx
CONST R1, #0 ; baseA (matrix A base address)
CONST R2, #8 ; baseB (matrix B base address)
CONST R3, #16 ; baseC (matrix C base address)
ADD R4, R1, R0 ; addr(A[i]) = baseA + i
LDR R4, R4 ; load A[i] from global memory
ADD R5, R2, R0 ; addr(B[i]) = baseB + i
LDR R5, R5 ; load B[i] from global memory
ADD R6, R4, R5 ; C[i] = A[i] + B[i]
ADD R7, R3, R0 ; addr(C[i]) = baseC + i
STR R7, R6 ; store C[i] in global memory
RET ; end of kernel
```
- [Trace of the program](https://drive.google.com/file/d/1hGevcicc9LXOm4zXPQcNDLjo4fh_OPwl/view?usp=drive_link)
- Terminal Output
```
iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v
MODULE=test.test_matadd vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp
-.--ns INFO gpi ..mbed/gpi_embed.cpp:79 in set_program_name_in_venv Did not detect Python virtual environment. Using system-wide Python interpreter
-.--ns INFO gpi ../gpi/GpiCommon.cpp:101 in gpi_print_registered_impl VPI registered
0.00ns INFO cocotb Running on Icarus Verilog version 11.0 (stable)
0.00ns INFO cocotb Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb
0.00ns INFO cocotb Seeding Python random module with 1734430799
0.00ns INFO cocotb.regression pytest not found, install it to enable better AssertionError messages
0.00ns INFO cocotb.regression Found test test.test_matadd.test_matadd
0.00ns INFO cocotb.regression running test_matadd (1/1)
4475001.00ns INFO cocotb.regression test_matadd passed
4475001.00ns INFO cocotb.regression **************************************************************************************
** TEST STATUS SIM TIME (ns) REAL TIME (s) RATIO (ns/s) **
**************************************************************************************
** test.test_matadd.test_matadd PASS 4475001.00 1.23 3623572.44 **
**************************************************************************************
** TESTS=1 PASS=1 FAIL=0 SKIP=0 4475001.00 1.25 3577372.47 **
**************************************************************************************
```
- Result
| Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES |
|-------------------------------------|--------|---------------|---------------|--------------|--------------|
| test.test_matadd.test_matadd | PASS | 4475001.00 | 1.23 | 3623572.44 | 178 |
| TESTS=1 PASS=1 FAIL=0 SKIP=0 | | 4475001.00 | 1.25 | 3577372.47 | |
### Matrix Multuplication
The matrix multiplication kernel multiplies two 2x2 matrices. It performs element wise calculation of the dot product of the relevant row and column and uses the ``CMP`` and ``BRnzp`` instructions to demonstrate branching within the threads (notably, all branches converge so this kernel works on the current tiny-gpu implementation).
``matmul.asm``
```asm
.threads 4
.data 1 2 3 4 ; matrix A (2 x 2)
.data 1 2 3 4 ; matrix B (2 x 2)
MUL R0, %blockIdx, %blockDim
ADD R0, R0, %threadIdx ; i = blockIdx * blockDim + threadIdx
CONST R1, #1 ; increment
CONST R2, #2 ; N (matrix inner dimension)
CONST R3, #0 ; baseA (matrix A base address)
CONST R4, #4 ; baseB (matrix B base address)
CONST R5, #8 ; baseC (matrix C base address)
DIV R6, R0, R2 ; row = i // N
MUL R7, R6, R2
SUB R7, R0, R7 ; col = i % N
CONST R8, #0 ; acc = 0
CONST R9, #0 ; k = 0
LOOP:
MUL R10, R6, R2
ADD R10, R10, R9
ADD R10, R10, R3 ; addr(A[i]) = row * N + k + baseA
LDR R10, R10 ; load A[i] from global memory
MUL R11, R9, R2
ADD R11, R11, R7
ADD R11, R11, R4 ; addr(B[i]) = k * N + col + baseB
LDR R11, R11 ; load B[i] from global memory
MUL R12, R10, R11
ADD R8, R8, R12 ; acc = acc + A[i] * B[i]
ADD R9, R9, R1 ; increment k
CMP R9, R2
BRn LOOP ; loop while k < N
ADD R9, R5, R0 ; addr(C[i]) = baseC + i
STR R9, R8 ; store C[i] in global memory
RET ; end of kernel
```
- [Trace of the program](https://drive.google.com/file/d/1Y5-AhNvuZ8JbgUuL-7RdQYksdoMAlju4/view?usp=drive_link)
- Terminal Output
```
iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v
MODULE=test.test_matmul vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp
-.--ns INFO gpi ..mbed/gpi_embed.cpp:79 in set_program_name_in_venv Did not detect Python virtual environment. Using system-wide Python interpreter
-.--ns INFO gpi ../gpi/GpiCommon.cpp:101 in gpi_print_registered_impl VPI registered
0.00ns INFO cocotb Running on Icarus Verilog version 11.0 (stable)
0.00ns INFO cocotb Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb
0.00ns INFO cocotb Seeding Python random module with 1734430820
0.00ns INFO cocotb.regression pytest not found, install it to enable better AssertionError messages
0.00ns INFO cocotb.regression Found test test.test_matmul.test_matadd
0.00ns INFO cocotb.regression running test_matadd (1/1)
12300001.00ns INFO cocotb.regression test_matadd passed
12300001.00ns INFO cocotb.regression **************************************************************************************
** TEST STATUS SIM TIME (ns) REAL TIME (s) RATIO (ns/s) **
**************************************************************************************
** test.test_matmul.test_matadd PASS 12300001.00 1.13 10873995.65 **
**************************************************************************************
** TESTS=1 PASS=1 FAIL=0 SKIP=0 12300001.00 1.15 10702781.50 **
**************************************************************************************
```
- Result
| Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES |
|-------------------------------------|--------|---------------|---------------|--------------|--------------|
| test.test_matmul.test_matadd | PASS | 12300001.00 | 1.13 | 10873995.65 | 491 |
| TESTS=1 PASS=1 FAIL=0 SKIP=0 | | 12300001.00 | 1.15 | 10702781.50 | |
## Optimize Control Flow
### Original Scheduler State
1.``IDLE``: Waiting to start.
2.``FETCH``: Fetch instructions from program memory.
3.``DECODE``: Decode instructions into control signals.
4.``REQUEST``: Request data from registers or memory.
5.``WAIT``: Wait for response from memory if necessary.
6.``EXECUTE``: Execute ALU and PC calculations.
7.``UPDATE``: Update registers, NZP, and PC.
8.``DONE``:Done executing this block.
### Thoughts
Based on the [architecture diagram](https://hackmd.io/eUCQWc3OQPmLBkLVUg0MGg?view#Compute-Core8) above, I found that the ALU and PC in the ``EXECUTE`` stage don't necessarily need to be executed during this stage. They can be executed in the ``WAIT`` stage instead, as instructions that don't use the LSU would otherwise waste this stage.
### Optimized States
1.``IDLE``: Waiting to start.
2.``FETCH``: Fetch instructions from program memory.
3.``DECODE``: Decode instructions into control signals.
4.``REQUEST``: Request data from registers or memory.
5.``WAIT``: Execute ALU and PC calculations,and wait for response from memory if necessary.
6.``UPDATE``: Update registers, NZP, and PC.
7.``DONE``:Done executing this block.
### Improved Parts
#### [scheduler.sv](https://github.com/Beethovenjoker/tiny-gpu/blob/main/src/scheduler.sv)
Remove the execute state.
```systemVerilog
EXECUTE: begin
// Execute is synchronous so we move on after one cycle
core_state <= UPDATE;
```
#### [alu.sv](https://github.com/Beethovenjoker/tiny-gpu/blob/main/src/alu.sv)
Change core_state to detect the WAIT state.
```systemVerilog
always @(posedge clk) begin
if (reset) begin
alu_out_reg <= 8'b0;
end else if (enable) begin
// Calculate alu_out when core_state = WAIT
if (core_state == 3'b100) begin
if (decoded_alu_output_mux == 1) begin
// Set values to compare with NZP register in alu_out[2:0]
alu_out_reg <= {5'b0, (rs - rt > 0), (rs - rt == 0), (rs - rt < 0)};
end else begin
// Execute the specified arithmetic instruction
case (decoded_alu_arithmetic_mux)
ADD: begin
alu_out_reg <= rs + rt;
end
SUB: begin
alu_out_reg <= rs - rt;
end
MUL: begin
alu_out_reg <= rs * rt;
end
DIV: begin
alu_out_reg <= rs / rt;
end
endcase
end
end
end
end
endmodule
```
#### [pc.sv](https://github.com/Beethovenjoker/tiny-gpu/blob/main/src/pc.sv)
PC can also perform calculations during the WAIT phase.
```systemverilog
always @(posedge clk) begin
if (reset) begin
nzp <= 3'b0;
next_pc <= 0;
end else if (enable) begin
// Update PC when core_state = WAIT
if (core_state == 3'b100) begin
if (decoded_pc_mux == 1) begin
if (((nzp & decoded_nzp) != 3'b0)) begin
// On BRnzp instruction, branch to immediate if NZP case matches previous CMP
next_pc <= decoded_immediate;
end else begin
// Otherwise, just update to PC + 1 (next line)
next_pc <= current_pc + 1;
end
end else begin
// By default update to PC + 1 (next line)
next_pc <= current_pc + 1;
end
end
// Store NZP when core_state = UPDATE
if (core_state == 3'b110) begin
// Write to NZP register on CMP instruction
if (decoded_nzp_write_enable) begin
nzp[2] <= alu_out[2];
nzp[1] <= alu_out[1];
nzp[0] <= alu_out[0];
end
end
end
end
```
#### [format.py](https://github.com/Beethovenjoker/tiny-gpu/blob/main/test/helpers/format.py)
Remove the execute state.
```python
def format_core_state(core_state: str) -> str:
core_state_map = {
"000": "IDLE",
"001": "FETCH",
"010": "DECODE",
"011": "REQUEST",
"100": "WAIT",
"110": "UPDATE",
"111": "DONE"
}
return core_state_map[core_state]
```
### Evaluation
#### Matrix Addition
- [Trace of the program](https://drive.google.com/file/d/1G8SafA9r-DydlhgShl3udHwC3ULSnNLv/view?usp=drive_link)
- Terminal Output
```
iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v
MODULE=test.test_matadd vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp
-.--ns INFO gpi ..mbed/gpi_embed.cpp:79 in set_program_name_in_venv Did not detect Python virtual environment. Using system-wide Python interpreter
-.--ns INFO gpi ../gpi/GpiCommon.cpp:101 in gpi_print_registered_impl VPI registered
0.00ns INFO cocotb Running on Icarus Verilog version 11.0 (stable)
0.00ns INFO cocotb Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb
0.00ns INFO cocotb Seeding Python random module with 1735826852
0.00ns INFO cocotb.regression pytest not found, install it to enable better AssertionError messages
0.00ns INFO cocotb.regression Found test test.test_matadd.test_matadd
0.00ns INFO cocotb.regression running test_matadd (1/1)
4200001.00ns INFO cocotb.regression test_matadd passed
4200001.00ns INFO cocotb.regression **************************************************************************************
** TEST STATUS SIM TIME (ns) REAL TIME (s) RATIO (ns/s) **
**************************************************************************************
** test.test_matadd.test_matadd PASS 4200001.00 1.36 3084251.90 **
**************************************************************************************
** TESTS=1 PASS=1 FAIL=0 SKIP=0 4200001.00 1.38 3045972.60 **
**************************************************************************************
```
- Result
| Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES |
|-------------------------------------|--------|---------------|---------------|--------------|--------------|
| Original Matrix Addition | PASS | 4475001.00 | 1.23 | 3623572.44 | 178 |
| Optimized Matrix Addition | PASS | 4200001.00 | 1.36 | 3084251.90 | 167 |
:::success
Improved by 11 cycles.
:::
#### Matrix Multiplication
- [Trace of the program](https://drive.google.com/file/d/1YxAIHQJ5Jl7sUEVotKAChRRjEbdxYNhG/view?usp=drive_link)
- Terminal Output
```
iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v
MODULE=test.test_matmul vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp
-.--ns INFO gpi ..mbed/gpi_embed.cpp:79 in set_program_name_in_venv Did not detect Python virtual environment. Using system-wide Python interpreter
-.--ns INFO gpi ../gpi/GpiCommon.cpp:101 in gpi_print_registered_impl VPI registered
0.00ns INFO cocotb Running on Icarus Verilog version 11.0 (stable)
0.00ns INFO cocotb Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb
0.00ns INFO cocotb Seeding Python random module with 1735826930
0.00ns INFO cocotb.regression pytest not found, install it to enable better AssertionError messages
0.00ns INFO cocotb.regression Found test test.test_matmul.test_matadd
0.00ns INFO cocotb.regression running test_matadd (1/1)
11275001.00ns INFO cocotb.regression test_matadd passed
11275001.00ns INFO cocotb.regression **************************************************************************************
** TEST STATUS SIM TIME (ns) REAL TIME (s) RATIO (ns/s) **
**************************************************************************************
** test.test_matmul.test_matadd PASS 11275001.00 1.08 10486023.66 **
**************************************************************************************
** TESTS=1 PASS=1 FAIL=0 SKIP=0 11275001.00 1.09 10305411.66 **
**************************************************************************************
```
- Result
| Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES |
|-------------------------------------|--------|---------------|---------------|--------------|--------------|
| Original Matrix Multiplication | PASS | 12300001.00 | 1.13 | 10873995.65 | 491 |
| Optimized Matrix Multiplication | PASS | 11275001.00 | 1.08 | 10486023.66 | 450 |
:::success
Improved by 41 cycles.
> TODO: Consider to submit pull requests back to tiny-gpu.
:::