# Extend tiny-gpu > 章元豪 Planned improvements include: * Explore the behavior the tiny-gpu. ✔️ * Optimize control flow and use of registers to improve cycle time.✔️ * Add basic pipelining. * Add a simple cache for instructions. ## Source Code [Original](https://github.com/adam-maj/tiny-gpu) ## Detailed Architecture ### GPU ![GPU](https://hackmd.io/_uploads/BkdaOq-Syl.png) ### Compute Core ![Compute_Core](https://hackmd.io/_uploads/rJI99wGBkg.png) ## Original Test ### Matrix Addition This matrix addition kernel adds two 1 x 8 matrices by performing 8 element wise additions in separate threads. This demonstration makes use of the `%blockIdx`, `%blockDim`, and `%threadIdx` registers to show SIMD programming on this GPU. It also uses the `LDR` and `STR` instructions which require async memory management. `matadd.asm` ```asm .threads 8 .data 0 1 2 3 4 5 6 7 ; matrix A (1 x 8) .data 0 1 2 3 4 5 6 7 ; matrix B (1 x 8) MUL R0, %blockIdx, %blockDim ADD R0, R0, %threadIdx ; i = blockIdx * blockDim + threadIdx CONST R1, #0 ; baseA (matrix A base address) CONST R2, #8 ; baseB (matrix B base address) CONST R3, #16 ; baseC (matrix C base address) ADD R4, R1, R0 ; addr(A[i]) = baseA + i LDR R4, R4 ; load A[i] from global memory ADD R5, R2, R0 ; addr(B[i]) = baseB + i LDR R5, R5 ; load B[i] from global memory ADD R6, R4, R5 ; C[i] = A[i] + B[i] ADD R7, R3, R0 ; addr(C[i]) = baseC + i STR R7, R6 ; store C[i] in global memory RET ; end of kernel ``` - [Trace of the program](https://drive.google.com/file/d/1hGevcicc9LXOm4zXPQcNDLjo4fh_OPwl/view?usp=drive_link) - Terminal Output ``` iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v MODULE=test.test_matadd vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp -.--ns INFO gpi ..mbed/gpi_embed.cpp:79 in set_program_name_in_venv Did not detect Python virtual environment. Using system-wide Python interpreter -.--ns INFO gpi ../gpi/GpiCommon.cpp:101 in gpi_print_registered_impl VPI registered 0.00ns INFO cocotb Running on Icarus Verilog version 11.0 (stable) 0.00ns INFO cocotb Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb 0.00ns INFO cocotb Seeding Python random module with 1734430799 0.00ns INFO cocotb.regression pytest not found, install it to enable better AssertionError messages 0.00ns INFO cocotb.regression Found test test.test_matadd.test_matadd 0.00ns INFO cocotb.regression running test_matadd (1/1) 4475001.00ns INFO cocotb.regression test_matadd passed 4475001.00ns INFO cocotb.regression ************************************************************************************** ** TEST STATUS SIM TIME (ns) REAL TIME (s) RATIO (ns/s) ** ************************************************************************************** ** test.test_matadd.test_matadd PASS 4475001.00 1.23 3623572.44 ** ************************************************************************************** ** TESTS=1 PASS=1 FAIL=0 SKIP=0 4475001.00 1.25 3577372.47 ** ************************************************************************************** ``` - Result | Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES | |-------------------------------------|--------|---------------|---------------|--------------|--------------| | test.test_matadd.test_matadd | PASS | 4475001.00 | 1.23 | 3623572.44 | 178 | | TESTS=1 PASS=1 FAIL=0 SKIP=0 | | 4475001.00 | 1.25 | 3577372.47 | | ### Matrix Multuplication The matrix multiplication kernel multiplies two 2x2 matrices. It performs element wise calculation of the dot product of the relevant row and column and uses the ``CMP`` and ``BRnzp`` instructions to demonstrate branching within the threads (notably, all branches converge so this kernel works on the current tiny-gpu implementation). ``matmul.asm`` ```asm .threads 4 .data 1 2 3 4 ; matrix A (2 x 2) .data 1 2 3 4 ; matrix B (2 x 2) MUL R0, %blockIdx, %blockDim ADD R0, R0, %threadIdx ; i = blockIdx * blockDim + threadIdx CONST R1, #1 ; increment CONST R2, #2 ; N (matrix inner dimension) CONST R3, #0 ; baseA (matrix A base address) CONST R4, #4 ; baseB (matrix B base address) CONST R5, #8 ; baseC (matrix C base address) DIV R6, R0, R2 ; row = i // N MUL R7, R6, R2 SUB R7, R0, R7 ; col = i % N CONST R8, #0 ; acc = 0 CONST R9, #0 ; k = 0 LOOP: MUL R10, R6, R2 ADD R10, R10, R9 ADD R10, R10, R3 ; addr(A[i]) = row * N + k + baseA LDR R10, R10 ; load A[i] from global memory MUL R11, R9, R2 ADD R11, R11, R7 ADD R11, R11, R4 ; addr(B[i]) = k * N + col + baseB LDR R11, R11 ; load B[i] from global memory MUL R12, R10, R11 ADD R8, R8, R12 ; acc = acc + A[i] * B[i] ADD R9, R9, R1 ; increment k CMP R9, R2 BRn LOOP ; loop while k < N ADD R9, R5, R0 ; addr(C[i]) = baseC + i STR R9, R8 ; store C[i] in global memory RET ; end of kernel ``` - [Trace of the program](https://drive.google.com/file/d/1Y5-AhNvuZ8JbgUuL-7RdQYksdoMAlju4/view?usp=drive_link) - Terminal Output ``` iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v MODULE=test.test_matmul vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp -.--ns INFO gpi ..mbed/gpi_embed.cpp:79 in set_program_name_in_venv Did not detect Python virtual environment. Using system-wide Python interpreter -.--ns INFO gpi ../gpi/GpiCommon.cpp:101 in gpi_print_registered_impl VPI registered 0.00ns INFO cocotb Running on Icarus Verilog version 11.0 (stable) 0.00ns INFO cocotb Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb 0.00ns INFO cocotb Seeding Python random module with 1734430820 0.00ns INFO cocotb.regression pytest not found, install it to enable better AssertionError messages 0.00ns INFO cocotb.regression Found test test.test_matmul.test_matadd 0.00ns INFO cocotb.regression running test_matadd (1/1) 12300001.00ns INFO cocotb.regression test_matadd passed 12300001.00ns INFO cocotb.regression ************************************************************************************** ** TEST STATUS SIM TIME (ns) REAL TIME (s) RATIO (ns/s) ** ************************************************************************************** ** test.test_matmul.test_matadd PASS 12300001.00 1.13 10873995.65 ** ************************************************************************************** ** TESTS=1 PASS=1 FAIL=0 SKIP=0 12300001.00 1.15 10702781.50 ** ************************************************************************************** ``` - Result | Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES | |-------------------------------------|--------|---------------|---------------|--------------|--------------| | test.test_matmul.test_matadd | PASS | 12300001.00 | 1.13 | 10873995.65 | 491 | | TESTS=1 PASS=1 FAIL=0 SKIP=0 | | 12300001.00 | 1.15 | 10702781.50 | | ## Optimize Control Flow ### Original Scheduler State 1.``IDLE``: Waiting to start. 2.``FETCH``: Fetch instructions from program memory. 3.``DECODE``: Decode instructions into control signals. 4.``REQUEST``: Request data from registers or memory. 5.``WAIT``: Wait for response from memory if necessary. 6.``EXECUTE``: Execute ALU and PC calculations. 7.``UPDATE``: Update registers, NZP, and PC. 8.``DONE``:Done executing this block. ### Thoughts Based on the [architecture diagram](https://hackmd.io/eUCQWc3OQPmLBkLVUg0MGg?view#Compute-Core8) above, I found that the ALU and PC in the ``EXECUTE`` stage don't necessarily need to be executed during this stage. They can be executed in the ``WAIT`` stage instead, as instructions that don't use the LSU would otherwise waste this stage. ### Optimized States 1.``IDLE``: Waiting to start. 2.``FETCH``: Fetch instructions from program memory. 3.``DECODE``: Decode instructions into control signals. 4.``REQUEST``: Request data from registers or memory. 5.``WAIT``: Execute ALU and PC calculations,and wait for response from memory if necessary. 6.``UPDATE``: Update registers, NZP, and PC. 7.``DONE``:Done executing this block. ### Improved Parts #### [scheduler.sv](https://github.com/Beethovenjoker/tiny-gpu/blob/main/src/scheduler.sv) Remove the execute state. ```systemVerilog EXECUTE: begin // Execute is synchronous so we move on after one cycle core_state <= UPDATE; ``` #### [alu.sv](https://github.com/Beethovenjoker/tiny-gpu/blob/main/src/alu.sv) Change core_state to detect the WAIT state. ```systemVerilog always @(posedge clk) begin if (reset) begin alu_out_reg <= 8'b0; end else if (enable) begin // Calculate alu_out when core_state = WAIT if (core_state == 3'b100) begin if (decoded_alu_output_mux == 1) begin // Set values to compare with NZP register in alu_out[2:0] alu_out_reg <= {5'b0, (rs - rt > 0), (rs - rt == 0), (rs - rt < 0)}; end else begin // Execute the specified arithmetic instruction case (decoded_alu_arithmetic_mux) ADD: begin alu_out_reg <= rs + rt; end SUB: begin alu_out_reg <= rs - rt; end MUL: begin alu_out_reg <= rs * rt; end DIV: begin alu_out_reg <= rs / rt; end endcase end end end end endmodule ``` #### [pc.sv](https://github.com/Beethovenjoker/tiny-gpu/blob/main/src/pc.sv) PC can also perform calculations during the WAIT phase. ```systemverilog always @(posedge clk) begin if (reset) begin nzp <= 3'b0; next_pc <= 0; end else if (enable) begin // Update PC when core_state = WAIT if (core_state == 3'b100) begin if (decoded_pc_mux == 1) begin if (((nzp & decoded_nzp) != 3'b0)) begin // On BRnzp instruction, branch to immediate if NZP case matches previous CMP next_pc <= decoded_immediate; end else begin // Otherwise, just update to PC + 1 (next line) next_pc <= current_pc + 1; end end else begin // By default update to PC + 1 (next line) next_pc <= current_pc + 1; end end // Store NZP when core_state = UPDATE if (core_state == 3'b110) begin // Write to NZP register on CMP instruction if (decoded_nzp_write_enable) begin nzp[2] <= alu_out[2]; nzp[1] <= alu_out[1]; nzp[0] <= alu_out[0]; end end end end ``` #### [format.py](https://github.com/Beethovenjoker/tiny-gpu/blob/main/test/helpers/format.py) Remove the execute state. ```python def format_core_state(core_state: str) -> str: core_state_map = { "000": "IDLE", "001": "FETCH", "010": "DECODE", "011": "REQUEST", "100": "WAIT", "110": "UPDATE", "111": "DONE" } return core_state_map[core_state] ``` ### Evaluation #### Matrix Addition - [Trace of the program](https://drive.google.com/file/d/1G8SafA9r-DydlhgShl3udHwC3ULSnNLv/view?usp=drive_link) - Terminal Output ``` iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v MODULE=test.test_matadd vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp -.--ns INFO gpi ..mbed/gpi_embed.cpp:79 in set_program_name_in_venv Did not detect Python virtual environment. Using system-wide Python interpreter -.--ns INFO gpi ../gpi/GpiCommon.cpp:101 in gpi_print_registered_impl VPI registered 0.00ns INFO cocotb Running on Icarus Verilog version 11.0 (stable) 0.00ns INFO cocotb Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb 0.00ns INFO cocotb Seeding Python random module with 1735826852 0.00ns INFO cocotb.regression pytest not found, install it to enable better AssertionError messages 0.00ns INFO cocotb.regression Found test test.test_matadd.test_matadd 0.00ns INFO cocotb.regression running test_matadd (1/1) 4200001.00ns INFO cocotb.regression test_matadd passed 4200001.00ns INFO cocotb.regression ************************************************************************************** ** TEST STATUS SIM TIME (ns) REAL TIME (s) RATIO (ns/s) ** ************************************************************************************** ** test.test_matadd.test_matadd PASS 4200001.00 1.36 3084251.90 ** ************************************************************************************** ** TESTS=1 PASS=1 FAIL=0 SKIP=0 4200001.00 1.38 3045972.60 ** ************************************************************************************** ``` - Result | Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES | |-------------------------------------|--------|---------------|---------------|--------------|--------------| | Original Matrix Addition | PASS | 4475001.00 | 1.23 | 3623572.44 | 178 | | Optimized Matrix Addition | PASS | 4200001.00 | 1.36 | 3084251.90 | 167 | :::success Improved by 11 cycles. ::: #### Matrix Multiplication - [Trace of the program](https://drive.google.com/file/d/1YxAIHQJ5Jl7sUEVotKAChRRjEbdxYNhG/view?usp=drive_link) - Terminal Output ``` iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v MODULE=test.test_matmul vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp -.--ns INFO gpi ..mbed/gpi_embed.cpp:79 in set_program_name_in_venv Did not detect Python virtual environment. Using system-wide Python interpreter -.--ns INFO gpi ../gpi/GpiCommon.cpp:101 in gpi_print_registered_impl VPI registered 0.00ns INFO cocotb Running on Icarus Verilog version 11.0 (stable) 0.00ns INFO cocotb Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb 0.00ns INFO cocotb Seeding Python random module with 1735826930 0.00ns INFO cocotb.regression pytest not found, install it to enable better AssertionError messages 0.00ns INFO cocotb.regression Found test test.test_matmul.test_matadd 0.00ns INFO cocotb.regression running test_matadd (1/1) 11275001.00ns INFO cocotb.regression test_matadd passed 11275001.00ns INFO cocotb.regression ************************************************************************************** ** TEST STATUS SIM TIME (ns) REAL TIME (s) RATIO (ns/s) ** ************************************************************************************** ** test.test_matmul.test_matadd PASS 11275001.00 1.08 10486023.66 ** ************************************************************************************** ** TESTS=1 PASS=1 FAIL=0 SKIP=0 11275001.00 1.09 10305411.66 ** ************************************************************************************** ``` - Result | Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES | |-------------------------------------|--------|---------------|---------------|--------------|--------------| | Original Matrix Multiplication | PASS | 12300001.00 | 1.13 | 10873995.65 | 491 | | Optimized Matrix Multiplication | PASS | 11275001.00 | 1.08 | 10486023.66 | 450 | :::success Improved by 41 cycles. > TODO: Consider to submit pull requests back to tiny-gpu. :::