Extend tiny-gpu

章元豪

Planned improvements include:

Explore the behavior the tiny-gpu. ✔️
Optimize control flow and use of registers to improve cycle time.✔️
Add basic pipelining.
Add a simple cache for instructions.

Source Code

Original

Detailed Architecture

GPU

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Compute Core

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Original Test

Matrix Addition

This matrix addition kernel adds two 1 x 8 matrices by performing 8 element wise additions in separate threads.

This demonstration makes use of the %blockIdx, %blockDim, and %threadIdx registers to show SIMD programming on this GPU. It also uses the LDR and STR instructions which require async memory management.

matadd.asm

.threads 8
.data 0 1 2 3 4 5 6 7          ; matrix A (1 x 8)
.data 0 1 2 3 4 5 6 7          ; matrix B (1 x 8)

MUL R0, %blockIdx, %blockDim
ADD R0, R0, %threadIdx         ; i = blockIdx * blockDim + threadIdx

CONST R1, #0                   ; baseA (matrix A base address)
CONST R2, #8                   ; baseB (matrix B base address)
CONST R3, #16                  ; baseC (matrix C base address)

ADD R4, R1, R0                 ; addr(A[i]) = baseA + i
LDR R4, R4                     ; load A[i] from global memory

ADD R5, R2, R0                 ; addr(B[i]) = baseB + i
LDR R5, R5                     ; load B[i] from global memory

ADD R6, R4, R5                 ; C[i] = A[i] + B[i]

ADD R7, R3, R0                 ; addr(C[i]) = baseC + i
STR R7, R6                     ; store C[i] in global memory

RET                            ; end of kernel

Trace of the program
Terminal Output

iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v
MODULE=test.test_matadd vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp
     -.--ns INFO     gpi                                ..mbed/gpi_embed.cpp:79   in set_program_name_in_venv        Did not detect Python virtual environment. Using system-wide Python interpreter
     -.--ns INFO     gpi                                ../gpi/GpiCommon.cpp:101  in gpi_print_registered_impl       VPI registered
     0.00ns INFO     cocotb                             Running on Icarus Verilog version 11.0 (stable)
     0.00ns INFO     cocotb                             Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb
     0.00ns INFO     cocotb                             Seeding Python random module with 1734430799
     0.00ns INFO     cocotb.regression                  pytest not found, install it to enable better AssertionError messages
     0.00ns INFO     cocotb.regression                  Found test test.test_matadd.test_matadd
     0.00ns INFO     cocotb.regression                  running test_matadd (1/1)
4475001.00ns INFO     cocotb.regression                  test_matadd passed
4475001.00ns INFO     cocotb.regression                  **************************************************************************************
                                                         ** TEST                          STATUS  SIM TIME (ns)  REAL TIME (s)  RATIO (ns/s) **
                                                         **************************************************************************************
                                                         ** test.test_matadd.test_matadd   PASS     4475001.00           1.23    3623572.44  **
                                                         **************************************************************************************
                                                         ** TESTS=1 PASS=1 FAIL=0 SKIP=0            4475001.00           1.25    3577372.47  **
                                                         **************************************************************************************

Result

Test	STATUS	SIM TIME (ns)	REAL TIME (s)	RATIO (ns/s)	CYCLES
test.test_matadd.test_matadd	PASS	4475001.00	1.23	3623572.44	178
TESTS=1 PASS=1 FAIL=0 SKIP=0		4475001.00	1.25	3577372.47

Matrix Multuplication

The matrix multiplication kernel multiplies two 2x2 matrices. It performs element wise calculation of the dot product of the relevant row and column and uses the CMP and BRnzp instructions to demonstrate branching within the threads (notably, all branches converge so this kernel works on the current tiny-gpu implementation).

matmul.asm

.threads 4
.data 1 2 3 4                  ; matrix A (2 x 2)
.data 1 2 3 4                  ; matrix B (2 x 2)

MUL R0, %blockIdx, %blockDim
ADD R0, R0, %threadIdx         ; i = blockIdx * blockDim + threadIdx

CONST R1, #1                   ; increment
CONST R2, #2                   ; N (matrix inner dimension)
CONST R3, #0                   ; baseA (matrix A base address)
CONST R4, #4                   ; baseB (matrix B base address)
CONST R5, #8                   ; baseC (matrix C base address)

DIV R6, R0, R2                 ; row = i // N
MUL R7, R6, R2
SUB R7, R0, R7                 ; col = i % N

CONST R8, #0                   ; acc = 0
CONST R9, #0                   ; k = 0

LOOP:
  MUL R10, R6, R2
  ADD R10, R10, R9
  ADD R10, R10, R3             ; addr(A[i]) = row * N + k + baseA
  LDR R10, R10                 ; load A[i] from global memory

  MUL R11, R9, R2
  ADD R11, R11, R7
  ADD R11, R11, R4             ; addr(B[i]) = k * N + col + baseB
  LDR R11, R11                 ; load B[i] from global memory

  MUL R12, R10, R11
  ADD R8, R8, R12              ; acc = acc + A[i] * B[i]

  ADD R9, R9, R1               ; increment k

  CMP R9, R2
  BRn LOOP                    ; loop while k < N

ADD R9, R5, R0                 ; addr(C[i]) = baseC + i
STR R9, R8                     ; store C[i] in global memory

RET                            ; end of kernel

Trace of the program
Terminal Output

iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v
MODULE=test.test_matmul vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp
     -.--ns INFO     gpi                                ..mbed/gpi_embed.cpp:79   in set_program_name_in_venv        Did not detect Python virtual environment. Using system-wide Python interpreter
     -.--ns INFO     gpi                                ../gpi/GpiCommon.cpp:101  in gpi_print_registered_impl       VPI registered
     0.00ns INFO     cocotb                             Running on Icarus Verilog version 11.0 (stable)
     0.00ns INFO     cocotb                             Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb
     0.00ns INFO     cocotb                             Seeding Python random module with 1734430820
     0.00ns INFO     cocotb.regression                  pytest not found, install it to enable better AssertionError messages
     0.00ns INFO     cocotb.regression                  Found test test.test_matmul.test_matadd
     0.00ns INFO     cocotb.regression                  running test_matadd (1/1)
12300001.00ns INFO     cocotb.regression                  test_matadd passed
12300001.00ns INFO     cocotb.regression                  **************************************************************************************
                                                          ** TEST                          STATUS  SIM TIME (ns)  REAL TIME (s)  RATIO (ns/s) **
                                                          **************************************************************************************
                                                          ** test.test_matmul.test_matadd   PASS    12300001.00           1.13   10873995.65  **
                                                          **************************************************************************************
                                                          ** TESTS=1 PASS=1 FAIL=0 SKIP=0           12300001.00           1.15   10702781.50  **
                                                          **************************************************************************************

Result

Test	STATUS	SIM TIME (ns)	REAL TIME (s)	RATIO (ns/s)	CYCLES
test.test_matmul.test_matadd	PASS	12300001.00	1.13	10873995.65	491
TESTS=1 PASS=1 FAIL=0 SKIP=0		12300001.00	1.15	10702781.50

Optimize Control Flow

Original Scheduler State

1.IDLE: Waiting to start.
2.FETCH: Fetch instructions from program memory.
3.DECODE: Decode instructions into control signals.
4.REQUEST: Request data from registers or memory.
5.WAIT: Wait for response from memory if necessary.
6.EXECUTE: Execute ALU and PC calculations.
7.UPDATE: Update registers, NZP, and PC.
8.DONE:Done executing this block.

Thoughts

Based on the architecture diagram above, I found that the ALU and PC in the EXECUTE stage don't necessarily need to be executed during this stage. They can be executed in the WAIT stage instead, as instructions that don't use the LSU would otherwise waste this stage.

Optimized States

1.IDLE: Waiting to start.
2.FETCH: Fetch instructions from program memory.
3.DECODE: Decode instructions into control signals.
4.REQUEST: Request data from registers or memory.
5.WAIT: Execute ALU and PC calculations,and wait for response from memory if necessary.
6.UPDATE: Update registers, NZP, and PC.
7.DONE:Done executing this block.

Improved Parts

scheduler.sv

Remove the execute state.

EXECUTE: begin
    // Execute is synchronous so we move on after one cycle
    core_state <= UPDATE;

alu.sv

Change core_state to detect the WAIT state.

    always @(posedge clk) begin 
        if (reset) begin 
            alu_out_reg <= 8'b0;
        end else if (enable) begin
            // Calculate alu_out when core_state = WAIT
            if (core_state == 3'b100) begin 
                if (decoded_alu_output_mux == 1) begin 
                    // Set values to compare with NZP register in alu_out[2:0]
                    alu_out_reg <= {5'b0, (rs - rt > 0), (rs - rt == 0), (rs - rt < 0)};
                end else begin 
                    // Execute the specified arithmetic instruction
                    case (decoded_alu_arithmetic_mux)
                        ADD: begin 
                            alu_out_reg <= rs + rt;
                        end
                        SUB: begin 
                            alu_out_reg <= rs - rt;
                        end
                        MUL: begin 
                            alu_out_reg <= rs * rt;
                        end
                        DIV: begin 
                            alu_out_reg <= rs / rt;
                        end
                    endcase
                end
            end
        end
    end
endmodule

pc.sv

PC can also perform calculations during the WAIT phase.

    always @(posedge clk) begin
        if (reset) begin
            nzp <= 3'b0;
            next_pc <= 0;
        end else if (enable) begin
            // Update PC when core_state = WAIT
            if (core_state == 3'b100) begin 
                if (decoded_pc_mux == 1) begin 
                    if (((nzp & decoded_nzp) != 3'b0)) begin 
                        // On BRnzp instruction, branch to immediate if NZP case matches previous CMP
                        next_pc <= decoded_immediate;
                    end else begin 
                        // Otherwise, just update to PC + 1 (next line)
                        next_pc <= current_pc + 1;
                    end
                end else begin 
                    // By default update to PC + 1 (next line)
                    next_pc <= current_pc + 1;
                end
            end   

            // Store NZP when core_state = UPDATE   
            if (core_state == 3'b110) begin 
                // Write to NZP register on CMP instruction
                if (decoded_nzp_write_enable) begin
                    nzp[2] <= alu_out[2];
                    nzp[1] <= alu_out[1];
                    nzp[0] <= alu_out[0];
                end
            end      
        end
    end

format.py

Remove the execute state.

def format_core_state(core_state: str) -> str:
    core_state_map = {
        "000": "IDLE",
        "001": "FETCH",
        "010": "DECODE",
        "011": "REQUEST",
        "100": "WAIT",
        "110": "UPDATE",
        "111": "DONE"
    }
    return core_state_map[core_state]

Evaluation

Matrix Addition

Trace of the program
Terminal Output

iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v
MODULE=test.test_matadd vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp
     -.--ns INFO     gpi                                ..mbed/gpi_embed.cpp:79   in set_program_name_in_venv        Did not detect Python virtual environment. Using system-wide Python interpreter
     -.--ns INFO     gpi                                ../gpi/GpiCommon.cpp:101  in gpi_print_registered_impl       VPI registered
     0.00ns INFO     cocotb                             Running on Icarus Verilog version 11.0 (stable)
     0.00ns INFO     cocotb                             Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb
     0.00ns INFO     cocotb                             Seeding Python random module with 1735826852
     0.00ns INFO     cocotb.regression                  pytest not found, install it to enable better AssertionError messages
     0.00ns INFO     cocotb.regression                  Found test test.test_matadd.test_matadd
     0.00ns INFO     cocotb.regression                  running test_matadd (1/1)
4200001.00ns INFO     cocotb.regression                  test_matadd passed
4200001.00ns INFO     cocotb.regression                  **************************************************************************************
                                                         ** TEST                          STATUS  SIM TIME (ns)  REAL TIME (s)  RATIO (ns/s) **
                                                         **************************************************************************************
                                                         ** test.test_matadd.test_matadd   PASS     4200001.00           1.36    3084251.90  **
                                                         **************************************************************************************
                                                         ** TESTS=1 PASS=1 FAIL=0 SKIP=0            4200001.00           1.38    3045972.60  **
                                                         **************************************************************************************

Result

Test	STATUS	SIM TIME (ns)	REAL TIME (s)	RATIO (ns/s)	CYCLES
Original Matrix Addition	PASS	4475001.00	1.23	3623572.44	178
Optimized Matrix Addition	PASS	4200001.00	1.36	3084251.90	167

Improved by 11 cycles.

Matrix Multiplication

Trace of the program
Terminal Output

iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v
MODULE=test.test_matmul vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp
     -.--ns INFO     gpi                                ..mbed/gpi_embed.cpp:79   in set_program_name_in_venv        Did not detect Python virtual environment. Using system-wide Python interpreter
     -.--ns INFO     gpi                                ../gpi/GpiCommon.cpp:101  in gpi_print_registered_impl       VPI registered
     0.00ns INFO     cocotb                             Running on Icarus Verilog version 11.0 (stable)
     0.00ns INFO     cocotb                             Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb
     0.00ns INFO     cocotb                             Seeding Python random module with 1735826930
     0.00ns INFO     cocotb.regression                  pytest not found, install it to enable better AssertionError messages
     0.00ns INFO     cocotb.regression                  Found test test.test_matmul.test_matadd
     0.00ns INFO     cocotb.regression                  running test_matadd (1/1)
11275001.00ns INFO     cocotb.regression                  test_matadd passed
11275001.00ns INFO     cocotb.regression                  **************************************************************************************
                                                          ** TEST                          STATUS  SIM TIME (ns)  REAL TIME (s)  RATIO (ns/s) **
                                                          **************************************************************************************
                                                          ** test.test_matmul.test_matadd   PASS    11275001.00           1.08   10486023.66  **
                                                          **************************************************************************************
                                                          ** TESTS=1 PASS=1 FAIL=0 SKIP=0           11275001.00           1.09   10305411.66  **
                                                          **************************************************************************************

Result

Test	STATUS	SIM TIME (ns)	REAL TIME (s)	RATIO (ns/s)	CYCLES
Original Matrix Multiplication	PASS	12300001.00	1.13	10873995.65	491
Optimized Matrix Multiplication	PASS	11275001.00	1.08	10486023.66	450

Improved by 41 cycles.

TODO: Consider to submit pull requests back to tiny-gpu.