章元豪
Planned improvements include:
This matrix addition kernel adds two 1 x 8 matrices by performing 8 element wise additions in separate threads.
This demonstration makes use of the %blockIdx
, %blockDim
, and %threadIdx
registers to show SIMD programming on this GPU. It also uses the LDR
and STR
instructions which require async memory management.
matadd.asm
.threads 8
.data 0 1 2 3 4 5 6 7 ; matrix A (1 x 8)
.data 0 1 2 3 4 5 6 7 ; matrix B (1 x 8)
MUL R0, %blockIdx, %blockDim
ADD R0, R0, %threadIdx ; i = blockIdx * blockDim + threadIdx
CONST R1, #0 ; baseA (matrix A base address)
CONST R2, #8 ; baseB (matrix B base address)
CONST R3, #16 ; baseC (matrix C base address)
ADD R4, R1, R0 ; addr(A[i]) = baseA + i
LDR R4, R4 ; load A[i] from global memory
ADD R5, R2, R0 ; addr(B[i]) = baseB + i
LDR R5, R5 ; load B[i] from global memory
ADD R6, R4, R5 ; C[i] = A[i] + B[i]
ADD R7, R3, R0 ; addr(C[i]) = baseC + i
STR R7, R6 ; store C[i] in global memory
RET ; end of kernel
iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v
MODULE=test.test_matadd vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp
-.--ns INFO gpi ..mbed/gpi_embed.cpp:79 in set_program_name_in_venv Did not detect Python virtual environment. Using system-wide Python interpreter
-.--ns INFO gpi ../gpi/GpiCommon.cpp:101 in gpi_print_registered_impl VPI registered
0.00ns INFO cocotb Running on Icarus Verilog version 11.0 (stable)
0.00ns INFO cocotb Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb
0.00ns INFO cocotb Seeding Python random module with 1734430799
0.00ns INFO cocotb.regression pytest not found, install it to enable better AssertionError messages
0.00ns INFO cocotb.regression Found test test.test_matadd.test_matadd
0.00ns INFO cocotb.regression running test_matadd (1/1)
4475001.00ns INFO cocotb.regression test_matadd passed
4475001.00ns INFO cocotb.regression **************************************************************************************
** TEST STATUS SIM TIME (ns) REAL TIME (s) RATIO (ns/s) **
**************************************************************************************
** test.test_matadd.test_matadd PASS 4475001.00 1.23 3623572.44 **
**************************************************************************************
** TESTS=1 PASS=1 FAIL=0 SKIP=0 4475001.00 1.25 3577372.47 **
**************************************************************************************
Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES |
---|---|---|---|---|---|
test.test_matadd.test_matadd | PASS | 4475001.00 | 1.23 | 3623572.44 | 178 |
TESTS=1 PASS=1 FAIL=0 SKIP=0 | 4475001.00 | 1.25 | 3577372.47 |
The matrix multiplication kernel multiplies two 2x2 matrices. It performs element wise calculation of the dot product of the relevant row and column and uses the CMP
and BRnzp
instructions to demonstrate branching within the threads (notably, all branches converge so this kernel works on the current tiny-gpu implementation).
matmul.asm
.threads 4
.data 1 2 3 4 ; matrix A (2 x 2)
.data 1 2 3 4 ; matrix B (2 x 2)
MUL R0, %blockIdx, %blockDim
ADD R0, R0, %threadIdx ; i = blockIdx * blockDim + threadIdx
CONST R1, #1 ; increment
CONST R2, #2 ; N (matrix inner dimension)
CONST R3, #0 ; baseA (matrix A base address)
CONST R4, #4 ; baseB (matrix B base address)
CONST R5, #8 ; baseC (matrix C base address)
DIV R6, R0, R2 ; row = i // N
MUL R7, R6, R2
SUB R7, R0, R7 ; col = i % N
CONST R8, #0 ; acc = 0
CONST R9, #0 ; k = 0
LOOP:
MUL R10, R6, R2
ADD R10, R10, R9
ADD R10, R10, R3 ; addr(A[i]) = row * N + k + baseA
LDR R10, R10 ; load A[i] from global memory
MUL R11, R9, R2
ADD R11, R11, R7
ADD R11, R11, R4 ; addr(B[i]) = k * N + col + baseB
LDR R11, R11 ; load B[i] from global memory
MUL R12, R10, R11
ADD R8, R8, R12 ; acc = acc + A[i] * B[i]
ADD R9, R9, R1 ; increment k
CMP R9, R2
BRn LOOP ; loop while k < N
ADD R9, R5, R0 ; addr(C[i]) = baseC + i
STR R9, R8 ; store C[i] in global memory
RET ; end of kernel
iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v
MODULE=test.test_matmul vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp
-.--ns INFO gpi ..mbed/gpi_embed.cpp:79 in set_program_name_in_venv Did not detect Python virtual environment. Using system-wide Python interpreter
-.--ns INFO gpi ../gpi/GpiCommon.cpp:101 in gpi_print_registered_impl VPI registered
0.00ns INFO cocotb Running on Icarus Verilog version 11.0 (stable)
0.00ns INFO cocotb Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb
0.00ns INFO cocotb Seeding Python random module with 1734430820
0.00ns INFO cocotb.regression pytest not found, install it to enable better AssertionError messages
0.00ns INFO cocotb.regression Found test test.test_matmul.test_matadd
0.00ns INFO cocotb.regression running test_matadd (1/1)
12300001.00ns INFO cocotb.regression test_matadd passed
12300001.00ns INFO cocotb.regression **************************************************************************************
** TEST STATUS SIM TIME (ns) REAL TIME (s) RATIO (ns/s) **
**************************************************************************************
** test.test_matmul.test_matadd PASS 12300001.00 1.13 10873995.65 **
**************************************************************************************
** TESTS=1 PASS=1 FAIL=0 SKIP=0 12300001.00 1.15 10702781.50 **
**************************************************************************************
Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES |
---|---|---|---|---|---|
test.test_matmul.test_matadd | PASS | 12300001.00 | 1.13 | 10873995.65 | 491 |
TESTS=1 PASS=1 FAIL=0 SKIP=0 | 12300001.00 | 1.15 | 10702781.50 |
1.IDLE
: Waiting to start.
2.FETCH
: Fetch instructions from program memory.
3.DECODE
: Decode instructions into control signals.
4.REQUEST
: Request data from registers or memory.
5.WAIT
: Wait for response from memory if necessary.
6.EXECUTE
: Execute ALU and PC calculations.
7.UPDATE
: Update registers, NZP, and PC.
8.DONE
:Done executing this block.
Based on the architecture diagram above, I found that the ALU and PC in the EXECUTE
stage don't necessarily need to be executed during this stage. They can be executed in the WAIT
stage instead, as instructions that don't use the LSU would otherwise waste this stage.
1.IDLE
: Waiting to start.
2.FETCH
: Fetch instructions from program memory.
3.DECODE
: Decode instructions into control signals.
4.REQUEST
: Request data from registers or memory.
5.WAIT
: Execute ALU and PC calculations,and wait for response from memory if necessary.
6.UPDATE
: Update registers, NZP, and PC.
7.DONE
:Done executing this block.
Remove the execute state.
EXECUTE: begin
// Execute is synchronous so we move on after one cycle
core_state <= UPDATE;
Change core_state to detect the WAIT state.
always @(posedge clk) begin
if (reset) begin
alu_out_reg <= 8'b0;
end else if (enable) begin
// Calculate alu_out when core_state = WAIT
if (core_state == 3'b100) begin
if (decoded_alu_output_mux == 1) begin
// Set values to compare with NZP register in alu_out[2:0]
alu_out_reg <= {5'b0, (rs - rt > 0), (rs - rt == 0), (rs - rt < 0)};
end else begin
// Execute the specified arithmetic instruction
case (decoded_alu_arithmetic_mux)
ADD: begin
alu_out_reg <= rs + rt;
end
SUB: begin
alu_out_reg <= rs - rt;
end
MUL: begin
alu_out_reg <= rs * rt;
end
DIV: begin
alu_out_reg <= rs / rt;
end
endcase
end
end
end
end
endmodule
PC can also perform calculations during the WAIT phase.
always @(posedge clk) begin
if (reset) begin
nzp <= 3'b0;
next_pc <= 0;
end else if (enable) begin
// Update PC when core_state = WAIT
if (core_state == 3'b100) begin
if (decoded_pc_mux == 1) begin
if (((nzp & decoded_nzp) != 3'b0)) begin
// On BRnzp instruction, branch to immediate if NZP case matches previous CMP
next_pc <= decoded_immediate;
end else begin
// Otherwise, just update to PC + 1 (next line)
next_pc <= current_pc + 1;
end
end else begin
// By default update to PC + 1 (next line)
next_pc <= current_pc + 1;
end
end
// Store NZP when core_state = UPDATE
if (core_state == 3'b110) begin
// Write to NZP register on CMP instruction
if (decoded_nzp_write_enable) begin
nzp[2] <= alu_out[2];
nzp[1] <= alu_out[1];
nzp[0] <= alu_out[0];
end
end
end
end
Remove the execute state.
def format_core_state(core_state: str) -> str:
core_state_map = {
"000": "IDLE",
"001": "FETCH",
"010": "DECODE",
"011": "REQUEST",
"100": "WAIT",
"110": "UPDATE",
"111": "DONE"
}
return core_state_map[core_state]
iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v
MODULE=test.test_matadd vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp
-.--ns INFO gpi ..mbed/gpi_embed.cpp:79 in set_program_name_in_venv Did not detect Python virtual environment. Using system-wide Python interpreter
-.--ns INFO gpi ../gpi/GpiCommon.cpp:101 in gpi_print_registered_impl VPI registered
0.00ns INFO cocotb Running on Icarus Verilog version 11.0 (stable)
0.00ns INFO cocotb Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb
0.00ns INFO cocotb Seeding Python random module with 1735826852
0.00ns INFO cocotb.regression pytest not found, install it to enable better AssertionError messages
0.00ns INFO cocotb.regression Found test test.test_matadd.test_matadd
0.00ns INFO cocotb.regression running test_matadd (1/1)
4200001.00ns INFO cocotb.regression test_matadd passed
4200001.00ns INFO cocotb.regression **************************************************************************************
** TEST STATUS SIM TIME (ns) REAL TIME (s) RATIO (ns/s) **
**************************************************************************************
** test.test_matadd.test_matadd PASS 4200001.00 1.36 3084251.90 **
**************************************************************************************
** TESTS=1 PASS=1 FAIL=0 SKIP=0 4200001.00 1.38 3045972.60 **
**************************************************************************************
Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES |
---|---|---|---|---|---|
Original Matrix Addition | PASS | 4475001.00 | 1.23 | 3623572.44 | 178 |
Optimized Matrix Addition | PASS | 4200001.00 | 1.36 | 3084251.90 | 167 |
Improved by 11 cycles.
iverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v
MODULE=test.test_matmul vvp -M $(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp
-.--ns INFO gpi ..mbed/gpi_embed.cpp:79 in set_program_name_in_venv Did not detect Python virtual environment. Using system-wide Python interpreter
-.--ns INFO gpi ../gpi/GpiCommon.cpp:101 in gpi_print_registered_impl VPI registered
0.00ns INFO cocotb Running on Icarus Verilog version 11.0 (stable)
0.00ns INFO cocotb Running tests with cocotb v1.9.2 from /usr/local/lib/python3.10/dist-packages/cocotb
0.00ns INFO cocotb Seeding Python random module with 1735826930
0.00ns INFO cocotb.regression pytest not found, install it to enable better AssertionError messages
0.00ns INFO cocotb.regression Found test test.test_matmul.test_matadd
0.00ns INFO cocotb.regression running test_matadd (1/1)
11275001.00ns INFO cocotb.regression test_matadd passed
11275001.00ns INFO cocotb.regression **************************************************************************************
** TEST STATUS SIM TIME (ns) REAL TIME (s) RATIO (ns/s) **
**************************************************************************************
** test.test_matmul.test_matadd PASS 11275001.00 1.08 10486023.66 **
**************************************************************************************
** TESTS=1 PASS=1 FAIL=0 SKIP=0 11275001.00 1.09 10305411.66 **
**************************************************************************************
Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES |
---|---|---|---|---|---|
Original Matrix Multiplication | PASS | 12300001.00 | 1.13 | 10873995.65 | 491 |
Optimized Matrix Multiplication | PASS | 11275001.00 | 1.08 | 10486023.66 | 450 |
Improved by 41 cycles.
TODO: Consider to submit pull requests back to tiny-gpu.