章元豪
Planned improvements include:
This matrix addition kernel adds two 1 x 8 matrices by performing 8 element wise additions in separate threads.
This demonstration makes use of the %blockIdx
, %blockDim
, and %threadIdx
registers to show SIMD programming on this GPU. It also uses the LDR
and STR
instructions which require async memory management.
matadd.asm
Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES |
---|---|---|---|---|---|
test.test_matadd.test_matadd | PASS | 4475001.00 | 1.23 | 3623572.44 | 178 |
TESTS=1 PASS=1 FAIL=0 SKIP=0 | 4475001.00 | 1.25 | 3577372.47 |
The matrix multiplication kernel multiplies two 2x2 matrices. It performs element wise calculation of the dot product of the relevant row and column and uses the CMP
and BRnzp
instructions to demonstrate branching within the threads (notably, all branches converge so this kernel works on the current tiny-gpu implementation).
matmul.asm
Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES |
---|---|---|---|---|---|
test.test_matmul.test_matadd | PASS | 12300001.00 | 1.13 | 10873995.65 | 491 |
TESTS=1 PASS=1 FAIL=0 SKIP=0 | 12300001.00 | 1.15 | 10702781.50 |
1.IDLE
: Waiting to start.
2.FETCH
: Fetch instructions from program memory.
3.DECODE
: Decode instructions into control signals.
4.REQUEST
: Request data from registers or memory.
5.WAIT
: Wait for response from memory if necessary.
6.EXECUTE
: Execute ALU and PC calculations.
7.UPDATE
: Update registers, NZP, and PC.
8.DONE
:Done executing this block.
Based on the architecture diagram above, I found that the ALU and PC in the EXECUTE
stage don't necessarily need to be executed during this stage. They can be executed in the WAIT
stage instead, as instructions that don't use the LSU would otherwise waste this stage.
1.IDLE
: Waiting to start.
2.FETCH
: Fetch instructions from program memory.
3.DECODE
: Decode instructions into control signals.
4.REQUEST
: Request data from registers or memory.
5.WAIT
: Execute ALU and PC calculations,and wait for response from memory if necessary.
6.UPDATE
: Update registers, NZP, and PC.
7.DONE
:Done executing this block.
Remove the execute state.
Change core_state to detect the WAIT state.
PC can also perform calculations during the WAIT phase.
Remove the execute state.
Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES |
---|---|---|---|---|---|
Original Matrix Addition | PASS | 4475001.00 | 1.23 | 3623572.44 | 178 |
Optimized Matrix Addition | PASS | 4200001.00 | 1.36 | 3084251.90 | 167 |
Improved by 11 cycles.
Test | STATUS | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | CYCLES |
---|---|---|---|---|---|
Original Matrix Multiplication | PASS | 12300001.00 | 1.13 | 10873995.65 | 491 |
Optimized Matrix Multiplication | PASS | 11275001.00 | 1.08 | 10486023.66 | 450 |
Improved by 41 cycles.
TODO: Consider to submit pull requests back to tiny-gpu.