# Evaluate tiny-gpu
> 侯廷錡
## Summary
1. GEMM on Tiny-GPU:
- Study [tiny-gpu](https://github.com/adam-maj/tiny-gpu).
- Implement and test a basic GEMM kernel (e.g., matrix multiplication).
2. Cache-Enhanced Tiny-GPU:
- Evaluate [dsandall/tiny-gpu](https://github.com/dsandall/tiny-gpu/tree/master) with its multicore cache.
- Understand cache impact on performance.
- Implement a GEMM kernel on this cache-enabled GPU and quantify performance gains.
3. gemmlowp Strategies:
- Review [gemmlowp design](https://github.com/google/gemmlowp/blob/master/doc/design.md ) for packing and cache blocking techniques.
- Apply these strategies (data packing, cache-friendly blocking) to improve the GEMM kernel on Tiny-GPU for better efficiency.
## Architecture
|  |  |
|-------------------------------------------------------------|-------------------------------------------------------------|
---
## ISA

## 1. Memory in tiny-gpu(without cache)
**tiny-gpu** has to main external memoeies
1. **Program Memory** - stores instructions(16 bits each).
2. **Data Memory** -stores data(8 bits each).
Because all threads across all cores need to read(write) these memories, **memory controllers** arbitrate requests to avovid collisions when too many requests arrive at once.
- **Lsu** in each tread issue `mem_read_vaild`/`mem_werite_vaild` requests.
- **Fetchers** in each core issue `program_meme_read_vaild` requests.
- The **memory controllers** (`controller.sv`) accept these requests, dispatch them to memory as bandwidth permits, and route reponses back to the correct LSU or fetcher.
### Program Memory Flow

1. Each core has a Fetcher that issues `program_mem_read_valid` + `program_mem_read_address` when it needs the next instruction.
2. The program_memory_controller.sv module collects these requests from multiple cores (up to `NUM_CORES`), decides which to send out to external memory, waits for `mem_read_ready` from the external memory, and finally returns the instruction data to the appropriate fetcher.
---
### Data Memory Flow

1. Each thread in each core has an LSU, which issues `mem_read_valid` or `mem_write_valid` for loads and stores (plus addresses and data).
2. The data_memory_controller.sv collects these requests from all LSUs (`NUM_CORES * THREADS_PER_BLOCK` total) and funnels them to external data memory.
3. Once external memory responds (`mem_read_ready` / `mem_read_data`), the controller passes the data back to the correct LSU.
---
## 2. Memory in tiny-gpu(cache)
- **LSUs and Data Cache**:
Each thread's LSU generates read/write requests. The dmem_cache module collects these requests, arbitrates among them, and interfaces with an external memory bus.
- Cache Hit: If an address matches a valid cache line (tag match), the data is returned immediately.
-Cache Miss: If not, the cache fetches the corresponding line from memory, stores it, and then returns the data. A similar process applies for write requests.
- **Fetchers and Program Cache**:
Each core's fetcher issues instruction-fetch requests. The combination of `pmem_controller` and `pmem_cache` handles these requests for program memory using a read-only cache mechanism, simplifying the flow compared to data caching.
### Program Memory Flow

- Fetch Requests: Each core’s fetcher sets fetcher_read_valid and provides the current PC as fetcher_read_address.
- Hit or Miss: If pmem_cache.sv has the instruction (hit), it returns immediately; otherwise (pmem_controller.sv) reads from external memory and updates the cache.
- Instruction Delivery: Once fetched, the instruction is sent back to the fetcher, which advances to DECODE.
### Data Memory Flow

- LSU Requests: Each thread’s LSU asserts read/write signals with an address (and data, if writing).
- Cache Lookup: The dmem_cache checks for a valid/tag match:
- Cache Hit: Data is returned immediately.
- Cache Miss: A full line is fetched from external DMEM, then sent to the LSU, with the cache updated accordingly.
- Write Handling: Writes update the cache and may propagate to external memory, subject to the chosen policy.
---
## 3. GEMM kernel
```asm=
0b0101000011011110, # MUL R0, %blockIdx, %blockDim
0b0011000000001111, # ADD R0, R0, %threadIdx ; i = blockIdx * blockDim + threadIdx
0b1001000100000001, # CONST R1, #1 ; increment
0b1001001000000010, # CONST R2, #2 ; N (matrix inner dimension)
0b1001001100000000, # CONST R3, #0 ; baseA (matrix A base address)
0b1001010000000100, # CONST R4, #4 ; baseB (matrix B base address)
0b1001010100001000, # CONST R5, #8 ; baseC (matrix C base address)
0b0110011000000010, # DIV R6, R0, R2 ; row = i // N
0b0101011101100010, # MUL R7, R6, R2
0b0100011100000111, # SUB R7, R0, R7 ; col = i % N
0b1001100000000000, # CONST R8, #0 ; acc = 0
0b1001100100000000, # CONST R9, #0 ; k = 0
# LOOP:
0b0101101001100010, # MUL R10, R6, R2
0b0011101010101001, # ADD R10, R10, R9
0b0011101010100011, # ADD R10, R10, R3 ; addr(A[i]) = row * N + k + baseA
0b0111101010100000, # LDR R10, R10 ; load A[i] from global memory
0b0101101110010010, # MUL R11, R9, R2
0b0011101110110111, # ADD R11, R11, R7
0b0011101110110100, # ADD R11, R11, R4 ; addr(B[i]) = k * N + col + baseB
0b0111101110110000, # LDR R11, R11 ; load B[i] from global memory
0b0101110010101011, # MUL R12, R10, R11
0b0011100010001100, # ADD R8, R8, R12 ; acc = acc + A[i] * B[i]
0b0011100110010001, # ADD R9, R9, R1 ; increment k
0b0010000010010010, # CMP R9, R2
0b0001100000001100, # BRn LOOP ; loop while k < N
0b0011100101010000, # ADD R9, R5, R0 ; addr(C[i]) = baseC + i
0b1000000010011000, # STR R9, R8 ; store C[i] in global memory
0b1111000000000000 # RET ; end of kernel
```
### Matmul Results
| Architecture | Cycles | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) |
|-------------------|--------|----------------|---------------|-----------------|
| **With Cache** | 706 | 17,675,001.00 | 54.25 | 325,784.34 |
| **Without Cache** | 491 | 12,300,001.00 | 11.64 | 1,056,952.17 |
### Matadd Results
| Architecture | Cycles | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) |
|-------------------|--------|----------------|---------------|-----------------|
| **With Cache** | 270 | 6,775,001.00 | 45.76 | 148,046.60 |
| **Without Cache** | 178 | 4,475,001.00 | 27.11 | 165,061.32 |
## 3. Core Concepts of gemmlowp
### GEMM and Cache Friendliness
- General matrix multiplication (GEMM) algorithms require \(n^3\) multiply-add operations, but the memory usage is \(n^2\).
- The same data will be accessed multiple times, so "efficient use of CPU registers and cache" becomes crucial.
#### Approach:
- Divide large matrices into smaller blocks and sub-blocks.
- Perform computations entirely within registers or L1/L2 cache as much as possible to reduce frequent memory exchanges.
---
### GEMM Kernel
- **Kernel** refers to the "inner loop code" written specifically for a given processor architecture (registers, SIMD width, etc.)
- **Core Idea**:
- Load as much data into registers as possible.
- Perform as many multiply-add operations as possible.
- Load new data.
- This minimizes expensive memory access, particularly from DRAM.
---
### Block and Pack/Unpack
1. **Block Concept**:
- Due to the limited capacity of registers, large matrices must be processed through multiple "block → load → compute → write-back" cycles.
2. **Purpose of Pack**:
- To improve cache hit rates and speed up read/write operations, data is "reorganized (packed)" into a continuous or cache-aligned format.
- This allows the kernel to read the data simply and efficiently.
3. **Purpose of Unpack**:
- After computation, the "partial results (accumulated in int32)" are written back to the output matrix in one step, which is called "Unpack."
---
### Low-Precision Computation (uint8 → int32 → uint8)
- The input and output matrices of gemmlowp are **uint8**, but intermediate accumulation is done in **int32**.
- Computation flow:
1. **Pack**: Reorganize uint8 input data.
2. **Kernel**: Perform computation and accumulate results in int32.
3. **Unpack**: Quantize/convert int32 results back to uint8 and write them to the output matrix.
---
### Program Structure
- The main file structure of gemmlowp:
- `single_thread_gemm.h` and `multi_thread_gemm.h`: Handle single-threaded and multi-threaded GEMM workflows.
- `pack.h`, `compute.h`, `unpack.h`: Implement the three main stages: "Pack / Kernel / Unpack."
- NEON-optimized files (e.g., `pack_neon.h`, `kernel_neon.h`): Optimizations for ARM platforms.
#### Typical Workflow (Single-threaded Example):
1. **PackLhs / PackRhs**:
- First, pack the data blocks of the left and right matrices.
2. **Compute**:
- Call the kernel to perform multiply-add operations on the packed data.
- The results are temporarily stored in an int32 buffer.
3. **UnpackResult**:
- Convert the int32 results back to the required type (uint8) and write them to the final output matrix.
### Incorporating Packing and Unpacking
1. **Define Memory Regions for Packing**: Allocate specific memory addresses for the packed matrices.
2. **Insert Packing Instructions**: Add instructions to copy and reorganize the input matrices into the packed regions.
3. **Modify Base Addresses**: Update the base addresses for matrices A and B to point to their packed locations.
4. **Insert Unpacking Instructions (Optional)**: After computation, unpack the results back to their original or desired memory locations.
---
## Assumptions
- **Matrix Size**: 2x2 matrices for simplicity.
- **Memory Allocation**:
- Matrix A: Original at addresses 0-3, Packed at 16-19.
- Matrix B: Original at addresses 4-7, Packed at 20-23.
- Matrix C: Original at addresses 8-11, Unpacked at 24-27 (optional).
---
## A. Define Packing Instructions
We'll start by packing Matrix A and Matrix B from their original locations to their packed locations.
#### Packing Matrix A (Addresses 0-3 to 16-19)
```asm
# PACK A: Copy from 0-3 to 16-19
# 1. Load A[0] from address 0 and store to 16
CONST R3, #0 ; R3 = 0
LDR R0, R3 ; R0 = MEM[R3] => MEM[0]
CONST R4, #16 ; R4 = 16
STR R4, R0 ; MEM[16] = R0
; 2. Load A[1] from address 1 and store to 17
CONST R3, #1 ; R3 = 1
LDR R0, R3 ; R0 = MEM[R3] => MEM[1]
CONST R4, #17 ; R4 = 17
STR R4, R0 ; MEM[17] = R0
; 3. Load A[2] from address 2 and store to 18
CONST R3, #2 ; R3 = 2
LDR R0, R3 ; R0 = MEM[R3] => MEM[2]
CONST R4, #18 ; R4 = 18
STR R4, R0 ; MEM[18] = R0
; 4. Load A[3] from address 3 and store to 19
CONST R3, #3 ; R3 = 3
LDR R0, R3 ; R0 = MEM[R3] => MEM[3]
CONST R4, #19 ; R4 = 19
STR R4, R0 ; MEM[19] = R0
```
#### PACK B: Copy from 4-7 to 20-23
```asm
; 1. Load B[0] from address 4 and store to 20
CONST R3, #4 ; R3 = 4
LDR R0, R3 ; R0 = MEM[R3] => MEM[4]
CONST R4, #20 ; R4 = 20
STR R4, R0 ; MEM[20] = R0
; 2. Load B[1] from address 5 and store to 21
CONST R3, #5 ; R3 = 5
LDR R0, R3 ; R0 = MEM[R3] => MEM[5]
CONST R4, #21 ; R4 = 21
STR R4, R0 ; MEM[21] = R0
; 3. Load B[2] from address 6 and store to 22
CONST R3, #6 ; R3 = 6
LDR R0, R3 ; R0 = MEM[R3] => MEM[6]
CONST R4, #22 ; R4 = 22
STR R4, R0 ; MEM[22] = R0
; 4. Load B[3] from address 7 and store to 23
CONST R3, #7 ; R3 = 7
LDR R0, R3 ; R0 = MEM[R3] => MEM[7]
CONST R4, #23 ; R4 = 23
STR R4, R0 ; MEM[23] = R0
```
#### Update baseA to 16 and baseB to 20
```asm=
CONST R3, #16 ; R3 = 16 (baseA)
CONST R4, #20 ; R4 = 20 (baseB)
CONST R5, #8 ; R5 = 8 (baseC remains the same)
```
#### Original MatMul Instructions
```asm
MUL R0, %blockIdx, %blockDim ; R0 = blockIdx * blockDim
ADD R0, R0, %threadIdx ; R0 = R0 + threadIdx
CONST R1, #1 ; R1 = 1 (increment)
CONST R2, #2 ; R2 = 2 (N = matrix inner dimension)
DIV R6, R0, R2 ; R6 = R0 / R2 (row)
MUL R7, R6, R2 ; R7 = R6 * R2
SUB R7, R0, R7 ; R7 = R0 - R7 (col)
CONST R8, #0 ; R8 = 0 (accumulator)
CONST R9, #0 ; R9 = 0 (k)
LOOP:
MUL R10, R6, R2 ; R10 = R6 * R2
ADD R10, R10, R9 ; R10 = R10 + R9
ADD R10, R10, R3 ; R10 = R10 + baseA (addr(A[i]))
LDR R10, R10 ; R10 = MEM[addr(A[i])]
MUL R11, R9, R2 ; R11 = R9 * R2
ADD R11, R11, R7 ; R11 = R11 + col
ADD R11, R11, R4 ; R11 = R11 + baseB (addr(B[i]))
LDR R11, R11 ; R11 = MEM[addr(B[i])]
MUL R12, R10, R11 ; R12 = R10 * R11
ADD R8, R8, R12 ; acc += R12
ADD R9, R9, R1 ; k += 1
CMP R9, R2 ; Compare k with N
BRn LOOP ; If k < N, branch to LOOP
ADD R9, R5, R0 ; addr(C[i]) = baseC + i
STR R9, R8 ; MEM[addr(C[i])] = acc
RET ;
```
#### UNPACK C: Copy from 8-11 to 24-27
```asm
; 1. Load C[0] from address 8 and store to 24
CONST R3, #8 ; R3 = 8
LDR R0, R3 ; R0 = MEM[R3] => MEM[8]
CONST R4, #24 ; R4 = 24
STR R4, R0 ; MEM[24] = R0
; 2. Load C[1] from address 9 and store to 25
CONST R3, #9 ; R3 = 9
LDR R0, R3 ; R0 = MEM[R3] => MEM[9]
CONST R4, #25 ; R4 = 25
STR R4, R0 ; MEM[25] = R0
; 3. Load C[2] from address 10 and store to 26
CONST R3, #10 ; R3 = 10
LDR R0, R3 ; R0 = MEM[R3] => MEM[10]
CONST R4, #26 ; R4 = 26
STR R4, R0 ; MEM[26] = R0
; 4. Load C[3] from address 11 and store to 27
CONST R3, #11 ; R3 = 11
LDR R0, R3 ; R0 = MEM[R3] => MEM[11]
CONST R4, #27 ; R4 = 27
STR R4, R0 ; MEM[27] = R0
```