Evaluate tiny-gpu

侯廷錡

Summary

GEMM on Tiny-GPU:

Study tiny-gpu.
Implement and test a basic GEMM kernel (e.g., matrix multiplication).

Cache-Enhanced Tiny-GPU:

Evaluate dsandall/tiny-gpu with its multicore cache.
Understand cache impact on performance.
Implement a GEMM kernel on this cache-enabled GPU and quantify performance gains.

gemmlowp Strategies:

Review gemmlowp design for packing and cache blocking techniques.
Apply these strategies (data packing, cache-friendly blocking) to improve the GEMM kernel on Tiny-GPU for better efficiency.

Architecture

Image Not Showing Possible Reasons The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted Learn More →	Image Not Showing Possible Reasons The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted Learn More →

ISA

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

1. Memory in tiny-gpu(without cache)

tiny-gpu has to main external memoeies

Program Memory - stores instructions(16 bits each).
Data Memory -stores data(8 bits each).

Because all threads across all cores need to read(write) these memories, memory controllers arbitrate requests to avovid collisions when too many requests arrive at once.

Lsu in each tread issue mem_read_vaild/mem_werite_vaild requests.
Fetchers in each core issue program_meme_read_vaild requests.
The memory controllers (controller.sv) accept these requests, dispatch them to memory as bandwidth permits, and route reponses back to the correct LSU or fetcher.

Program Memory Flow

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Each core has a Fetcher that issues program_mem_read_valid + program_mem_read_address when it needs the next instruction.
The program_memory_controller.sv module collects these requests from multiple cores (up to NUM_CORES), decides which to send out to external memory, waits for mem_read_ready from the external memory, and finally returns the instruction data to the appropriate fetcher.

Data Memory Flow

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Each thread in each core has an LSU, which issues mem_read_valid or mem_write_valid for loads and stores (plus addresses and data).
The data_memory_controller.sv collects these requests from all LSUs (NUM_CORES * THREADS_PER_BLOCK total) and funnels them to external data memory.
Once external memory responds (mem_read_ready / mem_read_data), the controller passes the data back to the correct LSU.

2. Memory in tiny-gpu(cache)

LSUs and Data Cache:
Each thread's LSU generates read/write requests. The dmem_cache module collects these requests, arbitrates among them, and interfaces with an external memory bus.
- Cache Hit: If an address matches a valid cache line (tag match), the data is returned immediately.
  -Cache Miss: If not, the cache fetches the corresponding line from memory, stores it, and then returns the data. A similar process applies for write requests.
Fetchers and Program Cache:
Each core's fetcher issues instruction-fetch requests. The combination of pmem_controller and pmem_cache handles these requests for program memory using a read-only cache mechanism, simplifying the flow compared to data caching.

Program Memory Flow

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Fetch Requests: Each core’s fetcher sets fetcher_read_valid and provides the current PC as fetcher_read_address.
Hit or Miss: If pmem_cache.sv has the instruction (hit), it returns immediately; otherwise (pmem_controller.sv) reads from external memory and updates the cache.
Instruction Delivery: Once fetched, the instruction is sent back to the fetcher, which advances to DECODE.

Data Memory Flow

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

LSU Requests: Each thread’s LSU asserts read/write signals with an address (and data, if writing).
Cache Lookup: The dmem_cache checks for a valid/tag match:
- Cache Hit: Data is returned immediately.
- Cache Miss: A full line is fetched from external DMEM, then sent to the LSU, with the cache updated accordingly.
Write Handling: Writes update the cache and may propagate to external memory, subject to the chosen policy.

3. GEMM kernel





























0b0101000011011110, # MUL R0, %blockIdx, %blockDim
0b0011000000001111, # ADD R0, R0, %threadIdx         ; i = blockIdx * blockDim + threadIdx
0b1001000100000001, # CONST R1, #1                   ; increment
0b1001001000000010, # CONST R2, #2                   ; N (matrix inner dimension)
0b1001001100000000, # CONST R3, #0                   ; baseA (matrix A base address)
0b1001010000000100, # CONST R4, #4                   ; baseB (matrix B base address)
0b1001010100001000, # CONST R5, #8                   ; baseC (matrix C base address)
0b0110011000000010, # DIV R6, R0, R2                 ; row = i // N
0b0101011101100010, # MUL R7, R6, R2
0b0100011100000111, # SUB R7, R0, R7                 ; col = i % N
0b1001100000000000, # CONST R8, #0                   ; acc = 0
0b1001100100000000, # CONST R9, #0                   ; k = 0
                    # LOOP:
0b0101101001100010, #   MUL R10, R6, R2
0b0011101010101001, #   ADD R10, R10, R9
0b0011101010100011, #   ADD R10, R10, R3             ; addr(A[i]) = row * N + k + baseA
0b0111101010100000, #   LDR R10, R10                 ; load A[i] from global memory
0b0101101110010010, #   MUL R11, R9, R2
0b0011101110110111, #   ADD R11, R11, R7
0b0011101110110100, #   ADD R11, R11, R4             ; addr(B[i]) = k * N + col + baseB
0b0111101110110000, #   LDR R11, R11                 ; load B[i] from global memory
0b0101110010101011, #   MUL R12, R10, R11
0b0011100010001100, #   ADD R8, R8, R12              ; acc = acc + A[i] * B[i]
0b0011100110010001, #   ADD R9, R9, R1               ; increment k
0b0010000010010010, #   CMP R9, R2
0b0001100000001100, #   BRn LOOP                     ; loop while k < N
0b0011100101010000, # ADD R9, R5, R0                 ; addr(C[i]) = baseC + i 
0b1000000010011000, # STR R9, R8                     ; store C[i] in global memory
0b1111000000000000  # RET                            ; end of kernel

Matmul Results

Architecture	Cycles	SIM TIME (ns)	REAL TIME (s)	RATIO (ns/s)
With Cache	706	17,675,001.00	54.25	325,784.34
Without Cache	491	12,300,001.00	11.64	1,056,952.17

Matadd Results

Architecture	Cycles	SIM TIME (ns)	REAL TIME (s)	RATIO (ns/s)
With Cache	270	6,775,001.00	45.76	148,046.60
Without Cache	178	4,475,001.00	27.11	165,061.32

3. Core Concepts of gemmlowp

GEMM and Cache Friendliness

General matrix multiplication (GEMM) algorithms require (n^3) multiply-add operations, but the memory usage is (n^2).
The same data will be accessed multiple times, so "efficient use of CPU registers and cache" becomes crucial.

Approach:

Divide large matrices into smaller blocks and sub-blocks.
Perform computations entirely within registers or L1/L2 cache as much as possible to reduce frequent memory exchanges.

GEMM Kernel

Kernel refers to the "inner loop code" written specifically for a given processor architecture (registers, SIMD width, etc.)
Core Idea:
- Load as much data into registers as possible.
- Perform as many multiply-add operations as possible.
- Load new data.
- This minimizes expensive memory access, particularly from DRAM.

Block and Pack/Unpack

Block Concept:
- Due to the limited capacity of registers, large matrices must be processed through multiple "block → load → compute → write-back" cycles.
Purpose of Pack:
- To improve cache hit rates and speed up read/write operations, data is "reorganized (packed)" into a continuous or cache-aligned format.
- This allows the kernel to read the data simply and efficiently.
Purpose of Unpack:
- After computation, the "partial results (accumulated in int32)" are written back to the output matrix in one step, which is called "Unpack."

Low-Precision Computation (uint8 → int32 → uint8)

The input and output matrices of gemmlowp are uint8, but intermediate accumulation is done in int32.
Computation flow:
1. Pack: Reorganize uint8 input data.
2. Kernel: Perform computation and accumulate results in int32.
3. Unpack: Quantize/convert int32 results back to uint8 and write them to the output matrix.

Program Structure

The main file structure of gemmlowp:
- single_thread_gemm.h and multi_thread_gemm.h: Handle single-threaded and multi-threaded GEMM workflows.
- pack.h, compute.h, unpack.h: Implement the three main stages: "Pack / Kernel / Unpack."
- NEON-optimized files (e.g., pack_neon.h, kernel_neon.h): Optimizations for ARM platforms.

Typical Workflow (Single-threaded Example):

PackLhs / PackRhs:
- First, pack the data blocks of the left and right matrices.
Compute:
- Call the kernel to perform multiply-add operations on the packed data.
- The results are temporarily stored in an int32 buffer.
UnpackResult:
- Convert the int32 results back to the required type (uint8) and write them to the final output matrix.

Incorporating Packing and Unpacking

Define Memory Regions for Packing: Allocate specific memory addresses for the packed matrices.
Insert Packing Instructions: Add instructions to copy and reorganize the input matrices into the packed regions.
Modify Base Addresses: Update the base addresses for matrices A and B to point to their packed locations.
Insert Unpacking Instructions (Optional): After computation, unpack the results back to their original or desired memory locations.

Assumptions

Matrix Size: 2x2 matrices for simplicity.
Memory Allocation:
- Matrix A: Original at addresses 0-3, Packed at 16-19.
- Matrix B: Original at addresses 4-7, Packed at 20-23.
- Matrix C: Original at addresses 8-11, Unpacked at 24-27 (optional).

A. Define Packing Instructions

We'll start by packing Matrix A and Matrix B from their original locations to their packed locations.

Packing Matrix A (Addresses 0-3 to 16-19)

# PACK A: Copy from 0-3 to 16-19

# 1. Load A[0] from address 0 and store to 16
CONST R3, #0          ; R3 = 0
LDR R0, R3            ; R0 = MEM[R3] => MEM[0]
CONST R4, #16         ; R4 = 16
STR R4, R0            ; MEM[16] = R0

; 2. Load A[1] from address 1 and store to 17
CONST R3, #1          ; R3 = 1
LDR R0, R3            ; R0 = MEM[R3] => MEM[1]
CONST R4, #17         ; R4 = 17
STR R4, R0            ; MEM[17] = R0

; 3. Load A[2] from address 2 and store to 18
CONST R3, #2          ; R3 = 2
LDR R0, R3            ; R0 = MEM[R3] => MEM[2]
CONST R4, #18         ; R4 = 18
STR R4, R0            ; MEM[18] = R0

; 4. Load A[3] from address 3 and store to 19
CONST R3, #3          ; R3 = 3
LDR R0, R3            ; R0 = MEM[R3] => MEM[3]
CONST R4, #19         ; R4 = 19
STR R4, R0            ; MEM[19] = R0

PACK B: Copy from 4-7 to 20-23

; 1. Load B[0] from address 4 and store to 20
CONST R3, #4          ; R3 = 4
LDR R0, R3            ; R0 = MEM[R3] => MEM[4]
CONST R4, #20         ; R4 = 20
STR R4, R0            ; MEM[20] = R0

; 2. Load B[1] from address 5 and store to 21
CONST R3, #5          ; R3 = 5
LDR R0, R3            ; R0 = MEM[R3] => MEM[5]
CONST R4, #21         ; R4 = 21
STR R4, R0            ; MEM[21] = R0

; 3. Load B[2] from address 6 and store to 22
CONST R3, #6          ; R3 = 6
LDR R0, R3            ; R0 = MEM[R3] => MEM[6]
CONST R4, #22         ; R4 = 22
STR R4, R0            ; MEM[22] = R0

; 4. Load B[3] from address 7 and store to 23
CONST R3, #7          ; R3 = 7
LDR R0, R3            ; R0 = MEM[R3] => MEM[7]
CONST R4, #23         ; R4 = 23
STR R4, R0            ; MEM[23] = R0

Update baseA to 16 and baseB to 20



CONST R3, #16          ; R3 = 16 (baseA)
CONST R4, #20          ; R4 = 20 (baseB)
CONST R5, #8           ; R5 = 8  (baseC remains the same)

Original MatMul Instructions

MUL R0, %blockIdx, %blockDim      ; R0 = blockIdx * blockDim
ADD R0, R0, %threadIdx            ; R0 = R0 + threadIdx
CONST R1, #1                      ; R1 = 1 (increment)
CONST R2, #2                      ; R2 = 2 (N = matrix inner dimension)
DIV R6, R0, R2                    ; R6 = R0 / R2 (row)
MUL R7, R6, R2                    ; R7 = R6 * R2
SUB R7, R0, R7                    ; R7 = R0 - R7 (col)
CONST R8, #0                      ; R8 = 0 (accumulator)
CONST R9, #0                      ; R9 = 0 (k)
LOOP:
MUL R10, R6, R2                   ; R10 = R6 * R2
ADD R10, R10, R9                  ; R10 = R10 + R9
ADD R10, R10, R3                  ; R10 = R10 + baseA (addr(A[i]))
LDR R10, R10                      ; R10 = MEM[addr(A[i])]
MUL R11, R9, R2                   ; R11 = R9 * R2
ADD R11, R11, R7                  ; R11 = R11 + col
ADD R11, R11, R4                  ; R11 = R11 + baseB (addr(B[i]))
LDR R11, R11                      ; R11 = MEM[addr(B[i])]
MUL R12, R10, R11                 ; R12 = R10 * R11
ADD R8, R8, R12                   ; acc += R12
ADD R9, R9, R1                    ; k += 1
CMP R9, R2                        ; Compare k with N
BRn LOOP                          ; If k < N, branch to LOOP
ADD R9, R5, R0                    ; addr(C[i]) = baseC + i
STR R9, R8                        ; MEM[addr(C[i])] = acc
RET                               ;

UNPACK C: Copy from 8-11 to 24-27

; 1. Load C[0] from address 8 and store to 24
CONST R3, #8           ; R3 = 8
LDR R0, R3             ; R0 = MEM[R3] => MEM[8]
CONST R4, #24          ; R4 = 24
STR R4, R0             ; MEM[24] = R0

; 2. Load C[1] from address 9 and store to 25
CONST R3, #9           ; R3 = 9
LDR R0, R3             ; R0 = MEM[R3] => MEM[9]
CONST R4, #25          ; R4 = 25
STR R4, R0             ; MEM[25] = R0

; 3. Load C[2] from address 10 and store to 26
CONST R3, #10          ; R3 = 10
LDR R0, R3             ; R0 = MEM[R3] => MEM[10]
CONST R4, #26          ; R4 = 26
STR R4, R0             ; MEM[26] = R0

; 4. Load C[3] from address 11 and store to 27
CONST R3, #11          ; R3 = 11
LDR R0, R3             ; R0 = MEM[R3] => MEM[11]
CONST R4, #27          ; R4 = 27
STR R4, R0             ; MEM[27] = R0