Evaluate tiny-gpu
侯廷錡
Summary
- GEMM on Tiny-GPU:
- Study tiny-gpu.
- Implement and test a basic GEMM kernel (e.g., matrix multiplication).
- Cache-Enhanced Tiny-GPU:
- Evaluate dsandall/tiny-gpu with its multicore cache.
- Understand cache impact on performance.
- Implement a GEMM kernel on this cache-enabled GPU and quantify performance gains.
- gemmlowp Strategies:
- Review gemmlowp design for packing and cache blocking techniques.
- Apply these strategies (data packing, cache-friendly blocking) to improve the GEMM kernel on Tiny-GPU for better efficiency.
Architecture
ISA

1. Memory in tiny-gpu(without cache)
tiny-gpu has to main external memoeies
- Program Memory - stores instructions(16 bits each).
- Data Memory -stores data(8 bits each).
Because all threads across all cores need to read(write) these memories, memory controllers arbitrate requests to avovid collisions when too many requests arrive at once.
- Lsu in each tread issue
mem_read_vaild
/mem_werite_vaild
requests.
- Fetchers in each core issue
program_meme_read_vaild
requests.
- The memory controllers (
controller.sv
) accept these requests, dispatch them to memory as bandwidth permits, and route reponses back to the correct LSU or fetcher.
Program Memory Flow

- Each core has a Fetcher that issues
program_mem_read_valid
+ program_mem_read_address
when it needs the next instruction.
- The program_memory_controller.sv module collects these requests from multiple cores (up to
NUM_CORES
), decides which to send out to external memory, waits for mem_read_ready
from the external memory, and finally returns the instruction data to the appropriate fetcher.
Data Memory Flow

- Each thread in each core has an LSU, which issues
mem_read_valid
or mem_write_valid
for loads and stores (plus addresses and data).
- The data_memory_controller.sv collects these requests from all LSUs (
NUM_CORES * THREADS_PER_BLOCK
total) and funnels them to external data memory.
- Once external memory responds (
mem_read_ready
/ mem_read_data
), the controller passes the data back to the correct LSU.
2. Memory in tiny-gpu(cache)
-
LSUs and Data Cache:
Each thread's LSU generates read/write requests. The dmem_cache module collects these requests, arbitrates among them, and interfaces with an external memory bus.
- Cache Hit: If an address matches a valid cache line (tag match), the data is returned immediately.
-Cache Miss: If not, the cache fetches the corresponding line from memory, stores it, and then returns the data. A similar process applies for write requests.
-
Fetchers and Program Cache:
Each core's fetcher issues instruction-fetch requests. The combination of pmem_controller
and pmem_cache
handles these requests for program memory using a read-only cache mechanism, simplifying the flow compared to data caching.
Program Memory Flow

- Fetch Requests: Each core’s fetcher sets fetcher_read_valid and provides the current PC as fetcher_read_address.
- Hit or Miss: If pmem_cache.sv has the instruction (hit), it returns immediately; otherwise (pmem_controller.sv) reads from external memory and updates the cache.
- Instruction Delivery: Once fetched, the instruction is sent back to the fetcher, which advances to DECODE.
Data Memory Flow

- LSU Requests: Each thread’s LSU asserts read/write signals with an address (and data, if writing).
- Cache Lookup: The dmem_cache checks for a valid/tag match:
- Cache Hit: Data is returned immediately.
- Cache Miss: A full line is fetched from external DMEM, then sent to the LSU, with the cache updated accordingly.
- Write Handling: Writes update the cache and may propagate to external memory, subject to the chosen policy.
3. GEMM kernel
Matmul Results
Architecture |
Cycles |
SIM TIME (ns) |
REAL TIME (s) |
RATIO (ns/s) |
With Cache |
706 |
17,675,001.00 |
54.25 |
325,784.34 |
Without Cache |
491 |
12,300,001.00 |
11.64 |
1,056,952.17 |
Matadd Results
Architecture |
Cycles |
SIM TIME (ns) |
REAL TIME (s) |
RATIO (ns/s) |
With Cache |
270 |
6,775,001.00 |
45.76 |
148,046.60 |
Without Cache |
178 |
4,475,001.00 |
27.11 |
165,061.32 |
3. Core Concepts of gemmlowp
GEMM and Cache Friendliness
- General matrix multiplication (GEMM) algorithms require (n^3) multiply-add operations, but the memory usage is (n^2).
- The same data will be accessed multiple times, so "efficient use of CPU registers and cache" becomes crucial.
Approach:
- Divide large matrices into smaller blocks and sub-blocks.
- Perform computations entirely within registers or L1/L2 cache as much as possible to reduce frequent memory exchanges.
GEMM Kernel
- Kernel refers to the "inner loop code" written specifically for a given processor architecture (registers, SIMD width, etc.)
- Core Idea:
- Load as much data into registers as possible.
- Perform as many multiply-add operations as possible.
- Load new data.
- This minimizes expensive memory access, particularly from DRAM.
Block and Pack/Unpack
-
Block Concept:
- Due to the limited capacity of registers, large matrices must be processed through multiple "block → load → compute → write-back" cycles.
-
Purpose of Pack:
- To improve cache hit rates and speed up read/write operations, data is "reorganized (packed)" into a continuous or cache-aligned format.
- This allows the kernel to read the data simply and efficiently.
-
Purpose of Unpack:
- After computation, the "partial results (accumulated in int32)" are written back to the output matrix in one step, which is called "Unpack."
Low-Precision Computation (uint8 → int32 → uint8)
- The input and output matrices of gemmlowp are uint8, but intermediate accumulation is done in int32.
- Computation flow:
- Pack: Reorganize uint8 input data.
- Kernel: Perform computation and accumulate results in int32.
- Unpack: Quantize/convert int32 results back to uint8 and write them to the output matrix.
Program Structure
- The main file structure of gemmlowp:
single_thread_gemm.h
and multi_thread_gemm.h
: Handle single-threaded and multi-threaded GEMM workflows.
pack.h
, compute.h
, unpack.h
: Implement the three main stages: "Pack / Kernel / Unpack."
- NEON-optimized files (e.g.,
pack_neon.h
, kernel_neon.h
): Optimizations for ARM platforms.
Typical Workflow (Single-threaded Example):
- PackLhs / PackRhs:
- First, pack the data blocks of the left and right matrices.
- Compute:
- Call the kernel to perform multiply-add operations on the packed data.
- The results are temporarily stored in an int32 buffer.
- UnpackResult:
- Convert the int32 results back to the required type (uint8) and write them to the final output matrix.
Incorporating Packing and Unpacking
- Define Memory Regions for Packing: Allocate specific memory addresses for the packed matrices.
- Insert Packing Instructions: Add instructions to copy and reorganize the input matrices into the packed regions.
- Modify Base Addresses: Update the base addresses for matrices A and B to point to their packed locations.
- Insert Unpacking Instructions (Optional): After computation, unpack the results back to their original or desired memory locations.
Assumptions
- Matrix Size: 2x2 matrices for simplicity.
- Memory Allocation:
- Matrix A: Original at addresses 0-3, Packed at 16-19.
- Matrix B: Original at addresses 4-7, Packed at 20-23.
- Matrix C: Original at addresses 8-11, Unpacked at 24-27 (optional).
A. Define Packing Instructions
We'll start by packing Matrix A and Matrix B from their original locations to their packed locations.
Packing Matrix A (Addresses 0-3 to 16-19)
PACK B: Copy from 4-7 to 20-23
Update baseA to 16 and baseB to 20
Original MatMul Instructions
UNPACK C: Copy from 8-11 to 24-27