# Evaluate tiny-gpu > 侯廷錡 ## Summary 1. GEMM on Tiny-GPU: - Study [tiny-gpu](https://github.com/adam-maj/tiny-gpu). - Implement and test a basic GEMM kernel (e.g., matrix multiplication). 2. Cache-Enhanced Tiny-GPU: - Evaluate [dsandall/tiny-gpu](https://github.com/dsandall/tiny-gpu/tree/master) with its multicore cache. - Understand cache impact on performance. - Implement a GEMM kernel on this cache-enabled GPU and quantify performance gains. 3. gemmlowp Strategies: - Review [gemmlowp design](https://github.com/google/gemmlowp/blob/master/doc/design.md ) for packing and cache blocking techniques. - Apply these strategies (data packing, cache-friendly blocking) to improve the GEMM kernel on Tiny-GPU for better efficiency. ## Architecture | ![](https://hackmd.io/_uploads/SJOuKGKvkg.png) | ![](https://hackmd.io/_uploads/Hk6uYMKPyx.png) | |-------------------------------------------------------------|-------------------------------------------------------------| --- ## ISA ![image](https://hackmd.io/_uploads/BJKHFGtwJx.png) ## 1. Memory in tiny-gpu(without cache) **tiny-gpu** has to main external memoeies 1. **Program Memory** - stores instructions(16 bits each). 2. **Data Memory** -stores data(8 bits each). Because all threads across all cores need to read(write) these memories, **memory controllers** arbitrate requests to avovid collisions when too many requests arrive at once. - **Lsu** in each tread issue `mem_read_vaild`/`mem_werite_vaild` requests. - **Fetchers** in each core issue `program_meme_read_vaild` requests. - The **memory controllers** (`controller.sv`) accept these requests, dispatch them to memory as bandwidth permits, and route reponses back to the correct LSU or fetcher. ### Program Memory Flow ![image](https://hackmd.io/_uploads/SkEZrCdDye.png) 1. Each core has a Fetcher that issues `program_mem_read_valid` + `program_mem_read_address` when it needs the next instruction. 2. The program_memory_controller.sv module collects these requests from multiple cores (up to `NUM_CORES`), decides which to send out to external memory, waits for `mem_read_ready` from the external memory, and finally returns the instruction data to the appropriate fetcher. --- ### Data Memory Flow ![image](https://hackmd.io/_uploads/r1kYCA_P1x.png) 1. Each thread in each core has an LSU, which issues `mem_read_valid` or `mem_write_valid` for loads and stores (plus addresses and data). 2. The data_memory_controller.sv collects these requests from all LSUs (`NUM_CORES * THREADS_PER_BLOCK` total) and funnels them to external data memory. 3. Once external memory responds (`mem_read_ready` / `mem_read_data`), the controller passes the data back to the correct LSU. --- ## 2. Memory in tiny-gpu(cache) - **LSUs and Data Cache**: Each thread's LSU generates read/write requests. The dmem_cache module collects these requests, arbitrates among them, and interfaces with an external memory bus. - Cache Hit: If an address matches a valid cache line (tag match), the data is returned immediately. -Cache Miss: If not, the cache fetches the corresponding line from memory, stores it, and then returns the data. A similar process applies for write requests. - **Fetchers and Program Cache**: Each core's fetcher issues instruction-fetch requests. The combination of `pmem_controller` and `pmem_cache` handles these requests for program memory using a read-only cache mechanism, simplifying the flow compared to data caching. ### Program Memory Flow ![image](https://hackmd.io/_uploads/HJThxY9v1l.png) - Fetch Requests: Each core’s fetcher sets fetcher_read_valid and provides the current PC as fetcher_read_address. - Hit or Miss: If pmem_cache.sv has the instruction (hit), it returns immediately; otherwise (pmem_controller.sv) reads from external memory and updates the cache. - Instruction Delivery: Once fetched, the instruction is sent back to the fetcher, which advances to DECODE. ### Data Memory Flow ![image](https://hackmd.io/_uploads/HyYy-YcDJg.png) - LSU Requests: Each thread’s LSU asserts read/write signals with an address (and data, if writing). - Cache Lookup: The dmem_cache checks for a valid/tag match: - Cache Hit: Data is returned immediately. - Cache Miss: A full line is fetched from external DMEM, then sent to the LSU, with the cache updated accordingly. - Write Handling: Writes update the cache and may propagate to external memory, subject to the chosen policy. --- ## 3. GEMM kernel ```asm= 0b0101000011011110, # MUL R0, %blockIdx, %blockDim 0b0011000000001111, # ADD R0, R0, %threadIdx ; i = blockIdx * blockDim + threadIdx 0b1001000100000001, # CONST R1, #1 ; increment 0b1001001000000010, # CONST R2, #2 ; N (matrix inner dimension) 0b1001001100000000, # CONST R3, #0 ; baseA (matrix A base address) 0b1001010000000100, # CONST R4, #4 ; baseB (matrix B base address) 0b1001010100001000, # CONST R5, #8 ; baseC (matrix C base address) 0b0110011000000010, # DIV R6, R0, R2 ; row = i // N 0b0101011101100010, # MUL R7, R6, R2 0b0100011100000111, # SUB R7, R0, R7 ; col = i % N 0b1001100000000000, # CONST R8, #0 ; acc = 0 0b1001100100000000, # CONST R9, #0 ; k = 0 # LOOP: 0b0101101001100010, # MUL R10, R6, R2 0b0011101010101001, # ADD R10, R10, R9 0b0011101010100011, # ADD R10, R10, R3 ; addr(A[i]) = row * N + k + baseA 0b0111101010100000, # LDR R10, R10 ; load A[i] from global memory 0b0101101110010010, # MUL R11, R9, R2 0b0011101110110111, # ADD R11, R11, R7 0b0011101110110100, # ADD R11, R11, R4 ; addr(B[i]) = k * N + col + baseB 0b0111101110110000, # LDR R11, R11 ; load B[i] from global memory 0b0101110010101011, # MUL R12, R10, R11 0b0011100010001100, # ADD R8, R8, R12 ; acc = acc + A[i] * B[i] 0b0011100110010001, # ADD R9, R9, R1 ; increment k 0b0010000010010010, # CMP R9, R2 0b0001100000001100, # BRn LOOP ; loop while k < N 0b0011100101010000, # ADD R9, R5, R0 ; addr(C[i]) = baseC + i 0b1000000010011000, # STR R9, R8 ; store C[i] in global memory 0b1111000000000000 # RET ; end of kernel ``` ### Matmul Results | Architecture | Cycles | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | |-------------------|--------|----------------|---------------|-----------------| | **With Cache** | 706 | 17,675,001.00 | 54.25 | 325,784.34 | | **Without Cache** | 491 | 12,300,001.00 | 11.64 | 1,056,952.17 | ### Matadd Results | Architecture | Cycles | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) | |-------------------|--------|----------------|---------------|-----------------| | **With Cache** | 270 | 6,775,001.00 | 45.76 | 148,046.60 | | **Without Cache** | 178 | 4,475,001.00 | 27.11 | 165,061.32 | ## 3. Core Concepts of gemmlowp ### GEMM and Cache Friendliness - General matrix multiplication (GEMM) algorithms require \(n^3\) multiply-add operations, but the memory usage is \(n^2\). - The same data will be accessed multiple times, so "efficient use of CPU registers and cache" becomes crucial. #### Approach: - Divide large matrices into smaller blocks and sub-blocks. - Perform computations entirely within registers or L1/L2 cache as much as possible to reduce frequent memory exchanges. --- ### GEMM Kernel - **Kernel** refers to the "inner loop code" written specifically for a given processor architecture (registers, SIMD width, etc.) - **Core Idea**: - Load as much data into registers as possible. - Perform as many multiply-add operations as possible. - Load new data. - This minimizes expensive memory access, particularly from DRAM. --- ### Block and Pack/Unpack 1. **Block Concept**: - Due to the limited capacity of registers, large matrices must be processed through multiple "block → load → compute → write-back" cycles. 2. **Purpose of Pack**: - To improve cache hit rates and speed up read/write operations, data is "reorganized (packed)" into a continuous or cache-aligned format. - This allows the kernel to read the data simply and efficiently. 3. **Purpose of Unpack**: - After computation, the "partial results (accumulated in int32)" are written back to the output matrix in one step, which is called "Unpack." --- ### Low-Precision Computation (uint8 → int32 → uint8) - The input and output matrices of gemmlowp are **uint8**, but intermediate accumulation is done in **int32**. - Computation flow: 1. **Pack**: Reorganize uint8 input data. 2. **Kernel**: Perform computation and accumulate results in int32. 3. **Unpack**: Quantize/convert int32 results back to uint8 and write them to the output matrix. --- ### Program Structure - The main file structure of gemmlowp: - `single_thread_gemm.h` and `multi_thread_gemm.h`: Handle single-threaded and multi-threaded GEMM workflows. - `pack.h`, `compute.h`, `unpack.h`: Implement the three main stages: "Pack / Kernel / Unpack." - NEON-optimized files (e.g., `pack_neon.h`, `kernel_neon.h`): Optimizations for ARM platforms. #### Typical Workflow (Single-threaded Example): 1. **PackLhs / PackRhs**: - First, pack the data blocks of the left and right matrices. 2. **Compute**: - Call the kernel to perform multiply-add operations on the packed data. - The results are temporarily stored in an int32 buffer. 3. **UnpackResult**: - Convert the int32 results back to the required type (uint8) and write them to the final output matrix. ### Incorporating Packing and Unpacking 1. **Define Memory Regions for Packing**: Allocate specific memory addresses for the packed matrices. 2. **Insert Packing Instructions**: Add instructions to copy and reorganize the input matrices into the packed regions. 3. **Modify Base Addresses**: Update the base addresses for matrices A and B to point to their packed locations. 4. **Insert Unpacking Instructions (Optional)**: After computation, unpack the results back to their original or desired memory locations. --- ## Assumptions - **Matrix Size**: 2x2 matrices for simplicity. - **Memory Allocation**: - Matrix A: Original at addresses 0-3, Packed at 16-19. - Matrix B: Original at addresses 4-7, Packed at 20-23. - Matrix C: Original at addresses 8-11, Unpacked at 24-27 (optional). --- ## A. Define Packing Instructions We'll start by packing Matrix A and Matrix B from their original locations to their packed locations. #### Packing Matrix A (Addresses 0-3 to 16-19) ```asm # PACK A: Copy from 0-3 to 16-19 # 1. Load A[0] from address 0 and store to 16 CONST R3, #0 ; R3 = 0 LDR R0, R3 ; R0 = MEM[R3] => MEM[0] CONST R4, #16 ; R4 = 16 STR R4, R0 ; MEM[16] = R0 ; 2. Load A[1] from address 1 and store to 17 CONST R3, #1 ; R3 = 1 LDR R0, R3 ; R0 = MEM[R3] => MEM[1] CONST R4, #17 ; R4 = 17 STR R4, R0 ; MEM[17] = R0 ; 3. Load A[2] from address 2 and store to 18 CONST R3, #2 ; R3 = 2 LDR R0, R3 ; R0 = MEM[R3] => MEM[2] CONST R4, #18 ; R4 = 18 STR R4, R0 ; MEM[18] = R0 ; 4. Load A[3] from address 3 and store to 19 CONST R3, #3 ; R3 = 3 LDR R0, R3 ; R0 = MEM[R3] => MEM[3] CONST R4, #19 ; R4 = 19 STR R4, R0 ; MEM[19] = R0 ``` #### PACK B: Copy from 4-7 to 20-23 ```asm ; 1. Load B[0] from address 4 and store to 20 CONST R3, #4 ; R3 = 4 LDR R0, R3 ; R0 = MEM[R3] => MEM[4] CONST R4, #20 ; R4 = 20 STR R4, R0 ; MEM[20] = R0 ; 2. Load B[1] from address 5 and store to 21 CONST R3, #5 ; R3 = 5 LDR R0, R3 ; R0 = MEM[R3] => MEM[5] CONST R4, #21 ; R4 = 21 STR R4, R0 ; MEM[21] = R0 ; 3. Load B[2] from address 6 and store to 22 CONST R3, #6 ; R3 = 6 LDR R0, R3 ; R0 = MEM[R3] => MEM[6] CONST R4, #22 ; R4 = 22 STR R4, R0 ; MEM[22] = R0 ; 4. Load B[3] from address 7 and store to 23 CONST R3, #7 ; R3 = 7 LDR R0, R3 ; R0 = MEM[R3] => MEM[7] CONST R4, #23 ; R4 = 23 STR R4, R0 ; MEM[23] = R0 ``` #### Update baseA to 16 and baseB to 20 ```asm= CONST R3, #16 ; R3 = 16 (baseA) CONST R4, #20 ; R4 = 20 (baseB) CONST R5, #8 ; R5 = 8 (baseC remains the same) ``` #### Original MatMul Instructions ```asm MUL R0, %blockIdx, %blockDim ; R0 = blockIdx * blockDim ADD R0, R0, %threadIdx ; R0 = R0 + threadIdx CONST R1, #1 ; R1 = 1 (increment) CONST R2, #2 ; R2 = 2 (N = matrix inner dimension) DIV R6, R0, R2 ; R6 = R0 / R2 (row) MUL R7, R6, R2 ; R7 = R6 * R2 SUB R7, R0, R7 ; R7 = R0 - R7 (col) CONST R8, #0 ; R8 = 0 (accumulator) CONST R9, #0 ; R9 = 0 (k) LOOP: MUL R10, R6, R2 ; R10 = R6 * R2 ADD R10, R10, R9 ; R10 = R10 + R9 ADD R10, R10, R3 ; R10 = R10 + baseA (addr(A[i])) LDR R10, R10 ; R10 = MEM[addr(A[i])] MUL R11, R9, R2 ; R11 = R9 * R2 ADD R11, R11, R7 ; R11 = R11 + col ADD R11, R11, R4 ; R11 = R11 + baseB (addr(B[i])) LDR R11, R11 ; R11 = MEM[addr(B[i])] MUL R12, R10, R11 ; R12 = R10 * R11 ADD R8, R8, R12 ; acc += R12 ADD R9, R9, R1 ; k += 1 CMP R9, R2 ; Compare k with N BRn LOOP ; If k < N, branch to LOOP ADD R9, R5, R0 ; addr(C[i]) = baseC + i STR R9, R8 ; MEM[addr(C[i])] = acc RET ; ``` #### UNPACK C: Copy from 8-11 to 24-27 ```asm ; 1. Load C[0] from address 8 and store to 24 CONST R3, #8 ; R3 = 8 LDR R0, R3 ; R0 = MEM[R3] => MEM[8] CONST R4, #24 ; R4 = 24 STR R4, R0 ; MEM[24] = R0 ; 2. Load C[1] from address 9 and store to 25 CONST R3, #9 ; R3 = 9 LDR R0, R3 ; R0 = MEM[R3] => MEM[9] CONST R4, #25 ; R4 = 25 STR R4, R0 ; MEM[25] = R0 ; 3. Load C[2] from address 10 and store to 26 CONST R3, #10 ; R3 = 10 LDR R0, R3 ; R0 = MEM[R3] => MEM[10] CONST R4, #26 ; R4 = 26 STR R4, R0 ; MEM[26] = R0 ; 4. Load C[3] from address 11 and store to 27 CONST R3, #11 ; R3 = 11 LDR R0, R3 ; R0 = MEM[R3] => MEM[11] CONST R4, #27 ; R4 = 27 STR R4, R0 ; MEM[27] = R0 ```