侯廷錡
Image Not Showing
Possible Reasons
|
Image Not Showing
Possible Reasons
|
---|
tiny-gpu has to main external memoeies
Because all threads across all cores need to read(write) these memories, memory controllers arbitrate requests to avovid collisions when too many requests arrive at once.
mem_read_vaild
/mem_werite_vaild
requests.program_meme_read_vaild
requests.controller.sv
) accept these requests, dispatch them to memory as bandwidth permits, and route reponses back to the correct LSU or fetcher.program_mem_read_valid
+ program_mem_read_address
when it needs the next instruction.NUM_CORES
), decides which to send out to external memory, waits for mem_read_ready
from the external memory, and finally returns the instruction data to the appropriate fetcher.mem_read_valid
or mem_write_valid
for loads and stores (plus addresses and data).NUM_CORES * THREADS_PER_BLOCK
total) and funnels them to external data memory.mem_read_ready
/ mem_read_data
), the controller passes the data back to the correct LSU.LSUs and Data Cache:
Each thread's LSU generates read/write requests. The dmem_cache module collects these requests, arbitrates among them, and interfaces with an external memory bus.
Fetchers and Program Cache:
Each core's fetcher issues instruction-fetch requests. The combination of pmem_controller
and pmem_cache
handles these requests for program memory using a read-only cache mechanism, simplifying the flow compared to data caching.
0b0101000011011110, # MUL R0, %blockIdx, %blockDim
0b0011000000001111, # ADD R0, R0, %threadIdx ; i = blockIdx * blockDim + threadIdx
0b1001000100000001, # CONST R1, #1 ; increment
0b1001001000000010, # CONST R2, #2 ; N (matrix inner dimension)
0b1001001100000000, # CONST R3, #0 ; baseA (matrix A base address)
0b1001010000000100, # CONST R4, #4 ; baseB (matrix B base address)
0b1001010100001000, # CONST R5, #8 ; baseC (matrix C base address)
0b0110011000000010, # DIV R6, R0, R2 ; row = i // N
0b0101011101100010, # MUL R7, R6, R2
0b0100011100000111, # SUB R7, R0, R7 ; col = i % N
0b1001100000000000, # CONST R8, #0 ; acc = 0
0b1001100100000000, # CONST R9, #0 ; k = 0
# LOOP:
0b0101101001100010, # MUL R10, R6, R2
0b0011101010101001, # ADD R10, R10, R9
0b0011101010100011, # ADD R10, R10, R3 ; addr(A[i]) = row * N + k + baseA
0b0111101010100000, # LDR R10, R10 ; load A[i] from global memory
0b0101101110010010, # MUL R11, R9, R2
0b0011101110110111, # ADD R11, R11, R7
0b0011101110110100, # ADD R11, R11, R4 ; addr(B[i]) = k * N + col + baseB
0b0111101110110000, # LDR R11, R11 ; load B[i] from global memory
0b0101110010101011, # MUL R12, R10, R11
0b0011100010001100, # ADD R8, R8, R12 ; acc = acc + A[i] * B[i]
0b0011100110010001, # ADD R9, R9, R1 ; increment k
0b0010000010010010, # CMP R9, R2
0b0001100000001100, # BRn LOOP ; loop while k < N
0b0011100101010000, # ADD R9, R5, R0 ; addr(C[i]) = baseC + i
0b1000000010011000, # STR R9, R8 ; store C[i] in global memory
0b1111000000000000 # RET ; end of kernel
Architecture | Cycles | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) |
---|---|---|---|---|
With Cache | 706 | 17,675,001.00 | 54.25 | 325,784.34 |
Without Cache | 491 | 12,300,001.00 | 11.64 | 1,056,952.17 |
Architecture | Cycles | SIM TIME (ns) | REAL TIME (s) | RATIO (ns/s) |
---|---|---|---|---|
With Cache | 270 | 6,775,001.00 | 45.76 | 148,046.60 |
Without Cache | 178 | 4,475,001.00 | 27.11 | 165,061.32 |
Block Concept:
Purpose of Pack:
Purpose of Unpack:
single_thread_gemm.h
and multi_thread_gemm.h
: Handle single-threaded and multi-threaded GEMM workflows.pack.h
, compute.h
, unpack.h
: Implement the three main stages: "Pack / Kernel / Unpack."pack_neon.h
, kernel_neon.h
): Optimizations for ARM platforms.We'll start by packing Matrix A and Matrix B from their original locations to their packed locations.
# PACK A: Copy from 0-3 to 16-19
# 1. Load A[0] from address 0 and store to 16
CONST R3, #0 ; R3 = 0
LDR R0, R3 ; R0 = MEM[R3] => MEM[0]
CONST R4, #16 ; R4 = 16
STR R4, R0 ; MEM[16] = R0
; 2. Load A[1] from address 1 and store to 17
CONST R3, #1 ; R3 = 1
LDR R0, R3 ; R0 = MEM[R3] => MEM[1]
CONST R4, #17 ; R4 = 17
STR R4, R0 ; MEM[17] = R0
; 3. Load A[2] from address 2 and store to 18
CONST R3, #2 ; R3 = 2
LDR R0, R3 ; R0 = MEM[R3] => MEM[2]
CONST R4, #18 ; R4 = 18
STR R4, R0 ; MEM[18] = R0
; 4. Load A[3] from address 3 and store to 19
CONST R3, #3 ; R3 = 3
LDR R0, R3 ; R0 = MEM[R3] => MEM[3]
CONST R4, #19 ; R4 = 19
STR R4, R0 ; MEM[19] = R0
; 1. Load B[0] from address 4 and store to 20
CONST R3, #4 ; R3 = 4
LDR R0, R3 ; R0 = MEM[R3] => MEM[4]
CONST R4, #20 ; R4 = 20
STR R4, R0 ; MEM[20] = R0
; 2. Load B[1] from address 5 and store to 21
CONST R3, #5 ; R3 = 5
LDR R0, R3 ; R0 = MEM[R3] => MEM[5]
CONST R4, #21 ; R4 = 21
STR R4, R0 ; MEM[21] = R0
; 3. Load B[2] from address 6 and store to 22
CONST R3, #6 ; R3 = 6
LDR R0, R3 ; R0 = MEM[R3] => MEM[6]
CONST R4, #22 ; R4 = 22
STR R4, R0 ; MEM[22] = R0
; 4. Load B[3] from address 7 and store to 23
CONST R3, #7 ; R3 = 7
LDR R0, R3 ; R0 = MEM[R3] => MEM[7]
CONST R4, #23 ; R4 = 23
STR R4, R0 ; MEM[23] = R0
CONST R3, #16 ; R3 = 16 (baseA)
CONST R4, #20 ; R4 = 20 (baseB)
CONST R5, #8 ; R5 = 8 (baseC remains the same)
MUL R0, %blockIdx, %blockDim ; R0 = blockIdx * blockDim
ADD R0, R0, %threadIdx ; R0 = R0 + threadIdx
CONST R1, #1 ; R1 = 1 (increment)
CONST R2, #2 ; R2 = 2 (N = matrix inner dimension)
DIV R6, R0, R2 ; R6 = R0 / R2 (row)
MUL R7, R6, R2 ; R7 = R6 * R2
SUB R7, R0, R7 ; R7 = R0 - R7 (col)
CONST R8, #0 ; R8 = 0 (accumulator)
CONST R9, #0 ; R9 = 0 (k)
LOOP:
MUL R10, R6, R2 ; R10 = R6 * R2
ADD R10, R10, R9 ; R10 = R10 + R9
ADD R10, R10, R3 ; R10 = R10 + baseA (addr(A[i]))
LDR R10, R10 ; R10 = MEM[addr(A[i])]
MUL R11, R9, R2 ; R11 = R9 * R2
ADD R11, R11, R7 ; R11 = R11 + col
ADD R11, R11, R4 ; R11 = R11 + baseB (addr(B[i]))
LDR R11, R11 ; R11 = MEM[addr(B[i])]
MUL R12, R10, R11 ; R12 = R10 * R11
ADD R8, R8, R12 ; acc += R12
ADD R9, R9, R1 ; k += 1
CMP R9, R2 ; Compare k with N
BRn LOOP ; If k < N, branch to LOOP
ADD R9, R5, R0 ; addr(C[i]) = baseC + i
STR R9, R8 ; MEM[addr(C[i])] = acc
RET ;
; 1. Load C[0] from address 8 and store to 24
CONST R3, #8 ; R3 = 8
LDR R0, R3 ; R0 = MEM[R3] => MEM[8]
CONST R4, #24 ; R4 = 24
STR R4, R0 ; MEM[24] = R0
; 2. Load C[1] from address 9 and store to 25
CONST R3, #9 ; R3 = 9
LDR R0, R3 ; R0 = MEM[R3] => MEM[9]
CONST R4, #25 ; R4 = 25
STR R4, R0 ; MEM[25] = R0
; 3. Load C[2] from address 10 and store to 26
CONST R3, #10 ; R3 = 10
LDR R0, R3 ; R0 = MEM[R3] => MEM[10]
CONST R4, #26 ; R4 = 26
STR R4, R0 ; MEM[26] = R0
; 4. Load C[3] from address 11 and store to 27
CONST R3, #11 ; R3 = 11
LDR R0, R3 ; R0 = MEM[R3] => MEM[11]
CONST R4, #27 ; R4 = 27
STR R4, R0 ; MEM[27] = R0