Cuda 學習 - Shared Memory

Cuda 學習 - Shared Memory === # GPU 中的記憶體類別 * 在以下圖中敘述了 GPU 內各個記憶體之間的運作關係, 其中主要使用的記憶體有以下 * Register * Shared Memory * Global/Local Memory * Constant Memory * 純讀可以被 cached, 用來保存不會被更改的數據使用 * 讀寫速度與 global 相當 * Global variable 使用 ![image](https://hackmd.io/_uploads/ByxcEjCR0.png) # Register * 速度最快的記憶體 * 讀寫由 thread 使用, Low latency & high bandwidth 並由 Local variable 使用 * 使用上空間較小且要注意 bank conflicts * 沒有辦法用 index 記錄每一個位置, 如果是個 array 本身就會儲存在 local memory 下 * 當 thread 越多可使用的 local variable 越少就會造成記憶體的擠壓變成使用到 local memory, 這種情形稱為 register spilling 所以選取適當的 thread 數量是關鍵而不是越多越好 # Shared Memory * 讀寫由 block 使用, 速度與 register 相似並由 Shared variable 使用 * 在同一個 block 裡面的所有 threads 共同使用 * 如果 threads 需要跨 blocks 時則需要透過 global memory * 在Pascal , Volta 之後的 GPU 中, Shared memory 不會在與 register做競爭 * 資料使用情形 : global memory -> shared memory -> block 內的 thread 執行任務 -> update shared memory -> global memory ![image](https://hackmd.io/_uploads/Hk26znRAR.png) # Global/Local Memory * Global memory 是由 grid 使用, Local memory 是由 thread 使用, 當 register 不夠是會呼叫使用 local memoty 使用 * Global 及 Local 是同一塊記憶體膽但根據使用情境不同而有所差異 * 沒有 cached * Global Variable 使用 Global Memory * Local Variable 使用 Local Memory * 放在 DRAM 上但會有一個 private area, 可以被 cached 在 L1下且被 thread 給使用 * 在編譯時會決定是否被使用, 其中 L1 為 on-chip 有較快的運算速度(在shared memory 同一塊), 而 L2 為 off-chip 相對運算較慢(在global memory同一塊) ![image](https://hackmd.io/_uploads/S17Vkh0A0.png) # Constant memory * 純讀可以被 cached, 用來保存不會被更改的數據使用 * 讀寫速度與 global 相當 * Global variable 使用 # Memory 的讀取方式 * Aligned memory access * Coalesced memory access # 怎麼查看 register * 在以下的程式中, gid, x1. x2, x3, x4 是屬於 block 內的 local variable 並在編譯時自動分配合適的 register 數量 ```cpp= __global__ void test(int* results, int size) { int gid = blockDim.x * blockIdx.x + threadIdx.x; int x1 = 3465; int x2 = 1768; int x3 = 453; int x4 = x1 + x2 + x3; if(gid < size) { results[gid] = x4; } } ``` * 可以看到下圖中的 register 數量為 4 ![Screenshot from 2024-10-14 21-04-32](https://hackmd.io/_uploads/Hkz-vqqkkg.png) * 需要在 cmake 中加入-Xptxas=-v ```cmake= target_compile_options(main PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-Xptxas=-v>) ``` # 怎麼使用 shared memory * 有分 static 及 dynamic 的方式 * static 的分配是在 kernel function 內宣告 shared memory, 並在需要使用的 shared memory 前面加上 ```__shared__``` ```cpp= #define SHARED_SIZE 128 __global__ void test_shared_static(int* input, int* output, int size) { int tid = threadIdx.x; int gid = blockDim.x * blockIdx.x + threadIdx.x; __shared__ int shared_mem[SHARED_SIZE]; if(gid < size) { shared_mem[tid] = input[gid]; output[gid] = shared_mem[tid]+1; } } ``` * dynamic 的分配與 static 的類似但原本的 ```__shared__``` 要改成 ```extern __shared__```, 並在啟用 kernel function 時要帶入使用多少 shared memory 的大小 ```cpp= __global__ void test_shared_dynamic(int* input, int* output, int size) { int tid = threadIdx.x; int gid = blockDim.x * blockIdx.x + threadIdx.x; extern __shared__ int shared_mem[]; if(gid < size) { shared_mem[tid] = input[gid]; output[gid] = shared_mem[tid]+1; } } test_shared_dynamic<<<grid, block, sizeof(int)* SHARED_SIZE>>>(d_input, d_output, size); ``` * 在這兩段程式中可以看到 ```shared_mem[tid] = input[gid];```這行程式已經相當於在從 global memory 複製到 shared memory 使用, 而 global memory 的更新則靠著每個 shared memory 上元素加一獲得更新```output[gid] = shared_mem[tid]+1;``` * 此寫法的有點是善用 shared memory 在 block 中運算速度快的特性使用 * 編譯結果中可看到分配了 340 的 shared memory ![Screenshot from 2024-10-14 21-49-03](https://hackmd.io/_uploads/Byt9-i5Jye.png) # Shared memory 的 Bank conflict * Shared memory 內是被分成 bank 的方式讓 warp 內的 thread 讀取資料的方式, 而 bank conflicit 是只說 warp 內的 thread 同時讀取到同一個 bank, 造成讀寫速度變成序列化的形式進而使得效能變慢 ## 以下方式沒有 bank conflicit * 每一個 thread 都讀取到唯一一個 bank ![Screenshot from 2024-10-17 22-01-24](https://hackmd.io/_uploads/SyQT_qC1Jx.png) * 在 half warp 內即使讀取到同一個 bank 也不算, 這歸功於內部的 brodcast 技術 ![Screenshot from 2024-10-17 22-00-34](https://hackmd.io/_uploads/ByY5u9C1kx.png) ## 以下形式有 bank conflicit * 顯而易見的看到中間的讀取方式存在 bank conflicit 這會導致程式效率變慢 ![Screenshot from 2024-10-17 22-02-44](https://hackmd.io/_uploads/HJEGF5CJJg.png) # 矩陣轉置 * 先來個一般版本的轉置 ```cpp= __global__ void transpose_Naive(int* input, int* output, int width, int height) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; // Check boundary if (x < width && y < height) { output[x * height + y] = input[y * width + x]; } } ``` ![Screenshot from 2024-10-17 21-44-13](https://hackmd.io/_uploads/HJA3Nc0kyl.png) * 加入 Shared Memory 後 ```cpp= __global__ void transpose_SHEM(int* input, int* output, int width, int height) { __shared__ int shared_mem[SHARED_SIZE][SHARED_SIZE]; int x = blockIdx.x * SHARED_SIZE + threadIdx.x; int y = blockIdx.y * SHARED_SIZE + threadIdx.y; // Copy data from global memory to shared memory if (x < width && y < height) { shared_mem[threadIdx.y][threadIdx.x] = input[y * width + x]; } __syncthreads(); // global index int transposed_x = blockIdx.y * SHARED_SIZE + threadIdx.x; int transposed_y = blockIdx.x * SHARED_SIZE + threadIdx.y; // Copy data to global memory if (transposed_x < height && transposed_y < width) { output[transposed_y * height + transposed_x] = shared_mem[threadIdx.x][threadIdx.y]; } } ``` ![image](https://hackmd.io/_uploads/S1MJH9C1kg.png) * 在加入 Padding 必面 bank conflicit ```cpp= __global__ void transpose_SHEM_padding(int* input, int* output, int width, int height) { __shared__ int shared_mem[SHARED_SIZE][SHARED_SIZE + 1]; // Avoid bank conflict int x = blockIdx.x * SHARED_SIZE + threadIdx.x; int y = blockIdx.y * SHARED_SIZE + threadIdx.y; // Copy data from global memory to shared memory if (x < width && y < height) { shared_mem[threadIdx.y][threadIdx.x] = input[y * width + x]; } __syncthreads(); // global index int transposed_x = blockIdx.y * SHARED_SIZE + threadIdx.x; int transposed_y = blockIdx.x * SHARED_SIZE + threadIdx.y; // Copy data to global memory if (transposed_x < height && transposed_y < width) { output[transposed_y * height + transposed_x] = shared_mem[threadIdx.x][threadIdx.y]; } } ``` ![Screenshot from 2024-10-17 21-45-37](https://hackmd.io/_uploads/ryWMS5Ckyg.png) * Test case * Random Matrix : 4096x4096 * Block : (32,32) * Grid : ((4096+32-1)/32,(4096+32-1)/32) * Environment * 筆電 : Dell G15 * GPU : RTX3050 Laptop 4G * CPU : 12th Gen Intel® Core™ i5-12500H, 16 cores * Memory : 16GB | Method | Time cost(sec) | BandWidth(GB/sec) | | ------------------------------ | -------------- | ----------------- | | Naive | 0.00327 | 38.1913 | | Shared Memory | 0.0018 | 69.3969 | | Shared Memory with padding | 0.00122 | 102.6 | source : https://safari.ethz.ch/architecture/fall2019/lib/exe/fetch.php?media=2013_programming_massively_parallel_processors_a_hands-on_approach_2nd.pdf https://leimao.github.io/blog/CUDA-Shared-Memory-Bank/ https://www.cs.nthu.edu.tw/~cherung/teaching/2010gpucell/CUDA04.pdf github : https://github.com/GU-Lin/Cuda_Practice/tree/main/6_Memory