Cuda 學習 - Parallel Reduction（上）

Cuda 學習 - Parallel Reduction（上） === # 介紹 * 本文參照 [Nvidia](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf) 的文件並搭配自身電腦做測試整理出此篇筆記 * 可以理解為所有數字相加求和的平行化, 例如 norm 或是單純 vector 內的所有元素和 * 為一個Tree base 的結構進行求和 ![image](https://hackmd.io/_uploads/SJyL03c00.png) * 為了能達到最佳效能 * 通常處理的數據量級較大 * GPU 總是保持在 busy 且盡可能不浪費運算資源 * 每一次 thread block 迭代會減少使用Array 中的數據資料 # Cuda 在 block 的限制 * 由於 Cuda 沒有 block 級別的同步主要原因為 * 擁有大量處理器的 GPU 來說，在硬件中實現這一點非常昂貴 * 迫使工程師 coding 時要使用較少的 block 完成以避免 deadlock 現象 * 於是利用以下方式進行 global 的同步 * cuda kernel lanuch 為 global 的機制 ![image](https://hackmd.io/_uploads/HJ5AETcAC.png) * 由上圖可看到 Level 0 可視為第一個 kernel 啟動比使用了 8 個 block 最為計算資源，等結束後即為一次global 的同步並在 Level 1 中開啟 1 個 block 來啟動第二次的 kernel 進而執行平行化運算。 * 接下來我們試著求出一個 Array 內的所有元素和 # 選取程式的衡量標準 * 速度(ms) ：看程式執行速度 * GFlops ：用于分析计算kernel的计算性能． * BandWidth : 看資料的吞吐量 * 可用以下方式查詢並計算自己 GPU 的Bandwidth ```cpp= cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); std::cout << "Memory Clock Rate (MHz): " << prop.memoryClockRate / 1000 << " MHz" << std::endl; std::cout << "Memory Bus Width (bits): " << prop.memoryBusWidth << " bits" << std::endl; ``` * 以我自己 RTX 3050 laptop 4GB 為例 * (Interface 128 bits)*(DDR 6GB/MHZ)*2/8 = 192 GB/s * 用于分析kernel的記憶體使用情况, 在此我們使用速度及 BandWidth 作為衡量標準 # Interleaved Addressing * 此方法的目的是將 A[i] += A[i+stride] 上其中 i 為偶數的 index，且每經過一個 Step 其 Stride也會增加一倍。 * Step 1 : A[i] += A[i+stride] * Step 2 : A[i] += A[i+2*stride] * Step 3 : A[i] += A[i+4*stride] * Step 4 : A[i] += A[i+8*stride] ![image](https://hackmd.io/_uploads/HJtyLpcCR.png) ```cpp= __global__ void reduce0(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } ``` * 在這段程式內的 for 迴圈中有一個 (tid%2s)==0 的判斷這會有 Warp divergence 的問題導致程式效率不高。為此我們可以做些更改 # Interleaved Addressing Improved * 將原本的 tid % (2*s) == 0 改成 ```cpp= int index = 2 * s * tid; if (index < blockDim.x) { sdata[index] += sdata[index + s]; } ``` * 消除了 Warp divergence，但這邊造成了新的問題 bank conflict, 多個 Warp 讀取同一塊 Shared memory 導致該平行化的程序變成需要序列化執行，這也會使程式效率不佳 ```cpp= __global__ void reduce1(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockDim.x; s *= 2) { int index = 2 * s * tid; if (index < blockDim.x) { sdata[index] += sdata[index + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } ``` # Sequential Addressing * 為了避免 bank conflict 的發生於改將原本交錯儲存的方式都改到前半的 threads上 * 然而還存在一個缺點, 那就是會有一半以上的 thread 是沒有作用 ![Screenshot from 2024-10-03 11-30-11](https://hackmd.io/_uploads/SJ4ygqo0R.png) ```cpp= __global__ void reduce2(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // // do reduction in shared mem for (unsigned int s=blockDim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } ``` # 計算結果 * 硬體環境 * 筆電 : Dell G15 * GPU : RTX3050 Laptop 4G * CPU : 12th Gen Intel(R) Core(TM) i5-12500H, 16 cores * Memory : 16GB * 計算任務 * 2^22 的 Array 內的元素和 * block : 128 * grid : 2^22/128 | Method | Time (ms)| Bandwidth (GB/s) | | ------------------------------------ | -------- | -----------------| | Interleaved Addressing | 0.5519 | 56.6171 | | Interleaved Addressing (Improved) | 0.493769 | 63.2887 | | Sequential Addressing | 0.473715 | 65.9679 | * 加速結果可看到目前以加速了約 1.165X, 尚未確定其他原因但在官方文件可看到 Sequential Addressing 的加速照理來說應該也很明顯才對 source : https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf https://blog.csdn.net/langb2014/article/details/51348523 github : https://github.com/GU-Lin/Cuda_Practice/tree/main/5_Parallel_Reduction * ###### tags : `Cuda`