---
# System prepended metadata

title: Cuda 學習 - Parallel Reduction（上）
tags: [Cuda]

---

Cuda 學習 - Parallel Reduction（上）
===
# 介紹
* 本文參照 [Nvidia](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf) 的文件並搭配自身電腦做測試整理出此篇筆記
* 可以理解為所有數字相加求和的平行化, 例如 norm 或是單純 vector 內的所有元素和
* 為一個Tree base 的結構進行求和
 ![image](https://hackmd.io/_uploads/SJyL03c00.png)
* 為了能達到最佳效能
    * 通常處理的數據量級較大
    * GPU 總是保持在 busy 且盡可能不浪費運算資源
    * 每一次 thread block 迭代會減少使用Array 中的數據資料

# Cuda 在 block 的限制
* 由於 Cuda 沒有 block 級別的同步主要原因為
    * 擁有大量處理器的 GPU 來說，在硬件中實現這一點非常昂貴 
    * 迫使工程師 coding 時要使用較少的 block 完成以避免 deadlock 現象
* 於是利用以下方式進行 global 的同步
    * cuda kernel lanuch 為 global 的機制

![image](https://hackmd.io/_uploads/HJ5AETcAC.png)
* 由上圖可看到 Level 0 可視為第一個 kernel 啟動比使用了 8 個 block 最為計算資源，等結束後即為一次global 的同步並在 Level 1 中開啟 1 個 block 來啟動第二次的 kernel 進而執行平行化運算。
* 接下來我們試著求出一個 Array 內的所有元素和

# 選取程式的衡量標準
* 速度(ms) ： 看程式執行速度
* GFlops ： 用于分析计算kernel的计算性能．
* BandWidth : 看資料的吞吐量
    * 可用以下方式查詢並計算自己 GPU 的Bandwidth
    ```cpp=
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    std::cout << "Memory Clock Rate (MHz): " << prop.memoryClockRate / 1000 << " MHz" << std::endl;
    std::cout << "Memory Bus Width (bits): " << prop.memoryBusWidth << " bits" << std::endl;
    ```
    * 以我自己 RTX 3050 laptop 4GB 為例
        * (Interface 128 bits)*(DDR 6GB/MHZ)*2/8 = 192 GB/s
    * 用于分析kernel的記憶體使用情况, 在此我們使用速度及 BandWidth 作為衡量標準
# Interleaved Addressing
* 此方法的目的是將 A[i] += A[i+stride] 上其中 i 為偶數的 index，且每經過一個 Step 其 Stride也會增加一倍。
* Step 1 : A[i] += A[i+stride]
* Step 2 : A[i] += A[i+2*stride]
* Step 3 : A[i] += A[i+4*stride]
* Step 4 : A[i] += A[i+8*stride]
![image](https://hackmd.io/_uploads/HJtyLpcCR.png)

```cpp=
__global__ void reduce0(int *g_idata, int *g_odata) 
{
    extern __shared__ int sdata[];
    // each thread loads one element from global to shared mem
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    sdata[tid] = g_idata[i];
    __syncthreads();
    // do reduction in shared mem
    for(unsigned int s=1; s < blockDim.x; s *= 2) 
    {
        if (tid % (2*s) == 0) {
        sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }
    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
```
* 在這段程式內的 for 迴圈中有一個 (tid%2s)==0 的判斷這會有 Warp divergence 的問題導致程式效率不高。為此我們可以做些更改
# Interleaved Addressing Improved
* 將原本的 tid % (2*s) == 0 改成 
```cpp=
int index = 2 * s * tid;
if (index < blockDim.x) 
{
    sdata[index] += sdata[index + s];
}
```
* 消除了 Warp divergence，但這邊造成了新的問題 bank conflict, 多個 Warp 讀取同一塊 Shared memory 導致該平行化的程序變成需要序列化執行，這也會使程式效率不佳
```cpp=
__global__ void reduce1(int *g_idata, int *g_odata) 
{
    extern __shared__ int sdata[];
    // each thread loads one element from global to shared mem
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    sdata[tid] = g_idata[i];
    __syncthreads();
    // do reduction in shared mem
    for(unsigned int s=1; s < blockDim.x; s *= 2) 
    {
        int index = 2 * s * tid;
        if (index < blockDim.x) 
        {
            sdata[index] += sdata[index + s];
        }
        __syncthreads();
    }
    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
```

# Sequential Addressing
* 為了避免 bank conflict 的發生於改將原本交錯儲存的方式都改到前半的 threads上
* 然而還存在一個缺點, 那就是會有一半以上的 thread 是沒有作用
![Screenshot from 2024-10-03 11-30-11](https://hackmd.io/_uploads/SJ4ygqo0R.png)

```cpp=
__global__ void reduce2(int *g_idata, int *g_odata) 
{
    extern __shared__ int sdata[];
    // each thread loads one element from global to shared mem
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    sdata[tid] = g_idata[i];
    __syncthreads();
    // // do reduction in shared mem
    for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
        if (tid < s) 
        {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
```

# 計算結果
* 硬體環境
    * 筆電 : Dell G15
    * GPU : RTX3050 Laptop 4G 
    * CPU : 12th Gen Intel(R) Core(TM) i5-12500H, 16 cores
    * Memory : 16GB
* 計算任務
    * 2^22 的 Array 內的元素和
    * block : 128
    * grid  : 2^22/128

| Method                               | Time (ms)| Bandwidth (GB/s) |
| ------------------------------------ | -------- | -----------------|
| Interleaved Addressing               | 0.5519   | 56.6171          |
| Interleaved Addressing (Improved)    | 0.493769 | 63.2887          |
| Sequential Addressing                | 0.473715 | 65.9679          |
* 加速結果可看到目前以加速了約 1.165X, 尚未確定其他原因但在官方文件可看到 Sequential Addressing 的加速照理來說應該也很明顯才對

source : https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
https://blog.csdn.net/langb2014/article/details/51348523
github : https://github.com/GU-Lin/Cuda_Practice/tree/main/5_Parallel_Reduction
* ###### tags : `Cuda`