Parallel Programming @ NYCU - HW5

# Parallel Programming @ NYCU - HW5 #### **`0716221 余忠旻`** ### Q1: What are the pros and cons of the three methods? Give an assumption about their performances. * **Method 1:** * **pros:** 1. 每個 thread 一次只需要處理一個 pixel，能善用 GPU 平行處理。 2. 沒有使用 paged-locked memory，因此對 system performance不會影響太大。 * **cons:** 1. Load Unbalance 的問題不能完全解決，每個 thread 所需處理的計算量可能會有所不同，會發生完成的 thread 需要等待未完成的 thread。 2. 因為規定`not allowed to use the image input as the host memory directly` ，會使得所使用的 memory 是 pagable，需要時可能不在 memory 裡，必須等待傳輸。 * **Method 2:** * **pros:** 1. 在資料是2D的情況下，並確保存放的時候有對齊，可藉由 CUDA 在一次讀取256 bits的資料會比較快特性，提高提取資料的時候的效率。 2. 使用的 memory 是 paged-locked，可以在需要時確保會在 memory 裡，避免傳輸時間。 * **cons:** 1. 會有 Load Unbalance 的問題，每個 thread 所需處理的計算量可能會有所不同，會發生完成的 thread 需要等待未完成的 thread。 2. 使用太多 paged-locked memory 會影響 system performance。 3. 使用 cudaMallocPitch 分配 GPU memory 為了保證記憶體空間會依 256 的倍數對齊，可能會多分配不須使用到的記憶體空間，造成浪費，並且 cudaMallocPitch 呼叫比 cudaMalloc 耗時。 4. 此方法沒有善用 paged-locked memory 的 zero-copy access 特性，仍然會產生 data copy 的執行時間。 * **Method 3:** * **pros:** 1. 每個thread一次可以處理比較多個 pixels，可以減少 blocks 搬動到 SM 的次數。 2. 在資料是2D的情況下，並確保存放的時候有對齊，可藉由 CUDA 在一次讀取256 bits的資料會比較快特性，提高提取資料的時候的效率。 3. 使用的 memory 是 paged-locked，可以在需要時確保會在 memory 裡，避免傳輸時間。 * **cons:** 1. 每個 thread 處理的是一個 group 的 pixels，因此在平行的架構中會出現較多 serial 執行的部份，會影響了 GPU的執行效率。 2. 使用太多 paged-locked memory 會影響 system performance。 3. 使用 cudaMallocPitch 分配 GPU memory 為了保證記憶體空間會依 256 的倍數對齊，可能會多分配不須使用到的記憶體空間，造成浪費，並且 cudaMallocPitch 呼叫比 cudaMalloc 耗時。 4. 此方法沒有善用 paged-locked memory 的 zero-copy access 特性，仍然會產生 data copy 的執行時間。 :::info **My Assumption:** Method 3 比 Method 1 和 Mehtod 2 來說，Load Unbalance 的問題比較嚴重，且多了 serial part 的部分要處理，所以我認為是三者跑最慢的。對於 Method 2 來說，它未善用到 zero-copy access 的優勢，反而可能會因為使用 cudaMallocPitch 而多分配出不少 GPU 記憶體空間，造成在做 memory copy 時的計算量有所增加，cudaMallocPitch 呼叫又比 cudaMalloc 耗時，因此在大部分的狀況下，我認為 Method 2 會比 Method 1 執行還慢。 **預期執行速度 Method 1 > Method 2 > Method3** ::: ___ ### Q2: How are the performances of the three methods? Plot a chart to show the differences among the three methods * for VIEW 1 and VIEW 2, and * for different maxIteration (1000, 10000, and 100000). * **VIEW 1** | iteration / method | Method 1 | Method 2 | Method 3 | | -------- | -------- | -------- | -------- | | Iteration 1000 |7.028 ms|9.104 ms|10.773 ms| | Iteration 10000 |34.853 ms|36.732 ms|48.659 ms| | Iteration 100000 |321.344 ms|324.041 ms|434.161 ms| ![](https://i.imgur.com/N6Wfuah.png) * **VIEW 2** | iteration / method | Method 1 | Method 2 | Method 3 | | -------- | -------- | -------- | -------- | | Iteration 1000 |3.968 ms|5.804 ms|6.701 ms| | Iteration 10000 |6.627 ms|8.497 ms|18.139 ms| | Iteration 100000 |32.402 ms|35.311 ms|116.718 ms| ![](https://i.imgur.com/KwPQt78.png) ___ ### Q3: Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not. 從上述 Q2 的比較圖看出不論是 iteration 數量還是選擇 View1 或 View 2， Method 3 的執行效率都是最低的，這是因為 Method 3 每個 thread 處理的是一個 group 的 pixels，因此出現 serial 計算的部份會比 Method 1 和 Mehtod 2 多，所以影響執行效率，並且 Load Unbalance 問題是比較嚴重的。 接下來再比較 Method 1 和 Method 2 ，對於選擇 View 1 以及 View 2 執行，執行效率有極大的差別。 :::success 可以從 HW2 之前得到的結論來探討: **在 8 位元灰階圖片中，0 表示黑色，最大值 255 則表示白色，數值越大顏色越淺** **所以我發現，當 count=256 時， i 可能為 0~255 中任一數，其中數值大小決定該像素顏色深淺** 再比較VIEW 1和 VIEW 2: ![](https://i.imgur.com/jzjx6af.jpg) 可以明顯看出 VIEW 1 的淺色像素遠比 View2 來得多且集中(集中在中間) ::: 因此在執行 VIEW 1 會比 VIEW 2 執行還要長，是因為 Method 1 和 Method 2 在處理 VIEW 1 都會被處理中間淺色像素的 thread 拖累，最後都要等待執行最久的 thread 完成，但在 VIEW 2 中，淺色像素較平均，執行速度就會比較快。在兩種 VIEW 中，Method 2 會比 Method 1 執行再稍慢一點，一方面可能是使用 cudaMallocPitch 分配 GPU memory 為了保證記憶體空間會依 256 的倍數對齊，可能會多分配不須使用到的記憶體空間，造成在做 memory copy 時的計算量有所增加，另一方面可能是 Method 2 比 Method 1 的 Load Unbalance 問題較嚴重一點，所以 Method 2 才會有執行效果不佳的原因。 ___ ### Q4: Can we do even better? Think a better approach and explain it. Implement your method in `kernel4.cu`. 透過上述的比較，可以看出 Method 1 執行效能最好，因此我以 Method 1 為基礎進行修改，在原本作業規定中 Method 1 在 host 端也要 allocate 一段記憶體空間做資料傳輸，但實際上可以不需要在 host 端額外配置一塊空間， **直接將 device 上的值透過 cudaMemcpy 複製到 img 當中**就好，這樣可以省去配置、複製及刪除的時間。除此之外，因為傳入的圖檔的為 1600 * 1200，所以我將一個block處理的threads數目改為 **"8"** `=> dim3 threadsPerBlock(8, 8);`，使其設定為加速最多的方式。 ___