Parallel Programming @ NYCU - HW6

# Parallel Programming @ NYCU - HW6 #### **`0716221 余忠旻`** ### Q1: Explain your implementation. How do you optimize the performance of convolution? :::success 在 `hostFE.c` 中，我設定 `local work size = 8 x 8`， `global work size = imageWidth x imageHeight`，這樣可以確保圖片的每一個 pixel 都有 work item 對應做平行處理，而在 `kernel.cl` 中，我利用 `get_globla_id(0)` / `get_global_id(1)` 取得每個 work item 所需處理 pixel 是對應圖片中的哪一個 pixel，接著利用作業給的 `serialConv.c` 中的 `serialConv()` 將對應 pixel 的 convolution 計算做出來 ::: **Optimize the performance of convolution** * **修改for迴圈寫法** 原先作業給的 `serialConv.c` 中的 `serialConv()` 中， convolution 的 for 迴圈中會不斷的做 if 判斷，檢查計算的內容有沒有超出圖片邊界，會產生較大的 overhead 原來的做法： ```cpp for (k = -halffilterSize; k <= halffilterSize; k++){ for (l = -halffilterSize; l <= halffilterSize; l++){ if (i + k >= 0 && i + k < imageHeight && j + l >= 0 && j + l < imageWidth){ sum += inputImage[(i + k) * imageWidth + j + l] * filter[(k + halffilterSize) * filterWidth +l + halffilterSize]; } } } outputImage[i * imageWidth + j] = sum; ``` &emsp; &ensp; 因此我將原本做法改成，先計算出 k, l 的開始與結束位置再將其帶入 for 的條件中 &emsp; &ensp; 修改完的做法： ```cpp int k_start = -halffilterSize + iy >= 0 ? -halffilterSize : 0; int k_end = halffilterSize + iy < imageHeight ? halffilterSize : halffilterSize + iy - imageHeight - 1; int l_start = -halffilterSize + ix >= 0 ? -halffilterSize : 0; int l_end = halffilterSize + ix < imageWidth ? halffilterSize : halffilterSize + ix - imageWidth - 1; for (int k = k_start; k <= k_end; k++){ idxI = (iy + k) * imageWidth + ix; idxF = (k + halffilterSize) * filterWidth + halffilterSize; for (int l = l_start; l <= l_end; l++){ sum += inputImage[idxI + l] * filter[idxF + l]; } } outputImage[iy * imageWidth + ix] = sum; ``` &emsp; &ensp; 在上面程式碼也可以看到， &emsp; &ensp; 做 convolution 時， &emsp; &ensp; 因為 inputImage 和 filter 的 index 在內層 for 迴圈中每次運算都只有差 1， &emsp; &ensp; 所以可以用參數紀錄 index 初始值， &emsp; &ensp; 隨著內層 for 迴圈 l 不斷 +1，index 也隨之 +1， &emsp; &ensp; 這樣 index 就能省下重複的計算 * **修改filter寫法** 若是 filter 該位置為 0 則跳過該 pixel 的乘法與加法計算因此我在 `hostFE.c` 會檢查 filter 將 filter 外圈為 0 的修剪掉，並修改 filterWidth 的大小，讓其計算量減少 ```cpp char *charFilter = (char *)malloc(filterWidth * filterWidth * sizeof(char)); int new_filterWidth = filterWidth; int checkStart = 0; int checkEnd = filterWidth - 1; int check = 1; while(check == 1 && checkStart < checkEnd) { for (int i = 0; i < filterWidth && check == 1; i++) if(filter[checkStart * filterWidth + i] != 0) check = 0; // upper for (int i = 0; i < filterWidth && check == 1; i++) if(filter[checkEnd * filterWidth + i] != 0) check = 0; // lower for (int i = 0; i < filterWidth && check == 1; i++) if(filter[i * filterWidth + checkStart] != 0) check = 0; // left for (int i = 0; i < filterWidth && check == 1; i++) if(filter[i * filterWidth + checkEnd] != 0) check = 0; // right if (check == 1) new_filterWidth -= 2; checkStart++; checkEnd--; } int charFilter_start = (filterWidth - new_filterWidth) % 2 == 0 ? (filterWidth - new_filterWidth) / 2 : 0; for (register int i = 0; i < new_filterWidth; i++) for (register int j = 0; j < new_filterWidth; j++) charFilter[i * new_filterWidth + j] = filter[((charFilter_start + i) * filterWidth) + charFilter_start + j]; ``` ___ ### Q2: Rewrite the program using CUDA. (1) Explain your CUDA implementation (2) plot a chart to show the performance difference between using OpenCL and CUDA (3) explain the result **(1) Explain your CUDA implementation** CUDA 的實作的算法與優化都與 OpenCL 相同，我設定 `threadsPerBlock = 8 x 8`， `numBlocks = (imageWidth / threadsPerBlock.x, imageHeight / threadsPerBlock.y)` 這樣可以和 OpenCL 一樣確保圖片的每一個 pixel 都有 thread 對應做平行處理， CUDA 和 OpenCL 的差異不大，只是程式環境不大一樣。因此我利用 HW5 的環境，在 `main.cpp` 做些微調整並在 `kernel.cu` 實作出 convolution ```cpp= #include <stdio.h> #include <stdlib.h> #include <cuda.h> #define BLOCK_SIZE 8 __global__ static void convolution(char *filter, int filterWidth, int imageHeight, int imageWidth, unsigned char *inputImage, unsigned short *outputImage) { int halffilterSize = filterWidth / 2; int iy = blockIdx.y * blockDim.y + threadIdx.y; int ix = blockIdx.x * blockDim.x + threadIdx.x; int sum = 0; int idxI = 0; int idxF = 0; int k_start = -halffilterSize + iy >= 0 ? -halffilterSize : 0; int k_end = halffilterSize + iy < imageHeight ? halffilterSize : halffilterSize + iy - imageHeight - 1; int l_start = -halffilterSize + ix >= 0 ? -halffilterSize : 0; int l_end = halffilterSize + ix < imageWidth ? halffilterSize : halffilterSize + ix - imageWidth - 1; for (int k = k_start; k <= k_end; k++){ idxI = (iy + k) * imageWidth + ix; idxF = (k + halffilterSize) * filterWidth + halffilterSize; for (int l = l_start; l <= l_end; l++){ sum += inputImage[idxI + l] * filter[idxF + l]; } } outputImage[iy * imageWidth + ix] = sum; } void hostFE (int filterWidth, float * filter, int imageHeight, int imageWidth, float * inputImage, float * outputImage) { char *charFilter = (char *)malloc(filterWidth * filterWidth * sizeof(char)); int new_filterWidth = filterWidth; int checkStart = 0; int checkEnd = filterWidth - 1; int check = 1; while(check == 1 && checkStart < checkEnd) { for (int i = 0; i < filterWidth && check == 1; i++) if(filter[checkStart * filterWidth + i] != 0) check = 0; // upper for (int i = 0; i < filterWidth && check == 1; i++) if(filter[checkEnd * filterWidth + i] != 0) check = 0; // lower for (int i = 0; i < filterWidth && check == 1; i++) if(filter[i * filterWidth + checkStart] != 0) check = 0; // left for (int i = 0; i < filterWidth && check == 1; i++) if(filter[i * filterWidth + checkEnd] != 0) check = 0; // right if (check == 1) new_filterWidth -= 2; checkStart++; checkEnd--; } int charFilter_start = (filterWidth - new_filterWidth) % 2 == 0 ? (filterWidth - new_filterWidth) / 2 : 0; for (register int i = 0; i < new_filterWidth; i++) for (register int j = 0; j < new_filterWidth; j++) charFilter[i * new_filterWidth + j] = filter[((charFilter_start + i) * filterWidth) + charFilter_start + j]; int filterSize = new_filterWidth * new_filterWidth * sizeof(char); int imageSize = imageHeight * imageWidth; unsigned char *bufInput = (unsigned char *)malloc(imageSize * sizeof(unsigned char)); unsigned short *bufOutput = (unsigned short *)malloc(imageSize * sizeof(unsigned short)); for (register int i = 0; i < imageSize; i++) bufInput[i] = inputImage[i]; char *deviceFilter; unsigned char *deviceInput; unsigned short *deviceOutput; cudaMalloc(&deviceFilter, filterSize); cudaMalloc(&deviceInput, imageSize * sizeof(unsigned char)); cudaMalloc(&deviceOutput, imageSize * sizeof(unsigned short)); cudaMemcpy(deviceFilter, charFilter, filterSize, cudaMemcpyHostToDevice); cudaMemcpy(deviceInput, bufInput, imageSize * sizeof(unsigned char), cudaMemcpyHostToDevice); dim3 threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 numBlocks(imageWidth / threadsPerBlock.x, imageHeight / threadsPerBlock.y); convolution<<<numBlocks, threadsPerBlock>>>(deviceFilter, new_filterWidth, imageHeight, imageWidth, deviceInput, deviceOutput); cudaMemcpy(bufOutput, deviceOutput, imageSize * sizeof(unsigned short), cudaMemcpyDeviceToHost); for (register int i = 0; i < imageSize; i++) outputImage[i] = (float)bufOutput[i]; cudaFree(deviceFilter); cudaFree(deviceInput); cudaFree(deviceOutput); free(charFilter); free(bufInput); free(bufOutput); } ``` **(2) plot a chart to show the performance difference between using OpenCL and CUDA** ![](https://i.imgur.com/zI9Hw6Z.png) **(3) explain the result** 在相同算法和優化的實作下，可以看見上述三種不同 filter 的情況中，CUDA 會比 OpenCL 稍快一點，我認為可能是因為 OpenCL 是跨平台跨設備支援的程式編寫框架，相對於 CUDA 只支援 NVIDIA 的 GPU 設備的情況下， OpenCL 較難對 NVIDIA 的設備進行優化，因此對於 NVIDIA 的 GPU 在效能的比較上， CUDA 的表現會比 OpenCL 好，並且從 [A Performance Comparison of CUDA and OpenCL](https://arxiv.org/ftp/arxiv/papers/1005/1005.2581.pdf) 這篇論文中，可以看見在資料傳輸與計算方面 CUDA 是優於 OpenCL，可以佐證我們的實驗結果 ___