Parallel Programming Assignment VI === ## Q1: Explain your implementation. How do you optimize the performance of convolution? **每個Kernel計算一個Pixel的Convolution計算**,所以`Global Work Size = imageWidth * imageHeight`,Kernel內(如下)使用Loacl Variable: `imageLTIdx, imageIdx, filterLTIdx, filterIdx`來更新陣列的Index,而不是每次都使用乘法運算重算Index。 ```. __kernel void convolution(__global float *outputImage, __global float *inputImage, __constant float *filter, int imageWidth, int imageHeight, int filterWidth) { int halfFilterWidth = filterWidth / 2; float sum = 0; int x = get_global_id(0); int y = get_global_id(1); int imageLTIdx = (y - halfFilterWidth) * imageWidth + x - halfFilterWidth; int filterLTIdx = 0; for (int r = -halfFilterWidth; r <= halfFilterWidth; r++) { if (y + r >= 0 && y + r < imageHeight) { int imageIdx = imageLTIdx; int filterIdx = filterLTIdx; for (int c = -halfFilterWidth; c <= halfFilterWidth; c++, imageIdx++, filterIdx++) { if (x + c >= 0 && x + c < imageWidth) { float factor = filter[filterIdx]; if (factor != 0.0) { sum += (inputImage[imageIdx] * factor); } } } } imageLTIdx += imageWidth; filterLTIdx += filterWidth; } outputImage[y * imageWidth + x] = sum; } ``` ## Q2: Rewrite the program using CUDA. ### (1) Explain your CUDA implementation CUDA與OpenCL一樣(資料傳輸方式也一樣,資料搬移->計算->資料搬移),**每個Kernel計算一個Pixel的Convolution計算**,共會產生影像高度數量的Block,每個Block會產生影像寬度數量的Thread。 ### (2) plot a chart to show the performance difference between using OpenCL and CUDA ![](https://i.imgur.com/nTK5Zy8.png) ### (3) explain the result. 由上圖結果可知,**CUDA比OpenCL慢一些**,在Kernel Size增加時速度的差距就越大。