Parallel Programming Assignment VI
===
## Q1: Explain your implementation. How do you optimize the performance of convolution?
**每個Kernel計算一個Pixel的Convolution計算**,所以`Global Work Size = imageWidth * imageHeight`,Kernel內(如下)使用Loacl Variable: `imageLTIdx, imageIdx, filterLTIdx, filterIdx`來更新陣列的Index,而不是每次都使用乘法運算重算Index。
```.
__kernel void convolution(__global float *outputImage, __global float *inputImage, __constant float *filter, int imageWidth, int imageHeight, int filterWidth)
{
int halfFilterWidth = filterWidth / 2;
float sum = 0;
int x = get_global_id(0);
int y = get_global_id(1);
int imageLTIdx = (y - halfFilterWidth) * imageWidth + x - halfFilterWidth;
int filterLTIdx = 0;
for (int r = -halfFilterWidth; r <= halfFilterWidth; r++)
{
if (y + r >= 0 && y + r < imageHeight)
{
int imageIdx = imageLTIdx;
int filterIdx = filterLTIdx;
for (int c = -halfFilterWidth; c <= halfFilterWidth; c++, imageIdx++, filterIdx++)
{
if (x + c >= 0 && x + c < imageWidth)
{
float factor = filter[filterIdx];
if (factor != 0.0)
{
sum += (inputImage[imageIdx] * factor);
}
}
}
}
imageLTIdx += imageWidth;
filterLTIdx += filterWidth;
}
outputImage[y * imageWidth + x] = sum;
}
```
## Q2: Rewrite the program using CUDA.
### (1) Explain your CUDA implementation
CUDA與OpenCL一樣(資料傳輸方式也一樣,資料搬移->計算->資料搬移),**每個Kernel計算一個Pixel的Convolution計算**,共會產生影像高度數量的Block,每個Block會產生影像寬度數量的Thread。
### (2) plot a chart to show the performance difference between using OpenCL and CUDA

### (3) explain the result.
由上圖結果可知,**CUDA比OpenCL慢一些**,在Kernel Size增加時速度的差距就越大。