# 平行程式設計 作業六
## Q1 (5 points): Explain your implementation. How do you optimize the performance of convolution?
### 平行化的概念是將每個像素視為一個工作項目(work item),並且對每個工作項目進行處理。在實作上,可以按照 OpenCL 標準流程進行以下步驟:創建命令隊列>創建緩衝區>創建內核>設置參數>執行內核。利用OpenCL標準流程,可以在 hostFE() 函數中按照這些步驟進行具體的實作。
### 1. CL_MEM_USE_HOST_PTR 參數來創建內存緩衝區,以便更有效地傳遞數據。
### 2.使用 CL_MEM_READ_ONLY or CL_MEM_COPY_HOST_PTR 參數來創建存放 filter 的內存緩衝區,同時將數據從主機端複製到計算設備端,加快存取速度。
### 3.在內核程式碼中,將全域變數 filterWidth、imageHeight 和 imageWidth 移至局部變數,以減少存取全域變數的成本。
#### 參考
https://www.cnblogs.com/zenny-chen/p/3640870.html、http://www.kimicat.com/opencl-1/opencl-jiao-xue-yi
### filter 存入 __constant Memory 加速存取且全域變數的計算換成局部變數中來提高效能
```
void hostFE(int filterWidth, float *filter, int imageHeight, int imageWidth,
float *inputImage, float *outputImage, cl_device_id *device,
cl_context *context, cl_program *program)
{
cl_int status;
int filterSize = filterWidth * filterWidth;
int imageSize = imageHeight * imageWidth;
cl_command_queue commandQueue = clCreateCommandQueue(*context, *device, 0, NULL);
cl_mem inputImgMem = clCreateBuffer(*context, CL_MEM_USE_HOST_PTR, imageSize * sizeof(float), inputImage, NULL);
cl_mem filterMem = clCreateBuffer(*context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, filterSize * sizeof(float), filter, NULL);
cl_mem outputImgMem = clCreateBuffer(*context, CL_MEM_WRITE_ONLY, imageSize * sizeof(float), NULL, NULL);
cl_kernel kernel = clCreateKernel(*program, "convolution", NULL);
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputImgMem);
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&outputImgMem);
clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&filterMem);
clSetKernelArg(kernel, 3, sizeof(cl_int), (void *)&imageHeight);
clSetKernelArg(kernel, 4, sizeof(cl_int), (void *)&imageWidth);
clSetKernelArg(kernel, 5, sizeof(cl_int), (void *)&filterWidth);
size_t global_work_size = imageSize;
size_t local_work_size = 64;
clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global_work_size, &local_work_size, 0, NULL, NULL);
clEnqueueReadBuffer(commandQueue, outputImgMem, CL_TRUE, 0, imageSize * sizeof(float), outputImage, 0, NULL, NULL);
}
```
```
__kernel void convolution(const __global float *inputImage, __global float *outputImage, __constant float *filter,
const int imageHeight, const int imageWidth, const int filterWidth)
{
int idx = get_global_id(0);
int row = idx / imageWidth;
int col = idx % imageWidth;
int halfFilterSize = filterWidth / 2;
float sum = 0.0f;
for (int k = -halfFilterSize; k <= halfFilterSize; k++) {
for (int l = -halfFilterSize; l <= halfFilterSize; l++) {
int inputRow = row + k;
int inputCol = col + l;
int filterIndex = (k + halfFilterSize) * filterWidth + (l + halfFilterSize);
if (filter[filterIndex] != 0 && inputRow >= 0 && inputRow < imageHeight && inputCol >= 0 && inputCol < imageWidth) {
sum += inputImage[inputRow * imageWidth + inputCol] * filter[filterIndex];
}
}
}
outputImage[row * imageWidth + col] = sum;
}
```
## [Bonus] Q2 (10 points): Rewrite the program using CUDA. (1) Explain your CUDA implementation, (2) plot a chart to show the performance difference between using OpenCL and CUDA, and (3) explain the result.