# 平行程式設計 作業六 ## Q1 (5 points): Explain your implementation. How do you optimize the performance of convolution? ### 平行化的概念是將每個像素視為一個工作項目(work item),並且對每個工作項目進行處理。在實作上,可以按照 OpenCL 標準流程進行以下步驟:創建命令隊列>創建緩衝區>創建內核>設置參數>執行內核。利用OpenCL標準流程,可以在 hostFE() 函數中按照這些步驟進行具體的實作。 ### 1. CL_MEM_USE_HOST_PTR 參數來創建內存緩衝區,以便更有效地傳遞數據。 ### 2.使用 CL_MEM_READ_ONLY or CL_MEM_COPY_HOST_PTR 參數來創建存放 filter 的內存緩衝區,同時將數據從主機端複製到計算設備端,加快存取速度。 ### 3.在內核程式碼中,將全域變數 filterWidth、imageHeight 和 imageWidth 移至局部變數,以減少存取全域變數的成本。 #### 參考 https://www.cnblogs.com/zenny-chen/p/3640870.html、http://www.kimicat.com/opencl-1/opencl-jiao-xue-yi ### filter 存入 __constant Memory 加速存取且全域變數的計算換成局部變數中來提高效能 ``` void hostFE(int filterWidth, float *filter, int imageHeight, int imageWidth, float *inputImage, float *outputImage, cl_device_id *device, cl_context *context, cl_program *program) { cl_int status; int filterSize = filterWidth * filterWidth; int imageSize = imageHeight * imageWidth; cl_command_queue commandQueue = clCreateCommandQueue(*context, *device, 0, NULL); cl_mem inputImgMem = clCreateBuffer(*context, CL_MEM_USE_HOST_PTR, imageSize * sizeof(float), inputImage, NULL); cl_mem filterMem = clCreateBuffer(*context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, filterSize * sizeof(float), filter, NULL); cl_mem outputImgMem = clCreateBuffer(*context, CL_MEM_WRITE_ONLY, imageSize * sizeof(float), NULL, NULL); cl_kernel kernel = clCreateKernel(*program, "convolution", NULL); clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputImgMem); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&outputImgMem); clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&filterMem); clSetKernelArg(kernel, 3, sizeof(cl_int), (void *)&imageHeight); clSetKernelArg(kernel, 4, sizeof(cl_int), (void *)&imageWidth); clSetKernelArg(kernel, 5, sizeof(cl_int), (void *)&filterWidth); size_t global_work_size = imageSize; size_t local_work_size = 64; clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global_work_size, &local_work_size, 0, NULL, NULL); clEnqueueReadBuffer(commandQueue, outputImgMem, CL_TRUE, 0, imageSize * sizeof(float), outputImage, 0, NULL, NULL); } ``` ``` __kernel void convolution(const __global float *inputImage, __global float *outputImage, __constant float *filter, const int imageHeight, const int imageWidth, const int filterWidth) { int idx = get_global_id(0); int row = idx / imageWidth; int col = idx % imageWidth; int halfFilterSize = filterWidth / 2; float sum = 0.0f; for (int k = -halfFilterSize; k <= halfFilterSize; k++) { for (int l = -halfFilterSize; l <= halfFilterSize; l++) { int inputRow = row + k; int inputCol = col + l; int filterIndex = (k + halfFilterSize) * filterWidth + (l + halfFilterSize); if (filter[filterIndex] != 0 && inputRow >= 0 && inputRow < imageHeight && inputCol >= 0 && inputCol < imageWidth) { sum += inputImage[inputRow * imageWidth + inputCol] * filter[filterIndex]; } } } outputImage[row * imageWidth + col] = sum; } ``` ## [Bonus] Q2 (10 points): Rewrite the program using CUDA. (1) Explain your CUDA implementation, (2) plot a chart to show the performance difference between using OpenCL and CUDA, and (3) explain the result.