[NYCU PP-f23] Assignment VI: OpenCL Programming

# [NYCU PP-f23] Assignment VI: OpenCL Programming `311551174 李元亨` ## Q1: Explain your implementation. How do you optimize the performance of convolution? There are a few things we did in `hostFE` to make the OpenCL computation work: ### OpenCL Initialization: ```cpp! // Initialize OpenCL Device cl_command_queue queue = clCreateCommandQueue(*context, *device, 0, &status); // Create OpenCL memory buffers cl_mem d_filter = clCreateBuffer(*context, CL_MEM_USE_HOST_PTR, filterSize * sizeof(float), filter, &status); cl_mem d_inputImage = clCreateBuffer(*context, CL_MEM_USE_HOST_PTR, imageHeight * imageWidth * sizeof(float), inputImage, &status); cl_mem d_outputImage = clCreateBuffer(*context, CL_MEM_WRITE_ONLY, imageHeight * imageWidth * sizeof(float), NULL, &status); // Create OpenCL kernel cl_kernel kernel = clCreateKernel(*program, "convolution", &status); ``` `hostFE` initializes the OpenCL device, context, command queue, and memory buffers for filter, input image, and output image. This facilitates parallel execution on compatible hardware. ### Kernel Creation and Argument Setup: ```cpp! // Create OpenCL kernel cl_kernel kernel = clCreateKernel(*program, "convolution", &status); // Set OpenCL kernel arguments status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_filter); status |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_inputImage); status |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_outputImage); status |= clSetKernelArg(kernel, 3, sizeof(int), &filterWidth); status |= clSetKernelArg(kernel, 4, sizeof(int), &imageWidth); status |= clSetKernelArg(kernel, 5, sizeof(int), &imageHeight); ``` It creates an OpenCL kernel named convolution using `clCreateKernel`. The kernel arguments (`clSetKernelArg`) are set for the filter, input image, output image, filter width, image width, and image height. ### Enqueuing Kernel Execution: ```cpp! // Execute OpenCL kernel size_t globalWorkSize = imageHeight * imageWidth; size_t localWorkSize = 100; status = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalWorkSize, &localWorkSize, 0, NULL, NULL); // Copy results from OpenCL buffer status = clEnqueueReadBuffer(queue, d_outputImage, CL_TRUE, 0, imageHeight * imageWidth * sizeof(float), outputImage, 0, NULL, NULL); ``` The code enqueues the execution of the OpenCL kernel (clEnqueueNDRangeKernel) to perform the convolution operation in parallel. It defines global and local work sizes to distribute the computation across the OpenCL device. We set local work sizes to 100 in order to handle input images efficiently with size (400*600). clEnqueueReadBuffer is used to transfer the computed results back from the OpenCL buffer to the outputImage array. As for the `kernel.cl`, which is in charge of computing the convolution, compared to the serial version, several optimizations have been applied. ### Passing Data: ```cpp! __kernel void convolution( __constant float* filter, // __constant char* checkFilterRow, const __global float* input, __global float* output, const int filterWidth, const int imageWidth, const int imageHeight ) ... ``` In OpenCL, the `__constant` qualifier is used to define memory that remains constant. It is intended for data that does not change during the kernel execution, e.g., `filter`. ### Parallel Execution: ```cpp! int idx = get_global_id(0); int x = idx % imageWidth; int y = idx / imageWidth; ``` OpenCL allows concurrent processing of pixels using the GPU, enabling parallelism compared to the sequential nature of the serial version. The kernel convolution is executed in parallel for each pixel of the output image using get_global_id(0) to determine the index. ### Elimination of Zero-weight Calculations: ```cpp! if (filter[filterIdx] == 0.0f) { continue; } ``` The kernel skips calculations where the filter value is zero, optimizing computation by reducing unnecessary multiplications and additions. Combining all the optimization above, the improvement we achieved is shown below. | Filter width:3 | Time(ms) | Speed Up | | ---- | -------- | -------- | | conv serial | 4.826 | 1x | | conv opencl | 0.426 | 11.35x | | Filter width:5 | Time(ms) | Speed Up | | ---- | -------- | -------- | | conv serial | 13.124 | 1x | | conv opencl | 0.482 | 27.23x | | Filter width:7 | Time(ms) | Speed Up | | ---- | -------- | -------- | | conv serial | 26.513 | 1x | | conv opencl | 0.494 | 53.64x | ## Q2 - (1) Explain your CUDA implementation Since CUDA implementation follows most of the optimizations we had in OpenCL, we will focus on the differences between these two. ### Thread block ```cpp! // Set up the kernel dim3 dimBlock(16, 16); dim3 numBlocks(imageWidth / dimBlock.x, imageHeight / dimBlock.y); ``` After tring some different thread block settings, we set the size of thread block to (16*16) in the end for better performance. ### Thread Indexing and Pixel Assignment: ```cpp! int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; ``` Inside the convolution kernel, each thread calculates the output pixel value for its corresponding position in the output image. The thread indices (x and y) are computed using blockIdx, blockDim, and threadIdx. ## Q2 - (2) plot a chart to show the performance difference between using OpenCL and CUDA ![Speed up compared to serial ](https://hackmd.io/_uploads/HJ6Pn0z_6.png) | Filter width:3 | Time(ms) | Speed Up | | -------- | -------- | -------- | | conv opencl | 0.426 | 11.35x | |conv CUDA| 0.538| 8.97x| |Filter width:5| Time(ms)| Speed Up| | -------- | -------- | -------- | |conv opencl |0.482| 27.23x| |conv CUDA |0.62 |21.17x| |Filter width:7 |Time(ms)| Speed Up| | -------- | -------- | -------- | |conv opencl |0.494| 53.64x| |conv CUDA |0.751 |35.30x| ## Q2 - (3) explain the result The result shows that the CUDA version runs slower than the OpenCL implementation, which differs from my expectations. We suspect this could be caused by differences in thread management and how data is transferred.