# Parallel Programming HW6
#### **`311551185 劉力勳`**
### <font color="#EA0000"> Q1: Explain your implementation. How do you optimize the performance of convolution?
</font>
First, create the memory, command queue, and kernel needed for OpenCL:
```
cl_command_queue q = clCreateCommandQueue(*context, *device, 0, NULL);
cl_kernel kernel = clCreateKernel(*program, "convolution", NULL);
cl_mem filter_buffer = clCreateBuffer(*context, CL_MEM_READ_ONLY, filterSize, NULL, NULL);
cl_mem input_buffer = clCreateBuffer(*context, CL_MEM_READ_ONLY, mem_size, NULL, NULL);
cl_mem output_buffer = clCreateBuffer(*context, CL_MEM_WRITE_ONLY, mem_size, NULL, NULL);
```
Next, write the input image and filter into read-only memory:
```
clEnqueueWriteBuffer(q, filter_buffer, CL_TRUE, 0, filterSize, (void *)filter, 0, NULL, NULL);
clEnqueueWriteBuffer(q, input_buffer, CL_TRUE, 0, mem_size, (void *)inputImage, 0, NULL, NULL);
```
Then, set the parameters for the kernel function:
```
clSetKernelArg(kernel, 0, sizeof(cl_int), (void *)&filterWidth);
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&filter_buffer);
clSetKernelArg(kernel, 2, sizeof(cl_int), (void *)&imageHeight);
clSetKernelArg(kernel, 3, sizeof(cl_int), (void *)&imageWidth);
clSetKernelArg(kernel, 4, sizeof(cl_mem), (void *)&input_buffer);
clSetKernelArg(kernel, 5, sizeof(cl_mem), (void *)&output_buffer);
```
Next, specify the global work size and local work size, and then execute. The code in kernel.cu is basically a slight modification of the serialConv.c provided by the teaching assistant, which can be run as is. Since the local work size is set to 25 x 25, and the global work size is set to imageWidth x imageHeight, this ensures that each pixel will be assigned to a work item for parallel processing.
```
size_t global_size[2] = {imageWidth, imageHeight};
size_t local_size[2] = {25, 25};
clEnqueueNDRangeKernel(q, kernel, 2, 0, global_size, local_size, 0, NULL, NULL);
```
Finally, the memory from the GPU must be written back to the host side:
```
clEnqueueReadBuffer(q, output_buffer, CL_TRUE, 0, mem_size, (void *)outputImage, NULL, NULL, NULL);
```