# Programming Assignment VI: OpenCL Programming
## Q1: Explain your implementation. How do you optimize the performance of convolution?
++***Ans:***++
* `hostFE.c`
前置的設定(`device`,`context`,`program`)都已經在`helper.c`內的`initCL`函式中完成。
```cpp=
#include <stdio.h>
#include <stdlib.h>
#include "hostFE.h"
#include "helper.h"
void hostFE(int filterWidth, float *filter, int imageHeight, int imageWidth,
float *inputImage, float *outputImage, cl_device_id *device,
cl_context *context, cl_program *program)
{
cl_int status;
int filterSize = filterWidth * filterWidth * sizeof(float);
int imgSize = imageHeight * imageWidth * sizeof(float);
size_t localws[2] = {8, 8};
size_t globalws[2] = {imageWidth, imageHeight};
cl_command_queue queue;
queue = clCreateCommandQueue(*context, *device, 0, NULL);
cl_mem d_filterw = clCreateBuffer(*context, CL_MEM_USE_HOST_PTR, sizeof(int), &filterWidth, NULL);
cl_mem d_filter = clCreateBuffer(*context, CL_MEM_USE_HOST_PTR, filterSize, filter, NULL);
cl_mem d_in = clCreateBuffer(*context, CL_MEM_USE_HOST_PTR, imgSize, inputImage, NULL);
cl_mem d_out = clCreateBuffer(*context, CL_MEM_WRITE_ONLY, imgSize, NULL, NULL);
cl_kernel kernel = clCreateKernel(*program, "convolution", NULL);
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*) &d_filterw);
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*) &d_filter);
clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*) &d_in);
clSetKernelArg(kernel, 3, sizeof(cl_mem), (void*) &d_out);
clEnqueueNDRangeKernel(queue, kernel, 2, NULL, globalws, localws, 0, NULL, NULL);
clEnqueueReadBuffer(queue, d_out, CL_TRUE, 0, imgSize, (void*)outputImage, 0, NULL, NULL);
}
```
以下幾個解釋:
> * `Row 14 : size_t globalws\[2]\={imageWidth, imageHeight}` :
> 因為圖片存取的方式是**先row再width**,考慮到Coalescing memory accesses的機制,將`imageWidth`設為第一維讓multithread存取連續的位置使其合併,有助於提升效能
> * `Row 19-22 clCreateBuffer(..., CL_MEM_USE_HOST_PTR, ...);`:
> 將`cl_mem_flag`設為**CL_MEM_USE_HOST_PTR**,讓OpenCL將`host_ptr`指向的記憶體做為參考。比起建立Read only buffer再寫入會更有效率
> *
* `kernel.cl`
仿照`serialConv.c`內的`serialConv`函式的寫法,修改的地方如下:
* 使用`get_global_size`來取得圖片尺寸
* 使用`get_global_id`取得當前pixel位置
* 除了`inputImage`超過**constant buffer size**以及`outputImage`需要修改,盡可能將變數放入`constant`裡面來加快讀取速度
```cpp=
__kernel void convolution( __constant int *filterWidth,
__constant float *filter,
__global float *inputImage,
__global float *outputImage)
{
int halffilterSize = *filterWidth / 2;
int imageHeight = get_global_size(1);
int imageWidth = get_global_size(0);
int i = get_global_id(1);
int j = get_global_id(0);
int k, l;
float sum = 0;
// Apply the filter to the neighborhood
for (k = -halffilterSize; k <= halffilterSize; k++)
{
for (l = -halffilterSize; l <= halffilterSize; l++)
{
if (i + k >= 0 && i + k < imageHeight &&
j + l >= 0 && j + l < imageWidth)
{
sum += inputImage[(i + k) * imageWidth + j + l] *
filter[(k + halffilterSize) * (*filterWidth) +
l + halffilterSize];
}
}
}
outputImage[imageWidth * i + j] = sum;
}
```