owned this note
owned this note
Published
Linked with GitHub
# Programming Assignment VI: OpenCL Programming
###### tags: `NYCU-PP-s23`
## Q1
:::info
> You will modify only ```hostFE.c``` and ```kernel.cl```.
:question: **Q1 (5 points)**: Explain your implementation. How do you optimize the performance of convolution?
:::
:a:
When creating buffers for the input image and filter using ```clCreateBuffer```, the ```CL_MEM_USE_HOST_PTR flag``` is used, and the corresponding pointers are passed in.
```cpp
cl_mem inBuffer = clCreateBuffer(*context, CL_MEM_USE_HOST_PTR, srcImgSize, inputImage, NULL);
cl_mem filterBuffer = clCreateBuffer(*context, CL_MEM_USE_HOST_PTR, filterSize, filter, NULL);
```
- Bit shift operations are used for division and multiplication by powers of two.
- When setting the global group, it is set to ```{(imageWidth × imageHeight)>>2, 1}```, allowing each kernel to perform convolution on four positions.
- Since each kernel calculates convolution for four positions, float4 is used to accumulate the sum. Float4 can also perform vector operations during convolution and sum accumulation.
- In the for loop, the y position is first checked for out-of-range values before entering another for loop to check the x position. This approach is faster than checking the x and y positions within the innermost for loop.
- During convolution, since the filter's index only differs by 1 in each operation, a parameter can be used to represent the initial index value. By continuously incrementing the parameter by 1, there is no need to recalculate the index each time, reducing the number of addition and multiplication operations.
- If the filter value at a particular position is 0, the multiplication and addition calculations for that pixel can be skipped.
## Q2
:::info
:question: **<font color="#0f0">[Bonus]</font> Q2 (10 points)**: Rewrite the program using CUDA.
> (1) Explain your CUDA implementation,
> (2) plot a chart to show the performance difference between using OpenCL and CUDA, and
> (3) explain the result.
:::
:a:
- (1)
```cpp
__kernel void convolution(const __global float *srcImg, __global float4 *outImg, __constant float *filter, const int w, const int h, const int hf) {
float4 sum = 0.0;
const int grid = get_global_id(0) << 2;
int current_x = grid % w;
int current_y = grid / w;
int yy, fidx, xx, i, j, wy, position;
float4 cal, f;
fidx = 0;
for (i = -hf; i <= hf; i++) {
yy = current_y + i;
if (yy >= 0 && yy < h) {
wy = yy * w;
for (j = -hf; j <= hf; j++) {
if (filter[fidx] == 0) {
} else {
xx = current_x + j;
if (xx >= 0 && xx < w) {
position = xx + wy;
cal = (float4)(srcImg[position], srcImg[position+1], srcImg[position+2], srcImg[position+3]);
f = filter[fidx];
sum += cal * f;
}
}
fidx++;
}
}
}
outImg[grid >> 2] = sum;
}
```
- Using the same algorithm and optimizations as in OpenCL implementation.
- For each row of the image, group every 4 pixels together.
- Each thread is responsible for 4 convolution operations.
- Float4 data format is also used for calculations.
- (2)

- (3)
> In the case of three different filter sizes, **OpenCL** is faster than **CUDA**.
> This could be because in the implementation of the **OpenCL** program, the ```CL_MEM_USE_HOST_PTR``` flag was used in the ```clCreateBuffer``` function,
> allowing direct access to the host's values.
>
> On the other hand, **CUDA** requires additional memory allocation on the device and copying of host values to the device.
> This additional step in **CUDA** involves extra function calls and processes, resulting in increased processing time.
:::warning
Answer the questions marked with **Q1** (and **Q2**) in a REPORT using HackMD. Notice that in this assignment a higher standard will be applied when grading the quality of your report.
:::
:::success
- **Note**: You cannot print any message in your program.
:::