# PP HW6
## Q1 : Explain your implementation. How do you optimize the performance of convolution?
在 OpenCL 中,我們可以把編譯的步驟如下列步驟所示:
1. Discovery platform
2. Discover device
3. Create context
4. Create a command queue
5. Create device buffers
6. Write host data to device buffers
7. Create a program object with source and build it
8. Create the kernel
9. Set the kernel arguments
10. Configure the work-item structure
11. Enqueue the kernel for execution
12. Read the output buffer back to the host
13. Release OpenCL resources
而在 main.c 中,我們可以發現已經幫助我們先做好步驟 1、2、3、7,皆在 helper.c 中,如下所示
```javascript=
// in the main.c
cl_program program;
cl_device_id device;
cl_context context;
initCL(&device, &context, &program);
```
```javascript=
// in the helper.c
void initCL(cl_device_id *device, cl_context *context, cl_program *program)
{
// Set up the OpenCL environment
cl_int status;
// Discovery platform
cl_platform_id platform;
status = clGetPlatformIDs(1, &platform, NULL);
CHECK(status, "clGetPlatformIDs");
// Discover device
clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, device, NULL);
CHECK(status, "clGetDeviceIDs");
// Create context
cl_context_properties props[3] = {CL_CONTEXT_PLATFORM,
(cl_context_properties)(platform), 0};
*context = clCreateContext(props, 1, device, NULL, NULL, &status);
CHECK(status, "clCreateContext");
const char *source = readSource("kernel.cl");
// Create a program object with source and build it
*program = clCreateProgramWithSource(*context, 1, &source, NULL, NULL);
CHECK(status, "clCreateProgramWithSource");
status = clBuildProgram(*program, 1, device, NULL, NULL, NULL);
CHECK(status, "clBuildProgram");
return;
}
```
### kernel.cl
從上方的 helper.c 中在22 ~ 27行,可以知道我們將 program 利用 kernel.cl 先創建好,而kernel裡面的內容主要是利用 get_global_id 來取得 Global ID 做平行化的工作(裡面內容大多與serial的內容相似,可以看註解)。
```javascript=
__kernel void convolution(int filterWidth, __constant float *filter, int imageHeight, int imageWidth, __global float *inputImage, __global float *outputImage)
{
int halffilterSize = filterWidth / 2;
//Thread gets its index within index space
int i = get_global_id(1);
int j = get_global_id(0);
int k, l;
float sum;
sum = 0; // Reset sum for new source pixel
// Apply the filter to the neighborhood
for (k = -halffilterSize; k <= halffilterSize; k++)
{
for (l = -halffilterSize; l <= halffilterSize; l++)
{
if (i + k >= 0 && i + k < imageHeight &&
j + l >= 0 && j + l < imageWidth)
{
sum += inputImage[(i + k) * imageWidth + j + l] *
filter[(k + halffilterSize) * filterWidth +
l + halffilterSize];
}
}
}
// write the pixel to outputImage
outputImage[i * imageWidth + j] = sum;
}
```
這邊值得一提的是我們的引入 parameter,由於filter是一個read only的memory所以我們這邊給他用__constant,而 inputImage 和 outputImage 是從全域空間來的,所以用__global 。
### hostFE.c
而剩餘 OpenCL 步驟要做的事情,也是 hostFE.c 在做的事情,
在 hostFE.c 中,首先我們將 filter 跟 image 的大小設置好,以用來給後面配置空間方便使用
```
cl_int status;
int filterSize = filterWidth * filterWidth;
int imageSize = imageHeight * imageWidth;
```
#### 4. Create a command queue
一device對應一command queue。上下文(context)將指令發送到與設備對應的命令隊列,device就可以執行 command queue 中的指令
```
cl_command_queue commandQueue = clCreateCommandQueue(*context, *device, 0, &status);
```
#### 5. Create device buffers
Buffer中保存的是資料物件,即device執行程序所需的數據存儲在其中。Buffer是由(context)創建的,這樣由同一context管理的多個device就可以共享Buffer中的數據。
```
cl_mem cl_filter = clCreateBuffer(*context, CL_MEM_READ_ONLY, filterSize * sizeof(float), NULL, &status);
cl_mem cl_input_image = clCreateBuffer(*context, CL_MEM_READ_ONLY, imageSize * sizeof(float), NULL, &status);
cl_mem cl_output_image = clCreateBuffer(*context, CL_MEM_WRITE_ONLY, imageSize * sizeof(float), NULL, &status);
```
#### 6. Write host data to device buffers
```
clEnqueueWriteBuffer(commandQueue, cl_filter, CL_TRUE, 0, filterSize * sizeof(float), (void *)filter, 0, NULL, NULL);
clEnqueueWriteBuffer(commandQueue, cl_input_image, CL_TRUE, 0, imageSize * sizeof(float), (void *)inputImage, 0, NULL, NULL);
```
#### 8. Create the kernel
根據program 產生 kernel,這表示 program 的入口
```
cl_kernel mykernel= clCreateKernel (*program, "convolution" , &status);
```
#### 9. Set the kernel arguments
注意要對齊在 kernel.cl 中寫的參數順序
```
clSetKernelArg(mykernel, 0, sizeof(cl_int), (void *)&filterWidth);
clSetKernelArg(mykernel, 1, sizeof(cl_mem), (void *)&cl_filter);
clSetKernelArg(mykernel, 2, sizeof(cl_int), (void *)&imageHeight);
clSetKernelArg(mykernel, 3, sizeof(cl_int), (void *)&imageWidth);
clSetKernelArg(mykernel, 4, sizeof(cl_mem), (void *)&cl_input_image);
clSetKernelArg(mykernel, 5, sizeof(cl_mem), (void *)&cl_output_image);
```
#### 10. Configure the work-item structure
設定 work item 的形式
```
size_t localws[2] = {10,10};
size_t globalws[2] = {imageWidth, imageHeight};
```
#### 11. Enqueue the kernel for execution
將 kernel 物件以及 work-item 參數放入 command queue 中進行執行。
```
clEnqueueNDRangeKernel(commandQueue, mykernel, 2, 0, globalws, localws, 0, NULL, NULL);
```
#### 12. Read the output buffer back to the host
讀取資料 commandQueue上 的 cl_output_image 到 outputImage
```
clEnqueueReadBuffer(commandQueue, cl_output_image, CL_TRUE, 0, imageSize * sizeof(float), (void *) outputImage, NULL, NULL, NULL);
```
#### 13. Release OpenCL resources
釋放剛剛使用到資源 並且結束返回
```
clReleaseKernel(mykernel);
clReleaseMemObject(cl_filter);
clReleaseMemObject(cl_input_image);
clReleaseMemObject(cl_output_image);
clReleaseCommandQueue(commandQueue);
return
```
下方為全部程式碼
```javascript=
#include <stdio.h>
#include <stdlib.h>
#include "hostFE.h"
#include "helper.h"
void hostFE(int filterWidth, float *filter, int imageHeight, int imageWidth,
float *inputImage, float *outputImage, cl_device_id *device,
cl_context *context, cl_program *program)
{
cl_int status;
int filterSize = filterWidth * filterWidth;
int imageSize = imageHeight * imageWidth;
cl_command_queue commandQueue = clCreateCommandQueue(*context, *device, 0, &status);
cl_mem cl_filter = clCreateBuffer(*context, CL_MEM_READ_ONLY, filterSize * sizeof(float), NULL, &status);
cl_mem cl_input_image = clCreateBuffer(*context, CL_MEM_READ_ONLY, imageSize * sizeof(float), NULL, &status);
cl_mem cl_output_image = clCreateBuffer(*context, CL_MEM_WRITE_ONLY, imageSize * sizeof(float), NULL, &status);
clEnqueueWriteBuffer(commandQueue, cl_filter, CL_TRUE, 0, filterSize * sizeof(float), (void *)filter, 0, NULL, NULL);
clEnqueueWriteBuffer(commandQueue, cl_input_image, CL_TRUE, 0, imageSize * sizeof(float), (void *)inputImage, 0, NULL, NULL);
cl_kernel mykernel= clCreateKernel (*program, "convolution" , &status);
clSetKernelArg(mykernel, 0, sizeof(cl_int), (void *)&filterWidth);
clSetKernelArg(mykernel, 1, sizeof(cl_mem), (void *)&cl_filter);
clSetKernelArg(mykernel, 2, sizeof(cl_int), (void *)&imageHeight);
clSetKernelArg(mykernel, 3, sizeof(cl_int), (void *)&imageWidth);
clSetKernelArg(mykernel, 4, sizeof(cl_mem), (void *)&cl_input_image);
clSetKernelArg(mykernel, 5, sizeof(cl_mem), (void *)&cl_output_image);
size_t localws[2] = {25, 25};
size_t globalws[2] = {imageWidth, imageHeight};
clEnqueueNDRangeKernel(commandQueue, mykernel, 2, 0, globalws, localws, 0, NULL, NULL);
clEnqueueReadBuffer(commandQueue, cl_output_image, CL_TRUE, 0, imageSize * sizeof(float), (void *) outputImage, NULL, NULL, NULL);
clReleaseKernel(mykernel);
clReleaseMemObject(cl_filter);
clReleaseMemObject(cl_input_image);
clReleaseMemObject(cl_output_image);
clReleaseCommandQueue(commandQueue);
// clReleaseContext(context);
return;
}
```
## Q2: Rewrite the program using CUDA
### (1) Explain your CUDA implementation
這部分主要需要更改的地方為將 hostEF.c 改成 CUDA 形式的 hostEF.cu,主要是在function hostFE 中,配置使用 CUDA 的 Memory 空間,來使用 CUDA 的運算資源,並且呼叫 kernel function convolutionKernel 去執行,以下是程式碼:
```javascript=
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include "hostFE.h"
__global__ void convolutionKernel(int filterWidth, float *filter, int imageHeight, int imageWidth,
float *inputImage, float *outputImage) {
int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;
if (i < imageHeight && j < imageWidth) {
float sum = 0.0f;
int halfFilterSize = filterWidth / 2;
for (int k = -halfFilterSize; k <= halfFilterSize; ++k) {
for (int l = -halfFilterSize; l <= halfFilterSize; ++l) {
int ni = i + k;
int nj = j + l;
if (ni >= 0 && ni < imageHeight && nj >= 0 && nj < imageWidth) {
sum += inputImage[ni * imageWidth + nj] *
filter[(k + halfFilterSize) * filterWidth + l + halfFilterSize];
}
}
}
outputImage[i * imageWidth + j] = sum;
}
}
void hostFE(int filterWidth, float *filter, int imageHeight, int imageWidth,
float *inputImage, float *outputImage, cl_device_id *device,
cl_context *context, cl_program *program) {
int filterSize = filterWidth * filterWidth;
int imageSize = imageHeight * imageWidth;
float *d_filter, *d_input_image, *d_output_image;
cudaMalloc((void **)&d_filter, filterSize * sizeof(float));
cudaMalloc((void **)&d_input_image, imageSize * sizeof(float));
cudaMalloc((void **)&d_output_image, imageSize * sizeof(float));
cudaMemcpy(d_filter, filter, filterSize * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_input_image, inputImage, imageSize * sizeof(float), cudaMemcpyHostToDevice);
dim3 blockSize(25,25);
dim3 gridSize((imageWidth + blockSize.x - 1) / blockSize.x, (imageHeight + blockSize.y - 1) / blockSize.y);
convolutionKernel<<<gridSize, blockSize>>>(filterWidth, d_filter, imageHeight, imageWidth,
d_input_image, d_output_image);
cudaMemcpy(outputImage, d_output_image, imageSize * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_filter);
cudaFree(d_input_image);
cudaFree(d_output_image);
}
```
其實內容與OpenCL很相似 只是中間要配置CUDA Memory,以及 kernel function 中,thread 的 index 寫法不同。
### (2) plot a chart to show the performance difference between using OpenCL and CUDA
:::success
圖片大小為 : 600 * 400
:::

:::success
圖片大小為 : 3000 * 3000
:::

### (3) explain the result.
我比較了個不同大小的bmp檔案,結果也都顯示兩者的平行化能力相近,而 filter 的大小為 filter1 > filter3 > filter2, speedup 在兩個平行化的方式都呈現 filter 越大效果越好的趨勢,而會造成 filter 越大平行效果越好的原因,想必很直觀就是因為更大的 filter 可以讓 thread 的處理量也變多,每個工作項進行更多的計算,這也代表平行化的效果更好,而至於為甚麼CUDA與OpenCL效果相近,我個人認為是因為這次的device OpenCL 也是使用到 GPU,而歲然 CUDA 是專門為 NVIDIA 的 GPU 設計,但是可能 GPU 對 OpenCL 也有不錯的優化,導致兩個技術有某程度上的相似度,進而讓結果相近。