# PP HW6 ## Q1 : Explain your implementation. How do you optimize the performance of convolution? 在 OpenCL 中,我們可以把編譯的步驟如下列步驟所示: 1. Discovery platform 2. Discover device 3. Create context 4. Create a command queue 5. Create device buffers 6. Write host data to device buffers 7. Create a program object with source and build it 8. Create the kernel 9. Set the kernel arguments 10. Configure the work-item structure 11. Enqueue the kernel for execution 12. Read the output buffer back to the host 13. Release OpenCL resources 而在 main.c 中,我們可以發現已經幫助我們先做好步驟 1、2、3、7,皆在 helper.c 中,如下所示 ```javascript= // in the main.c cl_program program; cl_device_id device; cl_context context; initCL(&device, &context, &program); ``` ```javascript= // in the helper.c void initCL(cl_device_id *device, cl_context *context, cl_program *program) { // Set up the OpenCL environment cl_int status; // Discovery platform cl_platform_id platform; status = clGetPlatformIDs(1, &platform, NULL); CHECK(status, "clGetPlatformIDs"); // Discover device clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, device, NULL); CHECK(status, "clGetDeviceIDs"); // Create context cl_context_properties props[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platform), 0}; *context = clCreateContext(props, 1, device, NULL, NULL, &status); CHECK(status, "clCreateContext"); const char *source = readSource("kernel.cl"); // Create a program object with source and build it *program = clCreateProgramWithSource(*context, 1, &source, NULL, NULL); CHECK(status, "clCreateProgramWithSource"); status = clBuildProgram(*program, 1, device, NULL, NULL, NULL); CHECK(status, "clBuildProgram"); return; } ``` ### kernel.cl 從上方的 helper.c 中在22 ~ 27行,可以知道我們將 program 利用 kernel.cl 先創建好,而kernel裡面的內容主要是利用 get_global_id 來取得 Global ID 做平行化的工作(裡面內容大多與serial的內容相似,可以看註解)。 ```javascript= __kernel void convolution(int filterWidth, __constant float *filter, int imageHeight, int imageWidth, __global float *inputImage, __global float *outputImage) { int halffilterSize = filterWidth / 2; //Thread gets its index within index space int i = get_global_id(1); int j = get_global_id(0); int k, l; float sum; sum = 0; // Reset sum for new source pixel // Apply the filter to the neighborhood for (k = -halffilterSize; k <= halffilterSize; k++) { for (l = -halffilterSize; l <= halffilterSize; l++) { if (i + k >= 0 && i + k < imageHeight && j + l >= 0 && j + l < imageWidth) { sum += inputImage[(i + k) * imageWidth + j + l] * filter[(k + halffilterSize) * filterWidth + l + halffilterSize]; } } } // write the pixel to outputImage outputImage[i * imageWidth + j] = sum; } ``` 這邊值得一提的是我們的引入 parameter,由於filter是一個read only的memory所以我們這邊給他用__constant,而 inputImage 和 outputImage 是從全域空間來的,所以用__global 。 ### hostFE.c 而剩餘 OpenCL 步驟要做的事情,也是 hostFE.c 在做的事情, 在 hostFE.c 中,首先我們將 filter 跟 image 的大小設置好,以用來給後面配置空間方便使用 ``` cl_int status; int filterSize = filterWidth * filterWidth; int imageSize = imageHeight * imageWidth; ``` #### 4. Create a command queue 一device對應一command queue。上下文(context)將指令發送到與設備對應的命令隊列,device就可以執行 command queue 中的指令 ``` cl_command_queue commandQueue = clCreateCommandQueue(*context, *device, 0, &status); ``` #### 5. Create device buffers Buffer中保存的是資料物件,即device執行程序所需的數據存儲在其中。Buffer是由(context)創建的,這樣由同一context管理的多個device就可以共享Buffer中的數據。 ``` cl_mem cl_filter = clCreateBuffer(*context, CL_MEM_READ_ONLY, filterSize * sizeof(float), NULL, &status); cl_mem cl_input_image = clCreateBuffer(*context, CL_MEM_READ_ONLY, imageSize * sizeof(float), NULL, &status); cl_mem cl_output_image = clCreateBuffer(*context, CL_MEM_WRITE_ONLY, imageSize * sizeof(float), NULL, &status); ``` #### 6. Write host data to device buffers ``` clEnqueueWriteBuffer(commandQueue, cl_filter, CL_TRUE, 0, filterSize * sizeof(float), (void *)filter, 0, NULL, NULL); clEnqueueWriteBuffer(commandQueue, cl_input_image, CL_TRUE, 0, imageSize * sizeof(float), (void *)inputImage, 0, NULL, NULL); ``` #### 8. Create the kernel 根據program 產生 kernel,這表示 program 的入口 ``` cl_kernel mykernel= clCreateKernel (*program, "convolution" , &status); ``` #### 9. Set the kernel arguments 注意要對齊在 kernel.cl 中寫的參數順序 ``` clSetKernelArg(mykernel, 0, sizeof(cl_int), (void *)&filterWidth); clSetKernelArg(mykernel, 1, sizeof(cl_mem), (void *)&cl_filter); clSetKernelArg(mykernel, 2, sizeof(cl_int), (void *)&imageHeight); clSetKernelArg(mykernel, 3, sizeof(cl_int), (void *)&imageWidth); clSetKernelArg(mykernel, 4, sizeof(cl_mem), (void *)&cl_input_image); clSetKernelArg(mykernel, 5, sizeof(cl_mem), (void *)&cl_output_image); ``` #### 10. Configure the work-item structure 設定 work item 的形式 ``` size_t localws[2] = {10,10}; size_t globalws[2] = {imageWidth, imageHeight}; ``` #### 11. Enqueue the kernel for execution 將 kernel 物件以及 work-item 參數放入 command queue 中進行執行。 ``` clEnqueueNDRangeKernel(commandQueue, mykernel, 2, 0, globalws, localws, 0, NULL, NULL); ``` #### 12. Read the output buffer back to the host 讀取資料 commandQueue上 的 cl_output_image 到 outputImage ``` clEnqueueReadBuffer(commandQueue, cl_output_image, CL_TRUE, 0, imageSize * sizeof(float), (void *) outputImage, NULL, NULL, NULL); ``` #### 13. Release OpenCL resources 釋放剛剛使用到資源 並且結束返回 ``` clReleaseKernel(mykernel); clReleaseMemObject(cl_filter); clReleaseMemObject(cl_input_image); clReleaseMemObject(cl_output_image); clReleaseCommandQueue(commandQueue); return ``` 下方為全部程式碼 ```javascript= #include <stdio.h> #include <stdlib.h> #include "hostFE.h" #include "helper.h" void hostFE(int filterWidth, float *filter, int imageHeight, int imageWidth, float *inputImage, float *outputImage, cl_device_id *device, cl_context *context, cl_program *program) { cl_int status; int filterSize = filterWidth * filterWidth; int imageSize = imageHeight * imageWidth; cl_command_queue commandQueue = clCreateCommandQueue(*context, *device, 0, &status); cl_mem cl_filter = clCreateBuffer(*context, CL_MEM_READ_ONLY, filterSize * sizeof(float), NULL, &status); cl_mem cl_input_image = clCreateBuffer(*context, CL_MEM_READ_ONLY, imageSize * sizeof(float), NULL, &status); cl_mem cl_output_image = clCreateBuffer(*context, CL_MEM_WRITE_ONLY, imageSize * sizeof(float), NULL, &status); clEnqueueWriteBuffer(commandQueue, cl_filter, CL_TRUE, 0, filterSize * sizeof(float), (void *)filter, 0, NULL, NULL); clEnqueueWriteBuffer(commandQueue, cl_input_image, CL_TRUE, 0, imageSize * sizeof(float), (void *)inputImage, 0, NULL, NULL); cl_kernel mykernel= clCreateKernel (*program, "convolution" , &status); clSetKernelArg(mykernel, 0, sizeof(cl_int), (void *)&filterWidth); clSetKernelArg(mykernel, 1, sizeof(cl_mem), (void *)&cl_filter); clSetKernelArg(mykernel, 2, sizeof(cl_int), (void *)&imageHeight); clSetKernelArg(mykernel, 3, sizeof(cl_int), (void *)&imageWidth); clSetKernelArg(mykernel, 4, sizeof(cl_mem), (void *)&cl_input_image); clSetKernelArg(mykernel, 5, sizeof(cl_mem), (void *)&cl_output_image); size_t localws[2] = {25, 25}; size_t globalws[2] = {imageWidth, imageHeight}; clEnqueueNDRangeKernel(commandQueue, mykernel, 2, 0, globalws, localws, 0, NULL, NULL); clEnqueueReadBuffer(commandQueue, cl_output_image, CL_TRUE, 0, imageSize * sizeof(float), (void *) outputImage, NULL, NULL, NULL); clReleaseKernel(mykernel); clReleaseMemObject(cl_filter); clReleaseMemObject(cl_input_image); clReleaseMemObject(cl_output_image); clReleaseCommandQueue(commandQueue); // clReleaseContext(context); return; } ``` ## Q2: Rewrite the program using CUDA ### (1) Explain your CUDA implementation 這部分主要需要更改的地方為將 hostEF.c 改成 CUDA 形式的 hostEF.cu,主要是在function hostFE 中,配置使用 CUDA 的 Memory 空間,來使用 CUDA 的運算資源,並且呼叫 kernel function convolutionKernel 去執行,以下是程式碼: ```javascript= #include <stdio.h> #include <stdlib.h> #include <cuda_runtime.h> #include "hostFE.h" __global__ void convolutionKernel(int filterWidth, float *filter, int imageHeight, int imageWidth, float *inputImage, float *outputImage) { int i = blockIdx.y * blockDim.y + threadIdx.y; int j = blockIdx.x * blockDim.x + threadIdx.x; if (i < imageHeight && j < imageWidth) { float sum = 0.0f; int halfFilterSize = filterWidth / 2; for (int k = -halfFilterSize; k <= halfFilterSize; ++k) { for (int l = -halfFilterSize; l <= halfFilterSize; ++l) { int ni = i + k; int nj = j + l; if (ni >= 0 && ni < imageHeight && nj >= 0 && nj < imageWidth) { sum += inputImage[ni * imageWidth + nj] * filter[(k + halfFilterSize) * filterWidth + l + halfFilterSize]; } } } outputImage[i * imageWidth + j] = sum; } } void hostFE(int filterWidth, float *filter, int imageHeight, int imageWidth, float *inputImage, float *outputImage, cl_device_id *device, cl_context *context, cl_program *program) { int filterSize = filterWidth * filterWidth; int imageSize = imageHeight * imageWidth; float *d_filter, *d_input_image, *d_output_image; cudaMalloc((void **)&d_filter, filterSize * sizeof(float)); cudaMalloc((void **)&d_input_image, imageSize * sizeof(float)); cudaMalloc((void **)&d_output_image, imageSize * sizeof(float)); cudaMemcpy(d_filter, filter, filterSize * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_input_image, inputImage, imageSize * sizeof(float), cudaMemcpyHostToDevice); dim3 blockSize(25,25); dim3 gridSize((imageWidth + blockSize.x - 1) / blockSize.x, (imageHeight + blockSize.y - 1) / blockSize.y); convolutionKernel<<<gridSize, blockSize>>>(filterWidth, d_filter, imageHeight, imageWidth, d_input_image, d_output_image); cudaMemcpy(outputImage, d_output_image, imageSize * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(d_filter); cudaFree(d_input_image); cudaFree(d_output_image); } ``` 其實內容與OpenCL很相似 只是中間要配置CUDA Memory,以及 kernel function 中,thread 的 index 寫法不同。 ### (2) plot a chart to show the performance difference between using OpenCL and CUDA :::success 圖片大小為 : 600 * 400 ::: ![image](https://hackmd.io/_uploads/rJFjpQEOa.png) :::success 圖片大小為 : 3000 * 3000 ::: ![image](https://hackmd.io/_uploads/SJD5pm4_a.png) ### (3) explain the result. 我比較了個不同大小的bmp檔案,結果也都顯示兩者的平行化能力相近,而 filter 的大小為 filter1 > filter3 > filter2, speedup 在兩個平行化的方式都呈現 filter 越大效果越好的趨勢,而會造成 filter 越大平行效果越好的原因,想必很直觀就是因為更大的 filter 可以讓 thread 的處理量也變多,每個工作項進行更多的計算,這也代表平行化的效果更好,而至於為甚麼CUDA與OpenCL效果相近,我個人認為是因為這次的device OpenCL 也是使用到 GPU,而歲然 CUDA 是專門為 NVIDIA 的 GPU 設計,但是可能 GPU 對 OpenCL 也有不錯的優化,導致兩個技術有某程度上的相似度,進而讓結果相近。