# PP HW6 # Q1 >Explain your implementation. How do you optimize the performance of convolution? 在hostfe中,先定義local size是25*25,再來把需要的參數都設置好,傳進convolution, 在觀察serialConv後,發現每次內層迴圈都需要用if去判斷有沒有超出邊界,會花費額外時間 所以就改成如下面的方式,預先算好範圍避免重複判斷 ```c // 計算在 y 方向上有效的 i 範圍 int i_start = (y - halfFilterWidth >= 0) ? -halfFilterWidth : -y; int i_end = (y + halfFilterWidth < imageHeight) ? halfFilterWidth : (imageHeight - 1 - y); // 計算在 x 方向上有效的 j 範圍 int j_start = (x - halfFilterWidth >= 0) ? -halfFilterWidth : -x; int j_end = (x + halfFilterWidth < imageWidth) ? halfFilterWidth : (imageWidth - 1 - x); for (int i = i_start; i <= i_end; i++) { for (int j = j_start; j <= j_end; j++) { int inputX = x + j; int inputY = y + i; float inputValue = inputImage[inputY * imageWidth + inputX]; int filterIndex = (i + halfFilterWidth) * filterWidth + (j + halfFilterWidth); sum += inputValue * filter[filterIndex]; } ``` # Q2 > Rewrite the program using CUDA. (1) Explain your CUDA implementation, (2) plot a chart to show the performance difference between using OpenCL and CUDA, and (3) explain the result. (1) Explain your CUDA implementation 基本上邏輯都跟實作opencl的方法都依樣,blocksize都是25*25,只是hosfFE.c要改成cu檔,makefile也要加進去編譯cuda的code ```c #include <stdio.h> #include <stdlib.h> #include <cuda_runtime.h> #include <cuda.h> extern "C" { #include "hostFE.h" } __global__ void convolutionKernel(float* inputImage, float* filter, float* outputImage, int imageHeight, int imageWidth, int filterWidth) { int x = blockIdx.x * blockDim.x + threadIdx.x; // Global X coordinate int y = blockIdx.y * blockDim.y + threadIdx.y; // Global Y coordinate int halfFilterWidth = filterWidth / 2; float sum = 0.0f; // Convolution calculation int i_start = (y - halfFilterWidth >= 0) ? -halfFilterWidth : -y; int i_end = (y + halfFilterWidth < imageHeight) ? halfFilterWidth : (imageHeight - 1 - y); int j_start = (x - halfFilterWidth >= 0) ? -halfFilterWidth : -x; int j_end = (x + halfFilterWidth < imageWidth) ? halfFilterWidth : (imageWidth - 1 - x); // 卷積計算 for (int i = i_start; i <= i_end; i++) { for (int j = j_start; j <= j_end; j++) { int inputX = x + j; int inputY = y + i; float inputValue = inputImage[inputY * imageWidth + inputX]; int filterIndex = (i + halfFilterWidth) * filterWidth + (j + halfFilterWidth); sum += inputValue * filter[filterIndex]; } } outputImage[y * imageWidth + x] = sum; // 儲存計算結果 } void hostFE(int filterWidth, float *filter, int imageHeight, int imageWidth, float *inputImage, float *outputImage, cl_device_id *device, cl_context *context, cl_program *program) { size_t imageSize = imageHeight * imageWidth * sizeof(float); size_t filterSize = filterWidth * filterWidth * sizeof(float); // Allocate device memory float *d_inputImage, *d_filter, *d_outputImage; cudaMalloc(&d_inputImage, imageSize); cudaMalloc(&d_filter, filterSize); cudaMalloc(&d_outputImage, imageSize); // Copy data from host to device cudaMemcpy(d_inputImage, inputImage, imageSize, cudaMemcpyHostToDevice); cudaMemcpy(d_filter, filter, filterSize, cudaMemcpyHostToDevice); // Define grid and block dimensions dim3 blockSize(25, 25); // 16x16 threads per block dim3 gridSize((imageWidth + blockSize.x - 1) / blockSize.x, (imageHeight + blockSize.y - 1) / blockSize.y); // Launch kernel convolutionKernel<<<gridSize, blockSize>>>(d_inputImage, d_filter, d_outputImage, imageHeight, imageWidth, filterWidth); cudaDeviceSynchronize(); // Copy result back to host cudaMemcpy(outputImage, d_outputImage, imageSize, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_inputImage); cudaFree(d_filter); cudaFree(d_outputImage); } ``` (2) plot a chart to show the performance difference between using OpenCL and CUDA ![image](https://hackmd.io/_uploads/B1AM8wgryg.png) (3) explain the result filter我是根據hw給的去用,size是1>3>2,觀察圖表能發現在filter size越大的情況下,opencl跟cuda表現都越好,可想而知是filter被分發到了更多的thread上,另外比較兩者效率上,可以看到filter size越大則opencl表現會越好,只不過雖然結果是這樣,但我在檢查hw6的main跟其他的檔案時,有發現一些用opencl的地方,只不過我改成cuda版時沒有去改(不影響正確性),所以實際上兩者效能應該會更接近