# PP HW6
# Q1
>Explain your implementation. How do you optimize the performance of convolution?
在hostfe中,先定義local size是25*25,再來把需要的參數都設置好,傳進convolution,
在觀察serialConv後,發現每次內層迴圈都需要用if去判斷有沒有超出邊界,會花費額外時間
所以就改成如下面的方式,預先算好範圍避免重複判斷
```c
// 計算在 y 方向上有效的 i 範圍
int i_start = (y - halfFilterWidth >= 0) ? -halfFilterWidth : -y;
int i_end = (y + halfFilterWidth < imageHeight) ? halfFilterWidth : (imageHeight - 1 - y);
// 計算在 x 方向上有效的 j 範圍
int j_start = (x - halfFilterWidth >= 0) ? -halfFilterWidth : -x;
int j_end = (x + halfFilterWidth < imageWidth) ? halfFilterWidth : (imageWidth - 1 - x);
for (int i = i_start; i <= i_end; i++) {
for (int j = j_start; j <= j_end; j++) {
int inputX = x + j;
int inputY = y + i;
float inputValue = inputImage[inputY * imageWidth + inputX];
int filterIndex = (i + halfFilterWidth) * filterWidth + (j + halfFilterWidth);
sum += inputValue * filter[filterIndex];
}
```
# Q2
> Rewrite the program using CUDA. (1) Explain your CUDA implementation, (2) plot a chart to show the performance difference between using OpenCL and CUDA, and (3) explain the result.
(1) Explain your CUDA implementation
基本上邏輯都跟實作opencl的方法都依樣,blocksize都是25*25,只是hosfFE.c要改成cu檔,makefile也要加進去編譯cuda的code
```c
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <cuda.h>
extern "C" {
#include "hostFE.h"
}
__global__ void convolutionKernel(float* inputImage, float* filter, float* outputImage,
int imageHeight, int imageWidth, int filterWidth) {
int x = blockIdx.x * blockDim.x + threadIdx.x; // Global X coordinate
int y = blockIdx.y * blockDim.y + threadIdx.y; // Global Y coordinate
int halfFilterWidth = filterWidth / 2;
float sum = 0.0f;
// Convolution calculation
int i_start = (y - halfFilterWidth >= 0) ? -halfFilterWidth : -y;
int i_end = (y + halfFilterWidth < imageHeight) ? halfFilterWidth : (imageHeight - 1 - y);
int j_start = (x - halfFilterWidth >= 0) ? -halfFilterWidth : -x;
int j_end = (x + halfFilterWidth < imageWidth) ? halfFilterWidth : (imageWidth - 1 - x);
// 卷積計算
for (int i = i_start; i <= i_end; i++) {
for (int j = j_start; j <= j_end; j++) {
int inputX = x + j;
int inputY = y + i;
float inputValue = inputImage[inputY * imageWidth + inputX];
int filterIndex = (i + halfFilterWidth) * filterWidth + (j + halfFilterWidth);
sum += inputValue * filter[filterIndex];
}
}
outputImage[y * imageWidth + x] = sum; // 儲存計算結果
}
void hostFE(int filterWidth, float *filter, int imageHeight, int imageWidth,
float *inputImage, float *outputImage, cl_device_id *device,
cl_context *context, cl_program *program) {
size_t imageSize = imageHeight * imageWidth * sizeof(float);
size_t filterSize = filterWidth * filterWidth * sizeof(float);
// Allocate device memory
float *d_inputImage, *d_filter, *d_outputImage;
cudaMalloc(&d_inputImage, imageSize);
cudaMalloc(&d_filter, filterSize);
cudaMalloc(&d_outputImage, imageSize);
// Copy data from host to device
cudaMemcpy(d_inputImage, inputImage, imageSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_filter, filter, filterSize, cudaMemcpyHostToDevice);
// Define grid and block dimensions
dim3 blockSize(25, 25); // 16x16 threads per block
dim3 gridSize((imageWidth + blockSize.x - 1) / blockSize.x,
(imageHeight + blockSize.y - 1) / blockSize.y);
// Launch kernel
convolutionKernel<<<gridSize, blockSize>>>(d_inputImage, d_filter, d_outputImage,
imageHeight, imageWidth, filterWidth);
cudaDeviceSynchronize();
// Copy result back to host
cudaMemcpy(outputImage, d_outputImage, imageSize, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_inputImage);
cudaFree(d_filter);
cudaFree(d_outputImage);
}
```
(2) plot a chart to show the performance difference between using OpenCL and CUDA

(3) explain the result
filter我是根據hw給的去用,size是1>3>2,觀察圖表能發現在filter size越大的情況下,opencl跟cuda表現都越好,可想而知是filter被分發到了更多的thread上,另外比較兩者效率上,可以看到filter size越大則opencl表現會越好,只不過雖然結果是這樣,但我在檢查hw6的main跟其他的檔案時,有發現一些用opencl的地方,只不過我改成cuda版時沒有去改(不影響正確性),所以實際上兩者效能應該會更接近