# Parallel Programming @ NYCU - HW6
#### **`0716221 余忠旻`**
### <font color="#3CB371"> Q1: Explain your implementation. How do you optimize the performance of convolution?
</font>
:::success
在 `hostFE.c` 中,
我設定 `local work size = 8 x 8`, `global work size = imageWidth x imageHeight`,
這樣可以確保圖片的每一個 pixel 都有 work item 對應做平行處理,
而在 `kernel.cl` 中,
我利用 `get_globla_id(0)` / `get_global_id(1)`
取得每個 work item 所需處理 pixel 是對應圖片中的哪一個 pixel,
接著利用作業給的 `serialConv.c` 中的 `serialConv()`
將對應 pixel 的 convolution 計算做出來
:::
**<font color="#5E86C" size = 4>Optimize the performance of convolution</font>**
* **<font color="#24367D">修改for迴圈寫法</font>**
原先作業給的 `serialConv.c` 中的 `serialConv()` 中,
convolution 的 for 迴圈中會不斷的做 if 判斷,
檢查計算的內容有沒有超出圖片邊界,會產生較大的 overhead
原來的做法:
```cpp
for (k = -halffilterSize; k <= halffilterSize; k++){
for (l = -halffilterSize; l <= halffilterSize; l++){
if (i + k >= 0 && i + k < imageHeight &&
j + l >= 0 && j + l < imageWidth){
sum += inputImage[(i + k) * imageWidth + j + l] *
filter[(k + halffilterSize) * filterWidth +l + halffilterSize];
}
}
}
outputImage[i * imageWidth + j] = sum;
```
    因此我將原本做法改成,先計算出 k, l 的開始與結束位置再將其帶入 for 的條件中
    修改完的做法:
```cpp
int k_start = -halffilterSize + iy >= 0 ? -halffilterSize : 0;
int k_end = halffilterSize + iy < imageHeight ? halffilterSize : halffilterSize + iy - imageHeight - 1;
int l_start = -halffilterSize + ix >= 0 ? -halffilterSize : 0;
int l_end = halffilterSize + ix < imageWidth ? halffilterSize : halffilterSize + ix - imageWidth - 1;
for (int k = k_start; k <= k_end; k++){
idxI = (iy + k) * imageWidth + ix;
idxF = (k + halffilterSize) * filterWidth + halffilterSize;
for (int l = l_start; l <= l_end; l++){
sum += inputImage[idxI + l] * filter[idxF + l];
}
}
outputImage[iy * imageWidth + ix] = sum;
```
    在上面程式碼也可以看到,
    做 convolution 時,
    因為 inputImage 和 filter 的 index 在內層 for 迴圈中每次運算都只有差 1,
    所以可以用參數紀錄 index 初始值,
    隨著內層 for 迴圈 l 不斷 +1,index 也隨之 +1,
    這樣 index 就能省下重複的計算
<br>
* **<font color="#24367D">修改filter寫法</font>**
若是 filter 該位置為 0 則跳過該 pixel 的乘法與加法計算
因此我在 `hostFE.c` 會檢查 filter
將 filter 外圈為 0 的 修剪掉,並修改 filterWidth 的大小,
讓其計算量減少
```cpp
char *charFilter = (char *)malloc(filterWidth * filterWidth * sizeof(char));
int new_filterWidth = filterWidth;
int checkStart = 0;
int checkEnd = filterWidth - 1;
int check = 1;
while(check == 1 && checkStart < checkEnd) {
for (int i = 0; i < filterWidth && check == 1; i++)
if(filter[checkStart * filterWidth + i] != 0) check = 0; // upper
for (int i = 0; i < filterWidth && check == 1; i++)
if(filter[checkEnd * filterWidth + i] != 0) check = 0; // lower
for (int i = 0; i < filterWidth && check == 1; i++)
if(filter[i * filterWidth + checkStart] != 0) check = 0; // left
for (int i = 0; i < filterWidth && check == 1; i++)
if(filter[i * filterWidth + checkEnd] != 0) check = 0; // right
if (check == 1) new_filterWidth -= 2;
checkStart++;
checkEnd--;
}
int charFilter_start = (filterWidth - new_filterWidth) % 2 == 0 ? (filterWidth - new_filterWidth) / 2 : 0;
for (register int i = 0; i < new_filterWidth; i++)
for (register int j = 0; j < new_filterWidth; j++)
charFilter[i * new_filterWidth + j] = filter[((charFilter_start + i) * filterWidth) + charFilter_start + j];
```
___
### <font color="#3CB371"> Q2: Rewrite the program using CUDA.
(1) Explain your CUDA implementation
(2) plot a chart to show the performance difference between using OpenCL and CUDA
(3) explain the result
</font>
**(1) Explain your CUDA implementation**
CUDA 的實作的算法與優化都與 OpenCL 相同,
我設定 `threadsPerBlock = 8 x 8`,
`numBlocks = (imageWidth / threadsPerBlock.x, imageHeight / threadsPerBlock.y)`
這樣可以和 OpenCL 一樣確保圖片的每一個 pixel 都有 thread 對應做平行處理,
CUDA 和 OpenCL 的差異不大,只是程式環境不大一樣。
因此我利用 HW5 的環境,在 `main.cpp` 做些微調整並在 `kernel.cu` 實作出 convolution
```cpp=
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#define BLOCK_SIZE 8
__global__ static void convolution(char *filter, int filterWidth, int imageHeight, int imageWidth, unsigned char *inputImage, unsigned short *outputImage) {
int halffilterSize = filterWidth / 2;
int iy = blockIdx.y * blockDim.y + threadIdx.y;
int ix = blockIdx.x * blockDim.x + threadIdx.x;
int sum = 0;
int idxI = 0;
int idxF = 0;
int k_start = -halffilterSize + iy >= 0 ? -halffilterSize : 0;
int k_end = halffilterSize + iy < imageHeight ? halffilterSize : halffilterSize + iy - imageHeight - 1;
int l_start = -halffilterSize + ix >= 0 ? -halffilterSize : 0;
int l_end = halffilterSize + ix < imageWidth ? halffilterSize : halffilterSize + ix - imageWidth - 1;
for (int k = k_start; k <= k_end; k++){
idxI = (iy + k) * imageWidth + ix;
idxF = (k + halffilterSize) * filterWidth + halffilterSize;
for (int l = l_start; l <= l_end; l++){
sum += inputImage[idxI + l] * filter[idxF + l];
}
}
outputImage[iy * imageWidth + ix] = sum;
}
void hostFE (int filterWidth, float * filter, int imageHeight, int imageWidth, float * inputImage, float * outputImage) {
char *charFilter = (char *)malloc(filterWidth * filterWidth * sizeof(char));
int new_filterWidth = filterWidth;
int checkStart = 0;
int checkEnd = filterWidth - 1;
int check = 1;
while(check == 1 && checkStart < checkEnd) {
for (int i = 0; i < filterWidth && check == 1; i++)
if(filter[checkStart * filterWidth + i] != 0) check = 0; // upper
for (int i = 0; i < filterWidth && check == 1; i++)
if(filter[checkEnd * filterWidth + i] != 0) check = 0; // lower
for (int i = 0; i < filterWidth && check == 1; i++)
if(filter[i * filterWidth + checkStart] != 0) check = 0; // left
for (int i = 0; i < filterWidth && check == 1; i++)
if(filter[i * filterWidth + checkEnd] != 0) check = 0; // right
if (check == 1) new_filterWidth -= 2;
checkStart++;
checkEnd--;
}
int charFilter_start = (filterWidth - new_filterWidth) % 2 == 0 ? (filterWidth - new_filterWidth) / 2 : 0;
for (register int i = 0; i < new_filterWidth; i++)
for (register int j = 0; j < new_filterWidth; j++)
charFilter[i * new_filterWidth + j] = filter[((charFilter_start + i) * filterWidth) + charFilter_start + j];
int filterSize = new_filterWidth * new_filterWidth * sizeof(char);
int imageSize = imageHeight * imageWidth;
unsigned char *bufInput = (unsigned char *)malloc(imageSize * sizeof(unsigned char));
unsigned short *bufOutput = (unsigned short *)malloc(imageSize * sizeof(unsigned short));
for (register int i = 0; i < imageSize; i++) bufInput[i] = inputImage[i];
char *deviceFilter;
unsigned char *deviceInput;
unsigned short *deviceOutput;
cudaMalloc(&deviceFilter, filterSize);
cudaMalloc(&deviceInput, imageSize * sizeof(unsigned char));
cudaMalloc(&deviceOutput, imageSize * sizeof(unsigned short));
cudaMemcpy(deviceFilter, charFilter, filterSize, cudaMemcpyHostToDevice);
cudaMemcpy(deviceInput, bufInput, imageSize * sizeof(unsigned char), cudaMemcpyHostToDevice);
dim3 threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 numBlocks(imageWidth / threadsPerBlock.x, imageHeight / threadsPerBlock.y);
convolution<<<numBlocks, threadsPerBlock>>>(deviceFilter, new_filterWidth, imageHeight, imageWidth, deviceInput, deviceOutput);
cudaMemcpy(bufOutput, deviceOutput, imageSize * sizeof(unsigned short), cudaMemcpyDeviceToHost);
for (register int i = 0; i < imageSize; i++) outputImage[i] = (float)bufOutput[i];
cudaFree(deviceFilter);
cudaFree(deviceInput);
cudaFree(deviceOutput);
free(charFilter);
free(bufInput);
free(bufOutput);
}
```
<br>
<br>
**(2) plot a chart to show the performance difference between using OpenCL and CUDA**

**(3) explain the result**
在相同算法和優化的實作下,
可以看見上述三種不同 filter 的情況中,CUDA 會比 OpenCL 稍快一點,
我認為可能是因為 OpenCL 是跨平台跨設備支援的程式編寫框架,
相對於 CUDA 只支援 NVIDIA 的 GPU 設備的情況下,
OpenCL 較難對 NVIDIA 的設備進行優化,
因此對於 NVIDIA 的 GPU 在效能的比較上, CUDA 的表現會比 OpenCL 好,
並且從 [A Performance Comparison of CUDA and OpenCL](https://arxiv.org/ftp/arxiv/papers/1005/1005.2581.pdf) 這篇論文中,
可以看見在資料傳輸與計算方面 CUDA 是優於 OpenCL,
可以佐證我們的實驗結果
___