Parallel Programming HW6 @NYCU, 2022 Fall

tags: `2022_PP_NYCU`

Q1

Explain your implementation. How do you optimize the performance of convolution?

在 hostFE.c 和 kernel.c 中的 *2 和 /2 都用 bit operation 來作加速。
在 hostFE.c 中修改 filter 大小，和將 inputImage 的 float 型態轉換成 char 型態，減少傳輸量。
在 kernel.c 中修改 convolution 寫法，減少多餘的判斷式。

修改 `filter` 大小

我有把 filter 外圍多餘的 0 都去除掉，產生一個較小的 filter ，這樣可以在傳輸到 gpu 時傳較少 data，在作 convolution 時也可以減少一些不必要的計算。

例如：

將 inputImage 的 float 型態轉換成 char 型態

我有把 inputImage 中，把原本 float(4 bytes) 轉換乘 char(1 byte) ，這樣可以減少傳送到 gpu 的 data 量。

// make input image from float to char
char *newInputImage = (char *)malloc(imageHeight * imageWidth * sizeof(char));
for (int i = 0; i < imageHeight * imageWidth; i++)
{
    newInputImage[i] = (char)inputImage[i];
}

修改 Convolution 寫法

我參考作業提供的 void serialConv() 來實做。
針對內部兩個 for loop 來作修改。

sum = 0; // Reset sum for new source pixel
// Apply the filter to the neighborhood
for (k = -halffilterSize; k <= halffilterSize; k++)
{
    for (l = -halffilterSize; l <= halffilterSize; l++)
    {
        if (i + k >= 0 && i + k < imageHeight &&
            j + l >= 0 && j + l < imageWidth)
        {
            sum += inputImage[(i + k) * imageWidth + j + l] *
                   filter[(k + halffilterSize) * filterWidth + l + halffilterSize];
        }
    }
}

原始板本每次都需要用 if/else 去作邊界的判斷，這樣會比較沒有效率。

// image x, y
int x = get_global_id(0); 
int y = get_global_id(1);

// filter x,y start, end
int row_start = y - halfFilterSize >= 0 ? 0 : halfFilterSize - y;
int row_end = y + halfFilterSize < imageHeight ? filterWidth - 1 : imageHeight - y;
int col_start = x - halfFilterSize >= 0 ? 0 : halfFilterSize - x;
int col_end = x + halfFilterSize < imageWidth ? filterWidth - 1 : imageWidth - x;

// filter based
for (int i = row_start; i <= row_end; i++) {
    int row = y - halfFilterSize + i;
    int col = x - halfFilterSize;
    for (int j = col_start; j <= col_end; j++) {
        sum += inputImage[row * imageWidth + col + j] * filter[i * filterWidth + j];
    }
}

修改版本可以事先算出 inputimage 從第幾個 row 和第幾個 col 開始，這樣可以減少多餘的判斷，加速運算。

Q2

Rewrite the program using CUDA.

(1) Explain your CUDA implementation

__global__ void convolution(int filterWidth, float *filter, int imageHeight, int imageWidth, float *inputImage, float *outputImage)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    
    // Iterate over the rows of the source image
    int halffilterSize = filterWidth >> 1;
    float sum = 0.0f;
    int k, l;
    
    // Apply the filter to the neighborhood
    for (k = -halffilterSize; k <= halffilterSize; k++)
    {
        for (l = -halffilterSize; l <= halffilterSize; l++)
        {
            if (j + k >= 0 && j + k < imageHeight &&
                i + l >= 0 && i + l < imageWidth)
            {
                sum += inputImage[(j + k) * imageWidth + i + l] *
                       filter[(k + halffilterSize) * filterWidth + l + halffilterSize];
            }
        }
    }
    outputImage[j * imageWidth + i] = sum;
}


extern "C" void hostFE(int filterWidth, float *filter, int imageHeight, int imageWidth,
                       float *inputImage, float *outputImage)
{
    float *d_filter, *d_inputImage, *d_outputImage;
    int filterSize = filterWidth * filterWidth * sizeof(float);
    int inputImageSize = imageHeight * imageWidth * sizeof(int);
    int outputImageSize = inputImageSize;

    cudaMalloc(&d_filter, filterSize);
    cudaMalloc(&d_inputImage, inputImageSize);
    cudaMalloc(&d_outputImage, outputImageSize);

    // cp mem to device
    cudaMemcpy(d_filter, filter, filterSize, cudaMemcpyHostToDevice);
    cudaMemcpy(d_inputImage, inputImage, inputImageSize, cudaMemcpyHostToDevice);

    int block_size = 16;

    dim3 threadsPerBlock(block_size, block_size);
    dim3 numBlocks(imageWidth / block_size, imageHeight / block_size);
    // kernel
    convolution<<<threadsPerBlock, numBlocks>>>(filterWidth, d_filter, imageHeight, imageWidth, d_inputImage, d_outputImage);
    
    // cp mem to host
    cudaMemcpy(outputImage, d_outputImage, outputImageSize, cudaMemcpyDeviceToHost);
    
    // free mem
    cudaFree(d_outputImage);
    cudaFree(d_inputImage);
    cudaFree(d_filter);
}

(2) plot a chart to show the performance difference between using OpenCL and CUDA

	filter 1	filter 2	filter 3
Opencl Optimized	0.416	0.404	0.412
Opencl	0.671	0.482	0.674
Cuda	0.850	0.565	0.687

(3) explain the result.

我實做 cuda 一般版本的 convolution。
我的 cuda 結果會比 opencl 慢，尤其是在 filter 1 的時候。

以目前查看到的文獻是 cuda 會比 opencl 快。
但是我的 cuda 結果會比 opencl 慢，可能的原因是我在 cuda 部份實做的不夠好。也有可能是我的 opencl 的程式在傳 memory 的種類分得比較細(global, constant) 讓速度贏過 cuda。

Parallel Programming HW6 @NYCU, 2022 Fall

tags: 2022_PP_NYCU

Q1

Explain your implementation. How do you optimize the performance of convolution?

修改 filter 大小

將 inputImage 的 float 型態轉換成 char 型態

修改 Convolution 寫法

Q2

Rewrite the program using CUDA.

(1) Explain your CUDA implementation

(2) plot a chart to show the performance difference between using OpenCL and CUDA

(3) explain the result.

Read more

Parallel Programming HW4 @NYCU, 2022 Fall

Parallel Programming HW1 @NYCU, 2022 Fall

Parallel Programming HW2 @NYCU, 2022 Fall

Parallel Programming HW5 @NYCU, 2022 Fall

tags: `2022_PP_NYCU`

修改 `filter` 大小