# <center> Parallel Programming NYCU Fall 2022 HW5 <p class="text-right"> 310552035 張竣傑 ## Q1 ### **What are the pros and cons of the three methods? Give an assumption about their performances.** * Method 1: - pros: 1. Each CUDA thread processes one pixel. 所以相比CPU,是平行計算。 - cons: 1. 由之前作業可以得知,其實每個thread的工作量不一致,導致說平行化並不完全,整體的效率都會被計算時間花最久的thread拖累。 * Method 2: - pros: 1. 使用page-locked memory,保證資料都會在記憶體中,減少傳輸資料的時間,page replacement不發生。 2. 使用cudaMallocPitch 分配 GPU memory,保證資料對齊,讓thread讀取速度可以加快。 - cons: 1. 沒有需要大量讀取GPU memory,pitch memory效果沒有特別提升。 2. 上述第二個優點會導致memory需要更多空間去配置,造成許多額外空間的浪費。 * Mehtod 3: - pros: 1. 同方法二第一點,使用page-locked memory,減少資料傳輸時間。 2. 同方法二第二點,使用cudaMallocPitch 分配 GPU memory。讓thread讀取速度可以加快。 - cons: 1. 每個 thread 處理的是一個 group 的 pixels,負擔不平衡,影響平行效果。 2. cudaMallocPitch使memory需要更多空間去配置,造成許多額外空間的浪費。 ## Q2 ### **How are the performances of the three methods? Plot a chart to show the differences among the three methods** Below are the figures of two views with different iteration. View1 數據: | view1 | 1000 | 10000 | 100000 | | ------- | ------ | ------ |:-------:| | method1 | 7.754 | 33.040 | 294.666 | | method2 | 7.932 | 33.233 | 294.102 | | method3 | 12.506 | 66.702 | 324.267 | ![](https://i.imgur.com/nH7eubo.png) View2 數據: | view2 | 1000 | 10000 | 100000 | | ------- | ----- | ------ | ------ | | method1 | 5.043 | 7.748 | 28.589 | | method2 | 4.960 | 7.797 | 28.147 | | method3 | 9.968 | 14.388 | 48.083 | ![](https://i.imgur.com/uYSnXia.png) ## Q3 ### Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not. * Method 1: 根據實驗結果,method 1的速度跟method 2比較起來相差無幾。 ![](https://i.imgur.com/Vs46Z8Z.png) * Method 2: ![](https://i.imgur.com/HsOLiFp.png) 透過 nvprof 比較兩者差異,雖然 Method 2 有使用 pitch memory,但是在GPU上似乎沒有太大的優勢。 * Method 3: ![](https://i.imgur.com/icBtoY2.png) 由實驗結果可以發現 Method 3的方法是所有方法中最慢的。這個方法沒有充分利用GPU完全平行化的特性,單個Thread需要處理多個pixels且有閒置的thread沒有使用。造成速度下降。 ## Q4 ### Can we do even better? Think a better approach and explain it. Implement your method in kernel4.cu. 三個方法因為要比較記憶體的配置方式,皆是採用先在host端配置額外記憶體,可以直接從Device端記憶體複製到解答提供的指標,可以減少大量的資料傳輸時間。 ``` c++ #include <cuda.h> #include <stdio.h> #include <stdlib.h> #define BLOCK_SIZE 16 __global__ void mandelKernel( int *device_out, float lowerX, float lowerY, float setpX, float stepY, int resX, int maxIterations) { // To avoid error caused by the floating number, use the following pseudo code // // float x = lowerX + thisX * stepX; // float y = lowerY + thisY * stepY; int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; float c_re = lowerX + x * setpX; float c_im = lowerY + y * stepY; float z_re = c_re; float z_im = c_im; int iter; for (iter = 0; iter < maxIterations; iter++){ if (z_re * z_re + z_im * z_im > 4.f) break; float new_re = z_re * z_re - z_im * z_im; float new_im = 2.f * z_re * z_im; z_re = c_re + new_re; z_im = c_im + new_im; } int i = x + y * resX; device_out[i] = iter; } // Host front-end function that allocates the memory and launches the GPU kernel void hostFE (float upperX, float upperY, float lowerX, float lowerY, int* img, int resX, int resY, int maxIterations) { float stepX = (upperX - lowerX) / resX; float stepY = (upperY - lowerY) / resY; int *device_out; int size = resX * resY * sizeof(int); cudaMalloc(&device_out, size); dim3 threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 numBlocks(resX / threadsPerBlock.x, resY / threadsPerBlock.y); mandelKernel<<<numBlocks, threadsPerBlock>>>( device_out, lowerX, lowerY, stepX, stepY, resX, maxIterations); cudaMemcpy(img, device_out, size, cudaMemcpyDeviceToHost); cudaFree(device_out); } ```