# <center> Parallel Programming NYCU Fall 2022 HW5
<p class="text-right">
310552035 張竣傑
## Q1
### **What are the pros and cons of the three methods? Give an assumption about their performances.**
* Method 1:
- pros:
1. Each CUDA thread processes one pixel. 所以相比CPU,是平行計算。
- cons:
1. 由之前作業可以得知,其實每個thread的工作量不一致,導致說平行化並不完全,整體的效率都會被計算時間花最久的thread拖累。
* Method 2:
- pros:
1. 使用page-locked memory,保證資料都會在記憶體中,減少傳輸資料的時間,page replacement不發生。
2. 使用cudaMallocPitch 分配 GPU memory,保證資料對齊,讓thread讀取速度可以加快。
- cons:
1. 沒有需要大量讀取GPU memory,pitch memory效果沒有特別提升。
2. 上述第二個優點會導致memory需要更多空間去配置,造成許多額外空間的浪費。
* Mehtod 3:
- pros:
1. 同方法二第一點,使用page-locked memory,減少資料傳輸時間。
2. 同方法二第二點,使用cudaMallocPitch 分配 GPU memory。讓thread讀取速度可以加快。
- cons:
1. 每個 thread 處理的是一個 group 的 pixels,負擔不平衡,影響平行效果。
2. cudaMallocPitch使memory需要更多空間去配置,造成許多額外空間的浪費。
## Q2
### **How are the performances of the three methods? Plot a chart to show the differences among the three methods**
Below are the figures of two views with different iteration.
View1 數據:
| view1 | 1000 | 10000 | 100000 |
| ------- | ------ | ------ |:-------:|
| method1 | 7.754 | 33.040 | 294.666 |
| method2 | 7.932 | 33.233 | 294.102 |
| method3 | 12.506 | 66.702 | 324.267 |

View2 數據:
| view2 | 1000 | 10000 | 100000 |
| ------- | ----- | ------ | ------ |
| method1 | 5.043 | 7.748 | 28.589 |
| method2 | 4.960 | 7.797 | 28.147 |
| method3 | 9.968 | 14.388 | 48.083 |

## Q3
### Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not.
* Method 1:
根據實驗結果,method 1的速度跟method 2比較起來相差無幾。

* Method 2:

透過 nvprof 比較兩者差異,雖然 Method 2 有使用 pitch memory,但是在GPU上似乎沒有太大的優勢。
* Method 3:

由實驗結果可以發現 Method 3的方法是所有方法中最慢的。這個方法沒有充分利用GPU完全平行化的特性,單個Thread需要處理多個pixels且有閒置的thread沒有使用。造成速度下降。
## Q4
### Can we do even better? Think a better approach and explain it. Implement your method in kernel4.cu.
三個方法因為要比較記憶體的配置方式,皆是採用先在host端配置額外記憶體,可以直接從Device端記憶體複製到解答提供的指標,可以減少大量的資料傳輸時間。
``` c++
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>
#define BLOCK_SIZE 16
__global__ void mandelKernel(
int *device_out, float lowerX, float lowerY,
float setpX, float stepY,
int resX, int maxIterations) {
// To avoid error caused by the floating number, use the following pseudo code
//
// float x = lowerX + thisX * stepX;
// float y = lowerY + thisY * stepY;
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
float c_re = lowerX + x * setpX;
float c_im = lowerY + y * stepY;
float z_re = c_re;
float z_im = c_im;
int iter;
for (iter = 0; iter < maxIterations; iter++){
if (z_re * z_re + z_im * z_im > 4.f)
break;
float new_re = z_re * z_re - z_im * z_im;
float new_im = 2.f * z_re * z_im;
z_re = c_re + new_re;
z_im = c_im + new_im;
}
int i = x + y * resX;
device_out[i] = iter;
}
// Host front-end function that allocates the memory and launches the GPU kernel
void hostFE (float upperX, float upperY, float lowerX, float lowerY, int* img, int resX, int resY, int maxIterations)
{
float stepX = (upperX - lowerX) / resX;
float stepY = (upperY - lowerY) / resY;
int *device_out;
int size = resX * resY * sizeof(int);
cudaMalloc(&device_out, size);
dim3 threadsPerBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 numBlocks(resX / threadsPerBlock.x, resY / threadsPerBlock.y);
mandelKernel<<<numBlocks, threadsPerBlock>>>(
device_out, lowerX, lowerY,
stepX, stepY, resX, maxIterations);
cudaMemcpy(img, device_out, size, cudaMemcpyDeviceToHost);
cudaFree(device_out);
}
```