Parallel Programming HW5

--- tags: 111上 --- # Parallel Programming HW5 > 310552027 郭家良 ### Run code using shell script :::spoiler - runnning shell script ```shell # bin/sh ITERATIONS=(1000 10000 100000) VIEWS=(1 2) IDS=(1 2 3) for id in "${IDS[@]}" do echo "processing kernel$id.cu" cp "kernel$id.cu" "kernel.cu" make for iter in "${ITERATIONS[@]}" do for view in "${VIEWS[@]}" do echo "====== running view $view with iteration $iter ======" ./mandelbrot -v $view -i $iter done done done ``` ::: #### Q1 What are the pros and cons of the three methods? Give an assumption about their performances. #### Method 1: - pros: - 每個 thread 都處理一個 pixel ，最大化了程式的平行度 - 直接使用 `cudaMemcpy` 做 data transfer，雖然 CPU 端使用 pagable memory，但不會像 paged-locked memory 需要額外的 system performance。 - cons: - 每個 thread 計算量是不一樣的，有些 image pixel 要 iteration 的次數很多，而有些可能早早結束，會發生完成的 thread 等待未完成的 thread。 - 因為所使用的 memory 是 pagable，因此需要時可能不在 memory 裡，需等待傳輸。 #### Method 2: - pros: - 同 **Method 1**，每個 thread 都處理一個 pixel ，最大化了程式的平行度 - 使用 allocate CPU memory 時利用 `cudaHostAlloc` 來 paged-locked memory，可以保證需要時一定在 memory 裡，避免傳輸時間。 - 使用 `cudaMallocPitch` 分配 GPU memory，可以保證分配到對齊後的記憶體空間，提高記憶體訪問效率。 - cons: - 同 **Method 1**，每個 thread 計算量是不一樣的，有些 image pixel 要 iteration 的次數很多，而有些可能早早結束，會發生完成的 thread 等待未完成的 thread。 - 使用 paged-locked memory，會需要額外的 system performance 來管理記憶體，可能會拖慢執行時間。 - 使用 `cudaMallocPitch` 分配 GPU memory 為了保證記憶體對齊會多分配不須使用到的記憶體空間，造成空間上浪費。 #### Method 3: - pros: - 每個 thread 處理一個 group 的 pixel， thread 的計算量會較上述兩種方法平均，較不會發生等待問題。 - 同 **Method 2**，使用 allocate CPU memory 時利用 `cudaHostAlloc` 來 paged-locked memory，可以保證需要時一定在 memory 裡，避免傳輸時間。 - 同 **Method 2**，使用 `cudaMallocPitch` 分配 GPU memory，可以保證分配到對齊後的記憶體空間，提高記憶體訪問效率。 - cons: - 處理一個 group 的 pixel 等同於有 serial 執行的部份，其實並沒有充分利用到平行度 - 同 **Method 2**，使用 paged-locked memory，會需要額外的 system performance 來管理記憶體，可能會拖慢執行時間。 - 同 **Method 2**，使用 `cudaMallocPitch` 分配 GPU memory 為了保證記憶體對齊會多分配不須使用到的記憶體空間，造成空間上浪費。 #### Assumption: - Method 1 的 data transfer 方式雖然因為 CPU 端 memory 為 pagable 的記憶體可能會有需要時資料不在記憶體需要額外傳輸的問題，但整體效能上因為充分利用了平行度、且也不像 Method 2 需要額外的 system performance 來管理 paged-lock memory，所以整體效能在實驗上是最好的。 - Method 2 也充分利用平行度，但我猜想 CPU 端使用 paged-lock memory 的額外 system overhead 是會把使用 paged-lock memory 的好處給抵消掉的，因此會叫 Method 1 差一些些 - Method 3 有較好的 load balance，但畢竟需要靠 serial 的執行來平均化執行時間，所以時間是需要最久的 #### Q2 How are the performances of the three methods? Plot a chart to show the differences among the three methods 下圖為 View 2 以及 View 2 搭配不同 iteration number 以及 method 的結果，Y 軸為 speedup，相對於 serial 的版本 ![](https://i.imgur.com/nuiyuA6.png) ![](https://i.imgur.com/F4h7jWf.png) #### Q3 Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not. - 根據實驗結果可以看到，每個 view, 每個 iteration 來說普遍是是 Method 1 的 speedup 最大、 Method 2 次之，最慢的是 Method 3，大致是是符合剛剛說的假設。 - 比較特別的是 View 1 的 Method 2 在 iteration 為 100000 時表現是最差的，這有點不符合預期，其實對於這點我並沒有想到一個很好的解釋 #### Q4 Can we do even better? Think a better approach and explain it. Implement your method in `kernel4.cu` 根據前面的回答，因為 Method 1 是最快的版本，所以我決定修改其程式碼來實做 Method 4。為了優化程式，我做了以下改進 1. 拿掉不必要的 `malloc` ```diff - int *h_img; int *d_img; - h_img = (int*)malloc(resX * resY * sizeof(int)); cudaMalloc((void **)&d_img, resX * resY * sizeof(int)); ... - cudaMemcpy(h_img, d_img, resX * resY * sizeof(int), cudaMemcpyDeviceToHost); - memcpy(img, h_img, resX * resY * sizeof(int)); + memcpy(img, d_img, resX * resY * sizeof(int)); ``` 2. 根據散列的數學定理一（參考自 [wiki](https://zh.wikipedia.org/wiki/%E6%9B%BC%E5%BE%B7%E5%8D%9A%E9%9B%86%E5%90%88) ): ![](https://i.imgur.com/z2NnDSX.png) 在 kernel function 加上一個判斷，可以提早結束一些 thread，避免它們需要跑完整個 maxIterations 的計算 ```diff __global__ void mandelKernel(int* d_img, int width, float lowerX, float lowerY, float stepX, float stepY, int maxIterations) { // To avoid error caused by the floating number, use the following pseudo code // int thisX = blockIdx.x * blockDim.x + threadIdx.x; int thisY = blockIdx.y * blockDim.y + threadIdx.y; float x = lowerX + thisX * stepX; float y = lowerY + thisY * stepY; float x_tmp = x; float y_tmp = y; + if (x_tmp * x_tmp + y_tmp * y_tmp <= 0.0625) { + d_img[thisY * width + thisX] = maxIterations; + return; + } int i; for (i = 0; i < maxIterations && x_tmp * x_tmp + y_tmp * y_tmp <= 4.f; ++i) { x_tmp = x + (x_tmp * x_tmp - y_tmp * y_tmp); y_tmp = y + (2.f * x_tmp * y_tmp); } d_img[thisY * width + thisX] = i; } ``` 3. 修改 Thread Block 的 Size 如下，經過我的實驗我發現這樣是最快的 ```diff - #difine BLOCK_SIZE 32 + #define BLOCK_SIZE_X 16 + #define BLOCK_SIZE_Y 12 ... - dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE); - dim3 numBlock(resX / BLOCK_SIZE, resY / BLOCK_SIZE); + dim3 blockSize(BLOCK_SIZE_X, BLOCK_SIZE_Y); + dim3 numBlock(resX / BLOCK_SIZE_X, resY / BLOCK_SIZE_Y); ```