平行程式設計 HW5

# 平行程式設計 HW5 姓名：許永平學號：0716208 系所：資工系 ## Q1 > What are the pros and cons of the three methods? Give an assumption about their performances. * method 1 優點：每個 thread 只需計算一個 pixel。缺點：某些 pixel 計算量較小，而某些 pixel 計算量可能較大，導致每個 thread 的執行時間不一，整體的 utilization 較低。 * method 2 優點：GPU 從某個適當的倍數 (例如 16 bytes) 開始讀取記憶體時，才會有最高的效率，而 cudaMallocPitch 會自動以適當的倍數配置記憶體，因此能提升存取資料的效率。缺點：若資料不是剛好 16 bytes 的倍數的話，需要額外宣告對齊的記憶體空間。 * method 3 優點：每個 thread 能夠一次計算多個 pixels，較能充分發揮效能。缺點：每個 thread 的 overhead 會加重，且有可能因為每個 thread 被分配到的 overhead 不一，反而導致效能下降。 ## Q2 > How are the performances of the three methods? Plot a chart to show the differences among the three methods > * for VIEW 1 and VIEW 2, and > * for different maxIteration (1000, 10000, and 100000). * View 1 | maxIteration | method 1 | method 2 | method 3 | | -------------| ---------- | ---------- | ---------- | | 1000 | 5.838 ms | 5.872 ms | 6.525 ms | | 10000 | 31.408 ms | 31.476 ms | 34.931 ms | | 100000 | 284.521 ms | 286.359 ms | 321.193 ms | ![](https://i.imgur.com/KbSsTEJ.png) * View 2 | maxIteration | method 1 | method 2 | method 3 | | -------------| ---------- | ---------- | ---------- | | 1000 | 4.032 ms | 4.863 ms | 5.536 ms | | 10000 | 6.033 ms | 7.013 ms | 8.204 ms | | 100000 | 28.415 ms | 28.636 ms | 39.097 ms | ![](https://i.imgur.com/TwbqksZ.png) performance：method 1 > method 2 > method 3 ## Q3 > Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not. 因為 method 1 和 2 是每個 thread 計算一個 pixel，跟 method 3 相比而言較不會有 overhead 分布不均的問題，故執行時間比 method 3 快。因為 method 3 是每個 thread 一次計算多個 pixels，負擔就已經較高了，再加上 overhead 分佈不均的情況可能蠻嚴重的，例如某些 thread 被分到運算量較小的 pixels，而某些 thread 被分到運算量較大的 pixels，這時運算量較小的 thread 就會需要等待其他運算量較大的 thread，進而導致 method 3 執行時間比 method 1 和 2 慢。 View 2 因為不像 View 1 有 overhead 分佈不均的嚴重情況，所以執行時間比 View 1 快。 ## Q4 > Can we do even better? Think a better approach and explain it. Implement your method in kernel4.cu 不要在 host 上另外配置記憶體空間，而是直接將 device 裡的 memory 透過 cudaMemcpy 複製到 img 裡，就能省去多配置記憶體、複製資料以及釋放記憶體的時間。也就是將 method 1 的 ```cpp= int *Mh = (int *)malloc(size); cudaMemcpy(Mh, Md, size, cudaMemcpyDeviceToHost); cudaFree(Md); memcpy(img, Mh, size); free(Mh); ``` 改為 ```cpp= cudaMemcpy(img, Md, size, cudaMemcpyDeviceToHost); cudaFree(Md); ``` 以及使用 **#pragma unroll(n)** 展開 mandel 裡的 for 迴圈，讓它也並行化執行此方法 (method 4) 與其他方法的比較如下： * View 1 | maxIteration | method 1 | method 2 | method 3 | method 4 | | -------------| ---------- | ---------- | ---------- | ---------- | | 1000 | 7.212 ms | 7.237 ms | 6.905 ms | 4.598 ms | | 10000 | 36.559 ms | 36.835 ms | 33.204 ms | 28.708 ms | | 100000 | 340.264 ms | 341.807 ms | 304.886 ms | 268.775 ms | ![](https://i.imgur.com/GgW7XBj.png) * View 2 | maxIteration | method 1 | method 2 | method 3 | method 4 | | -------------| ---------- | ---------- | ---------- | ---------- | | 1000 | 3.989 ms | 5.475 ms | 5.484 ms | 2.868 ms | | 10000 | 7.445 ms | 7.563 ms | 7.982 ms | 5.916 ms | | 100000 | 32.417 ms | 32.716 ms | 38.251 ms | 25.762 ms | ![](https://i.imgur.com/Zq6KH3l.png) 從結果可以看出，執行時間的確降低一些了。