Programming Assignment HW5

# Programming Assignment HW5 ## Q1: > What are the pros and cons of the three methods? Give an assumption about their performances. **Method1:** * pros : 執行較簡單，memory的overhead比較小。 * cons : 記憶體之間沒有對應關係，在執行cudaMemcpy()和memcpy()時，花費的時間比較長，memory的overhead比較大。 **Method2:** * pros : 透過 `cudaHostAlloc`就會使用到Pinned(page-locked)的memory，它可以讓GPU直接使用，省掉一些在不同memory間copy的動作跟時間。使用 `cudaMallocPitch`分配GPU的memory，可以保證memory都照256或512的倍數對齊，進而提高訪問的效率。 * cons : 使用`cudaHostAlloc`時Pinned memory不能分配過多，過多會導致整體系統速度下降。使用`cudaMallocPitch`分配GPU memory時會為了要對齊而多分配出額外不會使用到的空間造成浪費。產生Pinned memory或對齊記憶體空間也都需要花額外時間，所以如果沒有很多的memory access效果通常不會太好。 **Method3:** * pros : 每個thread一次可以處理較多的pixels，節省運算資源。減少Block的數量，也降低了記憶體存取的次數。 * cons : 雖然節省了運算資源，但讓每個thread的負擔變重，也會因為計算量的分配不均而拖慢運算的時間導致效能降低。 **Assumption:** 　　`cudaMallocPitch`為了對齊記憶體而產生的overhead會讓它的performance沒有Method1好，所以Method1的表現應該大於Method2。而Method3的計算多個Group造成的load balancing問題，應該是performance最差的，所以排名應該會是Method1 > Method2 > Method3。 ## Q2: > How are the performances of the three methods? Plot a chart to show the differences among the three methods ![](https://i.imgur.com/GYvz7Ry.png) ![](https://i.imgur.com/Ph7F5ae.png) ![](https://i.imgur.com/bZGGUFR.png) ## Q3: > Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not. **Method1 vs Method2** 從nvprof中可以看到兩者在計算的時間上沒什麼差別，但當MaxIteration越大的時候，Method2會花更多的時間在cudaDeviceSynchronize上，導致它的performance變得比Method1還差。 **Method2 vs Method3** 從View2的nvprof中可以發現，Method3在計算的時間上面比Method2來的多，因為View2的圖比View1圖中比較少高計算量的資料聚集在一起，所以在分配之後每個Group要計算的量都蠻多的，造成整體的計算時間大於Method2。 ## Q4: >Can we do even better? Think a better approach and explain it. Implement your method in `kernel4.cu` host memory不需要malloc，直接利用argument的img，這樣節省malloc和free的時間。