Parallel Programming HW-5

# Parallel Programming HW-5 :::info ***Q1***: (5 points) What are the pros and cons of the three methods? Give an assumption about their performances. ::: * **Method 1** * Pros: GPU 中的每一個 thread 只需要負責處理一個 pixel。如果在 workload 平衡的情況下，效能會表現得不錯．且 cudaHostAlloc() 比 cudaMallocPitch() 所需的時間還要短． * cons: 還記得在 mandelbrot 圖中白色部分會需要特別多的計算，因此我們每一個 thread 所需的計算量不同，導致整體的workload不平均． * **Method 2** * Pros: cudaMallocPitch() 會根據所需要的記憶體自動配置對齊 256 倍數的 memory，這樣的好處在於，cuda 每一次的存取都是 256 ，這樣可以減少不必要的存取次數． * cons: 倘若未滿 256 倍數，需要多配置不必要的空間． * **Method 3** * Pros: 每一個 thread 一次處理一個 group 的 pixel，可以想像成我們將原始的 view 切成相同大小的 block，每一個 thread 處理一個 block，這樣的好處會是所需的 thread 數較少． * cons: 同上述提到的，圖中白色部分會需要特別多的計算，若以 view1 來看的話，會有許多 block 會處理到整片白色的，加劇 Method 1 中 workload 不平衡的問題． :::info ***Q2***: (5 points) How are the performances of the three methods? Plot a chart to show the differences among the three methods. * for VIEW 1 and VIEW 2, and * for different maxIteration (1000, 10000, and 100000). ::: **view 1** ![Bar chart 1](https://i.imgur.com/Omd5Aga.png) **view 2** ![Bar chart 2](https://i.imgur.com/BfJfMRH.png) Method 1 > Method 2 > Method 3 :::info ***Q3***: (10 points) Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not. ::: **view 1** ![View 1](https://imgur.com/3zYNCtQ.png) **view 2** ![View 2](https://imgur.com/0QBvn4o.png) 1. 首先，先觀察 view 1 和 view 2 我們可以看到 view1 的白色區塊比較多，也就是說有較多計算較複雜的部分，這會使計算上的 worklaod 較 view 2 不平衡．可以觀察到 view 1 的圖表時間都遠大於 view 2． 2. 我猜測因為我們運算的 thread 足夠多，因此 method 1 的效能表現普遍都贏過其他兩個方法，maxIteration 越高這個情況越明顯． 3. method 1 和 method 2 我們都是採取一個 thread 處理一個 pixel ， method 3 則是一個 thread 處理一個 group 的 pixel 這樣的操作會加劇 workload 不平衡的狀況，從數據上也可以觀察到這個狀況．可能改善的方向可以是，我們將每一個 group 切成更細的 block。 4. method 1 和 method 2 普遍上來說，都是 method 1 比較快，我猜測是因為我們的資料足夠小，小到沒有用到 alignment 的好處，並且 cudaMallocPitch() 比 cudaHostAlloc() 的時間還要長，導致效能表現比較差． :::info ***Q4***: (6 points) Can we do even better? Think a better approach and explain it. Implement your method in kernel4.cu. ::: * 不要配置 host_img 這個空間，直接把用進來的指標，這樣我們可以減少 copy()、alloc()、free()上的時間． **view 1** ![Bar chart 1](https://i.imgur.com/dCV6FaF.png) **view 2** ![Bar chart 2](https://i.imgur.com/NhKov0M.png)