PP Assignment 5

# PP Assignment 5 ###### tags: `PP` :::info **Webpage** [Description](https://nycu-sslab.github.io/PP-f22/assignments/HW5/) ::: ## Q1 :::success What are the pros and cons of the three methods? Give an assumption about their performances. ::: **Answer** |Method #|pros|cons| |-|-|-| |1|1. 每個thread的處理簡單，只要處理單一筆資料即可 2. 使用分頁記憶體可以透過虛擬記憶體進行與固態硬碟的資料轉換記憶體不容易耗盡|因為使用```malloc```做分頁所以在進行GPU運算時會需要呼叫API來搬移到臨時記憶體再搬移到GPU上| |2|1. 使用```cudaHostAlloc```將global memory以pinned memory方式儲存因為不可分頁所以保證記憶體一定會在物理上的記憶體空間中不用再考慮轉移的API 2. 以```cudaMallocPitch```來操作二維數據保證數組每一行首元素的地址都按照256或512倍數對齊進而提高訪問效率|1. 因為每個資料都要寫上，所以記憶體容易不足可能會影響其他元件的效能 2. 如果資料不是256倍數就會有剩下還要額外宣告把尾數補齊| |3|每個thread處理資料量多可以節省運算資源不用開立這麼多的thread|每個thread處理的資料量更多原本單一資料的處理時間差加總起來會拉大差距每個thread完成的時間會差更多| ## Q2 :::success How are the performances of the three methods? Plot a chart to show the differences among the three methods - for VIEW 1 and VIEW 2, and - for different ```maxIteration``` (1000, 10000, and 100000). ::: **Answer** #### view 1 |interations|serial|thread #1|thread #2|thread #3| |-|-|-|-|-| |1000|1725.594|8.863|8.978|9.755| |10000|16960.565|33.785|33.714|40.013| |100000|169476.876|289.438|289.394|353.403| ![](https://i.imgur.com/tNA9cMc.png) #### view 2 |interations|serial|thread #1|thread #2|thread #3| |-|-|-|-|-| |1000|348.810|6.274|6.274|6.252| |10000|1213.638|9.059|9.107|9.137| |100000|9826.172|30.132|30.562|30.726| ![](https://i.imgur.com/Jzpyvqb.png) ## Q3 :::success Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not. ::: **Answer** - 首先先分析兩張圖片，可以發現第一張圖的執行時間都比第二張圖久，這件事情牽涉到兩張圖的密集程度，可以發現第一張圖的白色區塊多於第二張圖，而在Assignment 2中有提到```the brightness of each pixel is proportional to the computational cost of determining whether the value is contained in the Mandelbrot set```，因此可以知道第一張圖的計算量大並且密集 ![](https://i.imgur.com/uWhFKCK.png) - 不僅如此，因為cuda的作法是分成256個區塊，所以可以將兩張圖切割成256等分來觀察，在view 1裡面可以發現某些區塊顏色是全黑的，但在view 2裡面本身的底色就已經不是全黑；view 1裡面也有多處的全白，而view 2的每個區塊都至少會有包含接近底色的區域。由此可知就分布來說view 2也比較均勻，對於平行加速有較大的優勢 - 接下來可以分成三種方法做比較，因為method 3是在同一個thread裡面部屬多個pixels，所以受到上述所提到的分布不均影響更大，造成method 3整體的效能低於前兩者，這個現象也因為view 1較不均勻所以在view 1的傷害更大，可以看到加速倍率在view 1都比其他方法低很多 - 而method 1跟method 2的相比主要是差在allocate memory的方式，但因為主要計算都是針對單一pixel進行所以相距不大 ## Q4 :::success Can we do even better? Think a better approach and explain it. Implement your method in ```kernel4.cu``` ::: **Answer** - 由於第一種方式是最簡以及最快的，因此針對第一種進行改良，最直接的改良方式就是不去重新allocate新的記憶體，直接放入答案的記憶體中 - 由下面兩張圖可以發現不重新allocate的確可以達到加速效果 ![](https://i.imgur.com/y8MOIGR.png) ![](https://i.imgur.com/mhYS61H.png)