# PP-f22 assignment 5 ## Q1: What are the pros and cons of the three methods? Give an assumption about their performances. From CUDA 官網: "Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices as mentioned in Asynchronous Concurrent Execution." 用cudaHostAlloc的記憶體在cudaMemcpy時比較快,因此我認為 2 > 1。 3 的話如果block size跟一次處理的pixel沒有算好的話,可能會造成很多thread空等的時間。 I think speed: 2 > 1 > 3. ## Q2 How are the performances of the three methods? Plot a chart to show the differences among the three methods 單位皆為倍(ref = 1.0) <!-- Method 1 1000 0.99 1.01 10000 0.95 0.95 100000 0.94 0.92 Method 2 1000 1.09 1.07 10000 1.13 0.99 100000 1.12 0.88 Method 3 1000 0.99 0.85 10000 0.94 0.68 100000 0.71 0.58 --> ![](https://i.imgur.com/65EstHu.png) ![](https://i.imgur.com/NV9FzMR.png) ![](https://i.imgur.com/mhq1rfp.png) ## Q3 Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not. 整體來說算符合,使用 nvprof 可以看到在method 3中 cudaDeviceSynchronize 占了 86%,因為當許多thread都是0但是其他是100000的時候,就會造成很多浪費。 我原本 method 3只有reference的30%左右,把block size從32\*32改成16\*16之後才有改善。 ## Q4 Can we do even better? Think a better approach and explain it. Implement your method in `kernel4.cu`. I unrolled the mandel device function, and it speeduped a lot. I also directly copy the device memory to the img buffer.