PP-f22 assignment 5

# PP-f22 assignment 5 ## Q1: What are the pros and cons of the three methods? Give an assumption about their performances. From CUDA 官網: "Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices as mentioned in Asynchronous Concurrent Execution." 用cudaHostAlloc的記憶體在cudaMemcpy時比較快，因此我認為 2 > 1。 3 的話如果block size跟一次處理的pixel沒有算好的話，可能會造成很多thread空等的時間。 I think speed: 2 > 1 > 3. ## Q2 How are the performances of the three methods? Plot a chart to show the differences among the three methods 單位皆為倍(ref = 1.0)  ![](https://i.imgur.com/65EstHu.png) ![](https://i.imgur.com/NV9FzMR.png) ![](https://i.imgur.com/mhq1rfp.png) ## Q3 Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not. 整體來說算符合，使用 nvprof 可以看到在method 3中 cudaDeviceSynchronize 占了 86%，因為當許多thread都是0但是其他是100000的時候，就會造成很多浪費。我原本 method 3只有reference的30%左右，把block size從32\*32改成16\*16之後才有改善。 ## Q4 Can we do even better? Think a better approach and explain it. Implement your method in `kernel4.cu`. I unrolled the mandel device function, and it speeduped a lot. I also directly copy the device memory to the img buffer.