# PP-f22 assignment 5
## Q1: What are the pros and cons of the three methods? Give an assumption about their performances.
From CUDA 官網:
"Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices as mentioned in Asynchronous Concurrent Execution."
用cudaHostAlloc的記憶體在cudaMemcpy時比較快,因此我認為 2 > 1。
3 的話如果block size跟一次處理的pixel沒有算好的話,可能會造成很多thread空等的時間。
I think speed: 2 > 1 > 3.
## Q2 How are the performances of the three methods? Plot a chart to show the differences among the three methods
單位皆為倍(ref = 1.0)
<!-- Method 1
1000 0.99 1.01
10000 0.95 0.95
100000 0.94 0.92
Method 2
1000 1.09 1.07
10000 1.13 0.99
100000 1.12 0.88
Method 3
1000 0.99 0.85
10000 0.94 0.68
100000 0.71 0.58 -->



## Q3 Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not.
整體來說算符合,使用 nvprof 可以看到在method 3中 cudaDeviceSynchronize 占了 86%,因為當許多thread都是0但是其他是100000的時候,就會造成很多浪費。 我原本 method 3只有reference的30%左右,把block size從32\*32改成16\*16之後才有改善。
## Q4 Can we do even better? Think a better approach and explain it. Implement your method in `kernel4.cu`.
I unrolled the mandel device function, and it speeduped a lot. I also directly copy the device memory to the img buffer.