# Parallel Programming Homework 5
## Q1
Method one is the most straight forward method, each pixel is processed by a thread. The memory is copy back to host after computation. Method two use a different memory allocation method which should be beneficial to the hardware. Therefore, a better performance is expected. For method number three, each thread process multiple pixel with a loop. The computation behavior does not fit the original expected usage pattern of the hardware, poor performance is expected.
## Q2


## Q3
The profiling tool shows that method one spent most time in **cudaMemcpy** followed by **cudaDeviceSynchronize** then **cudaMalloc**. Method two spent most time doing **cudaMemcpy2D**, **cudaDeviceSynchronize** then **cudaMalloc**.
The most time consuming steps for method three are **cudaMemcpy2D**, **cudaDeviceSynchronize**, and then **cudaMalloc**. Also, method 3 use more time in memory copy than the other two. The results is surprising because method 3 performs worse than method one and the bottleneck is in memory copy, while the method should enhance this operation.
## Q4
Since the bottleneck of all three methods is in memory copy operation, the optimization should be done at this step. The optimization is done by creating a mapped memory, by passing _cudaHostAllocMappedto_ option to **cudaHostAlloc()**. This allow the GPU device to access part of the host memory therefore elminate the need to copy data from GPU to CPU after computation. The tested results show slightly better performance than method 1. The profiling tool shows an increase in synchronization time, but no memory copy after kernel computation is needed, therefore a net positive adjustment.