# Programming Assignment V: CUDA Programming
[toc]
### Q1 What are the pros and cons of the three methods? Give an assumption about their performances.
* Method 1:
* pros:
1. Each thread is resposible for process of one pixel, which is parallel processing in a practical sense.
2. When using ```cudaMalloc()``` and ```malloc()```, simply go to the stack to find a space that meets the size of the requested memory. It belongs to non-pinned memory, so the memory management overhead is relatively small.
* cons:
1. Because there is no corresponding relationship for the memory requsted by ```malloc()```, it takes a longer time to execute ```cudaMemcpy()``` and ```memcpy()```. Therefore, the memory transfer overhead is relatively large.
* Method 2:
* pros:
1. Each thread is resposible for process of one pixel, which is parallel processing in a practical sense.
2. The driver tracks the virtual memory ranges allocated with ```cudaHostalloc()``` and automatically accelerates calls to functions such as ```cudaMemcpy()```. Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as ```malloc()```.
* cons:
1. Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging.
* Method 3:
* pros:
1. Work load is more balanced than Method 1 and Method 2.
2. Because each thread processes a group of pixels, the use of GPU resources is reduced.
* cons:
1. Each thread processes a group of pixels, as a result, serial execution appears in the calculation.
* Assumption:
* Method 2 > Method 1 > Method 3
* Method 3 includes serial execution, so the assumption is that the performance of **Method 3 is the worst**.
* As for the comparison of Method 1 and Method 2, **Method 2 generates the best performance** for the advantage of pinned memory.
### Q2 How are the performances of the three methods? Plot a chart to show the differences among the three methods for VIEW 1 and VIEW 2, and for different maxIteration (1000, 10000, and 100000).
* View 1

* View 2

### Q3 Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not.
* Observation:
1. View 2 takes less time than View 1
* No matter what the iteration is, View 2 always takes less time than View 1. To explain the fact, we must look at the distribution of white area in View 1 and View 2. Because white area needs higher amount of calculation, View 1 has a imbalanced work load compared to View 2. Imbalanced work load leads to the effect that the completed threads wait for unfinished ones, which explains why View 1 takes more time than View 2.

2. Iteration amplifies the difference
* In the case of iteration=1000, there only exists a small difference between methods. While in the case of iteration=100000, Method 3 takes much longer to complete the calculation. It is because pixels in the same group is processed in a serial manner, which causes parallelization is reduced when iteration grows.
* Assumption: Method 2 > Method 1 > Method 3
* Experiment: Method 1 > Method 2 > Method 3
* Analysis:
* Method 1 vs Method 2
* As we can see, There is a trade-off calling ```cudaHostAlloc()```, that is the performance of ```cudaMalloc()``` and ```cudaMemcpy()```.
* When calling ```cudaHostAlloc()```, the pros is that overhead of ```cudaMemcpy()``` is smaller and the cons is the overhead of ```cudaMalloc()``` is larger.
* When calling ```malloc()```, the pros is that overhead of ```cudaMalloc()``` is smaller and the cons is the overhead of ```cudaMemcpy()``` is larger.
* Because the ratio of execution time for ```cudaMalloc()``` is bigger than ```cudaMemcpy()```, ```cudaMalloc()``` is that determined factor of performance. Therefore, Method 1 calling ```malloc()``` can generate a better performance.
* nvprof of Method 1

* nvprof of Method 2

* Method 2 vs Method 3
* The experiment totally matches the assumption. Moreover, we can see the difference becomes much larger in the case of iteration=100000. It is the effect of serial execution being amplified when calculation amount grows.
### Q4 Can we do even better? Think a better approach and explain it. Implement your method in kernel4.cu.
* There is no need to allocate an additional space on the host, since we can directly copy the array on the device to the img through ```cudaMemcpy()```. In this way, we can save the time of allocation, copy and delete.