# Programming Assignment V: CUDA Programming [toc] ### Q1 What are the pros and cons of the three methods? Give an assumption about their performances. * Method 1: * pros: 1. Each thread is resposible for process of one pixel, which is parallel processing in a practical sense. 2. When using ```cudaMalloc()``` and ```malloc()```, simply go to the stack to find a space that meets the size of the requested memory. It belongs to non-pinned memory, so the memory management overhead is relatively small. * cons: 1. Because there is no corresponding relationship for the memory requsted by ```malloc()```, it takes a longer time to execute ```cudaMemcpy()``` and ```memcpy()```. Therefore, the memory transfer overhead is relatively large. * Method 2: * pros: 1. Each thread is resposible for process of one pixel, which is parallel processing in a practical sense. 2. The driver tracks the virtual memory ranges allocated with ```cudaHostalloc()``` and automatically accelerates calls to functions such as ```cudaMemcpy()```. Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as ```malloc()```. * cons: 1. Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. * Method 3: * pros: 1. Work load is more balanced than Method 1 and Method 2. 2. Because each thread processes a group of pixels, the use of GPU resources is reduced. * cons: 1. Each thread processes a group of pixels, as a result, serial execution appears in the calculation. * Assumption: * Method 2 > Method 1 > Method 3 * Method 3 includes serial execution, so the assumption is that the performance of **Method 3 is the worst**. * As for the comparison of Method 1 and Method 2, **Method 2 generates the best performance** for the advantage of pinned memory. ### Q2 How are the performances of the three methods? Plot a chart to show the differences among the three methods for VIEW 1 and VIEW 2, and for different maxIteration (1000, 10000, and 100000). * View 1 ![](https://i.imgur.com/yBp1HLz.png) * View 2 ![](https://i.imgur.com/58ZNkjZ.png) ### Q3 Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not. * Observation: 1. View 2 takes less time than View 1 * No matter what the iteration is, View 2 always takes less time than View 1. To explain the fact, we must look at the distribution of white area in View 1 and View 2. Because white area needs higher amount of calculation, View 1 has a imbalanced work load compared to View 2. Imbalanced work load leads to the effect that the completed threads wait for unfinished ones, which explains why View 1 takes more time than View 2. ![](https://i.imgur.com/HPHWY13.png) 2. Iteration amplifies the difference * In the case of iteration=1000, there only exists a small difference between methods. While in the case of iteration=100000, Method 3 takes much longer to complete the calculation. It is because pixels in the same group is processed in a serial manner, which causes parallelization is reduced when iteration grows. * Assumption: Method 2 > Method 1 > Method 3 * Experiment: Method 1 > Method 2 > Method 3 * Analysis: * Method 1 vs Method 2 * As we can see, There is a trade-off calling ```cudaHostAlloc()```, that is the performance of ```cudaMalloc()``` and ```cudaMemcpy()```. * When calling ```cudaHostAlloc()```, the pros is that overhead of ```cudaMemcpy()``` is smaller and the cons is the overhead of ```cudaMalloc()``` is larger. * When calling ```malloc()```, the pros is that overhead of ```cudaMalloc()``` is smaller and the cons is the overhead of ```cudaMemcpy()``` is larger. * Because the ratio of execution time for ```cudaMalloc()``` is bigger than ```cudaMemcpy()```, ```cudaMalloc()``` is that determined factor of performance. Therefore, Method 1 calling ```malloc()``` can generate a better performance. * nvprof of Method 1 ![](https://i.imgur.com/QenXJG5.png) * nvprof of Method 2 ![](https://i.imgur.com/AJ9mrEE.png) * Method 2 vs Method 3 * The experiment totally matches the assumption. Moreover, we can see the difference becomes much larger in the case of iteration=100000. It is the effect of serial execution being amplified when calculation amount grows. ### Q4 Can we do even better? Think a better approach and explain it. Implement your method in kernel4.cu. * There is no need to allocate an additional space on the host, since we can directly copy the array on the device to the img through ```cudaMemcpy()```. In this way, we can save the time of allocation, copy and delete.