# Parallel Programming HW5 ###### tags: `Parallel Programming` ### Q1 What are the pros and cons of the three methods? Give an assumption about their performances. Ans: Method 1 * pros: * A whole image can be processed simultaneously with better parallelism by using multiple threads, each processing a pixel. * cons: * `malloc` will allocate data in paging memory, which means data could be swapped out to virtual memory. In case of GPU transfer, additional time is spent loading data from virtual memory. Method 2 * pros: * A whole image can be processed simultaneously with better parallelism by using multiple threads, each processing a pixel. * A small overhead exists when copying data from the pinned memory to the device because `cudasHostAlloc` allocates data to the pinned memory and does not swap the data to the virtual memory. * `cudaMallocPitch` aligns each row of the 2d array to 256 bytes before allocating the next row, so memory access is reduced. * cons: * As `cudaHostAlloc` allocates data into pinned memory and doesn't release it until the end of a program, it will occupy a lot of physical memory if the amount of data is large, which may cause other data to be continually swapped out, resulting in additional overhead. Method 3 * pros: * It utilizes fewer threads due to each thread processing groups of data, which saves GPU power. * cons: * Since there are fewer threads, pixels in each group are processed sequentially, resulting in poor parallelism. ### Q2 How are the performances of the three methods? Plot a chart to show the differences among the three methods * for VIEW 1 and VIEW 2, and * for different maxIteration (1000, 10000, and 100000). Ans: ![](https://i.imgur.com/JyxgGNK.png =500x) ![](https://i.imgur.com/9ZdbTdX.png =500x) ### Q3 Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not. Ans: * Method 3 has the poorest performance regardless of views or iterations. Since method 3 uses the fewest threads, each pixel cannot be processed in parallel. In each group, pixels are executed sequentially, resulting in a big performance gap. * As for method 1 and 2, the results do not match my assumption. At first, I thought `malloc()` should take longer time than `cudaHostAlloc()`. However, method 2 takes even longer time than method 1. * After using `nvprof`, it is found that `cudaHostAlloc()` takes less time than `malloc()`. However, `cudaMemcpy2D()` is time-consuming. In case of iteration = 100000, `cudaMemcpy2D()` consumes nearly 65% of the time, presumably because it calculates the pitch size for each row, it takes longer to run. As a result, one API reduces and the other API increases execution times, resulting in a result no better than method 1. ### Q4 Can we do even better? Think a better approach and explain it. Implement your method in `kernel4.cu` Ans: * By directly writing the final result to img, we save the time of calling `malloc()` or `cudaHostAlloc ()`. * To handle repeated operations (for loop), call a *device function* since the device function in cuda is an inline function, and the compiler may be able to optimize it. ![](https://i.imgur.com/wZWpr1i.png =800x) ![](https://i.imgur.com/mqaBT7p.png =800x)