###### tags: `Parallel Programming`
# Assignment 5 Report
`0811510 許承壹`
> Q1:What are the pros and cons of the three methods? Give an assumption about their performances.
* Method 1:
* pros:
1. Every thread only needs to deal with one pixel, so this utilizes the power of parallel computing on GPU perfectly.
2. There is no paged-locked memory, so it won't affect system performance.
* cons:
1. Load balance problem exists, since each thread may need to deal with different amount of computation. So threads that have already finished computing need to wait for those that have not finished yet.
* Method 2:
* pros:
1. Since data is 2D, using `cudaMallocPitch` will make sure the data to be aligned, and hence CUDA can read 256 bits at once, which speeds up reading data.
2. Paged-locked memory is enabled, so we can make sure data we need is in memory, avoiding page faults.
* cons:
1. too much paged-locked memory will affect system performance.
2. using `cudaMallocPitch` may waste some memory, since we force the memory space to be aligned.
3. This method doesn't make use of "zero-copy access" of paged-locked memory, so we still need time to copy data.
* Method 3:
* pros:
1. Each thread can deal with multiple pixels, decreasing the counts that we move data to SM.
2. Since data is 2D, using `cudaMallocPitch` will make sure the data to be aligned, and hence CUDA can read 256 bits at once, which speeds up reading data.
* cons:
1. each thread need to deal with multiple pixels, but this part is serial though. This will effect GPU performance.
2. too much paged-locked memory will affect system performance.
3. using `cudaMallocPitch` may waste some memory, since we force the memory space to be aligned.
4. This method doesn't make use of "zero-copy access" of paged-locked memory, so we still need time to copy data.
* Assumption:
I think the execution speed would be: Method1 > Method2 > Method3.
* Reason:
Because in method2 and 3, Calling `cudaMallocPith` cost more time than calling `cudaMalloc`, and they don't make use of "zero-copy access". And in method3, it even contains some serial parts in each threads. So, this is why I think method1 runs fastest, and method3 runs slowest.
> Q2:How are the performances of the three methods? Plot a chart to show the differences among the three methods
* View 1
| Iteration/Method | Method 1 | Method 2 | Method 3 |
| ---------------- | -------- |:-------- |:-------- |
| Iteration 1000 | 6.515 | 9.015 | 9.551 |
| Iteration 10000 | 33.200 | 36.711 | 45.359 |
| Iteration 100000 | 320.344 | 325.519 | 430.121 |

* View 2
| Iteration/Method | Method 1 | Method 2 | Method 3 |
| ---------------- | -------- |:-------- |:-------- |
| Iteration 1000 | 4.015 | 5.604 | 6.761 |
| Iteration 10000 | 6.636 | 8.712 | 18.250 |
| Iteration 100000 | 33.506 | 34.112 | 113.798 |

> Q3:Explain the performance differences thoroughly based on your experimental results. Does the results match your assumption? Why or why not.
* Ans:
From the comparison above, we see that Method has the most poor performance both in view 1 and view 2. Since there are a lot of serial computing in Method3.
When it comes to comparing method1 and method2, we see that method1 is better than method2. And in view 1, the gap between them is even bigger. When we want to compute view 1, this image has more grey and white pixels concentrated in the center, meaning that the computation loading is unbalanced. And in method2, it cannot deal with this problem effectively, and even waste time to allocate unused memory, so method2 is slower than method1.
> Q4.Can we do even better? Think a better approach and explain it. Implement your method in `kernel4.cu`.
* Based on method1, I figured out that I don't need to allocate one more memory space on host actually. I can use `cudaMemcpy` to copy data from device to host directly. This saved the time for allocating memory, copying data, and deleting memory.