<!-- Call the assignment `CS450_aXXX_nhn33.pdf` where `XXX` is the assignment number --> :::info **CS 450** Noah Nuebling ::: # Assignment 6 ## Question 1 ### 1. #### N = 100, e = 5 Total: 0.131939 Kernel: 0.000085 CPU: 0.000271 Cores: 2 #### N = 100, e = 10 Total: 0.129066 Kernel: 0.000086 CPU: 0.000487 Cores: 2 #### N = 1000, e = 5 Total: 0.127911 Kernel: 0.000610 CPU: 0.000840 Cores: 2 #### N = 1000, e = 10 Total: 0.129453 Kernel: 0.000609 CPU: 0.001341 Cores: 2 #### N = 10000, e = 5 Total: 0.131776 Kernel: 0.005767 CPU: 0.252819 Cores: 2 #### N = 10000, e = 10 Total: 0.134667 Kernel: 0.005775 CPU: 0.179304 Cores: 2 #### N = 100000, e = 5 Total: 0.486064 Kernel: 0.354338 CPU: 19.517811 Cores: 2 #### N = 100000, e = 10 Total: 0.483756 Kernel: 0.354482 CPU: 18.305090 Cores: 2 #### Which is faster, the GPU or CPU implementation? The GPU implementation is much faster than the CPU implementation. Especially if you look at the kernel times and at larger N. For very small N the CPU is faster because it doesn't have the transfer overhead. #### Is the GPU good for small or large N? What is interesting about the response times across values of N on the GPU? The GPU is good for large N. Even though the workload increases in N^2, the processing time increases by a much smaller factor #### Reason about the performance across N on the GPU. Since the gpu has thousands of cores, and my CPU only has two, it is much much better at highly parallelizable workloads. Most of the time overhead from adding another point is negated by the fact that it can just be processed on another thread in parallel to all the others. I suspect most of the slowdown that we observed in the kernel time when increasing N is because of atomic operations and __syncthread() statements making it possible for each thread to slow down many others. ### 2. #### How does your new algorithm work? Took the brute force algorithm and optimized it in a bunch of basic ways. Very similar to assignment 5. - **Opt 1** We reduced floating point operations by working with the squared distance and squared epsilon. - **Opt 2** We cut off redundant comparisons by returning out of the kernel in the inner loop `if (i >= tid)`. Then we add the missing values back on in our main function by applying `result = 2 * result + N;` - **Opt 3** We reduced more operations by streamlining the code in small ways. e.g. preventing a memory read and write, by inlining `dx * dx + dy * dy` instead of storing it into a variable, or declaring `struct Point a = data[tid];` outside of the for loop. #### When developing your algorithm, what worked well and what didn’t work well? All of the above optimizations worked well. Here are some optimization that I tried and that didn't work. For **Opt 2** I first tried to let the inner for loop do the work of skipping redundant iterations like I did in assignment 5 `for (int i = tid + 1; i < N; i++)` But for some reason this is really slow, while returning like we ended up doing in **Opt 2** is very fast on the other hand. This is surprising since I would've expected both to do pretty much the same thing. I also tried to develop a function to get the non-redundant point pairs layed out linearly. That would've made it possible to distribute the workload of processing the non-redundant point-pairs on the gpu threads evenly which might have brought significant speedup from my understanding. Unfortuntately I could only come up with a recursive function to get these point pairs layed out linearly. Here is a comment I made in code which explains the idea and its shortcomings in more detail: --- ``` I think the non_redundant_point_combinations approach would work, but the array of these combinations is roughly N^2 which is hundreds of gigabytes for N=100k I found a recursive function that, if solved, would allow us to get these values on the fly in the gpu instead of having to precalculate them all and put them into a n absurdly huge array. I can't manage to solve it though. Here it is anyways: - J_0 = 0 - I_0 = 1 - J_{n+1} = (J_n + 1) % (I_n + 1) - I_{n+1} = I_n + (1 if J_{n+1} = 0, else 0) This gives us a series of point combinations (I_i, J_i) that only compares each pair of points once, and doesn't compare points with themselves - so strips away all of the redundant point comparisons on the gpu. But since it's a one-dimensional series, it would also allow us to split these point combinations up onto the gpu cores evenly to give each core an even workload, and therefore maximize performance. The threads would jump around the points array a lot in their accesses thought which would probably slow things down somewhat. ``` --- I did bonus first and could not find a way to make using shared memory speed up things significantly. So I didn't bother using shared memory for question 1.2 #### Compare performance with the global memory implementation above (Part 1). Is your implementation faster? Note that your algorithm may be more efficient when N > 100000. Feel free to use a larger N, if you believe your approach will outperform the brute force algorithm at larger input sizes. Furthermore, you are not restricted to using global memory, you may also use shared memory. *Optimized algorithm will be abbreviated `OP` and non-optimized will be abbreviated `NOP`* *In the following we'll be talking about kernel times in seconds. We ignore the transfer times because we assume them to be the same between OP and NOP* ##### OP, N = 100000, e = 5 0.096455 ##### NOP, N = 100000, e = 5 0.354338 For these values, the optimized version is 3.6 times faster. ##### OP, N = 100000, e = 0.000000001 0.096344 ##### NOP, N = 100000, e = 0.000000001 0.352148 For these values, the optimized version is 3.6 times faster. ##### OP, N = 1000000, e = 0.000000001 8.694127 ##### NOP, N = 1000000, e = 0.000000001 36.812289 For these values, the optimized version is 4.23 times faster. We can see that the epsilon doesn't make as much of a difference for the optimized version as it did in assignment 5. As we increase N, the advantage of the optimized version slowly increases. If we had examined even greater N, we might see even bigger speedups. ### Report how you compiled your programs and the GPU model you used I started the programs with two bash scripts `start_p1_specific_gpu.sh` and `start_p2_specific_gpu.sh`, both of which are included in the archive. I used the gpu type for the compiler flag `-gencode=arch=compute_37,code=sm_37` which should be Tesla K80 cards. ## Bonus ### Compare the shared memory implementation to the global memory implementation above (Question 1, Part 1). Which is faster? Why? You may compare to the GPU kernel only response times, since data transfer times will be the same in both implementations. *We'll be referring to the shared memory algorithm as `SHP`* #### SHP, N = 100000, e = 10 0.384529 #### NOP, N = 100000, e = 10 0.354482 The shared memory implementation is slightly slower than the global memory one. Each thread compares its point with every `N` points. While only `blockDim.x` points are stored in shared block memory. Only this fraction of all points being processed actually benefits from the faster shared memory. The overhead of writing the points into shared memory and then figuring out whether to retrieve each point from shared or global memory is apparently outweighing the benefits. ### Report how you compiled your programs and the GPU model you used (make sure it is the same as Question 1) I compiled the the bonus with the bash script `start_bonus_specific_gpu.sh` which is included in the archive. I used the K80 cards just like for question 1.