<!-- Call the assignment `CS450_aXXX_nhn33.pdf` where `XXX` is the assignment number -->
:::info
**CS 450**
Noah Nuebling
:::
# Assignment 6
## Question 1
### 1.
#### N = 100, e = 5
Total: 0.131939
Kernel: 0.000085
CPU: 0.000271
Cores: 2
#### N = 100, e = 10
Total: 0.129066
Kernel: 0.000086
CPU: 0.000487
Cores: 2
#### N = 1000, e = 5
Total: 0.127911
Kernel: 0.000610
CPU: 0.000840
Cores: 2
#### N = 1000, e = 10
Total: 0.129453
Kernel: 0.000609
CPU: 0.001341
Cores: 2
#### N = 10000, e = 5
Total: 0.131776
Kernel: 0.005767
CPU: 0.252819
Cores: 2
#### N = 10000, e = 10
Total: 0.134667
Kernel: 0.005775
CPU: 0.179304
Cores: 2
#### N = 100000, e = 5
Total: 0.486064
Kernel: 0.354338
CPU: 19.517811
Cores: 2
#### N = 100000, e = 10
Total: 0.483756
Kernel: 0.354482
CPU: 18.305090
Cores: 2
#### Which is faster, the GPU or CPU implementation?
The GPU implementation is much faster than the CPU implementation. Especially if you look at the kernel times and at larger N.
For very small N the CPU is faster because it doesn't have the transfer overhead.
#### Is the GPU good for small or large N? What is interesting about the response times across values of N on the GPU?
The GPU is good for large N. Even though the workload increases in N^2, the processing time increases by a much smaller factor
#### Reason about the performance across N on the GPU.
Since the gpu has thousands of cores, and my CPU only has two, it is much much better at highly parallelizable workloads. Most of the time overhead from adding another point is negated by the fact that it can just be processed on another thread in parallel to all the others.
I suspect most of the slowdown that we observed in the kernel time when increasing N is because of atomic operations and __syncthread() statements making it possible for each thread to slow down many others.
### 2.
#### How does your new algorithm work?
Took the brute force algorithm and optimized it in a bunch of basic ways. Very similar to assignment 5.
- **Opt 1** We reduced floating point operations by working with the squared distance and squared epsilon.
- **Opt 2** We cut off redundant comparisons by returning out of the kernel in the inner loop `if (i >= tid)`. Then we add the missing values back on in our main function by applying `result = 2 * result + N;`
- **Opt 3** We reduced more operations by streamlining the code in small ways. e.g. preventing a memory read and write, by inlining `dx * dx + dy * dy` instead of storing it into a variable, or declaring `struct Point a = data[tid];` outside of the for loop.
#### When developing your algorithm, what worked well and what didn’t work well?
All of the above optimizations worked well.
Here are some optimization that I tried and that didn't work.
For **Opt 2** I first tried to let the inner for loop do the work of skipping redundant iterations like I did in assignment 5 `for (int i = tid + 1; i < N; i++)` But for some reason this is really slow, while returning like we ended up doing in **Opt 2** is very fast on the other hand. This is surprising since I would've expected both to do pretty much the same thing.
I also tried to develop a function to get the non-redundant point pairs layed out linearly. That would've made it possible to distribute the workload of processing the non-redundant point-pairs on the gpu threads evenly which might have brought significant speedup from my understanding.
Unfortuntately I could only come up with a recursive function to get these point pairs layed out linearly.
Here is a comment I made in code which explains the idea and its shortcomings in more detail:
---
```
I think the non_redundant_point_combinations
approach would work, but the array of these
combinations is roughly N^2 which is hundreds of
gigabytes for N=100k
I found a recursive function that, if solved,
would allow us to get these values on the
fly in the gpu instead of having to
precalculate them all and put them into a
n absurdly huge array. I can't manage
to solve it though.
Here it is anyways:
- J_0 = 0
- I_0 = 1
- J_{n+1} = (J_n + 1) % (I_n + 1)
- I_{n+1} = I_n + (1 if J_{n+1} = 0, else 0)
This gives us a series of point combinations
(I_i, J_i) that only compares each pair of
points once, and doesn't compare
points with themselves - so strips away
all of the redundant point comparisons on
the gpu. But since it's a one-dimensional
series, it would also allow us to split
these point combinations up onto the gpu
cores evenly to give each core an even
workload, and therefore maximize
performance.
The threads would jump around the points
array a lot in their accesses thought which
would probably slow things down somewhat.
```
---
I did bonus first and could not find a way to make using shared memory speed up things significantly. So I didn't bother using shared memory for question 1.2
#### Compare performance with the global memory implementation above (Part 1). Is your implementation faster? Note that your algorithm may be more efficient when N > 100000. Feel free to use a larger N, if you believe your approach will outperform the brute force algorithm at larger input sizes. Furthermore, you are not restricted to using global memory, you may also use shared memory.
*Optimized algorithm will be abbreviated `OP` and non-optimized will be abbreviated `NOP`*
*In the following we'll be talking about kernel times in seconds. We ignore the transfer times because we assume them to be the same between OP and NOP*
##### OP, N = 100000, e = 5
0.096455
##### NOP, N = 100000, e = 5
0.354338
For these values, the optimized version is 3.6 times faster.
##### OP, N = 100000, e = 0.000000001
0.096344
##### NOP, N = 100000, e = 0.000000001
0.352148
For these values, the optimized version is 3.6 times faster.
##### OP, N = 1000000, e = 0.000000001
8.694127
##### NOP, N = 1000000, e = 0.000000001
36.812289
For these values, the optimized version is 4.23 times faster.
We can see that the epsilon doesn't make as much of a difference for the optimized version as it did in assignment 5.
As we increase N, the advantage of the optimized version slowly increases.
If we had examined even greater N, we might see even bigger speedups.
### Report how you compiled your programs and the GPU model you used
I started the programs with two bash scripts `start_p1_specific_gpu.sh` and `start_p2_specific_gpu.sh`, both of which are included in the archive. I used the gpu type for the compiler flag `-gencode=arch=compute_37,code=sm_37` which should be Tesla K80 cards.
## Bonus
### Compare the shared memory implementation to the global memory implementation above (Question 1, Part 1). Which is faster? Why? You may compare to the GPU kernel only response times, since data transfer times will be the same in both implementations.
*We'll be referring to the shared memory algorithm as `SHP`*
#### SHP, N = 100000, e = 10
0.384529
#### NOP, N = 100000, e = 10
0.354482
The shared memory implementation is slightly slower than the global memory one.
Each thread compares its point with every `N` points. While only `blockDim.x` points are stored in shared block memory. Only this fraction of all points being processed actually benefits from the faster shared memory. The overhead of writing the points into shared memory and then figuring out whether to retrieve each point from shared or global memory is apparently outweighing the benefits.
### Report how you compiled your programs and the GPU model you used (make sure it is the same as Question 1)
I compiled the the bonus with the bash script `start_bonus_specific_gpu.sh` which is included in the archive. I used the K80 cards just like for question 1.