<!-- Call the assignment `CS450_aXXX_nhn33.pdf` where `XXX` is the assignment number --> :::info **CS 450** Noah Nuebling ::: # Assignment 5 ## Question 1 ### a) & b) **Subquestion (𝜀 = 10)**|**#Cores**|**Time (s) No Opt.**|**Time (s) w/ -O3**|**Speedup**|**Parallel Efficiency** :-----:|:-----:|:-----:|:-----:|:-----:|:-----: Sequential|1|91.8|-|1|1 (a)|2|49.17|-|1.87|0.93 Sequential|1|-|25.9|1|1 (b)|2|-|13.6|1.9|0.95 ### a) "Without compiler optimization, report the number of cores, speedup, and parallel efficiency achieved." Number of cores: 2 Speedup: 1.87 Parallel efficiency: 0.93 ### b) "With compiler optimization (-O3), report the number of cores, speedup, and parallel efficiency achieved." Number of cores: 2 Speedup: 1.9 Parallel efficiency: 0.95 ### c) The performance gains are very good. The efficiency is very close to 1, which would be the optimal value. That tells us, that the overhead is very small and the iterations of the outer for-loop are efficiently and equally distributed onto the 2 cores. The speedup values are therefore also very close to their optimal values of 2. (2 since we're running on 2 cores) ## Question 2 ### a) I made several versions the algorithm with small differences to make them faster in different scenarios. I'll be discussing `mainAlgorithmParallel2()` here, as it is the fastest parallel version of the algorithm when using compiler optimization (`-O3`). We changed the brute force algorithm in the following ways to make it faster: **Optimizations** #### 1. We are workig the the squared distance and epsilon instead of the actual distance and epsilon. This allows us to omit many (relatively) expensive `sqrt()` function calls. This works because, `d(p1, p2) <= epsilon` if and only if `d(p1, p2)^2 <= epsilon^2`. We can see that `d(p1, p2) = sqrt(dx^2 + dy^2)` while `d(p1, p2)^2 = sqrt(dx^2 + dy^2)^2 = dx^2 + dy^2` (Notice the omission of `sqrt()` function) So since this comparison is used in up to all of the up to N^2 iterations of the nested loop, using `d(p1, p2)^2` will spare a lot of `sqrt()` operations. And since epsilon stays the same throughout all iterations, we only have to calculate `epsilon^2` once. Therefore the overall number of operations is greatly reduced if we work with `epsilon^2` and `d(p1, p2)^2` instead of the non-squared values. #### 2. We're starting the iteration variable of our inner loop at `j = i+1` (where `i` is the iteration variable of the outer loop) instead of at `j = 1`. The brute-force algorithm checks if `d(a, b) <= epsilon` **as well as** `d(b, a) <= epsilon` for every pair of points a and b. However, as `d(a, b)` equals `d(b, a)`, this is redundant. The brute force algorithm also checks `d(a, a) < epsilon` for every point a. This is also redundant, because this will always be true (as long as `epsilon >= 0`, which is very reasonable, because epsilon represents a distance, and distances can't be negative) With this optimization, we skip both of these types of redundant iteration steps. To have the result match the brute-force algorithm, we need to add results for the skipped (redundant) optimized iterations back in after we're done iterating over the points. To achieve this, we use the formula `r = 2 * r' + N` Where r' is the result of our algorithm that skips the redundant iterations. We multiply r' with 2 to reflect that, for every matching pair of points a, b, with `d(a, b) < epsilon` there is the mirrored pair with `d(b, a) < epsilon` which we've skipped in our iteration but which should also be counted. We add N (the number of points), to reflect the skipped iterations of points being compared to themselves. We can simply add the number of points, since `d(a, a) < epsilon` is always true. r is the correct result that corresponds to the brute-force algorithm. #### 3. The last big optimization we're using is that we're first sorting all points by x value, and then using that to cut off the iteration once we know that all following points will have an x distance to the current point that is greater than epsilon. Here' how/why this works: Let's assume 2 points pi and pj, with i < j. We know that `dx(pi, pj) <= d(pi, pj)`. Therefore, if `d(pi, pj) <= epsilon`, then `dx(pi, pj) <= epsilon`. With this knowledge we know to only check if `dx(pi, pj) <= epsilon` if `d(pi, pj) <= epsilon` is false, because otherwise we know it to be true. But if `d(pi, pj) <= epsilon` is false, then `dx(pi, pj) <= epsilon` is potentially also false, and we want to check for that. So this means we only need to check for `dx(pi, pj) <= epsilon` in select iterations, (when `d(pi, pj) <= epsilon`) making the overhead pretty low. Here is the actual reason why we want to check for `dx(pi, pj) <= epsilon`: If, after checking, we find that `dx(pi, pj) <= epsilon` is in fact false, then we know that `epsilon < dx(pi, pj) <= d(pi, pj)`. And since the points are sorted by x coordinate (ascending), we also know that `epsilon < dx(pi, pk) <= d(pi, pk)` for all k > j. Therefore, we know that `d(pi, pk) <= epsilon` will be false for all k > j. And since that is the case, we now know that we've found all matches for pi and can move on to pi+1. We don't need to keep iterating j until `j < N`. This optimization is very effective, especially for smaller epsilon. #### 4. There are other small optimizations I took in `mainAlgorithm4()` and others (like not storing `P[i]` and `P[j]` in local variables) that I'm not gonna talk about in more detail here. These optimizations don't make a big difference, especially when using compiler optimization (`-O3`). **There are the following overheads** (a) To make the program utilize threads: #### 1. As we've covered in previous assignments, there is a certain overhead that comes with using `#pragma omp parallel for`, especially when using `schedule(dynamic)`. #### 2. In the sequential version of the algorithm I used a `while` loop instead of a `for` loop, which brought a decent speedup in this scenario. To be able to use omp for parallelization and especially its dynamic scheduling, I reverted to using a slower `for` loop in the parallel implementation. (b) To reduce the amount of computation: #### 1. I use `qsort()` to sort the points by their x coordinate. This brings a certain overhead which is in O(n log n) and really fast in practise, though. #### 2. We have to precalculate `epsilon^2`. This only has to be done once though, and is a super small overhead. #### 3. Before returning, we have to apply the formula `r = 2 * r' + N` (See above). This is also a negligble overhead. ### b) & c) & d) **Subquestion (𝜀 = 10)**|**#Cores**|**Time (s) (No Opt.)**|**Time (s) (w/ -O3) **|**Speedup**|**Parallel Efficiency**|**T1/T2 (No Opt.)**|**T1/T2 (w/ -O3)** :-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----: Sequential|1|0.51|-|1|1|-|- (b)|2|0.30|-|1.68|0.84|-|- Sequential|1|-|0.19|1|1|-|- \(c\)|2|-|0.13|1.48|0.74|-|- (d)|-|-|-|-|-|181.78|133.29 ### b) "Without compiler optimization, report the number of seconds it took to run the program (averaged over 3 trials), the speedup you achieved, the parallel efficiency, and the number of cores." (Used epsilon=10 for all of this) **#Cores** 2 **Efficiency** 0.84 **Speedup** 1.68 **Average runtimes** (N=100,000) (epsilon=10) * (No Opt) * (Sequential) * (mainAlgorithm1()) * 0.604984, 0.631076, 0.611815 * avg) 0.6159583333 * (mainAlgorithm4()) * 0.497482, 0.497796, 0.519741 * avg: **0.5050063333** * (2 cores) * (mainAlgorithmParallel2()) * 0.361994, 0.377123, 0.383221 * avg) 0.3741126667 * (mainAlgorithmParallel3()) * 0.299328, 0.299799, 0.304603 * avg: **0.3012433333** ### c) "With compiler optimization (-O3), report the number of seconds it took to run the program (averaged over 3 trials), the speedup you achieved, the parallel efficiency, and the number of cores." (Used epsilon=10 for all of this) **#Cores** 2 **Efficiency** 0.74 **Speedup** 1.48 **Average runtimes** * (-O3) * (Sequential) * (mainAlgorithm1()) * 0.192654, 0.195166, 0.195743 * avg: **0.194521** * (mainAlgorithm4()) * 0.227169, 0.186097, 0.180701 * avg: 0.197989 * (2 cores) * (mainAlgorithmParallel2()) * 0.127256, 0.136946, 0.130846 * avg: **0.1316826667** * (mainAlgorithmParallel3()) * 0.135264, 0.176381, 0.131425 * avg: 0.14769 ### d) Speedup over brute-force *(Speedup over brute-force would be much greater with a smaller epsilon)* (epsilon=10) (2 cores) * (No Opt.) * 49.170986 / 0.3012433333 = **163.2268022709** * (-O3) * 13.5641783333 / 0.1316826667 = **103.0065586703** (sequential) * (No Opt.) * 91.8 / 0.5050063333 = **181.7798984819** * (-O3) * 25.9279506667 / 0.194521 = **133.2912676097** ### e) The new algorithm is much faster. The optimizations listed above all worked really well. Here is an optimization I tried, that didn't work very well: - I tried to check if the points lied within a rectangular bounding box with sidelengh epsilon of each other, before checking the actual distance. - I did that with the code `if (abs(dx) > epsilon || abs(dy) > epsilon)` - The idea was that this is less expensive than the actual distance check and that if this fails, we know that the actual distance check **must** fail as well. However, this bouding-box check seems to not be sufficiently less costly compared to the distance check to have this pay off for the overhead of having to perform both the bounding-box check **and** the distance check, in the case that the bounding-box check succeeds. ### f) **Cache misses** ![](https://i.imgur.com/Gxg3Bj3.png) Q1 I1 misses: 1494 LLi misses: 1454 D1 misses: 110146028 LLd misses: 28505 LL misses: 29959 Q2 I1 misses: 1770 LLi misses: 1724 D1 misses: 770743 LLd misses: 59308 LL misses: 61032 I wanted to also measure floating point operations and number of points compared, but I couldn't manage to install Xcode Instruments due to space constraints on my laptop. After deleting many files and caches etc, and having Xcode download for several hours the install still failed and then it complained about missing space again, even though I have around 50 gb free. I couldn't find any other way to profile my program. Using `perf` and other profilers in my Ubuntu didn't work either unfortunately, because they don't work properly in a virtual machine. In order to free space I apparently deleted my stdlib.h, and now I can't compile the program anymore to measure the number of points compared. ![](https://i.imgur.com/SlI8fwn.png) ### g) I1 and LLi misses have stayed largely the same. This makes me think that these misses stem from outside of the the main logic which is executed for each point. D1 misses have decreased by a factor of 142.9, which is in line with the actual speedup. This makes me think that a vast majority of these misses comes from the main logic which is executed for each point and which is the primary source of the speedup. LLd misses and LL misses have around doubled, which I do not have an explanation for. Possibly they stem from the increase of if-statements in the main logic that is executed for each points, or maybe the source for these misses is the sorting of the points by x value, which is not performed in the unoptimized version. ## Bonus 1 Speedup over brute-force solution -|T1/T2 (-O3) ε = 5.0 |T1/T2 (-O3) ε = 10.0 :-----:|:-----:|:-----: Sequential | 272.8761484134 | 131.1468468378 For the sequential versions of the brute-force and the the optimized algorithm, compiled with -O3, I measured a speedup of 272.88 for epsilon=5, and a speedup of 131.15 for epsilon=10. ## Bonus 2 I would like my code to be considered for the competition. Using an extremely small epsilon (-O3, sequential, epsilon=0.000000000001, using mainAlgorithm3()), I measured a speedup of `T1/T2 = 1948.16`. Under the same circumstances, but without using compiler optimization (No Opt., sequential, epsilon=0.000000000001, using mainAlgorithm3()), I measured a speedup of `T1/T2 = 4234.8`.