# PHAS0100 - Week 8 (10th March 2023)
:::info
## :tada: Welcome to the 8th live-class!
### Today
Today we will cover
- Why parallel programming?
- What is OpenMP?
- Parallelising loops
- Reductions
- Data sharing
- Thread control
- Strong and weak scaling
:::
### Timetable for today
<details>
| Start | End | Topic |
| -------- | -------- | -------- |
|2pm|2:55pm | Quiz & Pi exercise |
|2:55pm|3:05pm| -- 10 minute break -- |
|3:05pm|3:55pm| Pi exercise & Documentation challenge: Thread control |
|3:55pm|4:05pm| -- 10 minute break -- |
|4:05pm |4:50pm| Manual reduction | breakout |
</details>
## Session 1
### Getting setup
- Clone the examples repository:
`git clone https://github.com/UCL-PHAS0100-22-23/week8_openmp_exercises.git`
- Open in vscode
- Make sure the container is working
- Any issues, let us know!
### Quiz 1: Parallel programming
See Menti code on screen.
### Exercise 1: Pi
In the exercise folder `01_pi` you will find a code that calculates pi using a numerical approximation to the integral:
$$\pi = 4\int_{0}^{1} \frac{{\rm d}x}{1+x^2}$$
It's not important that you understand the mathematics here, you should just focus on parallelising this code.
#### Timing compiler optimisations
1. Ensure you can compile the code with `g++ main.cpp timer.cpp`
2. Run the code with `./a.out` and record your time below:
- 2.81919 s (JJ)
- 2.25203 s (BF)
- 3.65214 s
- 3.14159 s (SN)
- 6.72951 s (BJ)
- 5.32554 s (SL)
- 2.83936 s (JQ)
- 2.41085 s (AZ)
- 2.07852 s (AG)
- 2.0779 (MM)
- 13.8568 s
- 4.41941 s (SK)
- 3/5713 s (RL)
3. Compile again with `g++ -O2 main.cpp timer.cpp`
4. Run the code and record your (hopefully improved) time below:
- 1.45177 s (JJ)
- 1.1674 s (BF)
- 1.19816 s
- 2.36319 s (SL)
- 2.60605 s (BJ)
- 1.17783 s (RL)
- 1.28908 s (SN)
- 1.07239 s (JQ)
- 1.25716 s (AZ)
- 1.31103 s (FG)
- 0.944704 s (AG)
- 0.941102 (MM)
#### Parallelising the code
You've already seen many examples of how you can parallelise these kinds of loops with OpenMP so this one is up to you! If you get stuck at any point, do ask people beside you or the instructors.
#### Testing the **strong** scaling of your code
Once you're convinced your code is parallelised correctly, you should change the number of OpenMP threads using `OMP_NUM_THREADS` (see [this week's notes][notes] for a reminder on how to do this). Run with 1 thread, 2 threads, 4 threads and 8 threads, calculate your *speedup* in each case and record your results below, including the maximum speedup you achieved:
[notes]:https://github-pages.ucl.ac.uk/research-computing-with-cpp/08openmp/02_intro_openmp.html
| `OMP_NUM_THREADS=1` | `OMP_NUM_THREADS=2` | `OMP_NUM_THREADS=4` | `OMP_NUM_THREADS=8` | **Maximum** speedup |
|---|---|---|---|---|
| <time> | <time> | <time> | <time> | <speedup> |
|1.18046 | 0.600333 | 0.307029 | 0.28677 | 4.11 (BF)|
|2.84206 |1.59017 |1.03446 |0.791932 |3.58877× (JJ) |
|1.30629 | 0.689021|0.373141 |0.37796 |3.5 (FG)|
|2.59369 |1.49369 |0.997566 |0.533909 |4.86 (BJ) |
| | | | | |
|0.931707 | 0.489801 | 0.337287 | 0.279723 | 3.33 (MM) |
|4.94798| 2.52631 | 2.50114 | 2.4354 | 2.03 (SL) |
#### Testing the **weak** scaling of your code
Now you should use `omp_get_max_threads()` in your code to multiply `N` by the number of OpenMP threads. This is just like example 4 in [this week's notes][notes]. Run with the same number of threads in the previous example and record your results below. This time I want you to record your worst efficiency (that is your 8-thread time divided by your single-thread time).
(Make sure you don't overflow your integers)(can someone provide source code tnm)
| `OMP_NUM_THREADS=1` | `OMP_NUM_THREADS=2` | `OMP_NUM_THREADS=4` | `OMP_NUM_THREADS=8` | **worst** efficiency |
|---|---|---|---|---|
| <time> | <time> | <time> | <time> | <efficiency> |
| 2.83228 | 2.95285 | 3.67399| 5.60816| 0.505× (JJ)|
| 0.00215277| 0.0017999| 0.00667108 | 0.00352606 | BF |
|2.65099 |2.89241 |3.30965 |3.87581 |0.68 (BJ)|
| | | | | |
### Documentation challenge: Thread control
I want you to find the following constructs in the **current** OpenMP specification. You should answer the following questions for each construct:
1. Does the construct have an implicit barrier
2. Can the barrier be disabled? If so, how?
3. Why does the construct have or not have an implicit barrier?
#### `single`
1. Yes
2. Yes, use `nowait`
3. Presumbably because you want the single thread to finish its task before continuing
1.
2.
3.
1.
2.
3.
#### `master`
1. No
2. N/A
3. If you want an implicit barrier just use `single`
1.
2.
3.
1.
2.
3.
#### `sections`
1. Yes
2. Yes, use `nowait`
3. Presumbably because you want the threads to finish their task before continuing
1.
2.
3.
1.
2.
3.
#### `critical`
1. No
2. N/A
3. Since each thread runs through the section sequentially a barrier is not needed
1.
2.
3.
1.
2.
3.
---
### Coding Challenge: A manual reduction
Implement a manual reduction to parallelise the sum starting from the serial `main.cpp` code in `02_pi_manual_reduction`.
**Hints**:
<details>
- In sum reduction:
1. Each thread sums into a *private* variable.
2. Private variables summed into final result *one at a time*.
- Where should the parallel region start and end?
- Should we use `parallel for` or just `parallel`?
- How do we make sure each thread has its own private `sum_tmp`?
- When should all threads sum into the final sum?
- How can we ensure each thread only access sum one at a time?
</details>
**Extra optional exercises**:
- Time your reduction. Is it as fast as OpenMPs?
- Test the strong and/or weak scaling of your reduction.