PHAS0100 - Week 8 (11th March 2022)

# PHAS0100 - Week 8 (11th March 2022) :::info ## :tada: Welcome to the 8th live-class! ### Today Today we will cover - Why parallel programming? - What is OpenMP? - Parallelising loops - Reductions - Data sharing - Thread control - Strong and weak scaling ::: ### Timetable for today <details> | Start | End | Topic | Room | | -------- | -------- | -------- | ----- | |2pm|2:55pm | Parallel loops and scaling |main| |2:55pm|3:05pm| -- 10 minute break -- | | |3:05pm|3:55pm| Scaling, data sharing and thread control | main | |3:55pm|4:05pm| -- 10 minute break -- | | |4:05pm |4:20pm| Manual reduction| breakout | |4:20pm |4:35pm| Schedules and tips & tricks| main | |4:35pm |4:55pm| Manual reduction| breakout | |4:55pm|5:00pm| - 5 minute closeout | main| </details> ## Session 1 ### Getting setup - Clone the examples repository: `git clone https://github.com/UCL-PHAS0100-21-22/week8_examples.git` - Open in vscode - Make sure the container is working - Any issues, let us know! For CMake: ``` find_package(OpenMP) if(OpenMP_CXX_FOUND) target_link_libraries(MyTarget PUBLIC OpenMP::OpenMP_CXX) endif() ``` ### Problematic Pi Why does this code produce the wrong result? ``` double sum = 0.0f; #pragma omp parallel for for(int i=0; i<N; ++i) { double x = i*dx; sum += 4.0f/(1.0f + x*x)*dx; } ``` Answers: - cannot be parallelised because the code is sequential: sum part is recursive, requires continual read\write access to memory - Approxiamtion needs to use the result from previous iteration, but parallelisation cause several iterations to happen at the same time, so the process turns to be not sequential. - - The same sum value can be read (copied) by multiple threads. The addition may be performed on the same value of sum. After one addition, the value is copied back to memory. After the other addition, the value could be copied into memory, overwriting the previous value, losing the first addition (as the same sum value was copied originally for the threads). This is an example of a reason why the answer may be wrong. - ### Scaling exercise Start from `main_openmp.cpp` code in `week8_examples/pi` (or if you followed along in the live coding exercise, feel free to use that code). You may have to edit the CMakeLists.txt file to compile the correct file. Number of cores can be varied by changing `OMP_NUM_THREADS=4 ./pi` Remember speedup is defined as (time using 1 thread) / (time using $n$ thread) #### Group 1 (Name starting A-M) - Vary number of cores, *don't* vary $N$ - Calculate speedup over serial code - Add data below n_cores, speedup 2, ~1.9 3, ~2.9 4, ~3.9 5, ~4.7 6, ~5.5 7, ~6.4 8, ~6.8 2, ~1.95 3, ~2.45 4, ~3.09 2, 1.7 3, 2.5 4, 3.4 5, 3.8 6, 4.4 7, 4.6 8, 5.1 2, 1.9 3, 2.4 4, 2.6 5, 3.3 6, 2.7 7, 3.3 8, 2.7 2, ~1.9 3, ~2.7 4, ~3.5 #### Group 2 (Name starting N-Z) - Vary number of cores and vary $N$ by same factor - e.g. x2 cores then x2 N - Calculate speedup over serial code - Remember to recompile after changing $N$! - Add data below n_cores, speedup ??, ?? 6, ~ 5.6 8, ~ 7.3 10, ~ 8.9 1, 1 2, 0.5 3, 0.25 4, 0.125 As N scales the iterations to solution gets bigger ## Session 2 ### Documentation challenge: Thread control Look up whether this construct has an implicit barrier: - Name starting A-F - `single` - Name starting G-L - `master` - Name starting M-R - `sections` - Name starting T-Z - `critical` 1. Does the construct have an implicit barrier 2. Can the barrier be disabled? If so, how? 3. Where did you get your information from? #### `single` 1. Yes 2. Yes, using "nowait" clause 3. https://www.openmp.org/spec-html/5.0/openmpsu38.html 1. 2. 3. 1. 2. 3. #### `master` 1. NO! 2. it dont have one 3. https://www.openmp.org/spec-html/5.0/openmpse24.html 1. No implied barrier either on entry or exit 2. Not applicable. There's no barrier 3. https://www.openmp.org/spec-html/5.0/openmpse24.html 1. No implied barrier exists on either entry to or exist from the master section. 2. 3. IBM docs #### `sections` 1. 2. 3. 1. 2. 3. 1. 2. 3. #### `critical` 1. https://www.openmp.org/spec-html/5.0/openmpsu89.html 2. 3. 1. no 2. 3. 1. 2. 3. --- ### Coding Challenge: A manual reduction Implement a manual reduction to parallelise the sum starting from the serial `pi.cpp` code in `week8_examples/pi`. **Hints**: <details> - Recall the steps involved in a manual sum reduction: 1. Each thread sums into a *private* variable. 2. Private variables summed into final result *one at a time*. - Where should the parallel region start and end? - Should we use `parallel for` or just `parallel`? - How do we make sure each thread has its own private `sum_tmp`? - When should all threads sum into the final sum? - How can we ensure each thread only access sum one at a time? </details> **Extra optional exercises**: - Time your reduction. Is it as fast as OpenMPs? - Test the strong and/or weak scaling of your reduction. ## Session 3 ### Documentation challenge: Schedules Summarise in pairs or individually what the following schedules do and write your summary below: - Name starting A-G - `static` - Name starting H-P - `dynamic` - Name starting Q-Z - `guided` #### `static` - - - #### `dynamic` - - - #### `guided` - - - --- ### Coding Challenge: Parallelising Mandelbrot Starting with `main.cpp` in `week8_examples/mandelbrot` 1. Run serially to check the area is approximately $1.5$ and the error is less than $0.01$. 2. Note the timing of the serial code. 3. Parallelise the main for loops: - Ensure the area remains identical to the serial result - Use the data sharing attributes to ensure correct privacy of variables - Try parallelising the outermost loop before using `collapse` to also parallelise the inner loop - Note the speedup you've achieved over the serial code 4. Try to improve performance using: - `schedule` - `collapse` 5. Note your best speedup below: **Speedups:** - - - - ## Homework: Mandelbrot scaling 1. Test the weak and strong scaling of your code Specifically for the weak scaling, try changing the number of threads *and* `NPOINTS` to: | `OMP_NUM_THREADS` | `NPOINTS` | |---|---| |1|256| |2|512| |4|1024| |8|2048| Try also changing the value of `MAXITER`. You should find a different scaling. Why might that be? 2. Using your scaling results, try to answer the questions: - How fast should my code run on a computer with twice the number of cores as mine? - How fast should my code run if I double `NPOINTS` *and* run on a computer with twice as many cores as mine? # Questions Here you can post any question you have while we are going through this document. A summary of the questions and the answers will be posted in moodle after each session. Please, use a new bullet point for each question and sub-bullet points for their answers. For example writing like this: ``` - Example question - [name=student_a] Example answer - [name=TA_1] Example answer ``` produces the following result: - Example question - [name=student_a] Example answer - [name=TA_1] Example answer Write new questions below :point_down: - [] - [name=TA_1] Example answer ###### tags: `phas0100` `teaching` `class`