# 3 Introduction into OpenMP 5.0 ###### tags: `SS2021-IN2147-PP` ## OpenMP and its Execution Model ### Terminology HW Threads * A single system has many hardware threads * A **`system/node`** can have multiple **`physical processors`** (aka. **`sockets`**) * Independent chips with their own interconnect between them * Each **`socket`** can have multiple **`cores`** * Replicated processors on a single die with on-die interconnect * Each **`core`** can run multiple **`hardware threads`** * `Simultaneous Multithreading / Hyperthreading` * Separate PC and Stack Pointer, but shared resources ### OpenMP = Open Multi-Processing * Standard for writing parallel programs, mainly **`on-node`** * `Shared memory parallelism` * **Portability** * 3 primary API components: * Compiler Directives (for C/C++ and Fortran) * Runtime Library Routines * Environment Variables * **OpenMP is not ...** * intended for distributed memory systems * necessarily implemented identically by all vendors * guaranteed to automatically make the most efficient use of shared memory * required to check for data dependencies, race conditions, deadlocks, etc. * designed to handle parallel I/O ### Fork/Join Execution Model ![](https://i.imgur.com/xbaO98o.png) ### Nested Parallelism ![](https://i.imgur.com/Xh8xqMf.png) ### OpenMP System Stack ![](https://i.imgur.com/ycnIQ3I.png) ## OpenMP Syntax ```c // Most of the constructs in OpenMP are compiler directives #pragma omp construct [clause [clause]...] // Directives can have continuation lines #pragma omp parallel private(i) \ private(j) ``` ### Parallel Regions ```c // Degree of parallelism controlled by ICV // $ export OMP_NUM_THREADS=8 // $ ./a.out #pragma omp parallel [parameters] { block } ``` ### Sections in a Parallel Region ```c // Parallel regions can have separate sections #pragma omp sections [parameters] { #pragma omp section { ... block ... } #pragma omp section { ... block ... } ... // an implicit barrier at the end of the sections region/block } ``` ### Parallel Loop ```c // Iterations must be independent!!! // OpenMP would not check dependency!!! #pragma omp for [parameters] for ... // implicit barrier at the end // Unless the parameter 'nowait' is specified ``` ### Available Loop Schedules * static * `Fix sized chunks` (default=n/t) * `Round-robin` fashion * dynamic * `Fix sized chunks` (default=1) * `Distributed one by one` at runtime as chunks finish * guided * Start with `large` chunks, then exponentially `decreasing size` * `Distributed one by one` at runtime as chunks finish * runtime * Controlled at runtime using control variable * auto * Compiler/Runtime can choose ## Data Sharing in OpenMP ### Shared vs. Private Data * **Global variables**: * Shared * **Variables declared within the “dynamic extent” of parallel regions**: * Private * Includes routines called from parallel regions ### Private Data Options ![](https://i.imgur.com/25CgWWF.png) ### First/Last Private Data Options ![](https://i.imgur.com/a1x6Nf4.png) ### Summary: Sharing Attributes of Variables * **`private(var-list)`** * Variables in var-list are private * **`shared(var-list)`** * Variables in var-list are shared * **`default(private | shared | none)`** * Sets the default for all variables in this region * **`firstprivate(var-list)`** * Variables are private * Initialized with the value of the shared copy before the region. * **`lastprivate(var-list)`** * Variables are private * Value of the thread executing the last iteration of a parallel loop in sequential order is copied to the variable outside of the region. ## Synchronization ### Barrier ```c // Each parallel region has an implicit barrier at the end // Can be switched off by adding nowait #pragma omp nowait // Additional barriers can be added when needed #pragma omp barrier ``` ### Master Region ```c // Only the master executes the code block: // * Other threads skip the region // * No synchronization at the beginning of // Possible uses: // * Printing to screen // * Keyboard entries // * File I/Oregion #pragma omp master block ``` ### Single Region ```c // Only a (arbitrary) single thread executes the code block // Possible uses: // * Initialization of data structures #pragma omp single [parameter] block ``` ### Critical Section ```c // Mutual exclusion // All 'unnamed' critical directives map to the same name #pragma omp critical [(Name)] block ``` ### Atomic Statements ```c // Ensures that a specific memory location is updated atomically #pragma ATOMIC expression-stmt ``` ### Simple Runtime Locks ```c omp_init_lock(&lockvar) // initialize a lock omp_destroy_lock(&lockvar) // destroy a lock omp_set_lock(&lockvar) // set lock omp_unset_lock(&lockvar) // free lock logicalvar = omp_test_lock(&lockvar) ``` ### Nestable Locks * Nestable locks can be set multiple times by a single thread. * If the lock counter is 0 after an unset operation, lock can be set by another thread * 例如: * function a 需要 mutex, * function b 也需要 mutex,同時 function a 還會呼叫 function b。 * 如果使用 std::mutex 必然會造成死結。 * 但是使用 std::recursive_mutex (Nestable Lock) 就可以解決這個問題。 ### Ordered Construct ```c #pragma omp for ordered for (...) { S1 #pragma omp ordered { S2 } S3 } ``` ![](https://i.imgur.com/2ijhsKs.png =400x) * [簡易的程式平行化-OpenMP(四)範例 for – Heresy's Space](https://kheresy.wordpress.com/2006/09/15/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E5%9B%9B%EF%BC%89%E7%AF%84%E4%BE%8B-for/) ## Reductions * Aggregate results * Potentially costly and time sensitive operation ### OpenMP Reductions ```c int a=0; #pragma omp parallel for reduction(+: a) for (int j=0; j<4; j++) { a = j; } printf("Final Value of a=%d\n", a); // OMP_NUM_THREADS=4 // 0+1+2+3 = 6 // OMP_NUM_THREADS=2 // 1+3 = 4 ``` ```c int a=0; #pragma omp parallel for reduction(+: a) for (int j=0; j<100; j++) { a = j; } printf("Final Value of a=%d\n", a); // OMP_NUM_THREADS=4 // 24+49+74+99 = 246 ``` ```c int a=0; #pragma omp parallel for reduction(+: a) for (int j=0; j<100; j++) { a += j; } printf("Final Value of a=%d\n", a); // OMP_NUM_THREADS=? // 0+1+2+...+99 = (99*100)/2 = 4950 ```