4 Advanced OpenMP and Changes since OpenMP 5.0

# 4 Advanced OpenMP and Changes since OpenMP 5.0 ###### tags: `SS2021-IN2147-PP` ### Important Runtime Routines ```c // Determine the number of threads for parallel regions omp_set_num_threads(count) // Query the maximum number of threads for team creation maxthreads = omp_get_max_threads() // Query number of threads in the current team numthreads = omp_get_num_threads() // Query own thread number (0..n-1) iam = omp_get_thread_num() // Query the number of processors numprocs = omp_get_num_procs() ``` ### Relevant Environment Variables (ICVs) ```shell ## ICVs = Internal Control Variables ## Number of threads in a team of a parallel region OMP_NUM_THREADS=4 ## Selects scheduling strategy to be applied at runtime ## Schedule clause in the code takes precedence OMP_SCHEDULE=”dynamic” OMP_SCHEDULE=”GUIDED,4“ ## Allow runtime system to determine the number of threads OMP_DYNAMIC=TRUE ## Allow nesting of parallel regions ## If supported by the runtime OMP_NESTED=TRUE ``` ## Data Races ### Types of Races #### Read after Write (RAW) ```c // T1: x = ... // T2: ... = x ``` #### Write after Read (WAR) ```c // T1: ... = x // T2: x = ... ``` #### Write after Write (WAW) ```c // T1: x = ... // T2: x = ... ``` ### Race Detection Tools * **`Helgrind`** (open source) * Dynamic, based on `valgrind tool` suite * **`Intel Inspector`** (commercial) * Part of Intel’s development suite * **`Thread Sanitizer`** (open source) * Static instrumentation via LLVM * **`Archer`** (open source) * Combines `Thread Sanitizer`’s approach with OpenMP semantics ## Loop Dependencies ### Dependence and Dependence Graph #### Control Dependence ```c S1: if (t == 5) S2: r=5.0 else S3: r=3.0 ``` ```mermaid graph LR; S1-->S2; S1-->S3; ``` #### Data Dependence ```c S1: pi=3.14 S2: r=5.0 S3: area=pi*r**2 ``` ```mermaid graph LR; S1-->S3; S2-->S3; ``` ### Types of Data Dependencies * I(S) * set of memory locations read (input) * O(S) * set of memory locations written (output) * True Dependence (Flow Dependence) * $O(S1) ∩ I(S2) ≠ ∅$ * $(S1\ \delta\ S2)$ * Anti Dependence * $I(S1) ∩ O(S2) ≠ ∅$ * $(S1\ \delta^{-1}\ S2)$ * Output Dependence * $O(S1) ∩ O(S2) ≠ ∅$ * $(S1\ \delta^{0}\ S2)$ ### Loop Dependencies #### Loop Independent Dependencies ```c for (i = 0; i < 4; i++) S1: b[i] = 8; S2: a[i] = b[i] + 10; ``` #### Loop Carried Dependencies ```c for (i = 0; i < 4; i++) S1: b[i] = 8; S2: a[i] = b[i-1] + 10; ``` ### Aliasing ```c void Foo(int A[], int B[], n) { for (int i=0; i<n; i++) A[i] = 5 * B[i] + A[i] } int main() { int A[100], B[100]; // A != B Foo(A, B, 100); // aliasing!!! Foo(A, A, 100); return 0; } ``` ## Loop Transformation ### Transformation 1: Loop Interchange ```c /* Assuming no aliasing */ // original do l=1,10 do m=1,10 do k=1,10 A(l,m,k)=A(l,m,k)+B enddo enddo enddo // Loop Interchange do l=1,10 do k=1,10 do m=1,10 A(l,m,k)=A(l,m,k)+B enddo enddo enddo ``` ### Transformation 2: Loop Distribution / Loop Fission ```c /* Assuming no aliasing */ // original do j=2,n S1: a(j)= b(j)+2 S2: c(j)= a(j-1) * 2 enddo // Loop Distribution / Loop Fission // * reduces the granularity // * increase parallelism do j=2,n S1: a(j)= b(j)+2 enddo do j=2,n S2: c(j)= a(j-1) * 2 enddo ``` ### Transformation 3: Loop Fusion ```c /* Assuming no aliasing */ // original do i=1,n a(i)= b(i)+2 enddo do i=1,n c(i)= d(i+1) * a(i) enddo // Loop Fusion: // * increases granularity // * reduce parallelism do i=1,n a(i)= b(i)+2 c(i)= d(i+1) * a(i) enddo ``` ### Transformation 4: Loop Alignment ```c // original do i=2,n S1: a(i)= b(i)+2 S2: c(i)= a(i-1) * 2 enddo // Loop Alignment do i=1,n S1: if (i>1) a(i)= b(i)+2 S2: if (i<n) c(i+1)= a(i) * 2 enddo ``` ![](https://i.imgur.com/HUkEaBV.png) ### Summary Loop Transformations * Eliminate carried dependences * Loop distribution * Loop alignment * Improve efficiency * Loop fusion * Loop interchange ## Advanced OpenMP / Tasks ### Drawbacks of Work Sharing * possible ***imbalance*** caused by **`workload`** * possible ***imbalance*** caused by **`machine`** * **`limited programming flexibility`** ### Explicit Tasking (since OpenMP 3.0) * ***Tied***: once a task starts it will remain on the same thread * Default behavior * Easy to reason about * ***Untied***: tasks can move to a different thread * Execution can be interrupted and the task moved * Advantage: more flexibility and better resource utilization ### The OpenMP Task Construct ```c #pragma omp parallel { #pragma omp single { for ( elem = l->first; elem; elem = elem->next) // Tasks can be executed by any thread in the team #pragma omp task process(elem) } // Barrier: all tasks are complete by this point } ``` ### Tasking Syntax in OpenMP ```c #pragma omp task [clause list] { ... } /* Select clauses */ // FALSE: Execution starts immediately by the creating thread if (scalar-expression) // Task is not tied to the thread starting its execution untied // Default is firstprivate Default (shared|none), private, firstprivate, shared // Hint to influence order of execution priority(value) // Waits for completion #pragma omp taskwait { ... } // The current task can be suspended // Explicit task scheduling point #pragma omp taskyield { ... } ``` :::warning **Implicit task scheduling points** * Task creation * End of a task * Taskwait * Barrier synchronization ::: ### Task Dependencies (since OpenMP 4.0) ```c /* Influences scheduling order */ // Out: variables produced by this task // In: variables consumed by this task // Inout: variables is both in and out // Example #pragma omp task shared(x, ...) depend(out: x) // T1 preprocess_some_data(...); #pragma omp task shared(x, ...) depend(in: x) // T2 do_something_with_data(...); #pragma omp task shared(x, ...) depend(in: x) // T3 do_something_independent_with_data(...); ``` * [c++ - OpenMP Task Dependency Ignored? - Stack Overflow](https://stackoverflow.com/questions/54990886/openmp-task-dependency-ignored) ### Performance Considerations for Tasking #### Advantages * Implicit load balancing * Simple programming model * Many complications and bookkeeping pushed to runtime #### Consideration 1: Task granularity * **Fine grained** allow for more resource utilization * more overhead * **Coarse grained** tasks reduce overhead * schedule fragmentation * **Right granularity!!!** #### Consideration 2: NUMA optimization * Modern runtimes provide optimizations * NUMA aware scheduling ## OpenMP Memory Model ### Memory Models #### Memory/Cache Coherency * Snoop-based protocols * Directory-based protocols #### Memory Consistency * Sequential Consistency * Relaxed Consistency * Processor Consistency (HW Threads Consistency) * Writes by any thread are seen by all threads in the order they were issued * But different threads may see a different order * Weak Consistency * Synchronization operations * Release Consistency * subdivide synchronization operations into “acquire” and “release” * [Memory-Consistency](https://hackmd.io/Bz0_tLu0S8STNPiAiY6tjw?view#Memory-Consistency) ### The OpenMP Memory Model * **`Weak consistency`** * Memory synchronization points (or flush points) * Entry and exit of **`parallel regions`** * **`Barriers`** (implicit and explicit) * Entry and exit of **`critical regions`** * Use of **`OpenMP runtime locks`** * Every **`task scheduling point`** ### OpenMP’s Flush Directive ```c /* Synchronizes data of the executing thread with main memory */ // Load/stores executed before the flush have to be finished // Load/stores following the flush are not allowed to be executed early #pragma omp flush [(list)] ``` ## OpenMP 5.1 * Full support for C11, C\++11, C\++14, C\++17, C\++20 and Fortran 2008 * The `OMP_PLACES` syntax was extended * `omp_display_env` runtime routine to provide information about ICVs ### Masked Region (~~Master Region~~) ```c // Only the primary thread executes the code block #pragma omp masked block // A region that only the threads execute that are specified in the integer expression #pragma omp masked [filter(integer-expression)] block ```