owned this note
owned this note
Published
Linked with GitHub
# 4 Advanced OpenMP and Changes since OpenMP 5.0
###### tags: `SS2021-IN2147-PP`
### Important Runtime Routines
```c
// Determine the number of threads for parallel regions
omp_set_num_threads(count)
// Query the maximum number of threads for team creation
maxthreads = omp_get_max_threads()
// Query number of threads in the current team
numthreads = omp_get_num_threads()
// Query own thread number (0..n-1)
iam = omp_get_thread_num()
// Query the number of processors
numprocs = omp_get_num_procs()
```
### Relevant Environment Variables (ICVs)
```shell
## ICVs = Internal Control Variables
## Number of threads in a team of a parallel region
OMP_NUM_THREADS=4
## Selects scheduling strategy to be applied at runtime
## Schedule clause in the code takes precedence
OMP_SCHEDULE=”dynamic” OMP_SCHEDULE=”GUIDED,4“
## Allow runtime system to determine the number of threads
OMP_DYNAMIC=TRUE
## Allow nesting of parallel regions
## If supported by the runtime
OMP_NESTED=TRUE
```
## Data Races
### Types of Races
#### Read after Write (RAW)
```c
// T1:
x = ...
// T2:
... = x
```
#### Write after Read (WAR)
```c
// T1:
... = x
// T2:
x = ...
```
#### Write after Write (WAW)
```c
// T1:
x = ...
// T2:
x = ...
```
### Race Detection Tools
* **`Helgrind`** (open source)
* Dynamic, based on `valgrind tool` suite
* **`Intel Inspector`** (commercial)
* Part of Intel’s development suite
* **`Thread Sanitizer`** (open source)
* Static instrumentation via LLVM
* **`Archer`** (open source)
* Combines `Thread Sanitizer`’s approach with OpenMP semantics
## Loop Dependencies
### Dependence and Dependence Graph
#### Control Dependence
```c
S1: if (t == 5)
S2: r=5.0
else
S3: r=3.0
```
```mermaid
graph LR;
S1-->S2;
S1-->S3;
```
#### Data Dependence
```c
S1: pi=3.14
S2: r=5.0
S3: area=pi*r**2
```
```mermaid
graph LR;
S1-->S3;
S2-->S3;
```
### Types of Data Dependencies
* I(S)
* set of memory locations read (input)
* O(S)
* set of memory locations written (output)
* True Dependence (Flow Dependence)
* $O(S1) ∩ I(S2) ≠ ∅$
* $(S1\ \delta\ S2)$
* Anti Dependence
* $I(S1) ∩ O(S2) ≠ ∅$
* $(S1\ \delta^{-1}\ S2)$
* Output Dependence
* $O(S1) ∩ O(S2) ≠ ∅$
* $(S1\ \delta^{0}\ S2)$
### Loop Dependencies
#### Loop Independent Dependencies
```c
for (i = 0; i < 4; i++)
S1: b[i] = 8;
S2: a[i] = b[i] + 10;
```
#### Loop Carried Dependencies
```c
for (i = 0; i < 4; i++)
S1: b[i] = 8;
S2: a[i] = b[i-1] + 10;
```
### Aliasing
```c
void Foo(int A[], int B[], n) {
for (int i=0; i<n; i++)
A[i] = 5 * B[i] + A[i]
}
int main() {
int A[100], B[100];
// A != B
Foo(A, B, 100);
// aliasing!!!
Foo(A, A, 100);
return 0;
}
```
## Loop Transformation
### Transformation 1: Loop Interchange
```c
/* Assuming no aliasing */
// original
do l=1,10
do m=1,10
do k=1,10
A(l,m,k)=A(l,m,k)+B
enddo
enddo
enddo
// Loop Interchange
do l=1,10
do k=1,10
do m=1,10
A(l,m,k)=A(l,m,k)+B
enddo
enddo
enddo
```
### Transformation 2: Loop Distribution / Loop Fission
```c
/* Assuming no aliasing */
// original
do j=2,n
S1: a(j)= b(j)+2
S2: c(j)= a(j-1) * 2
enddo
// Loop Distribution / Loop Fission
// * reduces the granularity
// * increase parallelism
do j=2,n
S1: a(j)= b(j)+2
enddo
do j=2,n
S2: c(j)= a(j-1) * 2
enddo
```
### Transformation 3: Loop Fusion
```c
/* Assuming no aliasing */
// original
do i=1,n
a(i)= b(i)+2
enddo
do i=1,n
c(i)= d(i+1) * a(i)
enddo
// Loop Fusion:
// * increases granularity
// * reduce parallelism
do i=1,n
a(i)= b(i)+2
c(i)= d(i+1) * a(i)
enddo
```
### Transformation 4: Loop Alignment
```c
// original
do i=2,n
S1: a(i)= b(i)+2
S2: c(i)= a(i-1) * 2
enddo
// Loop Alignment
do i=1,n
S1: if (i>1) a(i)= b(i)+2
S2: if (i<n) c(i+1)= a(i) * 2
enddo
```

### Summary Loop Transformations
* Eliminate carried dependences
* Loop distribution
* Loop alignment
* Improve efficiency
* Loop fusion
* Loop interchange
## Advanced OpenMP / Tasks
### Drawbacks of Work Sharing
* possible ***imbalance*** caused by **`workload`**
* possible ***imbalance*** caused by **`machine`**
* **`limited programming flexibility`**
### Explicit Tasking (since OpenMP 3.0)
* ***Tied***: once a task starts it will remain on the same thread
* Default behavior
* Easy to reason about
* ***Untied***: tasks can move to a different thread
* Execution can be interrupted and the task moved
* Advantage: more flexibility and better resource utilization
### The OpenMP Task Construct
```c
#pragma omp parallel
{
#pragma omp single {
for ( elem = l->first; elem; elem = elem->next)
// Tasks can be executed by any thread in the team
#pragma omp task
process(elem)
}
// Barrier: all tasks are complete by this point
}
```
### Tasking Syntax in OpenMP
```c
#pragma omp task [clause list]
{ ... }
/* Select clauses */
// FALSE: Execution starts immediately by the creating thread
if (scalar-expression)
// Task is not tied to the thread starting its execution
untied
// Default is firstprivate
Default (shared|none), private, firstprivate, shared
// Hint to influence order of execution
priority(value)
// Waits for completion
#pragma omp taskwait
{ ... }
// The current task can be suspended
// Explicit task scheduling point
#pragma omp taskyield
{ ... }
```
:::warning
**Implicit task scheduling points**
* Task creation
* End of a task
* Taskwait
* Barrier synchronization
:::
### Task Dependencies (since OpenMP 4.0)
```c
/* Influences scheduling order */
// Out: variables produced by this task
// In: variables consumed by this task
// Inout: variables is both in and out
// Example
#pragma omp task shared(x, ...) depend(out: x) // T1
preprocess_some_data(...);
#pragma omp task shared(x, ...) depend(in: x) // T2
do_something_with_data(...);
#pragma omp task shared(x, ...) depend(in: x) // T3
do_something_independent_with_data(...);
```
* [c++ - OpenMP Task Dependency Ignored? - Stack Overflow](https://stackoverflow.com/questions/54990886/openmp-task-dependency-ignored)
### Performance Considerations for Tasking
#### Advantages
* Implicit load balancing
* Simple programming model
* Many complications and bookkeeping pushed to runtime
#### Consideration 1: Task granularity
* **Fine grained** allow for more resource utilization
* more overhead
* **Coarse grained** tasks reduce overhead
* schedule fragmentation
* **Right granularity!!!**
#### Consideration 2: NUMA optimization
* Modern runtimes provide optimizations
* NUMA aware scheduling
## OpenMP Memory Model
### Memory Models
#### Memory/Cache Coherency
* Snoop-based protocols
* Directory-based protocols
#### Memory Consistency
* Sequential Consistency
* Relaxed Consistency
* Processor Consistency (HW Threads Consistency)
* Writes by any thread are seen by all threads in the order they were issued
* But different threads may see a different order
* Weak Consistency
* Synchronization operations
* Release Consistency
* subdivide synchronization operations into “acquire” and “release”
* [Memory-Consistency](https://hackmd.io/Bz0_tLu0S8STNPiAiY6tjw?view#Memory-Consistency)
### The OpenMP Memory Model
* **`Weak consistency`**
* Memory synchronization points (or flush points)
* Entry and exit of **`parallel regions`**
* **`Barriers`** (implicit and explicit)
* Entry and exit of **`critical regions`**
* Use of **`OpenMP runtime locks`**
* Every **`task scheduling point`**
### OpenMP’s Flush Directive
```c
/* Synchronizes data of the executing thread with main memory */
// Load/stores executed before the flush have to be finished
// Load/stores following the flush are not allowed to be executed early
#pragma omp flush [(list)]
```
## OpenMP 5.1
* Full support for C11, C\++11, C\++14, C\++17, C\++20 and Fortran 2008
* The `OMP_PLACES` syntax was extended
* `omp_display_env` runtime routine to provide information about ICVs
### Masked Region (~~Master Region~~)
```c
// Only the primary thread executes the code block
#pragma omp masked
block
// A region that only the threads execute that are specified in the integer expression
#pragma omp masked [filter(integer-expression)]
block
```