# 3 Introduction into OpenMP 5.0
###### tags: `SS2021-IN2147-PP`
## OpenMP and its Execution Model
### Terminology HW Threads
* A single system has many hardware threads
* A **`system/node`** can have multiple **`physical processors`** (aka. **`sockets`**)
* Independent chips with their own interconnect between them
* Each **`socket`** can have multiple **`cores`**
* Replicated processors on a single die with on-die interconnect
* Each **`core`** can run multiple **`hardware threads`**
* `Simultaneous Multithreading / Hyperthreading`
* Separate PC and Stack Pointer, but shared resources
### OpenMP = Open Multi-Processing
* Standard for writing parallel programs, mainly **`on-node`**
* `Shared memory parallelism`
* **Portability**
* 3 primary API components:
* Compiler Directives (for C/C++ and Fortran)
* Runtime Library Routines
* Environment Variables
* **OpenMP is not ...**
* intended for distributed memory systems
* necessarily implemented identically by all vendors
* guaranteed to automatically make the most efficient use of shared memory
* required to check for data dependencies, race conditions, deadlocks, etc.
* designed to handle parallel I/O
### Fork/Join Execution Model

### Nested Parallelism

### OpenMP System Stack

## OpenMP Syntax
```c
// Most of the constructs in OpenMP are compiler directives
#pragma omp construct [clause [clause]...]
// Directives can have continuation lines
#pragma omp parallel private(i) \
private(j)
```
### Parallel Regions
```c
// Degree of parallelism controlled by ICV
// $ export OMP_NUM_THREADS=8
// $ ./a.out
#pragma omp parallel [parameters]
{
block
}
```
### Sections in a Parallel Region
```c
// Parallel regions can have separate sections
#pragma omp sections [parameters]
{
#pragma omp section
{ ... block ... }
#pragma omp section
{ ... block ... }
...
// an implicit barrier at the end of the sections region/block
}
```
### Parallel Loop
```c
// Iterations must be independent!!!
// OpenMP would not check dependency!!!
#pragma omp for [parameters]
for ...
// implicit barrier at the end
// Unless the parameter 'nowait' is specified
```
### Available Loop Schedules
* static
* `Fix sized chunks` (default=n/t)
* `Round-robin` fashion
* dynamic
* `Fix sized chunks` (default=1)
* `Distributed one by one` at runtime as chunks finish
* guided
* Start with `large` chunks, then exponentially `decreasing size`
* `Distributed one by one` at runtime as chunks finish
* runtime
* Controlled at runtime using control variable
* auto
* Compiler/Runtime can choose
## Data Sharing in OpenMP
### Shared vs. Private Data
* **Global variables**:
* Shared
* **Variables declared within the “dynamic extent” of parallel regions**:
* Private
* Includes routines called from parallel regions
### Private Data Options

### First/Last Private Data Options

### Summary: Sharing Attributes of Variables
* **`private(var-list)`**
* Variables in var-list are private
* **`shared(var-list)`**
* Variables in var-list are shared
* **`default(private | shared | none)`**
* Sets the default for all variables in this region
* **`firstprivate(var-list)`**
* Variables are private
* Initialized with the value of the shared copy before the region.
* **`lastprivate(var-list)`**
* Variables are private
* Value of the thread executing the last iteration of a parallel loop in sequential order is copied to the variable outside of the region.
## Synchronization
### Barrier
```c
// Each parallel region has an implicit barrier at the end
// Can be switched off by adding nowait
#pragma omp nowait
// Additional barriers can be added when needed
#pragma omp barrier
```
### Master Region
```c
// Only the master executes the code block:
// * Other threads skip the region
// * No synchronization at the beginning of
// Possible uses:
// * Printing to screen
// * Keyboard entries
// * File I/Oregion
#pragma omp master
block
```
### Single Region
```c
// Only a (arbitrary) single thread executes the code block
// Possible uses:
// * Initialization of data structures
#pragma omp single [parameter]
block
```
### Critical Section
```c
// Mutual exclusion
// All 'unnamed' critical directives map to the same name
#pragma omp critical [(Name)]
block
```
### Atomic Statements
```c
// Ensures that a specific memory location is updated atomically
#pragma ATOMIC
expression-stmt
```
### Simple Runtime Locks
```c
omp_init_lock(&lockvar) // initialize a lock
omp_destroy_lock(&lockvar) // destroy a lock
omp_set_lock(&lockvar) // set lock
omp_unset_lock(&lockvar) // free lock
logicalvar = omp_test_lock(&lockvar)
```
### Nestable Locks
* Nestable locks can be set multiple times by a single thread.
* If the lock counter is 0 after an unset operation, lock can be set by another thread
* 例如:
* function a 需要 mutex,
* function b 也需要 mutex,同時 function a 還會呼叫 function b。
* 如果使用 std::mutex 必然會造成死結。
* 但是使用 std::recursive_mutex (Nestable Lock) 就可以解決這個問題。
### Ordered Construct
```c
#pragma omp for ordered
for (...)
{
S1
#pragma omp ordered
{ S2 }
S3
}
```

* [簡易的程式平行化-OpenMP(四)範例 for – Heresy's Space](https://kheresy.wordpress.com/2006/09/15/%E7%B0%A1%E6%98%93%E7%9A%84%E7%A8%8B%E5%BC%8F%E5%B9%B3%E8%A1%8C%E5%8C%96%EF%BC%8Dopenmp%EF%BC%88%E5%9B%9B%EF%BC%89%E7%AF%84%E4%BE%8B-for/)
## Reductions
* Aggregate results
* Potentially costly and time sensitive operation
### OpenMP Reductions
```c
int a=0;
#pragma omp parallel for reduction(+: a)
for (int j=0; j<4; j++) {
a = j;
}
printf("Final Value of a=%d\n", a);
// OMP_NUM_THREADS=4
// 0+1+2+3 = 6
// OMP_NUM_THREADS=2
// 1+3 = 4
```
```c
int a=0;
#pragma omp parallel for reduction(+: a)
for (int j=0; j<100; j++) {
a = j;
}
printf("Final Value of a=%d\n", a);
// OMP_NUM_THREADS=4
// 24+49+74+99 = 246
```
```c
int a=0;
#pragma omp parallel for reduction(+: a)
for (int j=0; j<100; j++) {
a += j;
}
printf("Final Value of a=%d\n", a);
// OMP_NUM_THREADS=?
// 0+1+2+...+99 = (99*100)/2 = 4950
```