owned this note
owned this note
Published
Linked with GitHub
# 10 Advanced MPI Communication Techniques Part 2
###### tags: `SS2021-IN2147-PP`
## Datatypes
### Central MPI Types and Objects
```c
// Communicators
MPI_Comm
// handle for non-blocking communication
MPI_Request
// object
// Object object type for a key/value store
// Passed to many routines to pass additional information
MPI_Info
// structured datatypes for communication
MPI_Datatype
```
### Derived MPI Datatypes and Management
```c
// Datatype construction
int MPI_Type_contiguous(int count, MPI_Datatype oldtype,
MPI_Datatype *newtype)
// Datatype commit
int MPI_Type_commit(MPI_Datatype *datatype)
// Datatype free
int MPI_Type_free(MPI_Datatype *datatype)
```

### Constructing a Vector Datatype
```c
int MPI_Type_vector(int count, int blocklength, int stride,
MPI_Datatype oldtype, MPI_Datatype *newtype)
```

### Constructing an Indexed Datatype
```c
int MPI_Type_indexed(int count,
const int array_of_blocklengths[], // AoB
const int array_of_displacements[], // AoD
MPI_Datatype oldtype,
MPI_Datatype *newtype)
```

### Creating a Struct Datatype
```c
int MPI_Type_create_struct(int count,
const int array_of_blocklengths[], // AoB
const MPI_Aint array_of_displacements[], // AoD
const MPI_Datatype array_of_types[], // AoT
MPI_Datatype *newtype)
```

### Constructing a Subarray Datatype
```c
int MPI_Type_create_subarray(int ndims,
const int array_of_sizes[],
const int array_of_subsizes[],
const int array_of_starts[],
int order,
MPI_Datatype oldtype, MPI_Datatype *newtype)
```

### Some Other Datatype Construction Functions
```c
MPI_Type_dup
// Duplication of datatype
MPI_Type_create_hvector
// Like MPI_Type_create_vector only with Byte-sized strides
MPI_Type_create_hindexed
// Like MPI_Type_create_indexed only with Byte-sized strides
MPI_Type_create_indexed_block
// Like MPI_Type_create_indexed only with fixed block size
MPI_Type_create_hindexed_block
// Like MPI_Type_create_indexed_block only with Byte-sized strides
MPI_Type_create_darray
// Create a distributed array datatype
```
### Inspecting Datatypes
```c
// Query metadata information on a type
int MPI_Type_get_envelope(MPI_Datatype datatype,
int *num_integers, int *num_addresses,
int *num_datatypes, int *combiner)
// Query actual type information
int MPI_Type_get_contents(MPI_Datatype datatype,
int max_integers, int max_addresses,
int max_datatypes, int array_of_integers[],
MPI_Aint array_of_addresses[],
MPI_Datatype array_of_datatypes[])
```
### Datatype Discussions
* Warning
* Datatype creation/destruction can lead to overhead
* Not many MPI implementations are optimized for datatypes
## One-Sided Communication
### Communication Modes
#### Two-sided communication

* Explicit sender and receiver for `P2P`
* All processes call `collectives`
* Advantages
* `Simple`
* Clear communication locations
* `Implicit synchronization`
* Disadvantages
* `Implicit synchronization`
* Requires active involvement on both sides
* Receiver can delay sender
#### One-sided communication

* Decouple data movement with process synchronization
* Exposes a part of its memory
* Directly read
* 
* Advantages
* `Direct memory access`
* Receiver does not get involved
* Requires `extra synchronization`
* Remote Memory Access (`RMA`) in MPI
* Shared memory models
* Explicit get and put calls
* Need hardware support
### Two-sided Communication v.s. One-sided Communication


### Creating Public Memory
* MPI terminology for `remotely accessible memory` is a “**`Window`**”
* Several Window Creation Models
```c
// Make existing/allocated memory regions remotely accessible
MPI_WIN_CREATE
// Allocate memory region and make it accessible
MPI_WIN_ALLOCATE
// No buffer yet, but will have one in the future
MPI_WIN_CREATE_DYNAMIC
// Add memory to a dynamic window
MPI_WIN_ATTACH
// Remove memory from a dynamic window
MPI_WIN_DETACH
```
### Making Local Memory Accessible
```c
// Expose a region of memory in an RMA window
int MPI_Win_create(void *base, MPI_Aint size, int disp_unit,
MPI_Info info, MPI_Comm comm, MPI_Win *win)
// Create a remotely accessible memory region in an RMA window
int MPI_Win_allocate(MPI_Aint size, int disp_unit,
MPI_Info info, MPI_Comm comm,
void *base, MPI_Win *win)
// Create an RMA window, to which data can later be attached
int MPI_Win_create_dynamic(MPI_Info info, MPI_Comm comm,
MPI_Win *win)
// Attaching and Detaching Memory
int MPI_Win_attach(MPI_Win win, void *base, MPI_Aint size)
int MPI_Win_detach(MPI_Win win, void *base)
```
### Data Movement
```c
// Move data to origin, from target
int MPI_Get(void *origin_addr, int origin_count,
MPI_Datatype origin_datatype, int target_rank,
MPI_Aint target_disp, int target_count,
MPI_Datatype target_datatype, MPI_Win win)
// Move data from origin, to target
int MPI_Put(void *origin_addr, int origin_count,
MPI_Datatype origin_datatype, int target_rank,
MPI_Aint target_disp, int target_count,
MPI_Datatype target_datatype, MPI_Win win)
// Move data from origin, to target
// Special case: MPI_REPLACE - Leads to atomic update
int MPI_Accumulate(void *origin_addr, int origin_count,
MPI_Datatype origin_datatype, int target_rank,
MPI_Aint target_disp, int target_count,
MPI_Datatype target_datatype,
MPI_Op op, MPI_Win win)
```
### Additional RMA Operations
* Get-accumulate
* Compare-and-swap
* Useful for linked list
* Fetch-and-Op
* Faster for the hardware to implement
### Ordering of Operations in MPI RMA
* `Put`/`Get` operations
* Ordering does not matter
* Write-after-Write
* Additional consistency issues
* `Accumulate` operations
* Use `MPI_REPLACE` for atomic `PUTs`
* All accumulate operations are ordered by default
### RMA Synchronization Models
* MPI RMA model allows data to be accessed only within an “epoch”
* Three types of epochs
* Fence (active target)
* Post-start-complete-wait (active target)
* Lock/Unlock (passive target)
#### Fence Synchronization
```c
// Simplest model
MPI_Win_fence(assert, win)
```

#### Post/Start/Complete/Wait Synchronization
```c
// Target: Exposure epoch
// Target may allow a smaller group of processes to access its data
MPI_Win_post
MPI_Win_wait
// Origin: Access epoch
// Origin can indicate a smaller group of processes to retrieve data from
MPI_Win_start
MPI_Win_complete
```

#### Lock/Unlock Synchronization

* Passive mode: One-sided, asynchronous communication
* Not mutex!!!
* Non-blocking
* Similar to release consistency
### Which Communication to Use?
#### Two-sided communication
* `Easier` to handle
* `Explicit send/recv` points
* Easier to reason about data and memory model
* Easy to use if `send/recv points are known` (e.g., SPMD codes)
#### RMA
* Good for `low protocol overheads` architecture
* Especially if supported from `underlying interconnect` (e.g. `InfiniBand`)
* **If not, RMA can be really slow**
#### Passive mode: `asynchronous` one-sided communication
* Data characteristics
* Big data analysis
* Requiring memory aggregation
* Asynchronous data exchange
* Data-dependent access pattern
* Computation characteristics
* Adaptive methods
* E.g. AMR, MADNESS
* Asynchronous dynamic load balancing
### Shared Memory in MPI
```c
// Create an MPI window that is also accessible with **local load/store operations**
// * Function returns base pointer
// * Needs manual synchronization
// * Only part of address space is shared
int MPI_Win_allocate_shared(MPI_Aint size, int disp_unit,
MPI_Info info, MPI_Comm comm,
void *baseptr, MPI_Win *win)
```
## Hybrid Programming
* Use a shared memory model within an MPI process
* MPI + OpenMP as a typical option
### Threading and MPI
* Accesses from several threads can cause problems during execution
* Shared data structures
* Coordinate access to NIC
* Callback coordination
### MPI’s Four Levels of Thread Safety
```c
// New MPI initialization routine
int MPI_Init_thread(int *argc, char ***argv,
int required, int *provided)
MPI_THREAD_SINGLE
// Only one thread exists in the application
MPI_THREAD_FUNNELED
// Multithreaded, but only the main thread makes MPI calls
MPI_THREAD_SERIALIZED
// Multithreaded, but only one thread at a time makes MPI calls
MPI_THREAD_MULTIPLE
// Multithreaded and any thread can make MPI calls at any time
```
### Consequences of using MPI_THREAD_MULTIPLE
#### Each call to MPI completes on its own

#### User is responsible for making sure `racing calls are avoided`

* Collective operations must be ordered correctly among threads
### Thread Safe Probing and Matching Receive
```c
// MPI_Probe/MPI_Iprobe inspects unexpected message queue (UMQ)
// Matching probes and receive
int MPI_Mprobe(int source, int tag, MPI_Comm comm,
MPI_Message *message, MPI_Status *status)
int MPI_Improbe(int source, int tag, MPI_Comm comm,
int *flag, MPI_Message *message, MPI_Status *status)
// Wait for/Check if matching message
// * Returns handle to matched message
// Matching receive:
// If two threads call MPI_Probe and both return
// We need matching receive!!!
int MPI_Mrecv(void* buf, int count, MPI_Datatype datatype,
MPI_Message *message, MPI_Status *status)
int MPI_Imrecv(void* buf, int count, MPI_Datatype datatype,
MPI_Message *message, MPI_Request *request)
// Receive message identified by Mprobe
// * Handle of message passed into receive
```
### Hybrid Programming MPI+OpenMP

* Danger: Amdahl’s Law in the `non-OpenMP regions`
* Need to split data structures (for MPI), then share data parts (for OpenMP threads)
* Hierarchical data decomposition
### When to Use Hybrid?
* Depends on application
* Option
* MPI only with one MPI process per HW thread
* One MPI process per node, one OpenMP thread per core on node
* Several MPI processes per node, rest of the concurrency in OpenMP
* **`One MPI process per socket`** often fits well
* Large degree of parallelism in OpenMP
* Naturally **`avoids NUMA problems`**
* Need to pin threads to that socket
### What‘s Left in MPI: Other Topics
* MPI I/O
* Parallel I/O reading and writing files
* Often used in connection with a parallel file system
* Dynamic Process Control
* Spawning of additional processes
* Intra. Vs. Inter-Communicators
* Unfortunately, rarely implemented or fully supported
* Additional concepts
* Communicator attributes
* Generalized requests
* Topologies
* MPI Tool Interfaces
* Transparent interception of MPI calls
* Query interface for internal state
## Advanced Message Passing Interface (MPI)
...