10 Advanced MPI Communication Techniques Part 2

# 10 Advanced MPI Communication Techniques Part 2 ###### tags: `SS2021-IN2147-PP` ## Datatypes ### Central MPI Types and Objects ```c // Communicators MPI_Comm // handle for non-blocking communication MPI_Request // object // Object object type for a key/value store // Passed to many routines to pass additional information MPI_Info // structured datatypes for communication MPI_Datatype ``` ### Derived MPI Datatypes and Management ```c // Datatype construction int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype) // Datatype commit int MPI_Type_commit(MPI_Datatype *datatype) // Datatype free int MPI_Type_free(MPI_Datatype *datatype) ``` ![](https://i.imgur.com/ukAzlge.png) ### Constructing a Vector Datatype ```c int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype) ``` ![](https://i.imgur.com/JpUp1Ve.png) ### Constructing an Indexed Datatype ```c int MPI_Type_indexed(int count, const int array_of_blocklengths[], // AoB const int array_of_displacements[], // AoD MPI_Datatype oldtype, MPI_Datatype *newtype) ``` ![](https://i.imgur.com/ETCp7t8.png) ### Creating a Struct Datatype ```c int MPI_Type_create_struct(int count, const int array_of_blocklengths[], // AoB const MPI_Aint array_of_displacements[], // AoD const MPI_Datatype array_of_types[], // AoT MPI_Datatype *newtype) ``` ![](https://i.imgur.com/gE4jaf9.png) ### Constructing a Subarray Datatype ```c int MPI_Type_create_subarray(int ndims, const int array_of_sizes[], const int array_of_subsizes[], const int array_of_starts[], int order, MPI_Datatype oldtype, MPI_Datatype *newtype) ``` ![](https://i.imgur.com/yOenIDP.png) ### Some Other Datatype Construction Functions ```c MPI_Type_dup // Duplication of datatype MPI_Type_create_hvector // Like MPI_Type_create_vector only with Byte-sized strides MPI_Type_create_hindexed // Like MPI_Type_create_indexed only with Byte-sized strides MPI_Type_create_indexed_block // Like MPI_Type_create_indexed only with fixed block size MPI_Type_create_hindexed_block // Like MPI_Type_create_indexed_block only with Byte-sized strides MPI_Type_create_darray // Create a distributed array datatype ``` ### Inspecting Datatypes ```c // Query metadata information on a type int MPI_Type_get_envelope(MPI_Datatype datatype, int *num_integers, int *num_addresses, int *num_datatypes, int *combiner) // Query actual type information int MPI_Type_get_contents(MPI_Datatype datatype, int max_integers, int max_addresses, int max_datatypes, int array_of_integers[], MPI_Aint array_of_addresses[], MPI_Datatype array_of_datatypes[]) ``` ### Datatype Discussions * Warning * Datatype creation/destruction can lead to overhead * Not many MPI implementations are optimized for datatypes ## One-Sided Communication ### Communication Modes #### Two-sided communication ![](https://i.imgur.com/b64kWgX.png =400x) * Explicit sender and receiver for `P2P` * All processes call `collectives` * Advantages * `Simple` * Clear communication locations * `Implicit synchronization` * Disadvantages * `Implicit synchronization` * Requires active involvement on both sides * Receiver can delay sender #### One-sided communication ![](https://i.imgur.com/jPvWd2T.png =400x) * Decouple data movement with process synchronization * Exposes a part of its memory * Directly read * ![](https://i.imgur.com/O3TPWCb.png =400x) * Advantages * `Direct memory access` * Receiver does not get involved * Requires `extra synchronization` * Remote Memory Access (`RMA`) in MPI * Shared memory models * Explicit get and put calls * Need hardware support ### Two-sided Communication v.s. One-sided Communication ![](https://i.imgur.com/nmyBhP3.png =500x) ![](https://i.imgur.com/S8vS1Fk.png =500x) ### Creating Public Memory * MPI terminology for `remotely accessible memory` is a “**`Window`**” * Several Window Creation Models ```c // Make existing/allocated memory regions remotely accessible MPI_WIN_CREATE // Allocate memory region and make it accessible MPI_WIN_ALLOCATE // No buffer yet, but will have one in the future MPI_WIN_CREATE_DYNAMIC // Add memory to a dynamic window MPI_WIN_ATTACH // Remove memory from a dynamic window MPI_WIN_DETACH ``` ### Making Local Memory Accessible ```c // Expose a region of memory in an RMA window int MPI_Win_create(void *base, MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, MPI_Win *win) // Create a remotely accessible memory region in an RMA window int MPI_Win_allocate(MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, void *base, MPI_Win *win) // Create an RMA window, to which data can later be attached int MPI_Win_create_dynamic(MPI_Info info, MPI_Comm comm, MPI_Win *win) // Attaching and Detaching Memory int MPI_Win_attach(MPI_Win win, void *base, MPI_Aint size) int MPI_Win_detach(MPI_Win win, void *base) ``` ### Data Movement ```c // Move data to origin, from target int MPI_Get(void *origin_addr, int origin_count, MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype, MPI_Win win) // Move data from origin, to target int MPI_Put(void *origin_addr, int origin_count, MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype, MPI_Win win) // Move data from origin, to target // Special case: MPI_REPLACE - Leads to atomic update int MPI_Accumulate(void *origin_addr, int origin_count, MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype, MPI_Op op, MPI_Win win) ``` ### Additional RMA Operations * Get-accumulate * Compare-and-swap * Useful for linked list * Fetch-and-Op * Faster for the hardware to implement ### Ordering of Operations in MPI RMA * `Put`/`Get` operations * Ordering does not matter * Write-after-Write * Additional consistency issues * `Accumulate` operations * Use `MPI_REPLACE` for atomic `PUTs` * All accumulate operations are ordered by default ### RMA Synchronization Models * MPI RMA model allows data to be accessed only within an “epoch” * Three types of epochs * Fence (active target) * Post-start-complete-wait (active target) * Lock/Unlock (passive target) #### Fence Synchronization ```c // Simplest model MPI_Win_fence(assert, win) ``` ![](https://i.imgur.com/cgYszrr.png =300x) #### Post/Start/Complete/Wait Synchronization ```c // Target: Exposure epoch // Target may allow a smaller group of processes to access its data MPI_Win_post MPI_Win_wait // Origin: Access epoch // Origin can indicate a smaller group of processes to retrieve data from MPI_Win_start MPI_Win_complete ``` ![](https://i.imgur.com/EmtpJL6.png =300x) #### Lock/Unlock Synchronization ![](https://i.imgur.com/5MierXU.png =600x) * Passive mode: One-sided, asynchronous communication * Not mutex!!! * Non-blocking * Similar to release consistency ### Which Communication to Use? #### Two-sided communication * `Easier` to handle * `Explicit send/recv` points * Easier to reason about data and memory model * Easy to use if `send/recv points are known` (e.g., SPMD codes) #### RMA * Good for `low protocol overheads` architecture * Especially if supported from `underlying interconnect` (e.g. `InfiniBand`) * **If not, RMA can be really slow** #### Passive mode: `asynchronous` one-sided communication * Data characteristics * Big data analysis * Requiring memory aggregation * Asynchronous data exchange * Data-dependent access pattern * Computation characteristics * Adaptive methods * E.g. AMR, MADNESS * Asynchronous dynamic load balancing ### Shared Memory in MPI ```c // Create an MPI window that is also accessible with **local load/store operations** // * Function returns base pointer // * Needs manual synchronization // * Only part of address space is shared int MPI_Win_allocate_shared(MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, void *baseptr, MPI_Win *win) ``` ## Hybrid Programming * Use a shared memory model within an MPI process * MPI + OpenMP as a typical option ### Threading and MPI * Accesses from several threads can cause problems during execution * Shared data structures * Coordinate access to NIC * Callback coordination ### MPI’s Four Levels of Thread Safety ```c // New MPI initialization routine int MPI_Init_thread(int *argc, char ***argv, int required, int *provided) MPI_THREAD_SINGLE // Only one thread exists in the application MPI_THREAD_FUNNELED // Multithreaded, but only the main thread makes MPI calls MPI_THREAD_SERIALIZED // Multithreaded, but only one thread at a time makes MPI calls MPI_THREAD_MULTIPLE // Multithreaded and any thread can make MPI calls at any time ``` ### Consequences of using MPI_THREAD_MULTIPLE #### Each call to MPI completes on its own ![](https://i.imgur.com/C7QYr5f.png) #### User is responsible for making sure `racing calls are avoided` ![](https://i.imgur.com/5qmyDhH.png) * Collective operations must be ordered correctly among threads ### Thread Safe Probing and Matching Receive ```c // MPI_Probe/MPI_Iprobe inspects unexpected message queue (UMQ) // Matching probes and receive int MPI_Mprobe(int source, int tag, MPI_Comm comm, MPI_Message *message, MPI_Status *status) int MPI_Improbe(int source, int tag, MPI_Comm comm, int *flag, MPI_Message *message, MPI_Status *status) // Wait for/Check if matching message // * Returns handle to matched message // Matching receive: // If two threads call MPI_Probe and both return // We need matching receive!!! int MPI_Mrecv(void* buf, int count, MPI_Datatype datatype, MPI_Message *message, MPI_Status *status) int MPI_Imrecv(void* buf, int count, MPI_Datatype datatype, MPI_Message *message, MPI_Request *request) // Receive message identified by Mprobe // * Handle of message passed into receive ``` ### Hybrid Programming MPI+OpenMP ![](https://i.imgur.com/HsvJ0Jh.png) * Danger: Amdahl’s Law in the `non-OpenMP regions` * Need to split data structures (for MPI), then share data parts (for OpenMP threads) * Hierarchical data decomposition ### When to Use Hybrid? * Depends on application * Option * MPI only with one MPI process per HW thread * One MPI process per node, one OpenMP thread per core on node * Several MPI processes per node, rest of the concurrency in OpenMP * **`One MPI process per socket`** often fits well * Large degree of parallelism in OpenMP * Naturally **`avoids NUMA problems`** * Need to pin threads to that socket ### What‘s Left in MPI: Other Topics * MPI I/O * Parallel I/O reading and writing files * Often used in connection with a parallel file system * Dynamic Process Control * Spawning of additional processes * Intra. Vs. Inter-Communicators * Unfortunately, rarely implemented or fully supported * Additional concepts * Communicator attributes * Generalized requests * Topologies * MPI Tool Interfaces * Transparent interception of MPI calls * Query interface for internal state ## Advanced Message Passing Interface (MPI) ...