# 9 Advanced MPI Communication Techniques Part 1 ###### tags: `SS2021-IN2147-PP` ## Communication Protocols ### Create a “Logical” Topology * Establish overlap, e.g., halos * Define neighborhood information * Establish communication direction #### Arrange data items as needed * Define base elements * Create mapping of element to MPI process #### Understand frequency of needed communication ### Common Operation: Bidirectional Exchange ![](https://i.imgur.com/wZWyNJy.png =400x) ### How Messages are Transferred ![](https://i.imgur.com/qwPOb44.png) ![](https://i.imgur.com/RXyEizT.png) ### Send Variants ```c // Standard send operation MPI_Send // Buffered Send // Force the use of a send buffer // Returns immediately, but costs resources MPI_Bsend // Synchronous send // Wait!!! // Only returns once receive has started // Adds extra synchronization, but can be costly MPI_Ssend // Ready Send // "User" must "ensure" that receive has been posted // Enables faster communication, but needs implicit synchronization MPI_Rsend ``` ### Checking the UMQ with MPI_Probe/Iprobe ```c /* Receive Side */ // Blocks until a matching message was found // Output: status object MPI_Probe // Terminates after checking whether a matching messages is available // Output: yes/no, status object (if yes) // Can be used to overlap wait time with useful work MPI_Iprobe ``` ### Issue: Copy Overhead #### Eager Protocol > Copy too much times!!! ![](https://i.imgur.com/vLId5At.png) #### Rendevouz Protocol > Send envelope (metadata) ![](https://i.imgur.com/GBNI5ZY.png) #### Eager vs. Rendezvous Protocol * Eager Protocol * Suitable for short messsages * Rendezvous Protocol * Suitable for long messsages * Cross-over point (交叉點!!!) * MPI **`switch`** protocol based on ***message size*** * Typically defined as **`eager limit`** * Depends on system setup * Can often be modified #### Deadlock with Rendevouz Protocol > Both wait for Recv to get envelop and respond handshake ![](https://i.imgur.com/9NKfFZU.png) ### Bidirectional Send/Recv ```c // Combined blocking send and receive int MPI_Sendrecv (void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, MPI_Datatype recvtag, MPI_Comm comm, MPI_Status *status) // Combined send and receive buffer int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status) ``` ## Nonblocking operations ![](https://i.imgur.com/93WhFSW.png =300x) * Useful for `long operations` * Waiting for events * Long I/O operations due to data size * Completion options * Separate wait operations * Polling * Interrupts ### Nonblocking P2P Operations in MPI ```c // Request object // User allocates object, but MPI maintains state // Initiated operations must be completed MPI_Request int MPI_Isend (void* buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm, MPI_Request* request) int MPI_Irecv (void* buf, int count, MPI_Datatype dtype, int source, int tag, MPI_Comm comm, MPI_Request* request) ``` #### Nonblocking Send Variants ```c MPI_Isend MPI_Ibsend MPI_Issend MPI_Irsend ``` #### Completion Operations ```c // Option 1: Blocking completion int MPI_Wait (MPI_Request *request, MPI_Status *status) // Option 2: Nonblocking/Polling completion int MPI_Test (MPI_Request *request, int *flag, MPI_Status *status) ``` ### How Messages are Transferred (Blocking) ![](https://i.imgur.com/NfSIK2B.png) ### How Messages are Transferred (Nonblocking) ![](https://i.imgur.com/sJGtuvk.png) ### MPI Terminology May Not Be What it Seems ![](https://i.imgur.com/jTHwztU.png) * Blocking routines * buffers can be reused * Nonblocking routines * buffers are still under MPI control ![](https://i.imgur.com/sTv2rhB.png) * Non-Local routines * Requires some (specific) other execution to run in another process * E.g., for MPI_Recv to complete, another process must call MPI_(X)send * Local routines * Not non-local (warning, MPI Standard currently being updated) ### Persistent Communication * Establish ***repeating communication*** * **`”Init” call`** to setup communication * **`MPI_Start`** to kickoff communication * Example ```c MPI_Request req; MPI_Status status; int msg [10]; // Same signature as MPI_Isend call MPI_Send_start(msg, 10, MPI_INT, 1, 42, MPI_COMM_WORLD, &req); // Information stored in request, which is inactive // Start the communication MPI_Start(&req) // request now active, basically a nonblocking operation // complete communication with Wait/Test routines MPI_Wait(&req, status); // request now inactive and can be reused MPI_Request_free(&req); ``` ## Collective Operations * Executed by `all processes` * Executed in the `same sequence` * Can be `blocking` or `non-blocking` ### MPI_Barrier ```c int MPI_Barrier (MPI_Comm comm) ``` ### Broadcast Communication: MPI_Bcast ```c // do not necessarily synchronize // MPI_Bcast is not sync!!!!! int MPI_Bcast (void *buf, int count, MPI_Datatype dtype, int root, MPI_Comm comm) ``` ### MPI_Gather ```c int MPI_Gather (void *sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) ``` ![](https://i.imgur.com/dmCwwqh.png) ### MPI_Scatter ```c int MPI_Scatter (void *sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) ``` ![](https://i.imgur.com/gtwmTl3.png) ### Other Collective Communication Routines ```c MPI_Gatherv MPI_Allgather MPI_Allgatherv MPI_Scatterv MPI_Alltoall MPI_Alltoallv MPI_Alltoallw ``` #### MPI_Gatherv ```c int MPI_Gatherv (void *sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int *recvcount, int *displs, MPI_Datatype recvtype, int root, MPI_Comm comm) ``` ![](https://i.imgur.com/nisWrfr.png) #### MPI_Alltoall ```c int MPI_Alltoall(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm) ``` ![](https://i.imgur.com/krFWc4S.png) ### MPI_Reduce ```c int MPI_Reduce (void* sbuf, void* rbuf, int count, MPI_Datatype dtype, MPI_Op op, int root, MPI_Comm comm) ``` ### Nonblocking Collectives (since MPI 3.0) ```c // Include MPI_Ibarrier, the non-blocking barrier // Returns request object int MPI_Ibcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm, MPI_Request *request) ``` ## Communicators ### Communicators and MPI_COMM_WORLD * A process can have different ranks in different communicators * one MPI process can be in multiple communicators ### Creating New Communicators ```c int MPI_Comm_dup(MPI_Comm comm, MPI_Comm *newcomm) // Also exists as MPI_Comm_Idup ``` ### Creating Subcommunicators ```c // Creates new communicator(s) // All MPI processes that pass the same color, will be in same new communicator // Key argument determines the rank order in the new communicator int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm) // Communicators should be freed, when no longer in use int MPI_Comm_free(MPI_Comm *comm) ``` ### Example: Row and Column Communicators ![](https://i.imgur.com/BU9qEfo.png)