# 9 Advanced MPI Communication Techniques Part 1
###### tags: `SS2021-IN2147-PP`
## Communication Protocols
### Create a “Logical” Topology
* Establish overlap, e.g., halos
* Define neighborhood information
* Establish communication direction
#### Arrange data items as needed
* Define base elements
* Create mapping of element to MPI process
#### Understand frequency of needed communication
### Common Operation: Bidirectional Exchange

### How Messages are Transferred


### Send Variants
```c
// Standard send operation
MPI_Send
// Buffered Send
// Force the use of a send buffer
// Returns immediately, but costs resources
MPI_Bsend
// Synchronous send
// Wait!!!
// Only returns once receive has started
// Adds extra synchronization, but can be costly
MPI_Ssend
// Ready Send
// "User" must "ensure" that receive has been posted
// Enables faster communication, but needs implicit synchronization
MPI_Rsend
```
### Checking the UMQ with MPI_Probe/Iprobe
```c
/* Receive Side */
// Blocks until a matching message was found
// Output: status object
MPI_Probe
// Terminates after checking whether a matching messages is available
// Output: yes/no, status object (if yes)
// Can be used to overlap wait time with useful work
MPI_Iprobe
```
### Issue: Copy Overhead
#### Eager Protocol
> Copy too much times!!!

#### Rendevouz Protocol
> Send envelope (metadata)

#### Eager vs. Rendezvous Protocol
* Eager Protocol
* Suitable for short messsages
* Rendezvous Protocol
* Suitable for long messsages
* Cross-over point (交叉點!!!)
* MPI **`switch`** protocol based on ***message size***
* Typically defined as **`eager limit`**
* Depends on system setup
* Can often be modified
#### Deadlock with Rendevouz Protocol
> Both wait for Recv to get envelop and respond handshake

### Bidirectional Send/Recv
```c
// Combined blocking send and receive
int MPI_Sendrecv (void *sendbuf, int sendcount, MPI_Datatype sendtype,
int dest, int sendtag,
void *recvbuf, int recvcount, MPI_Datatype recvtype,
int source, MPI_Datatype recvtag,
MPI_Comm comm, MPI_Status *status)
// Combined send and receive buffer
int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype datatype,
int dest, int sendtag,
int source, int recvtag,
MPI_Comm comm, MPI_Status *status)
```
## Nonblocking operations

* Useful for `long operations`
* Waiting for events
* Long I/O operations due to data size
* Completion options
* Separate wait operations
* Polling
* Interrupts
### Nonblocking P2P Operations in MPI
```c
// Request object
// User allocates object, but MPI maintains state
// Initiated operations must be completed
MPI_Request
int MPI_Isend (void* buf, int count, MPI_Datatype dtype, int dest,
int tag, MPI_Comm comm, MPI_Request* request)
int MPI_Irecv (void* buf, int count, MPI_Datatype dtype, int source,
int tag, MPI_Comm comm, MPI_Request* request)
```
#### Nonblocking Send Variants
```c
MPI_Isend
MPI_Ibsend
MPI_Issend
MPI_Irsend
```
#### Completion Operations
```c
// Option 1: Blocking completion
int MPI_Wait (MPI_Request *request, MPI_Status *status)
// Option 2: Nonblocking/Polling completion
int MPI_Test (MPI_Request *request, int *flag,
MPI_Status *status)
```
### How Messages are Transferred (Blocking)

### How Messages are Transferred (Nonblocking)

### MPI Terminology May Not Be What it Seems

* Blocking routines
* buffers can be reused
* Nonblocking routines
* buffers are still under MPI control

* Non-Local routines
* Requires some (specific) other execution to run in another process
* E.g., for MPI_Recv to complete, another process must call MPI_(X)send
* Local routines
* Not non-local (warning, MPI Standard currently being updated)
### Persistent Communication
* Establish ***repeating communication***
* **`”Init” call`** to setup communication
* **`MPI_Start`** to kickoff communication
* Example
```c
MPI_Request req;
MPI_Status status;
int msg [10];
// Same signature as MPI_Isend call
MPI_Send_start(msg, 10, MPI_INT, 1, 42, MPI_COMM_WORLD, &req);
// Information stored in request, which is inactive
// Start the communication
MPI_Start(&req)
// request now active, basically a nonblocking operation
// complete communication with Wait/Test routines
MPI_Wait(&req, status);
// request now inactive and can be reused
MPI_Request_free(&req);
```
## Collective Operations
* Executed by `all processes`
* Executed in the `same sequence`
* Can be `blocking` or `non-blocking`
### MPI_Barrier
```c
int MPI_Barrier (MPI_Comm comm)
```
### Broadcast Communication: MPI_Bcast
```c
// do not necessarily synchronize
// MPI_Bcast is not sync!!!!!
int MPI_Bcast (void *buf, int count, MPI_Datatype dtype,
int root, MPI_Comm comm)
```
### MPI_Gather
```c
int MPI_Gather (void *sendbuf, int sendcount, MPI_Datatype sendtype,
void* recvbuf, int recvcount, MPI_Datatype recvtype,
int root, MPI_Comm comm)
```

### MPI_Scatter
```c
int MPI_Scatter (void *sendbuf, int sendcount, MPI_Datatype sendtype,
void* recvbuf, int recvcount, MPI_Datatype recvtype,
int root, MPI_Comm comm)
```

### Other Collective Communication Routines
```c
MPI_Gatherv
MPI_Allgather
MPI_Allgatherv
MPI_Scatterv
MPI_Alltoall
MPI_Alltoallv
MPI_Alltoallw
```
#### MPI_Gatherv
```c
int MPI_Gatherv (void *sendbuf, int sendcount, MPI_Datatype sendtype,
void* recvbuf, int *recvcount, int *displs,
MPI_Datatype recvtype, int root, MPI_Comm comm)
```

#### MPI_Alltoall
```c
int MPI_Alltoall(void* sendbuf, int sendcount, MPI_Datatype sendtype,
void* recvbuf, int recvcount, MPI_Datatype recvtype,
MPI_Comm comm)
```

### MPI_Reduce
```c
int MPI_Reduce (void* sbuf, void* rbuf, int count, MPI_Datatype dtype,
MPI_Op op, int root, MPI_Comm comm)
```
### Nonblocking Collectives (since MPI 3.0)
```c
// Include MPI_Ibarrier, the non-blocking barrier
// Returns request object
int MPI_Ibcast(void* buffer, int count, MPI_Datatype datatype,
int root, MPI_Comm comm, MPI_Request *request)
```
## Communicators
### Communicators and MPI_COMM_WORLD
* A process can have different ranks in different communicators
* one MPI process can be in multiple communicators
### Creating New Communicators
```c
int MPI_Comm_dup(MPI_Comm comm, MPI_Comm *newcomm)
// Also exists as MPI_Comm_Idup
```
### Creating Subcommunicators
```c
// Creates new communicator(s)
// All MPI processes that pass the same color, will be in same new communicator
// Key argument determines the rank order in the new communicator
int MPI_Comm_split(MPI_Comm comm, int color,
int key, MPI_Comm *newcomm)
// Communicators should be freed, when no longer in use
int MPI_Comm_free(MPI_Comm *comm)
```
### Example: Row and Column Communicators
