# Paper Note
## MPI Tag Matching Performance on ConnectX and ARM
> `performance`, `2019`
[link](https://dl.acm.org/doi/pdf/10.1145/3343211.3343224)
- Reduce overhead (search for a match)
1. hash table
2. hashed-binning
3. 4-D lookup tables
4. parallelism
5. offloaded to local HCA (ConnectX)
------
### Section 2 - background
Traditionally, `MPICH` and `MPICH-derivative` MPI implementations used a pair of linked lists to implement matching
- posted receive queue (PRQ)
- unexpected message queue (UMQ)
Recently optimizations have been made in some of the major vendor’s drivers to use a hashed-binning solution where ==**the tag is hashed** to select one of many linked lists.==
> - Intel’s PSM2 driver : **randomized hash** that selects from 64 bins
> - Mellanox’s ConnectX-5 driver : **ordered hash** (the modulo of a 64 bit field that includes the MPI message tag) to select from 1021 bins
> Additionally, network vendors have also moved to address matching overheads by offering offloaded, hardware support based in the **local HCA**
> Host Channel Adapter(HCA)通常用於描述 InfiniBand 的介面卡。
- Mellanox’s ConnectX-5
- matches based on a single 64-bit tag (a UCX tag) (including MPI tag, MPI rank, and MPI context ID)
- two layers
- software layer
- **open source** and can be viewed as part of Open UCX
- a hash binning system with 1021 bins
- a hash function that takes the mod 1021 of
- the upper and lower 32-bits of the UCX tag
- combines the result using an bitwise `xor`
- hardware layer
- proprietary and the details are not publicly released
- **hardware matching** only happens for messages above a ==certain threshold== which defaults to `1KiB`
- these two layers can both be active at the same time
------
### Section 3 - experimental design and setup
#### Hardware
- Astra supercomputer at Sandia National Laboratories
- These processors are 28-core Cavium ThunderX2 CN9775 chips operating at 2GHz.
- Each node communicate using a 600Gb/s cache coherent interconnect
- 36 racks × 18 chassis per rack × 4 nodes per chassis = **5184 nodes** (each node has one processor)
- ======================
- The network is 4x EDR (100Gb/s) Infiniband Mellanox ConnectX5 hardware.
- Nodes communicate with the network interface via 8x PCIe 3 (16GT/s)
- The ConnectX-5 NIC provides hardware assistance for MPI message matching.
- ======================
- Nodes are arranged in a three-level fat tree topology
Evaluate network performance on the cluster
Collected data using a **modified version** of the OSU Micro-Benchmarks suite (OSU)
> (i) testing different list lengths
> ==additional unmatched receives are posted== in advance of the performance testing
> (ii) matching the environment of a well written application
> ==all receives are preposted== and the cache is cleared before each iteration.
Env : `OpenMPI version 3.1.3` `GCC
version 7.2.0 `
Message sizes : varied between `1B` and `1MiB`
Queue lengths : varied between `1` and `32k` tags
Each trial was performed with both hardware matching **enabled** and **disabled**.
For `0%`, `1%`, `10%` and `100%` collision rate
Each trial was run 20 times
Run with 2 applications 3 different matching techniques
- Applications : `LULESH 2.0` and `FDS`
1. software only
2. software with hardware
3. software with one bucket
------
### Section 4 - results
> Different numbers of unmatched receives increasing what we call the `receive queue depth`
> Collision rate : `0%`, `1%`, `10%` and `100%`
- Software Only Matching (baseline)
`Figure 1a` shows results from the modified OSU microbenchmarks bandwidth test with **no preposted unmatched receives and no hardware matching**.
This serves as a ==baseline== to test the software matching implementation.
There is a clear transition from eager to rendezvous at `16KiB` messages.
> ---
`Figure 1b` shows the same benchmark after **adding 1024 unmatched messages to the queue**.
Interestingly after the transition to the `rendezvous message protocol` (like P2P), ==bandwidth does not vary significantly with collision rate==.
The latency inherent in the larger message sizes and the resulting lower message rate can ==hide the time spent message matching== even in more extreme cases of 100% collisions.
> ---
`Figure 1c` shows that the bandwidth of a `4KiB` message size across the receive queue depth.
This data shows what we expect from binning systems; the performance is directly dependent on **the combination of queue depth and collision rate**.
Because the `effective search depth` is the number of entries in the selected bin, it can be evaluated at *`queue_depth` ∗ `collision_rate`*.
This is reflected in the data by the ==decreased performance as collision rates increase== for the same queue depth.
> ---
- Software + Hardware Matching
The performance gains of the new hardware are highly dependent on `message size`, with messages of approximately `1KiB` to `16 KiB` seeing improved performance from offloading tag matching to the NIC.
`Figure 2a` shows that messages of size `1KiB` to `16 KiB` had ==**significant increase** in bandwidth when **enabling hardware**==.
Outside of this range however, there is actually a reduction in bandwidth. 32 bit messages are ==**27.1% slower** with hardware enabled==.
For smaller messages, this reduction is likely due to overheads of hardware matching not being masked by overlap;
the additional latency from hardware message matching becomes the **performance bottleneck**.
> ---
`Figure 2b` shows that ==increased collision rate leads to a higher effective search depth.==
Comparing to the baseline above, `2KiB` to `16KiB` message sizes show a **performance increase ranging from 0.48% to 31.4%**.
As the collision rate increases, the performance in this range trends **downward**, with `16KiB` messages at 100% collision rate seeing a ==11.3% improvement==.
Outside of this range, the performance becomes worse with hardware matching enable.
> future work : The performance dip that occurs between `16KiB` and `256KiB`
> ---
`Figure 2c` shows how hardware matching handles `4KiB` message with an increasing number of matching entries.
There is ==significant performance drop== when the *expected value* for collisions becomes `1` (when queue depth ∗ collision rate = 1). This indicate that the benefit of hardware matching is **limited
to** entries that are **first in their bucket**.
Once the queue depth becomes large enough that even a single collision is expected, ==hardware matching is quickly surpassed by software matching== performance and the variance between runs increases dramatically.
> ---
- Wild card Support
`MPI_ANY_TAG` : recieve message with any tag
`Figure 7` shows that the hardware offload **does not appear to be impacted** by wild cards.
> ---
#### Application
`LULESH` was selected as representative of **highly optimized HPC applications**. This class of applications reduce communication and message processing overheads.
==Do not see a runtime impact== from improved message matching.
`FDS` represents typical HPC applications, ones that use MPI correctly but may have **complex communication requirements**, suboptimal implementations, and/or scale sensitive communication patterns.
`Figure 9` shows s that ==**hardware matching** greatly improves the application runtime== over UCX’s software implementation.
------
### Section 5 - discussion of the implications
The `software layer` is available through the Open UCX repository and utilizes a hash binning data structure containing 1021 bins.
The hash function takes the mod 1021 of the upper and lower 32 bits of the UCX tag and combines them using bitwise `xor`.
The `hardware layer` resides on the NIC, and performs UCX tag matching if a message is larger than a configurable threshold.
Small messages actually incur a penalty from hardware matching, while ==larger messages== (those over `1KiB`) see a ==benefit==.
This indicates that the `on-NIC` matching **takes more time** than matching on `CPU`, but allows larger message sizes to see improved performance from overlapping matching with another task.
Additionally, we discovered that the `hardware matching layer` is ==**highly dependent** on the hash binning data structure==, and will only accelerate requests that are the first element in a bin.
> 1021 is a large number of bins, why worry about collision rate?
>
> Even though the individual collision rate is less than 0.1%, adding 100 random tags in the queue would result in a expected value of 9.4 collisions.
>
> Distinct messages increase -> the probability of conflicts will ==greatly increase==.
-------
### Section 6 - related work
#### Efficient MPI Message Matching
> CPU architectures
Manycore architectures can have an order of magnitude ==worse performance== than a tradition big-core out of order processing core.
> The length of an MPI match list
For a subset of scientific applications, processing unexpected messages can be a ==significant bottleneck== to performance
> ----
#### Matching Techniques
> Multithreaded hash table approach
Relies on a concurrent hash table design that is highly scalable. The concurrent nature of the hash table requires that **no wildcards** can be used in MPI messages at all.
-> Allows multiple threads to interact with MPI efficiently
> **hash-map keyed** to use the entire set of matching criteria
Allowed them to include wildcards in a hashing-based matching scheme.
Their design requires setting a user configurable number of bins to which message match requests are posted.
This approach seems to have solved major long list match performance issues, but the approach has a small overhead in the hash mapping for lists of any size, and therefore has ==higher overhead== than a traditional list when the match would be near the front of a traditional linked list.
> there has been work in allowing an MPI implementation to ==dynamically swap== between a hash-table and a traditional list
> Other solutions
1. using GPU
- there are no MPI implementations that run on GPU today
- It is also unknown what overheads exists for **short lists** `or` the case of the **first match** element being the correct match
1. Decomposing the list into a 4D lookup
- allows skipping portions of the list where the match cannot occur
3. Creating `hardware designed` to offload the matching processing itself
- Portals communication API
- have been done in `FPGAs` and with `TCAMs`
- Seastar interconnect
- Portals MPI matching offloading NIC
- Did so with a general CPU
------
### Section 7 - conclusions and future work
The OSU Benchmark results show that the ConnectX-5 is sensitive to both `message queue depth` and `tag collision rate`.
Demonstrated that hardware message matching increases performance for applications that send messages between `1KiB` and `16KiB`
Because the hardware and software layers for the Connect X5 matching scheme are network specific, ==their performance impacts are not portable across systems==.
For regular applications(without optimized), a ==significant improvement== can be observed when leveraging software and hardware matching.
> Future work: how wildcard placement affects performance