Paper Note - HackMD

# Paper Note ## MPI Tag Matching Performance on ConnectX and ARM > `performance`, `2019` [link](https://dl.acm.org/doi/pdf/10.1145/3343211.3343224) - Reduce overhead (search for a match) 1. hash table 2. hashed-binning 3. 4-D lookup tables 4. parallelism 5. offloaded to local HCA (ConnectX) ------ ### Section 2 - background Traditionally, `MPICH` and `MPICH-derivative` MPI implementations used a pair of linked lists to implement matching - posted receive queue (PRQ) - unexpected message queue (UMQ) Recently optimizations have been made in some of the major vendor’s drivers to use a hashed-binning solution where ==**the tag is hashed** to select one of many linked lists.== > - Intel’s PSM2 driver : **randomized hash** that selects from 64 bins > - Mellanox’s ConnectX-5 driver : **ordered hash** (the modulo of a 64 bit field that includes the MPI message tag) to select from 1021 bins > Additionally, network vendors have also moved to address matching overheads by offering offloaded, hardware support based in the **local HCA** > Host Channel Adapter（HCA）通常用於描述 InfiniBand 的介面卡。 - Mellanox’s ConnectX-5 - matches based on a single 64-bit tag (a UCX tag) (including MPI tag, MPI rank, and MPI context ID) - two layers - software layer - **open source** and can be viewed as part of Open UCX - a hash binning system with 1021 bins - a hash function that takes the mod 1021 of - the upper and lower 32-bits of the UCX tag - combines the result using an bitwise `xor` - hardware layer - proprietary and the details are not publicly released - **hardware matching** only happens for messages above a ==certain threshold== which defaults to `1KiB` - these two layers can both be active at the same time ------ ### Section 3 - experimental design and setup #### Hardware - Astra supercomputer at Sandia National Laboratories - These processors are 28-core Cavium ThunderX2 CN9775 chips operating at 2GHz. - Each node communicate using a 600Gb/s cache coherent interconnect - 36 racks × 18 chassis per rack × 4 nodes per chassis = **5184 nodes** (each node has one processor) - ====================== - The network is 4x EDR (100Gb/s) Infiniband Mellanox ConnectX5 hardware. - Nodes communicate with the network interface via 8x PCIe 3 (16GT/s) - The ConnectX-5 NIC provides hardware assistance for MPI message matching. - ====================== - Nodes are arranged in a three-level fat tree topology Evaluate network performance on the cluster Collected data using a **modified version** of the OSU Micro-Benchmarks suite (OSU) > (i) testing different list lengths > ==additional unmatched receives are posted== in advance of the performance testing > (ii) matching the environment of a well written application > ==all receives are preposted== and the cache is cleared before each iteration. Env : `OpenMPI version 3.1.3` `GCC version 7.2.0 ` Message sizes : varied between `1B` and `1MiB` Queue lengths : varied between `1` and `32k` tags Each trial was performed with both hardware matching **enabled** and **disabled**. For `0%`, `1%`, `10%` and `100%` collision rate Each trial was run 20 times Run with 2 applications 3 different matching techniques - Applications : `LULESH 2.0` and `FDS` 1. software only 2. software with hardware 3. software with one bucket ------ ### Section 4 - results > Different numbers of unmatched receives increasing what we call the `receive queue depth` > Collision rate : `0%`, `1%`, `10%` and `100%` - Software Only Matching (baseline) `Figure 1a` shows results from the modified OSU microbenchmarks bandwidth test with **no preposted unmatched receives and no hardware matching**. This serves as a ==baseline== to test the software matching implementation. There is a clear transition from eager to rendezvous at `16KiB` messages. > --- `Figure 1b` shows the same benchmark after **adding 1024 unmatched messages to the queue**. Interestingly after the transition to the `rendezvous message protocol` (like P2P), ==bandwidth does not vary significantly with collision rate==. The latency inherent in the larger message sizes and the resulting lower message rate can ==hide the time spent message matching== even in more extreme cases of 100% collisions. > --- `Figure 1c` shows that the bandwidth of a `4KiB` message size across the receive queue depth. This data shows what we expect from binning systems; the performance is directly dependent on **the combination of queue depth and collision rate**. Because the `effective search depth` is the number of entries in the selected bin, it can be evaluated at *`queue_depth` ∗ `collision_rate`*. This is reflected in the data by the ==decreased performance as collision rates increase== for the same queue depth. > --- - Software + Hardware Matching The performance gains of the new hardware are highly dependent on `message size`, with messages of approximately `1KiB` to `16 KiB` seeing improved performance from offloading tag matching to the NIC. `Figure 2a` shows that messages of size `1KiB` to `16 KiB` had ==**significant increase** in bandwidth when **enabling hardware**==. Outside of this range however, there is actually a reduction in bandwidth. 32 bit messages are ==**27.1% slower** with hardware enabled==. For smaller messages, this reduction is likely due to overheads of hardware matching not being masked by overlap; the additional latency from hardware message matching becomes the **performance bottleneck**. > --- `Figure 2b` shows that ==increased collision rate leads to a higher effective search depth.== Comparing to the baseline above, `2KiB` to `16KiB` message sizes show a **performance increase ranging from 0.48% to 31.4%**. As the collision rate increases, the performance in this range trends **downward**, with `16KiB` messages at 100% collision rate seeing a ==11.3% improvement==. Outside of this range, the performance becomes worse with hardware matching enable. > future work : The performance dip that occurs between `16KiB` and `256KiB` > --- `Figure 2c` shows how hardware matching handles `4KiB` message with an increasing number of matching entries. There is ==significant performance drop== when the *expected value* for collisions becomes `1` (when queue depth ∗ collision rate = 1). This indicate that the benefit of hardware matching is **limited to** entries that are **first in their bucket**. Once the queue depth becomes large enough that even a single collision is expected, ==hardware matching is quickly surpassed by software matching== performance and the variance between runs increases dramatically. > --- - Wild card Support `MPI_ANY_TAG` : recieve message with any tag `Figure 7` shows that the hardware offload **does not appear to be impacted** by wild cards. > --- #### Application `LULESH` was selected as representative of **highly optimized HPC applications**. This class of applications reduce communication and message processing overheads. ==Do not see a runtime impact== from improved message matching. `FDS` represents typical HPC applications, ones that use MPI correctly but may have **complex communication requirements**, suboptimal implementations, and/or scale sensitive communication patterns. `Figure 9` shows s that ==**hardware matching** greatly improves the application runtime== over UCX’s software implementation. ------ ### Section 5 - discussion of the implications The `software layer` is available through the Open UCX repository and utilizes a hash binning data structure containing 1021 bins. The hash function takes the mod 1021 of the upper and lower 32 bits of the UCX tag and combines them using bitwise `xor`. The `hardware layer` resides on the NIC, and performs UCX tag matching if a message is larger than a configurable threshold. Small messages actually incur a penalty from hardware matching, while ==larger messages== (those over `1KiB`) see a ==benefit==. This indicates that the `on-NIC` matching **takes more time** than matching on `CPU`, but allows larger message sizes to see improved performance from overlapping matching with another task. Additionally, we discovered that the `hardware matching layer` is ==**highly dependent** on the hash binning data structure==, and will only accelerate requests that are the first element in a bin. > 1021 is a large number of bins, why worry about collision rate? > > Even though the individual collision rate is less than 0.1%, adding 100 random tags in the queue would result in a expected value of 9.4 collisions. > > Distinct messages increase -> the probability of conflicts will ==greatly increase==. ------- ### Section 6 - related work #### Efficient MPI Message Matching > CPU architectures Manycore architectures can have an order of magnitude ==worse performance== than a tradition big-core out of order processing core. > The length of an MPI match list For a subset of scientific applications, processing unexpected messages can be a ==significant bottleneck== to performance > ---- #### Matching Techniques > Multithreaded hash table approach Relies on a concurrent hash table design that is highly scalable. The concurrent nature of the hash table requires that **no wildcards** can be used in MPI messages at all. -> Allows multiple threads to interact with MPI efficiently > **hash-map keyed** to use the entire set of matching criteria Allowed them to include wildcards in a hashing-based matching scheme. Their design requires setting a user configurable number of bins to which message match requests are posted. This approach seems to have solved major long list match performance issues, but the approach has a small overhead in the hash mapping for lists of any size, and therefore has ==higher overhead== than a traditional list when the match would be near the front of a traditional linked list. > there has been work in allowing an MPI implementation to ==dynamically swap== between a hash-table and a traditional list > Other solutions 1. using GPU - there are no MPI implementations that run on GPU today - It is also unknown what overheads exists for **short lists** `or` the case of the **first match** element being the correct match 1. Decomposing the list into a 4D lookup - allows skipping portions of the list where the match cannot occur 3. Creating `hardware designed` to offload the matching processing itself - Portals communication API - have been done in `FPGAs` and with `TCAMs` - Seastar interconnect - Portals MPI matching offloading NIC - Did so with a general CPU ------ ### Section 7 - conclusions and future work The OSU Benchmark results show that the ConnectX-5 is sensitive to both `message queue depth` and `tag collision rate`. Demonstrated that hardware message matching increases performance for applications that send messages between `1KiB` and `16KiB` Because the hardware and software layers for the Connect X5 matching scheme are network specific, ==their performance impacts are not portable across systems==. For regular applications(without optimized), a ==significant improvement== can be observed when leveraging software and hardware matching. > Future work: how wildcard placement affects performance