11 Scaling - Challenges and Techniques

# 11 Scaling - Challenges and Techniques ###### tags: `SS2021-IN2147-PP` ## Latest Trends ### Scaling and its Impact * The **`level of concurrency`** (number of HW threads) is rising in HPC * Single thread performance no longer growing * Trend towards **`multicores`** * Trend towards **`highly parallel accelerators`** * Still **`rising number of nodes`** * Concurrency needs to be **exploited in software** * `Problems` need to be able to be `split` into smaller pieces * `No dependencies` * `Fine-grained work distribution` with load balance * New application requirements * Heterogeneous workloads * More complex workflows * **New programming approaches** (next 2 lectures) * `Accelerators` require special treatment * `Task-based models` emerging as one option ### Top500 List as of June 2020 ![](https://i.imgur.com/JlUXfAU.png =500x) ### Number of Cores Across the Top500 Machines ![](https://i.imgur.com/BWXn2et.png =500x) ## Scaling Types ![](https://i.imgur.com/2LR7Rq5.png =500x) * Scaling in this case means `changes` in the `number HW threads` used * Need to **`adjust application and runtime parameters`** as we do this * Problem size * Problem decomposition * Useful to keep one characteristics constant * Also **`other tuning options`** may be necessary * Algorithmic knobs (due to changed conditions) * Runtime knobs (e.g., due to changed message sizes) * Thread vs. process balance * Job mapping to machine * Basically running **`a different set of programs`** * Can expose new bugs/correctness issues * Can expose new performance issues ### Weak Scaling * Keeping the **`problem size`** per HW thread/core/node **`constant`** * Larger machine -> `larger problem` * Expanding the problem domain * Increasing refinement * Traditionally **`most common`** way to deal with scaling in HPC * Assumptions * Machines are never big enough * Fully loading a node/core/HW thread is the right thing to do * Exploiting the resources on one unit and keeping that constant * Advantages * Execution properties (like `cache behavior`) often `stay fixed` * Easier to scale, as overheads stay roughly constant as well * **Challenges** * `Keeping load constant` can be `tricky` for `complex applications` (2D/3D problems) * Multiple ways to `repartition a workload` * Example: **`Material Solidification Process`** * Molecular dynamics at LLNL with ddcMD: 2 Million atom run (2005) * Used all of Blue Gene/L (128K cores) * Close to perfect scaling * New scientific observations * ![](https://i.imgur.com/0fnlL6x.png) ### Strong Scaling * **`Keeping the total problem constant`** * Larger machine -> (hopefully) **`faster`** execution * Need to `adjust problem distribution` * Traditionally not the most common type of scaling * But becoming rapidly more and more relevant * Assumptions: * The machine is big enough for the problem * Goal: reducing time to solution and increasing throughput * Needed for **`time critical tasks`** * Emergency response to natural disasters * Real-time simulations * Large-scale ensemble calculations * Challenges * `Harder to scale`, as overheads grow with smaller per HW thread workloads * `Changing executing characteristics` (cache miss rates, etc.) * Example: Cardiac Simulation (BG/Q at LLNL) * Electrophysiology of the human heart * Close to real time heart simulation * Near cellular resolution * Parallel programming aspects * Strong scaling problem * Simulation on 1.600.000 cores * ”Bare metal” programming * Achieved close to 12 Pflop/s * ![](https://i.imgur.com/SQxIKnp.png =300x) ## Programmability ### Impact on Programmability * Things get harder at scale * Access to batch queues * Startup and finalization times * I/O access to files and startup information * This includes **debugging** * Printf becomes problematic * 1 printf on 1,000,000 HW threads = 13,600 pages of text at 8pt font * Interactive debugging no longer feasible * Aim at reproducing bugs at small scale * Isolating kernels * Similar conditions by applying the right scaling * But: in same cases we **need to debug at scale** * `Bugs` sometimes `don’t manifest` themselves `at small scale` * Size reduction and/or debugging additions change conditions ### Debugging at Scale * **Traditional debugging** paradigm **does not scale well**. * Huge quantities of symbol table and process state information * **`Too much information`** to present to the user * First need to **`reduce search space`** to manageable **`subset`** * Apply traditional **`debugging`** techniques **`on subset`** ### How do we debug applications at O(1M)?! #### Observation 1 > Parallel applications have a lot of processes **executing the same code**. * Lightweight tools to quickly identify **`process equivalence classes`** * Feed a **`representative of each class`** into a full-featured debugger #### Observation 2 > **Stack traces** are a good indicator of process behavior * Use **`stack traces`** to identify equivalence classes #### Observation 3 > **Time varying behavior** provides additional insight #### Stack Traces: the basis for STAT ![](https://i.imgur.com/Cb3z9tY.png =500x) #### STAT compliments traditional parallel debuggers ![](https://i.imgur.com/fVPhgFk.png =500x) #### Need to Gather Data From All MPI Processes ![](https://i.imgur.com/Fo6Dp77.png =500x) #### Scalable Communication is Necessary ![](https://i.imgur.com/8pItJ4Q.png =500x) ![](https://i.imgur.com/12HWdoC.png =500x) #### Scalable Aggregation is Necessary ![](https://i.imgur.com/PlfhCY6.png =500x) #### Real Use Case at > 1,000,000 Processes ![](https://i.imgur.com/2rw36zd.png =500x) #### Other Real “War” Stories ![](https://i.imgur.com/uSC5ouy.png) ### Performance Analysis to Scale / at Scale * **Typical performance issues** in a Distributed Memory System * Load imbalance; Processes waiting for data * Large fraction of time on collective operations * Network and I/O contention * Non-optimal process placement & binding * Also sequential / threaded performance !!! * Performance tool options similar as with shared memory * Profiling vs. Tracing * Sampling vs. Direct Instrumentation #### From Generic Messaging `Profiling`... * **`mpiP`**: **Open source MPI profiling library** * Categories: linker instrumentation, profiling * Available from github (recently migrated from sourceforge) * Portable across MPI libraries and system architectures * Available on most platforms, incl. IBM BG L/P/Q, Cray XT/XK/XE, Clusters ```shell ------------------------------------- @--- MPI Time (seconds) ------------- ------------------------------------- Task AppTime MPITime MPI% 0 9.78 1.97 20.12 1 9.8 1.95 19.93 2 9.8 1.87 19.12 3 9.77 2.15 21.99 * 39.1 7.94 20.29 ------------------------------------- ``` * **Scaling Challenges** of **`Profiling`** itself * What to summarize and when? * Mapping back to the right code pieces? * So we need **`Tracing`**... #### From Generic Messaging `Tracing`... * **`Vampir`** ![](https://i.imgur.com/eYdCNmo.png) * **Scaling Challenges** of **`Tracing`** itself * Potentially **huge traces** * **Overhead** and **performance disturbance** (also, e.g., due to I/O) ### Variability and Consequences #### Performance analysis is becoming statistics * **Single runs are no longer sufficient** to understand performance * Already true for sequential execution * Especially true for large scale, highly parallel, shared environments * Need to combine data from **several runs** * Need to **understand `variability`** in the system and the results * **Record and document** as much metadata as possible (static and dynamic) #### Reading * Scientific Benchmarking of Parallel Computing Systems * Torsten Hoefler and Roberto Belli, SC15 #### Some lessons * Avoid summarizing ratios * Summarize the **`costs`** or rates the ratios are based on * Report if the measurement values are **`deterministic`** * For `non-deterministic` data, report `confidence intervals` of the measurement. * Do not assume normality of collected data without **`diagnostic checking`**. * If possible, show **`upper performance bounds`** to facilitate interpretability ## Impact on Algorithms/Data Structures > Scaling can put **new pressures/requirements** on **`algorithms`** & **`data structures`** * **Avoid anything** **`O(N)`** if at all possible * Data structures * Keep `O(N)` data **distributed**, **never replicated** * Loops * Data transfers * ~~MPI_Alltoall~~: not scalable * MPI routines with `O(N)` sized arguments * **Create smaller subcommunicators** * ~~MPI_COMM_WORLD~~ * Avoid using all of MPI_COMM_WORLD * Localized communication * Sometimes **new algorithms** are needed * Some algorithms are fine/ideal at small scale, but get worse at large scale * Larger setup and higher (constant) overhead amortized at scale * **`Dynamic justments/switching`** may help ### Example: Optimizing Load Balancing in AMR * Adaptive Mesh Refinement (SAMRAI library) * Different levels of patches to refine in areas of interest * Requires active load balancing * Load balancing shows bad scaling behavior * Dominates at large scale ![](https://i.imgur.com/A6auUY8.png =300x) #### Timings in MPI rank space * Per node timings for each phase * ==Bottleneck is in **phase 1** and not phase 3== * Limited correlation based on rank space * ![](https://i.imgur.com/CeSkmtD.png =500x) #### Map Performance Metrics onto Underlying Communication Graph ![](https://i.imgur.com/zj4VNpV.png =500x) #### Visualizing Large Communication Graphs * Display of individual nodes is not scalable * Need to group nodes * Outer layers are best targets for this * Keep metric information / coloring * ![](https://i.imgur.com/ucSmiwO.png) ### Performance Improvements * Need to address **flow problem** * Reduce traffic through root * Box size / granularity is one of the knobs * Ultimately need new communication/balancing algorithm * ![](https://i.imgur.com/7mYR64D.png) ## Mapping to Systems ### Impact on Mapping Codes to Machines #### Example: FPMD Code on Blue Gene/L * Material simulation using first principles * No empirical parameters * Iterative process * **Communication Structure** * **`Dense matrix`** of wave function coefficients * Regular domain decomposition * **`Row/Column communication`** * Dominant communication: **`MPI_Bcast`** * Mapping matrix decomposition onto BG/L’s 3D torus * 65,536 nodes in a 64x32x32 torus * Problem split into 512 rows and 128 columns * Mapping of rows onto target machine #### Impact of Node Mappings on Performance ![](https://i.imgur.com/GTrl62s.png =500x) ### Why Are We Seeing this Performance Behavior? #### Observation 1 * Need to optimize for **`both row and columns`** * Take interference into account #### Observation 2 * Need to optimize for **`bandwidth`** * Drive as many links as possible #### Optimization process was manual * Detecting communication patterns * Trial and error mappings * Explain performance post mortem * Iterative refinement #### All runs had to be at scale ### Other Network Topologies ![](https://i.imgur.com/vCu2w4J.png) ### Dragonfly and its Properties ![](https://i.imgur.com/J9vpDHO.png) ### MPI Process Topologies * Mapping MPI processes to the machine is a hard problem * Depends on algorithm and system * Often hard to do for programming manually * **`Topologies`** define an **`intuitive`** name space * Simplifies algorithm design * Provides MPI with topology information * Enables more efficient mappings within MPI * **MPI supports** * Multidimensional grids and tori * Arbitrary graphs * Information attached to a communicator * Special creation functions * Ability to query topologies later on * Note/Warning: these are often not the most optimized routines, but getting better #### Cartesian Grid Topologies ```c // Creates a new communicator with Cartesian topology attached int MPI_Cart_create(MPI_Comm comm_old, int ndims, int dims[], int periods[], int reorder, MPI_Comm *comm_cart) // Get a rank from a tuple of coordinates int MPI_Cart_rank(MPI_Comm comm, int coords[], int *rank) // Get a tuple of coordinates from a rank int MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims, int coords[]) // Get a send/receive pair for use with MPI_Sendrecv int MPI_Cart_shift(MPI_Comm comm, int direction, int disp, int *rank_source, int *rank_dest) ``` #### Example: Cartesian Topologies ```c double buf1, buf2; MPI_Comm comm_2d; MPI_Status status; dims[0]=3; dims[1]=4; periods[0]=true; periods[1]=true; reorder=false; /* Torus, not Grid */ MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, reorder &comm_2d); MPI_Cart_coords(comm_2d, rank, 2, &coords); MPI_Cart_shift(comm_2d, 1, 1, &source, &dest); MPI_Sendrecv(buf1, 1, MPI_DOUBLE, dest, 42, buf2, 1, MPI_DOUBLE, source, 42, comm_2d, &status); ``` ![](https://i.imgur.com/E3zUu6h.png =500x) ### MPI Communicator From Graphs ```c // Ability to specify localized communication using arbitrary graphs int MPI_Graph_create(MPI_Comm comm_old, int nnodes, int index[], int edges[], int reorder, MPI_Comm *comm_graph) ``` ![](https://i.imgur.com/y6fuvtt.png =500x) ### MPI Communicator From Distributed Graphs ```c // Ability to specify localized communication using arbitrary distributed graphs int MPI_Dist_graph_create_adjacent(MPI_Comm comm_old, int indegree, int sources[], int sourceweights[], int outdegree, int destinations[], int destweights[], MPI_Info info, int reorder, MPI_Comm *comm_dist_graph) // Note: graphs across all processes must be consistent ``` ![](https://i.imgur.com/KTkE0vP.png) ### MPI Neighborhood Collectives ```c // Enables communication localized to topology neighbors MPI_Neighbor_allgather MPI_Neighbor_allgatherv MPI_Neighbor_alltoall MPI_Neighbor_alltoallv MPI_Neighbor_alltoallw ``` ## I/O for Parallel Applications ### Impact on I/O * I/O can take a **`majority of the execution`** share at scale * Reading of configurations files * Reading of input decks * Visualization dumps * Checkpointing * Options * `Posix I/O` directly to the file system * `MPI I/O` routines * Specialized libraries with special data formats like `HDF5` * Things to look for * Using the right file system * Exploiting parallelism * Writing from multiple processes to one file * Writing from multiple processes to many files * Metadata management performance ### Exploiting Parallelism in I/O ![](https://i.imgur.com/JaO8R8k.png) * Best scenario depends on * Data size * Internal data distribution * Type of parallel filesystem * System installation (sizing of metadata servers) * Requires **`benchmarking`** and **`experience`**, but be aware of **`variability`** (esp. for I/O)