owned this note
owned this note
Published
Linked with GitHub
# 11 Scaling - Challenges and Techniques
###### tags: `SS2021-IN2147-PP`
## Latest Trends
### Scaling and its Impact
* The **`level of concurrency`** (number of HW threads) is rising in HPC
* Single thread performance no longer growing
* Trend towards **`multicores`**
* Trend towards **`highly parallel accelerators`**
* Still **`rising number of nodes`**
* Concurrency needs to be **exploited in software**
* `Problems` need to be able to be `split` into smaller pieces
* `No dependencies`
* `Fine-grained work distribution` with load balance
* New application requirements
* Heterogeneous workloads
* More complex workflows
* **New programming approaches** (next 2 lectures)
* `Accelerators` require special treatment
* `Task-based models` emerging as one option
### Top500 List as of June 2020

### Number of Cores Across the Top500 Machines

## Scaling Types

* Scaling in this case means `changes` in the `number HW threads` used
* Need to **`adjust application and runtime parameters`** as we do this
* Problem size
* Problem decomposition
* Useful to keep one characteristics constant
* Also **`other tuning options`** may be necessary
* Algorithmic knobs (due to changed conditions)
* Runtime knobs (e.g., due to changed message sizes)
* Thread vs. process balance
* Job mapping to machine
* Basically running **`a different set of programs`**
* Can expose new bugs/correctness issues
* Can expose new performance issues
### Weak Scaling
* Keeping the **`problem size`** per HW thread/core/node **`constant`**
* Larger machine -> `larger problem`
* Expanding the problem domain
* Increasing refinement
* Traditionally **`most common`** way to deal with scaling in HPC
* Assumptions
* Machines are never big enough
* Fully loading a node/core/HW thread is the right thing to do
* Exploiting the resources on one unit and keeping that constant
* Advantages
* Execution properties (like `cache behavior`) often `stay fixed`
* Easier to scale, as overheads stay roughly constant as well
* **Challenges**
* `Keeping load constant` can be `tricky` for `complex applications` (2D/3D problems)
* Multiple ways to `repartition a workload`
* Example: **`Material Solidification Process`**
* Molecular dynamics at LLNL with ddcMD: 2 Million atom run (2005)
* Used all of Blue Gene/L (128K cores)
* Close to perfect scaling
* New scientific observations
* 
### Strong Scaling
* **`Keeping the total problem constant`**
* Larger machine -> (hopefully) **`faster`** execution
* Need to `adjust problem distribution`
* Traditionally not the most common type of scaling
* But becoming rapidly more and more relevant
* Assumptions:
* The machine is big enough for the problem
* Goal: reducing time to solution and increasing throughput
* Needed for **`time critical tasks`**
* Emergency response to natural disasters
* Real-time simulations
* Large-scale ensemble calculations
* Challenges
* `Harder to scale`, as overheads grow with smaller per HW thread workloads
* `Changing executing characteristics` (cache miss rates, etc.)
* Example: Cardiac Simulation (BG/Q at LLNL)
* Electrophysiology of the human heart
* Close to real time heart simulation
* Near cellular resolution
* Parallel programming aspects
* Strong scaling problem
* Simulation on 1.600.000 cores
* ”Bare metal” programming
* Achieved close to 12 Pflop/s
* 
## Programmability
### Impact on Programmability
* Things get harder at scale
* Access to batch queues
* Startup and finalization times
* I/O access to files and startup information
* This includes **debugging**
* Printf becomes problematic
* 1 printf on 1,000,000 HW threads = 13,600 pages of text at 8pt font
* Interactive debugging no longer feasible
* Aim at reproducing bugs at small scale
* Isolating kernels
* Similar conditions by applying the right scaling
* But: in same cases we **need to debug at scale**
* `Bugs` sometimes `don’t manifest` themselves `at small scale`
* Size reduction and/or debugging additions change conditions
### Debugging at Scale
* **Traditional debugging** paradigm **does not scale well**.
* Huge quantities of symbol table and process state information
* **`Too much information`** to present to the user
* First need to **`reduce search space`** to manageable **`subset`**
* Apply traditional **`debugging`** techniques **`on subset`**
### How do we debug applications at O(1M)?!
#### Observation 1
> Parallel applications have a lot of processes **executing the same code**.
* Lightweight tools to quickly identify **`process equivalence classes`**
* Feed a **`representative of each class`** into a full-featured debugger
#### Observation 2
> **Stack traces** are a good indicator of process behavior
* Use **`stack traces`** to identify equivalence classes
#### Observation 3
> **Time varying behavior** provides additional insight
#### Stack Traces: the basis for STAT

#### STAT compliments traditional parallel debuggers

#### Need to Gather Data From All MPI Processes

#### Scalable Communication is Necessary


#### Scalable Aggregation is Necessary

#### Real Use Case at > 1,000,000 Processes

#### Other Real “War” Stories

### Performance Analysis to Scale / at Scale
* **Typical performance issues** in a Distributed Memory System
* Load imbalance; Processes waiting for data
* Large fraction of time on collective operations
* Network and I/O contention
* Non-optimal process placement & binding
* Also sequential / threaded performance !!!
* Performance tool options similar as with shared memory
* Profiling vs. Tracing
* Sampling vs. Direct Instrumentation
#### From Generic Messaging `Profiling`...
* **`mpiP`**: **Open source MPI profiling library**
* Categories: linker instrumentation, profiling
* Available from github (recently migrated from sourceforge)
* Portable across MPI libraries and system architectures
* Available on most platforms, incl. IBM BG L/P/Q, Cray XT/XK/XE, Clusters
```shell
-------------------------------------
@--- MPI Time (seconds) -------------
-------------------------------------
Task AppTime MPITime MPI%
0 9.78 1.97 20.12
1 9.8 1.95 19.93
2 9.8 1.87 19.12
3 9.77 2.15 21.99
* 39.1 7.94 20.29
-------------------------------------
```
* **Scaling Challenges** of **`Profiling`** itself
* What to summarize and when?
* Mapping back to the right code pieces?
* So we need **`Tracing`**...
#### From Generic Messaging `Tracing`...
* **`Vampir`**

* **Scaling Challenges** of **`Tracing`** itself
* Potentially **huge traces**
* **Overhead** and **performance disturbance** (also, e.g., due to I/O)
### Variability and Consequences
#### Performance analysis is becoming statistics
* **Single runs are no longer sufficient** to understand performance
* Already true for sequential execution
* Especially true for large scale, highly parallel, shared environments
* Need to combine data from **several runs**
* Need to **understand `variability`** in the system and the results
* **Record and document** as much metadata as possible (static and dynamic)
#### Reading
* Scientific Benchmarking of Parallel Computing Systems
* Torsten Hoefler and Roberto Belli, SC15
#### Some lessons
* Avoid summarizing ratios
* Summarize the **`costs`** or rates the ratios are based on
* Report if the measurement values are **`deterministic`**
* For `non-deterministic` data, report `confidence intervals` of the measurement.
* Do not assume normality of collected data without **`diagnostic checking`**.
* If possible, show **`upper performance bounds`** to facilitate interpretability
## Impact on Algorithms/Data Structures
> Scaling can put **new pressures/requirements** on **`algorithms`** & **`data structures`**
* **Avoid anything** **`O(N)`** if at all possible
* Data structures
* Keep `O(N)` data **distributed**, **never replicated**
* Loops
* Data transfers
* ~~MPI_Alltoall~~: not scalable
* MPI routines with `O(N)` sized arguments
* **Create smaller subcommunicators**
* ~~MPI_COMM_WORLD~~
* Avoid using all of MPI_COMM_WORLD
* Localized communication
* Sometimes **new algorithms** are needed
* Some algorithms are fine/ideal at small scale, but get worse at large scale
* Larger setup and higher (constant) overhead amortized at scale
* **`Dynamic justments/switching`** may help
### Example: Optimizing Load Balancing in AMR
* Adaptive Mesh Refinement (SAMRAI library)
* Different levels of patches to refine in areas of interest
* Requires active load balancing
* Load balancing shows bad scaling behavior
* Dominates at large scale

#### Timings in MPI rank space
* Per node timings for each phase
* ==Bottleneck is in **phase 1** and not phase 3==
* Limited correlation based on rank space
* 
#### Map Performance Metrics onto Underlying Communication Graph

#### Visualizing Large Communication Graphs
* Display of individual nodes is not scalable
* Need to group nodes
* Outer layers are best targets for this
* Keep metric information / coloring
* 
### Performance Improvements
* Need to address **flow problem**
* Reduce traffic through root
* Box size / granularity is one of the knobs
* Ultimately need new communication/balancing algorithm
* 
## Mapping to Systems
### Impact on Mapping Codes to Machines
#### Example: FPMD Code on Blue Gene/L
* Material simulation using first principles
* No empirical parameters
* Iterative process
* **Communication Structure**
* **`Dense matrix`** of wave function coefficients
* Regular domain decomposition
* **`Row/Column communication`**
* Dominant communication: **`MPI_Bcast`**
* Mapping matrix decomposition onto BG/L’s 3D torus
* 65,536 nodes in a 64x32x32 torus
* Problem split into 512 rows and 128 columns
* Mapping of rows onto target machine
#### Impact of Node Mappings on Performance

### Why Are We Seeing this Performance Behavior?
#### Observation 1
* Need to optimize for **`both row and columns`**
* Take interference into account
#### Observation 2
* Need to optimize for **`bandwidth`**
* Drive as many links as possible
#### Optimization process was manual
* Detecting communication patterns
* Trial and error mappings
* Explain performance post mortem
* Iterative refinement
#### All runs had to be at scale
### Other Network Topologies

### Dragonfly and its Properties

### MPI Process Topologies
* Mapping MPI processes to the machine is a hard problem
* Depends on algorithm and system
* Often hard to do for programming manually
* **`Topologies`** define an **`intuitive`** name space
* Simplifies algorithm design
* Provides MPI with topology information
* Enables more efficient mappings within MPI
* **MPI supports**
* Multidimensional grids and tori
* Arbitrary graphs
* Information attached to a communicator
* Special creation functions
* Ability to query topologies later on
* Note/Warning: these are often not the most optimized routines, but getting better
#### Cartesian Grid Topologies
```c
// Creates a new communicator with Cartesian topology attached
int MPI_Cart_create(MPI_Comm comm_old, int ndims,
int dims[], int periods[], int reorder,
MPI_Comm *comm_cart)
// Get a rank from a tuple of coordinates
int MPI_Cart_rank(MPI_Comm comm, int coords[], int *rank)
// Get a tuple of coordinates from a rank
int MPI_Cart_coords(MPI_Comm comm, int rank,
int maxdims, int coords[])
// Get a send/receive pair for use with MPI_Sendrecv
int MPI_Cart_shift(MPI_Comm comm, int direction,
int disp, int *rank_source, int *rank_dest)
```
#### Example: Cartesian Topologies
```c
double buf1, buf2;
MPI_Comm comm_2d;
MPI_Status status;
dims[0]=3; dims[1]=4;
periods[0]=true; periods[1]=true;
reorder=false;
/* Torus, not Grid */
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, reorder &comm_2d);
MPI_Cart_coords(comm_2d, rank, 2, &coords);
MPI_Cart_shift(comm_2d, 1, 1, &source, &dest);
MPI_Sendrecv(buf1, 1, MPI_DOUBLE, dest,
42, buf2, 1, MPI_DOUBLE, source, 42, comm_2d, &status);
```

### MPI Communicator From Graphs
```c
// Ability to specify localized communication using arbitrary graphs
int MPI_Graph_create(MPI_Comm comm_old,
int nnodes, int index[], int edges[],
int reorder, MPI_Comm *comm_graph)
```

### MPI Communicator From Distributed Graphs
```c
// Ability to specify localized communication using arbitrary distributed graphs
int MPI_Dist_graph_create_adjacent(MPI_Comm comm_old,
int indegree, int sources[],
int sourceweights[],
int outdegree, int destinations[], int destweights[],
MPI_Info info, int reorder, MPI_Comm *comm_dist_graph)
// Note: graphs across all processes must be consistent
```

### MPI Neighborhood Collectives
```c
// Enables communication localized to topology neighbors
MPI_Neighbor_allgather
MPI_Neighbor_allgatherv
MPI_Neighbor_alltoall
MPI_Neighbor_alltoallv
MPI_Neighbor_alltoallw
```
## I/O for Parallel Applications
### Impact on I/O
* I/O can take a **`majority of the execution`** share at scale
* Reading of configurations files
* Reading of input decks
* Visualization dumps
* Checkpointing
* Options
* `Posix I/O` directly to the file system
* `MPI I/O` routines
* Specialized libraries with special data formats like `HDF5`
* Things to look for
* Using the right file system
* Exploiting parallelism
* Writing from multiple processes to one file
* Writing from multiple processes to many files
* Metadata management performance
### Exploiting Parallelism in I/O

* Best scenario depends on
* Data size
* Internal data distribution
* Type of parallel filesystem
* System installation (sizing of metadata servers)
* Requires **`benchmarking`** and **`experience`**, but be aware of **`variability`** (esp. for I/O)