Jih-Wei Liang
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # 11 Scaling - Challenges and Techniques ###### tags: `SS2021-IN2147-PP` ## Latest Trends ### Scaling and its Impact * The **`level of concurrency`** (number of HW threads) is rising in HPC * Single thread performance no longer growing * Trend towards **`multicores`** * Trend towards **`highly parallel accelerators`** * Still **`rising number of nodes`** * Concurrency needs to be **exploited in software** * `Problems` need to be able to be `split` into smaller pieces * `No dependencies` * `Fine-grained work distribution` with load balance * New application requirements * Heterogeneous workloads * More complex workflows * **New programming approaches** (next 2 lectures) * `Accelerators` require special treatment * `Task-based models` emerging as one option ### Top500 List as of June 2020 ![](https://i.imgur.com/JlUXfAU.png =500x) ### Number of Cores Across the Top500 Machines ![](https://i.imgur.com/BWXn2et.png =500x) ## Scaling Types ![](https://i.imgur.com/2LR7Rq5.png =500x) * Scaling in this case means `changes` in the `number HW threads` used * Need to **`adjust application and runtime parameters`** as we do this * Problem size * Problem decomposition * Useful to keep one characteristics constant * Also **`other tuning options`** may be necessary * Algorithmic knobs (due to changed conditions) * Runtime knobs (e.g., due to changed message sizes) * Thread vs. process balance * Job mapping to machine * Basically running **`a different set of programs`** * Can expose new bugs/correctness issues * Can expose new performance issues ### Weak Scaling * Keeping the **`problem size`** per HW thread/core/node **`constant`** * Larger machine -> `larger problem` * Expanding the problem domain * Increasing refinement * Traditionally **`most common`** way to deal with scaling in HPC * Assumptions * Machines are never big enough * Fully loading a node/core/HW thread is the right thing to do * Exploiting the resources on one unit and keeping that constant * Advantages * Execution properties (like `cache behavior`) often `stay fixed` * Easier to scale, as overheads stay roughly constant as well * **Challenges** * `Keeping load constant` can be `tricky` for `complex applications` (2D/3D problems) * Multiple ways to `repartition a workload` * Example: **`Material Solidification Process`** * Molecular dynamics at LLNL with ddcMD: 2 Million atom run (2005) * Used all of Blue Gene/L (128K cores) * Close to perfect scaling * New scientific observations * ![](https://i.imgur.com/0fnlL6x.png) ### Strong Scaling * **`Keeping the total problem constant`** * Larger machine -> (hopefully) **`faster`** execution * Need to `adjust problem distribution` * Traditionally not the most common type of scaling * But becoming rapidly more and more relevant * Assumptions: * The machine is big enough for the problem * Goal: reducing time to solution and increasing throughput * Needed for **`time critical tasks`** * Emergency response to natural disasters * Real-time simulations * Large-scale ensemble calculations * Challenges * `Harder to scale`, as overheads grow with smaller per HW thread workloads * `Changing executing characteristics` (cache miss rates, etc.) * Example: Cardiac Simulation (BG/Q at LLNL) * Electrophysiology of the human heart * Close to real time heart simulation * Near cellular resolution * Parallel programming aspects * Strong scaling problem * Simulation on 1.600.000 cores * ”Bare metal” programming * Achieved close to 12 Pflop/s * ![](https://i.imgur.com/SQxIKnp.png =300x) ## Programmability ### Impact on Programmability * Things get harder at scale * Access to batch queues * Startup and finalization times * I/O access to files and startup information * This includes **debugging** * Printf becomes problematic * 1 printf on 1,000,000 HW threads = 13,600 pages of text at 8pt font * Interactive debugging no longer feasible * Aim at reproducing bugs at small scale * Isolating kernels * Similar conditions by applying the right scaling * But: in same cases we **need to debug at scale** * `Bugs` sometimes `don’t manifest` themselves `at small scale` * Size reduction and/or debugging additions change conditions ### Debugging at Scale * **Traditional debugging** paradigm **does not scale well**. * Huge quantities of symbol table and process state information * **`Too much information`** to present to the user * First need to **`reduce search space`** to manageable **`subset`** * Apply traditional **`debugging`** techniques **`on subset`** ### How do we debug applications at O(1M)?! #### Observation 1 > Parallel applications have a lot of processes **executing the same code**. * Lightweight tools to quickly identify **`process equivalence classes`** * Feed a **`representative of each class`** into a full-featured debugger #### Observation 2 > **Stack traces** are a good indicator of process behavior * Use **`stack traces`** to identify equivalence classes #### Observation 3 > **Time varying behavior** provides additional insight #### Stack Traces: the basis for STAT ![](https://i.imgur.com/Cb3z9tY.png =500x) #### STAT compliments traditional parallel debuggers ![](https://i.imgur.com/fVPhgFk.png =500x) #### Need to Gather Data From All MPI Processes ![](https://i.imgur.com/Fo6Dp77.png =500x) #### Scalable Communication is Necessary ![](https://i.imgur.com/8pItJ4Q.png =500x) ![](https://i.imgur.com/12HWdoC.png =500x) #### Scalable Aggregation is Necessary ![](https://i.imgur.com/PlfhCY6.png =500x) #### Real Use Case at > 1,000,000 Processes ![](https://i.imgur.com/2rw36zd.png =500x) #### Other Real “War” Stories ![](https://i.imgur.com/uSC5ouy.png) ### Performance Analysis to Scale / at Scale * **Typical performance issues** in a Distributed Memory System * Load imbalance; Processes waiting for data * Large fraction of time on collective operations * Network and I/O contention * Non-optimal process placement & binding * Also sequential / threaded performance !!! * Performance tool options similar as with shared memory * Profiling vs. Tracing * Sampling vs. Direct Instrumentation #### From Generic Messaging `Profiling`... * **`mpiP`**: **Open source MPI profiling library** * Categories: linker instrumentation, profiling * Available from github (recently migrated from sourceforge) * Portable across MPI libraries and system architectures * Available on most platforms, incl. IBM BG L/P/Q, Cray XT/XK/XE, Clusters ```shell ------------------------------------- @--- MPI Time (seconds) ------------- ------------------------------------- Task AppTime MPITime MPI% 0 9.78 1.97 20.12 1 9.8 1.95 19.93 2 9.8 1.87 19.12 3 9.77 2.15 21.99 * 39.1 7.94 20.29 ------------------------------------- ``` * **Scaling Challenges** of **`Profiling`** itself * What to summarize and when? * Mapping back to the right code pieces? * So we need **`Tracing`**... #### From Generic Messaging `Tracing`... * **`Vampir`** ![](https://i.imgur.com/eYdCNmo.png) * **Scaling Challenges** of **`Tracing`** itself * Potentially **huge traces** * **Overhead** and **performance disturbance** (also, e.g., due to I/O) ### Variability and Consequences #### Performance analysis is becoming statistics * **Single runs are no longer sufficient** to understand performance * Already true for sequential execution * Especially true for large scale, highly parallel, shared environments * Need to combine data from **several runs** * Need to **understand `variability`** in the system and the results * **Record and document** as much metadata as possible (static and dynamic) #### Reading * Scientific Benchmarking of Parallel Computing Systems * Torsten Hoefler and Roberto Belli, SC15 #### Some lessons * Avoid summarizing ratios * Summarize the **`costs`** or rates the ratios are based on * Report if the measurement values are **`deterministic`** * For `non-deterministic` data, report `confidence intervals` of the measurement. * Do not assume normality of collected data without **`diagnostic checking`**. * If possible, show **`upper performance bounds`** to facilitate interpretability ## Impact on Algorithms/Data Structures > Scaling can put **new pressures/requirements** on **`algorithms`** & **`data structures`** * **Avoid anything** **`O(N)`** if at all possible * Data structures * Keep `O(N)` data **distributed**, **never replicated** * Loops * Data transfers * ~~MPI_Alltoall~~: not scalable * MPI routines with `O(N)` sized arguments * **Create smaller subcommunicators** * ~~MPI_COMM_WORLD~~ * Avoid using all of MPI_COMM_WORLD * Localized communication * Sometimes **new algorithms** are needed * Some algorithms are fine/ideal at small scale, but get worse at large scale * Larger setup and higher (constant) overhead amortized at scale * **`Dynamic justments/switching`** may help ### Example: Optimizing Load Balancing in AMR * Adaptive Mesh Refinement (SAMRAI library) * Different levels of patches to refine in areas of interest * Requires active load balancing * Load balancing shows bad scaling behavior * Dominates at large scale ![](https://i.imgur.com/A6auUY8.png =300x) #### Timings in MPI rank space * Per node timings for each phase * ==Bottleneck is in **phase 1** and not phase 3== * Limited correlation based on rank space * ![](https://i.imgur.com/CeSkmtD.png =500x) #### Map Performance Metrics onto Underlying Communication Graph ![](https://i.imgur.com/zj4VNpV.png =500x) #### Visualizing Large Communication Graphs * Display of individual nodes is not scalable * Need to group nodes * Outer layers are best targets for this * Keep metric information / coloring * ![](https://i.imgur.com/ucSmiwO.png) ### Performance Improvements * Need to address **flow problem** * Reduce traffic through root * Box size / granularity is one of the knobs * Ultimately need new communication/balancing algorithm * ![](https://i.imgur.com/7mYR64D.png) ## Mapping to Systems ### Impact on Mapping Codes to Machines #### Example: FPMD Code on Blue Gene/L * Material simulation using first principles * No empirical parameters * Iterative process * **Communication Structure** * **`Dense matrix`** of wave function coefficients * Regular domain decomposition * **`Row/Column communication`** * Dominant communication: **`MPI_Bcast`** * Mapping matrix decomposition onto BG/L’s 3D torus * 65,536 nodes in a 64x32x32 torus * Problem split into 512 rows and 128 columns * Mapping of rows onto target machine #### Impact of Node Mappings on Performance ![](https://i.imgur.com/GTrl62s.png =500x) ### Why Are We Seeing this Performance Behavior? #### Observation 1 * Need to optimize for **`both row and columns`** * Take interference into account #### Observation 2 * Need to optimize for **`bandwidth`** * Drive as many links as possible #### Optimization process was manual * Detecting communication patterns * Trial and error mappings * Explain performance post mortem * Iterative refinement #### All runs had to be at scale ### Other Network Topologies ![](https://i.imgur.com/vCu2w4J.png) ### Dragonfly and its Properties ![](https://i.imgur.com/J9vpDHO.png) ### MPI Process Topologies * Mapping MPI processes to the machine is a hard problem * Depends on algorithm and system * Often hard to do for programming manually * **`Topologies`** define an **`intuitive`** name space * Simplifies algorithm design * Provides MPI with topology information * Enables more efficient mappings within MPI * **MPI supports** * Multidimensional grids and tori * Arbitrary graphs * Information attached to a communicator * Special creation functions * Ability to query topologies later on * Note/Warning: these are often not the most optimized routines, but getting better #### Cartesian Grid Topologies ```c // Creates a new communicator with Cartesian topology attached int MPI_Cart_create(MPI_Comm comm_old, int ndims, int dims[], int periods[], int reorder, MPI_Comm *comm_cart) // Get a rank from a tuple of coordinates int MPI_Cart_rank(MPI_Comm comm, int coords[], int *rank) // Get a tuple of coordinates from a rank int MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims, int coords[]) // Get a send/receive pair for use with MPI_Sendrecv int MPI_Cart_shift(MPI_Comm comm, int direction, int disp, int *rank_source, int *rank_dest) ``` #### Example: Cartesian Topologies ```c double buf1, buf2; MPI_Comm comm_2d; MPI_Status status; dims[0]=3; dims[1]=4; periods[0]=true; periods[1]=true; reorder=false; /* Torus, not Grid */ MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, reorder &comm_2d); MPI_Cart_coords(comm_2d, rank, 2, &coords); MPI_Cart_shift(comm_2d, 1, 1, &source, &dest); MPI_Sendrecv(buf1, 1, MPI_DOUBLE, dest, 42, buf2, 1, MPI_DOUBLE, source, 42, comm_2d, &status); ``` ![](https://i.imgur.com/E3zUu6h.png =500x) ### MPI Communicator From Graphs ```c // Ability to specify localized communication using arbitrary graphs int MPI_Graph_create(MPI_Comm comm_old, int nnodes, int index[], int edges[], int reorder, MPI_Comm *comm_graph) ``` ![](https://i.imgur.com/y6fuvtt.png =500x) ### MPI Communicator From Distributed Graphs ```c // Ability to specify localized communication using arbitrary distributed graphs int MPI_Dist_graph_create_adjacent(MPI_Comm comm_old, int indegree, int sources[], int sourceweights[], int outdegree, int destinations[], int destweights[], MPI_Info info, int reorder, MPI_Comm *comm_dist_graph) // Note: graphs across all processes must be consistent ``` ![](https://i.imgur.com/KTkE0vP.png) ### MPI Neighborhood Collectives ```c // Enables communication localized to topology neighbors MPI_Neighbor_allgather MPI_Neighbor_allgatherv MPI_Neighbor_alltoall MPI_Neighbor_alltoallv MPI_Neighbor_alltoallw ``` ## I/O for Parallel Applications ### Impact on I/O * I/O can take a **`majority of the execution`** share at scale * Reading of configurations files * Reading of input decks * Visualization dumps * Checkpointing * Options * `Posix I/O` directly to the file system * `MPI I/O` routines * Specialized libraries with special data formats like `HDF5` * Things to look for * Using the right file system * Exploiting parallelism * Writing from multiple processes to one file * Writing from multiple processes to many files * Metadata management performance ### Exploiting Parallelism in I/O ![](https://i.imgur.com/JaO8R8k.png) * Best scenario depends on * Data size * Internal data distribution * Type of parallel filesystem * System installation (sizing of metadata servers) * Requires **`benchmarking`** and **`experience`**, but be aware of **`variability`** (esp. for I/O)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully