7 Introduction into HPC Systems and Applications

# 7 Introduction into HPC Systems and Applications ###### tags: `SS2021-IN2147-PP` ## Architectures ### Distributed Memory Machines (Hardware) * Separate compute nodes * Separate processors * Separate memories * Connected through an explicit network * 2 types depends on `level of integration` * Full systems on each node * `Cheap` commodity boards * Streamlined blade architectures * Tight integration * E.G. Blue Gene/L ### Distributed Memory Machines ![](https://i.imgur.com/7jTFV4T.png) ## Networks * Standard: **`Ethernet`** * `Cheap` and ubiquitous * Large software `overheads` * Networks of Workstations (`NOW`) * Alternative: **`User-level Communication`** * Only setup via OS kernel * Direct communication from user-level * Bypass OS path * Today: **`standard for cluster communication`** * Hidden from the user * **`VIA`** Architecture * Most establish example: **`Infiniband`** * ![](https://i.imgur.com/jNih9gK.png =300x) ### Virtual Interface Architecture (VIA) ![](https://i.imgur.com/OY2oY1J.png) ### Network Topologies ![](https://i.imgur.com/OjbJiDt.png) ## Facilities ### Double floors for easier Installation ![](https://i.imgur.com/wQC5NxO.jpg) ### Energy Consumption in Data Centers ![](https://i.imgur.com/MzFCKGh.png) ### Cooling Setup for SuperMUC at LRZ ![](https://i.imgur.com/WviU13e.png) ![](https://i.imgur.com/NWZ0yGF.png) ### SuperMUC-NG Node ![](https://i.imgur.com/vrCCl6I.png) ### SuperMUC-NG Node Cooling ![](https://i.imgur.com/nbwiTkc.png) ### Why Is the Infrastructure Important? * Determines **`data center overheads`** * Matching average operating power consumption with cooling infrastructure will reduce overheads * Infrastructure `limits the possible cooling technologies` for the HPC systems * Being set up only for air cooling will not allow an easy switch to water cooled systems * Trade-offs to reduced overhead can be a source for additional costs later on * Switching off power conditioning to reduce overheads can allow brown-outs to shutdown or damage system parts * Mistakes made here `can be costly in the long run` * HPC system replaced every 3-5 years * Infrastructure replaced every 10-20 years ## HPC Applications ### Wide Range of HPC Application Spaces > Predictive Simulation has become a key capability * Climate modeling * Weather forecasting * Nuclear Physics * Oil and Gas * Reservoir modeling * Bioscience, Medicine * Genomic research * Material Science * Automobile/Aeronautics Industry * CFD * Virtual Crashtests * City planning * Graph analysis * Security application * Finance ### Astrophysics: Simulation of Galaxies ![](https://i.imgur.com/eOoQ4RV.png) ### Material Solidification Process ![](https://i.imgur.com/O4SGjBe.png) ### Cardiac Simulation (BG/Q at LLNL) ![](https://i.imgur.com/MlloOph.png) ## HPC Environments * `Loose coupling` by only managing node allocation * Separate OS instances * `Tight coupling` by managing all resources globally * Single System Image ### HPC Ecosystem * `Simplified compute nodes` * Headless with **no graphics support** * **No local disk** * HPC is more than just compute nodes * Head/Login/Compilation nodes * System nodes * RAS (Reliability, Availability, Serviceability) components * Resource manager * Storage system * Parallel file systems * Often driven by dedicated I/O nodes * Tape archive * Visualization systems ### Consequences for Programmability * **Problems need to be manually partitioned and distributed** * Data has to be managed separately in the different memories * **Communication must be programmed explicitly** * Use of dedicated communication libraries * **Fault tolerance, fault handling and debugging is more complex** * Have to reason about distributed state * **Large scale MPP system often have slightly different OS environments** * Reduced services to minimize noise * **Login/Compile/Compute nodes may be different** * Requires “cross-compilation” ### Access to HPC Systems ![](https://i.imgur.com/H98eSGo.png) * Job/resource management system * Once resources are free, job gets scheduled * Batch Systems * SLURM Workload Manager * Simple Linux Utility for Resource Management