# 7 Introduction into HPC Systems and Applications
###### tags: `SS2021-IN2147-PP`
## Architectures
### Distributed Memory Machines (Hardware)
* Separate compute nodes
* Separate processors
* Separate memories
* Connected through an explicit network
* 2 types depends on `level of integration`
* Full systems on each node
* `Cheap` commodity boards
* Streamlined blade architectures
* Tight integration
* E.G. Blue Gene/L
### Distributed Memory Machines

## Networks
* Standard: **`Ethernet`**
* `Cheap` and ubiquitous
* Large software `overheads`
* Networks of Workstations (`NOW`)
* Alternative: **`User-level Communication`**
* Only setup via OS kernel
* Direct communication from user-level
* Bypass OS path
* Today: **`standard for cluster communication`**
* Hidden from the user
* **`VIA`** Architecture
* Most establish example: **`Infiniband`**
* 
### Virtual Interface Architecture (VIA)

### Network Topologies

## Facilities
### Double floors for easier Installation

### Energy Consumption in Data Centers

### Cooling Setup for SuperMUC at LRZ


### SuperMUC-NG Node

### SuperMUC-NG Node Cooling

### Why Is the Infrastructure Important?
* Determines **`data center overheads`**
* Matching average operating power consumption with cooling infrastructure will reduce overheads
* Infrastructure `limits the possible cooling technologies` for the HPC systems
* Being set up only for air cooling will not allow an easy switch to water cooled systems
* Trade-offs to reduced overhead can be a source for additional costs later on
* Switching off power conditioning to reduce overheads can allow brown-outs to shutdown or damage system parts
* Mistakes made here `can be costly in the long run`
* HPC system replaced every 3-5 years
* Infrastructure replaced every 10-20 years
## HPC Applications
### Wide Range of HPC Application Spaces
> Predictive Simulation has become a key capability
* Climate modeling
* Weather forecasting
* Nuclear Physics
* Oil and Gas
* Reservoir modeling
* Bioscience, Medicine
* Genomic research
* Material Science
* Automobile/Aeronautics Industry
* CFD
* Virtual Crashtests
* City planning
* Graph analysis
* Security application
* Finance
### Astrophysics: Simulation of Galaxies

### Material Solidification Process

### Cardiac Simulation (BG/Q at LLNL)

## HPC Environments
* `Loose coupling` by only managing node allocation
* Separate OS instances
* `Tight coupling` by managing all resources globally
* Single System Image
### HPC Ecosystem
* `Simplified compute nodes`
* Headless with **no graphics support**
* **No local disk**
* HPC is more than just compute nodes
* Head/Login/Compilation nodes
* System nodes
* RAS (Reliability, Availability, Serviceability) components
* Resource manager
* Storage system
* Parallel file systems
* Often driven by dedicated I/O nodes
* Tape archive
* Visualization systems
### Consequences for Programmability
* **Problems need to be manually partitioned and distributed**
* Data has to be managed separately in the different memories
* **Communication must be programmed explicitly**
* Use of dedicated communication libraries
* **Fault tolerance, fault handling and debugging is more complex**
* Have to reason about distributed state
* **Large scale MPP system often have slightly different OS environments**
* Reduced services to minimize noise
* **Login/Compile/Compute nodes may be different**
* Requires “cross-compilation”
### Access to HPC Systems

* Job/resource management system
* Once resources are free, job gets scheduled
* Batch Systems
* SLURM Workload Manager
* Simple Linux Utility for Resource Management