owned this note
owned this note
Published
Linked with GitHub
# 0 My Preview
###### tags: `SS2021-IN2147-PP`
## Parallelism for Performance
### Processor
* **Bit-level** up to 128 Bit
* **Instruction-level**: pipelining, functional units, vectorization
* **Latency** gets very important, branch-prediction
* Toleration of latency
### Memory
* Multiple **memory banks**
### IO
* Hardware **DMA**, **Raid** arrays
### Multiple processors
## Classification

### Parallel systems
* Parallel computers
* **SIMD** (Single Instruction Multiple Data)
* Synchronized execution of the same instruction on a set of data
* **MIMD** (Multiple Instruction Multiple Data)
* Asynchronous execution of different instructions
* M. Flynn, Very High-Speed Computing Systems, Proceedings of the IEEE, 54, 1966
### MIMD Computers
* **Distributed Memory** - DM (multicomputer)
* Building blocks are nodes with **private physical address space**. Communication is based on **messages**.
* **Shared Memory** - SM (multiprocessor)
* System provides a **shared address space**. Communication is based on **read/write operation** to global addresses.
### Shared Memory
* **Uniform Memory Access** – UMA (symmetric multiprocessors - SMP)
* Centralized shared memory, accesses to global memory from **all processors** have **same latency**.
* **Non-uniform Memory Access Systems** - NUMA (Distributed Shared Memory Systems - DSM)
* Memory is distributed among the nodes, **local accesses** much **faster** than remote accesses.
## Parallel architectures - SuperMUC
* SuperMUC @ Leibniz Supercomputer Centre

* SuperMUC in TOP500

* Peak performance: $3$ Peta Flops = $3*10^{15}$ Flops
## Distributed Memory Architecture
* **18 partitions** called islands with **512 nodes** per partition
* **Node** is a **shared memory system** with **2 processors**
* **Sandy Bridge** - EPIntel Xeon E5-2680 8C
* 2.7 GHz (Turbo 3.5 GHz)
* 32 GByte memory
* **Inifiniband** network interface
* Processor has **8 cores**
* 2-way **hyperthreading**
* 21.6 GFlops @ 2.7 GHz per core
* 172.8 GFlops per processor
## Sandy Bridge Processor

* L3 cache
* Partitioned with **cache coherence** based on **core valid bits** (via QPI)
* **Physical addresses** distributed by a **hash** function
## NUMA Node

* 2 processors with **32 GB of memory** (4 GB * 8)
* **Aggregate memory bandwidth** per node 102.4 GB/s
* Latency
* local ~50ns (~135 cycles @2.7 GHz)
* remote ~90ns (~240 cycles)
## Interconnection Network

* Infiniband FDR-10
* FDR means **Fourteen Data Rate**
* FDR-10 has an effective data rate of 38.79 Gbit/s
* Latency: 100 nsec per switch, 1usec MPI
* Vendor: Mellanox
* Intra-Island Topology: **non-blocking** tree
* 256 communication pairs can talk **in parallel**
* Inter-Island Topology: **Pruned Tree** 4:1
* 128 links per island to next level
### VIA and Infiniband

* **Standardized** user-level network interface
* Specification of the **software interface** not the NIC implementation
* **Can be fully implemented** in the **hardware NIC** or major parts of the protocol processing can be on-loaded on the host processor
* Allows to **bypass** the OS on the **data path**
* Consumers acquire one or more virtual interfaces via the kernel agent (control path)
* **Efficient sharing** of NICs
* Getting more important in multicore processors
### MPI Performance
| IBM MPI over Infiniband | IBM MPI over Ethernet |
|:------------------------------------:|:------------------------------------:|
|  |  |
### 9288 Compute Nodes

* Cold Corridoor Infiniband (red) cabling
* Ethernet (green) cabling
### Infiniband Interconnect

* 19 Orcas 126 Spine Switches
* 11900 Infiniband Cables
## IO System

## SuperMUC Power Consumption

## CPU frequency and RAM setting

## SuperMUC -Costs

## Energy Capping in Contract with IBM
* **New** funding scheme: energy included
* **Contract includes energy cost for 5 years**
* Power consumption of the system varies between 1 and 3 MW depending on the usage by the applications
* The contract is based on the energy consumed in a benchmark suite agreed between IBM and LRZ
## IBM iDataplex dx360 M4: Water cooling
* Heat flux > 90% to water; very low chilled water requirement
* Power advantage over air-cooled node
* Warm water cooled ~10% (cold water cooled ~15%)
* due to lower $\mathrm{T_{components}}$ and no fans
* Typical operating conditions
* $\mathrm{T_{air}} = 25 - 35^{\circ}\mathrm{C}$
* $\mathrm{T_{water}} = 18 - 45^{\circ}\mathrm{C}$
## IBM System x iDataPlex Rack

## 9288 Compute Nodes

* Warm Corridoor
* Warm Water
* Cooling Pipes
* Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division
## SuperMIC

* Intel Xeon Phi Cluster
* 32 Nodes
* 2 Xeon Ivy-Bridge processors E5-2650
* 8 cores each
* 2.6 GHz clock frequency
* 2 Intel Xeon Phi coprocessors 5110P
* 60 cores @ 1.1 GHz
* Memory
* 64 GB host memory
* 2x8 GB Xeon Phi
## Intel Xeon Phi –Knights Corner

## Nodes with Coprocessors

## The Compute Cube of LRZ

## Run jobs in batch
* Advantages
* Reproducable performance
* Run larger jobs
* No need to interactive poll for resources
* Test queue
* Max 1 island, 32 nodes, 2h, 1 job in queue
* General queue
* Max 1 island, 512 nodes, 48 h
* Large
* Max 4 islands, 2048 nodes, 48 h
* Special
* Max 18 islands …
* Job Script

## Limited CPU Hours available
* Please
* Specify job execution as tight as possible.
* Do not request more nodes than required. We have to „pay“ for all allocated cores, not only the used ones.
* SHORT (<1sec) sequential runs can be done on the login node.
* Even **SHORT OMP** runs can be done on the login node.
## Login to SuperMUC, Documentation
* First change the standard password
* https://idportal.lrz.de/r/entry.pl
* Login via
* lxhalle due to restriction on connecting machines
* ssh \<userid\>@supermuc.lrz.de
* No outgoing connections allowed
* Documentation
* http://www.lrz.de/services/compute/supermuc/
* http://www.lrz.de/services/compute/supermuc/loadleveler/
* Intel compiler: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/index.htm
## Batch Script Parameters
* #@ energy_policy_tag = NONE
* Switch of automatic adaptation of core frequency for performance measurements
* #@ node = 2
* #@ total_tasks= 4
* #@ task_geometry = {(0,2) (1,3)}
* #@ tasks_per_node = 2
* Limitations on combination documented at LRZ web page