# 0 My Preview ###### tags: `SS2021-IN2147-PP` ## Parallelism for Performance ### Processor * **Bit-level** up to 128 Bit * **Instruction-level**: pipelining, functional units, vectorization * **Latency** gets very important, branch-prediction * Toleration of latency ### Memory * Multiple **memory banks** ### IO * Hardware **DMA**, **Raid** arrays ### Multiple processors ## Classification ![](https://i.imgur.com/fCYnQyY.png) ### Parallel systems * Parallel computers * **SIMD** (Single Instruction Multiple Data) * Synchronized execution of the same instruction on a set of data * **MIMD** (Multiple Instruction Multiple Data) * Asynchronous execution of different instructions * M. Flynn, Very High-Speed Computing Systems, Proceedings of the IEEE, 54, 1966 ### MIMD Computers * **Distributed Memory** - DM (multicomputer) * Building blocks are nodes with **private physical address space**. Communication is based on **messages**. * **Shared Memory** - SM (multiprocessor) * System provides a **shared address space**. Communication is based on **read/write operation** to global addresses. ### Shared Memory * **Uniform Memory Access** – UMA (symmetric multiprocessors - SMP) * Centralized shared memory, accesses to global memory from **all processors** have **same latency**. * **Non-uniform Memory Access Systems** - NUMA (Distributed Shared Memory Systems - DSM) * Memory is distributed among the nodes, **local accesses** much **faster** than remote accesses. ## Parallel architectures - SuperMUC * SuperMUC @ Leibniz Supercomputer Centre ![](https://i.imgur.com/tD3gbJS.jpg) * SuperMUC in TOP500 ![](https://i.imgur.com/7RKrVBf.png) * Peak performance: $3$ Peta Flops = $3*10^{15}$ Flops ## Distributed Memory Architecture * **18 partitions** called islands with **512 nodes** per partition * **Node** is a **shared memory system** with **2 processors** * **Sandy Bridge** - EPIntel Xeon E5-2680 8C * 2.7 GHz (Turbo 3.5 GHz) * 32 GByte memory * **Inifiniband** network interface * Processor has **8 cores** * 2-way **hyperthreading** * 21.6 GFlops @ 2.7 GHz per core * 172.8 GFlops per processor ## Sandy Bridge Processor ![](https://i.imgur.com/wSdkM0R.png) * L3 cache * Partitioned with **cache coherence** based on **core valid bits** (via QPI) * **Physical addresses** distributed by a **hash** function ## NUMA Node ![](https://i.imgur.com/IdRwyta.png) * 2 processors with **32 GB of memory** (4 GB * 8) * **Aggregate memory bandwidth** per node 102.4 GB/s * Latency * local ~50ns (~135 cycles @2.7 GHz) * remote ~90ns (~240 cycles) ## Interconnection Network ![](https://i.imgur.com/dTWxjV0.png) * Infiniband FDR-10 * FDR means **Fourteen Data Rate** * FDR-10 has an effective data rate of 38.79 Gbit/s * Latency: 100 nsec per switch, 1usec MPI * Vendor: Mellanox * Intra-Island Topology: **non-blocking** tree * 256 communication pairs can talk **in parallel** * Inter-Island Topology: **Pruned Tree** 4:1 * 128 links per island to next level ### VIA and Infiniband ![](https://i.imgur.com/Ofzu34W.png) * **Standardized** user-level network interface * Specification of the **software interface** not the NIC implementation * **Can be fully implemented** in the **hardware NIC** or major parts of the protocol processing can be on-loaded on the host processor * Allows to **bypass** the OS on the **data path** * Consumers acquire one or more virtual interfaces via the kernel agent (control path) * **Efficient sharing** of NICs * Getting more important in multicore processors ### MPI Performance | IBM MPI over Infiniband | IBM MPI over Ethernet | |:------------------------------------:|:------------------------------------:| | ![](https://i.imgur.com/uITlmel.png) | ![](https://i.imgur.com/TKqvTOz.png) | ### 9288 Compute Nodes ![](https://i.imgur.com/EAIM9XE.jpg) * Cold Corridoor Infiniband (red) cabling * Ethernet (green) cabling ### Infiniband Interconnect ![](https://i.imgur.com/csVItrQ.jpg =300x) * 19 Orcas 126 Spine Switches * 11900 Infiniband Cables ## IO System ![](https://i.imgur.com/6sNF8TM.png) ## SuperMUC Power Consumption ![](https://i.imgur.com/ISe2Pcz.png) ## CPU frequency and RAM setting ![](https://i.imgur.com/r3qPJSO.png) ## SuperMUC -Costs ![](https://i.imgur.com/H04AVBg.png) ## Energy Capping in Contract with IBM * **New** funding scheme: energy included * **Contract includes energy cost for 5 years** * Power consumption of the system varies between 1 and 3 MW depending on the usage by the applications * The contract is based on the energy consumed in a benchmark suite agreed between IBM and LRZ ## IBM iDataplex dx360 M4: Water cooling * Heat flux > 90% to water; very low chilled water requirement * Power advantage over air-cooled node * Warm water cooled ~10% (cold water cooled ~15%) * due to lower $\mathrm{T_{components}}$ and no fans * Typical operating conditions * $\mathrm{T_{air}} = 25 - 35^{\circ}\mathrm{C}$ * $\mathrm{T_{water}} = 18 - 45^{\circ}\mathrm{C}$ ## IBM System x iDataPlex Rack ![](https://i.imgur.com/ykbZT96.jpg =300x) ## 9288 Compute Nodes ![](https://i.imgur.com/qAGdXej.jpg) * Warm Corridoor * Warm Water * Cooling Pipes * Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division ## SuperMIC ![](https://i.imgur.com/VaMbRQr.png =300x) * Intel Xeon Phi Cluster * 32 Nodes * 2 Xeon Ivy-Bridge processors E5-2650 * 8 cores each * 2.6 GHz clock frequency * 2 Intel Xeon Phi coprocessors 5110P * 60 cores @ 1.1 GHz * Memory * 64 GB host memory * 2x8 GB Xeon Phi ## Intel Xeon Phi –Knights Corner ![](https://i.imgur.com/2lmdBjK.png) ## Nodes with Coprocessors ![](https://i.imgur.com/jlEet04.png) ## The Compute Cube of LRZ ![](https://i.imgur.com/R7xrWal.png) ## Run jobs in batch * Advantages * Reproducable performance * Run larger jobs * No need to interactive poll for resources * Test queue * Max 1 island, 32 nodes, 2h, 1 job in queue * General queue * Max 1 island, 512 nodes, 48 h * Large * Max 4 islands, 2048 nodes, 48 h * Special * Max 18 islands … * Job Script ![](https://i.imgur.com/jgMTg8B.png) ## Limited CPU Hours available * Please * Specify job execution as tight as possible. * Do not request more nodes than required. We have to „pay“ for all allocated cores, not only the used ones. * SHORT (<1sec) sequential runs can be done on the login node. * Even **SHORT OMP** runs can be done on the login node. ## Login to SuperMUC, Documentation * First change the standard password * https://idportal.lrz.de/r/entry.pl * Login via * lxhalle due to restriction on connecting machines * ssh \<userid\>@supermuc.lrz.de * No outgoing connections allowed * Documentation * http://www.lrz.de/services/compute/supermuc/ * http://www.lrz.de/services/compute/supermuc/loadleveler/ * Intel compiler: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/index.htm ## Batch Script Parameters * #@ energy_policy_tag = NONE * Switch of automatic adaptation of core frequency for performance measurements * #@ node = 2 * #@ total_tasks= 4 * #@ task_geometry = {(0,2) (1,3)} * #@ tasks_per_node = 2 * Limitations on combination documented at LRZ web page