Book : Parallel and High Performance Computing

> Reference \: https://www.manning.com/books/parallel-and-high-performance-computing *In this post, I will only include some topics that interest me and some extra material that I read when studying this book.* # Part I \: Introduction to Parallel Computing ## Amdahl's Law \& Gustafson-Barsis's Law Reference \: https://hackernoon.com/a-deep-dive-into-amdahls-law-and-gustafsons-law * *Amdahl’s Law* maps to *Strong Scaling* * *Gustafson-Barsis's Law* maps to *Weak Scaling* :::info :information_source: **Project Made in KU** This repo contain the project made when I was studying in Kean University and taking Indepandent Study. https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/tree/main/OpenMP_Algorithms ::: ## Parallel Approches * Process-based Parallelization \: Message passing \(MPI\) * Thread-based Parallelization \: Share data via memory \(OpenMP\) * Vectorization \: SIMD * Stream Processing \: Through specialized processors ### Steps 1. Discretize \](break up\) the problem into smaller cells or elements 2. Define a computational kernel (operation) to conduct on each element of the mesh 3. Add the following layers of parallelization on CPUs and GPUs to perform the calculation * *Vectorization* \: Work on more than one unit of a time * *Using Threads* \: Threads work on 1 CPU * *Using Processes* \: Processes can be distribute to multiple CPUs or Compute nodes * *Off-Loading* \: Off loading to dedicated processors which can be GPU, NPU or FPGA... ## Data Design and Performance Models :::info :information_source: **Data-Oriented Design** 軟體設計模式 — Data-Oriented Design https://medium.com/@poga/data-oriented-design-9288178d3ea5 ::: ### Cache Miss's \: Compulsory, Capacity, Conflict :::info :information_source: **Cache Access Example** Good tutorial for visualize cache access. \#1 https://www.youtube.com/watch?v=RqKeEIbcnS8 \#2 https://www.youtube.com/watch?v=quZe1ehz-EQ ::: > Reference \: > * Chapter 4 page 102 > https://www.manning.com/books/parallel-and-high-performance-computing * **Compulsory** \: Cache misses that are necessary to bring in the data when it is first encountered. * **Capacity** \: Cache misses that are caused by a limited cache size, which evicts data from the cache to free up space for new cache line loads. * **Conflict** \: When data is loaded at the same location in the cache. If two or more data items are needed at the same time but are mapped to the same cache line, both data items must be loaded repeatedly for each data element access. # Part II \: CPU Parallel ## Vectorization *Reference to the following dedicated post* https://hackmd.io/@Erebustsai/HkdXPx-rh ## Multi-Core and Threading *Reference to the following dedicated post* https://hackmd.io/@Erebustsai/HyNZhtM-C # Part III \: GPU Programming *Reference to the following dedicated posts* * CUDA Programming \: https://hackmd.io/@Erebustsai/SJdjopmEn * OpenCL Programming \: https://hackmd.io/@Erebustsai/Sku_EMMr2 * GPGPU Algorithm Implementation \: https://hackmd.io/@Erebustsai/Byul7e-Up * Advanced GPGPU Programming \: https://hackmd.io/@Erebustsai/HJ5p3-NFp * SYCL Programming \: \(Pending\) https://hackmd.io/@Erebustsai/rJ3hrExRA # Part IV \: High Performance Computing Ecosystems ## Affinity \: Truce with the Kernel * *Affinity* \: Assigns a preference for the scheduling of a process, rank or thread to a particular hardware component. This is also called *pinning* or *binding*. * *Placement* \: Assigns a process or thread to a hardware location. ### Why is Affinity Important \? Parallel program usually will have synchornization point in different stage of the program. Therefore, thread should be scheduled together as a batch to make sure that when threads are waiting at the synchronation point with one of the thread not activated. > The best approach for getting gang scheduling is to only allocate as many processes as there are processors and bind those processes to the processors. However, for a system with OS, we need to make sure that the kernel of the OS have some thread to use. Therefore, reserving a processor for OS is usually a good practice. In my opinion, I reserve at least a core for a CPU with hyper-threading, which means *two thread* for the system. :::success :bulb: **Gang Scheduling** https://en.wikipedia.org/wiki/Gang_scheduling ::: **NUMA** \(Non-Uniform Memory Access\) Refer to another post https://hackmd.io/@Erebustsai/HJ4PO0bV6 > NUMA domain can typically be a factor of two or more. It is a top priority for our processes to stay on the same memory domain. > While tying affinity to a NUMA domain optimizes our access time to main memory, we still can have less than optimal performance due to poor cache usage. A process fills the L1 and L2 cache with the memory that it needs. But then, if it gets swapped out to another processor on the same NUMA domain with a different L1 and L2 cache, cache performance suffers. **Hyperthreading** Hyperthreads share a single physical core and its cache system. However, the cache for a real physical core need to be shared half by half. Therefore, for some application, turning off the hyperthreading might bring more performance. **Find out the Architecture** For a Linux system, it is easy to just use `lstopo` from `hwloc`, but for a Windows system, the `hwloc` can be found in the OpenMPI page. \(https://www-lb.open-mpi.org/software/hwloc/v2.11/\) Addtionally, `Coreinfo64.exe` can be used to idenetify the processor groups in your windows system hardware. \(https://learn.microsoft.com/zh-tw/sysinternals/downloads/coreinfo\) However, to find out which NUMA node a PCIe device connected to, I still have to check directly in the device manager in the detail page of a device's property. Refer to the https://hackmd.io/EhDRrZWEQ4it2FMZZPVE4w for sample result output. ### Affinity with OpenMP * `OMP_PLACES = [sockets|cores|threads]` * `OMP_PROC_BIND = [close|spread|primary]` or `[true|false]` `OMP_PLACES` denote the virtual processors that the scheduler can use and the scheduler can freely moving this threads within these denoted virtual processors. Notice that the amount of threads involked in the program should be equal or less than the number of denoted virtual processors if contain switch is to be avoid. `OMP_PROC_BIND` * `true` \: The kernel will not move the thrad once it gets scheduled. * `primary` \: Schedule threads on the main processor. * `close` \: Schedule threads close together. * `spread` \: dirtribute the threads :::success https://www.ibm.com/docs/en/xl-fortran-linux/16.1.0?topic=openmp-omp-proc-bind ::: :::info :information_source: **Report Threads Placement** The following Github file shows how to query binding policy and thread placement. https://github.com/essentialsofparallelcomputing/Chapter14/blob/master/OpenMP/place_report_omp.c ::: ## Batch Schedulers \: Bringing Order to Chaos *TODO* ## File Operation for the Parallel World :::success ***Correctness, Reducing Duplicated Output, Performance*** ::: ### HPC Storage System Components * **Spinning Disk** \: Electro-mechanical device where data is stored in a electro-magnetic layer through the movement of a mechanical recording head. * **SSD** \: A solid-state drive is a solid-state memory device that can replace a mechanical disk. * **Burst Buffer** \: Intermediate storage hardware layer composed of NVRAM and SSD components. It is positioned between the compute hardware and the main disk storage resources. * **Tape** \: A magnetic tape with auto-loading cartridges. :::info :information_source: **Tape Storage** **磁帶技術老歸老，儲存不常用冷資料仍有強大壽命與環保優勢** https://technews.tw/2023/08/18/tape-storage-is-not-commonly-used-for-cold-data/ **Tape drive Wikipedia** https://en.wikipedia.org/wiki/Tape_drive ::: :::info :information_source: **NVRAM** **什麼是非揮發性隨機存取記憶體 \(NVRAM\)？** https://www.lenovo.com/tw/zh/glossary/nvram/?orgRef=https%253A%252F%252Fwww.google.com%252F ::: ![image](https://hackmd.io/_uploads/S1g8VwNC0.png) > Reference \: https://www.manning.com/books/parallel-and-high-performance-computing > > Schematic showing positioning of burst buffer hardware in between the compute resources and disk storage. Burst buffers can either be node-local or shared among nodes via a network. :::info :information_source: **External Memory File Handle Introduced in Algorithmica** https://hackmd.io/FiPAsD0pQfempm4gPABisQ?view#External-Memory ::: ### MPI File Operation > Reference \: > * Swiss National Supercomputing Centre > https://www.cscs.ch/fileadmin/user_upload/contents_publications/tutorials/fast_parallel_IO/MPI-IO_NS.pdf > * 關於MPI-IO，你該知道的 > https://cloud.tencent.com/developer/article/1798470 ### HDF5 File > Reference \: > * HDF5 Wikipedia > https://zh.wikipedia.org/zh-tw/HDF > * 巨量資料的好夥伴HDF5 & REDIS > https://chtseng.wordpress.com/2017/08/15/%E5%B7%A8%E9%87%8F%E8%B3%87%E6%96%99%E7%9A%84%E5%A5%BD%E5%A4%A5%E4%BC%B4hdf5-redis/ > * hdf5-tutorial > https://github.com/HDFGroup/hdf5-tutorial