--- tags: algorithm --- <center> # **<font size = +4>Parallel Computing & Algorithms</font>** <br> <br> <div> <p style="text-align: right"> "xxx."<br>-- xxx (yyyy-yyyy) </p> </div> </center> ### Goal Multicore design is one of modern features of computer systems. You can easily see at least 4 cores in commercial CPUs in your desktop computers, laptops, even smart phones (for example, [Apple A16](https://en.wikipedia.org/wiki/Apple_A16) has one 6 cores CPU). However, the programming beginners rarely have opportunities to exploit the performance benefits of multi-core hardware, letting alone get a deep understanding of contemporary computer architecture. In recent years, Artificial Intelligence (AI) has become a prominent field, heavily relying on graphics processing units (GPUs) or accelerators for machine learning (ML). For example, the latest NVIDIA graphics card, the RTX 4090, has 16384 CUDA cores. Utilizing such an immense number of computational units has become a crucial challenge. This course shows you how to utilize these computing units (CPUs and GPUs) to unleash their computational performance. We will cover the following topics in ths course: 1. Modern computer architecture and job scheduling in operating systems 2. Multiprocessing & multithreading programming in Java, C#, and Python 3. Parallel algorithms 4. CUDA in C++ 現今多核心設計的中央處理器 (central processing unit, CPU) 隨處可見 ,例如桌上型電腦、筆記型電腦甚至是智慧型手機,核心數至少 4 核心 (cores) 起跳,多可高達 96 核心 (例如 [AMD EPYC 9654](https://www.amd.com/zh-hant/products/cpu/amd-epyc-9654)),但初學程式設計的學員鮮少有機會利用到硬體多核所帶來的效能,甚至還沒有機會深入了解當代的電腦架構。近年人工智慧 (Artificial Intelligence, AI) 成為顯學,大量需要繪圖晶片 (graphics processing unit, GPU) 或者圖形加速器來進行機器學習 (machine learning, ML),例如當前最新的 NVIDIA 顯示卡 RTX 4090 擁有 16384 顆 CUDA 核心,如何使用如此龐大的計算單元變成一個重要的課題。本課程將介紹平行程式與平行演算法,學員在課程中預計能夠學習到下列項目: 1. 當代硬體架構、作業系統 (以 Linux 為例) 2. 多程序與多執行緒的架構與實作 (以 Java、C#、Python 為例) 3. 平行演算法 4. CUDA (以 C++ 為例) ### Instructor Information - Instructor: [Zheng-Liang Lu (Arthur)](mailto:arthurzllu@gmail.com) ### Textbook TBA ### Programming Languages - You can pick up programming languages like C, [C++](https://www.csie.ntu.edu.tw/~d00922011/cpp.html), [C#](https://www.csie.ntu.edu.tw/~d00922011/csharp.html), [Java](https://www.csie.ntu.edu.tw/~d00922011/java.html), and [Python](https://hackmd.io/@arthurzllu/HJNXq84SO) mentioned in this course. - You can find installation instructions [here](https://hackmd.io/@arthurzllu/ry-PIcBlO). ## **Syllabus** ## ++Background Knowledge++ ### Type of Parallelism - Type fo tasks - CPU-bound tasks - I/O-bound tasks - For example, some operations of databases, web crawling. - Parallelism in hardware<img src = "https://diveintosystems.org/book/C15-Parallel/_images/flynn.png" height = 250px style="float:right"/> - Parallelism in a uniprocessor - SIMD instructions, vector processors, GPUs - Multiprocessors - Symmetric shared-memory multiprocessors - Distributed-memory multiprocessors - Chip-multiprocessors a.k.a. multicores - Multicomputers a.k.a. clusters - Parallelism in software - Instruction-level parallelism - Task-level parallelism - Data-level parallelism - Transaction level parallelis ### Preliminary Knowledge about Hardware - Multitasking/programming computer - Unlike mainframe computers which execute tasks in batches, personal computers (PC) need to perform multiple tasks simultaneously. - Hardware acceleration - Instruction pipelining: fetch, decode, execution, write-back - https://en.wikipedia.org/wiki/Instruction_pipelining - Loop unrolling - https://en.wikipedia.org/wiki/Loop_unrolling#Simple_manual_example_in_C - Out-of-order execution - https://ocw.mit.edu/courses/6-823-computer-system-architecture-fall-2005/a984df22afeb4bd732058005861a70a4_l12_ooo_pipes.pdf - Register renaming - Branch prediction - https://en.wikipedia.org/wiki/Branch_predictor - Prefetching memory and files - For example, the used data/files will be **cached** in memory to avoid frequent I/O from slow devices. See https://www.linuxatemyram.com/. - Cache design in modern processors - David Kuo, [CPU Cache 原理探討](https://hackmd.io/@drwQtdGASN2n-vt_4poKnw/H1U6NgK3Z) - Jserv, [現代處理器設計:Cache 原理和實際影響](https://hackmd.io/@sysprog/HkW3Dr1Rb) - Chi-En Wu and Jim Huang (Jserv), [每位程式開發者都該有的記憶體知識](https://github.com/sysprog21/cpumemory-zhtw) - 老狼, [Cache 是如何組織和運作的?](https://zhuanlan.zhihu.com/p/31859105) and [L1, L2, L3 cache 究竟在哪裡?](https://zhuanlan.zhihu.com/p/31422201) - Case study - Vadim Filanovsky and Harshad Sane, [Seeing through hardware counters: a journey to threefold performance increase](https://netflixtechblog.com/seeing-through-hardware-counters-a-journey-to-threefold-performance-increase-2721924a2822), Netflix Technology Blog, 2022-11-10 - Multi-core design - https://www.techspot.com/article/2363-multi-core-cpu/ - In **2001** we saw the first true multi-core processor released by IBM under their Power4 architecture and as expected it was geared towards workstation and server applications. - In **2005**, however, Intel released its first consumer focused dual-core processor which was a multi-core design and later that same year AMD released their version with the Athlon X2 architecture. - The first 1-qubit processors were announced not too long ago and yet a 54-Qubit processor was announced by Google in 2019 and claimed to have achieved quantum supremacy, which is a fancy way of saying their processor can do something a traditional CPU cannot do in a realistic amount of time. - Series aritcles for computer hardware development are listed as follows: [personal computer](https://www.techspot.com/article/874-history-of-the-personal-computer/), [cpu](https://www.techspot.com/article/2000-anatomy-cpu/), [memory](https://www.techspot.com/article/2024-anatomy-ram/), [graphic card](https://www.techspot.com/article/1988-anatomy-graphics-card/), [mother board](https://www.techspot.com/article/1965-anatomy-motherboard/), [hard drive](https://www.techspot.com/article/1984-anatomy-hard-drive/), [ssd](https://www.techspot.com/article/1985-anatomy-ssd/) ### Preliminary Knowledge about Software: Operating System - Program vs. Process ![](https://hackmd.io/_uploads/rJWBEcPws.png =500x) <font size=-3>https://techdifferences.com/difference-between-program-and-process.html</font> - Job scheduling - Shortest job first - [Round-Robin (RR) scheduling](https://en.wikipedia.org/wiki/Round-robin_scheduling) - Priority scheduling - Fair-share scheduling - Thread<img src = "https://1.bp.blogspot.com/-TKQSNUbkqNs/W_UCljTAeSI/AAAAAAAAI3Y/FK_seKThNSghF1eNh0wkb7fwTcPyxanyQCLcBGAs/s1600/1.jpg" height = 200px style="float:right"/> - You may take a look at the implementation in Linux kernel: https://github.com/torvalds/linux/blob/master/include/linux/sched.h - Worth to read this article for Linux processes: https://hackmd.io/@sysprog/linux-process - Parallelism and concurrency - Write to database in parallel or concurrently? - See **a**tomicity of the [ACID](https://en.wikipedia.org/wiki/ACID) properties of database transactions. - [Race condition](https://en.wikipedia.org/wiki/Race_condition) ```java= int x = 0; x++; // thread-safe? ``` - For example, https://en.wikipedia.org/wiki/Java_bytecode - Deadlock - [Dining philosophers](https://en.wikipedia.org/wiki/Dining_philosophers_problem)<img src = "https://hackmd.io/_uploads/SyOqfqvwi.png" height = 250px style="float:right"/> - Another issue: starvation - How to prevent the deadlock? - Try to eliminate the following four conditions: - Mutual exclusion - Non-preemption - Hold and wait - Circular wait - Producer-consumer model - https://www.cs.cornell.edu/courses/cs3110/2010fa/lectures/lec18.html - For example, [publish-subscribe pattern](https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern). - [Amdhal's Law](https://en.wikipedia.org/wiki/Amdahl%27s_law) - Let $\alpha\,\%$ be the parallelizable portion of one algorithm in percentage and $p$ be the number of processors (or computing units of any type). - For convenience, the timecost before parallelizing the algorithm is $T = 100.$ - Then the timecost of the algorithm of parallel version is $$T' = \dfrac{\alpha}{p} + 100 - \alpha.$$ - The Amdhal's law states that the asymptotic limit of the speedup $\frac{T}{T'}$ is $$\lim_{p\rightarrow\infty} \frac{T}{T'}= \dfrac{100}{100 - \alpha} < \infty.$$ - This implies that you cannot reduce the execution time to zero even if you can fully parallelize the algorithm. - In conclusions, - Parallelism benefits reduction of execution time until the marginal improvement starts to degrade as the non-parallelizable portion dominates gradually, one example for [the law of diminishing returns](https://en.wikipedia.org/wiki/Diminishing_returns). - Instead of buying more machines, it's better to hire programmers who are good at parallel algorithms. <center> ![](https://hackmd.io/_uploads/HkOaW5PDj.png =500x) </center> ### Parallel Modules/Packages in Programming Languages - Linux - parallel - https://www.gnu.org/software/parallel/parallel_tutorial.html ```bash= time parallel --ungroup --jobs 1 unzip -qq {} -d ./dest ::: $(ls -d ./src/*.zip | tr "\n" " ") # timecost: 0m33.277s time parallel --ungroup --jobs $(nproc) unzip -qq {} -d ./dest ::: $(ls -d ./src/*.zip | tr "\n" " ") # timecost: 0m1.359s with $(nproc) = 48 ``` - https://ulhpc-tutorials.readthedocs.io/en/latest/sequential/gnu-parallel/ - More examples can be found here: https://www.gushiciku.cn/dc_tw/200046748 - Python - [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) ```python= from multiprocess import Pool with Pool(multiprocessing.cpu_count()) as pool: results = pool.map(task, parameter_list) ``` - One of my daily jobs is to compute on intraday market data. For TXO (台指選擇權), the number of transaction entries is more than 20M on daily basis to calculate the TAIWAN VIX (台指選擇權波動率指數).<br> <img src = "https://hackmd.io/_uploads/SyeUJKmwo.png"/> - [threading](https://docs.python.org/3/library/threading.html) - [asyncio](https://docs.python.org/3.11/library/asyncio-task.html): asynchronous (nonblocking) programming - References - https://www.machinelearningplus.com/python/parallel-processing-python/ - Java <font size = -2><a href ="https://www.csie.ntu.edu.tw/~d00922011/java2/multithreading/ConcurrentFramework.zip">sample code</a></font> - Basic level: the [Thread](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Thread.html) class and the [Runnable](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Runnable.html) interface. ```java= ... public static void main(String[] args) { Runnable task = () -> System.out.println("Hello, concurrency."); new Thread(task).start(); new Thread(task).start(); new Thread(task).start(); } ... ``` - Mid level: thread pool implemented by the [Executors](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/Executors.html) class and its associated members. - popcornylu, [Java 多執行緒的基本知識](https://popcornylu.gitbooks.io/java_multithread/content/threadpool/basicpool.html) ```java= import java.util.List; import java.util.ArrayList; import java.util.concurrent.Executors; import java.util.concurrent.ExecutorService; import java.util.concurrent.Callable; import java.util.concurrent.Future; public class Test { public static void addAllTasks(List<Callable<String>> tasks) { tasks.add(() -> "Task 1"); tasks.add(() -> "Task 2"); tasks.add(() -> "Task 3"); } public static void main(String... args) throws Exception { int numOfThread = 4; List<Callable<String>> tasks = new ArrayList<>(); addAllTasks(tasks); ExecutorService pool = Executors.newFixedThreadPool(numOfThread); List<Future<String>> results = pool.invokeAll(tasks); results.forEach(System.out::println); pool.shutdown(); } } ``` - Mid-level: fork-join pool implemented by the [ForkJoinPool](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/ForkJoinPool.html) class with the [ForkJoinTask](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/ForkJoinTask.html) interface. - Asynchronous programming in Java: see [Callable](), [Future](https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/util/concurrent/Future.html) and [CompletableFuture](https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/util/concurrent/CompletableFuture.html). - [Goroutine](https://golangdocs.com/goroutines-in-golang) - https://gobyexample.com/goroutines - A Chinese explanation for goroutine: https://hoohoo.top/blog/golang-goroutine-above-fundamentals-and-structure/<img src = "https://hackmd.io/_uploads/SJ17VhDDs.png" height = 200px style="float:right"/> - [NVIDIA CUDA](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) - Quick start: https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html - Benchmarks - [TensorFlow 2 - CPU vs GPU Performance Comparison](https://datamadness.github.io/TensorFlow2-CPU-vs-GPU), 2019 - [Parallelizing across multiple CPU/GPUs to speed up deep learning inference at the edge](https://aws.amazon.com/tw/blogs/machine-learning/parallelizing-across-multiple-cpu-gpus-to-speed-up-deep-learning-inference-at-the-edge/), 2019 - [CPU vs GPU in Machine Learning Algorithms: Which is Better?](https://thinkml.ai/cpu-vs-gpu-in-machine-learning-algorithms-which-is-better/), 2021 - https://www.nvidia.com/en-us/data-center/resources/mlperf-benchmarks/ - Supplementary materials - [How CUDA Programming Works | GTC 2022](https://www.youtube.com/watch?v=n6M8R8-PlnE&ab_channel=PERLI) - C - [OpenMP](https://www.openmp.org/) - POSIX Threads (aka pthread) - https://passlab.github.io/CSCE790/notes/lecture08_PThreads.pdf - https://passlab.github.io/CSE436536/notes/lecture09_pthread01.pdf - https://blog.gtwang.org/programming/pthread-multithreading-programming-in-c-tutorial/ - C++ - OpenMP - Multithreaded programming - For example, https://www.geeksforgeeks.org/multithreading-in-cpp/ - More articles about C++ parallel programming - Patrick Diehl, [Advanced Parallel Programming in C++](https://www.diehlpk.de/blog/modern-cpp/), 2021 - 知乎:[The Coroutine in C++ 20 协程初探](https://zhuanlan.zhihu.com/p/237952115) ## ++Parallel Algorithms++ - https://www.dcc.fc.up.pt/~ricroc/aulas/1516/cp/apontamentos/slides_sorting.pdf - https://www3.cs.stonybrook.edu/~rezaul/Spring-2012/CSE613/CSE613-lecture-9.pdf - https://arxiv.org/ftp/arxiv/papers/1511/1511.03404.pdf ### Practical Examples #### Monte Carlo Simulation - It is an example of embarrassingly (ideally) parallel algorithms. <center> ![](https://hackmd.io/_uploads/ByvrD3Dvj.png =300x) </center> ==need quantitative benchmark== #### Compression & Decompression - Scenarios - Case 1: Decompressing 266 zip files for the TX futures. The total size of all csv files is 5.6 GB. - Case 2: Decompressing 254 zip files for all stocks in the Taiwan equity market. The total size of all csv files is 11 GB. - Case 3: Compressing nnnn csv files for all stocks in the Taiwan equity market.<br><br> | Scenario | Sequential (1T) | Parallel (48T) | |:--------:| ---------------:| --------------:| | Case 1 | 36.915s | 1.539s | | Case 2 | 1m13.223s | 3.597s | | Case 3 | TBA | TBA | #### Web Crawling ==need quantitative benchmark== ### Complexity Analysis #### Remarks - Can parallism reduce the time complexity of one problem? - Definitely **impossible** unless you can supply processors/workers as many as the number of data size with a constant ratio. ## ++Distributed Computing++ - Message Passing Interface (MPI) - Lecture slides: https://people.sc.fsu.edu/~jburkardt/presentations/mpi_2013_acs2.pdf - Tech stack - [Open MPI: Open Source High Performance Computing](https://www.open-mpi.org/) - [MPI for Python](https://mpi4py.readthedocs.io/en/stable/) - [Using IPython for parallel computing](https://ipyparallel.readthedocs.io/en/latest/) - References - https://hpc.nmsu.edu/discovery/mpi/introduction/ - https://wvuhpc.github.io/2018-Lesson_4/05-mpi/index.html - https://help.ubuntu.com/community/MpichCluster - https://medium.com/mpi-cluster-setup/mpi-clusters-within-a-lan-77168e0191b1 ## References ### Books - W. P. Petersen and P. Arbenz, [Introduction to Parallel Computing](https://www.amazon.com/Introduction-Parallel-Computing-Engineering-Mathematics/dp/0198515774), 2004 ![](https://hackmd.io/_uploads/rk_oaowdi.png =100x) - Peter Pacheco and Matthew Malensek, [An Introduction to Parallel Programming](https://www.amazon.com/Introduction-Parallel-Programming-Peter-Pacheco/dp/0128046058), 2/e, 2022 ![](https://hackmd.io/_uploads/Hk8v0ovus.png =100x) - Roman Trobec, Boštjan Slivnik, Patricio Bulić, and Borut Robič, [Introduction to Parallel Computing: From Algorithms to Programming on State-of-the-Art Platforms](https://link.springer.com/book/10.1007/978-3-319-98833-7), 2018 ![](https://hackmd.io/_uploads/HkSaoivuo.png =100x) #### Operating Systems (OS) - Abraham Silberschatz, Greg Gagne, Peter B. Galvin, [Operating System Concepts](https://www.amazon.com/-/zh_TW/Operating-System-Concepts-Abraham-Silberschatz/dp/1119800366/), 10/e, 2021 ![](https://hackmd.io/_uploads/Hy6-XLuwi.png =100x) - Andrew S. Tanenbaum and Herbert Bos, [Modern Operating Systems](https://www.amazon.com/-/zh_TW/Andrew-Tanenbaum/dp/0132126958/), 4/e, 2015 ![](https://hackmd.io/_uploads/Sk24i5DPj.png =100x) - Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau, [Operating Systems: Three Easy Pieces](https://pages.cs.wisc.edu/~remzi/OSTEP/), 2018 ![](https://hackmd.io/_uploads/B1bPsiPdi.png =100x) #### Hardware Architectures - Ulrich Drepper, [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf), 2007 - Maurice Herlihy, Nir Shavit, Victor Luchangco, and Michael Spear, [The Art of Multiprocessor Programming](https://www.amazon.com/Art-Multiprocessor-Programming-Maurice-Herlihy/dp/0124159508/), 2/e, 2020 ![](https://i.imgur.com/XC22Jsp.png =110x) - Randal Bryant and David O'Hallaron, [Computer Systems: A Programmer's Perspective](https://www.amazon.com/Computer-Systems-Programmers-Perspective-3rd/dp/013409266X), 3/e, 2015 ![](https://hackmd.io/_uploads/rkVOELdPj.png =120x) #### Parallel Algorithms - Behrooz Parhami, [Introduction to Parallel Processing: Algorithms and Architectures](https://www.amazon.com/Introduction-Parallel-Processing-Algorithms-Architectures/dp/0306459701), 1999 ![](https://hackmd.io/_uploads/Sk953jDds.png =100x) - Frank Thomson Leighton, [Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes](https://www.amazon.com/Introduction-Parallel-Algorithms-Architectures-Hypercubes/dp/1558601171?), 1991 ![](https://hackmd.io/_uploads/BkA-aow_i.png =100x) - Michael McCool, James Reinders, and Arch Robison, [Structured Parallel Programming: Patterns for Efficient Computation](https://www.amazon.com/Structured-Parallel-Programming-Efficient-Computation/dp/0124159931), 2012 ![](https://hackmd.io/_uploads/HyRrojvdo.png =100x) #### CUDA - Richard Ansorge, [Programming in Parallel with CUDA](https://www.cambridge.org/core/books/programming-in-parallel-with-cuda/C43652A69033C25AD6933368CDBE084C), 2022 ![](https://hackmd.io/_uploads/HyyNhiwuj.png =100x) - NVidia, [CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf) #### Distributed Computing - Nicola Santoro, [Design and Analysis of Distributed Algorithms](https://www.amazon.com/Design-Analysis-Distributed-Algorithms-Santoro/dp/0471719978), 2006 ![](https://hackmd.io/_uploads/r1PaAjwdi.png =100x) ### Courses - Aydin Buluc and Jim Demmel, [Applications of Parallel Computers](https://sites.google.com/lbl.gov/cs267-spr2022), UC Berkeley, 2022 - Suresh Jagannathan, [CS 353: Principles of Concurrency and Parallelism](https://www.cs.purdue.edu/homes/suresh/353-Spring2022/), Purdue University, 2022sp - Brian Railing and Nathan Beckmann, [15-418/15-618: Parallel Computer Architecture and Programming](https://www.cs.cmu.edu/afs/cs/academic/class/15418-s21/www/), Carnegie Mellon University, 2021sp - [High Performance Computing For Science And Engineering (HPCSE) II](https://www.cse-lab.ethz.ch/teaching/hpcse-ii_fs20/), Swiss Federal Institute of Technology in Zurich - Bei Yu, [CMSC 5743 Efficient Computing of Deep Neural Networks](http://www.cse.cuhk.edu.hk/~byu/CMSC5743/2021Fall/), The Chinese University of Hong Kong, 2021 - Dan Garcia and Lisa Yan, [Great Ideas in Computer Architecture (Machine Structures)](https://cs61c.org/fa22/), UC Berkeley, 2022fa - Yonghong Yan, [CSCE 513 Computer Architecture](https://passlab.github.io/CSCE513/), University of South Carolina, 2018fa - Mike Giles and Wes Armour, [Course on CUDA Programming on NVIDIA GPUs](https://people.maths.ox.ac.uk/gilesm/cuda/), 2023 - Sergio Martin, High Performance Computing for Science and Engineering, 2021: [pdf](https://www.cse-lab.ethz.ch/wp-content/uploads/2021/05/Lecture-GPU-Programming-II.pdf) - Al Barr, [GPU Programming](http://courses.cms.caltech.edu/cs179/), Caltech, 2022 - Jserv, [Linux 核心設計: 記憶體管理](https://hackmd.io/@sysprog/linux-memory) ### Youtubes - Jserv, [作業系統概念: Concurrency (並行) 程式設計篇](https://www.youtube.com/watch?v=3mkug2ygdIs&feature=youtu.be) - [极客湾 Geekerwan](https://www.youtube.com/@geekerwan5818), 吹牛还是真牛?苹果M1全网最硬核评测 [上](https://youtu.be/WMyAGVmiPWE) [下](https://youtu.be/Mq5np26UkDI) ### Game - [7 Billion Humans](https://store.steampowered.com/app/792100/7_Billion_Humans/) ![](https://i.imgur.com/6Tf8t0k.png =300x) ## TODO - Case study - Producer-consumer model - https://www.cs.cornell.edu/courses/cs3110/2010fa/lectures/lec18.html - Parallelism of Fibonacci numbers - https://classes.engineering.wustl.edu/cse231/core/index.php/Fibonacci - https://catonmat.net/mit-introduction-to-algorithms-part-thirteen - Single Instruction Multiple Data (SIMD) - [The Intel® Compiler Auto Vectorization](https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-auto-vectorization-tutorial/top.html) - https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html - Supplementary materials - https://www.machinelearningplus.com/python/parallel-processing-python/ - https://towardsdatascience.com/10x-faster-parallel-python-without-python-multiprocessing-e5017c93cce1