# MIT 6.172 Performance Engineering of Software Systems *This is a study \/ course note. It will refer to extra materials or some other post that I wrote.* # 1. Introduction and Matrix Multiplication ## My Other Posts * Loop Body Optimization with Matrix Multiplication https://hackmd.io/JAZbc0TpTjSH-EqTowxYUQ?view#Loop-Body-Optimization * Matrix Multiplication Optimization https://hackmd.io/JAZbc0TpTjSH-EqTowxYUQ?view#Loop-Body-Optimization * GPU Matrix Multiplication \([Github](https://github.com/Chen-KaiTsai/GPGPU_OpenCL/tree/main/matrix_multiplication)\) https://hackmd.io/JAZbc0TpTjSH-EqTowxYUQ?view#Loop-Body-Optimization * SIMD Matrix Multiplication \([Github](https://github.com/Chen-KaiTsai/PerformanceEngineering_repo)\) https://hackmd.io/JAZbc0TpTjSH-EqTowxYUQ?view#Loop-Body-Optimization * Article Study : Beating NumPy's Matrix Multiplication in 150 Lines of C Code https://hackmd.io/@Erebustsai/HJ4PO0bV6 ## Matrix Multiplication > Reference \: > * Youtube time stamp for visualized matrix multiplication > https://youtu.be/o7h_sYMk_oc?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=1937 ### Parallel `for` loop > https://youtu.be/o7h_sYMk_oc?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2260 For the answer why parallel loop can not be applied on `k` loop, it's because data race on `C` will definitely occur. ```cpp clik_for (int i = 0; i < n; ++i) for (int k = 0; k < n; ++k) // cannot use parallel for data race on updating of C clik_for (int j = 0; j < n; ++j) C[i][j] += A[i][k] * B[k][j]; ``` # 2. Bentley Rules for Optimizing Work ## New Bentley Rules > Reference \: > https://youtu.be/H-1-X9bkop8?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf **Data Structures** * Packing \& Encoding \: `string` date to `struct` with `unsigned int`s. In the lecture, a [bit field](https://learn.microsoft.com/en-us/cpp/c-language/c-bit-fields?view=msvc-170) is used on a C structure. ```cpp using Date_t = struct Date { unsigned int year : 12; unsigned int month : 4; unsigned int day : 5; }; ``` ![image](https://hackmd.io/_uploads/SyNzLqkRyg.png) Because of the alignment on 4 bytes, the `Date_t` will still be 4 bytes size. However, this is definitely much easier to access each filed. * Augmentation \: Add info to a data structure to make common operation do less work. * Precomputation \: Compute whatever can be compute in compiler time or generate a look-up table. \(metaprogramming\) * Compile-time Initialization \: As above * Caching * Sparsity \: [Sparse Matrix Multiplication](https://github.com/Chen-KaiTsai/GPGPU_OpenCL/tree/main/sparse_matrix) **Loops** * Hoisting * Sentinels * Loop Unrolling * Loop Fusion * Eliminating Wasted Iterations **Logic** * Constant Folding \& Propagation \: `constexpr` * Common-subexpression Elimination * Algebraic Identities * Short-circuiting \: Stop loop early when the answer is aquired * Ordering tests * Combining tests **Funtions** * Inlining * Tail-recursion Elimination * Coarsening Recursion # 3. Bit Hacks Basically, these tricks are good to know; however modern optimized compiler can do better job than you. Therefore, most of the time there is no need for these kind of optimizations. ## My Other Post * About Bit Hack with `xor` or `tmp` based swap https://hackmd.io/FiPAsD0pQfempm4gPABisQ?view#Cache-Oblivious-Algorithms ## Binary Representation ### My other Post * Number Representation and Computation https://hackmd.io/JAZbc0TpTjSH-EqTowxYUQ#Number-Representation-and-Computation :::info :information_source: **two\'s complement** No need to review. ::: * `x + \~x = \-1` * `y = x | (1 << k);` Set the kth bit * `y = x & ~(1 << k);` Clear the kth bit * `y = x ^ (1 << k);` Toggle the kth bit * `y = x & (1 << k);` Check if a bit is set \(check `y > 0`\) * `(x & mask) >> shift;` Extract a bit field \(`shift` is moving the result to the lower bits\) * `x = (x & ~mask) | (y << shift);` Set a bit field * `r = y ^ ((x ^ y) & -(x < y));` No branch find minimum \(remember `xor` itself gives 0\) * Round up to a Power of 2 [How it works](https://youtu.be/ZusiKXcz_ac?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2427) ```cpp uint64_t n; --n; /* * this can make sure if the number is already power of 2 the bit on the right of the * most significant set bit will be set, thus we can propergate all the bit on * the right of it. */ n |= n >> 1; n |= n >> 2; n |= n >> 4; n |= n >> 8; n |= n >> 16; n |= n >> 32; n++; ``` * Compute `lgN` where `N` is the power of 2. \(find the number of the set bit. A number will be power of 2 if there is only one bit set in the bit field except the first one\). [How it works.](https://youtu.be/ZusiKXcz_ac?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=3060) ```cpp const uint64_t deBruijn = 0x022fdd63cc95386d; const int convert[64] = { 0, 1, 2, 53, 3, 7, 54, 27, 4, 38, 41, 8, 34, 55, 48, 28, 62, 5, 39, 46, 44, 42, 22, 9, 24, 35, 59, 56, 49, 18, 29, 11, 63, 52, 6, 26, 37, 40, 33, 47, 61, 45, 43, 21, 23, 58, 17, 10, 51, 25, 36, 32, 60, 20, 57, 16, 50, 31, 19, 15, 30, 14, 13, 12 }; r = convert[(x * deBruijn) >> 58]; ``` # 4. Assembly Language \& Computer Architecture *I know basic assembly* ## My Other Posts * AI Compiler Series https://hackmd.io/@Erebustsai/Hy5pPhn2n * AI System by chenzomi12 https://hackmd.io/@Erebustsai/ryxabwGNSA * SIMD Programming Notes https://hackmd.io/@Erebustsai/HkdXPx-rh ## Opcode Suffixes > Opcodes might be augmented with a suffix that describes the data type of the operation or a condition code. An opcode for data movement, arithmetic, orlogic uses a single-character suffix to indicate the data type. ``` movq -16(%rbp), %rax ; q stand for quad word => 64-bit Notice that a word is 16-bit ; (8086 CPU is 16-bit based) ``` ![image](https://hackmd.io/_uploads/HyAnK4M0ke.png) ## Condition Codes [Reference](https://youtu.be/L1ung0wil9Y?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf) ## Assembly Idiom [Reference](https://youtu.be/L1ung0wil9Y?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf) :::info :information_source: **Tricks for ASM** * `xor %rax %rax` \: this will zero out the register ::: ## Floating-Point Instruction Set ### My Other Post * Number Representation and Computation https://hackmd.io/JAZbc0TpTjSH-EqTowxYUQ#Number-Representation-and-Computation ## SSE \& AVX Vector Opcodes [Reference](https://youtu.be/L1ung0wil9Y?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf) ## Instruction Pipeline ### My Other Post * GPU / NPU / CPU Hardware Design https://hackmd.io/@Erebustsai/rkamk1h31e # 5. C to Assembly ## My Other Post * AI Compiler Series https://hackmd.io/@Erebustsai/Hy5pPhn2n ## C to LLVM IR ### LLVM IR v.s. Assembly **Similarity** * LLVM IR uses a simple instruction format. `<destination operand> = <opcode><source operands>` * LLVM IR code adopts a similar structure to assembly code. * Control flow is implemented using conditional and unconditional branches. **Simplified** * Smaller instruction set * Infinite LLVM IR registers, similar to variables in C. * No implicit FLAGS register or condition codes. * No explicit stack pointer or frame pointer. * C-like type system \(strongly typed\) * C-like funtions ### LLVM IR Registers * `%<name>` * Infinite number of registers * Register names are local to each LLVM IR function ### LLVM IR Instructions * Instruction that produce a value \: `%<name> = <opcode><operand list>` * Other instructions \: `<opcode><operand list>` * Operands are registers, constants or basic blocks [Common LLVM IR Instructions](https://youtu.be/wt7a5BOztuM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=898) ### LLVM IR Data Types * Integers \: `i<number>` e.g. a 64-bit integer `i64`, a 1-bit integer `i1` * Floating-point values \: `double`, `float` * Arrays \: `[<number> x <type>]` e.g. An array of 5`int`s \: `[5 x i32]` * Structs \: `{<type>, ...}` * Vectors \: `< <number> x <type> >` \(SIMD operations\) * Pointers \: `<type>*` e.g. an 8-bit integer pointer \: `i8*` * Labels (i.e., basic blocks) \: `label` :::success :bulb: **Straight-line C Code to LLVM IR Examples** [Reference](https://youtu.be/wt7a5BOztuM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=1119) ::: ### LLVM IR Functions clike ``` int64_t fib(int64_t n) { ... return n; } define i64 @fib(i64) local_unnamed_addr #0 { ... ret i64 %0 } ``` LLVM IR function parameters map directly to their C counterparts. :::success :bulb: **Basic Blocks** [Reference](https://youtu.be/wt7a5BOztuM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf) ::: ### C Conditionals to LLVM IR :::info :information_source: `; <label>:8: ; preds = %6, %8` The preds value is a comma-separated list of basic blocks or labels that are the predecessors of this label. \(The block can be branched from\) ::: [Reference](https://youtu.be/wt7a5BOztuM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=1780) ### C Loops to LLVM IR [Reference](https://youtu.be/wt7a5BOztuM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2092) :::success :bulb: **Static Single Assignment \(SSA\)** `phi` instruction is used to solve this internal issue of IR [Reference](https://youtu.be/wt7a5BOztuM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2488) ::: ### LLVM IR Attributes [Reference](https://youtu.be/wt7a5BOztuM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2661) ## LLVM IR to Assembly There tasks * Select assembly instructions to implement LLVM IR instructions * Allocate x86-64 general-purpose registers to hold values * Coordinate funtion calls ### The Linux x86-64 Calling Convention [Reference](https://youtu.be/wt7a5BOztuM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2997) # 6. Multicore Programming \& 7. Races and Parallelism \& 8. Analysis of Multithreaded Algorithms ## My Other Post ### CPU Parallel * Book : Parallel and High Performance Computing https://hackmd.io/@Erebustsai/rJfAI5w2R * Books : The Art of ... Series? https://hackmd.io/@Erebustsai/S1sguFsJkg * Books : Optimized C++ & 21st Century C & Understanding and Using C Pointers & Effective Modern C++ https://hackmd.io/@Erebustsai/SJ6qYG7fJe * General Perfromance Engineering https://hackmd.io/@Erebustsai/Sk39malXa * Book : AI Embedded Systems Algorithm Optimization and Implementation https://hackmd.io/@Erebustsai/Sk39malXa * MPI Programming https://hackmd.io/hoXOc6nIRpG3bx0u_x57uQ * OpenMP Programming https://hackmd.io/@Erebustsai/HyNZhtM-C * SIMD Programming Notes https://hackmd.io/@Erebustsai/HkdXPx-rh ## Cache Coherence [Reference](https://youtu.be/dx98pqJvZVk?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=610) ### MSI Protocol [Reference](https://youtu.be/dx98pqJvZVk?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=791) :::success :bulb: **Memory Coherence \& Cache Coherence** * 介紹記憶體一致性\(Memory Consistency\)和快取一致性\(Cache Coherence\) https://blog.csdn.net/iNostory/article/details/119047985 ::: # 9. What Compilers Can and Cannot Do ## My Other Posts * AI Compiler Series https://hackmd.io/@Erebustsai/Hy5pPhn2n * AI System by chenzomi12 https://hackmd.io/@Erebustsai/ryxabwGNSA 1. Debug compiler optimization which might cause problem. 2. Avoid optimizing code that the compiler will do for you. 3. Choose what compiler should optimized against. 4. Write code that is easier for compiler to understand and optimize. ## Simple Model of Compiler [Reference](https://youtu.be/ulJm7_aTiQM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=539) ## Compiler Optimization Report * `clang` https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports * `gcc` https://stackoverflow.com/a/65017597 ## Overview of Compiler Optimizations * Edit from Bentley Rules [Reference](https://youtu.be/ulJm7_aTiQM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=829) * Extended list for Compiler Optimization [Reference](https://youtu.be/ulJm7_aTiQM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=944) ### Study Case \#1 Most of the compiler optimization happned on generating optimized LLVM IR, although not all of them. [Reference](https://youtu.be/ulJm7_aTiQM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=1148) Notice that the tricks that a compiler use special hardware \(circuit in the hardware\) that is fast. These hardware might be designed to do something entirely different but they can be used to do computation. :::info :information_source: **Magic Number** [Reference](https://youtu.be/ulJm7_aTiQM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=1432) Convert division to multiply by using a magic number. Notice that this is generated by compiler. This shows how compiler mostly might do better jobs than a trained programmer. ::: ### Eample \#2 N-Body Simulation [Reference](https://youtu.be/ulJm7_aTiQM?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=1571) **My Other Post** * N-body in GPGPU Algorithm Implementation https://hackmd.io/PTBuRGwiQHKPyzfe9ebMNQ#6-N-body **Sequential Primary Routine** ```cpp void simulate(body_t* bodies, int nbodies, int nsteps, int time_quantum) { for (int i = 0; i < nsteps; ++i) { calculate_force(nbodies, bodies); update_positions(nbodies, bodies, time_quantum); } } ``` **Key Routine in N-Body Simulation** ```cpp void update_positions(int nbodies, body_t* bodies, double time_quantum) { for (int i = 0; i < nbodies; ++i) { vec_t new_velocity = vec_scale(bodies[i].force, time_quantum / bodies[i].mass); // Update the position of bodies[i] based on the average // of its old and new velocity. bodies[i].position = vec_add(bodies[i].position, vec_scale(vec_add(bodies[i].velocity, new_velocity), time_quantum / 2.0)); // Set the new velocity of bodies[i]; bodies[i].velocity = new_velocity; } } ``` # 10. Measurement and Timing ## Quiescing System ### Source of Variability * Daemons and Background Jobs * Interrupts * Code and data alignment * Thread Placement [My Other Post about Thread Affinity](https://hackmd.io/vGkriUD3S-icJaVd4ubdug#Affinity--Truce-with-the-Kernel). * Runtime Scheduler * Hyperthreading * Multitenancy * Dynamic Voltage and Frequency Scaling * Turbo Boost To reduce this issue, we can simply lock the frequency or turn off the turbo boost. [See how I did it](https://hackmd.io/@Erebustsai/SJPwQzOvh). * Network Traffic ## Tools for Measuring Software Performance ### Measure the program externally **Command** In Linux `/usr/bin/time` **Output** ```bash real 0m2.926s user 0m1.172s sys 0m0.302s ``` Notice that `usr` + `sys` will not necessary equal to `real` because your system might not be running your program. `usr` time is in the user space and `sys` time is in the kenel. ### Instrument the program > Reference > * C++ STL chrono和clock_gettime的性能對比 > https://www.cnblogs.com/coding-my-life/p/16182717.html Put code in the program. In this lecture, professor recommand C function in the `glibc` call `clock_gettime()`. Notice that the `chrono` in C++ STL is using `clock_gettime()` as well but also doing something else to make it STL. Therefore, `chrono` generally will be slower than `clock_gettime()`; however, not by a large difference. *I generally prefer use `chrono` now, since I rarely dealt with pure C code.* ### Interrupt the program `gdb`, `gprof` ### Exploit hardware and operating systems support Mirror of `libpfm4` [Github Repo](https://github.com/wcohen/libpfm4) ### Simulate the program `cachegrind`, which is part of `valgrine`. [See My Other Post](https://hackmd.io/@Erebustsai/S1Rk0SvE2) ## My Experience I used to be working on an optimization of a OpenCL based AI framework \([CUDA Version can be found here](https://github.com/Chen-KaiTsai/CUDA_AI_Framework)\). Since it is heterogeneous computing, measuring time on GPU and CPU is basically need to be done separately; however, the only measurement result that can be submitted is the one that tested within the deployed system and within the main program routine. This means that, GPU is running VGA output and CPU is running OS, OpenCV drawing, GUI interface, etc. Those thing cannot be taken out of consideration because that is how the costumer want us to present. # 11. Storage Allocation ## My Other Post * Memory Management https://hackmd.io/d8us7WQSQkCiY_3g4l0GyQ#Memory-Management ## Stack Storage * Allocating and freeing take `O(1)` time. \(Basically manipulating the stack pointer\) * Limited applicability ## Heap Storage ### Heap Allocation **Memory Checkers** \: To detect a memory leak in the program including *dangling pointers* and *double freeing*. * AddressSanitizer * ASAN Clang \: https://blog.austint.in/2024/02/16/address-sanitizer-intro.html * Visual Studio AddressSanitizer \: https://jackkuo.org/post/vs2022_cmake_sanitizer/ * Valgrind \: https://hackmd.io/@Erebustsai/S1Rk0SvE2 Address Sanitizer knows more about a program so can find more bug than valgrind. ### Fixed-Size Allocation [Reference](https://youtu.be/nmMUUuXhk2A?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=681) ### Garbage Collections [Reference](https://youtu.be/nmMUUuXhk2A?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2271) **My Other Post** * Boehm-Demers-Weiser Garbage Collector https://hackmd.io/7n-VHaZtQBmLUgHEmc2uIA#Boehm-Demers-Weiser-Garbage-Collector # 12. Parallel Storage Allocation ## Allocating Virtual Memory ### Linux System Call `mmap()` **My Other Post** * Memory Management https://hackmd.io/d8us7WQSQkCiY_3g4l0GyQ#Memory-Management ## Address Translation [Reference](https://youtu.be/d5e_YJGXXFU?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=679) ## The Hoard Allocator [Reference](https://youtu.be/d5e_YJGXXFU?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=4072) **My Other Post** * The Hoard Memory Allocator https://hackmd.io/FiPAsD0pQfempm4gPABisQ#The-Hoard-Memory-Allocator # 13. Skip \(Cilk is Outdated\) # 14. Caching and Cache-Efficient Algorithms *A total review of Cache* ## My Other Post *There are cache related issue all over these two posts.* * General Perfromance Engineering https://hackmd.io/@Erebustsai/HJ4PO0bV6 * Book : Parallel and High Performance Computing https://hackmd.io/@Erebustsai/rJfAI5w2R :::info :information_source: **Using `lstopo`** https://www.open-mpi.org/projects/hwloc/tutorials/20120702-POA-hwloc-tutorial.html Since most of the system I have have no gui output, I will use `lstopo out.png` and use `scp` to get the `.png` file. **Intel Xeon WSL Output** ![out](https://hackmd.io/_uploads/rk7mI_j0Jx.png) **Nvidia Jetson Nano** ![out](https://hackmd.io/_uploads/HkiOtOjCJe.png) ::: ## Cache ### Cache Architecture :::success :information_source: **圖解直接對應(Direct mapped)、全相聯(Fully-associative)和組相聯(Set-associative)cache快取基本原理** https://blog.csdn.net/luolaihua2018/article/details/132647066 ::: ### Taxonomy of Cache Misses [Reference](https://youtu.be/xDKnMXtZKq8?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=887) :::info :information_source: **容量失效(capacity miss)與衝突失效(conflict miss)的區別** https://blog.csdn.net/wangquan1992/article/details/107084281 ::: * **Cold Miss** \: The first time the cache block is accessed. \(cannot avoid\) * **Capacity Miss** \: The previous cached copy would have been evicted even with a fully associative cache. * **Conflict Miss** \: Too many blocks from the same set in the cache. The block would not have been evited with a fully associative cache. * **Sharing Miss** \: Multiple processors accessing same cache line. Possible [false sharing](https://hackmd.io/@Erebustsai/S1sguFsJkg), which can refer to my other post. ### Confilict Miss Example \: Submatrices [Reference](https://youtu.be/xDKnMXtZKq8?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=1164) The solution is similar to [band conflict in GPGPU](https://hackmd.io/QhzlGKAvQcGKLqBcqd2GBg?view#Shared-Memory-Band-Conflict) which is by using padding. Pad a small number of space \(bytes\) to form some sort of mismatch to prevent confilicts. ### Ideal-Cache Model Model to measure impact of cache misses for the performance of an algorithm. [Reference](https://youtu.be/xDKnMXtZKq8?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=1614) > \"LRU\" Lemma \[ST85\]. Suppose that an algoirthm incurs `Q` cache misses on an ideal cache of size `M`. Then on a fully associative cache of size `2M` that uses the least-recently used \(LRU\) replacement policy, it incurs at most `2Q` cache misses. This means that the differences between an idea-cache and the LRU cache has limited constant differences. Therefore, assuming using one of them to measure an algorithm. ### Tall Cache Assumption [Reference](https://youtu.be/xDKnMXtZKq8?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2202) **My Other Post** * Tall cache assumption https://hackmd.io/FiPAsD0pQfempm4gPABisQ?view#External-Sorting Basically means that cache line size \(64 byte\) should be way smaller than the number of cache line exist. * Cache Line Size `B` * Cache Total Size `M` * The number of Cache Line `M/B` ![image](https://hackmd.io/_uploads/SJfnr4T01l.png =x30) ### Cache-Efficient Matrix Multiplication **My Other Post** * [Article Study : Beating NumPy's Matrix Multiplication in 150 Lines of C Code](https://hackmd.io/FiPAsD0pQfempm4gPABisQ?view#Article-Study--Beating-NumPys-Matrix-Multiplication-in-150-Lines-of-C-Code) ### Cache-Oblivious Algorithms [Reference](https://youtu.be/xDKnMXtZKq8?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=3785) **Use Recurrsion** # 15. Cache-Oblivious Algorithms Code does not need to know the cache size of the CPU to efficiently use the cache. ## 2D Heat-Diffusion Simulation [Reference](https://youtu.be/xwE568oVQ1Y?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=87) ### 3-Point Stencil [Reference](https://youtu.be/xwE568oVQ1Y?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=529) ### Cache-Oblivious Stencil Computations [Reference](https://youtu.be/xwE568oVQ1Y?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=1402) ### Hardware Prefetching Behavior Helping the Looping Version [Reference](https://youtu.be/xwE568oVQ1Y?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2673) ## Going Parallel ### Parallel Space-cut [Reference](https://youtu.be/xwE568oVQ1Y?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2931) ## Merge Sort **My Github Repo** * MergeSort with OpenMP https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/tree/main/OpenMP_Algorithms/MergeSort * MergeSort with MPI https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/tree/main/MPI_Algorithms/MergeSort ### Multiway Merging *This is not implemented in the above repos.* [Reference](https://youtu.be/xwE568oVQ1Y?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=4206) However, this algorithm is not Cache-Oblivious. ## Optimal Cache-Oblivious Sorting **Funnelsort \[FLPR99\]** * Wiki https://en.wikipedia.org/wiki/Funnelsort # 16. Nondeterministic Parallel Programming :::info :information_source: **Difference between Deterministic and Non-deterministic Algorithms** https://www.geeksforgeeks.org/difference-between-deterministic-and-non-deterministic-algorithms/ ::: ## Golden Rule of Parallel Programming **Never write nondeterministic parallel programs** ## Silver Rule of Parallel Programming **Never write nondeterministic parallel programs but if you must always devise a test strategy to manage the nondeterminism** :::success :bulb: **Example of Nondeterministic in Memory Allocation** The OS will not give same memory location for `melloc` in different runs of a program. However, the compiler can turn it off and produce the same memory location all the time. \(Debug mode\) ::: ### Strategies for Testing Nondeterministic Programs * Turn off nondeterminism * Encapsulate nondeterminism * Substitute a deterministic alternative * Use analysis tools ## Mutex ## Properties of Mutexes [Reference](https://youtu.be/mXkPCaZUXhg?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2150) * Yielding \/ Spinning * Reentrant \/ Nonreentrant > A reentrant mutex allow a thread that is already holding a lock to acquire it again. A nonreentrant mutex deadlocks if the thread attempts to reacquire a mutex it already holds. * Fair \/ UnFair > A fair mutex puts blocked threads on a FIFO queue, and the unlock operation unblocks the thread that has been waiting the longest. An unfair mutex lets any blocked thread go next. :::info :information_source: **Thread Starvation** A thread never get a chance to work. ::: ### Competitive Mutex :::success :bulb: **Context Switching Rate** In the lecture, professor provide a number \: **\~ 10 milliseconds** * 3 different methods to count the number of context switches for a specific workload provide 2 different answers https://serverfault.com/questions/991416/3-different-methods-to-count-the-number-of-context-switches-for-a-specific-workl * Understanding Context Switching and Its Impact on System Performance https://www.netdata.cloud/blog/understanding-context-switching-and-its-impact-on-system-performance/ ::: [Reference](https://youtu.be/mXkPCaZUXhg?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=3199) In the above, we know that if we yield a thread, it is about 10 milliseconds to possiblely get it back. Notice that a CPU instruction rate is in GHz \(`10^9 vs 10^2 seconds`\). Therefore the simple and good solution is *spin a little bit, and then yield*. **How long should we spin?** :::success > As long as a context switch takes. Then, you never wait longer than twice the optimal time. ::: :::info :information_source: Ski Rental Problem [Reference](https://youtu.be/mXkPCaZUXhg?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=3302) * Wiki https://en.wikipedia.org/wiki/Ski_rental_problem ::: ## My Other Post Basically cover the rest of the lecture. * [Chapter 7 : Concurrency API](https://hackmd.io/7n-VHaZtQBmLUgHEmc2uIA?view#Chapter-7--Concurrency-API) * [Books : The Art of ... Series?](https://hackmd.io/@Erebustsai/S1sguFsJkg) * [Book : Parallel and High Performance Computing](https://hackmd.io/@Erebustsai/rJfAI5w2R) # 17. Synchronization Without Locks ***The entire lecture can be seen as a whole thinking process. I will rewatch the entire thing along with other books that is listed below if needed.*** ## My Other Post * [Books : The Art of ... Series?](https://hackmd.io/@Erebustsai/S1sguFsJkg) * [Book : Parallel and High Performance Computing](https://hackmd.io/@Erebustsai/rJfAI5w2R) ## Memory Models [Reference](https://youtu.be/5sZo3SrLrGA?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=65) *If forget the definition, just watch the tutorial entirely.* ### Sequencial Memory Consistency [Reference](https://youtu.be/5sZo3SrLrGA?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=254) ### The Reason of Relaxed Memory Consistancy [Reference](https://youtu.be/5sZo3SrLrGA?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2538) ### x86\-64 Total Store Order [Reference](https://youtu.be/5sZo3SrLrGA?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2819) ### Memory Fences [Reference](https://youtu.be/5sZo3SrLrGA?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=3586) Basically, `atomic` in both C \& C\+\+ will act like memory fences. # 20. Speculative Parallelism & Leiserchess :::info :information_source: **Definition on Wiki** https://en.wikipedia.org/wiki/Speculative_multithreading ::: # 22. Graph Optimization ## My Other Post * Book : C++ Data Structure & Algorithm Design https://hackmd.io/@Erebustsai/S1ThGKDYa * Course : Advanced Analysis of Algorithms https://hackmd.io/@Erebustsai/SyfT89rsC * GPGPU Algorithm Implementation https://hackmd.io/@Erebustsai/Byul7e-Up ## Graph Representation My Other Post https://hackmd.io/@Erebustsai/Byul7e-Up#7-Sparse-Matrix Notice that the above representation is for a Sparse Matrix. ### CSR Format * Two Arrays \: `Offset` \& `Edges` * `Offset[i]` stores the offset of where vertex `i`\'s edges start in `Edges`. ![image](https://hackmd.io/_uploads/H1WATe-Hxl.png) [Reference](https://youtu.be/IT_4fw6gfJw?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=505) ## BFS \(Breadth-First Search\) ### Psudocode In the following psudocode, `n` is the number of vertices and `m` the number of edges. ``` Breadth-First-Search(Graph, root): for each node n in Graph: n.distance = INFINITY n.parent = NIL create empty queue Q root.distance = 0 Q.enqueue(root) while Q is not empty: current = Q.dequeue() for each node n that is adjacent to current: if n.distance == INFINITY n.distance = current.distance + 1 n.parent = current Q.enqueue(n) ``` ### Github Repo TODO [Reference](https://youtu.be/IT_4fw6gfJw?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=1546) ## Serial BFS and Optimized through Bitvector [Reference](https://youtu.be/IT_4fw6gfJw?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=1724) Bitvector can be used to reduce cache miss, since bitvector is much smaller and can probably fit into the cache. ## Parallel BFS Each stage of frontier can be processed in parallel. ![image](https://hackmd.io/_uploads/r1Ig0-ZSee.png =x350) There are protential race since different node in the frontier can visit a same neighber. Load balancing will be an issue as well. ### PSUDOCODE in C++ Notice that this is the psudocode in C++ and will definitely **NOT** compile and run correctly. ```cpp /** * @brief PSUDOCODE Parallel version * @note std::partial_sum can be used as prefix_sum * @note should be using C11 thread not OpenMP * @note using vector.push_back to replace the need for prefix scan for update frontier * @todo A working version */ void BFS_parallel (const int* __restrict__ offsets, const int n, const int* __restrict__ edges, const int m, const int source) { std::vector<std::atomic_int> parent(n , -1); std::vector<int> frontier(n); std::vector<int> frontierNext(n); std::vector<int> frontierMark(n); std::vector<int> frontierIdx(n); std::vector<int> degrees(n); std::vector<int> degreesIdx(n); frontier[0] = source; int frontierSize = 1; parent[source] = source; while(frontierSize > 0) { #pragma omp parallel for for (int i = 0; i < frontierSize; i++) { degrees[i] = offsets[frontier[i] + 1] - offsets[frontier[i]]; } std::partial_sum(degrees.begin(), degrees.end(), degreesIdx.begin()); #pragma omp parallel for for (int i = 0; i < frontierSize; i++) { int v = frontier[i]; int index = degreesIdx[i]; int frontierSizeNext = 0; for (int j = 0; j < degrees[v]; j++) { int ngh = edges[offsets[v] + j]; if (parent[ngh] == -1 && parent[ngh].compare_exchange_strong(v, -1)) { frontierNext[index + j] = ngh; frontierMark[index + j] = 1; frontierSizeNext++; } else { frontierNext[index + j] = -1; frontierMark[index + j] = 0; } } // filter out -1 and update the frontierSize // mark 1 or 0 for v and -1 // prefix_scan of mark vector will be the index for each element in the new frontier array std::partial_sum(frontierMark.begin(), frontierMark.end(), frontierIdx.begin()); int count = 0; for (int j = 0; j < frontierSize; j++) { if (frontierNext[j] != -1) { frontier[frontierIdx[j]] = frontierNext[j]; } } frontierSize = frontierSizeNext; } } } ``` [Reference](https://youtu.be/IT_4fw6gfJw?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=2540) ## Two Way to Do BFS [Reference](https://youtu.be/IT_4fw6gfJw?list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&t=3790)