Books : The Art of ... Series?

# Books : The Art of ... Series? > Reference \: > The Art of Concurrency > https://www.tenlong.com.tw/products/9789863471257 > The Art of Writing Efficient Programs > https://www.tenlong.com.tw/products/9781800208117 > Book Github Repository > https://github.com/PacktPublishing/The-Art-of-Writing-Efficient-Programs *In this post, I will only include some topics that interest me and some extra material that I read when studying this book.* *This post contain two books that I read. One of them is a new book that I got access recently, the other is an old book that I have when I was in college. I decide to review the old one and study the new one side by side and use the old book as extra material.* *Implementation of algorithm mentioned in the book will be implemented and upload to my github repo.* # Introduction & Fundamentals ## What cannot be parallelized\? * Algorithm contain state \: Seed of random generater, file I\/O. * Recurrsion * Induction Variable \: Whatever variable increase or decrese over loop iteration and no one on one relation ship with iterative variable \(`i`\). * Reduction * Dependancy over iterations :::success Easy way to identify if a loop can be parallelized \: Check if same result can be aquired when iterate the loop in inverse order. ::: ## Parallel Programming Design Considerations * Efficiency * Simplicity * Portability * Scalability In the above criterias, scalability and efficiency are more important. ### Thread Pools > Reference \: > * 並行程式設計: Thread Pool 實作和改進 by sysprog > https://hackmd.io/@sysprog/concurrency-thread-pool > * Thread Pool in C++ > https://www.geeksforgeeks.org/cpp/thread-pool-in-cpp/ > * Thread Pool In C++ Video Tutorial by CppNuts > https://www.youtube.com/watch?v=u7ouCuieBhI > * Thread pool：簡單的 C 語言實作 > https://www.bigcatblog.com/thread_pool/ C++ does not have built-in thread pool implementaion. Therefore the best is use one that exsit online or write it our own. :::success Thread Pools enhance overall performance by lowering the overhead of thread generation and destruction through thread reuse. ::: **Implemetations** * Thread Array * Task Queue The simple implementation online is using a `mutex` and a `conditional_variable` to lock the task queue and the thread that sucessfully acquired the lock will be wait for the `conditional_variable` while others waiting for the lock. ### Lock-free Thread Pool > Reference \: > * 並行程式設計: Lock-Free Programming > https://hackmd.io/@sysprog/concurrency/%2F%40sysprog%2Fconcurrency-lockfree > * 高效執行緒池之無鎖化實現(Linux C) > https://blog.csdn.net/xhjcehust/article/details/45844901 > * LFTPool Github Repository > https://github.com/xhjcehust/LFTPool :::success :bulb: **What is std::jthread in c++20?** https://stackoverflow.com/questions/62325679/what-is-stdjthread-in-c20 ::: ### Lock-free Ring Buffer > * 並行程式設計: Ring buffer > https://hackmd.io/@sysprog/concurrency-ringbuffer > **Notice that the above article use different notation for `head` and `tail` than all the other tutorials.** > * Wiki > https://zh.wikipedia.org/zh-tw/%E7%92%B0%E5%BD%A2%E7%B7%A9%E8%A1%9D%E5%8D%80 > * 環形緩衝區（Ring Buffer) for C > https://cms.35g.tw/coding/%E7%92%B0%E5%BD%A2%E7%B7%A9%E8%A1%9D%E5%8D%80%EF%BC%88ring-buffer-for-c/ > * linux核心之無鎖緩衝佇列kfifo原理（結合項目實踐） > https://blog.csdn.net/weixin_44062361/article/details/108749870 > * Low-Latency Lock-Free Ring-Buffer in C - Lock Free Programming > https://www.youtube.com/watch?v=aYwmopy6cdY > https://github.com/sadhbh-c0d3/lock-free * `head` point to the next ***write*** possision * `tail` point to the next ***read*** possision If `head` point to the same location as `tail`, buffer is full; if `tail` try to write to the same location as `head` pointed to, there is nothing to read from \(no data stored\). Notict the for [並行程式設計: Ring buffer](https://hackmd.io/@sysprog/concurrency-ringbuffer), the above notation is reversed. ### Other Topics :::info :information_source: **Granularity** *The Art of Concurrency* provide a easily understand diagram as the following. ![image](https://hackmd.io/_uploads/HyfzVPs-1g.png =400x) Notice, when chosing granularity for an parallel algorithm, it's heavily depand on the target hardware. For example, GPUs compare to CPUs need smaller granularity since GPUs can involke many thread at once. Therefore, they are metioned as \"many-core\" when CPUs are considered only as \"multi-core\". ::: :::info :information_source: **False Sharing** * **False sharing - Wilipedia** https://en.wikipedia.org/wiki/False_sharing * **What is \"false sharing\"? How to reproduce \/ avoid it?** https://stackoverflow.com/questions/22766191/what-is-false-sharing-how-to-reproduce-avoid-it * **False Sharing 原理動態展示** https://blog.darkthread.net/blog/false-sharing-demo-with-jquery-animation/ ::: :::info :information_source: **Static Scheduling \& Dynamic Scheduling** Dynamic Scheduling \: Using `queue` to store jobs that need to be processed. In most CPU resources, use `thread pool` to control forked threads. A producer\/consumer pattern can be used with `queue` and `thread pool` `Thread Pool` implementation can be found \: https://github.com/bshoshany/thread-pool?tab=readme-ov-file#bsthread_pool-a-fast-lightweight-and-easy-to-use-c17-thread-pool-library Reference \: * **並行程式設計: Thread Pool 實作和改進** https://hackmd.io/@sysprog/concurrency-thread-pool ::: :::info :information_source: **GCC vs Clang** https://blog.csdn.net/Zhouzi_heng/article/details/117231696 https://www.cnblogs.com/luoweifu/p/18348981 ::: ## Measuring Performance *The first rule of performance is Never guess about performance.* > Reference\: > Book Page 19 ### Measure it in Code :::info :information_source: **`chrono` vs `clock_gettime`** `chrono` is in C++ STL and `clock_gettime` is for POSIX systems. * C++ STL chrono和clock_gettime的性能對比 https://www.cnblogs.com/coding-my-life/p/16182717.html ::: ### Measure it with Profiler Types of profilers \: * Run the program in a virtual machine and observe the program time; however, run slower than native environment. * Require code instrumented with special instructions during compilation or linking. Code need to be compile for profiling * Use hardware counters that present on all modern CPUs. ### Using `perf` On my system, when first time using `perf`, a message of denying to accessing kernel performance evnet is shown. As follows ```clike Error: Access to performance monitoring and observability operations is limited. Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open access to performance monitoring and observability operations for processes without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability. More information can be found at 'Perf events and tool security' document: https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html perf_event_paranoid setting is 4: -1: Allow use of (almost) all events by all users Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK >= 0: Disallow raw and ftrace function tracepoint access >= 1: Disallow CPU event access >= 2: Disallow kernel profiling To make the adjusted perf_event_paranoid setting permanent preserve it in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>) ``` To open it, `sudo sysctl kernel.perf_event_paranoid=<parameter>` as describe in https://askubuntu.com/questions/1471162/how-to-change-kernel-perf-event-paranoid-settings > This profiler uses hardware performance counters and time-based sampling; it does not require any instrumentation of the code. As describe in the book, we can compile the code with `-O3` or `-O2` and it will still work on the binary. **Sample Output** ```clike Performance counter stats for './ai_framework': 498.93 msec task-clock # 0.879 CPUs utilized 171 context-switches # 342.732 /sec 0 cpu-migrations # 0.000 /sec 5,120 page-faults # 10.262 K/sec 743,461,321 cycles # 1.490 GHz 87,207,894 stalled-cycles-frontend # 11.73% frontend cycles idle 141,030,929 stalled-cycles-backend # 18.97% backend cycles idle 898,876,293 instructions # 1.21 insn per cycle # 0.16 stalled cycles per insn 196,502,029 branches # 393.845 M/sec 8,224,268 branch-misses # 4.19% of all branches 0.567760077 seconds time elapsed 0.157319000 seconds user 0.342874000 seconds sys ``` CPUs can count many different types of events; however, only a few types at a time. Therefore, we can assign limited counters to some of the event types. :::success https://en.wikipedia.org/wiki/Hardware_performance_counter ::: We can better use `perf` with GUI frontend such as the following \: https://github.com/KDAB/hotspot ## Brachless Computing *Basically, in my opinion, changing branching into a **Look UP Table.** Same topic can be found in my other post https://hackmd.io/FiPAsD0pQfempm4gPABisQ?view#Instruction-Level-Parallelism* > Reference \: > The Art of Writing Efficient Programs **Page 104** > https://www.tenlong.com.tw/products/9781800208117 ### Branching Code Example ```cpp unsigned long *data {...}; // a N size data array bool *b1 {...}; // a N size boolean array unsigned long a1 = 0, a2 = 0; for (size_t i = 0; i < N; ++i) { if (bi[i]) { a1 += data[i]; } else { a2 += data[i]; } } ``` ### Branchless Version ```cpp unsigned long *data {...}; // a N size data array bool *b1 {...}; // a N size boolean array unsigned long a1 = 0, a2 = 0; unsigned long* a[2] = { &a2, &a1 }; for (size_t i = 0; i < N; ++i) { a[b[i]] += data[i]; } ``` The book point out that using the `? :` operator using a lookup array instead of a conditional branch whenever possible. Using this kind of compiler, we can gain the same performance benefit by rewriting our loop body as the following \: ```cpp for (size_t i = 0; i < N; ++i) { (b1[i] ? a1 : a2) += data[i]; } ``` ## Memory Speed and CAS\(Column Access Strobe Latency\) https://www.crucial.tw/articles/about-memory/difference-between-speed-and-latency **Sample Output** ```clike Handle 0x003B, DMI type 17, 84 bytes Memory Device Array Handle: 0x0033 Error Information Handle: 0x003A Total Width: 72 bits Data Width: 64 bits Size: 64 GB Form Factor: DIMM Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL A Type: DDR4 Type Detail: Synchronous Registered (Buffered) LRDIMM Speed: 2400 MT/s Manufacturer: Samsung Serial Number: 33C2C0A4 Asset Tag: Not Specified Part Number: M386A8K40BM1-CRC Rank: 4 Configured Memory Speed: 2400 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 1, Hex 0xCE Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 64 GB Cache Size: None Logical Size: None ``` ### Memory Timing ![image](https://hackmd.io/_uploads/Hy23Ikafyl.png) > 首先從意義比較簡單的 tRAS 開始，RAS 的全名是 Row Address Strobe 「啟動一整列」所需要的時脈週期數。另一個時序參數則是 tRCD，RCD 的全名是 RAS to CAS Delay，也就是記憶體完成啟動一整列之後，到開始選定欄之前這中間經過的延遲時間。第三個時序參數則是 tRP，RP 的全名是 RAS Precharge，要將原本已經啟動的一整列關閉，再到開始啟動另一列這之間所經過的時脈週期數。 CL 的全名是 CAS (Column Address Strobe) Latency，也就是前面提過的，如果要存取的資料都在同一列的狀況下，就不需要重新啟動其他列，而可以直接存取同列的另一個欄位，也就是欄位編號定位之後到真正取得資料這之間所經過的時脈週期數。 :::success :bulb: **ECC vs non-ECC** https://community.fs.com/article/ecc-vs-non-ecc-memory-which-one-is-better.html ::: ### Memory Model > * Cross Reference \: > https://hackmd.io/d8us7WQSQkCiY_3g4l0GyQ#C-Memory-Model When CPU update a value, it update in the cache not in the main memory directly. The concept in memory model is the **visibility** of an updated shared variable. **Memory Order** is a feature in hardware that can be acctivated with a machine instruction. * **relaxed order** \: > When an atomic operation is executed with relaxed order, the only guarantee we have is that the operation itself is executed atomically. > ... > The CPU that runs our thread executes these operations in a certain order. It may not be the order in which they are written in the program. > ... > One CPU executes operations in a certain order, but their results are visible to other CPUs in a very different order. * **acquire-release memory order** \: > When an atomic operation is executed with this order, we have a guarantee that any operation that accesses the memory and was executed before the atomic operation becomes visible to another thread before that thread executes an atomic operation on the same atomic variable. Additionally, both thread should be in `acquire-release memory order`, otherwise there will be no guarantee. The book noted that this kind of memory order guarantees acts as barriers. This can be considered as barriers in OpenCL or CUDA C. * **acquire memory order** * **release memory order** The above two orders should be used as a pair and is perfect for typical producer\/consumer thread relationship. * **sequentially consistent** The default one which makesure that thread work as in code order. :::info :information_source: **Spectre \& Meltdown** > Reference\: > https://www.cc.ntu.edu.tw/chinese/epaper/0045/20180620_4508.html ::: ## Thread, Memory and Concurrency :::success :bulb: **Assumption on Hyper-threading** The book points out that the programmer should not take hyper-threading threads into programming consideration. The only thing that we need to do is testing how many threads applied or used will have best performance. **Testing is everything.** ::: ### Threads and Caches The write to memory of each thread will be bounded by the cache size of L1, L2 and L3. Recall that L1 and L2 is independant for each core and L3 cache is shared by the entire CPU. ![image](https://hackmd.io/_uploads/BkMyUcfJZx.png) > Reference\: Book Page 167 > With the memory speed normalized, so it's always 1 for the single thread, it is much easier to see that for small data sets that fit into L1 or L2 cache, the memory speed per thread remains almost the same even for 16 threads \(each thread is writing at 80% of its single-threaded speed\). **However, as soon as we cross into the L3 cache or exceed its size, the speed goes down after 4 threads.** Going from 8 to 16 threads provides only minimal improvement. There just isn't enough bandwidth in the system to write data to memory fast enough. # Advanced Cucurrency > Reference\: Book Page 199 :::success *Even the best design can be ruined by poor implementation.* ... *These low-level details are often difficult to analyze, hard to express in code, and require very careful coding. This is not the kind of code you want to be scattered around in your program, so the encapsulation of the tricky code is a necessity. We will have to give some thought to the best way to encapsulate this complexity.* ::: ## Locks, alternatives and their performance ### Using Mutex \(Lock-based\) Using `mutex` is just put any accesses of shared data inside the critical section, which is created by the lock and unlock of the mutex. \(Notice that the `lock_guard` will lock the `mutex` when created and unlock the when destroied\). ```cpp { std::lock_guard l(m); ++count; } ``` ### Using Atomic \(Wait-free\) ```cpp std::atomic<int> count; //... // without memory order determined ++count; // with memory order count.fetch_add(1, std::memory_order_relaxed); ``` To consider which memory order to use, if the count is later used by index an array we will need release-acquire order, but if the count is just for logging and reported as a number. We have no need for memory order restrictions. :::success > The Art of Writing Efficient Programs **Page 203** > https://www.tenlong.com.tw/products/9781800208117 > on X86, the atomic increment instruction had the bidirectional memory barrier "built-in," and requesting the relaxed memory order is not going to make it any faster. Still, it is important to specify the requirement your code truly needs, both for portability and for clarity ::: ### Use Operation that Atomic does not Predefined with CAS \(Lock-free\) :::success > The Art of Writing Efficient Programs **Page 203** > https://www.tenlong.com.tw/products/9781800208117 > in C++, there is no atomic multiplication \(and I don't know of any hardware that has such capability; certainly, it's not found on X86 or ARM or any other common CPU architecture\) ::: The solution is use compare-and-swap operation. This atomic operation can be used to build any read-modify-write operation. Notice that CUDA C also provide the similar thing for their CUDA kernel. ```cpp std::atomic<size_t> count; // ...on the threads... size_t c = count.load(std::memory_order_relaxed); while (!count.compare_exchange_strong(c, c + 1, std::memory_order_relaxed, std::memory_order_relaxed)) {} ``` :::success > The Art of Writing Efficient Programs **Page 204** > https://www.tenlong.com.tw/products/9781800208117 > the operation in C++ is compare_exchange_strong. There is also compare_exchange_weak; the difference is that the weak version can sometimes return false even when the current and the expected values match (on X86, it makes no difference, but on some platforms, the weak version can result in a faster overall operation). Second, the operation takes not one but two memory order arguments: the second one applies when the compare fails (so it is the memory order for just the comparison part of the operation). The first one applies when the compare succeeds and the write happens. ::: The code work as checking if the value changed between the first atomic load and the atomic CAS. The value will be update with the new current value if it's changed and will be replaced with desired value if the value is not changed. Basically, try to find a time window to update the value and infinite loop to wait for it. ### Lock-based, Lock-free & Wait-free Programs > The Art of Writing Efficient Programs **Page 206** > https://www.tenlong.com.tw/products/9781800208117 * **Wait-free Program** \: Threads are always making progress there is no waiting for access. * **Lock-free Program** \: At least one thread is always guaranteed to commit its work and not have to redo it. The entire program is always progress although not necessarily at full speed. * **Lock-based Program** \: The thread that holding the lock is not guaranteed to work on something. Therefore, at most one thread is making progress. :::info :information_source: **Thread-safe** * Top 20 C++ multithreading mistakes and how to avoid them https://acodersjourney.com/top-20-cplusplus-multithreading-mistakes/ * 寫 C/C++多執行緒程式的血與淚: 四個要避免的錯誤 https://jackyu.medium.com/%E5%AF%AB-c-c-%E5%A4%9A%E5%9F%B7%E8%A1%8C%E7%B7%92%E7%A8%8B%E5%BC%8F%E7%9A%84%E8%A1%80%E8%88%87%E6%B7%9A-%E5%9B%9B%E5%80%8B%E8%A6%81%E9%81%BF%E5%85%8D%E7%9A%84%E9%8C%AF%E8%AA%A4-137c97204489 ::: ## Data Structure for Concurrency ### Thread-Safe Stack In C++ std, all of the containers \(including `stack`\) are only provided with weak thread-safety, which means that it is safe for any number of threads apply `const` function on the container. On the other hand, any thread use a non `const` function will cause issue and require synchronization of some kind of barrier. > the writer must, as a minimum, do a release, while all readers must acquire. Any stronger barrier will work as well, and so will a lock, but every thread must take this step :::success > The Art of Writing Efficient Programs **Page 243** > https://www.tenlong.com.tw/products/9781800208117 **In order to provide usable thread-safe functionality, the interface must be chosen with thread safety in mind.** In the book example, the `std::stack` provide `top()`, `empty()` and `pop()` that all needed to be combined into a sigle atomic transaction. ::: ### Performance Consideration Basically, readability and simpilicity is the trade off of performance. If an multithreaded algorithm is not going to access a container often, probabily the performance of the container \(data structure\) is not that important, thus keep it simple might be the way to go. For a lock-based data structure, the performance is highly depand on the performance of the lock mechanism. What the algorithm will use the container can be take into consideration when creating the thread-safe data structure. Since some of the use cases can be ignored, better performance might be able to achieve by adapting to use cases that is possible. > The Art of Writing Efficient Programs **Page 259** > https://www.tenlong.com.tw/products/9781800208117 > if your application allows for any kind of limitation on the operations you need to support, such as producers and consumers are separate in time, or there is a single producer (or consumer) thread, you can almost certainly design a faster data structure for these limited operations ### Lock-based Stack https://github.com/PacktPublishing/The-Art-of-Writing-Efficient-Programs/blob/master/Chapter07/02_stack.C * `mt_stack<T>` \: :::success :bulb: `= default` & `= delete` https://stackoverflow.com/questions/6502828/what-does-default-mean-after-a-class-function-declaration The TL;DR version is \: * `= delete` \: disable a compiler generated version of constructor. * `= default` \: We use the compiler generated version of the constructor; therefore, no funtion body specified. ::: :::success :bulb: **Use `std::optional`** * \[C++ 筆記\] 好用的 `std::optional` https://blog.awin.one/posts/2019/c++-%E7%AD%86%E8%A8%98-%E5%A5%BD%E7%94%A8%E7%9A%84-stdoptional/ * 潮．C++17 | `std::optional` 解析/介於有跟沒有之間的變數？ https://tjsw.medium.com/%E6%BD%AE-c-17-std-optional-%E4%BB%8B%E6%96%BC%E6%9C%89%E8%B7%9F%E6%B2%92%E6%9C%89%E4%B9%8B%E9%96%93%E7%9A%84%E8%AE%8A%E6%95%B8-f2438fbe02e8 * `C++ std::optional` 用法與範例 https://shengyu7697.github.io/std-optional/ When a function need to return something that can be *not exist*, `std::optional` can be used and the program can check if the function do have a meaningful output by `result == std::nullopt`. ::: :::success :bulb: `mutable` & `const` function > Reference\: > The Art of Writing Efficient Programs **Page 249** > https://www.tenlong.com.tw/products/9781800208117 > Note that, to allow the lookup-only function, top(), to be declared const, we had to declare the mutex as mutable. > ... > As a minimum, they should not represent the logical state of the object: they are only implementation details. Then, care should be taken to avoid any race conditions when modifying them. * C++ 中的 `mutable` 關鍵字 https://liam.page/2017/05/25/the-mutable-keyword-in-Cxx/ In Lambda function `auto f1 = [=]() mutable { x = 42; };`, we can see that variable captured by value can be modified. Notice that `x` will still be a copy of `x` outside of lambda function. ::: * `wr_stack<T>` \: Notice that `mt_stack<T>` basically lock all the read and write accesses from each other. However, the possible is multiple reads can be concurrent which does only require the write operation does not happen when these threads is reading the data. A **read-write lock** can have any number of threads to take the read lock and will not lock each other out. However, write lock can only be taken by one thread and only if no other thread holds the read lock. In C++, `std::shared_mutex` can be used as a read lock and `std::unique_lock` can be used as the write lock. :::success > The Art of Writing Efficient Programs **Page 251** > https://www.tenlong.com.tw/products/9781800208117 **However, in general, the more complex locks, such as the shared mutex, will have more overhead than the simple locks.** ::: :::success :bulb: **Spinlock** A spinlock might provide better performance than normal `mutex`. > * The Art of Writing Efficient Programs Repo \: `spinlock.h` > https://github.com/PacktPublishing/The-Art-of-Writing-Efficient-Programs/blob/master/Chapter07/spinlock.h > * 並行程式的潛在問題 (二) *Spinlock* > https://ithelp.ithome.com.tw/articles/10281491 Notice that spinlock will keep a thread busy waiting for a lock; therefore, using a scheme to decide when to yield a thread should be considered. * Ski-Rental Problem https://en.wikipedia.org/wiki/Ski_rental_problem *TODO \: Review MIT course and add the scheme and reference here.* ::: ### Lock-free Stack > Reference \: > Lock-free Stack Implementation provide in book > https://github.com/PacktPublishing/The-Art-of-Writing-Efficient-Programs/blob/master/Chapter07/02b_stack_cas.C There are some important points that need to be take into consideration when building the lock-free stack \: *Refer to the book for details.* * If we only use **one atomic index** for the size of the array, a slot that it's obj is still under consturction can be trying to pop by other threads and, on the other hand, if a pop is not yet finish and a push is requested by another thread, there will be a hole in the stack and there is no way to track them. * As the above bullet point describe, threads should wait for a thread to complete construct or delete an element. * A bounded stomic increment operation \(perform the increment or decrement unless value equals the spcified bound\). ***Using Compare-and-swap \(CAS\)*** > Reference \: > https://github.com/PacktPublishing/The-Art-of-Writing-Efficient-Programs/blob/master/Chapter07/02b_stack_cas.C > The Art of Writing Efficient Programs **Page 260** > https://www.tenlong.com.tw/products/9781800208117 Consider the following code snippet as a building block that is typical example of how CAS operations. ```cpp std::atomic<int> n_ = 0; int bounded_fetch_add(int dn, int maxn) { int n = n_.load(std::memory_order_relaxed); do { if (n + dn >= maxn || n + dn < 0) return -1; } while (!n_.compare_exchange_weak(n, n + dn, std::memory_order_release, std::memory_order_relaxed)); return n; } ``` 1. Read the current value of the variable. 2. Check the necessary conditions. In our case, we verify that the increment wouldnot give us the value outside of the specified bounds \[0, maxn\). If the bounded increment fails, we signal it to the caller by returning -1 \(this is an arbitrary choice; usually, there is a specific action to be performed for the out-of-bounds case\). 3. Atomically replace the value with the desired result if the current value is still equal to what we read earlier. 4. If step 3 failed, the current value has been updated, check it again, and repeat steps 3 and 4 until we succeed. :::success Busy waiting is not good for a PC or Server that has many things that need to be work on. On a MCU, it might be useful if there is no other task needed and can be implement as part of the super loop. There are ways to inforce a thread let go CPU \(inform the scheduler\) and let other thread take over. * `std::this_thread::yield()` is a system-independent API porvide by C++. * On Linux, `nanosleep()` can be used. Notice that in the code, `nanosleep()` is involke with `0`. ::: :::info :information_source: **linux應用睡眠之nanosleep** https://blog.csdn.net/qq_18804879/article/details/134599361 ::: > First of all, neither push nor pop can proceed if the two indices are currently not equal: different counts imply that either a new element is being constructed or the current top element is being copied out. > `c_` is the index of the last fully constructed element > `p_` is the index of the first free slot in the array Therefore, as above quote describe, we need to check if the two indices are equal. If equal, ... > we need to atomically increment the producer index `p_` \(bounded by the current capacity of the array\). Then we can construct the new element in the slot we just reserved \(indexed by the old value of `p_`\). Then we increment the consumer index `c_` to indicate that the new element is available to the consumer threads **We need to read both indices in a single atomic operation**, since C++ built-in atomic only support for one variable. > The better solution is to pack both values into a single 64-bit word \(on a 64-bit processor\). The hardware atomic instructions such as load or compare-and-swap do not really care how you are going to interpret the data they read or write\: they just copy and compare 64-bit words ### Note for Thread-safe Data Structure > The Art of Writing Efficient Programs **Page 266** > https://www.tenlong.com.tw/products/9781800208117 * With a good lock implementation, a lock-guarded stack offers reasonable performance and is much simpler than the alternatives. * Any application-specific knowledge about the limitations on the use of the data structure should be exploited to gain performance cheaply. This is not the place to develop generic solutions, quite the opposite: implement as few features as you can and try to gain performance advantages from the restrictions. * A generic lock-free implementation is possible but, even for a data structure, that is as simple as a stack, it is quite complex. Sometimes, this complexity may even be justified ### Thread-Safe Queue Notice that, unlike stack, if the queue is not empty, the producer and the consumers do not interact at all. ### Lock-based Queue > The Art of Writing Efficient Programs Repo https://github.com/PacktPublishing/The-Art-of-Writing-Efficient-Programs/blob/master/Chapter07/03_queue.C ### Lock-free Queue * Can be implement similar as how lock-free stack made. :::info :information_source: **`explicit` in C++** * **潮．C++20 \| explicit 關鍵字大解析** https://tjsw.medium.com/%E6%BD%AE-c-20-%E4%BD%A0%E6%89%80%E4%B8%8D%E7%9F%A5%E9%81%93%E7%9A%84-explicit-%E9%97%9C%E9%8D%B5%E5%AD%97%E5%A4%A7%E5%85%A8-bac7c7eaae3f ::: ### Non-sequentially Consistent Data Structures A non multi-threaded queue is sequentially consistent, which mean that the order of the element in the queue will be the same as the added order. However, in multi-threaded situation, two thread popping the queue the later involked thread might be faster to complete which will cause the data order being different than the pop order. :::success > The Art of Writing Efficient Programs **Page 275** > https://www.tenlong.com.tw/products/9781800208117 > sequential consistency. A sequentially consistent program produces the output that is identical to the output of a program where operations from all threads are executed one at a time (without any concurrency), and **the order of the operations executed by any particular thread is not changed**. Instruction order in a particular thread will not be reordered. ::: If we don't need sequential consistency, we can **slice the queue into multiple sub-queues and each sub-queue handled by a thread** \(sub-queues will not be concurrent\). ### Memory Management for Concurrent Data Structures Generally, memory allocation cannot be multi-threaded and not support multi-threading as well. The key point is that we don't need memory allocation fast, since it should happen rarely when we are using the concurrent data structures. Therefore, we can only use one thread and it's okay to lock the entire data structure till the memory allocation is finished. > The Art of Writing Efficient Programs **Page 279** > https://www.tenlong.com.tw/products/9781800208117 There might be multiple threads that detect out of capacity. In this case, there will be multiple threads trying to allocate more memory for the data structure. The book author provide a **double-check locking** as the following code snippet. ```cpp std::atomic<int> wait; // 1 if managing memory std::mutex lock; while (wait == 1) {}; // Memory allocation in progress if ( /*… out of memory …*/ ) { std::lock_guard g(lock); if (/*… out of memory …*/) { // We got here first! wait = 1; //… allocate more memory … wait = 0; } } //… do the operation normally … ``` Basically, it use a lock to create a critical section that only one thread that successfully aquire the lock can continue to allocate memory. Other thread will wait for the lock. Once the lock is free, all the other threads that aquire the lock will need to **check again** if the data structure is still out of memory \(capacity\) ## Concurrency in C++ > The Art of Writing Efficient Programs **Page 293** > https://www.tenlong.com.tw/products/9781800208117 *This chapter can of the book can be used as an overview of C++ concurrency features.* :::info :information_source: **Determine C++ Standard supported by GCC Compiler** > Reference \: > * How to determine the version of the C++ standard used by the compiler? > https://stackoverflow.com/questions/2324658/how-to-determine-the-version-of-the-c-standard-used-by-the-compiler > * C++ Standards Support in GCC > https://gcc.gnu.org/projects/cxx-status.html Basically, for `g++`, we can check C++ standard version in runtime as the first reference suggest or we can map output from `gcc --version` with the official website. ::: ### C++11 * C++ Memory Model * `std::thread` * `std::mutex` * `std::atomic` * `std::lock` * `std::lock_guard` * `std::condition_variable` :::info :information_source: **Condition Variable in C++ 一個簡單的範例** https://tigercosmos.xyz/post/2023/06/c++/condition-variable-cpp/ ::: * `std::async` \: [Example1](https://shengyu7697.github.io/std-async/) [Example2](https://blog.gtwang.org/programming/cpp-11-async-function-parallel-computing-tutorial/) Using `std::async` will not always create new thread for the function to be work on. * `std::promise` * `std::future` * `std::promise` Notice that using `std::async` will have `std::future<>` as a return value so we can have a return value for the function that call by `std::async`. On the other hand, `std::thread` cannot have return value. > Most implementations will either provide very limited parallelism or execute each asynchronous function call on its own thread. Most operating systems have a rather high overhead for creating and joining threads. :::info :information_source: **C++ 執行緒：promise、future、packaged_task 與 async 的使用方法** https://zh-blog.logan.tw/2021/09/26/cxx-thread-promise-future-packaged-task-async-usage/ ::: ### C++17 * `std::scoped_lock` * `std::shared_mutex` \: > Usually, such threads perform read-only operations, while a writer thread needs exclusive access, hence the name read-write lock. It's a clever idea in theory, but most implementations offer **dismal** performance * `std::hardware_destructuve_interference_size` * `std::hardware_constructive_interference_size` * Parallel STL > Reference \: * https://github.com/DragonSpit/ParallelSTL * https://stackoverflow.com/questions/64326234/c17-parallel-algorithm-vs-tbb-parallel-vs-openmp-performance In my opinion, these parallel algorithm performance can be vary and highly depand on the compiler used. Therefore, testing the performance against other possible solution might be a good idea. :::success :bulb: **Vectorization-unsafe** > The Art of Writing Efficient Programs **Page 297** > https://www.tenlong.com.tw/products/9781800208117 ```cpp double much_computing(double x); std::vector<double> v; // … add data to v … double res = 0; std::mutex res_lock; std::for_each(std::execution::par, v.begin(), v.end(), [&](double& x){ double term = much_computing(x); std::lock_guard guard(res_lock); res += term; }); ``` > However, we cannot use an unsequenced policy here: if the same thread were to process multiple vector elements at the same time \(interleaved\), it could attempt to acquire the same lock multiple times. This is a guaranteed deadlock. ::: :::info :information_source: **GCC Parallel STL Require Intel TBB** * Parallel Programming: The C++ Parallel STL https://www.youtube.com/watch?v=xyDUPkq90pQ > Recent enough versions of GCC and Clang include parallel STL headers (in some installations, Clang requires GCC to be installed because it uses GCC-provided parallel STL). The problem appears at the lower level. The runtime threading system used by both compilers is Intel Threading Building Blocks (TBB), which comes as a library with its own set of headers. Require Intel TBB to install in the system and include the library either in compiler parameter or system parameter. In my opinion, this inconvenient is why I refrain from using this regularly for now \(2024\/12\/11\). ::: ## C++20 Coroutines > Reference \: > * Coroutines in C/C++ > https://www.geeksforgeeks.org/cpp/coroutines-in-c-cpp/ > * Understanding C++ Coroutine Implementation > https://medium.com/@AlexanderObregon/understanding-c-coroutine-implementation-8e6e5a2c3edd :::success Coroutines, in general, are functions that can be interrupted and resumed ::: ![image](https://hackmd.io/_uploads/r1Tt-yP4yg.png) * `f()` \: normal funciton * `g()` \: normal function involke in coroutine function * `h()` \: New normal function The above diagram showes that when the coroutine function is running, there are some info store in stack and when the coroutine function wait, the info in the stack is deleted. As long as the handle \(smart pointer\) is still alive, the coroutine function will be able to resume. Here is a summary of what is important to know about C++20 coroutines: * Coroutines are functions that can suspend themselves. This is different from the OS suspending a thread: suspending a coroutine is done explicitly by the programmer (cooperative multitasking). * Unlike regular functions, which are associated with stack frames, coroutines have handle objects. Coroutine state persists as long as the handle is alive. * After the coroutine is suspended, the control is returned to the caller, which continues to run the same way as if the coroutine had completed. * The coroutine can be resumed from any location; it does not have to be the caller itself. Furthermore, the coroutine can even be resumed from a different thread \(we will see an example later in this section\). The coroutine is resumed from the point of suspension and continues to run as if nothing happened \(but may be running on a different thread\) ### Coroutines Example \& Implementation Consideration > * Understanding C++ Coroutine Implementation > https://medium.com/@AlexanderObregon/understanding-c-coroutine-implementation-8e6e5a2c3edd * `co_await`\: Suspend the execution of coroutine until particular condition is met or an asynchronous operation is completed. Resume at the `co_wait`. * `co_yield`\: Suspend with a return value * `co_return`\: Return a value from coroutine and finalize its execution ```cpp #include <iostream> #include <coroutine> struct MyCoroutine { struct promise_type { MyCoroutine get_return_object() { return {}; } std::suspend_never initial_suspend() { return {}; } std::suspend_never final_suspend() noexcept { return {}; } void return_void() {} void unhandled_exception() { std::terminate(); } }; }; MyCoroutine myCoroutineFunction() { std::cout << "Start of coroutine\n"; co_await std::suspend_always{}; std::cout << "Resume of coroutine\n"; } ``` ### Coroutine Lifecycle ![image](https://hackmd.io/_uploads/r1kd94gYxl.png) ### `struct promise_type` Configuration are done in the `struct promise_type` * `get_return_object`\: Called when the coroutine is first created. Typically return a handle to the coroutine itself to allow control by the caller. * `initial_suspend` \& `final_suspend`\: Value can be `std::suspend_always` or `std::suspend_never`. Notice that `final_suspend` often returns `std::suspend_always` to make sure that hte coroutine cleans up its resources properly before existing. * `return_void` \& `return_value`\: Depends on if coroutine returns anything, one of the function will be called. * `unhandled_exception`\: Handles unhandled exception. Usually `std::terminate()` ### Example of Coroutine with Return Value ```cpp #include <iostream> #include <coroutine> struct ReturnValue { struct promise_type { int value; ReturnValue get_return_object() { return ReturnValue{this}; } std::suspend_never initial_suspend() { return {}; } std::suspend_never final_suspend() noexcept { return {}; } void return_value(int v) { value = v; } void unhandled_exception() { std::terminate(); } }; promise_type* promise; ReturnValue(promise_type* p) : promise(p) {} int get() { return promise->value; } }; ReturnValue computeValue() { co_return 42; } int main() { ReturnValue result = computeValue(); std::cout << "Computed Value: " << result.get() << std::endl; } ``` ### Custome Awaitable Types We can use custome awaitable types besides built-in types such as `std::suspend_always` and `std::suspend_never`. See [reference](https://medium.com/@AlexanderObregon/understanding-c-coroutine-implementation-8e6e5a2c3edd) for detail. ### Use Cases with Examples > * Understanding C++ Coroutine Implementation > https://medium.com/@AlexanderObregon/understanding-c-coroutine-implementation-8e6e5a2c3edd # Designing and Coding HighPerformance Programs ## High-Performance in C++ \& Compiler Optimizations in C++ *TODO \: Need a separate post for lvalue, rvalue, move and perfect forwarding etc.* Most thing mentioned in these chapters can be found in my other posts. * Book : Embedded Programming with Modern C++ Cookbook https://hackmd.io/@Erebustsai/SJPEVRskJl * Book : AI Embedded Systems Algorithm Optimization and Implementation https://hackmd.io/@Erebustsai/Sk39malXa * General Perfromance Engineering https://hackmd.io/@Erebustsai/HJ4PO0bV6 * Books : Optimized C++ & 21st Century C & Understanding and Using C Pointers & Effective Modern C++ https://hackmd.io/@Erebustsai/SJ6qYG7fJe ## Compiler Optimizations in C++ > Reference \: > MIT 6.172 Performance Engineering of Software Systems \: 9. What Compilers Can and Cannot Do > https://www.youtube.com/watch?v=ulJm7_aTiQM&list=PLUl4u3cNGP63VIBQVWguXxZZi0566y7Wf&index=9&pp=iAQB *TODO* ## Undefined Behavior and Performance > Reference \: > * What are all the common undefined behaviours that a C++ programmer should know about? > https://stackoverflow.com/questions/367633/what-are-all-the-common-undefined-behaviours-that-a-c-programmer-should-know-a > * Undefined Behavior in C and C++ > https://dev.to/pauljlucas/undefined-behavior-in-c-and-c-3a20 *The above reference provide common example of undefined behavior. These info can be used along with book provided examples.* ### Definition > The Art of Writing Efficient Programs **Page 373** > https://www.tenlong.com.tw/products/9781800208117 * **Implementation-defined Behavior** \: The exact specification of implementation-defined behavior must be provided by the implementation. This is not optional * **Unspecified Behavior** \: implementation is under no obligation to document the behavior: the standard usually provides a range of possible outcomes, and the implementation can specify its own possible outcomes without specifying which one is going to happen. * **Undefined Behavior** \: the standard imposes no requirements whatsoever on the behavior of the entire program. :::success > The Art of Writing Efficient Programs **Page 373** > https://www.tenlong.com.tw/products/9781800208117 The standard further says that if the behavior is undefined, the standard imposes no requirements on the results. The bottom line is, when your program's behavior is undefined, according to the standard, the compiler can generate code you do not expect, but this code cannot do anything that you could not do already. **The standard says that the entire program is ill-formed and imposes no restrictions on its outcome.** ::: :::warning unlike signed integer overflow, increment the unsigned type `size_t` past its maximum value is well defined, and the value rolls over back to zero ::: ### Undefined Behavior and C++ Optimization *Refer to book for detailed example.* In my opinion, compare performance with the assembly output on different code and using tools to detect undefined behavior are the best practice that we can do. ***Undefined behaviro validation tools*** * **GCC UB sanitizers** https://stackoverflow.com/questions/31803705/using-gcc-undefined-behavior-sanitizer * **Compiler Explorer** https://godbolt.org/ ## Design for Performance *TODO* # Parallel Algorithms \& Implementations > The Art of Concurrency **Page 95** > https://www.tenlong.com.tw/products/9789863471257 :::success :bulb: **What is the \"assert\" function in C++?** https://stackoverflow.com/questions/1571340/what-is-the-assert-function ::: ## Parallel Reduction :::info :information_source: **My Other Posts about the these Algorithms** * GPU Prefix Scan https://hackmd.io/PTBuRGwiQHKPyzfe9ebMNQ?view#5-Prefix-Scan * CUDA Programming \: Reduction Example https://hackmd.io/QhzlGKAvQcGKLqBcqd2GBg?view#Reduction-Example ::: ### Parallel Sum TODO \: Add Github Repo URL This suffice for all the operations that have *Commutative property* and *associative property* \(min\/max \& multiply\...). The algorithm basically just store partial sums of each threads and sum up the final sum in the main thread. Notice that in CPU parallel programming, slicing data to each thread in continuous blocks of data is better; however, in GPU programming, applying adjacent data to different thread in a warp is better. Refer to the above github repo, you will notice that when threads storing data to the corresponding location in the partial sum array can cause false sharing. However, as noted in the code, false sharing in this case is not too abundant since there will not be too many threads for CPU multi-programming. ### Prefix Sum > The Art of Concurrency **Page 107** > https://www.tenlong.com.tw/products/9789863471257 The following diagram showes that, similar to what we do in parallel sum, we can have local prefix scan and add the partial sum \(calculated with exclusive prefix scan on the last elements of each data block\) back to all the elements in the local prefix scan result. ![image](https://hackmd.io/_uploads/ryKuhxdBkx.png) Notice that the exclusive prefix sum can be in sequential since the amount of data is equal to the number of threads. Additionally, we should keep threads alive when doing it sequentially since CPU threads have big overheads. **Implementation** * With OpenMP \: https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/blob/main/OpenMP_Algorithms/PrefixScan/prefix_scan.cpp * With C++ Thread Library \: *TODO* :::info :information_source: **C++ 多執行序開發中安全存取變數的 atomic** Notice that `std=c++20` is needed for `g++` compiler. https://viml.nchc.org.tw/cpp-atomic/ ::: ## Selection **My Other Post** * Course Review : Advanced Analysis of Algorithms https://hackmd.io/9tGyeFoYRVuArznahp07Iw?view#Find-the-ith-Smallest-Element-of-a-Set * Quick Select Implementation https://github.com/Chen-KaiTsai/GPGPU_CUDA/blob/main/ImageFilter/filter.cu * Median of Five Implementation https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/blob/main/MedianOfFive/main.cpp ## MapReduce ![image](https://hackmd.io/_uploads/rJURU6Yrke.png =x400) :::info :information_source: **`std::barrier`** * `std::barrier` in C++20 https://www.geeksforgeeks.org/std-barrier-in-cpp-20/ * C++20 多執行序間的同步點 barrier 與 latch https://viml.nchc.org.tw/c20-barrier-and-latch/ ::: Notice that, in the old days \(before C++20\), this will need conditional variable to make the barrier mechanism. Please refer to the following stackoverflow post. https://stackoverflow.com/questions/38999911/what-is-the-best-way-to-realize-a-synchronization-barrier-between-threads **Implementation** TODO : Add Github Repo URL ## Parallel Sorting **My Other Post** * Different Sorting Algorithm Implementations with OpenMP https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/tree/main/OpenMP_Algorithms ## Linear Search ### No Early Stopping It's easy with simply distribute all the data equally to all the threads available to make a parallel version of linear search. However, there will be no early stopping since the entire set of the data will be checked. :::success For an array of data with `N` elements, sequential linear search stopped when element found will have `N/2` comparisons in average and parallel version without early stopping will have` N/p` comparison time. Therefore, if `p` \(processor number\) is > 2, parallel program will have better performance **in average**. ::: ### Early Stopping Checking an atomic flag, which determine if the number in search is found, is not efficient. Therefore, we might need to check the flag every few iteration and this number depand on how many threads are used. **Implementation** TODO : Add Github Repo URL ## N-ary Search **Implementation** https://github.com/Chen-KaiTsai/PerformanceEngineering_repo/blob/main/OpenMP_Algorithms/LinearSearch/N-Ary_Search.cpp ## Graphs **My Other Posts** * GPGPU Algorithm Implementation https://hackmd.io/PTBuRGwiQHKPyzfe9ebMNQ?view#9-Graph-Travel-TODO * Course Review : Advanced Analysis of Algorithms https://hackmd.io/9tGyeFoYRVuArznahp07Iw?view#VI-Graph-Algorithms * Book : C++ Data Structure & Algorithm Design https://hackmd.io/uRQooisDStG9uYADQJESXg?view#Graph-Algorithms-I * Book : C++ Data Structure & Algorithm Design https://hackmd.io/uRQooisDStG9uYADQJESXg?view#Graph-Algorithms-II *TODO \: Use lockfree structure to implement most of the algorithm. Now, wait for C++ Data Structure & Algorithm Design to finish* ### Sequential BFS \& DFS :::success :bulb: **Basic Sequential Implementation** * **BFS 與 DFS** https://hackmd.io/@ShanC/bfs_dfs ::: ### Parallel BFS My Other Post * https://hackmd.io/Qb-nmB9_TXKGgrX920f4tQ?view#Parallel-BFS ### Parallel DFS * **Parallel Depth-First Search for Directed Acyclic Graphs by NVIDIA** https://research.nvidia.com/publication/2017-03_parallel-depth-first-search-directed-acyclic-graphs * **Parallel DFS for Directed Acyclic Graphs Github** *This is a C++ implementation of a parallel algorithm of the DFS traversal.* https://github.com/morpheusthewhite/parallel-dfs-dag ## All-Pairs Shortest Path My Other Post * Course : Advanced Analysis of Algorithms https://hackmd.io/9tGyeFoYRVuArznahp07Iw?view#Shortest-Path-Algorithm * Book : C++ Data Structure & Algorithm Design https://hackmd.io/uRQooisDStG9uYADQJESXg#Graph-Algorithms-I > Reference \: > * Floyd-Warshall Algorithm in C++ > https://www.geeksforgeeks.org/cpp/floyd-warshall-algorithm-cpp/ *Problem \: Generate a matrix of node to node shortest path.* In program view, a `n` node graph with `n * n` weight matrix `W` and output an `n * n` matrix `D` for distance. `D[i][j]` is the smallest weight from node `i` to node `j`. :::success For each pair of vertices `(i, j)`, update the distance by considering each vertex `k` as an intermediate vertex. ::: ### Sequential Implementation ```cpp #include <cmath> void Floyds(float** D, int N) { for (int k = 0; k < N; k++) { for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { D[i][j] = std::min(D[i][j], D[i][k] + D[k][j]); } } } } ``` Notice that in the above code, the outer `k` loop cannot be parallelized since we need to make sure the entire matrix is updated before checking another `k`. ### Parallel Psudocode ```cpp void Floyds_parallel(float** D, int N) { for (int k = 0; k < N; k++) { #pragma omp parallel for for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { D[i][j] = std::min(D[i][j], D[i][k] + D[k][j]); } } } } ``` Notice this might not causing threads to be created and destroyed constantly. See [reference](https://stackoverflow.com/a/73296980) ## Minimum Spanning Tree \(MST\) My Other Post * Course : Advanced Analysis of Algorithms https://hackmd.io/9tGyeFoYRVuArznahp07Iw?view#Minimum-Spanning-Trees * Book : C++ Data Structure & Algorithm Design https://hackmd.io/uRQooisDStG9uYADQJESXg#Graph-Algorithms-I