# Discussion of Operation System (Linux) contributed by < [`linD026`](https://github.com/linD026) > ###### tags: `booknote` --- ## Synchronization Design and Trivial Concurrent Programs > [name=linD026][time=Tues, Oct 26, 2021 11:00 AM] > Reference: [Is Parallel Programming Hard, And, If So, What Can You Do About It?](https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html) :::success * [perfbook - mail list](https://www.spinics.net/lists/perfbook/maillist.html) ::: ### Shared-Variable with Compiler Optimization * Load tearing ```cpp ptr = global_ptr; if (ptr != NULL && ptr < high_address) do_low(ptr); ``` > If some other thread was concurrently setting `global_ptr` to `NULL`, the result might have all but one byte of the pointer set to zero, thus forming a “wild pointer”. Stores using such a wild pointer could corrupt arbitrary regions of memory, resulting in rare and difficult-to-debug crashes. Worse yet, on (say) an 8-bit system with 16-bit pointers, the compiler might have no choice but to use a pair of 8-bit instructions to access a given pointer. Because the C standard must support all manner of systems, the standard cannot rule out load tearing in the general case. * Store tearing >[ Re: [PATCH 1/1] Fix: trace sched switch start/stop racy updates](https://lore.kernel.org/lkml/20190821103200.kpufwtviqhpbuv2n@willie-the-truck/) ```cpp void bar(u64 *x) { *x = 0xabcdef10abcdef10; } ``` ```asm bar: mov w1, 61200 movk w1, 0xabcd, lsl 16 stp w1, w1, [x0] ret ``` :::warning **Bug for the volatile case** ```cpp void bar(u64 *x) { *(volatile u64 *)x = 0xabcdef10abcdef10; } ``` ```asm bar: mov w1, 61200 movk w1, 0xabcd, lsl 16 str w1, [x0] str w1, [x0, 4] ret ``` ::: * Load fusing ```cpp while (!need_to_stop) do_something_quickly(); ``` ```cpp if (!need_to_stop) for (;;) { do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); do_something_quickly(); } ``` > suppose that a real-time system needs to invoke a function named `do_something_quickly()` repeatedly until the variable need_to_stop was set, and that the compiler can see that `do_something_quickly()` does not store to need_to_stop. The compiler might reasonably unroll this loop sixteen times in order to reduce the perinvocation of the backwards branch at the end of the loop. Worse yet, because the compiler knows that `do_something_quickly(`) does not store to `need_to_stop`, the compiler could quite reasonably decide to check this variable only once, ```cpp= int *gp; void t0(void) { WRITE_ONCE(gp, &myvar); } void t1(void) { p1 = gp; do_something(p1); p2 = READ_ONCE(gp); if (p2) { do_something_else(); p3 = *gp; } } ``` >`t0()` and `t1()` run concurrently, and `do_something()` and `do_something_else()` are inline functions. Line 1 declares pointer `gp`, which C initializes to NULL by default. At some point, line 5 of `t0()` stores a non-NULL pointer to `gp`. Meanwhile, `t1()` loads from `gp` three times on lines 10, 12, and 15. Given that line 13 finds that `gp` is non-NULL, one might hope that the dereference on line 15 would be guaranteed never to fault. Unfortunately, the compiler is within its rights to fuse the read on lines 10 and 15, which means that if line 10 loads `NULL` and line 12 loads `&myvar`, line 15 could load `NULL`, resulting in a fault. Note that the intervening `READ_ONCE()` does not prevent the other two loads from being fused, despite the fact that all three are loading from the same variable. * Store fusing and Code reordering ```cpp void shut_it_down(void) { status = SHUTTING_DOWN; /* BUGGY!!! */ start_shutdown(); while (!other_task_ready) /* BUGGY!!! */ continue; finish_shutdown(); status = SHUT_DOWN; /* BUGGY!!! */ do_something_else(); } void work_until_shut_down(void) { while (status != SHUTTING_DOWN) /* BUGGY!!! */ do_more_work(); other_task_ready = 1; /* BUGGY!!! */ } ``` * Invented loads ```cpp // before ptr = global_ptr; if (ptr != NULL && ptr < high_address) do_low(ptr); // after if (global_ptr != NULL && global_ptr < high_address) do_low(global_ptr); ``` * Invented Stores ```cpp int a; // before if (condition) a = 1; else do_a_bunch_of_stuff(); // after a = 1; if (!condition) { a = 0; do_a_bunch_of_stuff(); } ``` ### How to fix it? * GCC often chooses to load the value to a register, increment the register, then store the value to memory, which is decidedly non-atomic. ```asm # convert the "cntr++;" into the assembly code # the env is linux 5.4.0-80-generic (multipass, x86_64, gcc-9) # test.c:19: cntr++; movq cntr(%rip), %rax # cntr, cntr.0_1 addq $1, %rax #, _2 movq %rax, cntr(%rip) # _2, cntr ``` :::info [EAX x86 Register Meaning and History](https://keleshev.com/eax-x86-register-meaning-and-history/) ::: Some compiler could translate the `++` operation into add-to-memory instruction. <font size=2>(So, yes it can be the atomically but you cannot depend on compiler decisions every time, right?)</font> ```asm # the env is mbp19 (x86_64, clang 12.0.0) .cfi_def_cfa_register %rbp incq _cntr(%rip) popq %rbp ``` :::success **Clang CFI** * [Control Flow Integrity](https://hackmd.io/@linD026/discussion_of_os#Programming-Language-and-Security-Issues) * [x86: Add support for Clang CFI](https://lwn.net/Articles/869267/) * [x86: Add support for Clang CFI](https://lwn.net/Articles/853248/) ::: * [`volatile`](https://en.cppreference.com/w/c/language/volatile) / [`WRITE_ONCE` / `READ_ONCE` / barrier](https://github.com/torvalds/linux/blob/master/tools/include/linux/compiler.h) > This means that within a single thread of execution, a volatile access cannot be optimized out or reordered relative to another visible side effect that is separated by a sequence point from the volatile access. * Tell the compiler that the value might location at the MMIO device register instead of the general register. > Furthermore, note the `volatile` casts in `READ_ONCE()` and `WRITE_ONCE()`, which tell the compiler that the location might well be an MMIO device register. Because [Memory-mapped I/O](https://en.wikipedia.org/wiki/Memory-mapped_I/O) registers are not cached, it would be unwise for the compiler to assume that the increment operation is atomic. :::info > Reference: [memory-barriers.txt](https://www.kernel.org/doc/Documentation/memory-barriers.txt) CACHE COHERENCY VS MMIO \----------------------- **Memory mapped I/O usually takes place through memory locations that are part of a window in the CPU's memory space that has different properties assigned than the usual RAM directed window.** Amongst these properties is usually the fact that such accesses bypass the caching entirely and go directly to the device buses. This means MMIO accesses may, in effect, overtake accesses to cached memory that were emitted earlier. A memory barrier isn't sufficient in such a case, but rather the cache must be flushed between the cached memory write and the MMIO access if the two are in any way dependent. ::: ```asm # the env is linux 5.4.0-80-generic (multipass, x86_64, gcc-9) # test.c:14: WRITE_ONCE(cntr, READ_ONCE(cntr) + 1); movq cntr(%rip), %rax # cntr, tmp85 movq %rax, -8(%rbp) # tmp85, __x movq -8(%rbp), %rdx # __x, _5 leaq cntr(%rip), %rax #, cntr.0_1 addq $1, %rdx #, _2 movq %rdx, (%rax) # _2, MEM[(volatile long unsigned int *)cntr.0_1] ``` :::success **Ans**: [perfbook p.44](https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e2.pdf) ::: * Another choice - **Memory Barrier** * Hardware design - Store Buffer and Invalidate Queue * [C11 standard](https://en.cppreference.com/w/c/atomic/memory_order) and [GCC](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html) and [Linux Kernel](https://www.kernel.org/doc/Documentation/memory-barriers.txt) * Atomic Operation: [Linux kernel](https://www.kernel.org/doc/Documentation/atomic_t.txt) / [GNU C compiler provide no restrict type version](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html) * Compare: [Memory fences, atomics and C11 memory model](https://github.com/rmind/stdc/blob/master/c11memfences.md) :::warning * [lwn.net - The trouble with volatile](https://lwn.net/Articles/233479/) * [並行程式設計: Atomics 操作](https://hackmd.io/@sysprog/concurrency-atomics) * [GitHub - sysprog21/lkmpg](https://github.com/sysprog21/lkmpg) * [Avoid unexpected concurrent access (#94)](https://github.com/sysprog21/lkmpg/commit/148fb013ee96d08495e6c7994c58f19529cb1def) * [Fix potential concurrent access problems with VFS (#108)](https://github.com/sysprog21/lkmpg/commit/1a6fb67cf292f4a0f2191e7f8720afa678d787b2) * [Sparse](https://sparse.docs.kernel.org/en/latest/): [Fix warning](https://github.com/sysprog21/lkmpg/commit/9289bfe59c6a5063f6d9dfcd49a25afd3ad18efb) / [Integrate into CI pipline](https://github.com/sysprog21/lkmpg/commit/25d82ec47f1ed1872c1f30624d69a4a52bf386ee) ::: * Memory barrier Sequences - Example of Ordering-Hostile Architecture ```graphviz digraph main { node [shape = box] edge [arrowhead = none] rankdir = TB cpu0 [label = "CPU 0"] cpu1 [label = "CPU 1"] cpu2 [label = "CPU 2"] cpu3 [label = "CPU 3"] cache1 [label = "<0>|cache|<1>", shape = record] cache2 [label = "<2>|cache|<3>", shape = record] cpu0mq [label = "CPU 0\nMessage\nQueue"] cpu1mq [label = "CPU 1\nMessage\nQueue"] cpu2mq [label = "CPU 2\nMessage\nQueue"] cpu3mq [label = "CPU 3\nMessage\nQueue"] memory [label = "Memory"] cpu0 -> cache1:0 cpu1 -> cache1:1 cpu2 -> cache2:2 cpu3 -> cache2:3 cpu0mq -> cache1:0 cpu1mq -> cache1:1 cpu2mq -> cache2:2 cpu3mq -> cache2:3 memory -> cpu0mq memory -> cpu1mq memory -> cpu2mq memory -> cpu3mq {rank=max cpu0 cpu1 cpu2 cpu3} {rank=same cache1 cache2} {rank=same cpu0mq cpu1mq cpu2mq cpu3mq} } ``` * trip the assertion - assume CPU0 message queue is full, CPU1 is empty. | CPU 0 | CPU 1 | CPU 2 | |:----------:|:---------------:|:---------------------------:| | a = 1; | | | | smp_wmb(); | while (b == 0); | | | b = 1; | c = 1; | z = c; | | | | smp_rmb(); | | | | x = a; | | | | assert(z == 0 \|\| x == 1); | > However, all mainstream computer systems provide one mechanism or another to provide “transitivity”, which provides intuitive causal ordering: if B saw the effects of A’s accesses, and C saw the effects of B’s accesses, then C must also see the effects of A’s accesses * assertion is guaranteed not to fire | CPU 0 | CPU 1 | CPU 2 | |:---------------:|:---------------:|:---------------------------:| | a = 1; | | | | smp_wmb(); | | | | b = 1; | while (b == 0); | while (b == 0); | | | smp_mb(); | smp_mb(); | | | c = 1; | d = 1; | | while (c == 0); | | | | while (d == 0); | | | | e = 1; | | assert(e == 0 \|\| a == 1); | :::warning The Linux kernel’s [`synchronize_rcu()`](https://elixir.bootlin.com/linux/latest/source/kernel/rcu/tree.c#L3741) primitive uses an algorithm similar to that shown in this example. ::: ### Synchronization * Depending on Hardware * Choose the operation * atomic * signal - Function Shipping * Specialization - Which side do you care about * Update and Read side * Dependencies * Address Dependencies > An address dependency occurs when the value returned by a load instruction is used to compute the address used by a later memory-reference instruction. ```cpp C C - MP + o - wmb - o + o - ad - o { y = 1; x1 = y; } P0(int *x0, int **x1) { WRITE_ONCE(*x0, 2); smp_wmb(); WRITE_ONCE(*x1, x0); } P1(int **x1) { int *r2; int r3; r2 = READ_ONCE(*x1); r3 = READ_ONCE(*r2); } exists(1 : r2 = x0 /\ 1 : r3 = 1) ``` * Data Dependencies > A data dependency occurs when the value returned by a load instruction is used to compute the data stored by a later store instruction. * Control Dependencies > A control dependency occurs when the value returned by a load instruction is tested to determine whether or not a later store instruction is executed. Note well the “later store instruction”: Many platforms do not respect load-to-load control dependencies. * Multicopy Atomicity ![](https://imgur.com/SQimNsK.png) > > Multicopy atomicity is a deeply intuitive notion about ordering that is > not always provided by real computer systems, namely that a given store > becomes visible at the same time to all CPUs, or, alternatively, that all > CPUs agree on the order in which all stores become visible. However, > support of full multicopy atomicity would rule out valuable hardware > optimizations, so a weaker form called \`\`other multicopy atomicity'' > instead guarantees only that a given store becomes visible at the same > time to all \-other\- CPUs. The remainder of this document discusses this > weaker form, but for brevity will call it simply \`\`multicopy atomicity''. > > Threads running on a fully multicopy atomic platform are guaranteed to agree on the order of stores, even to different variables. A useful mental model of such a system is the single-bus architecture shown in Figure 15.7. If each store resulted in a message on the bus, and if the bus could accommodate only one store at a time, then any pair of CPUs would agree on the order of all stores that they observed. **Synchronization Design** 1. Bounded wait-free synchronization: Every thread will make progress within a specific finite period of time. This level is widely considered to be unachievable, which might be why Alitarh et al. omitted it. 2. Wait-free synchronization: Every thread will make progress in finite time. 3. Lock-free synchronization: At least one thread will make progress in finite time. 4. Obstruction-free synchronization: Every thread will make progress in finite time in the absence of contention. 5. Clash-free synchronization: At least one thread will make progress in finite time in the absence of contention. 6. Starvation-free synchronization: Every thread will make progress in finite time in the absence of failures. 7. Deadlock-free synchronization: At least one thread will make progress in finite time in the absence of failures. **Partitioning - Storage Design and Data Ownership** * Array-Based * need the padding for avoiding the false sharing * Thread-Based * Per-Thread storage :::warning GCC provides the [`__thread`](https://gcc.gnu.org/onlinedocs/gcc/Thread-Local.html) storage class for the per-thread storage. But doesn't provides the `per_thread()` primitive, similar to the Linux kernel's [`per_cpu()`](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h#L269), to access each others' per-thread variables. ::: * Data locking and Code locking * Owned by CPU or Thread > If there is significant sharing, communication between the threads or CPUs can result in significant complexity and overhead. Furthermore, if the most-heavily used data happens to be that owned by a single CPU, that CPU will be a “hot spot” ### Implementation of Concurrent Program * **Synchronization** * Reference Count * Counting (perfbook section 5) * Lock * RWlock * Seqlock * Spinlock * Hashed Lock * [Combining Tree Barrier](https://en.wikipedia.org/wiki/Barrier_(computer_science)#Combining_Tree_Barrier) * **Data Structure** * [Harzard Pointer](https://hackmd.io/@linD026/harzard-pointer) * Static Analysis * RCU ([What is RCU?](https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt) / [`rcupdated.h`](https://github.com/torvalds/linux/blob/master/include/linux/rcupdate.h) / [`/kernel/rcu`](https://github.com/torvalds/linux/tree/master/kernel/rcu)) * [urcu](https://github.com/urcu/userspace-rcu) * [Skip List](https://hackmd.io/@linD026/discussion_of_os#Multiple-queue-Skip-list-SchedulerMuQSS) (listed at the following subject) :::warning **Example code:** [linD026/parallel-program](https://github.com/linD026/parallel-program) ::: :::info **Interrupt Handler** ![](https://imgur.com/VHaWAe3.png) --- * [Linux kernel - Top & bottom half](https://linux-kernel-labs.github.io/refs/heads/master/lectures/interrupts.html) / [LDD3](https://static.lwn.net/images/pdf/LDD3/ch10.pdf) > Linux (along with many other systems) resolves this problem by splitting the interrupt handler into two halves. The so-called top half is the routine that actually responds to the interrupt—the one you register with request_irq. The bottom half is a routine that is scheduled by the top half to be executed later, at a safer time. * [Moving interrupts to threads](https://lwn.net/Articles/302043/) > Traditionally, interrupt handling has been done with top half (i.e. the "hard" irq) that actually responds to the hardware interrupt and a bottom half (or "soft" irq) that is scheduled by the top half to do additional processing. The top half executes with interrupts disabled, so it is imperative that it do as little as possible to keep the system responsive. Threaded interrupt handlers reduce that work even further, so the top half would consist of a "quick check handler" that just ensures the interrupt is from the device; if so, it simply acknowledges the interrupt to the hardware and tells the kernel to wake the interrupt handler thread. * [GitHub - sysprog21/lkmpg](https://github.com/sysprog21/lkmpg) - [PR](https://github.com/sysprog21/lkmpg/pulls) --- **Priority Inversion** * [Introduction to RTOS - Solution to Part 11 (Priority Inversion)](https://www.digikey.com/en/maker/projects/introduction-to-rtos-solution-to-part-11-priority-inversion/abf4b8f7cd4a4c70bece35678d178321) * [A realtime preemption overview](https://lwn.net/Articles/146861/) * Example: perfbook - `count_lim_sig.c` (call `poll()` to block millisecond) ::: --- ## Programming Language and Security Issues * [Linux 核心採納 Rust 的狀況](https://hackmd.io/@linD026/rust-in-linux-organize) * Making C Less Dangerous - Kees Cook, Google [video](https://www.youtube.com/watch?v=XfNt6MsLj0E) / [pdf](https://outflux.net/slides/2018/lss/danger.pdf) * Control Flow Integrity in the Linux Kernel [video](https://www.youtube.com/watch?v=0Bj6W7qrOOI) / [pdf](https://outflux.net/slides/2020/lca/cfi.pdf) / [lwn](https://lwn.net/Articles/810077/) * [GitHub - sysprog21/lkmpg](https://github.com/sysprog21/lkmpg) - [Fix disallowed cr0 write protection and close_fd (#80)](https://github.com/sysprog21/lkmpg/commit/cccc98ab2c914cc922d34266b88ff73231e06607) --- ## Linux Kernel Module (LKM) * [GitHub - sysprog21/lkmpg](https://github.com/sysprog21/lkmpg) * [kernel.org - /Documentation/kbuild](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/kbuild) * [Linux Kernel Teaching](https://linux-kernel-labs.github.io/refs/heads/master/) --- ## Copy On Write (Memory Management) [Linux 核心 Copy On Write 實作機制](https://hackmd.io/@linD026/Linux-kernel-COW-content) --- ## Multiple Queue Skip list Scheduler(MuQSS) * [Lecture Notes on Skip Lists](https://courses.csail.mit.edu/6.046/spring04/handouts/skiplists.pdf) * [A kernel skiplist implementation (Part 1)](https://lwn.net/Articles/551896/) * [Skiplists II: API and benchmarks](https://lwn.net/Articles/553047/) * [ Linux-5.10-ck1, MuQSS CPU scheduler v0.205](https://lore.kernel.org/lkml/CABqErrE5Y1qEqxvewzTP-HvbB7o7CHHs+rStui8V1e9ukVm9aA@mail.gmail.com/) * [Multiple Queue Skiplist Scheduler version 0.120](https://lwn.net/Articles/705126/) * [Bachelor’s Thesis In Computing Science Linux CPU Schedulers:CFS and MuQSS Comparison](https://www.diva-portal.org/smash/get/diva2:1566002/FULLTEXT01.pdf) * [Process Scheduling O(N), O(1), RSDS, CFS, BFS, Deadline, MuQSS, etc. Performance comparison](https://students.mimuw.edu.pl/ZSO/Wyklady/15_CPUschedulers2/ProcessScheduling2.pdf) * [Core scheduling](https://lwn.net/Articles/780703/) --- ## Context Switch [Evolution of the x86 context switch in Linux](https://www.maizure.org/projects/evolution_x86_context_switch_linux/)