Week <13> — Team <27>

# Week <13> — Team <27> <style> .blue { background-color:#66FFFF; color:darkblue; padding:2px 4px; border-radius:3px; } </style> <style> .yellow { background-color:#F9F900; color:darkgreen; padding:2px 4px; border-radius:3px; } </style> **Members:** @112062124, @111000116, @112XXXXXX(student IDs) :::info You can write in Chinese. ::: ## Lecture & Reading Notes ## 1. The Scalability Problem: Why Systems Fail ### Defining Scalability - **Definition**: A system is scalable if <span class="yellow">its performance improves proportionately as you add more hardware resources (cores, servers).</span> - **The Ideal**: Linear Scalability. If we double the number of workers (threads/cores), we expect double the throughput. - **The Reality**: 1. **Amdahl’s Law**: As parallelism increases, the "speedup" is mathematically limited by the sequential portion of the program (the parts that cannot be parallelized). We see diminishing returns—the curve flattens out. 2. **Universal Scalability Law (USL)**: This model is more grim. It accounts for the cost of synchronization and communication. - As you add more cores, the overhead of coordinating them (locking, data sharing) increases. - At a certain point, the cost of communication outweighs the benefit of the extra core. - **Result**: Performance can actually degrade (the curve goes down) with more hardware. ## 2. **The Challenge of Correctness: Race Conditions** When we parallelize work, threads must eventually synchronize their partial results. This synchronization point is the **Critical Section**. **The Race Condition** A race condition occurs when the timing or ordering of non-atomic operations determines the output. * **The Shared Counter Experiment**: 1. **Setup**: int counter = 0. Thread A increments 1,000,000 times. Thread B increments 1,000,000 times. 2. **Expected**: 2,000,000. 3. **Actual**: Significantly less (e.g., 1,350,000). **Why? The Assembly Perspectiv**e In C, counter++ looks like one step. In Assembly (CPU level), it is **three distinct instructions**. The operation is **non-atomic**, meaning it can be interrupted. `LOAD`: Read value from memory address to a CPU register. `ADD`: Increment the value inside the register. `STORE`: Write the new value from the register back to memory. **The Failure Scenario (Context Switch)** 1. **Thread A** executes LOAD. It reads 5 into its register. 2. **Thread A** executes ADD. It reads 5 into its register. 3. Context Switch happens. **Thread A** is paused. 4. **Thread B** executes LOAD. It also reads 5 (because A hasn't saved the update yet). 5. **Thread B** executes ADD (5 -> 6) and STORE. Memory is now 6. 6. Context Switch happens. **Thread A** resumes. 7. **Thread A** resumes at ADD. It adds 1 to its register (which was 5), resulting in 6. 8. **Thread A** executes STORE. It writes 6 to memory. |步驟| Thread A (Core 1) | Thread B (Core 2) | Mem Value | |---|----------------------------------|------------------------------------|-----------| | 1 | LOAD into register (5) | | 5 | | 2 | **Context switch (A paused)** | | 5 | | 3 | | LOAD into register (5) *(Stale!)* | 5 | | 4 | | ADD to register (6) | 5 | | 5 | | STORE [counter] (6) | **6** | | 6 | **Context switch (A resumes)** | | 6 | | 7 | ADD to register (5 → 6) | | 6 | | 8 | STORE [counter] (6) | | **6** | **Result**: Two increments occurred, but the value only increased by 1. <span class="yellow">One update was "lost."</span> ## 3. **Ensuring Correctness: Locking Solutions** To fix race conditions, we must enforce Mutual Exclusion—only one thread enters the critical section at a time. ### A. Atomic Instructions (The Hardware Fix) * **Concept**: The CPU provides special hardware instructions that perform the Read-Modify-Write sequence as a single, uninterruptible unit. * **Pros**: Very fast, no context switching required. * **Cons**: Limited to simple operations (like basic arithmetic or pointer swaps). Cannot protect complex data structures (like a Linked List or a Process Table). ### B. Mutex (The OS Fix) * **Concept**: Wrap the critical section in lock() and unlock(). * **Example**: xv6 Process Creation (allocproc) * When creating a process, the kernel must scan the process table to find a UNUSED slot. * If two CPUs scan simultaneously without a lock, they might both grab the same slot. * **Solution**: acquire(&p->lock) before checking the state; release(&p->lock) after modification. * **Performance Impact**: While Mutex ensures correctness, it serializes execution, effectively turning parallel code back into sequential code for that section. ## 4. **The Hidden Performance Killer: Cache Coherency** Even if logic is correct, hardware behavior can destroy performance. ### Cache Line Bouncing Modern CPUs have private L1/L2 caches. When a lock variable is modified: - Other cores’ caches invalidate their copy. - Causes cache misses. - Massive bus traffic → performance collapse. ### False Sharing **Scenario**: You give Thread A and Thread B their own private counters to avoid locking: int ```counterA```; int ```counterB```; - **The Trap**: Cache memory is managed in chunks called Cache Lines (typically 64 bytes). If counterA and counterB are stored right next to each other in memory, they inhabit the same cache line. - **The Result**: When Thread A updates counterA, the CPU invalidates the entire 64-byte line. Thread B’s cache line is now invalid, even though it only cares about counterB. The system suffers from Cache Line Bouncing despite no logical data sharing. - **The Fix**: Padding. Insert unused bytes (padding) between the variables to force them onto different cache lines. ```c struct { int counterA; char padding[60]; // Pushes counterB to the next 64-byte chunk int counterB; } ``` ## 5. Scalability Optimization: The Sloppy Counter Goal: Reduce synchronization by using local counters. ### Mechanism - **Local Counters**: Each CPU core has its own private counter (padded to prevent false sharing). Updates are fast and local (no locks needed). - **Global Counter**: A single shared counter protected by a lock. - **Threshold ($S$)**: A value determining synchronization frequency. If the threshold increase, the frequency decrease. ### Logic - When a thread increments: It adds +1 to its **Local Counter**. - **Check**: Is Local Counter >= S? * **No**: Do nothing else. (Fast path). * **Yes**: Acquire lock $\rightarrow$ Add Local value to Global Counter $\rightarrow$ Reset Local to 0 $\rightarrow$ Release lock. (Slow path). ### Trade-off - **Small $S$ (e.g., 1)**: Accurate global count, but high contention (essentially a standard shared counter). - **Large $S$ (e.g., 128)**: Extremely high performance (linear scalability), but the Global Counter lags behind reality (it is "sloppy"). ## 6. Waiting Strategies: Spin vs. Sleep | Feature | Spinlock | Mutex (Sleep Lock) | |--------|----------|--------------------| | Action | Busy wait | Sleep/yield | | CPU Usage | High | Low | | Context Switch | None | Yes | | Best For | Tiny critical sections | Long sections | | Common Use | OS kernels | User apps | ### Spinlock Implementation * **Naive Approach**: while (flag == 1); flag = 1; $\rightarrow$ Broken. (Race condition between checking and setting). * **Correct Approach**: Requires hardware support like Test-and-Set or Compare-and-Swap. These instructions atomically read the old value and write the new value. * **Issues**: * On a Single Core, spinlocks are dangerous. If the thread holding the lock is preempted, the spinning thread will waste its entire time slice checking a lock that cannot be released (because the holder isn't running). ### Sleep Lock Implementation * Uses a **Queue**. * **Acquire**: If busy, push self to queue, change state to SLEEPING, call scheduler. * **Release**: If queue not empty, pop the first thread, change state to RUNNABLE. * **Fairness**: The queue ensures threads acquire the lock in order (FIFO), preventing starvation. ## 7. Summary & Best Practices * **Avoid Global Synchronization**: It is the bottleneck of scalability. * **Use the Weakest Solution Necessary**: * If you can use an **Atomic** variable, do not use a Lock. * If you can tolerate lag, use a **Sloppy Counter** (Hybrid). * If you must use a lock, determine the duration: * Short duration $\rightarrow$ **Spinlock**. * Long duration $\rightarrow$**Mutex**. * **Hardware Awareness**: Be mindful of False Sharing when designing data structures for parallel workers. ### Shio (Explorations & Discoveries) ### 1. 在看【race condition】時，看到 thread 上面有人提到關於【race condition】造成的事故——[Therac-25 案例](https://www.threads.com/%40a.chin.logs/post/C63dVYmrxeg)。在 [wiki](https://zh.wikipedia.org/zh-tw/Therac-25%E6%A1%88%E4%BE%8B?utm_source=chatgpt.com) 中對於該事件的發生脈絡的描述如下： :::info Therac-25 是加拿大原子能有限公司（AECL）所生產的放射線療法機器，在 Therac-6 和 Therac-20 之後推出。在 1985 年到 1987 年之間，在美國及加拿大至少有六起和 Therac-25 相關的醫療事故，因為軟體設計時的瑕疵，使病人受到了過量的輻射。軟體的瑕疵是因為競爭危害（兩個同時進行程式之間時序衝突造成的問題），有瑕疵時會使病患接受到比正常劑量高一百倍的輻射，因此造成患者死亡或重傷。此一事故突顯了安全關鍵系統若使用軟體控制時的潛在危險性，也是軟體工程及醫學資訊學的經典案例。另外因為工程師的過度自信，而且沒有進行適當的盡職調查來修復已知的軟體問題，這也是一個極端的例子，工程師因為對其初期的工程過度自信，沒有相信終端使用者提出的問題，最後產生了嚴重的結果。Therac-25 事件後，軟體開發工程化管理方法論開始得到重視。 ::: 這個事件發生的原因主要是以下兩點的綜合造成： 1. **Hardware interlocks 的取消，多數安全檢查改由 software interlock 控制** 2. **溢位（overflow）＋ race condition** 要理解這兩點是怎麼造成事故的話，要先理解下面的概念。 #### Hardware interlock vs Software interlock 其實 hardware interlock 很好理解，它就是日常生活常見的「防呆機制」，透過物理限制，讓機器在不安全時**根本無法運行**。最簡單的例子是：「電梯在運行時，電梯門不能打開；而電梯門打開時，電梯也不能啟動運行」。就算軟體出錯，硬體構造也會阻止危險動作。 Software interlock 則是完全依靠程式邏輯判斷是否可以運行，來避免危險狀況。這種方式的彈性較大，但如果軟體設計不當，就可能發生 race condition、overflow 等問題。不幸的是，Therac-25 的事故就是在 **安全關鍵邏輯完全交給軟體** 且 **程式有缺陷** 的情況下發生的。 #### 溢位 + race condition Therac-25 的安全機制大部分建立在 software interlock 上，因此只要軟體有 bug，就有可能直接影響病患安全。這裡用一個簡化過的架構來描述實際發生的事情。系統中有幾個重要的共享變數（shared variables）： - **`Class3`**：1 byte 計數器（0–255），用來控制是否需要重新檢查束流位置（collimator）。 - **`F$mal`**：故障旗標（0 = 無故障，非 0 = 有故障）。 - **`Tphase`**：治療流程狀態（例如：Set-Up、Ready、Treatment）。以下用 pseudo code 模擬實際邏輯： ```clike // Thread A: Set-Up Test / Datent（處理 operator 輸入） // 1. 每次操作員修改治療參數，就會遞增 Class3（1 byte → 會 overflow） Class3 = (Class3 + 1) % 256; Tphase = EDITING; // 當操作員輸入完成，Set-Up Test 要求進行安全檢查 Tphase = SETUP_COMPLETE; // Thread B: Lmtchk（背景安全檢查） // 2. 是否要做 collimator 檢查取決於 Class3 的值 if (Class3 != 0) { Chkcol(); // 檢查 collimator 位置 if (位置錯誤) F$mal = NONZERO; // 設為故障 } else { // Class3 == 0（overflow 時）→ 跳過 safety check // F$mal 不會被更新（維持舊值，通常是 0） } // 3. Set-Up Test 根據 F$mal 判斷是否可以進入治療階段 if (F$mal == 0) { Tphase = TREATMENT_READY; allow_high_energy_beam(); // 啟動 25 MeV 高能電子束 } else { block_beam(); } ``` ##### Thread A（Set-Up Test）操作員快速修改設定。每次修改都會讓 Class3++，由於是 1 byte，因此最終會 overflow：255 → 0。在這期間，Tphase 在 EDITING 和 SETUP_COMPLETE 之間切換。 ##### Thread B（Lmtchk） Lmtchk 在背景週期性執行安全檢查。某一個時間點，它剛好在 Class3 overflow 為 0 的瞬間讀取到 Class3 == 0。依照設計，Class3 == 0 時直接跳過 collimator 檢查，因此 F$mal 不會被設為故障，保持在 0。 ##### 最後 Thread A 再次執行 Set-Up Test，讀到 F$mal == 0，誤以為「目前沒有任何故障」。將 Tphase 設為 TREATMENT_READY。 ##### Race condition 發生在 Set-Up Test（Thread A）與 Lmtchk（Thread B）同時讀寫共享變數 Class3 與 F\$mal 時： 1. Thread A 不斷遞增 Class3（可能導致 overflow）。 2. Thread B 根據 Class3 的值决定是否進行安全檢查，並更新 F$mal。 3. 兩者執行順序不固定，結果依賴於「操作員輸入的速度」與「背景檢查執行的時間點」，也就是典型的時序相關問題（race condition）。 4. 當 Class3 在快速操作下 overflow 為 0 時，若 Lmtchk 剛好在這個時間點讀到 Class3 == 0，就會誤以為不需檢查 collimator，使 F\$mal 保持為 0，進而讓 Set-Up Test 誤判為「安全」，開啟高能電子束，最終造成嚴重的醫療事故。參考資料 : https://www.cs.columbia.edu/~junfeng/08fa-e6998/sched/readings/therac25.pdf?utm_source=chatgpt.com --- # Shio: The Hidden Cost of Concurrency **Deconstructing Compiler Optimizations and Locking Overheads** ### 1\. Motivation In our Operating Systems lectures, we constantly hear that "Locks are expensive" and "Atomic instructions are faster than Mutexes." But as engineers, qualitative statements aren't enough. We need to quantify these claims: 1. **Quantify the Gap:** Exactly how much slower is an Atomic operation compared to a raw instruction? Is a Mutex 2x slower or 100x slower? 2. **Solve the Mystery:** Why did our initial code, which theoretically contained a Race Condition, execute instantly and produce the "correct" result? 3. **Understand the Mechanism:** What actually happens at the CPU and OS level to cause these performance penalties? ----- ### 2\. Phase 1: The Compiler's "Cheat" Trap #### Initial Experiment Plan Our first attempt was simple: create a race condition and see how much locks slow it down. * **Setup:** 4 Threads. * **Workload:** Each thread increments a shared counter 10,000,000 times. * **Compilation:** `gcc -O2 lock_bench.c ...` (Standard optimization enabled). #### The Code (Version 1) We used the most intuitive implementation: ```c // Version 1: Naive Implementation #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <stdatomic.h> #include <time.h> #define ITERATIONS 10000000 #define NUM_THREADS 4 long raw_counter = 0; atomic_long atomic_counter = 0; long mutex_counter = 0; pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; void* unsafe_inc(void* arg) { for (int i = 0; i < ITERATIONS; i++) { raw_counter++; } return NULL; } void* atomic_inc(void* arg) { for (int i = 0; i < ITERATIONS; i++) { atomic_fetch_add(&atomic_counter, 1); } return NULL; } void* mutex_inc(void* arg) { for (int i = 0; i < ITERATIONS; i++) { pthread_mutex_lock(&lock); // 可能觸發 System Call (futex) mutex_counter++; pthread_mutex_unlock(&lock); } return NULL; } double get_elapsed_time(struct timespec start, struct timespec end) { return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) * 1e-9; } void run_benchmark(const char* name, void* (*func)(void*), long* result) { pthread_t threads[NUM_THREADS]; struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < NUM_THREADS; i++) { pthread_create(&threads[i], NULL, func, NULL); } for (int i = 0; i < NUM_THREADS; i++) { pthread_join(threads[i], NULL); } clock_gettime(CLOCK_MONOTONIC, &end); double time_taken = get_elapsed_time(start, end); printf("%-10s | %-10.4f sec | Result: %ld (Expected: %d)\n", name, time_taken, *result, ITERATIONS * NUM_THREADS); } int main() { pthread_t threads[NUM_THREADS]; struct timespec start, end; double time_taken; printf("BENCHMARK: Synchronization Cost (%d Threads, %d Ops each)\n", NUM_THREADS, ITERATIONS); printf("-------------------------------------------------------------\n"); printf("%-8s | %-12s | %-25s\n", "Type", "Time (sec)", "Result (Target: 40000000)"); printf("-------------------------------------------------------------\n"); // 1. Unsafe clock_gettime(CLOCK_MONOTONIC, &start); for(int i=0; i<NUM_THREADS; i++) pthread_create(&threads[i], NULL, unsafe_inc, NULL); for(int i=0; i<NUM_THREADS; i++) pthread_join(threads[i], NULL); clock_gettime(CLOCK_MONOTONIC, &end); printf("%-8s | %-12.4f | %ld\n", "Unsafe", get_elapsed_time(start, end), raw_counter); // 2. Atomic clock_gettime(CLOCK_MONOTONIC, &start); for(int i=0; i<NUM_THREADS; i++) pthread_create(&threads[i], NULL, atomic_inc, NULL); for(int i=0; i<NUM_THREADS; i++) pthread_join(threads[i], NULL); clock_gettime(CLOCK_MONOTONIC, &end); printf("%-8s | %-12.4f | %ld\n", "Atomic", get_elapsed_time(start, end), atomic_load(&atomic_counter)); // 3. Mutex clock_gettime(CLOCK_MONOTONIC, &start); for(int i=0; i<NUM_THREADS; i++) pthread_create(&threads[i], NULL, mutex_inc, NULL); for(int i=0; i<NUM_THREADS; i++) pthread_join(threads[i], NULL); clock_gettime(CLOCK_MONOTONIC, &end); printf("%-8s | %-12.4f | %ld \n", "Mutex", get_elapsed_time(start, end), mutex_counter); printf("-------------------------------------------------------------\n"); return 0; } ``` #### The Anomaly (Observed Data) The results were shocking: ![image](https://hackmd.io/_uploads/Hy4vdrsWWe.png) #### Failure Analysis: The Compiler is Too Smart Theoretically, 4 threads fighting over `raw_counter` without synchronization must cause **Lost Updates**. The result should be far less than 40 million. However, the data showed 0.0002 seconds execution time and a perfect result. Upon inspecting the assembly code, we discovered the culprit: **Loop Invariant Code Motion**. 1. **Compiler Logic:** GCC analyzed the loop and realized it was simply adding 1 to `raw_counter` 10 million times. Mathematically, this is identical to adding 10,000,000 once. 2. **Optimization:** The compiler collapsed the entire loop into a single instruction: ```asm add [raw_counter], 10000000 ``` 3. **Consequence:** * Instead of 40 million potential memory collisions, we only had **4**. * <span class="yellow"> Since these 4 instructions executed in nanoseconds, the probability of them overlapping was near zero.</span> * **Conclusion:** We weren't benchmarking concurrency; we were benchmarking the compiler's math skills. The experiment failed. ----- ### 3\. Phase 2: Revealing the Hardware Reality #### The Fix (Re-planning) To measure the true cost of hardware synchronization, we must force the compiler to be "dumb" and execute every single instruction. 1. **The `volatile` Keyword:** We declared the counter as `volatile`. This tells the compiler: "This variable may change externally. **Do not optimize it.** You must LOAD and STORE from memory every single time." 2. **Increased Contention:** We increased the thread count to **8** and disabled optimization (`-O0`) to ensure fierce contention for the CPU cache. #### The Code (Version 2) ```c // Version 2: Force Memory Access #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <stdatomic.h> #include <time.h> #define NUM_THREADS 8 #define ITERATIONS 5000000 volatile long raw_counter = 0; volatile atomic_long atomic_counter = 0; volatile long mutex_counter = 0; pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; void* unsafe_inc(void* arg) { for (int i = 0; i < ITERATIONS; i++) { raw_counter++; } return NULL; } void* atomic_inc(void* arg) { for (int i = 0; i < ITERATIONS; i++) { atomic_fetch_add(&atomic_counter, 1); } return NULL; } void* mutex_inc(void* arg) { for (int i = 0; i < ITERATIONS; i++) { pthread_mutex_lock(&lock); mutex_counter++; pthread_mutex_unlock(&lock); } return NULL; } double get_elapsed_time(struct timespec start, struct timespec end) { return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) * 1e-9; } int main() { pthread_t threads[NUM_THREADS]; struct timespec start, end; printf("BENCHMARK: Real World Concurrency (%d Threads, %d Ops/Thread)\n", NUM_THREADS, ITERATIONS); printf("Target Result: %ld\n", (long)NUM_THREADS * ITERATIONS); printf("-------------------------------------------------------------\n"); printf("%-8s | %-12s | %-25s\n", "Type", "Time (sec)", "Result"); printf("-------------------------------------------------------------\n"); // Unsafe clock_gettime(CLOCK_MONOTONIC, &start); for(int i=0; i<NUM_THREADS; i++) pthread_create(&threads[i], NULL, unsafe_inc, NULL); for(int i=0; i<NUM_THREADS; i++) pthread_join(threads[i], NULL); clock_gettime(CLOCK_MONOTONIC, &end); printf("%-8s | %-12.4f | %ld \n", "Unsafe", get_elapsed_time(start, end), raw_counter); // Atomic clock_gettime(CLOCK_MONOTONIC, &start); for(int i=0; i<NUM_THREADS; i++) pthread_create(&threads[i], NULL, atomic_inc, NULL); for(int i=0; i<NUM_THREADS; i++) pthread_join(threads[i], NULL); clock_gettime(CLOCK_MONOTONIC, &end); printf("%-8s | %-12.4f | %ld\n", "Atomic", get_elapsed_time(start, end), atomic_load(&atomic_counter)); // Mutex clock_gettime(CLOCK_MONOTONIC, &start); for(int i=0; i<NUM_THREADS; i++) pthread_create(&threads[i], NULL, mutex_inc, NULL); for(int i=0; i<NUM_THREADS; i++) pthread_join(threads[i], NULL); clock_gettime(CLOCK_MONOTONIC, &end); printf("%-8s | %-12.4f | %ld\n", "Mutex", get_elapsed_time(start, end), mutex_counter); printf("-------------------------------------------------------------\n"); return 0; } ``` #### The Real Data ![image](https://hackmd.io/_uploads/B1aO_Ssbbe.png) ----- ### 4. Deep Dive Mechanism Why does the corrected code produce these specific numbers? Let's deconstruct the layers of the system. #### A. Unsafe (0.0897s) — The Price of Error * **Instruction:** `LOAD` $\rightarrow$ `ADD` $\rightarrow$ `STORE`. * **Phenomenon:** Severe data corruption (only \~18% of updates survived). * **Mechanism (Lost Update):** Two computations occurred, but the value only increased by 1.While this is the fastest method (zero synchronization overhead), <span class="yellow">the data is garbage</span>. #### B. Atomic (0.8005s) — The Hardware Tax * **Instruction:** `LOCK ADD` (x86) or `LDXR/STXR` (ARM). * **Why 9x Slower?** This slowdown isn't software latency; it is the **physics of coordination**. 1. **Cache Line Locking (MESI Protocol):** When Core 1 executes an Atomic Add, it must broadcast a signal to all other cores: "I am claiming exclusive rights to this variable. Invalidate your cache copies\!" 2. **Bus Contention:** With 8 cores broadcasting simultaneously, the CPU's internal Ring Bus becomes saturated. 3. <span class="blue">**Store Buffer Drain:** Atomic instructions imply a Memory Barrier, forcing the CPU to stall the pipeline until the write buffer is drained to memory.</span> #### C. Mutex (3.4563s) — The Kernel Path (WSL Behavior) Implementation: pthread_mutex_lock backed by Linux futex. * **Why 4× Slower?** The mutex does not benefit from adaptive spinning in WSL. Each lock attempt quickly degrades into a futex-based blocking path, causing extremely high kernel overhead. 1. <span class="blue"> Immediate Futex Blocking: On WSL, pthread mutexes rarely perform adaptive spinning. Under contention, threads almost immediately enter futex(FUTEX_WAIT), causing a system call and switching into kernel mode.</span> 2. Context Switch Overhead: A thread in FUTEX_WAIT must be descheduled, and later rescheduled via FUTEX_WAKE. Each wait/wake cycle incurs two context switches, orders of magnitude slower than user-space atomic operations. 3. Kernel-Dominated Runtime: Because the critical section (++counter) is nanoseconds while futex sleep/wake and scheduling are microseconds, the program spends nearly all its time in the kernel. As a result, mutex performance is drastically slower than hardware atomics under heavy contention. :::info **Store Buffer Drain:** 跟記憶體討論的章節很像，當程式執行 `STORE`，CPU 不是馬上把資料寫進 cache(CPU MEM)，而是先放到一個小小的「排隊區」= Store Buffer。而因為Atomic 的原因Memory Barrier會要求：「在這個 barrier 之前的所有記憶體寫入，全部都要真實完成，才能繼續往下執行後面的指令」=>這個步驟被稱為Drain，造成效率減慢。 **Immediate Futex Blocking:** 如果是真正的LINUX會執行以下步驟(結合 SPIN 和 SLEEP) 1. 先試著用 Atomic 搶鎖 2. Adaptive Spinning（這段 spin 時間很短幾十或幾百個 CPU cycle） 3. 如果 mutex owner 已經被排程掉權衡後選擇$\rightarrow$睡眠式等待（sleep lock）所以如果超過一定cycle，futex才會執行context switch，所以short critical section也會有相對好的表現。但WSL不會有`Adaptive Spinning`的步驟，所以才會這麼慢。 ::: ### 5. 補充我們也有人用IOS的系統執行了兩個部分，他的結果如下圖所示 1. ![截圖 2025-12-01 上午11.15.56](https://hackmd.io/_uploads/ryaKX8i-bl.jpg) 2. ![截圖 2025-12-01 上午11.16.06](https://hackmd.io/_uploads/S1QcX8sZZe.png) ### 1. iOS Mutex Performance iOS mutex results remain relatively good. **Reason:** iOS mutexes use a spin-first strategy, so in our simple `++` benchmark, threads acquire the lock quickly and the overhead does not become severe. ### 2. iOS Atomic May Be Slower Than Mutex The iOS atomic version is sometimes even slower than the mutex version. Since iOS internals are harder to inspect, we can only hypothesize that Objective-C atomic operations may involve `os_unfair_lock`, which could introduce context-switch overhead. This is only a rough guess, as we cannot confirm the exact implementation details. **References:** [01](https://www.vadimbulavin.com/atomic-properties/?utm_source=chatgpt.com) [02](https://stackoverflow.com/questions/21098494/atomic-properties-vs-thread-safe-in-objective-c?utm_source=chatgpt.com) ----- ### What I learned (each member writes 2–4 sentences, in their own words) ### @112062124 在不同的作業系統中，底層對 CPU 並行策略的設計往往有所差異，這通常與系統的主要使用族群與典型工作負載有關。以一般桌面與行動裝置使用者為主的系統，往往會優先針對日常應用情境進行優化。而在這次比較中，我個人反而更欣賞 Linux 的作法：它既能讓短暫的 critical section 取得良好的執行效率，又能避免較長的 critical section 長時間佔用 CPU 資源。當然，如何在實務上界定「short critical section」本身也是一個相當深的議題。 ### @111000116 I learned that synchronization overheads, like Cache Line Bouncing, are the real bottlenecks to scalability. While locks ensure correctness, they degrade performance if overused. The "Sloppy Counter" concept particularly inspired me; it minimizes global synchronization by trading temporary precision for speed. Moving forward, I aim to balance correctness and efficiency by selecting the lightest possible synchronization strategies rather than blindly using locks. ## Content Sharing Agreement **Please uncheck if you disagree!** - [X] We agree that excerpts of this note may be shared by TA to Discord if our team's note is selected as *shio* highlight. - [X] We agree that our team ID may be shown on Discord if our team's note is selected as *shio* highlight. - [X] We agree that the instructore can show excerpts of this note if our team's note is selected as *shio* highlight. :::warning * Please don't share this HackMD note directly with non-teammate. But you are encouraged to help others (including non-teammate) on Discord. [Follow our collaboration policy](https://sys-nthu.github.io/os25-fall/admin/policies.html#collaboration-citation) * Please don't peek at or copy another team's HackMD, or using LLM to generate your reflection. :::

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.