Linux 核心專題: 並行程式設計教材修訂

執行人: weihsinyeh
專題解說錄影

Reviewed by `chloe0919`

想請問為何 MPMC 的極限吞吐量 (throughput) 取決於 cache coherence 的成本，而非處理器的個數?

weihsinyeh
因為在競爭資源的問題中搭配 MPMC 的圖，看到多個生產者/消費者會同時競爭共享資源，對照Shared-Resources 所寫的內容，無論是競爭共享資源還是透過用於溝通的共享資源來幫忙操作，如果競爭就要不斷更新 cache line 的資訊以維持 cache coherence ，這樣才能讓每個正在競爭共享資源的處理器能夠在最快的情況搶到資源，而跨越 chip 的溝通成本也很高。同時隨著處理器的數量越多，越來越多處理器需要不斷更新各處理器的私有 cache line ，表示 cache coherence 的成本更高。但往往卻只有一個處理器能夠獲得存取共享資源的存取機會，這樣的效益很低。所以才會說 MPMC 的極限吞吐量 (throughput) 取決於 cache coherence 的成本，而非處理器的個數

Reviewed by `SimonLee0316`

在探討 cache coherence 的問題中提到有 1 個 core 都要去競爭某 1 個 core 持有的 lock ，每次持有 lock 的時間為

T = 3 秒

，這個時間是如何選定的？

weihsinyeh
是我自己設定一個值，用來幫自己在推導公式能夠較好理解，同時依據那篇論文最後的公式
$a_{k} = (n - k) / a$ 沒有這個變數
$T$ ，所以推導到最後會消掉。

Reviewed by `zack-404`

在討論「排程器需要如何幫忙中」，想請問 "optinal" 的意思是什麼呢？還是這是 "optional" 錯字呢？

weihsinyeh
我寫錯字，謝謝幫忙

Reviewed by `cheezad`

在提到 MESI 的時候希望可以放個維基百科的網址（MESI），因為我對這個沒有很有概念，剛好有一個網址可以點會很方便

weihsinyeh
謝謝，已更新

TODO: 〈並行和多執行緒程式設計〉系列講座並紀錄問題和提出改進

可略過「案例: MapReduce」和「以 Model Checking 學習並行處理」
最終成果應整合進〈Concurrency Primer〉。

〈概念〉

（一）並行的好處

當系統中若包含多個 CPU 或不對等 (heterogeneous) 的執行單元時，須將工作拆成多個獨立運行的工作，再分散至各執行單元同時執行。其好處則使其效率最大化。

（二）排程器需要如何幫忙

為了要讓各執行單元同時執行，會需要先了解 排程器。

背景：多個處理器，會有處理器總是閒置，有處理器總是工作的問題，然而卻不能以 optimal 的方式，預先知道每個任務的執行時間排程。因此需要動態調整以達到負載平衡。有一個 global 的 runqueue，排程器可以簡單分配 runqueue 中的工作應該由哪個處理器執行，或是負責處理負載平衡的 Specialized kernel threads 將工作移至閒置的處理器。
問題：處理負載平衡時，因為只有一個 runqueue，則操作 runqueue 需要有 lock 確保操作的互斥性，則競爭 lock 使得多個處理器有效能瓶頸。
解決方式：這是會出現每個處理器會有一個 runqueue 的原因，這相當於拆解 lock 的粒度 (granularity) 。而排程器透過 work steal 的方式協調這些 runqueue 的負載平衡，每個處理器的 runqueue 中的 workers（multiple processes and threads）　會先完成該 runqueue 中的工作後，再從其他 runqueue 中 steal work。以下有兩種方式。
方法一：規劃如何將程式分成多個 threads，使其可以執行在其他的處理器（Cilk Model）。
方法二：為種將每個 runqueue 的執行緒發現 runqueue 沒有工作，則將其他還有工作的 runqueue 中的工作移過來做 (work steal) 。

透過排程器幫忙讓系統可以善用多核的特性，在一個時間點執行多個行程。

（三）工作間的溝通管道

接著為了這些行程同時執行時結果符合預期，則行程間彼此需溝通資訊。溝通的管道為共享的資源，因此需要規範此共享的資源。透過 lock 被規範，共享資源可粗略的分為兩種功能。

semaphore
先有 sempahore 的出現，因為 sempahore 沒有 ownership 的概念，而是限制進入 critical section 中的數量上限，sempahore 本身具有通知的功能。

在當所有可獨立運作的工作分有 producer 與 consumer 的相對關係後，producer/consumer 做完某件事或做某事情到一定程度會通知 consumer/producer。

mutex lock 後來有 mutex lock，其可簡稱 lock，因為有 ownership 的概念，一次只能一個使用。mutex 通常會搭配 condition variable 來達到通知 (cond singal, cond broadcast) 的功能，因為其他會更難實作。

呼應例子 main.c 之中每個 parent 與 child thread 的關係是 producer 與 consumer。所以當 parent ready 時會 signal child (這種就是透過 condition variable 的通知例子 )。當 parent not ready 時會更新共享變數( 透過 mutex lock 維護 critical section )，而 child 會 wait 。

lock 的目的在於達到「並行不悖」，同時搭配（非阻塞同步機制）atomic (一旦執行就確保所做的事情一定成功) 或是（阻塞同步機制）spinlock 。否則上述 critical section 的操作又會需要 lock (鎖中鎖) 。

atomic 實作 atmoic 會遇到的問題在於，由於記憶體階層的結構，在 cacheline 有實作的困難。首先 Atomics 操作的功能是讓指令與指令可以一氣呵成，如此一來進入 critical section ，修改共享記憶體，都不會與其他執行緒有競爭。但是在多核的實作上面由於在共享 public 的記憶體上具備多層 private 的記憶體的 cache ，這件事情即便指令一氣呵成，仍無法解決。不為 atomic 解決的範疇。因此在處理因為多核的實作，在共享 public 的記憶體上而有 race condition 的問題需要其他解決方式。

而因為共享 public 的記憶體與每個 CPU 各自 private 的記憶體 cache 不一致。分別有 MESI 與 MOESI 兩種協定被提出用來達成 cache coherence。

TODO: 改進〈Concurrency Primer〉關於 Atomics 的論述

Abstact

背景：有很多實現並行程式的工具
問題：何時使用這些工具，或是在沒有這些工具下要如何寫出能夠並行的程式。

§1. Background 順序的重要

背景：執行緒透過共享記憶體溝通，使執行緒可以 block 另一個執行緒的進行。

問題：然而這件事情仰賴 in-order 的執行（sequential consistency）。但是現代的
(1) compiler optimization :
編譯器最佳化會在一個執行緒上 reoreder ，不會同時將並行的執行緒考慮進去，使得 block 的機制不能夠被預期。
(2) instruction-level parallelism :
硬體也會在 pipeline 的 schedulers reorder。
(3) cache coherence :
為了避免 CPU 效能限制在存取 RAM 的速度上，記憶體階層出現，其中 cache 會搭配 store buffer，以用來存 pending writes。因此需要達到 cache coherence ，才能讓每個 core ，看到共享記憶體最新的變化，已達到（sequential consistency）。

解決辦法：為了讓 CPU 看到的是 new 。因此需要硬體，編譯器，程式語言的幫忙。

§2. Enforcing law and order 溝通的方式

解決辦法：程式語言告訴硬體與編譯器不要 reorder。在變數上加上 atomic type。
這一章說 Atomic type 在順序上的重要性，且這裡的作用為 Compiler barrier。
在變數上加 atomic type 只能可以使程式碼的順序 in-order 。

§3. Atomicity 透過 atomicity 來避免 compiler reorder

背景：
But order is only one of the vital ingredients for inter-thread communication. The other is what atomic types are named for: atomicity. Something is atomic if it can not be divided into smaller parts. For example, some operations must be sequentially execute and connot be interfered by other operations. This program section that contains these operations are called Critical Section.

問題 1 : torn reads and writes. 用意舉例如果有些操作沒有 atomictiy 會怎樣

Consider a program with two threads. One thread processes a list of files, incrementing a counter each time it finishes working on one. The other thread handles the user interface, periodically reading the counter to update a progress bar. If that counter is a 64-bit integer, we can not access it atomically on 32-bit machines, since we need two loads or stores to read or write the entire value. If we are particularly unlucky, the first thread could be halfway through writing the counter when the second thread reads it, receiving garbage. These unfortunate occasions are called torn reads and writes.

問題 1 : torn reads and writes. 解法

If reads and writes to the counter are atomic, however, our problem disappears. We can see that, compared to the difficulties of establishing the right order, atomicity is fairly straightforward.

Summary of concepts from the first three sections, as shown in Figure 4.
In §1, we observe the importance of maintaining the correct order of operations: t3 -> t4 -> t5 -> t6 -> t7, so that two concurrent programs can function as expected.
In §2, we see how two concurrent programs communicate to guarantee the order of operations: t5 -> t6.
In §3, we understand that certain operations must be treated as a single atomic step to ensure the order of operations: t3 -> t4 -> t5 and the order of operations: t6 -> t7.

§5. Read-modify-Write

前言:

So far we have introduced the importance of order and atomicity.
In §2, we see how an atomic object ensures the order of single store or load operations is not reordered by the compiler within a program. Based on the correct inter-thread order, multi-threads can establish a correct cross-thread order to communicate successfully.

In §3, atomicity ensures that an operation can eventually finish without being interfered with by other operations. This also establishes ordering between operations, as no two operations can occur concurrently.

Atomic loads and stores are all well and good when we don’t need to consider the previous state of atomic variables, but sometimes we need to read a value, modify it, and write it back as a single atomic step. As shown in Figure . That is, the modification is based on the previous state that is visible for reading, and the result is then written back. A complete read-modify-write operation is performed atomically to ensure visibility to subsequent operations.

Exchange, Test and Set, Fetch and…, Compare and Swap can all be transformed into atomic RMW operations, ensuring that operations like t1
$\to$ t2
$\to$ t3 will become an atomic step.

Furthermore, to communicate with other threads, a shared resource is required. As previously discussed, the process of accessing shared resources responsible for communication also needs to ensure both order and non-interference. To avoid recursive protection of shared resources, atomic operations can be introduced for the shared resources responsible for communication. As shown in Figure .

There are a few common read-modify-write (RMW) operations to make theses operation become a single atomic step. In C ++ , they are represented as member functions of std::atomic<T>. In C, they are freestanding functions.

Test and Set (Left) and Compare and Swap (Right) leverage their functionality of checking and their atomicity to make other RMW operations perform atomically. The red color represents atomic RMW operations, while the blue color represents operations that behave atomically.

接下來以三個面向探討：
3. 重疊的問題比較 (Exchange) (Test and Set) (Fetch and) vs CAS

5.1 Exchange:

Transform RMW into modifying a private variable first, and then directly swapping the private variable with the shared variable. Therefore, we only need to ensure that the second step, which involves loading the global variable and then exchanging it with the local variable, is a single atomic step.

This allows programmers to extensively modify the private variable beforehand and only write it to the shared variable when necessary.

5.2 Test and Set :

Test-and-set works on a Boolean value: we read it, set it to true, and provide the value it held beforehand. C and C ++ offer a type dedicated to this purpose, called atomic_flag. The initial value of an atomic_flag is indeterminate until initialized with ATOMIC_FLAG_INIT macro.

Test-and-set operations are not limited to just RMW (Read-Modify-Write) functions; they can also be utilized for constructing simple spinlock. In this scenario, the flag acts as a shared resource for communication between threads. Thus, spinlock implemented with test-and-set operations ensures that entire RMW operations on shared resources are performed atomically. As shown in Figure

atomic_flag af = ATOMIC_FLAG_INIT;
void lock()
{
    while (atomic_flag_test_and_set(&af)) { /* wait */ }
}
void unlock() { atomic_flag_clear(&af); }

If we call lock() and the previous value is false, we are the first to acquire the lock, and can proceed with exclusive access to whatever the lock protects. If the previous value is true, someone else has acquired the lock and we must wait until they release it by clearing the flag.

5.3 Fetch and

Transform RMW to directly modify the global variable (such as addition, subtraction, or bitwise AND, OR, XOR) and return its previous value, all as part of a single atomic operation. The difference from Exchage is that when programmers only need to make simple modifications to the shared variable, they can use fetch and.

5.4 Compare and swap

Finally, we have compare-and-swap (CAS), sometimes called compare-and-exchange. It allows us to conditionally exchange a value if its previous value matches some expected one. In C and C ++ , CAS resembles the following, if it were executed atomically:

/* A is an atomic type. C is the non-atomic type corresponding to A */
bool atomic_compare_exchange_strong(A* obj, C* expected, C desired)
{
    if (memcmp(obj, expected, sizeof(*object)) == 0) {
        memcpy(obj, &desired, sizeof(*object));
        return true;
    } else {
        memcpy(expected, obj, sizeof(*object));
        return false;
    }
}

The _strong suffix may leave you wondering if there is a corresponding “weak” CAS. Indeed, there is. However, we will delve into that topic later in §8.1.

Because CAS involves an expected value comparison, it allows CAS operations to extend beyond just RMW functions. Here's how it works: first, read the shared resource and use this value as the expected value. Modify the private variable, and then CAS. Compare the current shared variable with the expected shared variable. If they match, it indicates that the modification operation is exclusive. If they don't match, it implies interference from another thread.

Subsequently, update the expected value with the current shared value and retry the modify operation in a loop. This iterative process allows CAS to serve as a communication mechanism between threads, ensuring that entire RMW operations on shared resources are performed atomically.

As shown in , compared with Test and Set, thread that employs CAS directly uses the shared resource to check and uses atomic CAS to ensure that the Modify operation is atomic, coupled with a while loop to ensure that the entire Read Modify Write operations can behave atomically.

§6. Shared Resources

From Section 5, we have understood that there are two types of shared resources that need to be considered. The first type is shared resource that concurrent threads will access in order to collaborate to achieve a goal. The second type is a shared resource that serves as a communication channel for concurrent threads, ensuring correct access to shared resources. However, all of these considerations stem from a programming perspective, where we only distinguish between shared resources and private resources.

Given all the complexities to consider, modern hardware adds another layer to the puzzle, as depicted in Figure 2 fig:ideal-machine. Remember, memory moves between the main RAM and the CPU in segments known as cache lines. These lines also represent the smallest unit of data transferred between cores and their caches. When one core writes a value and another reads it, the entire cache line containing that value must be transferred from the first core’s cache(s) to the second, ensuring a coherent “view” of memory across cores. This dynamic can significantly affect performance.

目標：把討論變成cache line的角度
This slowdown is even more insidious when it occurs between unrelated variables that happen to be placed on the same shared resource, which is the cache line, as shown in \fig{false-sharing}. When designing concurrent data structures or algorithms, this false sharing must be taken into account. One way to avoid it is to pad atomic variables with a cache line of private data, but this is obviously a large space-time trade-off.

目標：理解現在的共享資源為 cache line

caption of Figure :
false-sharing : Processor 1 and Processor 2 operate independently on variables A and B. Simultaneously, they read the cache line containing these two variables. In the next time step, each processor modifies A and B in their private L1 cache separately. Subsequently, both processors write their modified cache line to the shared L2 cache. At this moment, the expansion of the scope of shared resources to encompass cache lines highlights the importance of considering cache coherence issues.

另外一種書寫方式
因此為了使得存取 shared L2 cache line 的操作正確，如第五章所言我們有兩個方法：
方法一：可以用一個共享資源去使這個操作 shared L2 cahce line 的操作為 atomic 如同 Fig 3.
方法二：或是將操作直接變成 atomic (這個在探討執行順序的時候再說) 。

Not only the shared resource, but also we need to consider shared resource that serves as a communication channel, e.g. spinlock. Processors communicate through cache lines when using locks. When a processor broadcast the release of a lock, multiple processors on different cores attempt to acquire the lock simultaneously. In order to maintain consistent lock's state across all L1 cache lines, which is cache coherence. The cache line containing the lock will constantly be transferred among the caches of those cores. Unless the critical sections are considerably lengthy, the time spent managing this cache line movement could exceed the time spent within the critical sections themselves,∗ despite the algorithm’s non-blocking nature.

With this high communication costs, there may be only one processor typically succeeds in acquiring it again in case of mutex lock or spinlock, leading to high communication costs with little practical benefit (only one processor gains the lock). This disparity severely limits the scalability of spin lock.

另外一種書寫方式
先談論方法一，試圖用在 5.2 Test and Set 的 spinlock 來解決這個問題，如同圖 spinlock 。 processor 是透過 cache line 在溝通的，然而現在不同的 processor 都要存取同個 lock，並且時刻注意是否被 unlock 了。因此需要保持 lock 的狀態在每個 L1 cache line 都一樣(cache coherence) 。然而每次釋放 lock 後，在 chip 間廣播，卻只有一個 processor 會再獲得 lcok，這使得溝通的成本變得很高與實際效益很低（只有一個processor 會再獲得lock），兩者落差太大。而這件事情造成 spin lock 的 scalability 不高。

先不用在 Concurrency Primer 中提到
參考：〈從 CPU cache coherence 談 Linux spinlock 可擴展能力議題〉
因此 spin lock 點對點溝通的方式出現，為了避免 invalidate 佔據整個匯流排。
同時將競爭的 lock 資源提高細粒度，因此 ticket lock 出現，區域的 lock，使多處理器不必為了全域的 lock 而不斷進行 cache coherence。而 ticket lock 已經知道下一個會取得全域 lock，則可直接透過資料結構佇列，來傳遞給下一個。

目標：理解cache coherehece 更新有多困難，效能低
目標：理解cache coherehece 造成競爭 lock （shared resource to communicate）時效能下降

$\to$ 使用 lock 不是一個很優的想法
目標：跨越chip 的溝通成本高

caption of Figure :
spinlock : Three precessors use lock to communicate to insure the access operations to shared L2 cahce will be correct. Processor 2 and 3 are trying to acquire lock that hold by Processor 1. Therefore, when processor 1 unlock, the state of lock need to be updated on other processor's L1 cache.

While processors can acquire the lock through an atomic RMW operation.

§7. Concurrency tool and Synchronization

並行的工具：atomic 操作，用 atomic 操作使操作表現得像 atomic，用於溝通的共享資源， CAS
再將這些工具分為 blocking 與 lockless.

Atomic loads, stores, and RMW operations are the building blocks for every single concurrency tool. It is useful to split those tools into two camps: blocking and lockless. As mentioned in §5. Read-modify-write, multiple threads can use these blocking tools to communicate with others. Furthermore, these blocking tools can even assist in synchronization between threads. The blocking mechanism is quite simple, because all threads need to do is block others in order to make their own progress. However, this simplicity can also cause threads to pause for unpredictable durations and then influence the throughput of the overall system.

細說 blocking mechanisms 的問題。時間, dead lock, scalability.

Waiting time

Take a mutex as an example: it requires threads to access shared data sequentially. If a thread locks the mutex and another attempts to lock it too, the second thread must wait, or block, until the first one unlocks it, regardless of the wait time.

Deadlock

If the second thread holds another mutex first, then a thread locks the mutex and subsequently tries to lock the other mutex that is held by the second thread, then the deadlock occurs. Therefore, we can see that deadlock happens when different threads lock in incompatible orders, which leads to the system becoming immobilized as threads perpetually wait on each other.

Scalability

Additionally, in §6. Shared Resources, we can see another problem of the lock is its scalability is limited.

After understanding the issue that blocking mechanisms are prone to, we try to achieve synchronization between threads without lock.

Consider this program, if there is only a single thread execute these operations as follow :

       | Thread 1        | Thread 2
       | while (x == 0)  |
       | {               | while (x == 0)
t(n)   |     x = 1 - x;  | {
t(n+1) | }               |     x = 1 - x;
       |                 | }

When executed by a single thread, these operations complete within a finite time. However, with two threads executing concurrently, if one thread executes x = 1 - x and the other thread executes x = 1 - x subsequently, then the value of x will always be 0, which will lead to livelock.
Therefore, even without any locks in concurrent threads, they still cannot guarantee that the overall system can make progress toward achieving the programmer's goals.

Consequently, we should focus not on comparing which communication tools or synchronization methods are better, but on exploring how to effectively use these tools in a given scenario to facilitate smooth communication between threads and achieve the programmer's goals.

Consequently, we should not focus on comparing which communication tools or synchronization mechanisms are better, but rather on exploring how to effectively use these tools in a given scenario to facilitate smooth communication between threads and achieve the programmer's goals.

TODO: 新增 lock-free 章節到〈Concurrency Primer〉

§8. Lock-Free

前言

In the previous chapter, we explored different mechanisms based on concurrency tool characteristics. Next, we need to explore which strategies can help design a concurrency program that allows concurrent threads to collectively ensure progress in the overall system while also improving scalability, which is the initial goal of designing a concurrency program.

Next, we will explore the relationship between the progress of each thread and the progress of the entire system.

Type of Progress

When we consider the scenario where many concurrent threads collaborate and each thread is divided into many operations,

Wait-Free: Every operation in every thread will be completed within a limited time. This also implies that each operation contributes to the overall progress of the system.

Lock-Free: At any given moment, among all operations in every thread, at least one operation contributes to the overall progress of the system. However, it does not guarantee that starvation will not occur.

Obstruction-Free : At any given time, if there is only a single thread operating without interference from other threads, its instructions can be completed within a finite time. Therefore, when threads are working concurrently, it does not guarantee progress.
Therefore, we can classify progress types into three sets: the set Obstruction Free includes the set Lock Free, and the set Lock Free includes the set Wait Free. Achieving Wait Free is the most optimal approach, allowing each thread to progress without being blocked by others.

新版
progress-type :
在 wait free system，由於每個 Thread 都不會 block 其他 Thread，因此每個 Thread 在每刻都會有 progress，因此整個系統都會有 progress 。lock free 在 time 1 時，Thread 1 block 其他 Threads 使其他 Threads 等待。在 time 2 時， Thread 1 被 suspend 後，不會造成其他 Thread 一直被 block ，在 time 3 時，Thread 2 依然可以有 progress，而整個系統也因此而有 progress。在 obstruction free 時，在 time 2 時， Thread 1 被 suspend 後，造成其他 Thread 也因此被 block 。因此在 time 3 時，Thread 2 跟 Thread 3 依舊還在 wait，使得系統在之後都沒有 progress 了。

In a wait-free system, each thread is guaranteed to make progress at every moment because no thread can block others. This ensures that the overall system can always make progress. In a lock-free system, at Time 1, Thread 1 may cause other threads to wait while it performs its operation. However, even if Thread 1 suspends at Time 2, it does not subsequently block other threads. This allows Thread 2 to make progress at Time 3, ensuring that the overall system continues to make progress even if one thread is suspended. In an obstruction-free system, when Thread 1 is suspended at Time 2, it causes other threads to be blocked as a result. This means that by Time 3, Thread 2 and Thread 3 are still waiting, preventing the system from making progress thereafter. Therefore, obstruction-free systems may halt progress if one thread is suspended, leading to the potential blocking of other threads and stalling the system.

progress-type :
We can classify progress types into three sets: the set Obstruction Free includes the set Lock Free, and the set Lock Free includes the set Wait Free. Achieving Wait Free is the most optimal approach, allowing each thread to progress without being blocked by others.

The main goal is that the whole system, which contains all concurrent threads, is always making forward progress. To achieve this goal, we rely on concurrency tools, including atomic operation and the operations that perform atomically. Additionally, we carefully select synchronization mechanism, which may involve utilizing shared resources for communication (lock). Furthermore, we design our program with appropriate data structures and algorithms. Therefore, lock-free doesn't mean we cannot use any lock; we just need to ensure that the blocking mechanism will not limit the scalability and that the system can avoid the problems in Concurrency Tool and Synchronization (e.g., long time of waiting, deadlock).

消費者生產者問題 (SPMC)

We take the single producer and multiple consumers problem as an example to demonstrate how to achieve fully lock-free programming by improving some implementations step by step.

SPMC Solution 1 - Lock-based

Firstly, introduce the scenario of lock-based algorithms. This will block the consumer, and consumers will race for the job.
Producer

    while(ThereAreMoreTasks()){
        task = AllocateAndBuildNewTask();
        {
            lock_guard<mutex> lock{mut}; // enter critical section
            queue.push(task);            // exit critical section
        }  
        cv.notify();                     //                     
    }
    {
        lock_guard<mutex> lock{mut};     // enter critical section
        queue.push(done);                // add sentinel; that's all folks
    }                                    // exit critical section
    cv.notify();

Consumer

    myTask = null;
    while( myTask != done ){
        {
            lock_guard<mutex> lock {mut}; // enter critical section
            while queue.empty()           // if not ready, do not busy-waiting.
                cv.wait(mut);             //
            myTask = queue.first();       // take task
            if( myTask != done )          // remove it if not the sentinel,
                queue.pop();              // which others need to see
        }                                 // exit critical section.
        if(myTask != done)
            DoWork( myTask );             
    }

state 1 生產者邊增加工作到工作佇列，消費者邊等待工作佇列出現工作。
state 1 The producer is adding tasks to the job queue while consumers wait for tasks to become available and is ready to take on any job that appears in the job queue.

state 2

\to

state 3 生產者放完一個工作到工作佇列，釋放工作佇列的 mutex lock。
state 2

\to

state 3 After the producer adds a task to the job queue, the producer releases the mutex lock for the job queue.

state 3

\to

state 4 消費者一號搶到工作佇列的 mutex lock，從工作佇列拿取工作後，釋放工作佇列的 mutex。
state 3

\to

state 4 Consumer 1 acquires the mutex lock for the job queue, retrieves a task from it, and then releases the mutex lock.

state 5 下一個消費者搶到工作佇列的 mutex lock，但已經沒有工作在工作佇列中。
state 5 Next, other consumers attempt to acquire the mutex lock for the work queue. However, after they acquire the lock, they find no tasks in the queue.

state 6 因此消費者在 condition variable 上等待生產者換醒。這時候消費者不是 busy wait，而是等待生產者喚醒。是進階版的 mutex lock。
state 6 Consequently, the consumers wait on a condition variable for the producer to signal. During this time, the consumers are not busy waiting but rather waiting for the producer to wake it up. This mechanism is an advanced form of mutex lock.

Catpion SPMC-Solution1 : The interaction between the producer and consumer in Solution 1, including their state transitions.

An Introduction to Lock-Free Programming
One important consequence of lock-free programming is that if you suspend a single thread, it will never prevent other threads from making progress, as a group, through their own lock-free operations.

之所以此程式碼不為 lock free 的原因，是因為當生產者如果暫停了。
由於生產者中斷會使消費者沒有工作，進而被 block，整個系統會沒有 progress。
而當一個消費者如果拿到工作佇列的 lock 後被中斷，則其他消費者也會被 block 。但是生產者還是不斷地增加工作，使得整個系統會沒有 progress。
所以無論前者與後者的實作方式都不為 lock free。

The reason why this implementation is not lock-free is:

First, if a producer suspends, it causes consumers to have no work available, leading them to block and thus halting progress in the entire system.

Secondly, when consumers need to access shared resources (the job queue) concurrently and one consumer acquires the lock of the job queue but suddendly gets suspended before completing without unlocking, causing other consumers to be blocked. Meanwhile, producers continue adding tasks, yet the system fails to make progress.

Therefore, neither the former nor the latter implementation approach is lock-free.

SPMC Solution 2 - Lock-base and Lock Free (演算法，改變想法)

為了解決 Solution 1 消費者競爭工作佇列的 lock ，還可能因為工作佇列沒有工作，竟而要等待的問題。因此接著真對工作佇列沒有工作提出解法。Secondly, introduce the secenario of Lock-base and lock free algorithms.

In Solution 1, consumers contend for the lock of the job queue to access the queue; however, after they acquire the lock, they may still need to wait when the queue is empty. To solve this issue, the introduction of lock-based and lock-free algorithms is presented in the following section.

Producer

    Producer first build the whole job

Consumer

    myTask = null;
    while( myTask == Null){
        lock_guard<mutex> lock {mut}; // enter critical section
        if(head != Null){             // no race for the job.
            myTask = head;
            head =  head->next;
        }
    }

state 0 : producer 把所有工作都先做好。
state 0 : The producer prepares all the jobs in advance.

state 1 : 消費者一號搶到工作佇列的 lock，拿走工作後釋放 lock 。
state 1 : Consumer 1 acquires the lock on the job queue, takes a job, and releases the lock.

state 2 : 消費者二號重新獲得 lock 後由於生產者已經是先將工作都建立在工作佇列中，因此消費者二號不會因為沒有工作做而等待。
透過這種方式消費者只要有拿到工作佇列的 lock 就一定有工作，除非是所有工作都已被消費者拿走。因此不需要再因為沒有工作而被 blocked。唯一要等待的只有競爭工作佇列的鎖。
state 2 : After Consumer 2 acquires the lock, it can find that there are still jobs in the queue.

Through this approach, once a consumer obtains the lock on the job queue, there is guaranteed work available unless all jobs have been taken by other consumers.

Thus, there is no need to wait due to a lack of jobs; the only wait is for acquiring the lock to access the job queue.

Catpion SPMC-Solution2 : The interaction between the producer and consumer in Solution 2, including their state transitions.

之所以稱這個演算法與資料結構的實作為 locked based and lock-Free 是因為演算法設計成當生產者將所有工作都放入工作佇列，消費者才去拿取工作，因此並不會發生生產者被暫停，而消費者會因此被 block ，所以這裡為 lock-free。

不是 lock-free 的原因如同 SPMC Solution1 的第二個。

The reason this implementation is referred to as both locked-based and lock-free is because the algorithm is designed such that the producer adds all jobs to the job queue before consumers begin taking them. This design ensures that if the producer suspends or adds the job slowly, consumers will not be blocked due to the lack of a job. Therefore, it qualifies as lock-free.

The reason that it's locked-based, not lock-free, is the same as the second aspect of SPMC Solution 1.

SPMC 解法三 - Fully Lock Free (Changing the data structure to reduce the granularity of locks.)

在 CppCon 2014: Herb Sutter "Lock-Free Programming" 的演講提到:
Going Fully "Lock Free", Atomic Mail Slots.
這就像先前在 MPMC MPSC/SPMC SPSC 討論的，要將低 lock 的粗礪度，(降低共享) 將會競爭的工作佇列分成多個 slots 。

目標：理解競爭 lock 所保護的共享資源的粗礪度會影響效能
In Section Shared Resource, we can understand that communications between processors across a chip are through cache lines, which incurs high costs. Additionally, using locks further decreases overall performance and limits scalability. However, when locks are necessary for concurrent threads to communicate, reducing the sharing resource and the granularity of the sharing resource to communicate (lock) is crucial.

Therefore, to achieve fully lock-free programming, we change the data structure to reduce the granularity of locks. 改變資料結構，降低共享，直接規避競爭問題

Catpion SPMC-Solution3 : The left side shows that the lock protects the entire job queue to ensure exclusive access to its head operations for multi-thread access. The right side illustrates that each thread has its own slot for accessing jobs, not only achieving exclusivity through data structure but also eliminating the need for shared resources to communicate.

Providing each consumer with their own unique slot to access jobs addresses the problem at its root, directly avoiding competition.

下面狀態圖先移掉。
Following is the interaction between the producer and consumer as shown in Solution 3, including their state transitions, is shown as SPMC-Solution3-FSM.

Catpion SPMC-Solution3-FSM : The finite state machine is as follows: When the producer finishes creating jobs, the state transitions from Empty to Task, then waits for consumers to complete their tasks. When a consumer finishes a task, transitioning from Empty to Done, and the total number of completed tasks reaches a certain number, the state transitions from Done to End, indicating completion. Meanwhile, consumers take jobs, transitioning the state from Task to Empty.

SPMC 解法四 - Lock-base and Lock Free (Concurrency Tool 的解法)

除了將粗礪度降低外，還可以將 第二個解法 Lock-based and Lock Free 方法 中，消費者競爭工作佇列時，如果一個消費者取得工作佇列的 lock ，卻中途被暫停(suspend)，使得其他消費者被 block 時。為了避免這個狀況，可以使用 Section 5-4. CAS ，由於 CAS 搭配 loop 迴圈，使得 Write 的操作達到語意上的 atomic。

In addition to reducing granularity, there is another way to avoid that if one consumer acquires the lock on the job queue but suddenly gets suspended, causing other consumers to be blocked in SPMC-solution2. We can use Section 5-4. CAS with a loop to ensure that the write operation achieves semantic atomicity, as shown in \fig{fig:atomic-type}.

不同於 SPMC-solution2 透過 shared resource to communicate (mutex) 來達到 blocking 的同步機制。如果有一個執行緒持有 lock，則其他執行緒執行失敗則等待。 CAS 讓執行緒執行失敗仍繼續不斷地嘗試，因此如果有一個執行緒被 block 住，代表一定有一個執行緒有進展，因此確保整個系統，只要執行足夠長的時間，至少會有一個執行緒會有進展 (progress) 達到 Lock-Free。

Unlike SPMC-Soultion2, which uses a shared resource (mutex lock) for blocking synchronization, the first thread holding the lock causes the other thread to wait until the first thread releases the lock. CAS allows threads that initially failed to acquire the lock to continue to execute Read and Modify. Therefore, we can conclude that if one thread is blocked, it indicates that at least one other thread is making progress.

In (SPMC-Soultion2), a blocking mechanism uses mutex lock; we can see that only one thread is active when it accesses the job queue. Although CAS will continue to execute Read and Modify, it doesn't result in an increase in overall throughput. This is because the operations will be regarded as trash when atomic CAS fails. Therefore, we can understand that lock-free algorithms are not faster than blocking ones. The reason for using lock-free is to ensure that if one thread is blocked, it does not cause other threads to block, thereby preventing overall progress from stalling.

Lock-Free Conclusion

In conclusion about lockfree, we can see that both blocking and lockless approaches have their place in software development. They serve different purposes with their own design philosophies. When performance is a key consideration, it is crucial to profile your application. The performance impact varies with numerous factors, such as thread count and CPU architecture specifics. Balancing complexity and performance is essential in concurrency, a domain fraught with challenges.

補充

這裡的 block 指的就是會不會因為一個執行緒被暫停而 block 其他的執行緒進而導致整個系統沒有進展。換言之 lock-free 中的 lock : 如 An Introduction to Lock-Free Programming 所說

The lock in lock-free does not refer directly to mutexes, but rather to the possibility of “locking up” the entire application in some way, whether it’s deadlock, livelock – or even due to hypothetical thread scheduling decisions made by your worst enemy.

TODO: 〈並行和多執行緒程式設計〉系列講座並紀錄問題和提出改進

〈實作輕量級的 Mutex Lock〉

測試和驗證

在 mutex 的程式碼 main.c中 main thread 先建立多執行緒，將 clock 的值加一後，等待其他執行緒去工作，工作為增加 clock 的值。接著其他執行緒開始可以工作，直到 clock 的值超過1 << N_NODES 也就是

2^{N_N O D E S}

，則由 main thread 結束所有執行緒。
每個執行緒想像有三種狀態 wait、ready、not ready 以更好理解。起初狀態是 wait ，當執行緒知道 parent 執行緒變為 ready 狀態後，可以決定要繼續維持 not ready(去工作)還是 ready (將工作交給孩子執行續)。

thread

_{1}

在知道 parent thread

_{0}

變為 ready 後，thread

_{1}

有兩個可能會做的事情 wait

\to

？

thread

_{1}

: 透過 node_signal改變狀態 wait

\to

ready。這樣就是把工作交給 child thread ，讓 thread

_{2}

在知道 parent thread

_{1}

變為 ready 後改變狀態 wait

\to

？。thread

_{2}

wait

\to

ready 則將工作再交給傳給 child thread。

thread

_{1}

: wait

\to

not ready，則 thread

_{2}

維持狀態 wait 等待 thread

_{1}

為 ready 狀態。當 thread

_{1}

為 not ready ，則代表不把工作傳給 child thread ，而是 thread

_{1}

自己工作 clock tick() 將 clock 的值加一。

執行緒 wait

\to

？ 由 bit 決定。迴圈每回合 bit = !bit; 使得 wait

\to

not ready 與 wait

\to

ready 會交替。初始時 bit = false ，所以每個執行緒第一次必為 wait

\to

not ready。每個執行緒會工作代表 parent thread 為 ready 也就是處於休息的狀態。

而 thread

_{0}

因為沒有 parent thread 所以沒有 wait 的狀態，而是不斷交替 not ready (工作)與 ready (休息)。接下來則要確保 thread

_{0}

ready (休息)的時候，其中一個 child thread 會去工作。否則所有 child thread 都為 ready (休息)，則會導致增加執行緒，所有 clock tick()的工作都為 thread

_{0}

做則沒有提昇 scability 。能夠確保這件事情保證：「當任何一個執行緒拿到工作後選擇休息，則之後其中一個 child thread 一定會去做這個工作」。而「當任何一個執行緒拿到工作，也代表所有 parent thread 一定是選擇休息」。

work

_{0}

: thread

_{0}

工作。
work

_{1}

: thread

_{0}

休息，由於每個執行緒第一次必為 wait

\to

not ready 工作，則 thread

_{1}

工作。
…
work

_{n - 1}

: thread

_{x}

工作，代表其 parent thread 在得知上一層 ready 時都是 wait

\to

ready。
work

_{n}

: 一定是 thread

_{x}

的 parent thread 其中之一去工作，因為他們在 work

_{n - 1}

時是 ready 休息，則此次一定有其中一個去工作。

如果 thread

_{x}

是最小的執行緒，當 thread

_{x}

ready 休息時，一定有parent thread 會工作，因此每個 work 經過所有執行緒一定會有一個執行緒去做事。這就證明「當任何一個執行緒拿到工作後選擇休息，則之後其中一個 child thread 一定會去做這個工作」。

從這裡也看得出來 parent thread 承載的工作量會是 child thread 的二倍。因為 thread

_{0}

工作的機率是

0.5

交替工作或休息，所以 thread

_{1}

只會在thread

_{0}

休息時拿到工作機率剩

0.5

。由於 thread

_{1}

也會交替工作或休息，因次 thread

_{1}

工作的機率變成

0.25

。

這裡可以解釋程式碼 clock_wait(&clock, 1 << (N_NODES)); 增加一個執行緒相當於增加一倍的工作量的原因。如果不增加一倍的話，新增加的執行緒也不會做到工作，則 scability 不會增加。

下圖說明 parent thread 的工作量為 child thread 的兩倍，只有當工作量

\geq

1 << (N_NODES) 才能至少讓最小的 child thread 的工作量

\geq 1

。

在工作量

＝

1 << (N_NODES-2) 時，thread

_{14}

與 thread

_{15}

的工作量為 0。

程式碼中有兩種資源是執行緒會存取：

clock ：所有執行緒共享一個 clock 資源。
a. 只會有一個執行緒在一個時間點用 clock_tick() 更改 clock 。
b. 執行緒用 clock_wait() 來等待，目的是讓 Thread₀ 不要接續工作。
i=x 時， Thread₀ 工作，將 clock->ticks 從 t_x 加到 t_x+1。
i=x+1 時，確保在 Thread₀ 休息時， child thread 把 t_x+1加到t_x+2 。
i=x+2 時，如果 t_x+1 則 Thread₀ 則等待直到 child thread 把 t_x+1加到t_x+2 。
clock_tick() ：用 mutex_lock() 與 mutex_unlock() 來保證一個期間，只有一個執行緒更新 clock 的操作。接下來用。

    thread 0  clock_wait()
    =======
t0: mutex_lock(&clock->mutex)
t1: 等 t (clock->ticks)：等待直到「共享 clock->ticks >= 自己的值」
    cond_wait(&clock->cond, &clock->mutex);
    ---------------------------------------------
    critical section
t2: bool ret = clock->ticks >= ticks;
    確認 clock->ticks 一定等於 ticks 同步達成。
    ---------------------------------------------
t3: mutex_unlcok(&clock->mutex)

t₁ : 用 cond_wait() 是為了 thread 0 不要頻繁的等待，其他執行緒只有一開始會等t。
t_1-1 : 將在 t₀ 的 clock unlock()。
t_1-2 : 用 clock 的 condition variable 來確認沒有其他執行緒也在對 clock 操作。
t_1-3 : 沒有其他執行緒。mutex_lock(&clock->mutex)。進入 t₂ 的 critical section。
t_1-4 : 有其他的執行緒，則開始 spin_hint() 多次後到 futex_wait 等待透過 clock 的 condition variable 來確認是否有其他執行緒也在對 clock 操作。如果沒有則到 t_1-3。

node ：每個執行緒都有自己的 node 資源，用來表示自己目前的狀態。只有當 parent thread 的 node 是 ready 的，child thread 才會做事，因此child thread 需要等待。
a. child thread 等待 parent thread 的 node 是用 node_wait() 。
b. parent thread 則是透過 node_signal() 改變狀態 wait
$\to$ ready。

這兩種資源也協助兩種等待，第一種由 clock 來實作 等t，第二種由 node 來實作 等parent。

A B C 三個執行緒一開始被main thread 建立後會因為 tick:0 <i:1 而等待。直到 mian_thread clock_tick() 使 tick = 1。

說明資源 node

A 是 B 的 parent。
每個執行緒都有一個 node 資源，會等這個 node 資源的只有該執行緒的 child 執行緒。因此當 node_signal() 時只用 cond_signal 。
而每個執行緒在等一個parent 的 node 的資源時都會先lock

接下來 B futex_wait() 等待 A's node condv。因為在 node 上等待 A's node condv 只有 A 的小孩也就是 B。所以只要用 wake 不需用 broadcast 。

而下面的圖說明 A 呼叫函式 node_signal() 中呼叫函式 cond_signal() ，其中呼叫函式 futex_wake() 。

接下來 B wake 後，就會 lock A 的mutex lock。由於剛剛 A 以已經在 node_signal() 將 A's ready = true 此時 B 再將 A's ready = false 後 B 就不在 node_wait() 而是去做事情了。

所以這裡也說明 A 休息會將工作給 child thread (B)。B 再去依據 bit 的值決定要休息或工作。

說明資源 clock

A 等在 clock 的 condv 但因為有兩種情況

所有執行緒一開始在等 clock->ticks 從 0 加為 1。這在main thread 呼叫函式clock_tick() 其中呼叫函式 cond_broadcast() ，其中呼叫函式。futex_requeue()，叫醒一個在等 clock 的 condv 的執行緒，而其他執行緒繼續去睡覺。
A 為執行緒₀，而某個執行緒在將 clock->ticks 從偶數加到奇數後，原本等 clock 的 condv 的執行緒₀ 會因為 tick >= i 就不再等待了。
下圖為 clock_tick() 的操作。

〈並行程式設計: Atomics 操作〉

出現原因：

並行溝通需要同步機制。實作同步機制有最簡單的方式就是阻塞的方式，但阻塞有 scability 的問題。因此非阻塞的同步機制出現。

非阻塞的同步機制：

非阻塞的同步機制讓所有執行緒都可以在進入 critical section 前時間點

t_{0}

先紀錄當前的狀態，時間點

t_{1}

執行緒可以在其中不斷嘗試改動，但只有最後時間點

t_{2}

再次去檢查這個狀態是否仍如同

t_{0}

此次嘗試改動才算正確，代表沒有其他執行緒介入。這裡可以延伸到 ABA問題。

而 atomic 會在

t_{0}

與

t_{2}

發揮作用，否則在

t_{0}

與

t_{2}

又會有 critcal section 的問題。
因此才會說非阻塞的同步機制的基石為 atomic 。

C11 Atomics

接下來以範例說明 atomic 的使用方式，例子為 atomic 在阻塞的同步機制上 (spin lock) 的運作。呼應這段話

就算硬體有能力處理多個執行緒，但只要涉及共享資源的存取，不免要藉由阻塞達到同步：被阻塞的執行緒，要不原地等待 (spin)，要不等待喚醒 (sleep/wake) 的訊號，無法充分運用硬體能力

atomic.c 可以看到透過 atomic 來輔助 Test and Set 的方式來讓 acquire lock (第 28 行) 操作是 atomic 的，如果 acquire 失敗則 spin (第 29 行)。接著 unlock （第 31 行）





    // acquire the lock
    while (atomic_flag_test_and_set(&atomic_float_obj.flag))  
            CPU_PAUSE(); // acquire fail -> spin
    atomic_float_obj.value += 1.0f;
    atomic_flag_clear(&atomic_float_obj.flag);

問題：接著延伸 atomic.c 中，發生在因為 atomic 的特性無法應用在 cache line 上。因此不同執行緒在 cache line 上會有競爭。為了達到 cache coherence 會有效能下降，這件事被叫做 false sharing。

Wiki false sharing
the caching protocol may force the first participant to reload the whole cache block despite a lack of logical necessity. The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource.

解決方法：使用 alignas

背後作祟的 cache

呼應前面 false sharing 在 cache line 競爭再用例子 cacheline.c 說明 Tore reads and writes 。

多核處理器之間的 cache coherence

跨越 chip ，為達一致，成本高
atomics 無法保證記憶體大小大過一個 cache line 的共享資源，會有 memory reorder

為了達到 cache coherence 透過 MESI protocol 來規範

簡述 4 個狀態意思:

Modified(M)
- 這份資料是被修改過的
- 且其他 CPU 上的 cache 沒有這份資料 (Shared bit = 0)
  推論 : 因此為最新的，可以被寫回主記憶體，CPU 也可讀 (Valid bit = 1)
- 在主記憶體中的這份資料是過時的 (Dirty bit = 1)
Exclusive(E)
- 其他 CPU 上的 cache 沒有這份資料 (Shared bit = 0)
- 在主記憶體中的這份資料和 cache 中存有的這份資料是一樣的 (Dirty bit = 0)
  推論 : 因為其他 cache 無這份資料，這份資料跟主記憶體的資料也相同，因此這份資料一定為最新，因此 CPU 可讀 (Valid bit = 1)
- 其他 CPU 上的 cache 發 probe read 時，會將資料提供給他
Shared(S)
- 其他 CPU 上的 cache 必定含有同一份資料 (Shared bit = 1)
- 在主記憶體中的這份資料和 cache 中存有的這份資料是一樣的 (Dirty bit = 0)
  推論 : 有可能與其他 cache 都為狀態 shared ，可能其他 cache 資料較新。然而，雖然不知道自己的資料是否是最新的，但因與主記憶體資料相同，所以 CPU 可讀 (Valid bit = 1)
Invalid(I)
- cache 中存有的這份資料是過時的 (Valid bit = 0)

	M	E	S	I
M	X	X	X	V
E	X	X	X	V
S	X	X	V	V
I	V	V	V	V

這個表格是在說明兩個 cache 的狀態。 V 代表這個情況會不會發生。舉例 Cache₀ 為縱軸，Cache₁ 為橫軸。Cache₀ 與 Cache₁ 不可能在同一份資料上彼此互為 (狀態 Modified, Exclusive)，因為這兩個狀態(Shared bit = 0)，因此情況不存在 X。

在狀態 Shared 的推論中，提到 cache 不知道自己的資料是否是最新的，而 MOESI 就被提出來解決這個問題。
Shared 狀態原先為 (Shared bit = 1) (Valid bit = 1) 分成兩個情況如果資料為最新的 (Dirty bit = 1) 叫做 Owner，不為最新的則依舊叫 Shared。

cache coherence 對程式效能的衝擊

分為兩個議題，分別為競爭資源的問題以及 memory redorder 的問題。

競爭資源的問題

一個依賴全域 multi-producer/multiple-consumer (MPMC) 的程式難有好的 scalability，因為 MPMC 的極限吞吐量 (throughput) 取決於 cache coherence 的成本，而非處理器的個數! 最好是用多個 SPMC 或多個 MPSC，甚至多個 SPSC 代替，在源頭就規避掉競爭。藍色箭頭是會有競爭的操作。

一個依賴全域 multi-producer/multiple-consumer (MPMC) 的程式難有好的 scalability，因為 MPMC 的極限吞吐量 (throughput) 取決於 cache coherence 的成本，而非處理器的個數! 因此解決辦法是提高細粒度，如下圖多個 SPSC，在 shared memory 上都不會有競爭了。參考〈藉由 spinlock 的調整，改善多核處理器建立 TCP 連線效能〉

但如此一來卻也沒辦法透過資料結構來知道 data 的操作以保證順序正確，因此除了傳輸 data 還需要連同順序資訊。此外要讓每個都有對應的共享資源，實作也很困難。
因此為了在 MPMC（有競爭）與 SPSC （實作難）間取得平衡，或是當 producer 與 consumer 間的存取速度有落差，可以使用 buffer 的概念，下面這個圖是結合左邊 SPMC 與右邊 MPSC 。
舉例如果 producer put data 的速度大於 consumer 的 get data 。則可以使用左邊 SPMC 的想法，雖然在 worker 以 consumer 的角度從 buffer 取資料會有競爭，但此方式避免效能瓶頸發生在 consumer 的取資料的過程。這就如同 CPU 處理資料很快，但寫入 RAM 記憶體很慢，而有 CPU 的效能瓶頸卡在寫入記憶體的問題（范紐曼瓶頸）。

綜合二個問題在於多處裡器在 cache line 發生競爭，就算問題二的 SPSC 解法，可以降低操作間的競爭，然而一個 producer 與一個 consumer 之間也會發生競爭。

Locked-base queue : 如果 SPSC 共享的資料結構是 Locked-base 的佇列，存取佇列都會用 lock，Producer 跟 Consumer 會競爭 lock，即便其存取的是佇列中不同的 slot。
Locked-Free Ring Buffer : 後來改進為 locked-free queue，只要確定 Ring Buffer的 head 與 tail，就為 lock free。然而當 producer 將資料從頭部 enqueue 時，consumer 從尾部 dequeue 時，如果兩個位址在同個 cache line 上仍然會有 cache trashing 。false sharing 。
Fast Foword : 控制 head 跟 tail 的距離。然而這個距離是一個可變的 threashold 透過其他第三方執行緒去調整 head 或 tail。因此其實只是變相的將 cache trashing 的問題放到第三方執行緒與 producer 或是第三方執行緒與 consumer 。
Mutli-line Upates : 引入 batch 的概念 當 producer 製造一定數量的資料後再更新 head，避免頻繁更新。
MCRingBuffer : 使用區域變數 head 避免不斷的 cache trashing，一定時間才會更新。同時也用 batch 的概念。然而使用batch 會發生 deadlock ，當 CPU_A 在寫一個 RingBuffer_A 的 batch，同時需要 RingBuffer_B 的 batch 的資訊，但因為 CPU_B 在寫 RingBuffer_B 的 batch 導致 CPU_A 等待。而 CPU_B 也在等待 RingBuffer_A 。
《 B-Queue: Efficient and Practical Queuing for Fast Core-to-Core Communication 》使用區域變數的概念，有區域變數 head 與 tail 紀錄生產者與消費者現在的位置。同時引入區域的 batch 變數 batch head 與 batch tail 去讓生產者與消費者知道 slot 可以用（區域變數到區域batch 變數間為可用範圍）。這讓生產者與消費者在一個時間點都有一個 batch （很多個 slot ）可用，避免 cache thrashing 。同時透過 backtracking 避免 deadlock 。原先有問題的是下面那張圖，由於 batch 設定過大，head 需要寫入資訊到 batch 空的slot，但由於 tail 佔據 batch 含 8 個 slots 。 head 要讀取其中一個都有困難。因此透過二分法搜法使得 batch size 變小。這樣就可以讓 head 去讀到一個 batch 含有空的 slot ，且這個 batch 並未被tail 佔據。

因此從發展的脈落，看到原先解決 lock 所保護的範圍太大競爭的問題，後來解決 caching trashing 的問題，又解決為了 cache coherence 頻繁更新限制效能的問題。最後由於將共享變數變為區域 batch 變數，再細分成多個 slots (降低共享) 。而分成多個資源又會出現 deadlock 的問題，又提出動態調整 lock 所保護的範圍以因應。

呼應〈並行程式設計: Atomics 操作〉
要提高速度，就要避免讓 CPU 頻繁進行 cache coherence。這不單和 atomic 指令本身有關，還會影響到程式的整體表現。最有效的解決方法就是降低共享。

筆記：生產者消費者是 bounded buffer 。
commit 還未
submit 已經

memory order 的問題

簡化為只要兩層（L1 cache 與 L2 cache）就可以說明：

接著我們考慮一個具備 2 層 cache 且有 2 個處理器核的系統，向下箭頭表示儲存 (即 write) 操作、向上箭頭表示載入 (即 read) 操作，對 CPU₀ 和 CPU₁ 來說，同顆 CPU 其左側操作的時序先於右側操作。如下圖:

〈Demystifying the Linux CPU Scheduler〉§ 2.2.3 O(1) Scheduler
Each core on the chip has its own set of registers and level 1 (L1) private cache, with the other memory components typically being shared across all the processing cores.

因此將 L2 cache 改為 shared memeory，原先y=2 與 x=1 操作也可以在 L1 cache 被 reorder。

在上圖中，有二個被原生執行緒 (native thread，指執行緒被作業系統核心識別，且納入排程的執行單元，在本例中，分別執行在 CPU₀ 和 CPU₁) 共享的變數 x 和 y，同時假定 CPU₁ 的執行緒先做 y = 0; 的儲存操作，使得在它的 cache 中都安排好存放變數 y 的 cache 內容 (entry)，且假定此時沒有關於任何針對變數 x 的 cache 內容。

首先，CPU₁ 的執行緒先對 y 進行 y = 0 的儲存操作，等該操作完畢後，CPU₀ 的執行緒才有動作。CPU₀ 的執行緒先做 x = 1 的儲存操作，接著做 y = 2 的儲存操作。完成之後，CPU₁ 的執行緒立即對 x 和 y 進行讀取。

我們從圖中可見，x 和 y 的儲存操作經過 CPU₀ 的 L1 cache, 再是 CPU₀ 和 CPU₁ 共用的 L2 cache，最後到外部記憶體。此時，L2 cache 中已有 x 和 y 這兩個變數對應的 cache 內容。因此在 CPU₁ 讀取時，x 和 y 的寫入順序是一致的。

到了 CPU₁ 的 L1 Cache 時，由於之前沒有 x 相關的 cache 內容，因此此時 L2 cache 控制器會進行判斷是直接將它分配給一個空白的 cache 條目，抑或將已有的 cache 內容進行逐出 (evict)，然後將變數 x 所處的 cacheline 增添到到 cache 內容中。這裡就會有一些延遲與記憶體重新安排的情況。此時，由於變數 y 已處於 L1 cache 內容中，因此它有可能被直接寫回（write back），只要之前針對 x 的 cacheline 的安排過程，不將 y 所處的 cacheline 予以逐出 (evict)。如此一來，右側執行緒所觀察到的寫入順序，就會變為先寫 y 再寫 x。因此最右邊會看到 reorder 的情況發生。

weak memory order
x86 和 Arm load 和 store 直接透過匯流排寫入共享記憶體。

有二個被原生執行緒 (native thread，指執行緒被作業系統核心識別，且納入排程的執行單元)

〈從 CPU cache coherence 談 Linux spinlock 可擴展能力議題〉

最簡單的 spin lock 用於等待一個持有 lock 的行程釋放資源。然而當 CPU 數量增加時，造成多個 CPU 的行程同時等待，一旦持有 lock 的執行緒更新資訊，則需要廣播給其他等待 lock 的 CPU 知道，因為這些 CPU 曾發送 invaliate ack 給獲取資源的 CPU 。此時廣播時會競爭匯流排的資源。

解決這個方式，是需要循序執行點對點 unicast 的方式來更新 cache line。

當許多 CPU 在 spin lock 時，會反覆的廣播，即發出 cache invalidate，通知其他 CPU 自己在等待 lock，這樣的行為就是在 snooping 窺探自己所等待的資源在其他的 CPU 的 cache 上是否有更新，也就是釋放。一旦釋放即可以去搶奪資源。而這樣當有很多 CPU 同時在 spin lock 時，則 cache invalidate 的通知會佔據匯流排，導致匯流排的競爭嚴重。因此點對點非廣播的方式誕生了。而點對點的問題在於當我收到多個 CPU 的 cache invalidate 時，知道誰先發送又成為問題。

spin lock 的效能很差，更不用說non-scalable locks.

當並行在一定數量的 core 以上時， lock 的成本隨著競爭 core 的數量增加而增加。
效能原先很好，但當 core 的數量增加幾個，效能會突然急遽下降
critical section 發生的機會很小。

non scalable lock 不只效能差，同時可能導致系統效能急遽下降。

單一核的請求 spinlock 的頻率 (rate) 即是
$\frac{1}{T_{a r r i v e}}$ 。進而推廣到多核中央處理器平台，於是請求 spinlock 的頻率和空閒核的數量成正比。假設現有
$n$ 核中央處理器系統上已有
$k$ 個核在自旋等鎖，那麼
$n - k$ 個核上 spinlock 的到達頻率則是
$\frac{n - k}{T_{a r r i v e}}$

但讀的時候我認為應該是

\frac{1}{T_{a r r i v e} \times k}

因為核的數量越多，spinlock 的頻率會將低。雖然這個性質也會反應在數學式子

\frac{n - k}{T_{a r r i v e}}

中，因為無法知道 spinlock 跟核的總數

n

的關係。

後來討論知道之所以跟

N

有關係。有

N

個行程在等待 lock。
系統有

N

個 core 。有

i

個core 在競爭一個 ticket lock。

〈Non-scalable locks are dangerous〉

背景：建構一個能夠描述 spinlock 行為的模型要區分二種操作。第一種是沒有競爭，spin lock 可以被快速地獲得。第二種是有競爭，轉移 lock 的 ownership 時間成本增加。
問題：難以知道 spin lock 何時從甚麼從沒有被競爭的狀態到有競爭的狀態。其他人嘗試將兩個模型直接合併卻因無法將真正 the point of collapse 找出來，同時無法解釋發生的原因。

總共有

n + 1

個狀態，

n

個 core，因此狀態

n

代表所有的 core 都在等待。
狀態

0

代表只有

0

個 core wait for the lock，就是上面沒有競爭的情況
狀態

k

代表

k

個 core wait for the lock

a

代表獲得 lock (

a_{k}

相當於請求鎖的頻率，原本有

k - 1

個 core 在等待 lock，現在變成

k

個 core 在等待 lock。)

s

代表釋放 lock，

s_{k}

是獲得 lock 到釋放 lock 的時間區間的導數。因此相當於服務頻率 service rate 。

老師的
簡化起見，我們忽略掉 spinlock 到達的細節，假設在中央處理器單一核上平均每隔
$T_{a r r i v e}$ (事實上應是指數分佈, Exponential distribution) 會請求一次 spinlock，因此單一核的請求 spinlock 的頻率 (rate) 即是
$\frac{1}{T_{a r r i v e}}$ (事實上應是 Poisson 分佈，法語的 pois 發音類似英語的 bwa，於是 Poisson 音似漢語「玻頌」)。進而推廣到多核中央處理器平台，於是請求 spinlock 的頻率和空閒核的數量成正比。假設現有
$n$ 核中央處理器系統上已有 k 個核在自旋等鎖，那麼
$n - k$ 個核(還未在自旋等鎖) 上 spinlock 的到達頻率則是
$\frac{n - k}{T_{a r r i v e}}$ ，因此，從上圖中的
$A [k]$ 得到：

$A [k] = \frac{n - k}{T_{a r r i v e}}$

這個在中央處理器單一核上平均每隔
$T_{a r r i v e}$ 會請求一次 spinlock。單一核是正在 k 個核在自旋等鎖，還是
$n - k$ 個核(還未在自旋等鎖)?
一個核到達頻率是前面 k 個核有一個核自旋等待後拿到 ticket lock 後直接將 ticket lock 轉移給給新加入的核?

在中央處理器單一核上平均每隔

T_{a r r i v e}

會請求一次 spinlock。這個

T_{a r r i v e}

也就是論文中的

a

是用最後 "the average number of waiting (idle) cores as the expected value of that distribution, w "算出來的。他是平均所有「等待 lock 的 core的情況」，情況的總數有(0~n)個。

這是用最後的 Markov model 計算出來的平均

a

來測量每一種情況，也就是(不同等待的 core 的數量) ，而不是依據每一種情況單數設定變數去量化比如說現在

x

個 core 在等待，在這個情況下的

a

的值應該會是一個

x

的函數

f (x)

。之所以不這麼做的原因是單獨計算一個 core 獲得 lock 到釋放 lock 的持有時間，就會有 cache miss，確保 cache coherence ，先測量真實情況再計算的測量步驟本身已經干涉 lock 的行為了。所以在建構 Markov model 時，才會拿最後計算出的結果，也就是每個數量的 core 在模型中出現的機率有多少

w = \sum_{i = 0}^{n} i \times P_{i}

來算。還有一些為了測量實際的行為，直接在 cpu 上面測量，像是增加 ram 的大小來紀錄測量結果，用不同的 cache ，來避免因 cache miss，或為了 cache coherence 影響真正的行為。

〈Non-scalable locks are dangerous〉(§ 3.2) Performance model for ticket locks
To compute the arrival rate, we define a to be the average time between consecutive lock acquisitions on a single core.

有 1 個 core 都要去競爭某 1 個 core 持有的 lock ，每次持有 lock 的時間為

T = 3

秒，則 arrival time

a = 3 (T)

秒。也就是等待

3 (T)

秒後，就可以持有 lock 了。

有 1 個 core 都要去競爭某 2 個 core 持有的 lock ，每次持有 lock 的時間為

T = 3

秒。接著我定義：「

W

為現在在等待 lock 的 core 數量。

H

為現在自持有 lock 的 core 數量。」依照此例子

W = 1

、

H = 2

，則 arrival time

a = \frac{3 (T)}{2 (H)}

秒。也就是等待

1.5 (T)

秒後，就可以持有 lock 了。

有 2 個 core 都要去競爭某 1 個 core 持有的 lock ，則 arriveal time =

2 \times 3 (T)

秒。

依照此例子

W = 2

、

H = 1

。再套用到數學式子

a = 2 (W) \times 3 (T)

秒，發現平均一個 core 等待

6

秒後，就可以再度持有 lock 了。

有 2 個 core 都要去競爭某 2 個 core 持有的 lock ，則 arriveal time =

3 (T)

秒。依照此例子

W = 2

、

H = 2

。
所以套用到數學式子知道

a = \frac{2 (W) \times 3 (T)}{2 (H)}

秒，也就是等待

3

秒後，發現平均一個 core 等待

3

秒後，就可以再度持有 lock 了。

數學公式

a = \frac{W \times 3 (T)}{H}

然而這樣是所有的 core 一定是在等待 lock 或是在持有 lock，只分為兩種類型。然而現實中卻並非所有 core 都一定當下跟某個資源有關。這也是 spin lock 的概念。有一堆 core 現在在等待一些 spin lock，其中有一些 core 現在是已經持有 lock，其餘則在 spin (旋轉)等待 lock。套用論文的話這兩種 core 合稱「這些 core 都已經在等待 lock 。」因此多了一個變數

K = H + W

個表示已經持有 lock 或正在 wait lock 。

〈Non-scalable locks are dangerous〉(§ 3.2) Performance model for ticket locks
k cores are already waiting for the lock
The rate at which a single core will try to acquire the lock, in the absence of contention, is 1/a.

將數學式子表示為

a = \frac{W \times 3 (T)}{H} \dots

平均 K 個中，其中一個 core 需要等多久才拿到。
再將 rate (頻率)的概念用進來

\frac{1}{a} = \frac{H}{W \times 3 (T)}

回到現實並非所有 core 都一定當下跟某個資源有關，也就是並非所有 core 都在競爭 spin lock。

K

個 core 只是整體

N

個 core 的集合。

〈Non-scalable locks are dangerous〉(§ 3.2) Performance model for ticket locks
the number of remaining cores that are not already waiting for the lock (i.e., n−k)

因此為了行文便利，先將

K

個 core 視為在系統中競爭的數量，這個系統中競爭的數量包含相當於所以已經在等待 lock 的 core。再考慮系統中競爭的 core 與系統中沒有競爭的 core 總共有

N

個 core 。系統中沒有競爭的(還沒有在等待 lock 的 core)有

N - K

個

Non-scalable locks are dangerous〉(§ 3.2) Performance model for ticket locks
k cores are already waiting for the lock
the number of remaining cores that are not already waiting for the lock (i.e., n−k)

將系統中還未競爭(還沒有在等待 lock 的 core) 視為(等待 lock 的 core)，要加入競爭。
將系統中已經正在競爭(已經在等待 lock 的 core) 視為已經持有的 lock。
將整個範圍擴大，如此才可繼續利用先前的數學式子。

然而系統中正在競爭(已經在等待 lock 的 core) 的 core 其實有一部份是

W

個在等待 lock ，

H

個正在持有 lock。但他們都合稱為已經在競爭 lock 的 core。
因此正在等待 lock 的

W

個 core 並非總是等待，只要等待

a

秒後，就可以再度持有 lock 了。

因此計算系統中已經正在競爭的 core 平均持有 lock 的時間為

T^{'}

T^{'} = \frac{H \times T}{K}

由於

a = \frac{W \times T}{H}

Non-scalable locks are dangerous〉(§ 3.2) Figure 7 :
Markov model for a ticket spin lock for n cores. State i represents i cores holding or waiting for the lock. ai is the arrival rate of new cores when there are already i cores contending for the lock.

加入

a_{k}

變數。其倒數

\frac{1}{a_{k}}

為已經有

K

個 cores 在競爭 lock，當新 cores 要加入需要等待多久時間，才會得到 lock 。則

a_{k}

為 arrival rate 。
系統中的還沒有競爭的新 cores 要加入想像成「等待」。
系統中的正在競爭中 cores 無論是否真實在那一刻持有或等待都視為成「持有」。
由於已經知道系統中已經正在競爭的 core 的平均持有 lock 的時間為

T^{'} = \frac{H \times T}{K}

。
且 arriaval time 的計算為

a = \frac{W \times T}{H} \to T = a \times \frac{H}{W}

擴大

a

當整個系統的視角：

\frac{1}{a_{k}} = \frac{（ N - K ） \times (T^{'})}{K} \dots

系統還沒有競爭的新 cores 要加入等到 lock 的時間

\frac{1}{a_{k}} = \frac{（ N - K ） \times (T^{'})}{K}

\frac{1}{a_{k}} = \frac{N - K}{K} \times (T^{'})

\frac{1}{a_{k}} = \frac{N - K}{K} \times (\frac{H \times T}{K})

\frac{1}{a_{k}} = \frac{N - K}{K} \times \frac{H}{K} \times (T)

\frac{1}{a_{k}} = \frac{N - K}{K} \times \frac{H}{K} \times (a \times \frac{H}{W})

arrive rate

a_{k} = \frac{N - K}{a}

arrival rate (

a_{k}

) 正比於多少個 core 還沒有在等待

n - k

有一群人

n - k

在搶演唱會的門票，有一群人已經搶到演唱會的門票

k

。則當有一個搶到票的人釋放他的門票，則剩下的人搶到的門票的

〈Non-scalable locks are dangerous〉
In particular, the arrival rate from
$k$ to
$k + 1$ waiters,
$a_{k}$ , should be proportional to the number of remaining cores that are not already waiting for the lock.
if
$k$ cores are already waiting for the lock, the arrival rate of new contenders is
$a_{k} = (n - k) / a$

Conversely, the service rate from k + 1 to k, sk, should be inversely proportional to k, reflecting the fact that transferring ownership of a ticket lock to the next core takes linear time in the number of waiters.

一個 core (1) 獲得 lock 的時間點

\to

釋放 lock 的時間點(2)

\to

下一個 core 獲得 (aquire) 到 lock 的時間點，成為 service rate。

service rate 在時間區段 (1) 成為

s

。而在時間區段 (2) 會需要考慮 cache line

Directory-based cache coherence

wiki/Directory-based_coherence
Directory-based coherence uses a special directory to serve instead of the shared bus in the bus-based coherence protocols.
In directory based cache coherence, this is done by using this directory to keep track of the status of all cache blocks, the status of each block includes in which cache coherence "state" that block is, and which nodes are sharing that block at that time, which can be used to eliminate the need to broadcast all the signals to all nodes, and only send it to the nodes that are interested in this single block.

這個 Directory-based cache coherece 是用來達成 CPU 同時在 spin lock 點對點 unicast cache invalidate 的通知方式。

〈Non-scalable locks are dangerous〉
c, the time taken by the home directory to respond to a cache line request.

因此當有

k

個 core 已經正在競爭，點對點的方式發送 acquire ，所以跟

k

有關係，才將 service rate 分為兩個時間區段(1)

s

與時間區段(2) 因此下個獲得鎖的核的 cache line 得到更新的時間平均值為

(k \times C) / 2

。

新的 core 加入其中用機率計算的原因？

〈藉由 spinlock 的調整，改善多核處理器建立 TCP 連線效能〉

拆解 lock 的粒度 (granularity) :
減少 spinlock 的粒度後，每個 CPU 對應一個 slot，因此每個 socket 都可以對應到一個 slot 並對應到一個 CPU。這解決了多個一次只有一個 socket 可以到一個 slot 中。然而卻無法透過一個 slot 來規範處理的順序了。
https://patents.google.com/patent/CN103309842A/zh

〈排程器〉

排程器要讓資源分配能夠公平，因為當資源合理的分配給各個行程時，才能達到 scalability。就像上面的 main.c 是資源分配不合理的明顯例子，也就是 scalability 不好。但要注意的地方是此例子scalabitlity 不好並非源於排程器沒有合理的分配時間給每個行程。

work-steal

〈 Cilk : An Efficient Multithreaded Runtime System 〉
Two scheduling paradigms have arisen to address the problem of scheduling multithreaded computations: work sharing and work stealing.
In work sharing, whenever a processor generates new threads, the scheduler attempts to migrate some of them to other processors in hoptes of distributing the work to underutilized processors. -> 排程器知道哪些 processors 是 underutilized ，排程器分散工作到 underutilized processors.
In work stealing, however, underutilized processors take the initiative: they attempt to "steal" threads from other processors. underutilized processors 自己會去拿其他 processors 的 thread 來工作。

Cilk program 包含多個 Cilk procedure, 其中各自包含多個執行序。

work steal 為了解決 load balance 的問題。

Cilk Model :

Cilk procedure 的其中一個執行緒開始執行時，可以 spawn 一個子執行緒，他會執行子 Cilk procedure。

向下線 spawn edges 代表小孩的產生 : spawn of a child。這相當於一個 subroutine call，Parent 執行緒會與 Child 執行緒一起並行執行，所以 Parent 執行緒不應該停下來 block 等待 Child 執行緒回傳值，這否則完全沒有並行的意義了，因此 Parent 執行緒換建立 Successor 執行緒，去接受 Child 執行緒的值。

一個執行緒與它的 Successor 都為同個 Cilk procedure。而一連串的 Successor 都是來自同個 Cilk procedures 會被 水平線 連在一起。

Clik 的計算會展開成 spawn tree ，此由 procedures 與 spawn edges 組成的，但執行的順序會遵守圖中的 precedence relation 。

而 Clik program 的執行時間則由兩個參數限制 :
work

T_{1}

: 一個 processr 執行 porgram 的時間，相當於

\sum_{a l l t h r e a d}

執行緒執行的時間。
紅線為 critical section 。

ciritical path

T_{\infty}

: 當有無限的 processor ，最長的執行時間。
A1 執行緒要開始 run 整個 Clik procedure ， A4 要結束整個 Clik procedure。有機會可以並行的部分為 A2 與 A3。而當 A2 與 A3 沒有 data dependency 時，下圖會是最好的執行時間。因此 processor > 2 以後已不再將提高效能，processor 為二個各自處理 A2 與 A3。

因此得到結論，圖一 :

T = 4 = T_{t o t a l} / (P = 1)

。
而執行時間必

\geq

並行最大化的時候執行時間 :

T \geq T_{\infty}

。
Cilk scheduler 用 work steal 技巧讓

T / P

越接近

T_{\infty}

，如圖二 :

T_{t o t a l} / (P = 2) = 4 / 2

。

work steal 在 linux 裡面是為了要 load balance。

現在的系統都是多處理器，最大的目的就是要 load balance，使得任務可以在不同處理器上工作，而不是讓一個處理器執行所有的任務。然而當一個閒置的處理器要幫忙碌的處理器分攤工作，resumed 工作到新的處理器上卻很困難，因為一個行程自己在運行的時候，每個處理器都會有自己的私有 L1 cache 以及暫存器，（而其他的記憶體資源通常是被所有處理器所共享），當一個工作被搬移到其他的處理器，就要保證 memory state 是相同的，這就是其 resume 的成本。

而由於排程器不能預測一個任務會執行多久，花費多少個 timeslice ，所以工作量不平衡一定會發生。

接著存放工作資料結構的是一個 runqueue ，但由於這個 runqueue 並未對應多處理器實作 processor affinity 。因此一個原先在 CPU 0 上跑得程式搬到 CPU 1 上會有 cache miss 。

此外， specialized kernel 執行緒負責重新排程，但這件事只能發生在 runqueue 沒有被 lock 起來。為了避免 deadlock ， Linux 會依據 runqueue address 由小到大 lock runqueue 。因此 O(n) 的排程器會有多處理器競爭 global runqueue。因為單一的 runqueue 就代表操作必須為互斥，當一個處理器需要 enqueue / dequeue 一個任務，其他的處理器都需要等待直到 unlock。（這裡有問題的地方是為何變成多個 runqueues 了）。

O(1) 排程器提高了可擴展性。
方式：O(1) 有 global 排程器去協調平橫每個處理器上 runqueue 的負載。用了 work steal 的想法，當處理器閒置時，執行緒會去那其他處理器的任務。

實作方式 :

而除了Linux 設計一個 (active theft) 執行緒去偷執行緒外。程式設計師也可以自己將程式設計成適合分成很多個 task 以便讓不同處理器可以執行，這就是 Cilk prepreocesor 的用途。

thread T (arg-decls …) {stmts …}

Cilk preprocessor 會將

T

(thread) 先轉乘 C function，有一個 argument 和 void return 型態。此 argument 是一個 closure data 的指標。closure data 有很多 slot 存放

T

的參數，而其中一個 slot 存放的為 join counter 為

T

要跑起來所需要的 missing argument 。當所有 argument 都齊全則 closure ready 了。

接下來 Clik scheduler 會 invoke 一個 Thread 作為 procedure ，原先的 arguments 會複製到區域變數，這個 Thread 也會被分配一個 heap。
程式設計師需要自己將 parent procedure 用成二個執行緒。一個執行緒會 spawn child procedure 另一個執行緒為 successor thread。

Programming Parallel Application in Clik

The Cilk work-stealing scheduler

TODO

work-steal

有 8 個 thread queue。有 24 個 threads 。

〈Linux 核心設計: 淺談同步機制〉

參考 spinlock

1. spin on test and set

init   | lock:= clear
lock   | while(TestAndSet(lock) = BUSY);
unlock | lock:= clea

機制
一個 core 持有 lock 後，其他 core 不斷嘗試 test and set (spinning)。
cache coherence 帶來降低效能的問題
Spinning Cores 不斷消耗 memroy bus (write 1): Cache coherence 。因為 memory write, spinning cores 不斷 invalidate cache copies of cores，即便值都沒有改變。==> cache coherence

cache coherence 帶來降低效能的問題
Spinning Cores reload the memory due to the cache miss ==> performance impact

2. test and test and set(spin on read)

init   | lock:= clear
lock   | while(lock == BUSY or TestAndSet(lock) = BUSY);
unlock | lock:= clea

減緩 cache coherence 帶來降低效能的問題
透過不用一直在 spin 的過程一直讀 lock 的狀態改善。
Unfairness
但 spinning cores 會有 read miss，且要讀取 lock 改變的值回 cache。競爭存取 memory bus，就會有較快更新 cache line 的 cpu 比原先較早 try acquire lokc 的 cpu 更早取得 lock 有 Unfairness 的問題。

3. ticket lock 解決 unfairness

union type C99
A union type describes an overlapping on empty set of member objects, each of which has an optionally specified name and possibly distinct type.

typedef struct {
    union {
        unsigned int slock;
        struct __raw_tickets {
            unsigned short owner;
            unsigned short next;
        } tickets;
    };
} arch_spinlock_t;

因為 union 的型態。因此 arch_spinlock_t 結構體其實變數是有重疊。

overlap 
------------------------------
        slock (32 bit)
------------------------------
next (16 bit) | owner (16 bit)
------------------------------

LL/SC 的 spin_lock

§ Concurrency Primer : Implementing atomic read-modify-write operations with LL/SC instructions
Load-link performs a read operation from an address, similar to any load, but it also signals the processor to watch that address.
Store-conditional executes a write operation only if no other writes have occurred at that address since its paired load-link.

spinlock.h



























static inline void arch_spin_lock(arch_spinlock_t *lock)
{
	unsigned long tmp;
	u32 newval;
	arch_spinlock_t lockval;

	prefetchw(&lock->slock);
	__asm__ __volatile__(
        // lockval(0) = lock->slock(3) 載入全域 (next, owner)
"1:	ldrex	%0, [%3]\n" 
        // newval(1) = lockval(0) + (1 << TICKET_SHIFT)(4) 更新區域 (next, owner)
"	add	%1, %0, %4\n" 
        // lock->slock(3) = newval(1) 交換 local 與 global 
"	strex	%2, %1, [%3]\n" 
        // Check the SC result.
"	teq	%2, #0\n" 
        // Loop if the SC failed.
"	bne	1b" 
	: "=&r" (lockval), "=&r" (newval), "=&r" (tmp)
	: "r" (&lock->slock), "I" (1 << TICKET_SHIFT)
	: "cc");
	while (lockval.tickets.next != lockval.tickets.owner) {
		wfe();
		lockval.tickets.owner = READ_ONCE(lock->tickets.owner);
	}
	smp_mb();
}

在這段組合語言中（第 13 行左右），透過 lockval + (1 << TICKET_SHIFT) 的操作，我們對 slock（32-bit 的 union 值）加上一個常數，這個常數剛好只會改變 next（高 16 bits），而不會改動 owner（低 16 bits）。因為 lockval 原本是從 slock 複製出來的，也就是 lockval = [next | owner]，加上 (1 << 16) 僅僅是把 next 加一，owner 保持原值。

因此在(第 10 行) ldrex %0, [%3] 將 next 讀進來到私有變數 lockval 這就是自己的取號單則不再變動，要做的事情就是(第 24 行)不斷將變化的共享變數 owner 讀進來。看是否與自己的取號單相同。相同則代表輪到自己上鎖，arch_spin_lock() 則函式執行完，上完鎖準備進入 critical section。

之後要離開 critical section 的時候arch_spin_unlock() 就將共享變數 owner 更新(write back)，此時發送 (invalidate) 給其他正在等待叫號的 CPU ，這樣其他等待的 CPU 就會 reload 更新的 cache line 。而如果是拿下一個號碼單的 CPU 將變動共享變數讀入後就可以上鎖了。

static inline void arch_spin_unlock(arch_spinlock_t *lock)
{
	smp_mb();
	lock->tickets.owner++; 
	dsb_sev();
}

所以總共跟 cache coherence 有關的地方有三個地方

load :
CPU 取號碼牌。 ldrex %0, [%3]
write back & send invalidate :
CPU (Lock holder) 辦理完，更新服務台 (unlock)。 lock->tickets.owner++;
reload :
CPU 與服務台對號不成功，spin。 lockval.tickets.owner = READ_ONCE(lock->tickets.owner);

第一步驟與第二步驟是不管怎樣都要做的。因為一個 try acquire lock 的 CPU 無論現在的 lock 是狀態 locked 還是 unlock ，都一定要至少知道溝通用的共享資源的狀況一次。此外要釋放 lock 的 CPU 一定要更新溝通用的共享資源。接下來的 MCS lock 就是嘗試將第三步驟變成在 spin 的時候不用一直 reload ，因為就算 CPU 們一直 reload 最後還是只會是下一號獲得 lock 。這就像第 100 號的人不需要知道第 1 號到第 98 號何時服務結束。第 100 號的人只要知道第 99 號何時服務結束即可。

依舊沒解決 cache coherence 帶來降低效能的問題
然而就算改成這樣仍然還會有很多 cache miss 要同時更新的問題。因此接下來就是改成 MCS lock。

4. MCS lock

MCS lock 機制
MCS 中的 lock 會一直轉移。如果有 cpu₂ 又再次 try acquire lock，則 lock 又會裝新的 cpu₂ 的 node，但仍然會 spin on locked == 0 。而之所以 cpu₀ 在第一次取得時不用將 locked 設為 1 。因為它不用透過 locked == 1 成立來讓自己停止 spin 。因此其實第一個取得 lock 的 cpu₀ 其 node 中的 locked 是沒有用處的，因為也不須因為 locked == 0 成立而 spin 。
解決 cache coherence 帶來降低效能的問題
此外還可以看到 try acquire lock 到 enter critical section 的過程只會檢查自己的node locked 。因此真正做 cache coherence 的時間點只有在 cpu₀ 載入存放 cpu₁ 的 node 的 cacheline 改變 locked = 1 。而 cpu₀ 發送 invalidate 。而 cpu₁ 再將其 cache miss reload 。
解決 unfairness
而現在就是哪個 core 先 acquire 就先去 FIFO 的資料結構 linked list 中排隊。
產生 size 的問題

struct mcs_spinlock {
    struct mcs_spinlock *next; 
    int locked; 
};

MCS locks and qspinlocks
In the tip tree, MCS locks are used in the implementation of mutexes, but they do not replace the existing ticket spinlock implementation. One reason for this is size: ticket spinlocks fit into a single 32-bit word, while MCS locks do not. That turns out to be important: spinlocks are embedded into many kernel structures, some of which (notably struct page) cannot tolerate an increase in size. If the MCS lock technique is to be used throughout the kernel, some other approach will be needed.

5. Qspinlocks

TODO: 協作 Issue #7

compiler barrier and memory barrier

Code generation on atomic load and store is a little bit tricky. As explained in LLVM's document, sometimes they are just normal load/store, without atomic instructions, and their code generation can vary across different architectures. Should we explain more on this?

atomic load 跟 atomic store 的用處是 atomic 編譯出來會加 compiler barrier 。就像在 2. Enforcing law and order，範例程式讓變數是 atomic 的屬性是為了使得其行程的指令順序不要被 re-order。

§5. Read-modify-write 四種溝通方式 atomic operation 以及 semantic atomic operation

semantic atomic 用 atomic operation 來保護 critical section 中的 RMW 操作（確保指令不被干擾，不交換順序）。

5.1 Exchange

Read-modify-write 直接變成 (1) 先自己修改一個私有變數，(2) 再直接將私有變數與共享變數交換。只要第二個步驟是 atomic 即可。

global g_var;  
--------------------------------------------------------------------------------
    thread 0              ||    thread 1           
local l_var               ||  local l_var        
    =======               ||    =======                
變成一道 atomic 指令        ||
+--------------------+    ||
|R: load g_var       |    ||
|M&W: g_var = l_var  |    ||      
+--------------------+    || +--------------------+
                          || |R: load g_var       |
                          || |M&W: g_var = l_var  |      
                          || +--------------------+

而執行緒 1 也要 Modify and write 共享變數的操作也 atomic，就能實現目的：沒有其他執行緒 (M&W) 中的值會被其他執行緒 (M&W) 值覆蓋掉。

5.2 Test and set

global flag  
--------------------------------------------------------------------------------
    thread 0              ||  thread 1                   
    =======               ||   =======                
  test:　flag : false     ||
+critical section ---+    ||
|set: flag = true    |    ||                  
|R&M&W ...           |    || test: flag : true
|clear: flag = false |    || spin 
+--------------------+    || test: flag : false
                          || +critical section ---+
                          || |set: flag = true    | <- lock
                          || |R&M&W ...           | 
                          || |clear: flag = false | <- unlock
                             +--------------------+

在進入critical section，(1)先檢查確定沒有其他執行緒在 critiacl section。如果有則 spin (等待)。(2)如果沒有則將 flag 設為 true，確保此執行緒在 critical section 時，不會有其他執行緒干涉。
(1) 跟 (2) test and set 為 atmoic 操作。 clear 也為 atmoic 。

5.3 Fetch and …

global g_var;  
--------------------------------------------------------------------------------
    thread 0              ||        thread 1                 
    =======               ||       =======                
變成一道 atomic 指令        ||
+--------------------+    ||
|R: load g_var       |    ||
|M&W: g_var++        |    ||     
+--------------------+    || +--------------------+
                          || |R: load g_var       |
                          || |M&W: g_var--        |      
                          || +--------------------+

與 exchange 的差別插在有沒有私有變數去與共享變數交換。

5.4 Compare and Swap

如果當前的狀態符合預期的狀態，則 swap 成想要的狀態。如果不符合預期的狀態，則將預期的狀態 swap 成符合的狀態。

global g_var = A;
--------------------------------------------------------------------------------
    thread 0              ||    thread 1                 
    =======               ||    =======                
local expected;           ||  local expected;
local l_var = B;          ||  local l_var;
--------------------------------------------------------------------------------
  R: expected = g_var     ||  
do{                       ||
  M: l_var = ...          ||    R: expected = g_var
}while(!CAS(&expected));  ||  do{                     
+- CAS()---------------+  ||    M: l_var = ...      <-----------+
| g_var == expected    |  ||  }while(!CAS(&expected));          |
| W:   g_var = l_var   |  ||  +- CAS()------------+             |
| return true          |  ||  | g_var != expected | <- 代表有人在我執行時修改了 g_var
+----------------------+  ||  | expected = g_var  |    
                              | return false      |
                              +-------------------+
                              do{
                                M: l_var = ...         <- Critical Section
                              }while(!CAS(&expected));

ABA lstack 如果失敗的話重新更新拿 orig 的值。

static struct lstack_node *pop(_Atomic struct lstack_head *head)
{
    struct lstack_head next, orig = atomic_load(head);
    do {
        if (!orig.node)
            return NULL;

        next.aba = orig.aba + 1;
        next.node = BBBB;
    } while (!atomic_compare_exchange_weak(head, &orig, next));
    return orig.node;
}

總結 Read-modify-write :

5.1 Exchange
(M & W) 透過 atomic 的特性保證交換私有與共享的變數。
讓 RMW 的操作都要為 atmoic。
5.2 Test and set
(R & M & W) 透過 lock 與 unlock。
5.3 Fetch and …
(M & W) 都有 atomic 的特性保證操作共享變數。
讓 RMW 的操作都要為 atmoic
5.4 Compare and Swap
( R ) 共享變數一開始被 atomic Read 。
(M) 私有變數被 Modify 。
透過 Compare 確保上述沒有人修改共享變數。
(W) atomic Swap 私有與共享的變數。

5.5 rmw_example

Following example code is a simplify implementation of thread pool to demonstrate the use of C11 atomic library.

Exchange
In function thread_pool_destroy, atomic_exchange(&thrd_pool->state, cancelled) reads current state and replaces it with ”cancelled”. A warning message is printed if the pool is destroyed when still running. If the exchange is not performed atomically, we may initially get the state as ”running”. Subsequently, a thread could set the state to ”cancelled” after finishing the last one, resulting in a false warning.

Test and set
In the rmw_example, the scenario is as follows:

The main thread initially creates a job and then set the future's flag, which is akin to acquiring a lock and then transfer its ownership to the worker.
Subsequently, the main thread will wait until the worker clear the flag. This inidcate that worker done the job and return the ownership back to the main thread.
This implies that the main thread will be blocked until the worker completes the job, which ensure correct cooperation.

Fetch and…
In the function thread_pool_destroy, atomic_fetch_and is utilized as a means to set the state to idle. Yet, in this case, it is not necessary, as the pool needs to be reinitialized for further use regardless. Its return value could be further utilized, for instance, to report the previous state and perform additional actions.

指令集的問題

SC 最嚴格， Relaxed 最鬆。 memory order。
Fetch and 在 x86 跟 arm 架構上都沒有
x86 CISC 風格， arm LL/SC, IBM z/Architecture,
但這裡使用 Fetch and … 是使用程式語言，但他還需要編譯器轉成指令集。所以真正執行的並不一定是指令集。

However, x86 architectures using CISC design lack these specific hardware instructions. Compilers achieve equivalent functionality using CAS (Compare-and-Swap).
On ARM architecture, compilers implement this using LL/SC (Load Linked/Store Conditional).

idoleat
However, atomic RMW operations here are merely a programming tool for programmers to achieve program logic correctness. Its actual execution as atomic operations depends on the how compiler translate it into actual atomic instructions based on differenct hardware instruction set. Exchange, fetch and add (FAA), test and set (TAS) and CAS in instruction level are different style of atomic RMW instructions. ISA could only provide some of them, leaving the rest to compilers to synthesize atomic RMW operations. For example, In IA32/64 and IBM System/360/z architectures, TAS functionality is directly supported by hardware instructions. x86 has XCHG, XADD for Exchange and FAA but has TAS implemented with XCHG. Arm, in another style, provides LL/SC (Load Linked/Store Conditional) flavor instructions for all the operations, with CAS added in Armv8/v9-A.

(reference : https://en.wikipedia.org/wiki/Test-and-set)

`rmw_example.c`

將 Test and Set 的效果在 rmw_example.c 發揮更多的作用。在 rmw_example.c#L68 中的程式碼能夠看出其效果的只有 Compare and Swap 。而這段程式碼 Exchange, Test and Set, Fetch and …, Compare and Swap 都透過 stdatomic.h 中 atomic 的操作來執行。然而在第一章有提到:

System programmers are acquainted with tools such as mutexes, semaphores, and condition variables. However, the question remains: how do these tools work, and how do we write concurrent code in their absence?

所以可以 concurrency-primer 的內容可以先描述概念：不仰賴這些工具像是 stdatomic.h 中 atomic 如何操作，在試著透過透過工具來看這些概念要用在哪個場景。
因為在 rmw_example.c 中 test_and_set 就是透過 set flag 與 clear flag 來達到 lock 與 unlock。

static bool thread_pool_init(thread_pool_t *thrd_pool, size_t size)
{
    if (atomic_flag_test_and_set(&thrd_pool->initialezed)) {
        printf("This thread pool has already been initialized.\n");
        return false;
    }

但 Test and Set 的操作並非一定要用 atomic_flag_test_and_set。像是 concurrent-programs/mutex/mutex.h 也可以用 mutex 來達成這樣 lock 與 unlock 的效果。

因為如果要讓 Test and Set 發揮效果，並且為了讓別人知道 Test and Set 與 Compare and Swap 的使用場景，可以透過 test 失敗則要 spin 來區隔。compare 失敗要繼續做事情來區隔。

Concurrency Primer (§ 5.2) Test and Set :
We could use it to build a simple spinlock。

然而如果講到這件事情搭配 rmw_example.c 就需要增加一個共享的資源。現在的共享資源為一個 thread_pool 。可以搭配一個像是每個工作都在共同存取一個共享變數為其做加法，而這個工享變數透過 mutex 來實作 flag 的功能。

concurrent-programs/tpool/tpool.c

逼近 Pi，用 BBP 可以計算個別位數，最後加起來就是近似結果，剛好適合這個情境

tpool_create()函式中建立 4 個 thread 在 thread pool 。
一開始建立 4 個 workers ，每個 worker 要做的事情都是去 jobqueue_fetch()（第 4 行）。在 jobqueue_fetch 傳入 jobqueue 作為參數。

















    pool->count = count, pool->jobqueue = jobqueue;
    if ((pool->workers = malloc(count * sizeof(pthread_t)))) {
        for (int i = 0; i < count; i++) {
            if (pthread_create(&pool->workers[i], NULL, jobqueue_fetch,
                               (void *) jobqueue)) {
                for (int j = 0; j < i; j++)
                    pthread_cancel(pool->workers[j]);
                for (int j = 0; j < i; j++)
                    pthread_join(pool->workers[j], NULL);
                free(pool->workers);
                jobqueue_destroy(jobqueue);
                free(pool);
                return NULL;
            }
        }
        return pool;
    }

變數命名為 thread 執行緒是做工作的，task 是鏈節串列的資料結構，job 為真正要做的事情。如此一來可以將 tpool.c 應用到處理多個不同類型的 job 上。

threadtask 資料結構中，有存放下一個 task 的指標為 next，arg 存放是哪個執行緒 (id) 做這個工作，有工作(job)內容 void *(*func)(void *); arg 存放是哪個執行緒 (id) 做這個工作。future 存這個工作的資訊。

future 的 flag 存這個 task 四個狀態。這個作用在當 jobqueue_fetch() 一直去拿 jobqueue 頭端的工作。flag 用來描述工作的狀態，其目的是控制所有的 jobqueue ，當工作量到一定的程度，就算還有工作在 jobqueue 中，也會透過 __FUTURE_DESTROYED 與 __FUTURE_CANCELLED 的狀態使執行緒就算要拿工作也會失敗。

result 存結果，存每個 worker 計算的結果，每個 worker 會拿對應的 job 以及用其對應的 task->arg 代入 BBP formula 。在concurrent-programs/tpool/tpool.c的 jobqueue_fetch() 針對不同的 task->future->flag 的狀態會有不同的操作。因為有可能 task 沒有被做完時，就會被 destroy 。























        if (task->func) {
            ...
            void *ret_value = task->func(task->arg);
            pthread_mutex_lock(&task->future->mutex);
            if (task->future->flag & __FUTURE_DESTROYED) {
                pthread_mutex_unlock(&task->future->mutex);
                pthread_mutex_destroy(&task->future->mutex);
                pthread_cond_destroy(&task->future->cond_finished);
                free(task->future);
            } else {
                task->future->flag |= __FUTURE_FINISHED;
                task->future->result = ret_value;
                pthread_cond_broadcast(&task->future->cond_finished);
                pthread_mutex_unlock(&task->future->mutex);
            }
            free(task);
        } else {
            pthread_mutex_destroy(&task->future->mutex);
            pthread_cond_destroy(&task->future->cond_finished);
            free(task->future);
            free(task);
            break;
        }

其中最能夠直接決定 pi 的結果的為第 12 行程式碼計算 bbp。如果 task 沒有被 destroy，則會做 20 行的工作。因為要將 BBP formula 的結果存入 task->future->result （第 12 行），而這裡會與 main thread 要取每個 future[i]->result 的值相加會有 race condition。因此有 mutex lock 與 condition variable 在保護這段 critical section （第 11~12 行）。在修改完 future 的內容後，會透過 condition variable 廣播給正在等待這個 future 的主執行緒。廣播完後 unlock (第 14 行)先前在（第 4 行）lock 的 mutex 。

而主執行緒會依序去取存在 futures 陣列中的每個 future 的值。因為主執行緒跟每個其他的執行緒（worker）會競爭 future 。因此這裡讀 future[i]->result 需要有 mutex lock 保護(第二行)。
























void *tpool_future_get(struct __tpool_future *future, unsigned int seconds)
{
    pthread_mutex_lock(&future->mutex);
    /* turn off the timeout bit set previously */
    future->flag &= ~__FUTURE_TIMEOUT;
    while ((future->flag & __FUTURE_FINISHED) == 0) {
        if (seconds) {
            struct timespec expire_time;
            clock_gettime(CLOCK_MONOTONIC, &expire_time);
            expire_time.tv_sec += seconds;
            int status = pthread_cond_timedwait(&future->cond_finished,
                                                &future->mutex, &expire_time);
            if (status == ETIMEDOUT) {
                future->flag |= __FUTURE_TIMEOUT;
                pthread_mutex_unlock(&future->mutex);
                return NULL;
            }
        } else
            pthread_cond_wait(&future->cond_finished, &future->mutex);
    }

    pthread_mutex_unlock(&future->mutex);
    return future->result;
}

然而當主執行緒要去取值時，worker 不一定已經把值算完了。
當 worker 已經過狀態 RUNNING (__FUTURE_RUNNING = 01) 以及 FINISHED (__FUTURE_FINISHED = 02) 。

則現在的狀態為future->flag = (01 | 02) = 03。
經過（第 5 行）future->flag = ((03) & (~04)) = 03。
因此（第 6 行） while 迴圈因此 while 迴圈(future->flag = 03 & __FUTURE_FINISHED = 02) == 02 不為 0 ，主執行緒不會等待。
（第 22 行）直接 unlock future->mutex 而主執行緒就可以讀到 BBP 的結果。

反之，如果 worker 只經過狀態 RUNNING (__FUTURE_RUNNING = 01)。

則現在的狀態為future->flag = (01)。
經過（第 5 行）future->flag = ((01) & (~04)) = 01。
因此（第 6 行） while 迴圈(future->flag = 01) & ( __FUTURE_FINISHED = 02) == 00 為 0 ，則主執行緒會等待。
等待的方式，因為目前的程式碼將 seconds 設定為 0 ，所以不會做有時間限定的 mutex ，因此主執行緒（第 18 ）在 condition variable 上等待直到 worker 透過其廣播。

上述這種阻塞的機制，包含兩個重點。第一是 mutex lock 讓主程式緒跟一個對應的 worker 能夠不要同時對 future 做操作。第二個是透過 __FUTURE_FINISHED bitwise 的操作讓主程序等待直到 worker 做完。

接下來把 tpool.c 的 pthread 的用法 pthread_mutex_t mtx_init與 pthread_cond_t cnd_init 改成 c11 使用的 mtx_t 與 cnd_t 。

-       pthread_mutex_init(&future->mutex, NULL);
-       pthread_cond_init(&future->cond_finished, &attr);
+       mtx_init(&future->mutex,mtx_timed);
+       cnd_init(&future->cond_finished);

同時在 tpool.c 中雖然有些有時間期限的 mutex 但卻將 seconds = 0，所以還是用 mtx_plain 而非 cnd_timedwait 。
commit 746920c 中的 worker() 與 add_job() 分別對應 tpool.c 的 jobqueue_fetch() 與 tpool_apply()。原先 examples/rmw_example.c 的 job 為 printf() 現在改為 func pointer 去指向 bbp() 去計算 bbp 。這讓程式碼更有彈性可以之後可以執行其他的 job 。

接下來原先方法一（上述）的為將私有變數相加，方法二是將它變成 bbp_sum 變為一個共享變數，所有 worker 都會去競爭該資源，將其個別計算的 bbp 結果加入共享變數。討論過後如下。

方法一為每個 future 都有對應一個 lock ，只有主執行緒與對應的 worker 會去競爭。而如果改成方法二則是將競爭資源的數量提高為 worker 的數量。方法一的 lock 細粒度比方法二的 lock 還高，因此方法一的 scability 較高。參考 spinlock 可擴展能力議題與藉由 spinlock 的調整，改善多核處理器建立 TCP 連線效能。但是方法一最終仍有一個主執行緒去依序取值(線性的時間複雜度），如果要取的 future 還在被 lock 中仍然需要等待 cond_timedwait。
為了呼應〈Concurrency Primer〉的 §5.2 Test and Set。有 lock 與 unlock 的方式，將 mutex 的操作改為 test_and_set 。 commit f37c425 為了方便在書中描述使用情境 lock
$\to$ critical section
$\to$ unlock，因此想要用 critical section 的操作，否則直接用 atomic_fetch_add 就可以了，但這是〈Concurrency Primer〉的 §5.3 Fetch and … 的使用情境。因此透過 test_and_set 的 test 搭配 while 迴圈達到 spin wait 的效果。test 實現 lock 的效果。clear 實現 unlock 的效果。
Deadlock 的解決：原先改為 test and set 的版本時，沒有考慮到順序的問題。原先的寫法，問題發生在如果是主執行緒先取得 lock ，後則會等待 worker 將工作完成 future->flag == __FUTURE_FINISHED。 worker 沒將工作完成則 while 迴圈條件式一直成立，則主執行緒一直等待。然而 worker 因為還沒取得 lock （因為 lock 已經被主執行緒所持有），所以無法工作，將 future->flag |= __FUTURE_FINISHED 。因此兩者都在等待對方的資源。

-    while (atomic_flag_test_and_set(&future->lock))
-        ;
-    while (future->flag != __FUTURE_FINISHED)
-        ;
+    while (future->flag != __FUTURE_FINISHED)
+        ;
+    while (atomic_flag_test_and_set(&future->lock))
+        ;

寫法二對應 lock free 解法四

透過 ThreadSanitizer

ThreadSanitizer:DEADLYSIGNAL
==5850==ERROR: ThreadSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7fde3b8ac310 bp 0x7fde394bf330 sp 0x7fde394bf298 T5851)
==5850==The signal is caused by a WRITE memory access.
==5850==Hint: address points to the zero page.
ThreadSanitizer:DEADLYSIGNAL
ThreadSanitizer:DEADLYSIGNAL
ThreadSanitizer:DEADLYSIGNAL
ThreadSanitizer:DEADLYSIGNAL
ThreadSanitizer: nested bug in the same thread, aborting.

這個是不支援 C11 的導致的問題

把工作佇列 ( job queue )改成單向鏈結串列，生產者與消費者都從頭部存取工作。同時把程式的流程改成 main thread 先建立完所有的工作add_job()，再去建立 worker create_worker() 則會避免生產者與消費者競爭頭部，而有 ABA 問題發生。同時也不會因為生產者 suspend，而影響消費者沒有 job ，導致整個系統沒有 progress 。這為 lock free 的操作。

int main()
{
    struct tpool_future *futures[PRECISION + 1];
    double bbp_sum = 0;

    tpool_t thrd_pool;
    if (!tpool_init(&thrd_pool, N_THREADS)) {
        printf("failed to init.\n");
        return 0;
    }

    // Employer add job to the job queue
    for (int i = 0; i <= PRECISION; i++)
        futures[i] = add_job(&thrd_pool, bbp, i);

    // Employer recruit many workers 
    create_worker(&thrd_pool);

    // Employer wait for the result of job
    for (int i = 0; i <= PRECISION; i++) {
        tpool_future_wait(futures[i]);
        bbp_sum += *(double *)(futures[i]->result);
        tpool_future_destroy(futures[i]);
    }
    tpool_destroy(&thrd_pool);
    printf("PI calculated with %d terms: %.15f\n", PRECISION + 1, bbp_sum);
    return 0;
}

而因為消費者雖然會競爭工作佇列的頭部以拿取工，但如果用 CAS 就是 lock free。

static int worker(void *args)
{
    if (!args)
        return EXIT_FAILURE;
    tpool_t *thrd_pool = (tpool_t *)args;
    while(1){
        head_job_t newhead, head = atomic_load(thrd_pool->head);
        do{
            if (head.job == NULL)
                return EXIT_SUCCESS;
            newhead.job = head.job->next;
        }while (!atomic_compare_exchange_weak(thrd_pool->head, &head, newhead));
        job_t *job = head.job;
        job->future->result = (void *)job->func(job->future->id);
        atomic_flag_clear(&job->future->flag);
        free(job);
    }
    return EXIT_SUCCESS;
}

如果要用 fetch and 的想法，可以讓每個 worker 拿到 job 也就是計算 bbp 的時候再去拿 sigma 的項數，看現在要計算下面 bbp 公式的第幾個
$k$

π = \sum_{k = 0}^{\infty} \frac{1}{16^{k}} (\frac{4}{8 k + 1} - \frac{2}{8 k + 4} - \frac{1}{8 k + 5} - \frac{1}{8 k + 6})

Simplify the rmw_example code

Simplify the rmw_example code to provide minimal examples.
 - Remove the state transitions in the thread pool originally used for
controlling interaction between the main thread and other threads.

 - After the main thread has added all jobs to the job queue, it then
creates several other threads.

 - The other threads simply compete for access to the job queue using
Compare and Swap operations accompanied with a loop, and take the
job from it.

寫法三

但寫法二就算是 lock free 也不是一個有彈性的 thread pool ，有彈性的 thread pool 指的是生產者能隨時都添加工作，消費者能隨時取工作。所以 main thread 的流程應該是先建立完所有的消費者然後再去增加工作。但為了要做到 lock free。消費者搶所有工作還是用 CAS。生產者放工作也一樣用 CAS 。這麼做就符合 An Introduction to Lock-Free Programming multiple writer

\to

使用 CAS 搭配 loop

\to

有 ABA 問題。雖然原本的消費者也是 multiple writer 但是因為他們的操作都是拿取工作所以不是這裡的情況：消費者一號存取工作的過程中，生產者放工作，消費者二號又拿走新的工作的問題。
Parallel Programming: The ABA problem, a bit of Concurrency Theory: Linearizability, Sequential Consistency, Consensus 提到 ABA 問題的四種解法

DCAS (double compare and swap) lstack 中多一個紀錄 aba 的值就算。

 /* A is an atomic type. C is the non-atomic type corresponding to A */
 bool atomic_compare_exchange_strong(A* obj, C* expected, C desired)
 {
     if (memcmp(obj, expected, sizeof(*object)) == 0) {
         memcpy(obj, &desired, sizeof(*object));
         return true;
     } else {
         memcpy(expected, obj, sizeof(*object));
         return false;
     }
 }

§6.2.6.2 Integer types C99
It is possible for objects x and y with the same effective type T to have the same value when they are accessed as objects of type T, but to have different values in other contexts. In particular, if == is defined for type T, then x == y does not imply that memcmp(&x, &y, sizeof (T)) == 0.
Furthermore, x == y does not necessarily imply that x and y have the same value; other operations on values of type T may distinguish between them.

§7.21.4.1 The memcmp function C99
Synopsis
#include <string.h>
int memcmp(const void *s1, const void *s2, size_t n);
Description
The memcmp function compares the first n characters of the object pointed to by s1 to the first n characters of the object pointed to by s2.
Returns
The memcmp function returns an integer greater than, equal to, or less than zero, accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.

由此可知他會逐一比對每個 byte 裡面的值。這樣用一個結構體裡面包著 job 以及一個對應的 aba 值就可以達到 double compare and swap 。因為這樣就會去比結構體裡的 aba ，不需要寫兩個while 迴圈判斷式。

Parallel Programming: The ABA problem, a bit of Concurrency Theory: Linearizability, Sequential Consistency, Consensus 中說 double compare and swap
not available on most platforms 不確定是在說有 64 位元跟 32 位元的不能一次比較太大的結構體。

寫法四

改為 mutex lock 與 condition variable 的實作時使用 commit 746920c

寫法五

TODO: Test and Set 設計讓主執行緒去 spin，但其實可以改為 spin_hint 的操作，但因為要有喚醒休息的執行緒的功能，因此在之後提到 broadcast 跟 signal 的操作時再來探討。

TODO: 改進〈Concurrency Primer〉關於 Atomics 的論述

§3. Atomicity

移到第 A 章: just make sure that any variables used for thread synchronization are no larger than the CPU word size.

問題 2 : false sharing

然而在多核多個執行緒共同存取一個大於 CPU word size 的共享變數，需要考量到記憶體架構。則會遇到 false sharing 的問題

問題 2 : false sharing 解決方法：

使用 alignas

問題 3 : Memory Barrier

由於記憶體架構: CPU 有自己的 private L1 cache 與 shared memory。因此並行真的透過共享資源合作，這件事情就無法透過 atmoic 保護。因為 atomic 的指令並不會一次就直接寫到 shared memory，而是遵循記憶體架構一層層寫入。(或是透過 Store-buffer)，因此這裡 Memory Ordering 使得指令會 out-of-order 的問題。

atomic 實作 atmoic 會遇到的問題在於：
這裡要區分是多核的並行，還是單核的並行。如果是多核的話會探討到記憶體階層跟 cache coherence 的機制，如果是單核的話是競爭 cacheline 。
在多核：由於記憶體階層的結構。
首先 Atomics 操作的功能是讓指令與指令可以一氣呵成，如此一來進入 critical section ，修改共享記憶體，都不會與其他執行緒有競爭。但是在多核的實作上面由於在共享 public 的記憶體上具備多層 private 的記憶體的 cache ，這件事情即便指令一氣呵成，仍無法解決。不為 atomic 解決的範疇。
因此在處理因為多核的實作，在共享 public 的記憶體上而有 race condition 的問題需要其他解決方式。

問題3 :
If threads do not use atomic reads and writes to share data, we are still in trouble.
解決辦法在 §5. Read-modify-write

Reference

The C11 and C++11 Concurrency Model
CppMem: Interactive C/C++ memory model
Consistency in Non-Transactional Distributed Storage Systems
CppCon 2014: Herb Sutter "Lock-Free Programming
An Introduction to Lock-Free Programming

idoleat

2024/06/20 08:47:44

atomic

不確定需不需要多提及這是針對不同 CPU 上的 atomic operation，若是在同個 CPU 上 x86 可以不用 LOCK prefix 和 memory barrier。不過 C11 也沒有提供針對這種情境下調整 atomic operation 行為的功能，倒是可以使用 sched_setaffinity 確保執行緒在不同 CPU 上

2024/06/28 13:11:01

這邊先提及觀念

將下方「Fetch and...」段落 However, 之後的說明搬到這邊，並修改為「 However, atomic RMW operations here are merely a programming tool for programmers to achieve program logic correctness. Its actual execution as atomic operations depends on the how compiler translate it into actual atomic instructions based on differenct hardware instruction set. Exchange, fetch and add (FAA), test and set (TAS) and CAS in instruction level are different style of atomic RMW instructions. ISA could only provide some of them, leaving the rest to compilers to synthesize atomic RMW operations. For example, In IA32/64 and IBM System/360/z architectures, TAS functionality is directly supported by hardware instructions. x86 has XCHG, XADD for Exchange and FAA but has TAS implemented with XCHG. Arm, in another style, provides LL/SC (Load Linked/Store Conditional) flavor instructions for all the operations, with CAS added in Armv8/v9-A. 」

2024/06/28 13:13:09

RISC-V 有比較詳細的文件解釋實作方式 https://five-embeddev.com/riscv-user-isa-manual/Priv-v1.12/a.html 不過這本書目前都沒有提及到 RISC-V，是否需要加入討論？

2024/06/28 13:15:22

更詳細的討論我覺得可以獨立一個章節討論，如這個 issue 所述 https://github.com/sysprog21/concurrency-primer/issues/9#issuecomment-2180325139 (Edited)

JeffBla

2025/07/02 06:11:19

之前的說明，是 owner 被更新，next 保持不變，不符合 ticket lock 與程式碼的邏輯，故修改之

Linux 核心專題: 並行程式設計教材修訂

Reviewed by chloe0919

Reviewed by SimonLee0316

Reviewed by zack-404

Reviewed by cheezad

TODO: 〈並行和多執行緒程式設計〉系列講座並紀錄問題和提出改進

〈概念〉

（一）並行的好處

（二）排程器需要如何幫忙

（三）工作間的溝通管道

TODO: 改進〈Concurrency Primer〉關於 Atomics 的論述

Abstact

§1. Background 順序的重要

§2. Enforcing law and order 溝通的方式

§3. Atomicity 透過 atomicity 來避免 compiler reorder

問題 1 : torn reads and writes. 用意舉例如果有些操作沒有 atomictiy 會怎樣

問題 1 : torn reads and writes. 解法

§5. Read-modify-Write

前言:

5.1 Exchange:

5.2 Test and Set :

5.3 Fetch and

5.4 Compare and swap

§6. Shared Resources

§7. Concurrency tool and Synchronization

細說 blocking mechanisms 的問題。時間, dead lock, scalability.

Waiting time

Deadlock

Scalability

TODO: 新增 lock-free 章節到〈Concurrency Primer〉

§8. Lock-Free

前言

Type of Progress

消費者生產者問題 (SPMC)

SPMC Solution 1 - Lock-based

SPMC Solution 2 - Lock-base and Lock Free (演算法，改變想法)

SPMC 解法三 - Fully Lock Free (Changing the data structure to reduce the granularity of locks.)

SPMC 解法四 - Lock-base and Lock Free (Concurrency Tool 的解法)

Lock-Free Conclusion

補充

TODO: 〈並行和多執行緒程式設計〉系列講座並紀錄問題和提出改進

〈實作輕量級的 Mutex Lock〉

測試和驗證

說明資源 node

說明資源 clock

〈並行程式設計: Atomics 操作〉

出現原因：

非阻塞的同步機制：

C11 Atomics

背後作祟的 cache

多核處理器之間的 cache coherence

為了達到 cache coherence 透過 MESI protocol 來規範

cache coherence 對程式效能的衝擊

競爭資源的問題

memory order 的問題

〈從 CPU cache coherence 談 Linux spinlock 可擴展能力議題〉

〈Non-scalable locks are dangerous〉

Directory-based cache coherence

〈藉由 spinlock 的調整，改善多核處理器建立 TCP 連線效能〉

〈排程器〉

work-steal

Cilk Model :

work steal 在 linux 裡面是為了要 load balance。

實作方式 :

The Cilk work-stealing scheduler

work-steal

〈Linux 核心設計: 淺談同步機制〉

1. spin on test and set

2. test and test and set(spin on read)

3. ticket lock 解決 unfairness

4. MCS lock

5. Qspinlocks

TODO: 協作 Issue #7

compiler barrier and memory barrier

§5. Read-modify-write 四種溝通方式 atomic operation 以及 semantic atomic operation

5.1 Exchange

5.2 Test and set

5.3 Fetch and …

5.4 Compare and Swap

5.5 rmw_example

Reviewed by `chloe0919`

Reviewed by `SimonLee0316`

Reviewed by `zack-404`

Reviewed by `cheezad`

`rmw_example.c`

寫法二對應 lock free 解法四