Concurrency 學習筆記

Introduction

電腦系統當中把排程任務的基本單元稱為行程，每個行程之間的記憶體位址是互相獨立且分離的，除非特定操作不然不允許行程之間互相存取，這使得行程之間的溝通或者 concurrency 的實現方法會變得昂貴。相較之下，執行緒共享相同的記憶體位址，使得同一個 thread group 底下的執行緒可以利用 shared memory 來進行溝通，實現 concurrency 變得更容易且執行上更快速。

閱讀〈Concurrency Primer〉

既然執行緒之間利用共享記憶體空間來溝通並實現 concurrency ，同步就會是非常重要的議題，重點在執行緒對共享記憶體進行寫入或存取的順序 。我們從組合語言甚至硬體架構， CPU pipeline 的角度切入試圖了解 concurrency 的各種問題與實作方式，不僅僅是將任務切分成數個細小區塊切換執行這麼單純。
在現代電腦架構當中，記憶體架構時常有很多層，多核心處理器也是非常常見的，每個核心會有自己的 store buffer ，如此一來寫入的操作不會阻塞住其他指令的執行，這裡的難點在如何讓整個系統的記憶體保持 coherent ，也就是每個核心的每個寫入動作對於其他核心而言都是可見的。你會想，指令的順序不就是我們寫程式各項指令的順序嗎？其實不然，在現代硬體架構與編譯器優化的影響下，實際程式執行的指令順序、執行哪些指令，早就不是我們閱讀原始檔就能理解的。

Atomic

在 C 語言當中引入 <stdatomic.h> 即可使用 atomic 類型的資料型態，最主要的影響是它帶來 single total modification order ，編譯器保證不會改變 atomic variable 周圍其他變數的 loads and stores 順序。整個程式的執行順序依舊可能和你想像的不同，但它保證執行後的結果會和程式碼所寫的順序相同，稱為 sequential consistency 。
但這一切是建立在執行緒之間共享的變數是 atomic 且真的具有 atomicity 的情況，什麼情況可能違反呢？試想如果兩個跑在 32-bit machine 上的執行緒之間共享一個 64-bit integer 的 counter ，則我們不可能在滿足 atomicity 條件下存取這個 counter ，因此我們也應該確保任何共享變數的大小都不該大於 CPU word size 。但如果很不巧的你的共享變數 T 大於 CPU word size ，我們可以透過補上 _Atomic 使得編譯器在 runtime 時會自動將該變數的讀取與寫入用 lock 圍繞住。

Read-Modify-Write

實際上談到 atomic 的時候，我們應該從硬體的觀點，即微處理器的指令來看，要不全部操作都成功，要不就全部失敗，當 atomic operation 失敗時就重來直到成功為止。
單核處理器可以透過關閉 global interrupt 來確保 atomic operation ，多核架構在 lock 和 counter 的實作上則必須透過特殊硬體支援來確保 RMW 是 atomic 的。例如會提供以下幾種特殊指令

Exchange

讀取當前值並將其更換為一個新的值，但要是讀取和變更的行為被切分為不同步驟，則 race condition 可能發生。

Test and Set

透過一個 atomic boolean value 來達成，在 C 語言當中即是 atomic_flag 。讀取該布林變數並設定為 true 。

Fetch and …

讀取一個變數值，對它進行美些操作並回傳原先操作前的值，上述動作包成一個 atomic operation 。

Compare and Swap

又可以簡寫為 CAS ，讀取變數值，如果目前的值和 target 相同則改變該變數的值。
以下程式碼為例，原先從 target 讀取到的舊值為 compare ，如果舊值和記憶體當中的值相同，則把新的值 new 寫入 target 當中。

lock cmpxchg(target, compare, new):
    lock(memory_bus);
    register = load(target);
    if (register == compare)
        store(target, new);
    unlock(memory_bus);
    return register;

這裡的關鍵概念是，目前執行緒先載入共用變數當前的值，在嘗試修改的時候利用之前載入的值和當前記憶體的內含值進行比對以確認操作為 atomic 。

如果有某個指令可以打斷 CAS 流程提前對該記憶體進行寫入，但是寫入的值和原先相同，則 CAS 是否能偵測到？
atomic operation 實作到底要不要用到 lock ?

Atomic operations as building blocks

我們可以將 atomic loads, atomic stores and RMW operations 稱為 concurrency 的 building blocks ，又可以把這些 building blocks 分為 blocking 和 lockless 。

blocking : 能使執行緒被阻塞住一段時間，可能造成 deadlock 或 livelock 。
lockless : 每個執行緒永遠都繼續執行

硬體架構也可以被分為兩種

strongly-ordered : loads and stores 大多數情況會保持 system-wide 順序，例如 x64 。在這個限制底下 load/load 、 store\store 、 load\store 都禁止交換執行順序，只有 store\load 可以，稱為 Total Store Order 。
weakly-ordered : 除非使用 memory barriers ，不然不能信任程式執行時期的 loads and store 順序和程式撰寫的順序相同，例如 ARM 。

LL/SC instructions

多數 RISC 架構硬體上缺少 RMW 指令集，我們不能用一般的 loads and stores 實作 RMW 。因此我們需要以下兩種指令

load-link : 從某個記憶體位址進行讀取，並告知處理器要持續監聽該位址的值有無變化。
store-conditional : 只有當嘗試寫入的位址再對應的 load-link 指令執行後沒有被其他 store 指令更改，才進行寫入。

但要求處理器監聽位址當中的每個 bytes 是非常昂貴的行為，因此多數處理器選擇用較粗糙的方式進行監控，例如監控整個 cache line ，這代表只要對監控位址存在的 block 當中的任意位址進行寫入，對應的 store-conditional 都會失敗 (spurious SC fail) ，不只是被 load-link 監控的那個位址。這也是 CAS 還有分為 strong 或 weak 的原因。

如果利用某些 CAS loops 實現的 lockless algorithm 對某個非 atomic variable 進行操作，行為如下

讀取變數值
對該值進行一些 non-atomic operations
對新的值和舊值進行 CAS
如果 CAS 失敗，代表有其他執行緒先改變該變數值，所以我們重新來一遍

利用 CAS strong 來實作的話，編譯器會利用巢狀迴圈，使用內層的迴圈來處理 spurious SC fails ，外層迴圈則進行我們原先想做的操作直到沒有其他執行緒造成我們的 CAS 失敗為止。 CAS weak 則允許 spurious SC fail 的發生，因此不會產生內層回圈，不管是不是 spurious SC fail ，只要 CAS fail 則重來。

Memory ordering

除了 sequencial consisten ，還有數種不同的 memory ordering 如下

Sequentailly Consistent (memory_order_seq_cst)
Acquire (memory_order_acquire)
Release (memory_order_release)
Relaxed (memory_order_relaxed)
Acquire-Rlease (memory_order_acq_rel)
Consume (memory_order_consume)

Acquire and release

我們可以將 acquire 和 release 視為 one-way barriers 。

acquire operation 可以防止 load 被排到 barrier 後面，或防止 barrier 後面的 load / store 重排到 barrier 前面的 load 。換言之允許 barrier 前面的 store 排到 barrier 後面。

release operation 則禁止 barrier 後面的 store 被排到 barrier 前面，或是防止 barrier 前面的 load / store 重排到 barrier 後面的 store 。只允許 barrier 後面的 load 被排到 barrier 前面。

並用時可以做到 writer to reader synchornization ，如果 writer thread 透過 release semantics 儲存一個值，而 reader thread 透過 acquire semantics 獲取該值，則所有在該 store-release 之前進行的 write 在 reader 的 load-acquire 之後都是可見的。

Relaxed

執行緒之間的共享變數操作不需要特定順序時，我們可以使用 relaxed atomic operations 。在 CAS loops 當中大量使用 relaxed loads 。只要指令之間沒有相依關係， load/store 可以任意交換，即 weak order 。

Acquire-release

在 atomic RMW operations 當中會用到 memory_order_acq_rel ，因為我們同時需要對一個變數值進行 load-acquire 跟 store-release 。此處許多人應該會認為 acq_rel 和 seq_cst 是一樣的，其實還是有些區別， acquire-release 是保證相對於該變數的 load / store 都是 load-acquire 和 store-released ，而 sequentailly consistent 則是保證整個程式的 global order 。

Consume

如果某個變數很少被變更，但是經常被許多執行緒讀取，則我們可以對讀取的動作進行優化，若能在 weakly-ordered system 上避免對 loads 加上 memory barrier 就可以做到這件事。

考慮到每個處理器核心會有自己的快取，如果利用 lock 來實現同步機制， cache line 可能會花費大量時間在不斷改變值，如果 critical section 很大快要執行一段時間那還好，如果很小塊，使得多個執行緒不斷對該 lock 進行搶佔，則對應的每個 cache line 更新的成本可能會比執行 critical section 本身更高。

volatile

volatile 是用來通知編譯器我們程式當中的某個值可能被我們程式以外的執行緒或行為給變更，它提供兩個保證

編譯器不會省略任何 loads or stores
編譯器不會對 volatile 變數之間的順序做重排

因此 volatile 變數以外的變數的 load / store 依舊可能被省略或重排，除非你的硬體或編譯器保證絕對不會進行重排，不然 volatile 並不是一個實現同步機制的方法。

Atomic fusion

atomic 變數雖然可以避免大多數編譯器優化產生的問題，但並非 100% 的避免。考慮到以下程式

while (tmp = foo.load(memory_order_relaxed))
        doSomething(tmp);

由於 load-relaxed 並不提供任何執行順序上的保證，因此編譯器有可能把上面的程式碼變成以下的樣子

while (tmp = foo.load(memory_order_relaxed)){
        doSomething(tmp);
        doSomething(tmp);
        doSomething(tmp);
        doSomething(tmp);
    
}

當這種 fusing 的行為是不被接受的時候，我們需要把變數轉為 volatile 或使用 asm volatile("" ::: "memory") 來避免這種行為。 Linux kernel 的巨集 READ_ONCE(), WRITE_ONCE() 就是提供這樣的功能。

執行順序

Evaluation

可以分為兩件事

value computations : 一串運算式計算的結果
side effect : 修改物件狀態

C 語言沒有明確定義 evaluation 的順序，可由編譯器自行決定。

Sequenced-before

對同一個執行緒下， evaluation 順序關係的描述
A is sequenced-before B, and B isn't sequenced before A : A 的求值會先完成後才進行 B 的求值
A isn't sequenced-before B, and B is sequenced before A : B 的求值會先完成，才開始對 A 的求值
A isn't sequenced before B, and B isn't sequenced before A : 有兩種可能，一種是順序不固定，另一種則是兩個執行緒求值過程可能重疊。

我們撰寫的程式語言就是在定義一連串的 sequenced-before 關係，例如以下幾個經典的行為

第 x 行的 evaluation 會先於第 x+1 行的 evaluation
運算元的 value compuation 會先於運算子的 value compuation (f1() + f2() ，先計算 f1(), f2() 才進行 f1()+f2()) ，此規則不包含 side effect
i++ ， value computation 先於 side effect
++i ， value computation 後於 side effect
&&, || 左邊運算元的 evaluation 先於右邊運算元的 evaluation ，因此 i++ && (i + j) ， && 右邊的 i 會是左邊經過完整 evaluation 後的結果

但以下程式碼就是 C 語言的未定義行為， sequenced-before 的關係取決於編譯器與執行環境

i = i++;

特別注意 sequenced-before 具有非對稱性 (asymmetric) 和遞移性 (transitive) 。

Happens-before

這裡強調的是可見的 visible ，而非執行順序。 A happens-before B 代表 A 的效果對於 B 而言是可見的。其中我們在 C11 的 §5.1.2.4/7 可以查閱到

All modifications to a particular atomic object M occur in some particular total order, called the modification order of M. If A and B are modifications of an atomic object M, and A happens before B, then A shall precede B in the modification order of M…

什麼是 particular total order 呢？在數學當中的 total order 指的是一個集合裡的二元關係，具備以下四種性質

反身性 (reflexive)

$\forall a \in X, a R a$
反對稱 (antisymmetric)

$\forall a, b \in X, a R b a n d b R a \Rightarrow a = b$
遞移性 (transitive)

$\forall a, b, c \in X, a R b a n d b R c \Rightarrow a R c$
連通性 (connected) : 集合當中任兩元素必定滿足其中一個關係

$\forall a, b \in X, a R b o r b R a$

C11 當中的定義即是在告訴我們，對於 atomic object M 的任何改動都遵守某個 particular total order ，假設我們用

R

表示 happens-before relation

$R (A, B) a n d R (B, A) \Rightarrow A = B$ ，如果 A happens-before B 且 B happens-before A 則 A 和 B 是相同的 statement
$R (A, B) a n d R (B, C) \Rightarrow R (A, C)$ ，如果 A happens-before B 且 B happens-before C 則 A happens-before C 。
$R (A, B) o r R (B, A)$ ，任意兩 evaluation A, B 一定滿足 A happens-before B 或者 B happens-before A ，不會有兩者的 evaluation 效果對於對方來說都不可見的情況發生。

同一個執行緒當中的 happens-before 關係即是 sequenced-before 關係，若同時存在多個執行緒，則 happens-before 關係會變得至關重要，多執行緒環境中，我們必須設法確保不同執行緒之間的 happens-before 關係。

Synchornized-with

描述不同執行緒間的同步行為，若 A synchronized-with B ，則代表 A 對記憶體操作的效果對 B 是可見的。白話一點可以想像成是不同執行緒之間的 happens-before 關係。

Memory Consistency Models

在這裡重新定義一個程式執行 "正確" 的定義，並非只會發生一種執行結果，而是定義在所有可能發生的執行結果當中，哪些是允許的，這樣的約定稱為 Memory Consistency Models ，系統要在這樣的約定底下，想辦法做最佳化。

Sequential Consistency

由 Leslie Lamport 提出以下的定義

A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

在多處理器架構底下， Sequential Consistency 保證程式執行的結果會和所有處理器遵守 sequential order 執行結果相同，就像是程式當中規範的 order 那樣 (sequenced-before) 。

注意這裡強調的是執行結果，而非執行順序，換句話說只要執行結果和原本規範的程式執行順序之後的執行結果相同就可以，實際上的程式執行順序是可以改變的。 Memory model 又分為四種模式

Weak(relaxed) memory model : 所有 memory reordering 都被允許
Weak with data dependency ordering : 保證 load 的順序，可能存在 store/store 或 store/load 的 reordering 。
Usually strong : 保證 acquire and release semantic 的執行，只存在 store/load reordering 。
Sequentially consistent : 保證全部順序。

確保執行順序和前一段程式描述的部分要重讀
memory_order_relaxed 的效果是什麼？單純讓該 statement 的 evaluation 為 atomic 嗎？ load , store 允許任何形式的重排？

acquire-release ordering

atomic load 屬於 acquire 操作， atomic store 屬於 release 操作，最主要有兩種效果，其中一個是限制 CPU 指令的重排

在 store 之前的所有讀寫操作，不允許被移到該 store 的後面
在 load 之後的所有讀寫操作，不允許被移到該 load 的前面

在這樣的情況下，假設執行緒1 store 某個數值，而執行緒2 load 該數值的動作是成功的，則由於執行緒1 store happens-before 執行緒2 load ，且由於遞移性的關係，執行緒1 store 之前的所有讀寫操作都 happens-before 執行緒2 的 load 。

Dekker's Algorithms & Peterson's Algorithm

Dekker's Algorithm 程式碼如下

    Processor 1                       | Processor 2
    ===========                       | ===========
 0: while (true) {                    | while (true) {
 1:     /* non-critical code */       |     /* non-critical code */
 2:     flag[0] = true;               |     flag[1] = true;
 3:     while (flag[1]) {             |     while (flag[0]) {
 4:         if (turn != 0) {          |          if (turn != 1) {
 5:             flag[0] = false;      |              flag[1] = false;
 6:             while (turn != 0) ;   |              while (turn != 1) ;
 7:             flag[0] = true;       |              flag[1] = true;
 8:         }                         |          }
 9:     }                             |     }
10:     /* critical section */        |     /* critical section */
11:     turn = 1;                     |     turn = 0;
12:     flag[0] = false;              |     flag[1] = false; 
13: }                                 | }

若處理器把 flag[0], flag[1] 認定為不存在相依性，則會認為即使改變他們的執行順序也不會影響結果，這時候如果兩個處理器都先執行第三行才執行第二行，就會同時進入 critical section 。

接下來考慮 Peterson's Algorithm ，該演算法對硬體的假設是 load 和 store 皆為 atomic operation ，並且為共享記憶體模型，在這個假設下可以證明此演算法的正確性。
共享變數分別是

bool flag[2]
int turn

    Thread 0                         | Thread 1
    ========                         | ========
0: while (true) {                    | while (true) {
1:     /* non-critical code */       |     /* non-critical code */
2:     flag[0] = true;               |     flag[1] = true;
3:     turn = 1;                     |     turn = 0;
4:     while (flag[1] && turn == 1)  |     while (flag[0] && turn == 0)
5:         ;                         |         ;
6:     /* critical section */        |     /* critical section */
7:     flag[0] = false;              |     flag[1] = false;
8: }                                 | }

即使第四行比第二行先執行，也會因為 turn 變數一個時間點只可能是 0 或 1 所以保證只有一個執行緒進入 critical section 。但即使這樣把這個演算法實作並運行在 x86-64 的硬體架構上，由於 load/store 指令為 atomic opration 的假設在該架構上不滿足，因此會出現錯誤，由於 lock 實作當中 load/store 會因為指令重排的關係不遵守 Sequential Consistency ，所以 mutual exclusion 無法被滿足，需要額外指令來抑制重排。

硬體架構與執行順序

Store buffers with Bypassing Capability
假設每個處理器都有自己的 write buffer (store buffer) ，其中一個最佳化方式是當處理器進行 store 時，會先寫入 store buffer ，而非直接寫入 main memory ，避免等待寫入 main memory 的時間，所以 main memory 當中的資料其實是舊的，如果這時候就允許讀取 main memory 當中的資料則會讀到舊資料而導致違反 Sequential Consistency 。
Overlapping Write Operations
在一個沒有 bus 且各個記憶體模組互相連結的系統中，我們考慮以下情境，處理器按照程式要求的順序發出寫入請求，但是不等待前一個寫入執行完畢，就立刻發出下一個請求，這時候可能會出現以下場景，寫入要求 1 發出，寫入要求 2 發出，寫入要求 2 完成，寫入要求 1 完成。一開始發送要求的順序看似符合規範，但完成的結果卻違反 Sequential Consistency ，要解決這個問題我們需要等待當前要求獲得回應後再發出下一個寫入請求。
Non-Blocking Read Operations
如果最佳化時使得讀取順序可以變更，也可能違反 Sequential Consistency 。

Cache Architecture & SC

上述例子當中就可以看到一但寫入或讀取記憶體順序不同，都有可能違反 Sequential Consistency ，在具有快取的系統當中更是如此，我們需要確保同一份資料在所有處理器的快取當中保持一致，如此一來才能維持 Sequential Consistency 。
因此我們需要有足夠嚴格的 Cache Coherence Protocol ，主要有以下兩個原則

所有的寫入最終會被所有處理器看見 (visible)
對相同位址的寫入順序對於所有處理器而言都是一致的

第二點代表針對相同位址的寫入，所有處理器會一致的看到相同順序，不會發生重排順序的情況，對於不同位址的寫入，所有處理器也應該看到相同的順序，來避免任何寫入順序的重排。

為了維持程式執行的順序，代表我們要確保上一個寫入操作完成後才能進行下一個，沒有快取的系統當中只需要等待 main memory 回傳完成的訊號即可，但對於有快取的系統來說相對複雜，在有快取的系統當中，寫入完成的意義是，所有處理器都能看到新的寫入值。

尚需探討 cache coherence protocol 。

同時我們也該了解多個執行緒競爭 cache 的行為，也就是多個執行緒同時存取同一個 cacheline 。由於硬體當中 cache coherence 的基本單元是 cacheline ，倘若一個變數座落在兩個不同 cacheline 上 (假設 cacheline 大小是 32 位元而該變數佔據 64 位元)，則 cachline 競爭就缺少了 atomic 特性，此處反應了多核處理器架構下，多數問題就來自 cache 。

在多核處理器架構下被多個執行緒共享的主記憶體往往很慢，因為會頻繁進行 cache coherence ，要提高速度最好的解決辦法就是降低共享

MPMC (multi-producer/multi-consumer) 的程式很難有好的 scalibility ，因為極限吞吐量 (throughput) 取決於 cache coherence 成本，而非處理器個數，最好是用 SPMC 或 MPSC ，或者多個 SPSC 代替。
計時器的實作有兩種，一種是讓每個使用到的執行緒都不斷修改計時器的值，但如此一來會因為處理器不斷進行 cache coherence 而使性能大受影響，若這個計時器的功能只是輸出流水號，我們可用 thread-local storage 改寫程式碼，在特定時間點將所有執行緒的數值合併。

False sharing
本來兩個執行緒之間不分享的兩個變數，卻因為這兩個變數落入同個 cacheline ，使得兩個執行緒競爭相同 cacheline ，稱為 false sharing 。

memory barrier 到底是什麼？作用是什麼？怎麼實作？

Memory Barrier

要理解 Memory Barrier 首先理解在多核處理器架構上，因為多了快取的關係使得資料生效的順序更加難以掌控， out of order execution 不一定指指令重排，它們很有可能按照原先希望的順序執行，但卻因為快取、 store buffer 、 invalidate queue 的關係使得每個處理器看到指令生效時間不同。

讓所有處理器的快取資料保持一致的方法稱為 cache coherence protocol ，以 MESI protocol 為例，若有某個 CPU 要寫入資料，要先確保其他 CPU 都把同一位置的快取 invalidate 後才能寫資料到自己的 cache 當中，如此可以保證資料的一致性，但廣播 invalidate 訊號到收齊所有 acknowledgement 中間的延遲可能很長，造成 CPU 閒置。

Store buffer

我們可以引入 store buffer 來解決上述問題，也就是嘗試進行寫入操作的 CPU 不再等待 invalidate ack ，先將資料寫入 store buffer ，然後繼續做事，之後收到 invalidate ack 再將 store buffer 當中最新資料寫到 cache 上。然後收到 invalidate 訊號的 CPU 直接回覆 invalidate ack ，但實際上先把 invalidate 請求放到 invalidate queue 當中，等等再處理。但這會造成 Store buffer 處所寫的問題， write memory barrier 會確保在 barrier 之前的值如果在 store buffer 當中，則會比 barrier 之後的值先更新到 cache 上， write memory barrier 會透過 marked 的方式做到這點，如果 CPU 在對 cache 進行寫入時發現 store buffer 當中有 marked 的欄位，則會先把 marked 的資料寫到快取中，再進行後續操作。

Invalidate queue

如果再多核處理器架構下再加入 invalidate queue ，先保存 invalidate 請求而直接回覆 invalidate ack ，可以縮短 CPU stall time ，但同樣的也會造成 cache 資料當中不一致的問題，此時可以利用 read memory barrier ，執行到 read memory barrier 時會將 invalidate queue 當中的資料標記為 marked ，之後 CPU 在讀取 cache 當中的值時，若發現 invalidate queue 當中對應資料為 marked ，則會先把該資料進行 invalidate 之後再繼續讀取，執行 invalidate 後就不會讀自己 cache 當中的值，而會改從其他 CPU 的 cache 當中進行讀取。

Compiler Barrier

(TODO)

Linux kernel Memory Barrier

/Documentation/memory-barriers.txt
在 Linux kernel 當中也要解決 memory ordering 的問題，甚至因為它是一個要應付多種不同硬體架構的作業系統，更需要嚴格的 memory barrier 。對於一個 CPU 來說最少有以下幾個保證要做到

有相依性的記憶體操作必須保證依照該順序觸發，例如 Q = READ_ONCE(P); D = READ_ONCE(*Q); ， CPU 保證進行以下操作 Q = LOAD P, D = LOAD *Q ，若是在 DEC Alpha 架構上還會多加上 memory barrier Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER 。
特定 CPU 上若有重疊的 load/store 操作(針對同一塊記憶體區段的操作)，例如 a = READ_ONCE(*X); WRITE_ONCE(*X, b); ，則 CPU 保證只會進行以下這個順序的操作， a = LOAD *X, STORE *X = b 。

另外對於程式設計師而言有幾件事要注意

絕不能假設編譯器在沒有使用 READ_ONCE(), WRITE_ONCE() 時一定會對記憶體執行對應的操作順序，沒有這些巨集的使用，編譯器可以進行任何可能的重排。
絕不能假設獨立的記憶體操作會依照順序進行，任何排列都是有可能的。
重疊的記憶體操作肯定有可能被合併或被捨棄。

還有一些 anti-guarantees

以上的保證事項都不適用於 bitfields ，不要使用 bitfield 成員來同步平行演算法。
如果要用 lock 來保護 bitfields 成員，注意到同一個 bitfield 當中的所有成員都要用同一個 lock 來保護。

Memory barrier 在 Linux kernel 中分為以下幾種

Write (store) memory barrier
用來保證系統當中所有在該 write memory barrier 之前的 store 操作都會發生在該 write memory barrier 之後的 store 操作之前。對於 load 指令沒有任何效用。
Address-dependency barriers
Historical ，自行查閱原資料
Read (load) memory barrier
同時也是一個 address-dependency barrier ，同時保證系統中所有在該 read memory barrier 之前的 load 操作都會發生在該 read memory barrier 之後的 load 操作之前。
General memory barrier
在該 general memory barrier 之前的所有 load/store 指令都會發生在該 general memory barrier 之後的所有 load/store 指令之前。

有數個 implicite varieties

Acquire operation

Lock types in Linux kernel

Kernel 當中提供了數種 locking primitives 而它們可以被分為三種種類

Sleeping locks
CPU local locks
Spinning locks

此處我們就不深究 PREEMPT_RT 。

Sleeping locks

Sleeping locks 只能在 preemptible task context 當中被獲取，雖然在 kernel 當中的實作允許其他 contexts 也能進行 try_lock() ，但小心評估 unlock() 和 try_lock() 的安全性是必須的。簡單來說，除非沒有別的選擇，不然不要在其他種 contexts 當中嘗試獲取 sleeping locks 。

屬於 sleeping locks 的 lock types 有

mutex
rt_mutex
支援 priority inheritance 的 mutex
semaphore
counting semaphore 的實作，通常用在 serialization 和 waiting ，但新的使用案例通常建議將 serialization 和 wait 機制分開使用，例如 mutexes 和 completions 。
而由於 counting semaphores 沒有持有者的概念，因此在 PREEMPT_RT 當中不需要改變其實作，也因此無法提供 priority inheritance 給 semaphores 。結果是如果被 semephore 阻塞則可能導致 priority inversion 。
rw_semaphore
是一個 multiple readers, single writer lock 機制，在 non-PREEMPT_RT kernel 當中實作為 fair ，所以不會造成 writer starvation 。
ww_mutex
percpu_rw_semaphore

在 PREEMPT_RT kernel 當中，以下三種 lock 也會被轉換成 sleeping locks

local_lock
spinlock_t
rwlock_t

CPU local locks

只有一種，即為 local_lock 。
在 non-PREEMPT_RT kernel 當中， local_lock 函式會作為 preemption 和 interrupt disabling primitives 的 wrapper 。和其他 locking 機制相比的話，取消 preemption 或 interrupts 是單純的 CPU local concurrency control mechanisms 而非 CPU 之間的 concurrency control 。

它透過取消 preemption 或 interrupts 來達成對 critical section 的保護。以下是在 non-PREEMPT_RT kernel 當中， local_lock 操作和 preemption, interrupt primitives 之間的對應

local_lock	preemption/interrupt primitives
`local_lock(&llock)`	`preempt_disable()`
`local_unlock(&llock)`	`preempt_enable()`
`local_lock_irq(&llock)`	`local_irq_disable()`
`local_unlock_irq(&llock)`	`local_irq_enable()`
`local_lock_irqsave(&llock)`	`local_irq_save()`
`local_unlock_irqrestore(&llock)`	`local_irq_restore()`

local_lock 提供的 named scope 和普通的 primitives 相比有兩個優勢

lock name 能夠提供靜態分析以及對於被保護範圍清晰的解說，普通的 primitives 是沒有 scope 概念以及 opaque 的。
若啟用 lockdep 則 local_lock 會取得一個 lockmap 來驗證它保護機制的正確性。可以檢查使用 interrupt 或 soft-interrupt context 是否使用 preempt_disable() 作為 protection mechanism 。

Spinning locks

有兩種

raw_spinlock_t
bit spinlocks

若在 non-PREEMPT_RT kernel 當中，以下兩種 lock 也是 spinning locks

spinlock_t
rwlock_t

spinning locks 背後隱含的操作同時也會取消 preemption 而 lock/unlock 函式可以有數種 suffix 來代表不同的保護機制

_bh() : disable/enable bottom halves (soft interrupts)
_irq() : disable/enable interrupts
_irqsave/restore() : save and disable/restore interrupt disabled state

Owner semantics

上述的 locks 除了 semaphores 以外，有一個必須嚴格遵守的規定
The context (task) that acquired the lock must release it.

Mutex

Generic Mutex Subsystem

What are mutexes ?

在 Linux kernel 當中 mutex 指的是一種可以確保共享記憶體當中的存取順序的 locking primitive ，在 Linux kernel 當中被歸納在 sleeping locks 當中，其行為和 binary semaphore 類似。

Implementation

定義在 /include/linux/mutex_types.h 當中，註解特別提到 mutex 的幾個特性（針對同一個 mutex 的情況）

一次只有一個 task 可以取得該 mutex
只有 owner 可以將該 mutex 解鎖
進行多次解鎖是不被允許的
recursive locking 是不被允許的
mutex 實體需要透過 API 建立，不可以利用 memset 或 copying
task 不能在持有 mutex 的情況下離開
持有 mutex lock 的記憶體區塊不能被釋放
被持有的 mutex 不能被重新初始化
mutex 不能在 hardware or software interrupt context 當中被使用，例如 tasklets 和 timers

我們可以透過開啟 DEBUG_MUTEXES 選項來確保上述的規定都被確保。
其中成員變數當中的 owner 會紀錄目前狀態並利用一個 struct task_struct * 指向自己的持有者，由於 struct task_struct * 至少對齊 L1_CACHE_BYTES ，較低的幾個位元會用來儲存其他資訊。最基本的形式它至少還帶有一個 wait-queue 和一個 spinlock 來確保取得該 mutex lock 的存取順序。

當我們嘗試取得某個 mutex 時，有以下幾種可能的形式，取決於該 mutex 目前的狀態

fast path
將 current task 和該 mutex 當前的 owner 進行 atomic cmpxchg() ，這只會發生在沒有競爭的情況，若有競爭則進入下一個形式。
midpath
也稱為 optimistic spinning ，此處我們預設 mutex 的 owner 正在運行所以很快就會進行釋放，則該嘗試獲取 mutex 的 task 會持續 spin 直到獲取該 lock ，並且確保當前沒有比該 task 優先權更高的 task 準備執行。而且我們會利用 MCS lock 確保 mutex spinner 被 queue up ，以確保一次只有一個 spinner 來競爭該 mutex 。
MCS lock 是一種 spinlock ，但和傳統 test-and-set spinlock 不同的是他可以避免 cacheline bouncing ，它透過使得每個嘗試獲取該 lock 的 CPU 都在自己的 local variable 上 spin 來達成這件事。它是為了 sleeping lock 的 optimistic spinning 特別製作的。當位在 MCS spinlock 的 queue 當中的任務需要被 reschedule 時，它們可以自行離開，從而避免需要被 reschedule 的任務被卡在 queue 當中持續等待某個 mutex 。
slow path
若 mutex lock 依舊無法被取得，則將該 task 放入 wait-queue 當中進入沈睡狀態，直到有人呼叫 unlock 。

另外和其他 sleeping locks 一樣， mutex 不具有 implicit reference 指向它們保護的共享記憶體空間，而利用 mutex_unlock() 來釋放 mutex 和該記憶體空間，特別注意在 mutex_unlock() 期間或之後原本持有 mutex 的 owner context 依舊可以獲取該 mutex 內容並對其進行修改，所以在此期間其他 context 嘗試獲取該 mutex 的行為都是不安全的。

Linux kernel 當中 mutex 最大的缺點在於佔用的空間太大，例如在 x86-64 架構上佔了 32 bytes 。

Spinlocks

(To be edited)

Futex

Description

futex() 身為一個系統呼叫，提供一個機制來等待特定條件成立，可以用來保護共享記憶體空間進行同步，當我們利用 futex 時，大部分的同步操作都會在 user space 當中進行。

一個 futex 是一個 32-bit 稱為 futex word 的變數，我們在呼叫 futex() 系統呼叫時需要提供 futex 的地址。 futex 的記憶體空間是透過 mmap(), shmat() 等方式建立，以此來分享在多個行程之間。當我們執行某個 futex operation 的目的是請求阻塞某個執行緒， kernel 會檢查 calling thread 提供的 futex word 和預期的 futex word 是否相同，若相同 kernel 才會阻塞。從 futex word 的載入到和預期的值進行比較一直到將執行緒阻塞，這一連串的操作都是 atomic 的並且和其他執行緒對於相同 futex word 的操作是 totally ordered ，也就是說透過 futex 進行 blocking 是一個 atomic compare-and-block operation 。

futex 也可以用來實作 lock ，而 lock 的狀態可以在共享記憶體區段當中被表示為一個 atomic flag ，在非競爭狀態，執行緒可以透過 atomic compare-and-exchange 來取得並變更 lock 狀態，若是在競爭的狀況也就是已經有其他執行緒佔用這個 lock ，則這次的 futex() 系統呼叫則會阻塞，導致自己進行 sleep state 。而原本持有該 lock 的執行緒在釋放 lock 時要先將 lock state 改為 not acquired 之後再透過該 futex word 執行一個 futex() 系統呼叫來將所有沈睡在該 lock 上的執行緒喚醒。

Lock Dependency

lockdep-design

Lock-class

在 Linux 核心當中 a class of lock 代表的是一群有相同 locking rules 的 locks 。即使該 lock 有數種實作，舉例來說 inode 當中的一個 lock 便是一個 lock class ，每個 inode 當中都有各自該 lock 的實體。

Validator 的作用是追蹤 lock-classes 的 'usage state' ，同時也會檢查 lock-classes 之間的相依性。 Lock usage 代表在對應的 IRQ contexts 當中該 lock 是如何被使用的， lock dependency 則代表 lock order ，本文當中 L1->L2 代表一個任務當前持有 L1 並嘗試獲取 L2 。以 lockdep 的觀點來說，兩個 lock 不一定相關，相依性只代表某個 lock order 曾經出現過。 Validator 的作用即是確保 lock usage 和 dependenncies 都為正確的。

Lock-class 的行為是由它的成員一起組成的，當系統開啟後某個 lock-class 有第一個實體被使用後，整個 lock-class 會被註冊，使得剩下所有的實體也跟著開始被追蹤它們的 usage state 和 dependencies 。即使 lock-class 當中的實體被移除， lock-class 也不會被移除，只有當 lock-class 的記憶體空間被系統回收才會被移除，例如核心模組被移除或者 workqueue 被清除。

Usage state

Validator 追蹤 lock class 的使用紀錄並且把 usage 區分為 (4 usages * n STATEs + 1) 個種類。

這四種 usage 分別為

ever held in STATE context
ever held as readlock in STATE context
ever held with STATE enabled
ever held as readlock with STATE enabled

而 n STATEs 則被記錄在 kernel/locking/lockdep_states.h 當中，目前包括

hardirq
softirq

最後一個加一的種類則是

ever used (== !unused)

若 locking rules 被違反時，這些 usage bits 會被寫在 locking error messages 當中。對於一個 lock 而言，位元的位置從左到右代表該 lock 的 usage 和 readlock ，分別對應到以上 n STATEs ，而顯示的字元代表

'.' : acquired while irqs disabled and not in irq context
'-' : acquired in irq context
'+' : acquired with irqs enabled
'?' : acquired in irq context with irqs enabled

對於一個 STATE 而言，該 lock 是否在該 STATE context 當中被持有，和該 STATE 是否有啟動，可以排出四種組合，列在以下 table 當中。

	IRQ enabled	IRQ disabled
ever in irq	'?'	'-'
never in irq	'+'	'.'

沒有被使用過的 lock 則不可能進入 error 的一部分。

Single-lock state rules

若一個 lock 被描述為 irq-safe 代表它曾在 irq context 當中被使用，若為 irq-unsafe 則代表它曾在 irq enabled 狀態下被持有。

一個 softirq-unsafe 的 lock-class 同時也會是 hardirq-unsafe 。以下的狀態有排他性，每個 lock-class 一次只能依據它的 usage 有一個狀態

<hardirq-safe> or <hardirq-unsafe>
<softirq-safe> or <softirq-unsafe>

若一個 lock 可以在 irq context 當中被使用 ( irq-safe ) ，則它無法在 irq enabled 時被持有 ( irq-unsafe ) 。否則 deadlock 可能發生。舉個例子，如果某個 lock 在一個 context 當中被持有，這時候該 context 被中斷，則該 lock 會被嘗試獲取兩次，創造出一個 deadlock ，也稱為 lock recursion deadlock 。

同一個 lock-class 當中的 lock 不能被獲取兩次，它會導致 lock recursion deadlocks 。

Multi-lock dependency rules

如果有兩個 lock 的話，它們不能被以相反的順序獲取

<L1> -> <L2>
<L2> -> <L1>

這會導致 lock inversion deadlock ，這樣的獲取順序會出現一個 circle 導致互相永遠的等待， validator 檢查出這樣錯誤的時間複雜度會是隨機的，由於當中還可能存在其他的 locking sequence ，即便如此 validator 還是會檢查這些 lock 是否有嘗試利用 circular fashion 來獲取 lock 。

另外以下的方式也是不允許的

<hardirq-safe>   ->  <hardirq-unsafe>
<softirq-safe>   ->  <softirq-unsafe>

第一條規則是由於 hardirq-safe lock 可能被 hardirq context 取得，並中斷一個 hardirq-unsafe lock ，最後會導致 lock inversion deadlock 。同樣的 softirq-safe lock 可能被 softirq context 取得，中斷一個 softirq-unsafe lock 。

上述的規則會被強加在每個 kernel 當中的 lock ，當嘗試獲取一個新的 lock 時， validator 都會檢查當前持有的 locks 和新的 lock 之間是否違反這些規則。

若一個 lock-class 改變了它的 state ，以下的規則也需要遵守

若有新的 hardirq-safe lock 被發現，我們檢查它是否曾經持有任何 hardirq-unsafe lock
若有新的 softirq-safe lock 被發現，我們檢查它是否曾經持有任何 softirq-unsafe lock
若有新的 hardirq-unsafe lock 被發現，我們檢查它是否曾經持有任何 hardirq-safe lock
若有新的 softirq-unsafe lock 被發現，我們檢查它是否曾經持有任何 softirq-safe lock

同樣這些規則也會在所有 lock 上被檢查，因為一個 interrupt context 可以中斷任何 irq-unsafe 或 hardirq-unsafe lock ，這會導致 lock inversion deadlock ，即便該情況還沒發生，都會立刻報錯

Exception: Nested Data dependencies leading to nested locking

以下有幾個在 Linux 核心當中嘗試獲取同一個 lock-class 當中的多個實體的案例。通常如果相同型別當中的物件有階層架構層比較容易發生。在這些案例當中兩個物體之間通常具有繼承關係，而 kernel 獲取 lock 的順序對於這些物體來說是固定的。

其中一個在階層架構下產生 nested locking 的案例是 whole disk block-dev object 和一個 partition block-dev object 。 partition 是整個裝置的一部分，其中 whole disk lock 相比起 partition lock 更為 higher lock ，這樣的 lock ordering 是完全正確的。 Validator 並不會自動檢查這類自然順序，因為其 ordering 背後的 locking rule 並非固定的。

為了讓 validator 知道正確的 usage model ，許多 locking primitives 出現了新的版本，讓使用者可以定義一個 "nesting level" ，例如 block device mutex

enum bdev_bd_mutex_lock_class
{
     BD_MUTEX_NORMAL,
     BD_MUTEX_WHOLE,
     BD_MUTEX_PARTITION
};

mutex_lock_nested(&bdev->bd_contains->bd_mutex, BD_MUTEX_PARTITION);

以上程式碼代表在 bdev object 上的 locking 為 partition 。 Validator 對待這種在 nested 風格當中的 lock 的方式就是對待一個獨立的 class 。

Annotations

有兩個函式家族是用來標記並檢查特定的 lock 是否有被持有

lockdep_assert_held*(&lock)
lockdep_*pin_lock(&lock)

其中 lockdep_assert_held*(&lock) 在 kernel 當中被大量使用，是巨集家族，用來判斷特定的 lock 在特定時間是否有被持有。（若沒被持有則會產生 WARN() ）。

另一個巨集家族 lockdep_*pin_lock(&lock) 目前只有在 rq->lock 有被用到。它們用來檢查某個 lock 是否被意外地 unlocked 。在具有 callbacks 的情景當中很好用， upper layer 通常會假設 lock 依然被持有，但 lower layer 自認為可以將該 lock 釋放再重新獲取。 lockdep_pin_lock() 會回傳一個 struct pin_cookie 然後在 lockdep_unpin_lock() 當中使用它，來確保沒人篡改過該 lock 。

Proof of 100% correctness

Validator 在每個簡單並且獨立的 single-task locking sequence 當中都可以達到完美的 mathematical 'closure' ， validator 可以證明這些 locking sequence 不可能在任何時間點以任何組合造成 lock related deadlock 。

至於更為複雜的 multi-CPU 和 multi-taskl locking 情境當中則不需要實際發生來證明 deadlock ，只有 simple component locking chains 才需要至少發生一次來讓 validator 確認它的正確性。

這可以很大程度地降低 kernel 當中 locking 相關 QA 的複雜性，在 QA 階段只需要盡量產生非常多的 simple single-task locking dependencies ，而不需要檢查不同 CPU 之間所有 locking 的組合。

Performance

以上提到的規則都是執行時期的檢查，如果每次獲取 lock 和 irqs-enable event 都要檢查的話，整個系統會慢到無法使用。這些檢查的時間複雜度是

O (n^{2})

。

為了解決這個問題，我們針對每種 locking scenario 只會檢查一次。 kernel 會維護一個 simple stack of held locks ，另外計算一個輕量的 64-bit hash value ，每種 lock chain 的 hash value 都是 unique 的。若該 chain 第一次被檢查，則對應的 hash value 會被放到 hash table 當中，而該 hash table 會被以 lockfree 的方式檢查。之後相同的 hash value 如果再出現，則 hash table 就會告訴我們不用再檢查了。

Troubleshooting

(TODO)

Recursive read locks

lockers 可以被區分為三種類型

writers (例如 exclusive lockers, spin_lock() 或 write_lock() )
non-recursive readers (例如 shared locks, down_read() )
recursive readers ( recursive shared lockers, rcu_read_lock() )

以下描述當中 W, E 代表 writers ( exclusive writers ), r 代表 non-recursive readers , R 代表 recursive readers, S 代表 all readers ( non-recursive + recursive ) ，兩種都是 shared lock 。 N 代表 writers 和 non-recursive readers, 兩種都是 non-recursive

Recursive readers 和名稱相同，代表 lockers 被允許在相同 lock instance 的另一個 reader 的 critical section 當中嘗試取得該 lock ，換句話說，允許在 lock instance 出現 nested read-side critical sections 。

如果換成 non-recursive readers ，若嘗試在相同 lock instance 的另一個 reader 的 critical section 當中獲取該 lock 則會造成 self deadlock 。

recursive readers 和 non-recursive readers 的差異在於 recursive readers 只會被 write lock holder 阻塞著，而 non-recursive readers 則會被 write lock waiter 阻塞著。用以下的例子來看

TASK A:                 TASK B:

read_lock(X);
                        write_lock(X);
read_lock_2(X);

首先 task A 透過 read_lock(X) 取得 X 的 reader (不管是否 recursive) ，而 task B 嘗試取得 X 的 writer 時它會被阻塞並且成為 writer on X 的 waiter 。此時若 read_lock_2() 是一個 recursive readers ， task A 可以持續進展，因為 writer waiters 不會阻塞 recursive readers ，也不會有 deadlock 。但若 read_lock_2() 是一個 non-recursive readers ， task A 則會被阻塞著因為 non-recursive readers 會被 writer waiter 阻塞。同時造成 self deadlock 。

Block conditions on readers/writers of the same lock instance

簡單來說有四種 block conditions

Writers block other writers
Readers block writers
Writers block both recursive readers and non-recursive readers
Readers (recursive or not) don't block other recursive readers but may block non-recursive readers (because of the potential co-existing writer waiters)

再以一個例子顯示 recursive read lock 和 non-recursive read lock 的差異

TASK A:                 TASK B:

read_lock(X);

                        write_lock(X);

read_lock(X);

若 X 為 recursive read lock ，則上述情景不會造成 deadlock ，由於當 task B 成為 X 的 write lock waiter 時，第二個 read_lock() 不會被阻塞著。但如果 X 為 non-recursive read lock ，則 task B 會阻塞 task A 第二次獲取 reader lock 的行為，造成 deadlock 。

特別注意一個 lock 可以是 write lock ( exclusive lock ) 或一個 non-recursive read lock ( non-recursive shared lock ) 或是一個 recursive read lock ( recursive shared lock ) ，端看使用什麼 lock operation 來取得它 (更精確的說，是 lock_acquire() 當中的參數 read 的值)。換句話說，一個 lock 有三種型別，端看 acquisition functions

exclusive
non-recursive read
recursive read

更精確的說，我們把 write locks 和 non-recursive read lock 稱為 non-recursive locks 而 recursive read locks 稱為 recursive locks 。

Recursive locks 不會互相阻塞，而 non-recursive locks 會。另外一個 non-recursive lock 可以阻塞對應的 recursive lock ，反之亦然。

例如以下的例子是 recursive locks 造成的 deadlock

TASK A:                 TASK B:

read_lock(X);
                        read_lock(Y);
write_lock(Y);
                        write_lock(X);

此處 task A 會等待 task B 呼叫 read_unlock(Y) 而 task B 則等待 task A 呼叫 read_unlock(X) 。

Dependency types and strong dependency paths

Lock dependencies 會紀錄一對 locks 之間獲取的順序，由於 lockers 有三種，總共會有九種 dependencies ，不過以下可以顯示只需要四種 lock dependencies 就足以進行 deadlock detection 。

以下的標示

L1->L2

代表 lockdep 在執行時期看見 L1 已經被當前 context 持有了，而該 context 嘗試獲取 L2 。而偵測 deadlock 的邏輯可以理解為我們是否有可能被 L2 阻塞著而同時持有 L1 ，換句話說，是否還存在一個 locker L3 而 L1 阻塞著 L3 時 L2 被 L3 阻塞。

因此我們只在乎兩件事

L1 阻塞了什麼
誰阻塞了 L2

我們可以將 recursive readers 和 non-recursive readers 結合給 L1 (因為它們被相同型別阻塞)，而可以結合 writers 和 non-recursive readers 給 L2 。

總共有四種 dependency edges 可以存在 lockdep graph 當中

-(ER)->
exclusive writer to recursive reader, X-(ER)->Y 代表 X->Y 且 X 為 writer 而 Y 為 recursive reader
-(EN)->
exclusive writer to non-recursive locker dependency
-(SR)->
shared reader to recursive reader dependency
-(SN)->
shared reader to non-recursive locker dependency

對於一組 locks 而言，他們之間可以有多個 dependencies

TASK A:

read_lock(X);
write_lock(Y);
...

TASK B:

write_lock(X);
write_lock(Y);

則在 dependency graph 當中我們就會有 X-(SN)->Y 和 X-(EN)->Y 。這樣我們可以表達成 -(xN)-> ，同樣也有 -(Ex)->, -(xR)-> 和 -(Sx)-> 這樣的表達方法。

許多連續的 dependency edge 則稱為 "path" ，另外還定義了 "strong" path ，代表在 path 當中的 edges 都具有 strong dependency ，代表連續的兩個 edges 不會是 -(xR)-> 和 -(Sx)-> 。
換句話說，一個 "strong" path 當中若有 X->Y->Z ，而 X 到 Y 是透過 -(SR)-> 或 -(ER)-> 則 Y 到 Z 不能是 -(SN)-> 或 -(SR)-> 。

Recursive Read Deadlock Detection

Lemma 1
If there's a closed strong path, then there's a combination of locking sequences that causes deadlock. (strong circle is sufficient for deadlock detection)

Lemma 2
If there's no closed strong path, then there's no combination of locking sequences that could cause deadlock. (strong circles are necessary for deadlock detection)

搭配以上兩個 lemma ，我們可以推論出一個 closed strong path 對於 deadlock 而言是充分且必要條件。因此一個 closed strong path 等同於出現 deadlock 的機率。

此處我們先考慮不會造成 deadlock 的 dependency circles

Proof for sufficiency (Lemma 1)
假設我們有個 strong circle

L1 -> L2 -> ... -> Ln -> L1

假設每個 Lx 在 Lx -> Lx+1 當中都被不同的 CPU/task 持有。由於 L1 -> L2 因此 L1 的持有者會嘗試在持有 L1 的情況獲取 L2 ，但 L2 已經被持有，而且 L1 -> L2 和 L2 -> L3 都不是 -(xR)-> 和 -(Sx)-> ，這代表 L1 -> L2 當中的 L2 為 non-recursive locker ，或 L2 -> L3 當中的 L2 為一個 writer 。

由上述推論我們可以得出 Lx 的持有者都必須等到 Lx+1 的持有者釋放 lock 才能獲取，因此造成 deadlock 。

Proof for neccesary (Lemma 2)
Lemma 2 的敘述等價於，若有 deadlock 情境出現，則一定有 strong circle 在 dependency graph 當中。

此處我們假設有 n 個 CPU/task ， P1 … Pn ，而 Px 等待要取得的 lock 稱為 Lx ，所以 P1 會在持有 Ln 的情況下等待索取 L1 ，因此 dependency graph 中會出現 Ln -> L1 ，同樣的 L1 -> L2 … 。
這代表它們會形成一個 circle

Ln -> L1 -> L2 -> ... -> Ln

接著要證明此 circle 為 strong ，對於 lock Lx 來說， Px 貢獻了 Lx-1 -> Lx 這個 dependency ，而 Px+1 貢獻了 Lx -> Lx+1 這個 dependency ，由於 Px 等待 Px+1 釋放 Lx ，所以 Px+1 持有的 Lx 不可能是 reader 且 Px 看見的 Lx 不可能是 recursive reader ，因為 readers 不會阻塞 recursive readers ，因此 Lx-1 -> Lx 和 Lx -> Lx+1 不可能為 -(xR)->, -(Sx)-> pair ，而這對於 circle 當中每個 lock 都成立，因此該 circle 為 strong 。

RCU

RCU 是 Read-copy update 的縮寫，是一種同步機制，相較於傳統的 locking 機制不區分 readers 、 updaters ，只提供不同執行緒之間的 critical section ，或者是允許同時進行並行讀取但在更新的過程當中只能有一個 updater 的 reader-writer locks ， RCU 提供同時數個 readers 與一個 updater 並行存在的機制。甚至在某些例子例如 non-preemptable kernel 當中， RCU 的讀取端完全沒有 overhead 。 RCU 由三種機制組成，分別用來處理 insertion 、 deletion 和使得 readers 可以安全地在同時有 insertion 和 deletion 的情況下進行讀取。

RCU 其中一個重要的特點是它可以在資料被更改的同時保證讀取安全讀取資料。它利用 publish-subscribe 機制來提供讀取時同時寫入的能力，考慮以下兩段程式，分別對應 updater 和 reader 。

updater

struct foo {
  int a;
  int b;
  int c;
};
struct foo *gp = NULL;

/* . . . */

p = kmalloc(sizeof(*p), GFP_KERNEL);
p->a = 1;
p->b = 2;
p->c = 3;
gp = p;

reader

p = gp;
if (p != NULL) {
    do_something_with(p->a, p->b, p->c);
}

以上兩段看似正常的程式會發生什麼問題呢？首先我們都知道編譯器或 CPU 都可能造成程式實際執行順序和預期不同，如果 gp = p 比起 p 成員的初始化先執行，那同時進行讀取的 readers 可能看到未被初始化的值，這個情況可以利用 memory barrier 來解決， rcu_assign_pointer() 當中就含有這樣的機制，因此 updater 程式碼可以更正如下

p->a = 1;
p->b = 2;
p->c = 3;
- gp = p;
+ rcu_assign_pointer(gp, p);

rcu_assign_pointer() 會 publish 一個新的結構體，迫使編譯器跟 CPU 在 p 的成員初始化之後才進行 gp = p 。同時 readers 也需要修正來確保正確的執行順序，原本的程式碼看起來沒什麼問題，但如果在 DEC Alpha CPU 上，編譯器可能進行的優化會使 CPU 先 fetch p->a, p->b, p->c 之後才 fetch p 的值，稱為 value-speculation compiler optimizations ，因此我們需要將程式碼改為以下

+ rcu_read_lock();
- p = gp;
+ p = rcu_dereference(gp);
if (p != NULL) {
    do_something_with(p->a, p->b, p->c);
}
+ rcu_read_unlock();

rcu_dereference() 可以想成是對某個指標的值做 subcribe ，確保接下來的 dereference 操作都能讀取到對應的 publisher ( rcu_assign_pointer() ) 所賦予的值。

在 Linux kernel 當中 rcu_assign_pointer(), rcu_dereference() 都會被包在更 high-level 的函式當中使用，而不會直接拿來使用，以 linked list 來當例子。

publisher

struct foo {
    struct list_head list;
    int a;
    int b;
    int c;
};
LIST_HEAD(head);

/* ... */

p = kmalloc(sizeof(*p), GFP_KERNEL);
p->a = 1;
p->b = 2;
p->c = 3;
list_add_rcu(&p->list, &head);

list_add_rcu() 提供某種類似鎖的機制以避免同時對 linked list 進行 list_add() 操作，但不會阻擋同時多個 RCU readers 。

subscriber

rcu_read_lock();
list_for_each_entry_rcu(p, head, list) {
    do_something_with(p->a, p->b, p->c);
}
rcu_read_unlock();

list_add_rcu() 和 list_for_each_entry_rcu() 是對應的 publiser-subscriber 關係，保證看得見對方的作用。

Wait for Pre-Existing RCU Readers to Complete

等待事件完成的方式有很多種，包括 reference count 、 reader-writer locks 、 events 等等， RCU 最大的優點是它在等待數萬個不同事件完成的情況下可以不用追蹤個別狀態，如此一來可以避免掉很多問題。

在 RCU 當中等待完成的事情叫做 "RCU read-side critical sections" 。開頭是 rcu_read_lock() 結尾是 rcu_read_unlock() ，撇除 SRCU 這個特別的情況，在 RCU read-side critical section 當中可以包含任何程式碼，但就是不能包含 block 或 sleep 。只要遵守這個條件，我們可以利用 RCU 來等待任何我們想要執行的程式。

RCU 等待 readers 的方式可以簡述為下

對物件作出變更，例如替換 linked list 當中的某個節點
等待所有存在的 RCU read-side critical section 結束，例如利用 synchronize_rcu() ，其中一個重點是後續出現的 RCU read-side critical section 無法取得被移除的節點。
清除，例如將被移除的節點釋放

synchronize_rcu() 實作上的概念很特別，由於 RCU read-side critical sections 是由 rcu_read_lock(), rcu_read_unlock() 包圍著的，但在 non-CONFIG_PREEMPT kernel 當中這兩個韓式甚至不會產生任何程式碼，那 synchronize_rcu() 是怎麼知道所有 RCU read-side critical section 都完成了呢？

由於 rcu_read_lock(), rcu_read_unlock() 之間不允許任何導致 blocking 或 sleeping 的程式碼，因此我們知道當某個 CPU 執行 context switch 時，前面所有 RCU read-side critical section 都已經完成了，所以若每個 CPU 都執行至少一次 context switch ，則代表之前所有 RCU read-side critical section 都完成了，此時 synchronize_rcu() 也就可以回傳成功。可以簡化為以下 pseudo code

for_each_online_cpu(cpu)
    run_on(cpu);

但這個方法僅限於能在 RCU read-side critical section 將 preemption 機制取消的 kernel 例如 non-CONFIG_PREEMPT kernels 和 CONFIG_PREEMPT kernels ，但在 CONFIG_PREEMPT_RT realtime kernels 當中就不能這麼做。 realtime RCU 採取的是 reference count 來實作 synchronize_rcu() 。

Maintain Multiple Versions of Recently Updated Objects

Deletion

由於 RCU 允許在物件被更新的同時被許多 readers 進行讀取，在某些時間當中同一個物件可能有兩個版本，例如以下例子。
假設一開始的鍵結串列如下圖

紅色邊筐代表 readers 可以讀取到這個節點，接著如果執行以下程式碼

p = search(head, key);
if (p != NULL) {
    list_del_rcu(&p->list);
    synchronize_rcu();
    kfree(p);
}

當 list_del_rcu() 被執行後，鍵結串列會變為以下狀態

但由於 readers 和 updaters 並沒有同步， readers 可能在進行節點移除的時候讀到要被移除的節點，看讀取時機，所以 readers 可能看見的鍵結串列有兩個版本

節點移除前
節點移除後

當 synchronize_rcu() 完成時，代表所有之前存在的 RCU read-side critical section 都完成了，也就代表沒有任何 readers 可以再讀取到被移除的節點，到此處鍵結串列也只有一個版本

最後進行釋放

Replacement

讓我們回到鍵結串列一開始的狀態

此時改為執行以下程式







q = kmalloc(sizeof(*p), GFP_KERNEL);
*q = *p;
q->b = 2;
q->c = 3;
list_replace(*p->list, *q->list);
synchronize_rcu();
kfree(p);

當第四行執行完畢後，鍵結串列狀態如下

開始進行 replacement 後，由於不同 readers 可能看到不同版本的鍵結串列，所以會有兩個版本的鍵結串列

synchronize_rcu() 結束後也代表 grace period 結束，所有在 list_replacement_rcu() 之前開始的讀取都完成了，也代表都離開他們的 RCU read-side critical section ，所以也不能繼續持有該指標的 reference ，所以整個鍵結串列又回到一個版本

Grace Periods

試想如果 RCU 機制無法提供多個版本的鍵結串列狀態，或者在 RCU read-side critical section 當中不保證不發生 block 或 sleep ，會發生什麼事呢？

以上圖為例，假設

t_{1}

時間點是 critical section 的開始，也就是有人嘗試修改某個共享變數，而這些 reader 都嘗試讀取該共享變數，在

t_{2}

時間點 critical section 結束，若無法提供多個版本的鍵結串列，也無法提供 mutual exclusion 機制，則 reader 1,2,3 都可能讀取到無效的記憶體內容甚至造成錯誤。

寬限期就是為此存在的，在這段期間系統為提供多個版本的鍵結串列，因此在寬限期開始之前就開始讀取的 reader 會讀到舊值，但不會出錯，他們看見的依然是一個完整不中斷的鍵結串列，在

t_{1}

之後開始讀取的 reader 則可以讀到新的版本，而寬限期的結束則是等這些在寬限期開始之前就進行讀取的 reader 結束後就會隨之結束。

所以 RCU 更在乎的是 reader 是否讀取到完整的鍵結串列，但不一定要是最新版的？
若非常在乎資料是否是最新版本的場景是否就不適用 RCU ？

RCU Core API

rcu_read_lock(void)
讓 reader 通知 reclaimer 它正進入 RCU read-side critical section 。雖然我們不該在 RCU read-side critical section 當中進行任何 blocking 操作，但 CONFIG_PREEMPT_RCU 依舊可以將 RCU read-side critical section 暫停。
rcu_read_unlock(void)
讓 reader 通知 reclaimer 它正離開 RCU read-side critical section 。

文件當中特別提及

Note that anything that disables bottom halves, preemption, or interrupts also enters an RCU read-side critical section. Acquiring a spinlock also enters an RCU read-side critical sections, even for spinlocks that do not disable preemption, as is the case in kernels built with CONFIG_PREEMPT_RT=y. Sleeplocks do not enter RCU read-side critical sections.

取消 bottom halves, preeption 或 interrupts 機制都算是進入 RCU read-side critical section ，或者獲取一個 spinlock 也是，即便該 spinlock 沒有取消 preemption 。但 sleeplocks 不算是進入 RCU read-side critical section 。

synchronize_rcu()
表示 updater 的結束和 reclaimer 的起始點，它會讓該 CPU 阻塞直到所有 pre-existing RCU read-side critical section 都完成才會繼續，但在 synchronize_rcu() 之後才開始的 RCU read-side critical section ，就不需要等它完成。特別注意 pre-existing RCU read-side critical section 都完成時，不代表 synchronize_rcu() 會立即回傳，例如可能存在 scheduling delay ，或者有些 RCU 實作會利用 batch 的方式處理請求，這些都可能導致 synchronize_rcu() 的延遲。
由於 synchronize_rcu() 需要判斷在此之前存在的 readers 是否都完成，因此它的實作方式對於 RCU 而言是關鍵。為了讓 RCU 在 read-intensive 場景中能最大化它的效率， synchronize_rc() 的 overhead 必須要很小。
call_rcu() 是 synchronize_rcu() 的 asynchronous callback ，它會接收一個函式以及對應的參數，而非阻塞著，之後等到進行中的 RCU read-side critical sections 結束後呼叫該函式。這個 callback 在禁止阻塞或者 update-side 的效能非常重要的場景很有用。但 call_rcu() 依舊不該被經常使用，因為通常 synchronize_rcu() 產生的程式碼會比較簡單，而且它自動限制 grace period 被延遲的 update rate 。
rcu_assign_pointer()
此函式實作的形式是利用巨集， updater 利用此巨集將新的值指派給 RCU-protected pointer ，為了目的是要安全的將變數值的改變在 updater 和 reader 間傳送。這是一個 spatial macro ，它不會算出 rvalue 但會提供 compiler directives 和 memory-barrier instructions 給對應的編譯器或 CPU 架構。
rcu_dereference()
同樣用巨集實作，同樣也是 spatial macro ， reader 使用它來存取 RCU-protected pointer ，回傳的值是可以安全進行 dereferenced 的，注意到這個巨集並沒有將該指標給 dereference ，而是保護該指標直到有人要 dereference 它。同時它也會對於不同 CPU 架構產生對應的 memory-barrier 指令。以目前來說只有 Alpha 架構需要 memory-barrier ，其他 CPU 的話都是執行 volatile load ，另外沒有主流 C 編譯器支援 address dependencies ，因此它使用 volatile casts 。

通常我們利用 rcu_dereference() 的場景是用來將一個 RCU-proteced pointer 複製到區域變數當中，之後再 dereference 該變數

p = rcu_dereference(head.next);
return p->data;

可以更化簡為以下

return rcu_dereference(head.next)->data;

如果想要存取該指標當中數個欄位，使用一個區域變數是比較好的方式。重複的呼叫 rcu_dereference() 可讀性很糟糕，另外也不保證回傳的指標是相同的，還會在 Alpha CPUs 上造成沒必要的 overhead 。

特別注意 rcu_dereference() 的回傳值只在其被包覆的 RCU read-side critical section 當中有效，例如以下例子

rcu_read_lock();
p = rcu_dereference(head.next);
rcu_read_unlock();
x = p->address; /* BUG!!! */
rcu_read_lock();
y = p->data;    /* BUG!!! */
rcu_read_unlock();

以下的架構圖呈現了 RCU 當中 reader, updater 和 reclaimer 的關係

從上圖可以觀察到 rcu_read_lock(), rcu_read_unlock(), synchronize_rcu() 和 call_rcu() 之間的時間順序關係，以此決定

synchronize_rcu() 回傳給他們的 callers 的時機
call_rcu() 何時可以被呼叫

有效並高效的 RCU 實作當中大量使用 batching 來減緩相關 API 帶來的 overhead 。 rcu_assign_pointer() 和 rcu_dereference() 透過寫入與讀取 RCU-protected pointer 來達成 spatial changes 的溝通。

在 Linux kernel 當中有至少三種 RCU 使用方式，以上的架構是最常見的一種，在 updater side ， rcu_assign_pointer(), synchronize_rcu() 和 call_rcu() 在三種風格當中使用方式都相同。但在 reader-side 的保護機制則會有所不同

這三種風格主要用在以下三種不同情境

用在普通資料結構的 RCU
用在網路資料結構並且可能受到遠端的 denial-of-service attacks 的 RCU
用在 scheduler 和 interrupt/NMI-handler tasks 的 RCU