Linux 核心設計: 淺談同步機制

充斥歷史痕跡的同步機制

〈UNIX 作業系統 fork/exec 系統呼叫的前世今生〉提及電腦科學家 Melvin Conway 在 1963 年發表論文〈A Multiprocessor System Design〉，正式提出 fork 思想，大約在這個時間點，另一個電腦科學家 Edsger Dijkstra 則以數學定義提出 Semaphore 概念，而早期分時多工作業系統 Compatible Time-Sharing System (CTSS) 則在 1961 年首次進行示範運作，整個作業系統必要的特徵和思維都還在發展，其中包含同步機制。

同步問題陳述:
critical section (以下簡稱 CS): 提供對共享變數之存取的互斥控制，從而確保資料正確。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

CS 解法:

Mutual exclusion (互斥)：任一時間點，只允許一個 process/thread 進入規範的 CS 內活動。
Progress (有進展): 需要同時滿足下面要件：
- 不想進入 CS 的 process/thread 不得阻礙其它 process/thread 進入 CS，也就是說，不得參與進入 CS 的決策過程;
- 必須在有限的時間從想進入 CS 的 process/thread 中，挑選其中一個 process/thread 進入 critical section，一有空位就讓想進 CS 者即刻進入;
Bound waiting (有限的等待): 自 process/thread 提出進入 CS 的申請到獲准進入 CS 的等待時間是有限的。即若有
$n$ 個 processes/thread 想進入，則任一 process/thread 至多等待
$n - 1$ 次即可進入。

在缺乏專門硬體指令的狀況下，存在多種同步機制演算法: (但其實各自的假設)

現代的電腦硬體都有簡單且明確的指令可達到相同的效果，不用再顧慮上述演算法的 willingness vs. turn，換言之，這些軟體的同步演算法已無實用價值。

Edsger Dijkstra 提出 semaphore 概念時，被廣泛用於 CS 跟 signaling 這兩類問題。

signaling 是 process/threads 之間要通知彼此「某些條件已經改變」的情境：例如 producer-consumer 類型的問題可用兩個 semaphore 來實作，一個代表「有資料可用」，另一個代表「有空間可放東西」。

隨著電腦硬體逐漸提供 atomic 指令後，mutex 或稱為 lock 的機制被列入作業系統的實作考量:

需要進入 CS 時，用 mutex/lock —— 上鎖/解鎖永遠是同一個 thread/process;
需要處理 signalling 時，用 semaphore —— 等待/通知通常是不同的 threads/processes;

簡言之，要搶資源時用 mutex，要互相通知時用 semaphore。

既生瑜，何生亮？

先來看 semaphore 與 mutex 的差異，參照 Michael Barr 的文章 Mutexes and Semaphores Demystified: mutex 與 Semaphore 都用於確保 critical section (以下簡稱 CS) 的正確，讓多個執行單元運作並存取資源時，執行結果不會因為執行單元的時間先後的影響而導致錯誤。

mutex 與 semaphore 的差別在於：

process 使用 mutex 時，process 的運作是持有 mutex，執行 CS 來存取資源，然後釋放 mutex
- Mutex 就像是資源的一把鎖：解鈴還須繫鈴人
process 使用 semaphore 時，process 總是發出信號 (signal)，或者總是接收信號 (wait)，同一個 process 不會先後進行 signal 與 wait
- 換言之，process 要不擔任 producer，要不充當 consumer 的角色，不能兩者都是。semaphore 是為了保護 process 的執行同步正確;

mutex 與 semaphore 兩者設計來解決不同的問題。區分 mutex 與 binary semaphore：

mutex 確保數個 process 在一個時間點上，只能有一個 process 存取單項資源;
semaphore 讓數個 producer 與數個 consumer 在計數器的基礎上進行合作;

另一個 mutex 與 binary semaphore 的差異在於，嚴格遵守 mutex 的情境中，要額外考慮 priority inversion。因此，mutex 實作中多半採用一些機制來防止 Priority Inversion。Priority Inheritance 是基於 process 持有 mutex 的概念，使得數個不同 priority 的 process，在等待資源時透過 mutex 傳遞 priority，避免 priority inversion 發生。

注意，mutex 與 semaphore 在很多材料有著不同的解釋。Barr 的觀點是回歸 Dijkstra 所提出的數學模型，後者衍生出的 semaphore，則包含 counting semaphore 與 binary semaphore，也才能夠用於保護資源，或是處理 multiple identical resources 的問題。

對照 Taiwan Linux Kernel Hackers 的解說影片淺談 Semaphore 與 Mutex

在科技公司的面試過程中，答覆 Mutex, Semaphore, the difference, and Linux kernel

在 Linux 核心中，起初僅有 semaphore 這個核心物件 (kernel object)，直到 v2.6.16 核心才將 mutex 自 semaphore 實作中抽離。儘管 Mutex 與 Semaphore 兩者都是休眠鎖，但 Linux 核心實作 mutex 時，運用加速技巧，將上鎖區分以下三個步驟：

Fast path: 嘗試使用 atomic operation，減少 counter 數值來獲得鎖
Mid path: 上一步若失敗，嘗試使用特化的 MCS spinlock 等待，再取鎖。
- 當持鎖的 thread 還在執行，且系統不存在更高優先權的任務時，即可推定，持鎖的 thread 很快就會把鎖釋放出來，因此會使用一個特化的 MCS spinlock 等待鎖被釋放。特化的 MCS spinlock 可在被重新排程時，退出 MCS spinlock queue。當走到這步時，就會到 Slow path。
Slow path: 鎖沒有辦法取得，只好把自己休眠。
- 走到這一步，mutex 才會將自己加入 wait-queue 然後休眠，等待有人 unlock 後才會被喚醒。

這樣付出的代價是，mutex 成為 Linux 核心最複雜的 lock，在 x86-64 的環境下需要使用 40 bytes，相較之下，semaphore 只用 24 bytes，這意味著對 CPU cache 的效率和 cache coherence 的衝擊。

回歸基本面

取自 Donald E. Porter 教授的 CSE 506: Operating Systems 教材

Linux kernel synchronization
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Big Kernel Lock: 鎖定整個 Linux 核心的機制，透過 lock_kernel() 與 unlock_kernel() 函式。這是為了支援多處理器，暫時引進的鎖定機制，已在 v2.6.39 徹底清除。
- commit
- Killing the Big Kernel Lock
- BKL 用於保護整個核心，而 spinlock 用於保護非常特定的某一共享資源。行程 (process) 持有 BKL 時允許發生排程。實作機制：在執行schedule 時，schedule 將檢查行程是否擁有 BKL，若有，它將被釋放，以致於其它的行程能夠獲得該 lock，而當輪到該行程執行之際，再讓它重新獲得 BKL。注意：在持有 spinlock 期間不會發生排程。
- 整個核心僅有一個 BKL
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- coarse-grained vs fine-grained: 粗粒度和細粒度。
- coarse-grained 對於 locking 來說，持有時間可以很長並不斷延長鎖的持有時間，好處在於對 lock 資源管理的責任較小，難點在於 lock 資源維持原有狀態的成本
- fine-grained 對於 locking 來說，持有時間一般是秒級或者毫秒級，好處在於 lock 資源後不必維持原有 lock 的狀態，但這種簡單的設計方法導致資源管理的責任變重
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Atomic increment/decrement (x++ or x--)
– Used for reference counting
– Some variants also return the value x was set to by this instruction (useful if another CPU immediately changes the value)

Compare and swap (CAS)

語意: if (x == y) x = z;

int compareAndSwap(int *loc, int expectedValue, int valueToStore) {
    ATOMIC {
        int val = *loc;
        if (val == expectedValue) {
            *loc = valueToStore;
        }
        return val;
    }
}

廣泛用於許多 lock-free data structures

延伸閱讀: Atomic Instructions, Lock-Free Data Structures

一個 lock-free 的程式需要確保符合這個特徵：執行該程式的所有執行緒至少有一個能夠不被阻擋 (blocked) 而繼續執行
- lock-Free 中的 "Lock" 沒有直接涉及 mutex 和 semaphore 一類 mutual exclusion，而是描述了應用程式受限於特定原因被鎖定的可能，例如可能因為 dead lock, live lock，或者 thread scheduling 致使的 priority inversion 等等
- 下方的程式碼沒有任何 mutex，但卻不符合 lock-free 的屬性：若有兩個執行緒同時執行程式碼，考慮到特定的排程方式，非常可能兩個執行緒同時陷入無窮迴圈，也就是阻擋彼此
```
while (x == 0) {
    x = 1 - x;
}
```
- Lock-Free 程式設計的挑戰不僅來自於其任務本身的複雜性，還要始終著眼於對事物本質的洞察
  - 0 -> 1; 1 - 1 -> 0 -> 1

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

延伸閱讀: Evaluating the Cost of Atomic Operations on
Modern Architectures

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Linux spinlock 的對等 C 語言實作
```
while (0 != atomic_dec(&lock->counter))	{
    do {
        // Pause the CPU until some coherence	
        // traffic : 考慮到 cache line 的效應!!
    } while	(lock->counter <= 0);	
}
```
- 特定 CPU 的cache line 更新同步到不同 CPU 的 cache line 的時間不同，意味著等待 spinlock 的 CPU 中哪個 CPU 先得知 lock->counter 的變化，哪個 CPU 就能優先獲得 spinlock
- What is cache line bouncing? How may a spinlock trigger this frequently?
- 延伸閱讀: 從 CPU cache coherence 談 Linux spinlock 擴展議題

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Test-and-Set Lock
注意: Write Back + Evict (驅逐) Cache Line

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Test-Test-and-Set Lock

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

普通的 spinlock 對待 reader 和 writer 是一視同仁，RW spinlock 給 reader 賦予更高的優先權，有沒有讓 writer 優先的鎖呢？答案就是 seqlock
- 延伸閱讀: Linux內核同步機制之（六）：Seqlock
- Chapter 5. Concurrency and Race Conditions
相關學術研究:

Memory Consistency
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

乍聽之下好似「電腦病毒會讓人類染病」一樣匪夷所思，但實際上，由於電腦由各種裝置組成，其結構強度落差極大，透過放射線影響 DRAM 一類的資料儲存裝置是可能的，而且 Microsoft Windows 核心為此還要做出應對。
在工業控制的領域，尤其是航空，就會面對 bit flips (位元反轉，原本該是 0 的資料，在某種時空狀況，會轉換為 1，反之亦然)，這甚至是多項公共安全危機的原因。
- 學軟體可不是堆積程式碼和呼叫 API，而該從每個環節都充分掌握，畢竟我們需要對廣大的使用者負責
- Cosmic Rays: what is the probability they will affect a program?

Universal Scalability Law (USL)

C (N) = \frac{N}{1 + α (N - 1) + β N (N - 1)}

對於可擴展性 (scalability) 的數學定義，透過給定的

N

（使用者數量，或者是系統實體處理器數量)，計算

C

(系統容量) 來建立模型，其中

α

和

β

參數表示資源爭奪或對於共享資源的等待 (queueing)。

Since there are no topological dependencies, C(N) can model symmetric multiprocessors, multicores, clusters, and GRID architectures. Also, because each of the three terms has a definite physical meaning, they can be employed as a heuristic to determine where to make performance improvements in hardware platforms or software applications.

If α and β are both zero, the system scales perfectly — throughput is proportional to load (or to processors in the system).
If α is slightly positive it indicates that part of the workload is not scalable. Hence the curve plateaus to the right. Another way of thinking about this is that some (shared) resource is approaching 100% utilization.
If in addition β is slightly positive, it implies that some resource is contended: for example, preliminary processing of new jobs steals time from the main task that finishes the jobs.

Neil Gunther 博士在 2008 年於How to Quantify Scalability 一文總結:

The universal scalability law is equivalent to the synchronous queueing bound on throughput in a modified Machine Repairman with state-dependent service times.

β = 0

時，對應到 Amdahl's law:

Amdahl's law for parallel speedup is equivalent to the synchronous queueing bound on throughput in a Machine Repairman model of a multiprocessor.

spinlock

spinlock 是 Linux 中使用非常頻繁的 locking 機制。當 Process A 申請 lock 成功後，Process B 申請鎖就會失敗，但不會引發排程，一如粒子原地自旋 (spin)。

出處: Discovery of Electron Spin

原地轉到 Process A 釋放 lock 為止，由於不會 sleep，也不會引發 schedule 的特性使然，在 interrupt context 中，資料的保護通常採用 spinlock。

出處: Locking in OS Kernels for SMP Systems
While CPU 1 (in process context) is holding the lock, any incoming interrupt requests on CPU 1 are stalled until the lock is released. An interrupt on CPU 2 busy waits on the lock, but it does not deadlock.
After CPU 1 releases the lock, CPU 2 (in interrupt context) obtains it, and CPU 1 (now executing the interrupt handler) waits for CPU 2 to release it.

若多個 process 申請 lock 的時候，當第一個申請 lock 成功的 process 在釋放時，其他 process 即刻處於競爭的關係，因此會造成不公平 (可能會有某個 process 頻繁持有 lock)。現在的 Linux 採用 Ticket lock 排隊機制，也就是說，誰先申請，誰就先得到 lock。

出處: 洽公不再苦等全新叫號系統打造有感服務, 2007

對應到真實世界的案例，是銀行的辦事大廳通常會有多個窗口同步進行，並採用取號排隊。當你去銀行辦理業務時，首先會去取號機器領取號碼牌，上面寫著特定的流水號，然後你就可排隊等待，而且大廳內還會有醒目的顯示器，標注「請 xxx 號到 1 號窗口辦理」字樣，可能也會搭配自動廣播。每辦理完一個顧客的業務，顯示器上面的數字都會遞增。等待的顧客都會對比自己手上寫的編號和顯示器上面是否一致，若一致，即可去前台辦理業務。

試想以下情境：倘若剛開始營業，顧客 A 是本日第一個顧客，去取號機器領取 0 號 (next 數值) 號碼牌，然後看到顯示器上顯示 0 (owner 數值)，顧客 A 就知道現在輪到自己辦理業務。顧客 A 到前台辦理業務 (acquire lock) 的過程中，顧客 B 來了。同樣，顧客 B 去取號機器拿到 1 號 (next 數值) 號碼牌。然後顧客 B 靜候通知，而顧客 A 依舊辦理業務，此時顧客 C 也來了。顧客 C 去取號機器拿到2 號 (next 數值) 號碼牌。顧客 C 也靜靜找個座位等待。終於，顧客 A 的業務辦完了 (release lock)，然後顯示器上面顯示 1（owner 數值），顧客 B 和 C 都比對顯示器上面的數字和自己手中號碼牌的數字是否相等。顧客 B 終於可辦理業務了 (acquire lock)，但顧客 C 依然等待中。直到顧客 B 的業務辦完了 (release lock)，然後顯示器上面顯示2 (owner 數值），顧客 C 終於開始辦理業務 (acquire lock)。一旦顧客 C 的業務處理後 (release lock)，這 3 個顧客都辦完了業務離開了，留下一個銀行櫃檯服務員。最終，顯示器上面顯示 3 (owner 數值)，取號機的下一個排隊號也是 3 號 (next 數值)，無人辦理業務，也就是說 lock 處於釋放狀態。

Linux 針對每個 spinlock 會有兩個計數器:

next
owner

初始值皆為 0，每個 Process 可以存取這兩個計數器。舉例來說，當 Process A 申請 lock 時，會判斷 next 和 owner 的值是否相等。如果相等就代表 lock 可申請成功，否則原地自旋，直到 owner 和 next 的值相等才會退出自旋。假設 Process A 申請 lock 成功，然後 next 加 1，此時 owner 值為 0，next 值為1。Process B 也申請鎖，讀取 next 的值寫入到區域變數 tmp (內含值為 1) 中。由於 next 和 owner 值不相等，因此原地自旋讀取 owner 的值，判斷 owner 和 tmp 是否相等，直到相等退出自旋狀態。由於Process B得到號碼牌next，next的值會往上加 1，變成 2。一旦 Process A 釋放 lock 後，會將 owner 的值加 1，那麼此時 Process B 的 owner 和 tmp 的值都是 1，因此 Process B 獲得鎖。當 Process B 釋放 lock 後，同樣會將 owner 的值加 1。最後 owner 和 next 都等於2，代表沒有 process 持有 lock。next 記錄著目前Process們申請 lock 的總次數，而 owner 是 process 們目前持有過 lock 的總計數值。

以下 Linux 核心程式碼採用 v4.20，硬體架構是 AArch32 (arm) / AArch64 (arm64)。為了行文便利，程式碼做了適度排版和編輯，請一併參照原始程式碼。

首先定義描述 spinlock 的結構體 arch_spinlock_t，檔案 arch/arm/include/asm/spinlock_types.h

typedef struct {
    union {
        unsigned int slock;
        struct __raw_tickets {
            unsigned short owner;
            unsigned short next;
        } tickets;
    };
} arch_spinlock_t;

如上所述，我們需要兩個計數器，分別是 owner 和 next。因為採用 union 宣告，無號整數 slock 佔用的記憶體空間涵蓋 owner 和 next。下面是申請 lock 操作 arch_spin_lock 的實作，檔案 arch/arm/include/asm/spinlock.h (下方第一步跟第二步也算是 Critical Section，否則有可能兩個 Process 在第一步跑完後拿到相同的號碼或者第二步跑完後號碼機應該加兩號，卻只有加一號，這裡的 Critical Section 是用組合語言來實作出來的，而在 Semaphore 的 down 裡面的 Critical Section 則是用 spin_lock 來實現。) 組語解析可以參考 Linux内核同步机制之（一）：原子操作

static inline void arch_spin_lock(arch_spinlock_t *lock) {
    arch_spinlock_t local_lock;
 
    local_lock.slock = lock->slock; /* 1 */
    lock->tickets.next++; /* 2 */
    while (local_lock.tickets.next != local_lock.tickets.owner) { /* 3 */
        wfe(); /* 4 */
        local_lock.tickets.owner = lock->tickets.owner; /* 5 */
    }
}

延伸上面的案例，顧客從取號機取得號碼牌，也就是將取號機的資訊更新到自己手上;
取號機更新下個顧客將要拿到的排隊號;
查閱顯示器，判斷是否輪到自己了;
wfe 是 Arm 架構的 WFE (wait for event) 指令，讓 ARM 核進入低功耗模式。當 process 拿不到 lock 之際，只是原地自旋不如 CPU 進入休眠，可節省能源消耗。但這樣睡去後，該在何時醒來呢？就是只在持有 lock 的 process 釋放時，才醒來判斷是否可以持有 lock。若不能獲得 lock，繼續睡眠即可。對應到真實世界的比喻，就是銀行顧客在大廳先打盹，等到廣播下一位排隊者時，醒來檢查是否輪到自己;
前台已為上一個顧客辦理完成業務，剩下排隊的顧客都要抬頭看一下顯示器是不是輪到自己;

釋放 lock 的操作相比之下就很簡單，以上方銀行辦理業務的例子來說，釋放 lock 的操作僅僅是顯示器上面的排隊號加 1，也就是說，僅需要將owner 計數值遞增 1 即可。arch_spin_unlock 實作如下，檔案 arch/arm/include/asm/spinlock.h

static inline void arch_spin_unlock(arch_spinlock_t *lock) {
    lock->tickets.owner++;
    dsb_sev();
}

dsb_sev() 對應到 Arm 架構的 SEV 指令，可發送訊號給所有處理器核。當 process 無法獲取 lock 的時候，會使用 WFE 指令讓 CPU 睡眠。現在釋放 lock，自然要喚醒所有睡眠的 CPU，醒來檢查自己是否可獲取 lock。

semaphore

semaphore 是 process 間通信處理同步互斥的機制。負責協調各個 process ，保證他們能夠正確、合理地使用共享資源。和 spinlock 最大差異在於：無法獲取 semaphore 的 process 可以 sleep，因此會導致 schedule。

為了記錄可用資源的數量，需要 count 計數器來標記目前可用資源數量，伴隨著另一個描述多少人等待的資料結構，在 Linux 核心中就是一個 doubly-linked list。實作程式碼如下，檔案 include/linux/semaphore.h:

struct semaphore {
    unsigned int count;
    struct list_head wait_list;
};

在 Linux 中，還需要一個結構體記錄目前的 process 的資訊(task_struct)，檔案 kernel/locking/semaphore.c:

struct semaphore_waiter {
    struct list_head list;
    struct task_struct *task;
};

semaphore_waiter 結構體的 list 成員是當 process 無法獲取 semaphore 的時候，加入到 semaphore 的 wait_list 成員。task成員就是記錄後續被喚醒的 process 資訊。

1965 年荷蘭籍電腦科學家 Edsger W. Dijkstra 發展出解決資料同步議題的 semaphore 機制，後者就像是個管理系統資源的計數器，於是 semaphore 有時也稱為 Dijkstra Counters。

在 semaphore 的操作中，semaphore 控制內含的 counter 數值的增加或減少。若 semaphore 內在的 counter 達成 0 之際，任何新進來的 process/thread 會試圖把 counter 減少但必須等待另一個 process/thread 來增加其 counter 使其大於0新進的 process/thread 才能減得下去。於是主要的操作如下：

P = down (記憶為「坐下去/卡位」) = wait / sleep，用來指示出目前的資源不能使用 (因為別的 process/thread 正在使用)
- P 的全名是 "proberen te verlagen"，是荷蘭語，通常翻譯成「試圖減去 1」
V = up (記憶為「還原」) = wakeup / signal，用來指示出目前的資源可供使用 (因為目前沒有別的 process/thread 在使用)
- V 的全名是 "verhogen"，荷蘭語表「加 1」的意思

注意 semaphore 的初始值可以是 0，表示沒有保存下來的 wakeup 操作，也就是只做單次 signal 用。

在 Linux 核心中，採用 down 和 up 來命名這兩項主要操作。
申請 semaphore 的 down 實作如下，檔案 kernel/locking/semaphore.c

void down(struct semaphore *sem) {
    struct semaphore_waiter waiter;
 
    if (sem->count > 0) {
        sem->count--; /* 1 */
        return;
    }
 
    waiter.task = current; /* 2 */
    list_add_tail(&waiter.list, &sem->wait_list); /* 2 */
    schedule(); /* 3 */
}

如果 semaphore 標記的資源還有剩餘，自然可成功獲取 semaphore。只需要遞減 counter;
既然無法獲取 semaphore，就需將目前的 process 掛入 semaphore 的 wait_list;
schedule() 是觸發排程器，主動讓出 CPU 使用權。在讓出之前，需要將目前 process 從 runqueue 上移除;

semaphore.c 的 critical section 可能造成 sleep，因為使用的是 raw_spin_ 而不是 spin_ ，追了一下 __down 和 __down_common 發現最後在 call schedule() 之前，會先 unlock 這個 &sem->lock，可是這樣這邊的呼叫就是不對稱的了。 raw_spin_lock_irqsave -> raw_spin_unlock_irq -> raw_spin_lock_irq -> raw_spin_unlock_irqrestore

static inline int __sched __down_common(struct semaphore *sem, long state,
								long timeout)
{
	...
	raw_spin_unlock_irq(&sem->lock);
	timeout = schedule_timeout(timeout);
	raw_spin_lock_irq(&sem->lock);
	...
}

void down(struct semaphore *sem)
{
	unsigned long flags;

	raw_spin_lock_irqsave(&sem->lock, flags);
	if (likely(sem->count > 0))
		sem->count--;
	else
		__down(sem);
	raw_spin_unlock_irqrestore(&sem->lock, flags);
}

釋放 semaphore 的實作如下，檔案 kernel/locking/semaphore.c

void up(struct semaphore *sem) {
    struct semaphore_waiter waiter;
 
    if (list_empty(&sem->wait_list)) {
        sem->count++; /* 1 */
        return;
    }
 
    waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list);
    list_del(&waiter->list); /* 2 */
    wake_up_process(waiter->task); /* 2 */
}

如果 wait_list 沒有 process，那麼只需要增加 counter;
從 wait_list 開頭取出第一個 process，並從 linked list 上移除。然後就是喚醒該 process;

read-write lock (rwlock)

無論是 spinlock 還是 semaphore，在同一時間僅能有一個 process 進入 critical section (CS)。對於有些情況，我們可區分讀寫操作，進而提出最佳化機制，讓 read 操作的 process 可以並行 (concurrent) 運作。對於 write 操作僅限於一個 process 進入 CS。這種同步機制就是 rwlock，一般具有以下幾種性質:

同一時間僅有一個 writer process 進入 CS;
在沒有 writer process 進入 CS 時，同時可有多個 reader process 進入 CS;
reader process 和 writer process 不可以同時進入 CS;

讀寫鎖有兩種:

semaphore 類型 (rw_semaphore);
spinlock 類型;

以下講解 spinlock 類型。

既然我們允許多個 reader 進入 CS，因此我們需要一個計數統計 reader 的方法。同時，由於 writer 永遠只會有一個進入 CS，因此只需要一個 bit 標記是否有 writer process 進入 CS。Linux 核心講究效率，將這兩個計數器合而為一，只用一個 32-bit unsigned int 表示:

最高位 (bit 31): 代表是否有 writer 進入 CS;
低 31 位（bit 30 - bit 0): 統計 reader 個數;

+----+-------------------------------------------------+
| 31 | 30                                            0 |
+----+-------------------------------------------------+
  |                    |
  |                    +----> [0:30] Read Thread Counter
  +-------------------------> [31]  Write Thread Counter

描述 rwlock 只需要一個變數，在 Linux 核心如此定義其結構，檔案 arch/arm/include/asm/spinlock_types.h

typedef struct {
    volatile unsigned int lock;
} arch_rwlock_t;

既然區分讀寫操作，因此會有各自的申請讀寫 lock 函式，檔案 arch/arm/include/asm/spinlock.h

static inline void arch_read_lock(arch_rwlock_t *rw)
{
	unsigned long tmp, tmp2;

	prefetchw(&rw->lock);
	__asm__ __volatile__(
"1:	ldrex	%0, [%2]\n"
"	adds	%0, %0, #1\n"
"	strexpl	%1, %0, [%2]\n"
	WFE("mi")
"	rsbpls	%0, %1, #0\n"
"	bmi	1b"
	: "=&r" (tmp), "=&r" (tmp2)
	: "r" (&rw->lock)
	: "cc");

	smp_mb();
}

組語解析可以參考 Linux内核同步机制之（五）：Read/Write spin lock

上面的 inline assembly 對應等價的 C 程式如下 (有 exclusive monitor):

static inline void arch_read_lock(arch_rwlock_t *rw) {
    do {
        wfe();
        tmp = rw->lock;
        tmp++; /* 1 */
    } while(tmp & (1 << 31)); /* 2 */
    rw->lock = tmp;
}

增加 reader 計數，最後會更新到 rw->lock 中;
更新 rw->lock 前提是沒有 writer，因此這裡會判斷是否有 writer 已進入 CS，也就是檢查 rw->lock 變數的 bit 31。如果已有 writer 進入 CS，就在這裡停駐，並透過 WFE 指令讓 CPU 睡眠，類似之前介紹 spinlock 的實作機制;
上面這段 C Code 應該有問題，在沒有 reader & writer 的情況下，卻會先執行 wfe()，應該要改成下面的流程

static inline void arch_read_lock(arch_rwlock_t *rw) {
    while(rw->lock & (1 << 31)) { /* 2 */
        wfe();
    }
    rw->lock++; /* 1 */
}

當 reader 離開 CS 時會呼叫 read_unlock 釋放 lock，後者的實作機制如下，檔案 arch/arm/include/asm/spinlock.h

static inline void arch_read_unlock(arch_rwlock_t *rw) {
    unsigned long tmp, tmp2;

    smp_mb();

    prefetchw(&rw->lock);
    __asm__ __volatile__(
"1:	ldrex	%0, [%2]\n"
"	sub	%0, %0, #1\n"
"	strex	%1, %0, [%2]\n"
"	teq	%1, #0\n"
"	bne	1b"
    : "=&r" (tmp), "=&r" (tmp2)
    : "r" (&rw->lock)
    : "cc");

    if (tmp == 0)
        dsb_sev();
}

組語解析可以參考 Linux内核同步机制之（五）：Read/Write spin lock

上面的 inline assembly 等價於以下 C 程式碼:

    rw->lock--;

不難發現，實作機制類似 spin_unlock。遞減 reader 計數，然後使用 SEV 指令喚醒所有的 CPU，檢查等待狀態的 process 是否可獲取 lock。

看完 read 操作，繼續看 write 操作的實作方式，檔案 arch/arm/include/asm/spinlock.h

static inline void arch_write_lock(arch_rwlock_t *rw) {
    unsigned long tmp;

    prefetchw(&rw->lock);
    __asm__ __volatile__(
"1:	ldrex	%0, [%1]\n"
"	teq	%0, #0\n"
        WFE("ne")
"	strexeq	%0, %2, [%1]\n"
"	teq	%0, #0\n"
"	bne	1b"
    : "=&r" (tmp)
    : "r" (&rw->lock), "r" (0x80000000)
    : "cc");

    smp_mb();
}

上面的 inline assembly 等價於以下 C 程式:

    while (rw->lock) { /* 1 */
        wfe();
    }
    rw->lock = 1 << 31; /* 2 */

由於 writer 具排他性 (reader 和 writer 不得同時進入 CS)，因此這裡只有 rw->lock 的值為 0 之際，目前的 writer 才可以進入 CS;
設定 rw->lock 的 bit 31，代表有 writer 進入 CS;

當 writer 離開 CS 時，會呼叫 write_unlock 釋放 lock。write_unlock 實作如下，檔案 arch/arm/include/asm/spinlock.h

static inline void arch_write_unlock(arch_rwlock_t *rw) {
    smp_mb();

    __asm__ __volatile__(
    "str	%1, [%0]\n"
    :
    : "r" (&rw->lock), "r" (0)
    : "cc");

    dsb_sev(); /* 2 */
}

上面的 inline assembly 等價於以下 C 程式:

    rw->lock = 0; /* 1 */

同樣由於 writer 具排他性，因此只需要將 rw->lock 指派為 0 即可。代表沒有任何 process 進入 CS。畢竟是因為同一時間只能有一個 writer 進入 CS，當這個 writer 離開 CS 時，必定意味著現在沒有任何 process 進入 CS;
使用 SEV 指令喚醒所有的 CPU，檢查等待狀態的 process 是否可獲取 lock;

Passive Reader-Writer Lock : Scalable Read-mostly Synchronization Using Passive Reader-Writer Locks，上海交通大學陳海波教授領導開發的 prwlock，針對 Linux v3.8 進行修改。

When a running task is going to sleep on a condition, it is removed from Run-queue and inserted into PWake-queue.
The processes check whether the tasks in PWake-queue meet their conditions. If so, they are moved to the Run-queue from the PWake-queue.
A task in Run-queue is selected to execute by scheduler.
If no runnable tasks on the core, the idle cores use monitor & mwait instructions to sleep on a global word.
When a writer finishes its work, it wakens up the idle cores by touch the word.

Queued/MCS Spinlock

取自 Davidlohr Bueso 的 Futex Scaling for Multi-core Systems

延伸閱讀: Much Ado About Blocking: Wait/Wake in the Linux Kernel

lock 對 scalability 衝擊極大

UNSW Advanced Operating System 教材 SMP, Multicore, Memory Ordering & Locking 指出一個血淋淋的教訓

實驗環境:

Linux 2.6.35-rc5 (relatively old, but problems are representative of issues in recent kernels too)
Off-the-shelf 48-core server (AMD)

Exim 是一套由劍橋大學發展的 message transfer agent (MTA)，實驗指出，在處理器核數超過 40 後，throughput 不升反降，而且耗費在核心的時間大幅增加。

透過 oprofile 指出效能瓶頸:

發現 Exim 透過 open 系統呼叫去派送電子郵件時，open 系統呼叫的 Linux 實作花了很多時間在查詢 mount table:

struct vfsmount *lookup_mnt(struct path *path) {
    struct vfsmount *mnt;
    spin_lock(&vfsmount_lock);
    mnt = hash_get(mnts, path);
    spin_unlock(&vfsmount_lock);
    return mnt;
}

但 spin_lock 和 spin_unlock 之間的區域看似很短，會是影響 scalability 的關鍵嗎？

ticket spin locks 是 non-scalable 的實作。

(對照 Page 96-109)

僅是避免 lock collapse 是不夠的!
參照 An Overview of Linux Kernel Lock Improvements, Davidlohr Bueso (2014)
延伸閱讀多核處理器和 spinlock

為了徹底改善問題，我們需要另闢蹊蹺:

Lock-free algorithms
Allow safe concurrent access without excessive serialisation
有很多技巧和途徑，在 Linux 核心中，我們探討以下:
- Sequential locks (seqlock)
- Read-Copy-Update (RCU)

seqlock

The idea is to keep a sequence number that is updated (twice) every time a set of variables is updated, once at the start, and once after the variables are consistent again. While a writer is active (and the data may be inconsistent) the sequence number is odd; while the data is consistent the sequence is even.
The reader grabs a copy of the sequence at the start of its section, spinning if the result is odd. At the end of the section, it rereads the sequence, if it is different from the first read value, the section is repeated.
This is in effect an optimistic multi-reader lock. Writers need to protect against each other, but if there is a single writer (which is often the case) then the spinlocks can be omitted. A writer can delay a reader; readers do not delay writers – there’s no need as in a standard multi-reader lock for writers to delay until all readers are finished.
This is used amongst other places in Linux for protecting the variables containing the current time-of-day.

待整理

spinlock

Linux 核心設計: 淺談同步機制

充斥歷史痕跡的同步機制

既生瑜，何生亮？

回歸基本面

Universal Scalability Law (USL)

spinlock

semaphore

read-write lock (rwlock)

Queued/MCS Spinlock

lock 對 scalability 衝擊極大

seqlock

待整理

Read more

運用 Perf 分析程式效能並改善

Linux 核心專題: 虛擬無線網路裝置驅動程式

資訊科技詞彙翻譯

單一指令處理器 (OISC)