# Lock-free 程式設計議題
:::danger
本頁面已不再維護,請造訪 ==[並行程式設計: Lock-Free Programming
](https://hackmd.io/@sysprog/concurrency-lockfree)== 新頁面。
:::
## 解說影片及文章
* video: [Introduction to Lock-free Programming](https://www.youtube.com/watch?v=RWCadBJ6wTk)
* 檢驗認知: 用 C11 atomics 改寫演說中示範的程式碼
* 何謂 ABA?如何避免 ABA?
* video: [Lock-Free Programming (or, Juggling Razor Blades), Part I](https://www.youtube.com/watch?v=c1gO9aB9nbs) / [Lock-Free Programming (or, Juggling Razor Blades), Part II"](https://www.youtube.com/watch?v=CmxkPChOcvw)
* Mike Acton 的 [Recent sketches on concurrency, data design and performance](https://cellperformance.beyond3d.com/articles/2009/08/roundup-recent-sketches-on-concurrency-data-design-and-performance.html)
## Fork Join Model
* 典型的平行程式,由以下 3 個區段所組成:
1. fork: 準備進入要平行化的區域時, master thread (下圖紅色) 產生出 team of threads (紅色加黃色)
> 注意: 這裡的 "fork" 並非 `fork(2)` system call
![](https://i.imgur.com/3cHajBu.png)
2. parallel region: 這些執行緒合力執行給定的工作
![](https://i.imgur.com/FcOmLLn.png)
3. join: 工作完成之後, 又回到 fork 之前, 單一執行緒的狀態
![](https://i.imgur.com/M3I3AFO.png)
* fork-join model 就是不斷重複上述過程
![](https://i.imgur.com/UtpTNxC.png)
* 由記憶體的角度切入: 每個 thread 都有自己的 stack, 但 text, heap & data section 是共用的
[source](https://www.tutorialspoint.com/operating_system/os_multi_threading.htm)
![](https://www.tutorialspoint.com/operating_system/images/thread_processes.jpg)
## 探討 Synchronization
為了達到 concurrency,要有 Split a program 與 Synchronization 兩個步驟。
* Race Condition
* 設 count = 0 並為 Process A 和 B 的共享變數。
* 正確執行 Process A 和 B 的結果 count 應仍為 0
* 但若 Scheduler 照下列時序 (T1 -> T2 -> ...),則最後 count 為 -1
```cpp
/* Process A: count++ */
T1: lw $t1, &count
T3: addi $t1, $t1, 1
T4: sw $tw, &count
/* Process B: count-- */
T2: lw $t1, &count
T5: addi $t1, $t1, -1
T6: sw $tw, &count
```
* [Mutex vs. Semaphores – Part 1: Semaphores](https://blog.feabhas.com/2009/09/mutex-vs-semaphores-%E2%80%93-part-1-semaphores/)
* [Edsger Dijkstra](https://en.wikipedia.org/wiki/Edsger_W._Dijkstra) 在 1965 年提出 binary semaphore 的構想,嘗試去解決在 concurrent programs 中可能發生的 race conditions。想法便是透過兩個 function calls 存取系統資源 Semaphore,來標明要進入或離開 critical region。
* 使用 Semaphore 擁有的風險
* Accidental release
* This problem arises mainly due to a bug fix, product enhancement or cut-and-paste mistake.
* Recursive deadlock
* 某一個 task 重複地想要 lock 它自己已經鎖住的 Semaphore,通常發生在 libraries 或 recursive functions。
> for example, the simple locking of malloc being called twice within the framework of a library. An example of this appeared in the MySQL database bug reporting system: Bug #24745 InnoDB semaphore wait timeout/crash – deadlock waiting for itself
* Task-Death deadlock
* 某一個 task 其持有的 Semaphore 已經被終止或發生錯誤。
* Priority inversion
* Semaphore as a signal
* **Synchronization** between tasks is where, typically, one task waits to be notified by another task before it can continue execution (unilateral rendezvous). A variant of this is either task may wait, called the bidirectional rendezvous. **Quite different to mutual exclusion, which is a protection mechanism.**
* 使用 Semaphore 作為處理 Synchronization 的方法容易造成 Debug 困難及增加 "accidental release" 類型的問題。
* [Mutex vs. Semaphores – Part 2: The Mutex](https://blog.feabhas.com/2009/09/mutex-vs-semaphores-%E2%80%93-part-2-the-mutex/)
* 作業系統實作中,需要引入的機制。"The major use of the term mutex appears to have been driven through the development of the common programming specification for UNIX based systems."
* 儘管 Mutex 及 Binary Semaphore 的概念極為相似,但有一個最重要的差異:**the principle of ownership**。
* 而 Ownership 的概念可以解決前述所討論的五個風險
* Accidental release
* Mutex 若遭遇 Accidental release 將會直接發出 Error Message,因為它在當下並不是 owner。
* Recursive Deadlock
* 由於 Ownership,Mutex 可以支援 relocking 相同的 Mutex 多次,只要它被 released 相同的次數。
* Priority Inversion
* Priority Inheritance Protocol
* [Priority Ceiling Protocol](https://en.wikipedia.org/wiki/Priority_ceiling_protocol)
* [投影片](https://tams.informatik.uni-hamburg.de/lehre/2016ss/vorlesung/es/doc/cs.lth.se-RTP-F6b.pdf): 介紹 PCP,並實際推導兩個例子。
* Death Detection
* 如果某 task 因為某些原因而中止,則 RTOS 會去檢查此 task 是否擁有 Mutex ;若有, RTOS 則會負責通知所有 waiting 此 Mutex 的 tasks 們。最後再依據這些 tasks 們的情況,有不同的處理方式。
* All tasks readied with error condition
* 所有 tasks 假設 critical region 為未定義狀態,必需重新初始化。
* Only one task readied
* Ownership 交給此 readied task,並確保 the integrity of critical region,之後便正常執行。
* Semaphore as a signal
* Mutex 不會用來處理 Synchronization。
* [Mutex vs. Semaphores – Part 3: Mutual Exclusion Problems](https://blog.feabhas.com/2009/10/mutex-vs-semaphores-%E2%80%93-part-3-final-part-mutual-exclusion-problems/)
* 但還是有一些問題是無法使用 mutex 解決的。
* Circular deadlock
* One task is blocked waiting on a mutex owned by another task. That other task is also block waiting on a mutex held by the first task.
* Necessary Conditions
* Mutual Exclusion
* Hold and Wait
* Circular Waiting
* No preemption
* Solution: Priority Ceiling Protocol
* PCP 的設計可以讓低優先權的 task 提升至 ceiling value,讓其中一個必要條件 "Hold and Wait" 不會成立。
* Non-cooperation
* Most important aspect of mutual exclusion covered in these ramblings relies on one founding principle: **we have to rely on all tasks to access critical regions using mutual exclusion primitives**.
* Unfortunately this is dependent on the design of the software and cannot be detected by the run-time system.
* Solution: Monitor
* Not typically supplied by the RTOS
* A monitor simply encapsulates the shared resource and the locking mechanism into a single construct.
* Access to the shared resource is through a controlled interface which cannot be bypassed.
* [Mutexes and Semaphores Demystified](http://www.barrgroup.com/Embedded-Systems/How-To/RTOS-Mutex-Semaphore)
* Myth: Mutexes and Semaphores are Interchangeable
* Reality: 即使 Mutexes 跟 Semaphores 在實作上有極為相似的地方,但他們通常被用在不同的情境。
* The correct use of a semaphore is for **signaling** from one task to another.
* A mutex is meant to be taken and released, always in that order, by each task that uses the shared resource it protects.
* 還有一個重要的差別在於,即便是正確地使用 mutex 去存取共用資源,仍有可能造成危險的副作用。
* 任兩個擁有不同優先權的 RTOS task 存取同一個 mutex,可能會創造 **Priority Inversion**。
* Definition: The real trouble arises at run-time, when a medium-priority task preempts a lower-priority task using a shared resource on which the higher-priority task is pending. If the higher-priority task is otherwise ready to run, but a medium-priority task is currently running instead, a priority inversion is said to occur. [[Source]](https://barrgroup.com/Embedded-Systems/How-To/RTOS-Priority-Inversion)
* 但幸運的是,Priority Inversion 的風險可以透過修改作業系統裡 mutex 的實作而減少,其中會增加 acquiring and releasing mutexes 時的 overhead。
* 當 Semaphores 被用作 signaling 時,是不會造成 Priority Inversion 的,也因此是沒有必要去跟 Mutexes 一樣去更改 Semaphores 的實作。
* This is a second important reason for having distinct APIs for these two very different RTOS primitives.
## Deeper Look: Lock-Free Programming
>導讀 [An Introduction to Lock-Free Programming](http://preshing.com/20120612/an-introduction-to-lock-free-programming/) 與 [Lock-Free 编程](http://www.cnblogs.com/gaochundong/p/lock_free_programming.html)
為了解決上述 semaphore, mutex 會產生的問題 (e.g. deadlock, livelock, priority inversion, ...),可以使用 lock-free(lockless) programming 內的一些技巧以提昇效能與正確性
lock-free 不是說完全不用 mutex lock,而是指整個程式的執行不會被其單個執行緒 lock 鎖住
```cpp
while (x == 0) {
x = 1 - x;
}
```
* 由兩個執行緒同時執行這段程式碼,有可能出現兩者都無法跳出迴圈的情況嘛?
|x|thread 1|thread 2|
|-|-|-|
|0|while(x == 0)||
|0||while(x == 0)|
|1|x = 1 - x||
|0||x = 1 - x|
|0|while(x == 0)|
|0||while(x == 0)|
||...|...|
----
### A. lock-free vs wait-free
參考: [Non-blocking algorithm](https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)
* 想像一個情況: 在使用 mutex 時,若其中一個執行緒在釋放 lock 之前被系統 preempt,則需要同一個資源的其他執行緒會無法繼續,直到原本的執行緒恢復執行完畢
* 若系統 kill 該執行緒,則整個程式就完全無法再繼續前進
* **wait-free** 是 non-blocking 之中強度最高的,它保證所有的指令可以 bound 在某個時間內完成,也就是所有的執行緒經過一段足夠的時間後,一定會有所進展
* **lock-free** 稍微弱一點,它允許部份執行緒被暫停,但仍然保證整體程式的其他部份會有進展
* 最差的情況下,就算只有一個執行緒在執行,其他所有執行緒都在等待,仍然屬於 lock-free
* obstruction-free,non-blocking 中最弱的,但不在本處的討論範圍
* 根據定義,所有 wait-free 都屬於 lock-free
故就算不使用 mutex lock (e.g. busy waiting, 上面的 spin lock),不代表就是 lock-free
----
### B. lock-free 的意義
要求整個程式都是 lock-free 是極為困難的,通常 lock-free 會應用在一些特定的資料結構上,例如為 lock-free queue 實做 lock-free 版本的 `push`, `pop`, `isEmpty` 等
----
### C. lock-free 使用的技術
#### 1. Atomic Operation: Read-Modify-Write
此類的指令雖然包含複數個動作,但由硬體的設計保證就像是 load, store, add, ... 中間無法再插入其他指令
延伸: [Read-modify-write](https://en.wikipedia.org/wiki/Read-modify-write)
常見的實作如: test-and-set, fetch-and-add, compare-and-swap
延伸: [Compare-and-swap](https://en.wikipedia.org/wiki/Compare-and-swap)
```cpp
bool test_and_set(bool *target) //atomic
{
//non-interruptable operations
bool b = *target;
*target = true;
return b;
}
```
```cpp
loop //lock init to FALSE
{
while(test_and_set(lock)) ; //block
//critical section
lock = false;
}
```
各平台提供的 API
* Win32: `_InterlockedIncrement`
* iOS: `OSAtomicAdd32`
* C++11: `std::atomic<int>::fetch_add`
值得一提的是,C++11 的 atomic operation 無法保證是 lock-free 的,必須透過 `std::atomic<>::is_lock_free` 來確認
#### 2. Compare-And-Swap Loops
在 RMW 中最常見的應用就是將 CAS 搭配 loop 使用
Win32 支援的 CAS: `_InterlockedCompareExchange`
```cpp
void LockFreeQueue::push(Node* newHead)
{
for (;;) {
// Copy a shared variable (m_Head) to a local.
Node* oldHead = m_Head;
// Do some speculative work, not yet visible to other threads.
newHead->next = oldHead;
// Next, attempt to publish our changes to the shared variable.
// If the shared variable hasn't changed, the CAS succeeds and we return.
// Otherwise, repeat.
if (_InterlockedCompareExchange(&m_Head, newHead, oldHead) == oldHead)
return;
}
}
```
* 上例為何可被稱為 lock-free? 因為如果其中一個執行緒的迴圈無法通過 `_InterlockedCompareExchange`,代表有其他的執行緒成功了,程式整體是有進展的
* 在處理 CAS Loop 時要特別小心 [ABA Problem](https://en.wikipedia.org/wiki/ABA_problem):
若兩個執行緒同時對一個記憶體內容進行操作,但優先執行完畢的執行緒把原值 A 改爲 B 又改回 A,這樣後執行的執行緒將會讀取錯誤的資料
![](https://i.imgur.com/SparxyO.jpg)
#### 3. Sequential consistency
所有的執行緒都依循順序執行程式,簡單來說,就是前面的指令一定比後面的指令先執行
```shell
$ gcc -s main.c
$ cat main.s
...
store x, 1
load r1, y
...
```
* sequential consistency 保證不會出現
```
//in CPU
load r1, y
store x, 1
```
的情況
案例探討: [C11 Lock-free Stack](https://nullprogram.com/blog/2014/09/02/)
#### 4. Memory Ordering
參考: [浅谈Memory Reordering](http://dreamrunner.org/blog/2014/06/28/qian-tan-memory-reordering/)
* 當一個 **lock-free** 程式在**多核**系統執行,且執行環境**不保證 sequential consistency** 時,就要小心 memory reordering
![](http://preshing.com/images/marked-example2.png)
![](http://preshing.com/images/reordered.png)
memory reordering 代表程式的執行順序不一定,可能會造成錯誤結果
實例: [Memory Reordering Caught in the Act](http://preshing.com/20120515/memory-reordering-caught-in-the-act/)
注意發生的條件
* 如果保證 sequential consistency,那就不用擔心會出錯
* 如果在**單核**系統執行,那只要關閉 compiler optimization 就可以保證 sequential consistency
* 如果程式**並非 lock-free**,那 critical section 的指令會因為有 lock 而被 block 住,也保證了執行的正確性
reordering 可以分成兩種:
* compiler reordering
實例: [Memory Ordering at Compile Time](http://preshing.com/20120625/memory-ordering-at-compile-time/)
* processor reordering
其實就是在 compile time 跟 run time 兩個時間點,因為最佳化產生指令重新排序的關係
----
### D. lock-free 衍生的問題
上文提到兩個,一個是 CAS loop 裡面的 ABA 問題,另一個是在討論 sequential consistency 時提到的 memory ordering,這裡著重於第二點
#### 1. 根據最佳化時間點不同,有不同的解法
* Compiler Barrier
[浅谈Memory Reordering](http://dreamrunner.org/blog/2014/06/28/qian-tan-memory-reordering/) 和 [Memory Ordering at Compile Time](http://preshing.com/20120625/memory-ordering-at-compile-time/) 的後半部都有實際演練,沒有乖乖讀的人現在可以點進去老實讀完
* Memory Barrier
圖解: [Memory Barriers Are Like Source Control Operations](http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/)
定義: [Acquire and Release Semantics](http://preshing.com/20120913/acquire-and-release-semantics/)
總共有 4 種 barrier,而透過這 4 種 barrier 可組合成 acquire & release semantics
![](http://preshing.com/images/acq-rel-barriers.png)
引述:
"**Acquire semantics** prevent memory reordering of the read-acquire with any read or write operation which follows it in program order.
...
**Release semantics** prevent memory reordering of the write-release with any read or write operation which precedes it in program order."
![](http://preshing.com/images/read-acquire.png)
![](http://preshing.com/images/write-release.png)
透過 Acquire and Release Semantics 來限制優化,讓記憶體相關操作必須在特定語句前後完成
#### 2. Memory Models
延伸: [Weak vs. Strong Memory Models](http://preshing.com/20120930/weak-vs-strong-memory-models/)
memory reordering 發生的機率,跟兩者有關,一個是硬體(processor),另一個是軟體(toolchain)
根據不同的執行環境可以列出下表:
![](https://i.imgur.com/Z0LWi07.png)
所謂的 weak 就是指有很大的機率會將 load/store 指令進行重排,而 strong 則相反
1. **Weak memory model**: 可能存在所有的 memory reordering
2. **Weak with data dependency ordering**: 可能存在 storeload 和 storestore 的reordering (load 的順序必能被保證)
3. **Usually Strong**: 保證 acquire and release semantic 的執行,所以只存在 storeload reordering
4. **Sequentially consistent**: 全部被保證順序