從 Linux Kernel 閱讀 Memory Barrier

contributed by < kevinshieh0225 >

4/12 一對一討論
 memory-barriers document - The Linux Kernel Archives
memory-barriers document - 整理版本

Abstract memory access model

多核的計算機示意圖如下：

        +-------+   :   +--------+   :   +-------+
        |       |   :   |        |   :   |       |
        |       |   :   |        |   :   |       |
        | CPU 1 |<----->| Memory |<----->| CPU 2 |
        |       |   :   |        |   :   |       |
        |       |   :   |        |   :   |       |
        +-------+   :   +--------+   :   +-------+
            ^       :       ^        :       ^
            |       :       |        :       |
            |       :       |        :       |
            |       :       v        :       |
            |       :   +--------+   :       |
            |       :   |        |   :       |
            |       :   |        |   :       |
            +---------->| Device |<----------+
                    :   |        |   :
                    :   |        |   :
                    :   +--------+   :
                    :                :

CPU 會和不同裝置溝通並做記憶體存取操作，然而相對 CPU 執行速度，其他裝置的速度是非常慢的，這將導致 CPU 在進行記憶體存取操作時需要停下來等待記憶體存取（stall）。於是 CPU 准許諸多手段以提昇效能、盡可能減少 pipeline stall 的發生，包括 out-of-order execution 、 load speculation, branch prediction

可參考現代處理器設計：原理和關鍵特徵有更深入探討。

以上手段雖然保證單執行序下的結果一致，但當多執行序的情況下，我們無法保證指令執行順序並可能造成對共用記憶體操作結果的不一致，以此為例：

CPU 1		CPU 2
==============================
{ A == 1; B == 2 }
A = 3;		x = B;
B = 4;		y = A;

我們無法保證 CPU 各自執行指令的順序，導致指令執行的順序有

4!

種排列可能，並且會導致四種不同的輸出結果：

x == 2, y == 1
x == 2, y == 3
x == 4, y == 1
x == 4, y == 3

Guarantees

以下列舉 CPU 的可預期行為，可見於 Dependency ordering 的簡報：

Address dependency：On any given CPU, dependent memory accesses will be issued in order.

Q = READ_ONCE(P); D = READ_ONCE(*Q);
 // the CPU will issue the following memory operations:
Q = LOAD P, D = LOAD *Q

Data dependency：Overlapping loads and stores within a particular CPU will appear to be ordered within that CPU.

a = READ_ONCE(*X); WRITE_ONCE(*X, b);
 // the CPU will only issue the following sequence of memory operations:
a = LOAD *X, STORE *X = b


WRITE_ONCE(*X, c); d = READ_ONCE(*X);
 // the CPU will only issue:
STORE *X = c, d = LOAD *X

 (Loads and stores overlap if they are targeted at overlapping pieces of
 memory).

Control depencency：READ-WRITE order for generally ordered by all CPU architectures, but not all cases. Discuss in later section.

q = READ_ONCE(a);
if (q > 41) {
	WRITE_ONCE(b, 1);
}
// CPU in this case will operate inorder.

就算使用 READ_ONCE WRITE_ONCE確保執行順序，也不保證多執行序下記憶體操作的一致性，詳細原因可參考 Memory Barriers: a Hardware View for Software Hackers

DEC alpha 架構處理器在 linux v4.15 把 memory barrior 加入到 READ_ONCE() 當中，而在原文件中稱此為 data dependency barriers。

但也有你不該期待 CPU 的行為假設：

It must not be assumed that the compiler will do what you want with memory references that are not protected by READ_ONCE() and WRITE_ONCE(). Without them, the compiler is within its rights to do all sorts of “creative” transformations
It must not be assumed that independent loads and stores will be issued in the order given.
It must be assumed that overlapping memory accesses may be merged or discarded.

X = *A; Y = *(A + 4);
 // we may get any one of the following sequences:
X = LOAD *A; Y = LOAD *(A + 4);
Y = LOAD *(A + 4); X = LOAD *A;
{X, Y} = LOAD {*A, *(A + 4) };


*A = X; *(A + 4) = Y;
 // we may get any of:
STORE *A = X; STORE *(A + 4) = Y;
STORE *(A + 4) = Y; STORE *A = X;
STORE {*A, *(A + 4) } = {X, Y};

注意，以上保證並不包括 bitfields 的操作，也盡可能不要用 bitfields 在並行同步的演算法中。

以上保證只針對那些 "properly alligned（由記憶體分配對齊的） and sized scalar（大小等價於 char, short, int, long） variables"。

Memory Barriers

由上一章我們得知 CPU 為了提高執行效能會無所不用其極的對 memory access operation 做調整，然而這將導致 CPU 之間執行上的結果一致性問題。

memory barrier 的目的就是保證在此指令前後執行的結果與變化，能保證其他執行單元在對應的 barrier 指令前能夠注意到、接收到 (perceive)。（通常 memory barrier 會成對使用，詳情可看 SMP Barrier Pairing）

memory barrier 並非代表說，在這個 mb 指令以前的記憶體操作都會被執行並同步完畢，mb 只是畫了一條線告訴大家記憶體指令間的關係、使用規則與執行限制。

mb 也並不會廣播這個指令執行的資訊給其他的 CPU ，如果要完整的保護執行一致性需要使用 barrier pair 來達到目的。

（聽起來很抽象，可以參考 Dive deeper into write/read memory barrier 筆記一探 mb 和硬體執行上的操作。)

Explicit Memory Barrier 主要可分為四種基本類型：

Write (or store) memory barriers：保證所有在 wmb 指令前的 STORE operations 會先發生於（happen-before）後面的 STORE。wmb 只規範 stores 的順序，並不影響 loads。
Data dependency barriers：類似 read barrier，詳情另尋
Read (or load) memory barriers：保證所有在 rmb 指令前的 LOAD operations 會先發生於（happen-before）後面的 LOAD 。rmb 只規範 loads 的順序，並不影響 stores。
General memory barriers：保證所有在 mb 指令前的 LOAD,STORE operations 會先發生於（happen-before）後面的 LOAD ,STORE。

另外也有 Implicit Memory Barrier 的類型：

利用 ACQUIRE+RELEASE 來達到 memory barrier 的效果：

ACQUIRE operation：確保在 ACQUIRE 指令後的記憶體操作會在這個指令順序後執行（ACQUIRE 前的記憶體操作仍可能在指令順序後完成）。包括 LOCK operations, smp_load_acquire(), smp_cond_load_acquire()。
RELEASE operation：確保在 RELEASE 指令前的記憶體操作會在這個指令順序前執行（RELEASE 前的記憶體操作仍可能在指令順序前完成）。包括 UNLOCK operations, smp_store_release()。

ACQUIRE+RELEASE 確保在這 critical section 的記憶體操作會在這個 critical section 內完成，但他並不保證其他的記憶體操作執行的時間點，故而無法達到完整 memory barrier 的保證。

使用 memory barrier 的前提是 CPU 執行的程式間彼此會互相影響，如果彼此間沒有影響的話那就沒有必要使用。

Control Dependencies

Control Dependencies 很不容易維持，現代編譯器容易在 branch 操作中做出指令重序，就算使用了 volatile 還是可能重排，而適時需要 memory barrier 的協助。這裡簡述幾種情形，詳細請參考原文。

請注意：
Control dependencies 通常也會和其他 barriers 搭配使用，control dependencies 也可以視作另類的屏障手段。

READ-READ

READ-READ 不保證順序，必須使用 rmb 來限制順序。考慮以下例子如果不使用 rmb ，編譯器可能會先做 branch prediction 認為 q>0 ，並先執行了 b 的讀取。

q = READ_ONCE(a);
if (q) {
	<read barrier>
	p = READ_ONCE(b);
}

READ-WRITE

我們先前提到 control dependencies 保證 READ-WRITE 的順序，但請注意如果少了 READ_ONCE WRITE_ONCE任一，編譯器還是可能會對程式碼進行重排。除了 READ, WRITE 指令可能會被和其他讀寫操作合併最佳化，考慮以下例子，如果編譯器認為 a 總是非零，那麼編譯器會直接除去 if-statement：

q = a;
if (q) {
    WRITE_ONCE(b, 1);
}
// compiler may optimize as
q = a;
b = 1;  /* BUG: Compiler and CPU can both reorder!!! */

所以一定要有 READ_ONCE WRITE_ONCE 缺一不可。

但也不總是保證加了就不會被最佳化，考慮以下例子：

/* If both legs of the “if” statement begin with identical stores
 * to the same variable, then those stores must be ordered
 */
q = READ_ONCE(a);
if (q) {
    barrier();
    WRITE_ONCE(b, 1);
    do_something();
} else {
    barrier();
    WRITE_ONCE(b, 1);
    do_something_else();
}
/* there is no conditional between the load from ‘a’ and the store to ‘b’
 * which means that the CPU is within its rights to reorder them :
 */
q = READ_ONCE(a);
barrier();
WRITE_ONCE(b, 1);  /* BUG: No ordering vs. load from a!!! */
if (q) {
    /* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
    do_something();
} else {
    /* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
    do_something_else();
}
/* instead we need explicit explicit memory barriers
 * for example, smp_store_release()
 */
q = READ_ONCE(a);
if (q) {
    smp_store_release(&b, 1);
    do_something();
} else {
    smp_store_release(&b, 1);
    do_something_else();
}

If both legs of the “if” statement begin with identical stores to the same variable, then those stores must be ordered, either by preceding both of them with smp_mb() or by using smp_store_release() to carry out the stores.

Please note that it is not sufficient to use barrier() at beginning of each leg of the “if” statement because, as shown by the example above, optimizing compilers can destroy the control dependency while respecting the letter of the barrier() law.

SMP Barrier Pairing

在 CPU-CPU 彼此影響的情境需要 memory barrier 確保執行正確性，但注意特定的 memory barrier 務必搭配使用才能發揮功能。

General barrier 因應情境可以搭配所有上述種類的 barrier type。（不包含 multicopy atomicity）

ACQUIRE barrier 多與 RELEASE 搭配，但也可以和其他 barrier 搭配。

Write barrier 可以和 data dependency barrier, control dependency, Read barrier, ACQUIRE, RELEASE, 或是 General barrier 做搭配。

而 data dependency barrier, control dependency, Read barrier 可以和 Write barrier, ACQUIRE, RELEASE, 或是 General barrier 做搭配。

Examples of Memory Barrier Sequences

最後我們透過範例來觀察 Memory Barrier 指令如何影響記憶體操作的順序。

write memory barrier

假設我們有一段指令操作：

CPU 1
=======================
STORE A = 1
STORE B = 2
STORE C = 3
<write barrier>
STORE D = 4
STORE E = 5

由於 write barrier 的緣故，我們要求 wmb 以前的指令（STORE A = 1, B = 2, C = 3）必須在下次 STORE D = 4, E = 5 執行以前做完。於是乎 memory operation order 可以如下圖所示：

+-------+       :      :
|       |       +------+
|       |------>| C=3  |     }     /\
|       |  :    +------+     }-----  \  -----> Events perceptible to
|       |  :    | A=1  |     }        \/       the rest of the system
|       |  :    +------+     }
| CPU 1 |  :    | B=2  |     }
|       |       +------+     }
|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
|       |       +------+     }        requires all stores prior to the
|       |  :    | E=5  |     }        barrier to be committed before
|       |  :    +------+     }        further stores may take place
|       |------>| D=4  |     }
|       |       +------+
+-------+       :      :
                   |
                   | Sequence in which stores are committed to the
                   | memory system by CPU 1
                   V

Read memory barrier vs. Load Speculation

Load Speculation：CPU 多半會先假設（speculate）load 的值，因為 load 的執行時間較長，且會受 system bus 運輸資料的速度影響，所以 CPU 如果得知未來有 load 的指令，並發現此時沒有其他 load 佔用 system bus，那就會提早執行 load。這使的 load 幾乎可以在執行指令時即刻完成，因為 CPU 早就先把存值載入了。

在此我們放入 read barrier 確保多 CPU 下的正確性，考慮以下：

CPU 1			CPU 2
==============================================
			LOAD B
			DIVIDE		    } Divide instructions generally
			DIVIDE		    } take a long time to perform
			<read barrier>
			LOAD A

如果在過程中 A 都並未發生改變可能執行狀況如下：CPU 在預先得知會執行 LOAD A ，於是預先載入數值，等到準備執行時確定沒有收到 invalidate，再正式使用執行 LOAD A。

                                        :       :       +-------+
                                        +-------+       |       |
                                    --->| B->2  |------>|       |
                                        +-------+       | CPU 2 |
                                        :       :DIVIDE |       |
                                        +-------+       |       |
The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
division speculates on the              +-------+   ~   |       |
LOAD of A                               :       :   ~   |       |
                                        :       :DIVIDE |       |
                                        :       :   ~   |       |
                                        :       :   ~   |       |
                                    rrrrrrrrrrrrrrrr~   |       |
                                        :       :   ~   |       |
Once the divisions are complete -->     :       :   ~-->|       |
the CPU can then perform the            :       :       |       |
LOAD with immediate effect              :       :       +-------+

然而如果 A 被其他 CPU update and invalidate，如果沒有 rmb 的限制，可能因為 invalidate 傳入的比較慢，而 CPU 使用了舊的 speculate load；由於使用 rmb，讓我們能優先確認 invalidate queue 以防出錯：

                                        :       :       +-------+
                                        +-------+       |       |
                                    --->| B->2  |------>|       |
                                        +-------+       | CPU 2 |
                                        :       :DIVIDE |       |
                                        +-------+       |       |
The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
division speculates on the              +-------+   ~   |       |
LOAD of A                               :       :   ~   |       |
                                        :       :DIVIDE |       |
                                        :       :   ~   |       |
                                        :       :   ~   |       |
                                    rrrrrrrrrrrrrrrrr   |       |
                                        +-------+       |       |
The speculation is discarded --->   --->| A->1  |------>|       |
and an updated value is                 +-------+       |       |
retrieved                               :       :       +-------+

Example of SMP barrier pair

write barrier 保證 STORE 的 partial order；read barrier 保證 LOAD 的 partial order，參考以下範例：

CPU 1			CPU 2
==============================================
	{ A = 0, B = 9 }
STORE A=1
<write barrier>
STORE B=2
			LOAD B
			LOAD A

如果只用 write barrier ，CPU 2 可以根據執行的效率改變讀取的執行順序，而使的執行順序一致性失敗：

+-------+       :      :                :       :
|       |       +------+                +-------+
|       |------>| A=1  |------      --->| A->0  |
|       |       +------+      \         +-------+
| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
|       |       +------+        |       +-------+
|       |------>| B=2  |---     |       :       :
|       |       +------+   \    |       :       :       +-------+
+-------+       :      :    \   |       +-------+       |       |
                             ---------->| B->2  |------>|       |
                                |       +-------+       | CPU 2 |
                                |       | A->0  |------>|       |
                                |       +-------+       |       |
                                |       :       :       +-------+
                                 \      :       :
                                  \     +-------+
                                   ---->| A->1  |
                                        +-------+
                                        :       :

今天再增加一個 read barrier：

CPU 1			CPU 2
==============================================
	{ A = 0, B = 9 }
STORE A=1
<write barrier>
STORE B=2
			LOAD B
			<read barrier>
			LOAD A

即可讓 A 的執行結果符合預期：

CPU 1			CPU 2
==============================================
	{ A = 0, B = 9 }
STORE A=1
<write barrier>
STORE B=2
			LOAD B
			<read barrier>
			LOAD A

+-------+       :      :                :       :
|       |       +------+                +-------+
|       |------>| A=1  |------      --->| A->0  |
|       |       +------+      \         +-------+
| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
|       |       +------+        |       +-------+
|       |------>| B=2  |---     |       :       :
|       |       +------+   \    |       :       :       +-------+
+-------+       :      :    \   |       +-------+       |       |
                             ---------->| B->2  |------>|       |
                                |       +-------+       | CPU 2 |
                                |       :       :       |       |
                                |       :       :       |       |
  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
  barrier causes all effects      \     +-------+       |       |
  prior to the storage of B        ---->| A->1  |------>|       |
  to be perceptible to CPU 2            +-------+       |       |
                                        :       :       +-------+

Explicit kernel barriers

最後我們針對 Explicit kernel barriers 的類型做討論：

Compiler Barrier

compiler barrier 只有 general 版的 barrier()，限制編譯器對於指令前後的 memory access ，不會被 reorder 到指令的另一側去，確保 data dependency 的關係。READ_ONCE() 和 WRITE_ONCE() 可被視為較弱的 barrier()。

linux/tools/include/linux/compiler.h

#define barrier() __asm__ __volatile__("" : : : "memory")

#define READ_ONCE(x)					\
({							\
	union { typeof(x) __val; char __c[1]; } __u =	\
		{ .__c = { 0 } };			\
	__read_once_size(&(x), __u.__c, sizeof(x));	\
	__u.__val;					\
})

#define WRITE_ONCE(x, val)				\
({							\
	union { typeof(x) __val; char __c[1]; } __u =	\
		{ .__val = (val) }; 			\
	__write_once_size(&(x), __u.__c, sizeof(x));	\
	__u.__val;					\
})

CPU Memory Barriers

compiler barrier 雖確保 data dependency ，然而無法確保 CPU 讀寫的存值是 up-to-date 的，比如 store buffers/Invalidate queue 改變的快取一致性。這時需要透過 CPU Memory Barrier 來保障讀寫一致性。

TYPE		    MANDATORY		SMP CONDITIONAL
=================================================================
GENERAL		    mb()		smp_mb()
WRITE		    wmb()		smp_wmb()
READ		    rmb()		smp_rmb()
DATA DEPENDENCY		        READ_ONCE()

memory barrier in `kernel/workqueue.c`

Concurrency Managed Workqueue (cmwq)

讓我們從核心程式碼 kernel/workqueue.c 中一探 memory barrier 使用的範例：

workqueue introduction

在討論裡面的 mb 應用前，先來了解一下 workqueue 在做什麼。

sturct workqueue 是任務佇列的結構體，紀錄受安排執行的任務（work item）列表。
struct work_struct 是一個指向任務函式的結構體，包裝我們的任務單位 work item 。

行程透過 alloc_workqueue 來建立任務佇列，並設定這個任務佇列的屬性：

@name 佇列 wq 的名字
@flags wq 分配給 CPU 的方式，比如：
        WQ_UNBOUND: wq 不指定特定 cpu worker
        WQ_HIGHPRI: wq 指定給高效能的 worker_pool
@max_active
        每個 CPU 可以一次執行幾個佇列的 work item。
        比如 16 的話代表一個 CPU 可以一次執行 16 個 work item。

在 kernel/workqueue_internal.h 定義的 struct worker 是執行任務的執行序。每個 cpu 被分配到兩個 worker-pool，一個執行普通優先級的 work item，一個執行高優先級的 work item。

另外也有 dynamic unbound worker-pools 來執行 unbound workqueue 的任務，也就是協助分配這些未指定 cpu worker 的 wq。

`worker_thread`

進到 mb 主題之前，先理解 worker_thread 的執行脈落：

每一種 worker 都執行 worker_thread，透過 goto 來決定此刻 worker_thread 的執行狀態：

goto sleep：如果無分配 workqueue 則休眠
goto woke_up：被分配 workqueue 醒來準備做事
goto recheck：持續執行任務並確認還有沒有剩餘的 workqueue

以 process_scheduled_works 尋訪 worker->scheduled 並執行任務：

static void process_scheduled_works(struct worker *worker)
{
	while (!list_empty(&worker->scheduled)) {
		struct work_struct *work = list_first_entry(&worker->scheduled,
						struct work_struct, entry);
		process_one_work(worker, work);
	}
}

process_one_work 讓 worker_thread 執行 work item 。函式進行流程如下：

// 這是很簡略版本
static void process_one_work(struct worker *worker, struct work_struct *work)
1. 確認執行在對的 CPU 上
2. 確認一個 work item 只被一個 worker thread 執行（如果有其他人在執行就先 return）
3. worker 宣告我現在要執行 work：
        改變 work, worker_pool 的狀態
4. 執行 work
5. 釋放 worker 的狀態

在步驟 3. 時改變了 work item 內 PENDING 的屬性，代表這個 work item 準備要被執行了，不再是等待執行的狀態。

這時我們看到了 memory barrier 的使用。

`set_work_pool_and_clear_pending`

這個函式要更改 work item 的狀態成正在執行，而不再是等待執行的狀態，程式碼如下：

static void set_work_pool_and_clear_pending(struct work_struct *work,
					    int pool_id)
{
	/*
	 * The following wmb is paired with the implied mb in
	 * test_and_set_bit(PENDING) and ensures all updates to @work made
	 * here are visible to and precede any updates by the next PENDING
	 * owner.
	 */
	smp_wmb();
	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
	/*
	 * The following mb guarantees that previous clear of a PENDING bit
	 * will not be reordered with any speculative LOADS or STORES from
	 * work->current_func, which is executed afterwards.  This possible
	 * reordering can lead to a missed execution on attempt to queue
	 * the same @work.  E.g. consider this case:
	 *
	 *   CPU#0                         CPU#1
	 *   ----------------------------  --------------------------------
	 *
	 * 1  STORE event_indicated
	 * 2  queue_work_on() {
	 * 3    test_and_set_bit(PENDING)
	 * 4 }                             set_..._and_clear_pending() {
	 * 5                                 set_work_data() # clear bit
	 * 6                                 smp_mb()
	 * 7                               work->current_func() {
	 * 8				      LOAD event_indicated
	 *				   }
	 *
	 * Without an explicit full barrier speculative LOAD on line 8 can
	 * be executed before CPU#0 does STORE on line 1.  If that happens,
	 * CPU#0 observes the PENDING bit is still set and new execution of
	 * a @work is not queued in a hope, that CPU#1 will eventually
	 * finish the queued @work.  Meanwhile CPU#1 does not see
	 * event_indicated is set, because speculative LOAD was executed
	 * before actual STORE.
	 */
	smp_mb();
}

我覺的他的註解範例有問題，後面會提出我的理解。

set_work_pool_and_clear_pending 就這三行指令：

static void set_work_pool_and_clear_pending(struct work_struct *work,
					    int pool_id)
{
    smp_wmb();          // 確保 STORE 的執行順序
    set_work_data();    // 改變 work item 工作狀態
    smp_mb();           // 確保 STORE 和 LOAD 的執行順序
                        // 這裡的 LOAD 特別針對緊接的 "4.執行 work" 做屏障
}

這個過程確保了：

STORE 的順序務必遵循 before STORE -> set_work_data() -> after STORE
在 set_work_data() 做完後，預期後面執行的 LOAD 都必須確保指令過後的狀態是否有更動（不能直接使用 specultative LOAD 的結果，記得確認 invalidate queue）。

對應到上面的範例，會發現因為 smp_mb() 確保了 LOAD 的順序，不會因為 speculative LOAD 而使用了舊資訊。。所以原本的註解範例應該改成：

    /* consider this case:
     * 
     *   CPU#0                         CPU#1
     *   ----------------------------  --------------------------------
     *
     * 1  STORE event_indicated
     * 2  queue_work_on() {
     * 3    test_and_set_bit(PENDING)
     * 4  }                            set_..._and_clear_pending() {
+    * 5                                 smp_wmb()
     * 6                                 set_work_data() # clear bit
-    * 6                                 smp_mb()
     * 7                               }
     * 8                               work->current_func() {
     * 9				   LOAD event_indicated
     *                                 }
     *
     * Without an explicit full barrier speculative LOAD on line 8 can
     * be executed before CPU#0 does STORE on line 1.  If that happens,
     * CPU#0 observes the PENDING bit is still set and new execution of
     * a @work is not queued in a hope, that CPU#1 will eventually
     * finish the queued @work.  Meanwhile CPU#1 does not see
     * event_indicated is set, because speculative LOAD was executed
     * before actual STORE.
     */

後來再想想，也不是他寫錯，只是他想表達的方式和邏輯和我不同而已。