HackMD - Collaborative Markdown Knowledge Base

Arm 工程師 Will Deacon 在 Edinburgh 的演講 Uh-oh; it's I/O ordering 中提到：

初始條件：*x, *y 在記憶體裡，初始值為 0，foo, bar 是 local variable

digraph g {
    rankdir = "RL";
    
    "cpu0" [label= "CPU0 | a: WRITE_ONCE(*x, 1); | b: foo = READ_ONCE(*y);", shape="record"];
   
    "cpu1" [label= "CPU1 | c: WRITE_ONCE(*y, 1); | d: bar = READ_ONCE(*x);", shape="record"];
    cpu1->cpu0 [color=white]
}

a, b, c, d 存在著幾種可能的 交錯執行順序
{a, b, c, d}, {a, c, b, d}, {a, c, d, b}, {c, d, a, b},…

即便是這個順序 {a, b, c, d}，foo 和 bar 的結果有沒有可能會是 foo == bar == 0？

為什麼要探討 reordering？

看起來和 linux kernel 無關，也看不出和 concurrency 有關。

reorder 有兩個層級
- Compiler 對 source code
- CPU 執行 instruction
reordering 後的結果未必是 100% 正確
透過 memory barrier 控制 reordering
linux kernel 對 memory order、memory barrier 有不同程度的支援

為什麼執行順序和 program order 不一樣？

當我們看到執行結果的時候，
對照一下 program order 所預期的結果，為什麼不一樣？

In-order and Out-of-Order

in-order
out-of-order

我們從這兩種順序來看 CPU 執行順序對效能及正確性的影響

in-order

                 1      2         3         4       5        6         7         8
fetch      |  fetch1                              fetch2 
decode     |          decode1                               decode2
execute    |                    execute1                             execute2
write-back |                               write1                                write1

當 processor 是用 in-order 在執行 instruction，
執行 2 個 instruction 需要 8 個 cycle。

out-of-order

                 1      2         3         4        5
fetch      |  fetch1  fetch2 
decode     |          decode1  decode2
execute    |                   execute1  execute2
write-back |                              write1    write1

如果是 out-of-order，執行 2 個 instruction 只需要 5 個 cycle。

在上一個指令的 cycle 完整的執行完之前，下一個指令是有可能已經開始執行了。
execute2 比 write1 還早執行，這樣會有正確的結果嗎？

diy, litmus, armmem

我們用這個工具來作一些實驗

diy、litmus 是由 INRIA (法國國家資訊與自動化研究所) 開發的一套可檢驗 memory model 的工具。
litmus 是用來測試 diy 所產生的 test case。
linux kernel 使用 litmus 在檢驗 memory-ordering model。
armmem 是由 INRIA 和劍橋大學的一套檢驗 memory model 的工具。

為什麼突然講到 litmus 呢？

linux kernel 在檢驗 memory model 的時候，也是需要工具，這些工具包括了 diy, litmus, armmem。

如果從 C/C++ 檢驗 memory model，
還有一個工具 cppmem
cppmem 是由劍橋大學 The Computer Laboratory 所開發，以 web 為介面。

litmus 是在模擬什麼？

reordering 有兩種層次
- Compiler reordering
- CPU reordering
litmus 是在模擬 CPU reordering

怎麼使用 diy, litmus 作實驗？

用 diy 產生 test case

$ cat X.conf 
-arch ARM
-name arm
-nprocs 2
-size 6
-safe Pod**,Fre,Rfe,Wse
-mode critical

$ diy7 -conf X.conf -o .
Generator produced 6 tests

透過修改 X.conf，可以產生 test case 來模擬 ARM 或是 x86

產生了 6 個 test case：arm000.litmus, arm001.litmus, …, arm005.litmus

然後就可以用 litmus 作實驗

$ litmus7 arm000.litmus -o arm000

litmus7 的 output 有一個 arm000.c，我們在這個檔案裡可以看到

/* Full memory barrier */
inline static void mbar(void) {
  asm __volatile__ ("dsb sy" ::: "memory");
}

這裡看到了 memory barrier 和 dsb。
linux kernel 支援什麼 memory barrier？
不同的 CPU 架構對 memory barrier 有什麼不同程度的支援？

探討 memory barrier 之前，我們先看 store buffer

What is store buffer？

在沒有 store buffer 之前，CPU 會直接對 cache 進行 load / store，像以下這張圖這樣。

digraph G {
{ rank = same; cpu0; cpu1 }
{ rank = same; cache0; cache1 }
{ rank = same; interconnect0; interconnect1; interconnect_mid }

cpu0 [shape=square;label="CPU 0"];
cache0 [shape=square;label="Cache"];
cpu0->cache0;

cpu1 [shape=square;label="CPU 1"];
cache1 [shape=square;label="Cache"];
cpu1->cache1;

interconnect0 [shape=square;label="";width=0];

interconnect1 [shape=square;label="";width=0];

interconnect_mid [shape=square;label="";width=0];

cache0 -> interconnect0[arrowhead=none; label="  interconnect "];
cache1 -> interconnect1[arrowhead=none];

interconnect0 -> interconnect_mid[arrowhead=none];

interconnect_mid->interconnect1[arrowhead=none];



memory [shape=square;label="Memory"];

interconnect_mid-> memory;
}

Memory Barriers: a Hardware View for Software Hackers

有了 store buffer

digraph G {
{ rank = same; cpu0; cpu1 }
{ rank = same; cache0; cache1 }
{ rank = same; interconnect0; interconnect1; interconnect_mid }

cpu0 [shape=square;label="CPU 0";fontsize=12];
cache0 [shape=square;label="Cache";fontsize=10];
sb0 [shape=square;label="Store \rBuffer";fontsize=10];

cpu0 -> sb0 -> cache0;
cache0->cpu0;

cpu1 [shape=square;label="CPU 1";fontsize=12];
cache1 [shape=square;label="Cache";fontsize=10];
sb1 [shape=square;label="Store\rBuffer";fontsize=10];

cache1->cpu1;
cpu1 -> sb1 -> cache1;

interconnect0 [shape=square;label="";width=0];

interconnect1 [shape=square;label="";width=0];

interconnect_mid [shape=square;label="";width=0];

cache0 -> interconnect0[arrowhead=none; label="  interconnect "];
cache1 -> interconnect1[arrowhead=none];

interconnect0 -> interconnect_mid[arrowhead=none];

interconnect_mid->interconnect1[arrowhead=none];

memory [shape=square;label="Memory";fontsize=10];

interconnect_mid-> memory;
}

load 會先從 cache 去讀。
store 並不是直接寫入 cache，而是把 store 先放進 store buffer。
用 GraphViz 重畫後，位置和原圖有點不一樣，不過概念是一樣的。

Data Memory Barrier (DMB)

因為是以 Arm 做實驗，所以我們先說明 DMB

Data Memory Barrier (DMB)

This prevents reordering of data accesses instructions across the DMB instruction

It also ensures that any explicit preceding data or unified cache maintenance operations have completed before any subsequent data accesses are executed

以上節錄自 Arm 官方文件

避免 dmb 之後的 instruction 被 reordering
completed 的意思就是 instruction 的 pipeline 已經執行到 write back

這個範例來自於 Professor Peter Sewell
(劍橋大學 Computer Laboratory)

ARM SB+dmbs
"DMBdWR Fre DMBdWR Fre"
Cycle=Fre DMBdWR Fre DMBdWR
{
%x0=x; %y0=y;
%y1=y; %x1=x;
}
 P0            | P1            ;
 MOV R0, #1    | MOV R0, #1    ;
 STR R0, [%x0] | STR R0, [%y1] ;
 DMB           | DMB           ;
 LDR R1, [%y0] | LDR R1, [%x1] ;
exists
(0:R1=0 /\ 1:R1=0)

[SB+dmbs]

我們將分別用 litmus7、armmem 來對 SB+dmbs 作實驗。

litmus 是由 INRIA (法國國家資訊與自動化研究所) 開發的一套可檢驗 memory model 的工具。
linux kernel 使用 litmus 檢驗不同的 memory-ordering model

litmus7 實驗結果

 #START _litmus_P1
	mov x8,#1
	str x8,[x2]
	dmb sy
	ldr x4,[x0]
#START _litmus_P0
	mov x8,#1
	str x8,[x0]
	dmb sy
	ldr x4,[x2]
 
Test SB+dmbs Allowed
Histogram (4 states)
2     *>0:R1=0; 1:R1=0;
490580:>0:R1=1; 1:R1=0;
481247:>0:R1=0; 1:R1=1;
28171 :>0:R1=1; 1:R1=1;
Ok
 
Witnesses
Positive: 2, Negative: 999998
Condition exists (0:R1=0 /\ 1:R1=0) is validated

從第 14 行我們看到 0:R1=0; 1:R1=0，原因之後會說明
dmb 之後的 instruction 沒有被 reordering

ppcmem/armmem 是由 INRIA 和劍橋大學共同開發的一套檢驗 memory model 的工具，以 web 為介面。
Linux kernel 也有使用 ppcmem 在檢驗 memory barrier 和 atomic instruction Validating Memory Barriers and Atomic Instructions

armmem 實驗結果

 Test SB+dmbs Allowed
States 3
0:R1=0; 1:R1=1; via [0;0;0;1;0;0;2;0;3;0;0;0;1;0;4;0;0;0]
0:R1=1; 1:R1=0; via [0;0;0;1;0;0;1;0;3;4;0;0;0;2;0;0;0;0]
0:R1=1; 1:R1=1; via [0;0;0;1;0;0;1;0;2;0;3;0;0;0;4;0;0;0]
No (allowed not found)
Condition
exists (0:R1=0 /\ 1:R1=0)
Hash=631b5d7f1cd13c8b2af0b5ebf9c34dda
Cycle=Fre DMBdWR Fre DMBdWR
Observation SB+dmbs Never 0 3

沒有看到 0:R1=0; 1:R1=0
dmb 之後的 instruction 沒有被 reordering

為什麼 litmus7 和 armmem 實驗的結果不一樣？

因為 litmus7 和 armmem 模擬的層級不一樣。
armmem 並不是真的在一顆 CPU 上執行 instructions，所以，不會再另外模擬沒有 atomic operation 的結果。

暫停一下，目前為止的實驗是要證明什麼事？

在 litmus7 和 armmem 上的實驗，都是在證明 dmb 之後的 instruction 沒有被 reordering
litmus7 和 armmem 的測試結果都是正確的。
litmus7 是真的在 CPU 上執行，可以看到沒有 atomic operation 的結果。

這裡說明了 atomic operation 的必要性。

What is atomic operation？

以 Arm 架構的指令 (ldrex、strex) 探討 atomic

 	__asm__ __volatile__(
"1:	ldrex	%0, [%3]\n"
"	add	%1, %0, %4\n"
"	strex	%2, %1, [%3]\n"
"	teq	%2, #0\n"
"	bne	1b"
	: "=&r" (lockval), "=&r" (newval), "=&r" (tmp)
	: "r" (&lock->slock), "I" (1 << TICKET_SHIFT)
	: "cc");

以上節錄自 spinlock

從這 3 個角度來理解 atomic operation

Mutual exclusion
Dekker's algorithm
Exclusive Monitors

Mutual exclusion

這裡我們探討的 mutual exclusion 是指電腦科學的一個問題 race condition，並不是在探討 mutex，mutex 只是其中一種解法。
以 critical section 來解決 race condition 會有什麼問題？
critical section 可能造成 deadlocks 和 resource starvation。

Mutual exclusion

解法有硬體和軟體兩種解法。
硬體解法包括 atomic operation、busy-wait …。
軟體解法有 spinlock、RCU …。

Dekker's algorithm

Dekker's algorithm is the first known correct solution to the mutual exclusion problem

[wiki]

Dekker's algorithm 有效率低落的問題，現代的處理器已不採用
process 不是 processor

Dekker's algorithm

透過 busy wait 的方式，讓任何時候最多只能有一個 process 進入 critical section
如果已經有一個 process 進入了 critical section，下一個要進入 critical section 的 process 就會 busy wait，直到 P0 離開了 critical section (假設 P0 先進入 critical section)

busy wait 會有什麼問題？

busy wait 對效能是有影響的
busy wait 表示起碼會有一個最低限度的時間是花在等待其他 process 離開 critical section。

Exclusive Monitors

Exclusive monitors

Arm 的 Exclusive monitors 有兩種狀態：
open 和 exclusive。

load-exclusive 會把 exclusive monitor 的狀態改成 exclusive。
store-exclusive 只有當 exclusive monitor 的狀態是 exclusive 的時候 store 才會成功。

load 進行中的時候無法 store

如果存在著一個 exclusive monitor 還沒有 把狀態改成 exclusive，那麼 store 就不會成功。

 ARM SB+dmbs
"DMBdWR Fre DMBdWR Fre"
Cycle=Fre DMBdWR Fre DMBdWR
{
%x0=x; %y0=y;
%y1=y; %x1=x;
}
 P0            | P1            ;
 MOV R0, #1    | MOV R0, #1    ;
 STR R0, [%x0] | STR R0, [%y1] ;
 DMB           | DMB           ;
 LDR R1, [%y0] | LDR R1, [%x1] ;
exists
(0:R1=0 /\ 1:R1=0)

Reference:https://www.cl.cam.ac.uk/~pes20/arm-supplemental/arm040.html#tst@SB+dmbs

當第 10 行的 STR R0, [%x0]、STR R0, [%y1] 都尚未 complete，第 12 行的 LDR R1, [%y0]、LDR R1, [%x1] 卻同時執行，會是什麼結果？
dmb 保證 P0 的 LDR 沒有被 reordering，也保證了 P1 的 LDR 沒有被 reordering，但是少了 atomic，所以仍然有可能發生 P0: R1=0; P1: R1=0

回顧一開始的範例

digraph g {
    rankdir = "RL";
    
    "cpu0" [label= "CPU0 | a: WRITE_ONCE(*x, 1); | b: foo = READ_ONCE(*y);", shape="record"];
   
    "cpu1" [label= "CPU1 | c: WRITE_ONCE(*y, 1); | d: bar = READ_ONCE(*x);", shape="record"];
    cpu1->cpu0 [color=white]
}

WRITE_ONCE、READ_ONCE 是 atomic operation 嗎？

Processor 0	Processor 1
mov [_x], 1 // M1 mov r1, [_y] // M2	mov [_y], 1 // M3 mov r2, [_x] // M4
Initially x == y == 0 r1 == 0 and r2 == 0 is allowed

Processor 0	Processor 1
mov r1,[_x] // M1 mov [_y], 1 // M2	mov r2, [_y] // M3 mov [_x], 1 // M4
Initially x == y == 0 r1 == 1 and r2 == 1 is not allowed

Agenda

Introduction

What is reorder?

為什麼執行順序和 program order 不一樣？

In-order and Out-of-Order

in-order

out-of-order

diy, litmus, armmem

litmus 是在模擬什麼？

怎麼使用 diy, litmus 作實驗？

What is store buffer？

Data Memory Barrier (DMB)

Data Memory Barrier (DMB)

litmus7 實驗結果

armmem 實驗結果

What is atomic operation？

Mutual exclusion

Mutual exclusion

Dekker's algorithm

Dekker's algorithm

Exclusive Monitors

Exclusive monitors

What is memory model？

Memory model

x64

Arm

weakly memory model

How Linux kernel supports concurrency？

Reference

tags: `COSCUP 2019` `linux kernel` `memory order` `memory model` `atomic` `concurrency`

	#START _litmus_P1
	mov x8,#1
	str x8,[x2]
	dmb sy
	ldr x4,[x0]
	#START _litmus_P0
	mov x8,#1
	str x8,[x0]
	dmb sy
	ldr x4,[x2]

	Test SB+dmbs Allowed
	Histogram (4 states)
	2 *>0:R1=0; 1:R1=0;
	490580:>0:R1=1; 1:R1=0;
	481247:>0:R1=0; 1:R1=1;
	28171 :>0:R1=1; 1:R1=1;
	Ok

	Witnesses
	Positive: 2, Negative: 999998
	Condition exists (0:R1=0 /\ 1:R1=0) is validated

	Test SB+dmbs Allowed
	States 3
	0:R1=0; 1:R1=1; via [0;0;0;1;0;0;2;0;3;0;0;0;1;0;4;0;0;0]
	0:R1=1; 1:R1=0; via [0;0;0;1;0;0;1;0;3;4;0;0;0;2;0;0;0;0]
	0:R1=1; 1:R1=1; via [0;0;0;1;0;0;1;0;2;0;3;0;0;0;4;0;0;0]
	No (allowed not found)
	Condition
	exists (0:R1=0 /\ 1:R1=0)
	Hash=631b5d7f1cd13c8b2af0b5ebf9c34dda
	Cycle=Fre DMBdWR Fre DMBdWR
	Observation SB+dmbs Never 0 3

	__asm__ __volatile__(
	"1: ldrex %0, [%3]\n"
	" add %1, %0, %4\n"
	" strex %2, %1, [%3]\n"
	" teq %2, #0\n"
	" bne 1b"
	: "=&r" (lockval), "=&r" (newval), "=&r" (tmp)
	: "r" (&lock->slock), "I" (1 << TICKET_SHIFT)
	: "cc");

	ARM SB+dmbs
	"DMBdWR Fre DMBdWR Fre"
	Cycle=Fre DMBdWR Fre DMBdWR
	{
	%x0=x; %y0=y;
	%y1=y; %x1=x;
	}
	P0 \| P1 ;
	MOV R0, #1 \| MOV R0, #1 ;
	STR R0, [%x0] \| STR R0, [%y1] ;
	DMB \| DMB ;
	LDR R1, [%y0] \| LDR R1, [%x1] ;
	exists
	(0:R1=0 /\ 1:R1=0)