Try   HackMD

Linux 核心專題: 記憶體模型

執行人: I-Ying-Tsai
解說影片

Reviewed by eleanorLYJ

關於 "Software managed coherency" 的實作嘗試,你寫到「寫了一個 Kernel Module 搭配 dma 來試著使用它」,但程式碼片段是空白的。請問這次嘗試的具體目的是為了比較軟體管理與硬體管理一致性的效能差異,還是在實驗初期探索不同的記憶體一致性策略?

因為在閱讀 Arm 手冊的過程中我發現了軟體管理已經被遺棄的敘述,於是才想說實做出來比較與硬體管理一致性的效能差異,這與後續的實驗其實沒有關聯,而程式碼片段是空白的原因是因為我還沒將它寫好,日後寫好後才會將它補上。

I-Ying-Tsai

Reviewed by HeLunWu0317

現代處理器廣泛使用弱記憶體模型以提升效能,卻可能在多核間引發難以察覺的 race condiction。

race condiction 拼寫錯誤。

任務簡述

現代處理器廣泛使用弱記憶體模型以提升效能,卻可能在多核間引發難以察覺的 race condiction。為此,Linux Kernel Memory Model (LKMM) 提供一套模型與工具(herd7/litmus7)協助驗證記憶體順序是否違反預期。透過對 litmus 測試的分析,我得以觀察 ARMv8-A 真實硬體是否會違反這些模型,並驗證 barrier 的有效性。

TODO: 研讀 Atomics 操作教材並改進

細讀並行程式設計: Atomics 操作、紀錄認知和提問,務必涵蓋「Memory Ordering 和 Barrier」
利用 litmus7 在 Raspberry Pi 4B 硬體實驗並充分解讀
適度改進筆記內容
搭配閱讀 2024 年報告-12024 年報告-2

利用 litmus7 在 Raspberry Pi 4B 硬體實驗並充分解讀

做實驗之前我需要先了解我做實驗的硬體架構:

i-ying-tsai@raspberrypi:~/herdtools7/catalogue/aarch64/tests $ lscpu
Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                ARM
  Model name:             Cortex-A72
    Model:                3
    Thread(s) per core:   1
    Core(s) per cluster:  4
    Socket(s):            -
    Cluster(s):           1
    Stepping:             r0p3
    CPU(s) scaling MHz:   33%
    CPU max MHz:          1800.0000
    CPU min MHz:          600.0000
    BogoMIPS:             108.00
    Flags:                fp asimd evtstrm crc32 cpuid
Caches (sum of all):      
  L1d:                    128 KiB (4 instances)
  L1i:                    192 KiB (4 instances)
  L2:                     1 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-3
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Vulnerable
  Spectre v1:             Mitigation; __user pointer sanitization
  Spectre v2:             Vulnerable
  Srbds:                  Not affected
  Tsx async abort:        Not affected

畫出架構如下:

                      +-----------------------+
                      |        DRAM           |
                      +-----------------------+
                               |
                      +-----------------------+
                      |     L2 Cache (1 MiB)  |
                      |     (Cluster Shared)  |
                      +-----------------------+
                      /      |       |       \
             +---------+ +---------+ +---------+ +---------+
             | Core 0  | | Core 1  | | Core 2  | | Core 3  |
             +---------+ +---------+ +---------+ +---------+
             | L1i 48K | | L1i 48K | | L1i 48K | | L1i 48K |
             | L1d 32K | | L1d 32K | | L1d 32K | | L1d 32K |
             +---------+ +---------+ +---------+ +---------+

首先安裝工具:

我根據 Paul McKenney 在 演講中所使用的簡報 使用以下指令先安裝了 herd7tools :

sudo apt update
sudo apt install opam
opam init
eval $(opam env)
opam update
opam upgrade
opam install herdtools7

接著我先使用了 litmus7 工具分析了 SB.litmus :
SB.litmus :

AArch64 SB
"PodWR Fre PodWR Fre"
Cycle=Fre PodWR Fre PodWR
Generator=diycross7 (version 7.54+01(dev))
Prefetch=0:x=F,0:y=T,1:y=F,1:x=T
Com=Fr Fr
Orig=PodWR Fre PodWR Fre
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
 P0          | P1          ;
 MOV W0,#1   | MOV W0,#1   ;
 STR W0,[X1] | STR W0,[X1] ;
 LDR W2,[X3] | LDR W2,[X3] ;
exists
(0:X2=0 /\ 1:X2=0)

對它執行 litmus7 ./SB.litmus -carch AArch64 指令後結果如下:

Test SB Allowed
Histogram (3 states)
324   *>0:X2=0; 1:X2=0;
499761:>0:X2=1; 1:X2=0;
499915:>0:X2=0; 1:X2=1;
Ok

Witnesses
Positive: 324, Negative: 999676
Condition exists (0:X2=0 /\ 1:X2=0) is validated
Hash=dca0cdb9995839b737bab4e9b28561fa
Cycle=Fre PodWR Fre PodWR
Generator=diycross7 (version 7.54+01(dev))
Com=Fr Fr
Orig=PodWR Fre PodWR Fre
Observation SB Sometimes 324 999676
Time SB 0.18

避免過多的縮排,參閱課程教材風格予以調整。

在實驗之前,我透過閱讀老師的 Atomics 操作的教材以及借助詢問 GPT 給我的一些破碎的資訊,先得到了一些概念並試著翻閱手冊,思考問題後再進行實驗:

  • STR W0, [X1] : 此指令會將 W0 的值(例如 1)寫入 X1 所指定的記憶體位址(此處代表變數 x),屬於普通(non-exclusive)Store 指令。但在硬體層面它有可能會出錯,首先是 Cache 和主記憶體的信息延遲可能會導致在 P0 更改 X1 後的某段時刻,其他 CPU 看到的 X1 會不一樣。
    • Cache coherence : 翻閱 Arm 關於 Cache coherence 的手冊(DDI0487I.a B2.9.1 Coherent Memory)後,發現手段有三種:

      • Disable caching:這個很直觀,它會把要共享的記憶體區域標記成 non-cacheable。但這會讓DRAM 延遲大幅提高、功耗高、效能降低。也就是會讓 Cache 的優勢消失。在這裡不會被實做。

      • Software managed coherency:
        對於這個方法,Arm 官方是這麼描述的:

        Software managed coherency is the traditional solution to the data sharing problem. Here the software, usually device drivers, must clean or flush dirty data from caches, and invalidate old data to enable sharing with other processors or masters in the system. This takes processor cycles, bus bandwidth, and power.

        於是我寫了一個 Kernel Module 搭配 dma 來試著使用它:

        
        
      • Hardware managed coherency:
        這個是目前大部分的處理器採用的方式,而 ARM Cortex-A72 採用的是基於 MOESI protocol 的 Cache 行為,具體行為如下:

        • Modified (已修改)
        • Owned (擁有者,允許共享但自己負責寫回)
        • Exclusive (獨佔,未修改)
        • Shared (共享)
        • Invalid (無效)

        並基於這個協定讓每個 CPU 使用 Snoop Control Unit (SCU) 監控各個 CPU 的 Cacheline 狀態。(當某 CPU 要讀一條 Cacheline 時,SCU 會 snoop 其他 CPU,看是否有更新版本),接著各 CPU 之間透過 ARM 定義的 AMBA ACE bus (AMBA Coherent Extensions) 傳輸 snoop transaction:

        • AMBA ACE bus :
          定義了 5 個 channel:
          • Read Address Channel
          • Write Address Channel
          • Read Data Channel
          • Write Data Channel
          • Snoop Channel(就是透過這個發出 snoop 的請求)
    • Store Buffering
      ARM TRM 裡提到:

      The Load-Store Unit contains multiple queues, including a store buffer. This enables stores to retire from the pipeline before they are globally visible, improving throughput.

      Q:那 Cortex-A72 pipeline 是如何設計的呢?
      A:被分成這幾個步驟:

      • Instruction Fetch:從 L1 I-Cache 抓取指令,每 cycle 可供應最多 3 條指令。

      • Instruction Decode:參考原文

        The decode unit translates fetched instructions into internal operations that can be dispatched out of order to execution pipelines. It includes branch decode, register rename, and allocation logic.

      • Dispatch / Rename:它會實際地執行 rename ,具體行為如下

        • 讀取 free list 中可用的 physical registers

          • free list : 它會記錄「目前哪些 Physical Registers 是空閒的,可以被新的指令使用」。
            這裡可能是用 bit vector + priority encoder 來實做,因為我沒有在手冊裡翻到對 free list 的實做細節描述,但在閱讀其他開源專案時發現幾乎都是使用了這個實做方式或其變形。
            Ex:Berkeley Out-of-Order Machine

            • 其中 free list 在這裡被定義,使用了 bit vector 的方式:
              ​​​​​​​​​​​​val free_list = RegInit(UInt(numPregs.W), io.initial_allocation)
              
              io.initial_allocation:它會告訴 FreeList:「剛開機時,哪些 Physical Register 預設是空的?」,來紀錄初始能用的 Physical Register。

            Q:為何 AR 和 PR 需要分開設計?初步會直覺地認為使用 PR 就足以對應 AR,實際上需分離設計以支援 speculative execution 與指令重排序。
            A:為了讓 Pipeline 可以支援 Reordering 與 speculative execution。首先因為 PR 的數量比 AR 還多,如果 CPU 內部只有 Architectural Register, Pipeline 中同一
            一個 AR 只能有一份值。但若有兩條指令要對同一個 AR 寫入(WAW):

            ​​​​​​​​​​​​​ADD X1, X1, #1   ; X1 = X1 + 1
            ​​​​​​​​​​​​​ADD X1, X1, #2   ; X1 = X1 + 2
            

            或是 WAR:

            ​​​​​​​​​​​​​I1: R1 = R2 + R3 
            ​​​​​​​​​​​​​I2: R2 = 7 
            

            在第一個範例中,第二條指令必須等到第一條指令將值寫回 AR,才能開始執行,否則會發生 harzard。
            在第二個範例中,I2 可能因為 reordering 而導致先執行,直接把 R2 改成 7。

            • WAW (Write After Write) hazard → 需要序列化。
            • WAR (Write After Read) hazard → 需要 stall 等待。
        • 將每個 architectural register 對應到一個新的 physical register

        • 更新 rename table

        • 配給 Reorder Buffer (ROB) Entry,建立依賴鏈

          • Reorder Buffer (ROB):它允許指令在 pipeline 中亂序執行,但結果只能按程式原順序送回到 register file 或 memory。保證了 No reorder violation externally
          • ROB Entry:每條進入 pipeline 的指令,都會在 ROB 裡分配一個 entry slot。
            他是一個一個環狀隊列(Ring Buffer),保存 pipeline 內每條指令的執行狀態。
        • 將 rename 後的 µops 放進 Issue Queue。

      • Integer Execute:有兩條 ALU pipeline,支援乘累加、除法單元與 branch resolve

      • Load/Store Unit:
        裡面包含了:

        • Address Generation Logic (AGU):計算 Load/Store 的有效位址

        • TLB:處理虛擬位址轉換

          • LSU 包含一個獨立的 Data TLB (DTLB)。
            由 32–64 個 entry 構成,在 Cortex-A72 TRM 手冊裡是這麼描述的:

            The core implements a main data TLB with 32 entries, and a micro TLB with 10 entries.

            以及

            These TLBs are fully associative for fast matching.

            每個 entry 存放:

            • Virtual Page Number
            • Physical Frame Number
            • Access Permission
            • Attribute Bits (cacheable, shareable)

            當 AGU 計算出虛擬地址後,DTLB 快速比對,若 hit,直接給 LSU 繼續執行。若 miss,啟動 MMU Page Table Walker (PTW),這是 ARM 架構硬體提供的 page table,走訪 TTBR。

        • Load Queue (Load Buffer)

          • Load Queue 是一組 FIFO 或 Circular Buffer。每條 Load 進 LSU 後,如果 miss L1 D-cache,會登記在 Load Queue。

            Load Queue 記錄了:

            • Load Address
            • Target Register
            • 是否已從 Store Buffer forwarding
            • 是否完成

            允許同時有多條 Outstanding Load,典型可支持 8–16 條。
            與 Store Buffer 和 Fill Buffer 協同:

            • 如果有匹配 Store,執行 Load-Forwarding
            • 如果 cache hit,直接返回
        • Store Buffer:暫存 retire 後尚未真正寫入 cache 的 Store

        • Fill Buffer:處理 cache miss 時,從 L2 / DRAM 拉回 cache line

        • Coherence Controller:處理 ACE snoop,保證多核心 cache line 一致性

        • Alignment / Merge Logic:處理 unaligned 訪問、write combine

      • Writeback:

      • Retirement:

      ​​​​┌─────────────────────────────┐
      ​​​​│      Instruction Fetch       │
      ​​​​└────────────┬────────────────┘
      ​​​​             ↓
      ​​​​┌─────────────────────────────┐
      ​​​​│       Instruction Decode     │
      ​​​​└────────────┬────────────────┘
      ​​​​             ↓
      ​​​​┌─────────────────────────────┐
      ​​​​│      Issue / Rename Stage    │
      ​​​​└────────────┬────────────────┘
      ​​​​             ↓
      ​​​​┌─────────────────────────────┐
      ​​​​│     Execution Units (ALU)    │
      ​​​​│  +-------------------------+ │
      ​​​​│  | Load-Store Unit (LSU)   | │
      ​​​​│  | ┌─────────────────────┐ | │
      ​​​​│  | │  AGU (Address Gen)  │ | │
      ​​​​│  | ├─────────────────────┤ | │
      ​​​​│  | │  DTLB (Data TLB)    │ | │
      ​​​​│  | ├─────────────────────┤ | │
      ​​​​│  | │  Page Table Walker  │ | │
      ​​​​│  | ├─────────────────────┤ | │
      ​​​​│  | │  Load Queue         │ | │
      ​​​​│  | ├─────────────────────┤ | │
      ​​​​│  | │  Store Buffer       │ | │
      ​​​​│  | ├─────────────────────┤ | │
      ​​​​│  | │  Fill Buffer        │ | │
      ​​​​│  | ├─────────────────────┤ | │
      ​​​​│  | │  Coherence Ctrl (SCU)| │ │
      ​​​​│  | ├─────────────────────┤ | │
      ​​​​│  | │ Alignment/Merge Logic│ │ │
      ​​​​│  | └─────────────────────┘ | │
      ​​​​│  +-------------------------+ │
      ​​​​└────────────┬────────────────┘
      ​​​​             ↓
      ​​​​┌─────────────────────────────┐
      ​​​​│       L1 Data Cache (L1D)    │
      ​​​​└────────────┬────────────────┘
      ​​​​             ↓
      ​​​​┌─────────────────────────────┐
      ​​​​│   L2 Cache (Cluster Shared)  │
      ​​​​└────────────┬────────────────┘
      ​​​​             ↓
      ​​​​┌─────────────────────────────┐
      ​​​​│       AMBA ACE Bus (Snoop)   │
      ​​​​└────────────┬────────────────┘
      ​​​​             ↓
      ​​​​         Main Memory (DRAM)
      
      

TODO: 研讀 LKMM 的演講材料

至少該涵蓋以下演講: (包含錄影)

TODO: 重現 LKMM 實驗

適度調整 paulmckrcu/litmustools/memory-model

接著,我下載了 Linux kernel 的 tools/memory-model 子目錄,接著進入 tools/memory-model/litmus-tests 資料夾,以 herd 驗證搭配 litmus7 來進行實機實驗,並適度修改程式碼進行分析:

directory 是「目錄」,而非「資料夾」,務必使用課程規範的術語書寫。

分析 MP+poonceonces.litmus :

C MP+poonceonces

(*
 * Result: Sometimes
 *
 * Can the counter-intuitive message-passing outcome be prevented with
 * no ordering at all?
 *)

{}

P0(int *buf, int *flag) // Producer
{
	WRITE_ONCE(*buf, 1);
	WRITE_ONCE(*flag, 1);
}

P1(int *buf, int *flag) // Consumer
{
	int r0;
	int r1;

	r0 = READ_ONCE(*flag);
	r1 = READ_ONCE(*buf);
}

exists (1:r0=1 /\ 1:r1=0) (* Bad outcome. *)
  • 這裡使用了 WRITE_ONCEREAD_ONCE ,保證了讀和寫是一個 atomic 操作,compiler 不會將它重排。

根據這段程式碼,理論上它只會有 3 種情況:

  • 1 : r0 = 0, r1 = 0
  • 2 : r0 = 0, r1 = 1
  • 3 : r0 = 1, r1 = 1

然而當我用 herd 進行分析後的執行結果卻是:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ ~/herdtools7/_build/default/herd/herd.exe -conf linux-kernel.cfg litmus-tests/MP+poonceonces.litmus
Test MP+poonceonces Allowed
States 4
1:r0=0; 1:r1=0;
1:r0=0; 1:r1=1;
1:r0=1; 1:r1=0;
1:r0=1; 1:r1=1;
Ok
Witnesses
Positive: 1 Negative: 3
Condition exists (1:r0=1 /\ 1:r1=0)
Observation MP+poonceonces Sometimes 1 3
Time MP+poonceonces 0.03
Hash=c0ebd8ca556580d772ac73303c4c4f84

第四種可能出現了!根據這個執行結果我們幾乎可以 100% 確定 CPU 有可能重排了這段程式碼,於是我是著去試著查看 Arm 手冊 來試著了解 ARMv8-A 架構下 memory model 對 weak ordering 的容忍程度後發現:

  • load-load, load-store, store-store, store-load 都可能被 CPU pipeline 重排。

  • 那什麼情況下可能會被重排呢?查看手冊 D7-1 後發現它寫著 :

    The architecture does not prescribe a particular form for the memory systems.

    這說明了 ARM 不強制特定 microarchitecture(例如是否允許亂序、緩存設計等),而是提供一組行為定義,允許實作有所不同。

    以及第 B2.3 節寫著:

    A DMB ST will prevent two store instructions from being reordered.
    A DMB LD will prevent two load instructions from being reordered.

    也就是說:

    • 未加入 DMB/STL(store barrier),store-store reorder 是允許的。
    • 同時加入 store-release 或 dmb ish st , store-store reorder 將被禁止。

接著我試著使用了 litmus7 分析在 AArch64 架構下的實驗結果:

首先我用以下命令試著利用 herd 工具包裡的 litmus.exeMP+poonceonces.litmus 轉成 .c

mkdir mp.out
litmus7 -o mp.out -carch AArch64 litmus-tests/MP+poonceonces.litmus
cd mp.out
make

但出現了:

gcc -Wall -std=gnu99  -pthread -O2 -c outs.c
gcc -Wall -std=gnu99  -pthread -O2 -c utils.c
gcc -Wall -std=gnu99  -pthread -O2 -c litmus_rand.c
gcc -Wall -std=gnu99  -pthread -S MP+poonceonces.c
MP+poonceonces.c: In function ‘P0’:
MP+poonceonces.c:199:9: warning: implicit declaration of function ‘WRITE_ONCE’ [-Wimplicit-function-declaration]
  199 |         WRITE_ONCE(*buf, 1);
      |         ^~~~~~~~~~
MP+poonceonces.c: In function ‘P1’:
MP+poonceonces.c:227:14: warning: implicit declaration of function ‘READ_ONCE’ [-Wimplicit-function-declaration]
  227 |         r0 = READ_ONCE(*flag);
      |              ^~~~~~~~~
gcc -Wall -std=gnu99  -pthread  -o MP+poonceonces.exe outs.o utils.o litmus_rand.o MP+poonceonces.s
/usr/bin/ld: /tmp/cca8NjdC.o: in function `P0':
MP+poonceonces.c:(.text+0x46c): undefined reference to `WRITE_ONCE'
/usr/bin/ld: MP+poonceonces.c:(.text+0x47c): undefined reference to `WRITE_ONCE'
/usr/bin/ld: /tmp/cca8NjdC.o: in function `P1':
MP+poonceonces.c:(.text+0x56c): undefined reference to `READ_ONCE'
/usr/bin/ld: MP+poonceonces.c:(.text+0x57c): undefined reference to `READ_ONCE'
collect2: error: ld returned 1 exit status
make: *** [Makefile:30: MP+poonceonces.exe] Error 1
rm MP+poonceonces.s

我檢查了 util.c/h 發現是空的,於是我手動將這兩個巨集加進了 MP+poonceonces.c :

#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val))

結果編譯成功:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp_dir $ make
gcc -Wall -std=gnu99  -pthread -S MP+poonceonces.c
gcc -Wall -std=gnu99  -pthread  -o MP+poonceonces.exe outs.o utils.o litmus_rand.o MP+poonceonces.s
awk -f show.awk MP+poonceonces.s > MP+poonceonces.t
rm MP+poonceonces.s

第一次測試

參數是:

#define SIZE_OF_TEST 100000
#define NUMBER_OF_RUN 10

結果是:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp.out $ ./run.sh 
Wed 25 Jun 13:16:54 BST 2025
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Results for litmus-tests/MP+poonceonces.litmus %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
C MP+poonceonces

{
}
P0(int* buf, int* flag) {

	WRITE_ONCE(*buf, 1);
	WRITE_ONCE(*flag, 1);

}

P1(int* buf, int* flag) {

	int r0;
	int r1;

	r0 = READ_ONCE(*flag);
	r1 = READ_ONCE(*buf);

}


exists (1:r0=1 /\ 1:r1=0)
Generated assembler
#START _litmus_P0
// 0 "" 2
	ldr	x0, [sp, 40]
	mov	w1, 1
	str	w1, [x0]
	ldr	x0, [sp, 32]
	mov	w1, 1
	str	w1, [x0]
// 204 "MP+poonceonces.c" 1
#START _litmus_P1
// 0 "" 2
	ldr	x0, [sp, 48]
	ldr	w0, [x0]
	str	w0, [sp, 44]
	ldr	x0, [sp, 56]
	ldr	w0, [x0]
	str	w0, [sp, 40]
// 232 "MP+poonceonces.c" 1
Test MP+poonceonces Allowed
Histogram (3 states)
499996:>1:r0=0; 1:r1=0;
2     :>1:r0=0; 1:r1=1;
500002:>1:r0=1; 1:r1=1;
No

Witnesses
Positive: 0, Negative: 1000000
Condition exists (1:r0=1 /\ 1:r1=0) is NOT validated
Hash=c0ebd8ca556580d772ac73303c4c4f84
Observation MP+poonceonces Never 0 1000000
Time MP+poonceonces 0.28
Revision exported, version 7.58
Command line: litmus7 -o mp.out -carch AArch64 litmus-tests/MP+poonceonces.litmus
Parameters
#define SIZE_OF_TEST 100000
#define NUMBER_OF_RUN 10
#define AVAIL 1
#define STRIDE (-1)
#define MAX_LOOP 0
/* gcc options: -Wall -std=gnu99  -pthread */
/* barrier: user */
/* launch: changing */
/* affinity: none */
/* memory: direct */
/* safer: write */
/* preload: random */
/* speedcheck: no */
/* alloc: dynamic */
GCC=gcc
LITMUSOPTS=
Wed 25 Jun 13:16:55 BST 2025

可以發現在實機測試中未觀察到預測的壞結果。
根據這個已經可以發現:

  • 在 weak memory model 下,指令重排造成的壞結果機率還是偏低。也就是說,對於

但!那依然有可能發生,於是我做了第二次測試。

第二次測試

我調整了參數為:

#define SIZE_OF_TEST 1000000
#define NUMBER_OF_RUN 100

結果為:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp.out $ ./run.sh 
Wed 25 Jun 13:18:31 BST 2025
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Results for litmus-tests/MP+poonceonces.litmus %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
C MP+poonceonces

{
}
P0(int* buf, int* flag) {

	WRITE_ONCE(*buf, 1);
	WRITE_ONCE(*flag, 1);

}

P1(int* buf, int* flag) {

	int r0;
	int r1;

	r0 = READ_ONCE(*flag);
	r1 = READ_ONCE(*buf);

}


exists (1:r0=1 /\ 1:r1=0)
Generated assembler
#START _litmus_P0
// 0 "" 2
	ldr	x0, [sp, 40]
	mov	w1, 1
	str	w1, [x0]
	ldr	x0, [sp, 32]
	mov	w1, 1
	str	w1, [x0]
// 204 "MP+poonceonces.c" 1
#START _litmus_P1
// 0 "" 2
	ldr	x0, [sp, 48]
	ldr	w0, [x0]
	str	w0, [sp, 44]
	ldr	x0, [sp, 56]
	ldr	w0, [x0]
	str	w0, [sp, 40]
// 232 "MP+poonceonces.c" 1
Test MP+poonceonces Allowed
Histogram (4 states)
50000231:>1:r0=0; 1:r1=0;
22    *>1:r0=1; 1:r1=0;
2331  :>1:r0=0; 1:r1=1;
49997416:>1:r0=1; 1:r1=1;
Ok

Witnesses
Positive: 22, Negative: 99999978
Condition exists (1:r0=1 /\ 1:r1=0) is validated
Hash=c0ebd8ca556580d772ac73303c4c4f84
Observation MP+poonceonces Sometimes 22 99999978
Time MP+poonceonces 22.87
Revision exported, version 7.58
Command line: litmus7 -o mp.out -carch AArch64 litmus-tests/MP+poonceonces.litmus
Parameters
#define SIZE_OF_TEST 100000
#define NUMBER_OF_RUN 10
#define AVAIL 1
#define STRIDE (-1)
#define MAX_LOOP 0
/* gcc options: -Wall -std=gnu99  -pthread */
/* barrier: user */
/* launch: changing */
/* affinity: none */
/* memory: direct */
/* safer: write */
/* preload: random */
/* speedcheck: no */
/* alloc: dynamic */
GCC=gcc
LITMUSOPTS=
Wed 25 Jun 13:18:54 BST 2025

這時發現,(r0=1; r1=0)出現了!
這驗證了:

  • 這個架構下,允許此類 memory reordering 發生

接著查看 tools/memory-model/Documentation/explanation.txt 後,裡面是這樣描述的:

smp_store_release() forces the CPU to execute all po-earlier instructions before the store associated with the fence

以及

Any store which propagates to C before a release fence is executed is forced to propagate to C' before the store associated with the release fence does.

這說明了:

  • release fence 除了保證程式碼順序外,還保證跨 CPU 的 propagation 順序

A-cumulative fence 在 LKMM 中的定義如下:

The propagation ordering enforced by release fences and strong fences affects stores from other CPUs that propagate to CPU C before the fence is executed, as well as stores that are executed on C before the fence.
We describe this property by saying that release fences and strong fences are A-cumulative.

也就是說:

  • A-cumulative fences 不只保證「本 CPU」的 store 先於 fence 後的操作。
  • 也會「累積」來自其他 CPU、在 fence 執行前就已經 propagate 到本 CPU 的 store,也一併納入 ordering 保證中。

從硬體層面來看:

  • 在 ARMv8-A 上,smp_store_release() 會編譯為:
    ​​stlr wX, [addr]
    
    • 語義:這是一個 Release Store,保證:
      • 所有在此之前的 memory operations(load 或 store)必須先完成並對其他核心可見。
      • 可避免 store-store、load-store 重排序。
    • 限制:只保證這個 store 本身在前面操作之後發生,不影響之後的 loads。
  • smp_load_acquire() 對應硬體指令:
    ​​​​ldar wX, [addr]
    
    • 語義:這是一個 Acquire Load,保證:
      • 這個 load 在程式中完成後,後續的所有 memory operations(load 或 store)必須在它之後執行。
      • 可避免 load-load、load-store 重排序。
    • 限制:不會影響前面已完成的 store。
  • 而 smp_wmb() 則為:
    ​​dmb ishst
    
    • DMB ISHST 是一種 memory barrier 指令,僅影響 store→store 順序。

    • 語義:Store-Store Barrier,只針對 store → store 順序保證。

      它保證:

      • 所有在它之前的 store 必須完成後,才能執行它之後的 store 指令。
      • 不會阻止 load-load、store-load、load-store 等重排序
  • 而 smp_rmb() 則為:
    ​​dmb ishld
    
    • 語義:Load-Load Barrier,只針對 load → load 順序保證。
  • 而 smp_mb() 則為:
    ​​dmb ish
    
    • 語義:Full Barrier(memory barrier for all types: LD/LD, ST/ST, ST/LD, LD/ST)

於是接著我參考 LKMM 的規範,使用了 release-store semantics 嘗試阻擋這個情況。
也就是接下來進行的測試的分析。

分析 MP+fencewmbonceonce+fencermbonceonce.litmus

C MP+fencewmbonceonce+fencermbonceonce

(*
 * Result: Never
 *
 * This litmus test demonstrates that smp_wmb() and smp_rmb() provide
 * sufficient ordering for the message-passing pattern.  However, it
 * is usually better to use smp_store_release() and smp_load_acquire().
 *)

{}

P0(int *buf, int *flag) // Producer
{
	WRITE_ONCE(*buf, 1);
	smp_wmb();
	WRITE_ONCE(*flag, 1);
}

P1(int *buf, int *flag) // Consumer
{
	int r0;
	int r1;

	r0 = READ_ONCE(*flag);
	smp_rmb();
	r1 = READ_ONCE(*buf);
}

exists (1:r0=1 /\ 1:r1=0) (* Bad outcome. *)

這個 litmus 測試使用以下同步操作:

  • P0WRITE_ONCE(buf)smp_wmb()WRITE_ONCE(flag)
  • P1READ_ONCE(flag)smp_rmb()READ_ONCE(buf)

這相當於在 store 和 load 間各自插入 memory barrier:

  • smp_wmb() 確保 buf=1 不會在 flag=1 之後發生(禁止 store-store reorder)。
  • smp_rmb() 確保 flag==1 讀到之後的讀取不會提前被執行(禁止 load-load reorder)。

首先使用 herd7 工具進行分析,結果如下:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ ~/herdtools7/_build/default/herd/herd.exe -conf linux-kernel.cfg litmus-tests/MP+fencewmbonceonce+fencermbonceonce.litmus 
Test MP+fencewmbonceonce+fencermbonceonce Allowed
States 3
1:r0=0; 1:r1=0;
1:r0=0; 1:r1=1;
1:r0=1; 1:r1=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (1:r0=1 /\ 1:r1=0)
Observation MP+fencewmbonceonce+fencermbonceonce Never 0 3
Time MP+fencewmbonceonce+fencermbonceonce 0.04
Hash=7e451c2d514abe8532fb9eb15fcc46c6

可以發現,只出現了三種狀態。
這也驗證了之前的探討。

接著進行實機測試:

使用以下指令轉換成 C code 後:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ mkdir mp_wmb_rmb.out
i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ litmus7 -o mp_wmb_rmb.out -carch AArch64 litmus-tests/MP+fencewmbonceonce+fencermbonceonce.litmus
i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ cd mp_wmb_rmb.out/

一樣先在編譯前先在 MP+pooncerelease+poacquireonce.c 裡補上巨集定義:

#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val))

static inline void smp_wmb(void)
{
    asm volatile("dmb ishst" ::: "memory");
}

static inline void smp_rmb(void)
{
    asm volatile("dmb ishld" ::: "memory");
}

接著 make 後編譯成功:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp_wmb_rmb.out $ make
gcc -Wall -std=gnu99  -pthread -S MP+fencewmbonceonce+fencermbonceonce.c
gcc -Wall -std=gnu99  -pthread  -o MP+fencewmbonceonce+fencermbonceonce.exe outs.o utils.o litmus_rand.o MP+fencewmbonceonce+fencermbonceonce.s
awk -f show.awk MP+fencewmbonceonce+fencermbonceonce.s > MP+fencewmbonceonce+fencermbonceonce.t
rm MP+fencewmbonceonce+fencermbonceonce.s

更改參數如下:

#define SIZE_OF_TEST 1000000
#define NUMBER_OF_RUN 100

執行結果:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp_wmb_rmb.out $ ./run.sh 
Wed 25 Jun 17:50:35 BST 2025
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Results for litmus-tests/MP+fencewmbonceonce+fencermbonceonce.litmus %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
C MP+fencewmbonceonce+fencermbonceonce

{
}
P0(int* buf, int* flag) {

	WRITE_ONCE(*buf, 1);
	smp_wmb();
	WRITE_ONCE(*flag, 1);

}

P1(int* buf, int* flag) {

	int r0;
	int r1;

	r0 = READ_ONCE(*flag);
	smp_rmb();
	r1 = READ_ONCE(*buf);

}


exists (1:r0=1 /\ 1:r1=0)
Generated assembler
#START _litmus_P0
// 0 "" 2
	ldr	x0, [sp, 40]
	mov	w1, 1
	str	w1, [x0]
	bl	smp_wmb
	ldr	x0, [sp, 32]
	mov	w1, 1
	str	w1, [x0]
// 215 "MP+fencewmbonceonce+fencermbonceonce.c" 1
#START _litmus_P1
// 0 "" 2
	ldr	x0, [sp, 48]
	ldr	w0, [x0]
	str	w0, [sp, 44]
	bl	smp_rmb
	ldr	x0, [sp, 56]
	ldr	w0, [x0]
	str	w0, [sp, 40]
// 244 "MP+fencewmbonceonce+fencermbonceonce.c" 1
Test MP+fencewmbonceonce+fencermbonceonce Allowed
Histogram (3 states)
49987086:>1:r0=0; 1:r1=0;
22347 :>1:r0=0; 1:r1=1;
49990567:>1:r0=1; 1:r1=1;
No

Witnesses
Positive: 0, Negative: 100000000
Condition exists (1:r0=1 /\ 1:r1=0) is NOT validated
Hash=7e451c2d514abe8532fb9eb15fcc46c6
Observation MP+fencewmbonceonce+fencermbonceonce Never 0 100000000
Time MP+fencewmbonceonce+fencermbonceonce 25.28
Revision exported, version 7.58
Command line: litmus7 -o mp_wmb_rmb.out -carch AArch64 litmus-tests/MP+fencewmbonceonce+fencermbonceonce.litmus
Parameters
#define SIZE_OF_TEST 100000
#define NUMBER_OF_RUN 10
#define AVAIL 1
#define STRIDE (-1)
#define MAX_LOOP 0
/* gcc options: -Wall -std=gnu99  -pthread */
/* barrier: user */
/* launch: changing */
/* affinity: none */
/* memory: direct */
/* safer: write */
/* preload: random */
/* speedcheck: no */
/* alloc: dynamic */
GCC=gcc
LITMUSOPTS=
Wed 25 Jun 17:51:01 BST 2025

結果顯示在 LKMM 規範下,使用 smp_wmb() 與 smp_rmb() 可以有效地防止 message-passing 模式中的重排序問題。

測試加入語意更強的 smp_mb() 的 full barrier 的效果

改進 MP+poonceonces.litmus 如下(加入 full barrier)

C MP+fencembonceonce+fencembonceonce

(*
 * Result: Never
 *
 * This test demonstrates that smp_mb() prevents the message-passing reordering.
 *)

{}

P0(int *buf, int *flag) // Producer
{
	WRITE_ONCE(*buf, 1);
	smp_mb();
	WRITE_ONCE(*flag, 1);
}

P1(int *buf, int *flag) // Consumer
{
	int r0;
	int r1;

	r0 = READ_ONCE(*flag);
	smp_mb();
	r1 = READ_ONCE(*buf);
}

exists (1:r0=1 /\ 1:r1=0) (* Bad outcome *)

首先使用 herd7 工具分析如下:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ ~/herdtools7/_build/default/herd/herd.exe -conf linux-kernel.cfg litmus-tests/MP+fencembonceonce+fencembonceonce.litmus
Test MP+fencembonceonce+fencembonceonce Allowed
States 3
1:r0=0; 1:r1=0;
1:r0=0; 1:r1=1;
1:r0=1; 1:r1=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (1:r0=1 /\ 1:r1=0)
Observation MP+fencembonceonce+fencembonceonce Never 0 3
Time MP+fencembonceonce+fencembonceonce 0.03
Hash=2d339bb94a0294fd3094cac103eec317

可以發現,加入 smp_mb() 後,成功阻止了 message-passing 重排序現象。

接著進行實機測試:

使用以下指令轉換成 C code 後:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ mkdir mp_mb.out
i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ litmus7 -o mp_mb.out -carch AArch64 litmus-tests/MP+fencembonceonce+fencembonceonce.litmus
i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ cd mp_mb.out/

一樣先在編譯前先在 MP+pooncerelease+poacquireonce.c 裡補上巨集定義:

#define WRITE_ONCE(x, v) (*(volatile typeof(x) *)&(x) = (v))
#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))

static inline void smp_mb(void) {
    __sync_synchronize();
}

接著 make 後編譯成功:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp_mb.out $ make
gcc -Wall -std=gnu99  -pthread -S MP+fencembonceonce+fencembonceonce.c
gcc -Wall -std=gnu99  -pthread  -o MP+fencembonceonce+fencembonceonce.exe outs.o utils.o litmus_rand.o MP+fencembonceonce+fencembonceonce.s
awk -f show.awk MP+fencembonceonce+fencembonceonce.s > MP+fencembonceonce+fencembonceonce.t
rm MP+fencembonceonce+fencembonceonce.s

更改參數如下:

#define SIZE_OF_TEST 1000000
#define NUMBER_OF_RUN 100

執行結果:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp_mb.out $ ./run.sh
Thu 26 Jun 07:39:03 BST 2025
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Results for litmus-tests/MP+fencembonceonce+fencembonceonce.litmus %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
C MP+fencembonceonce+fencembonceonce

{
}
P0(int* buf, int* flag) {

	WRITE_ONCE(*buf, 1);
	smp_mb();
	WRITE_ONCE(*flag, 1);

}

P1(int* buf, int* flag) {

	int r0;
	int r1;

	r0 = READ_ONCE(*flag);
	smp_mb();
	r1 = READ_ONCE(*buf);

}


exists (1:r0=1 /\ 1:r1=0)
Generated assembler
#START _litmus_P0
// 0 "" 2
	ldr	x0, [sp, 40]
	mov	w1, 1
	str	w1, [x0]
	bl	smp_mb
	ldr	x0, [sp, 32]
	mov	w1, 1
	str	w1, [x0]
// 209 "MP+fencembonceonce+fencembonceonce.c" 1
#START _litmus_P1
// 0 "" 2
	ldr	x0, [sp, 48]
	ldr	w0, [x0]
	str	w0, [sp, 44]
	bl	smp_mb
	ldr	x0, [sp, 56]
	ldr	w0, [x0]
	str	w0, [sp, 40]
// 238 "MP+fencembonceonce+fencembonceonce.c" 1
Test MP+fencembonceonce+fencembonceonce Allowed
Histogram (3 states)
47087458:>1:r0=0; 1:r1=0;
2926327:>1:r0=0; 1:r1=1;
49986215:>1:r0=1; 1:r1=1;
No

Witnesses
Positive: 0, Negative: 100000000
Condition exists (1:r0=1 /\ 1:r1=0) is NOT validated
Hash=2d339bb94a0294fd3094cac103eec317
Observation MP+fencembonceonce+fencembonceonce Never 0 100000000
Time MP+fencembonceonce+fencembonceonce 25.07
Revision exported, version 7.58
Command line: litmus7 -o mp_mb.out -carch AArch64 litmus-tests/MP+fencembonceonce+fencembonceonce.litmus
Parameters
#define SIZE_OF_TEST 100000
#define NUMBER_OF_RUN 10
#define AVAIL 1
#define STRIDE (-1)
#define MAX_LOOP 0
/* gcc options: -Wall -std=gnu99  -pthread */
/* barrier: user */
/* launch: changing */
/* affinity: none */
/* memory: direct */
/* safer: write */
/* preload: random */
/* speedcheck: no */
/* alloc: dynamic */
GCC=gcc
LITMUSOPTS=
Thu 26 Jun 07:39:28 BST 2025

分析 MP+pooncerelease+poacquireonce.litmus

C MP+pooncerelease+poacquireonce

(*
 * Result: Never
 *
 * This litmus test demonstrates that smp_store_release() and
 * smp_load_acquire() provide sufficient ordering for the message-passing
 * pattern.
 *)

{}

P0(int *buf, int *flag) // Producer
{
	WRITE_ONCE(*buf, 1);
	smp_store_release(flag, 1);
}

P1(int *buf, int *flag) // Consumer
{
	int r0;
	int r1;

	r0 = smp_load_acquire(flag);
	r1 = READ_ONCE(*buf);
}

exists (1:r0=1 /\ 1:r1=0) (* Bad outcome. *)

查看行為後發現 flag 預設是 0 的狀態,然後讓需要被阻止重排的操作前面讀取 flag 直到它被設成 1,然後負責 store 的那個 Thread 需要負責寫 flag=1。

P0

WRITE_ONCE(*buf, 1);
smp_store_release(flag, 1);
  • WRITE_ONCE(*buf, 1):寫入實際資料(payload)。
  • smp_store_release(flag, 1)
    • 對應 ARMv8 指令:stlr w1, [flag]
    • 保證前面的所有 store(如 buf=1)不會被 reorder 到它之後。

P1

r0 = smp_load_acquire(flag);
r1 = READ_ONCE(*buf);
  • smp_load_acquire(flag)
    • 對應 ARMv8 指令:ldar w0, [flag]
    • 若讀到 flag==1,則保證之後的所有 load(如 buf)不會被 reorder 到這個 acquire 之前。
  • READ_ONCE(*buf):嘗試讀取 payload。

Q:這個模式是否只適合只有一個 Producer 和一個 Consumer 的情況?
A:是的,這個模式是 point-to-point 同步。因為:

  • flag 是一個 單一整數變數(int flag),不具備識別來源或目的的能力:

    • 多個 consumer 同時讀 flag==1,彼此沒有協議機制去分辨誰該處理。
    • 多個 producer 同時寫 buf 和 flag,則 buf 的內容會有競爭風險(race condition)。

    Q : 那這樣我應該如何利用它來支援 支援 N producers/N consumers 呢?
    A:下一個實驗進行探討。

Q:會不會因為工作阻塞而導致 flag 無法釋放,大量浪費 CPU 資源?
A:後續會試著進行探討。

首先使用 herd7 工具進行分析,結果如下:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ ~/herdtools7/_build/default/herd/herd.exe -conf linux-kernel.cfg litmus-tests/MP+pooncerelease+poacquireonce.litmus
Test MP+pooncerelease+poacquireonce Allowed
States 3
1:r0=0; 1:r1=0;
1:r0=0; 1:r1=1;
1:r0=1; 1:r1=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (1:r0=1 /\ 1:r1=0)
Observation MP+pooncerelease+poacquireonce Never 0 3
Time MP+pooncerelease+poacquireonce 0.03
Hash=f259d8efc86e0b40c7168bf58524ab49

可以發現,只出現了三種狀態。
這也驗證了之前的探討。

接著進行實機測試:

使用以下指令轉換成 C code 後:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ mkdir mp_acqrel.out
i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ litmus7 -o mp_acqrel.out -carch AArch64 litmus-tests/MP+pooncerelease+poacquireonce.litmus
i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ cd mp_acqrel.out

一樣先在編譯前先在 MP+pooncerelease+poacquireonce.c 裡補上巨集定義:

#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val))

#define smp_store_release(p, v) \
  do { __atomic_store_n((p), (v), __ATOMIC_RELEASE); } while (0)

#define smp_load_acquire(p) \
  __atomic_load_n((p), __ATOMIC_ACQUIRE)

接著 make 後編譯成功:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp_acqrel.out $ make
gcc -Wall -std=gnu99  -pthread -S MP+pooncerelease+poacquireonce.c
gcc -Wall -std=gnu99  -pthread  -o MP+pooncerelease+poacquireonce.exe outs.o utils.o litmus_rand.o MP+pooncerelease+poacquireonce.s
awk -f show.awk MP+pooncerelease+poacquireonce.s > MP+pooncerelease+poacquireonce.t
rm MP+pooncerelease+poacquireonce.s

更改參數如下:

#define SIZE_OF_TEST 1000000
#define NUMBER_OF_RUN 100

執行結果:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp_acqrel.out $ ./run.sh
Wed 25 Jun 15:44:25 BST 2025
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Results for litmus-tests/MP+pooncerelease+poacquireonce.litmus %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
C MP+pooncerelease+poacquireonce

{
}
P0(int* buf, int* flag) {

	WRITE_ONCE(*buf, 1);
	smp_store_release(flag, 1);

}

P1(int* buf, int* flag) {

	int r0;
	int r1;

	r0 = smp_load_acquire(flag);
	r1 = READ_ONCE(*buf);

}


exists (1:r0=1 /\ 1:r1=0)
Generated assembler
#START _litmus_P0
// 0 "" 2
	ldr	x0, [sp, 40]
	mov	w1, 1
	str	w1, [x0]
	ldr	x0, [sp, 32]
	mov	w1, 1
	stlr	w1, [x0]
// 209 "MP+pooncerelease+poacquireonce.c" 1
#START _litmus_P1
// 0 "" 2
	ldr	x0, [sp, 48]
	ldar	w0, [x0]
	str	w0, [sp, 44]
	ldr	x0, [sp, 56]
	ldr	w0, [x0]
	str	w0, [sp, 40]
// 237 "MP+pooncerelease+poacquireonce.c" 1
Test MP+pooncerelease+poacquireonce Allowed
Histogram (3 states)
49999653:>1:r0=0; 1:r1=0;
3771  :>1:r0=0; 1:r1=1;
49996576:>1:r0=1; 1:r1=1;
No

Witnesses
Positive: 0, Negative: 100000000
Condition exists (1:r0=1 /\ 1:r1=0) is NOT validated
Hash=f259d8efc86e0b40c7168bf58524ab49
Observation MP+pooncerelease+poacquireonce Never 0 100000000
Time MP+pooncerelease+poacquireonce 25.01
Revision exported, version 7.58
Command line: litmus7 -o mp_acqrel.out -carch AArch64 litmus-tests/MP+pooncerelease+poacquireonce.litmus
Parameters
#define SIZE_OF_TEST 100000
#define NUMBER_OF_RUN 10
#define AVAIL 1
#define STRIDE (-1)
#define MAX_LOOP 0
/* gcc options: -Wall -std=gnu99  -pthread */
/* barrier: user */
/* launch: changing */
/* affinity: none */
/* memory: direct */
/* safer: write */
/* preload: random */
/* speedcheck: no */
/* alloc: dynamic */
GCC=gcc
LITMUSOPTS=
Wed 25 Jun 15:44:50 BST 2025

可以發現未出現任何一次 flag = 1 且 buf = 0 的執行結果(我進行了多次實驗),這也驗證了前面所述:

  • release-acquire memory ordering 保證 P0 的 store 不會被 P1 的 load 越過。

分析 ISA2+plain.litmus

C ISA2+plain

(*
 * This version of the ISA2 test removes all barriers and release/acquire.
 * It may allow the outcome: 1:r0=1 /\ 2:r0=1 /\ 2:r1=0
 *)

{}

P0(int *x, int *y)
{
	*x = 1;
	*y = 1;
}

P1(int *y, int *z)
{
	int r0;

	r0 = *y;
	*z = 1;
}

P2(int *x, int *z)
{
	int r0;
	int r1;

	r0 = *z;
	r1 = *x;
}

exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0) (* bad outcome expected *)

先用 herd7 分析:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ herd7 -conf linux-kernel.cfg litmus-tests/ISA2+plain.litmus
Test ISA2+plain Allowed
States 8
1:r0=0; 2:r0=0; 2:r1=0;
1:r0=0; 2:r0=0; 2:r1=1;
1:r0=0; 2:r0=1; 2:r1=0;
1:r0=0; 2:r0=1; 2:r1=1;
1:r0=1; 2:r0=0; 2:r1=0;
1:r0=1; 2:r0=0; 2:r1=1;
1:r0=1; 2:r0=1; 2:r1=0;
1:r0=1; 2:r0=1; 2:r1=1;
Ok
Witnesses
Positive: 1 Negative: 7
Flag data-race
Condition exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0)
Observation ISA2+plain Sometimes 1 7
Time ISA2+plain 0.04
Hash=03bd81ecd2a78698aafba1ce0a5c54a2

發現:

Condition exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0)
Observation ISA2+plain Sometimes 1 7
Flag data-race

這表示:

  • 模型允許 P2 observe 到 z = 1(來自 P1)
  • 同時卻 看不到 x = 1(來自 P0)
  • x=1 明明在 y=1 之前發生,而 P1 已見到 y=1 才寫 z=1,但 P2 卻沒看到 x=1

接著進行實機測試:

使用以下指令轉換成 C code 後:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ mkdir mp_isa2.out
i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ litmus7 -o mp_isa2.out -carch AArch64 litmus-tests/ISA2+plain.litmus 
i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ cd mp_isa2.out/
i-ying-tsai@raspberrypi:~/memory_model/tools/memory-

接著 make 後編譯成功:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp_isa2.out $ make
gcc -Wall -std=gnu99  -pthread -O2 -c outs.c
gcc -Wall -std=gnu99  -pthread -O2 -c utils.c
gcc -Wall -std=gnu99  -pthread -O2 -c litmus_rand.c
gcc -Wall -std=gnu99  -pthread -S ISA2+plain.c
gcc -Wall -std=gnu99  -pthread  -o ISA2+plain.exe outs.o utils.o litmus_rand.o ISA2+plain.s
awk -f show.awk ISA2+plain.s > ISA2+plain.t
rm ISA2+plain.s

更改參數如下:

#define SIZE_OF_TEST 1000000
#define NUMBER_OF_RUN 100

執行結果:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp_isa2.out $ ./run.sh 
Thu 26 Jun 16:10:56 BST 2025
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Results for litmus-tests/ISA2+plain.litmus %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
C ISA2+plain

{
}
P0(int* x, int* y) {

	*x = 1;
	*y = 1;

}

P1(int* y, int* z) {

	int r0;

	r0 = *y;
	*z = 1;

}

P2(int* x, int* z) {

	int r0;
	int r1;

	r0 = *z;
	r1 = *x;

}


exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0)
Generated assembler
#START _litmus_P0
// 0 "" 2
	ldr	x0, [sp, 40]
	mov	w1, 1
	str	w1, [x0]
	ldr	x0, [sp, 32]
	mov	w1, 1
	str	w1, [x0]
// 212 "ISA2+plain.c" 1
#START _litmus_P1
// 0 "" 2
	ldr	x0, [sp, 56]
	ldr	w0, [x0]
	str	w0, [sp, 44]
	ldr	x0, [sp, 48]
	mov	w1, 1
	str	w1, [x0]
// 239 "ISA2+plain.c" 1
#START _litmus_P2
// 0 "" 2
	ldr	x0, [sp, 48]
	ldr	w0, [x0]
	str	w0, [sp, 44]
	ldr	x0, [sp, 56]
	ldr	w0, [x0]
	str	w0, [sp, 40]
// 268 "ISA2+plain.c" 1
Test ISA2+plain Allowed
Histogram (8 states)
29368074:>1:r0=0; 2:r0=0; 2:r1=0;
4215954:>1:r0=1; 2:r0=0; 2:r1=0;
18914273:>1:r0=0; 2:r0=1; 2:r1=0;
5     *>1:r0=1; 2:r0=1; 2:r1=0;
3601  :>1:r0=0; 2:r0=0; 2:r1=1;
32805398:>1:r0=1; 2:r0=0; 2:r1=1;
14474671:>1:r0=0; 2:r0=1; 2:r1=1;
218024:>1:r0=1; 2:r0=1; 2:r1=1;
Ok

Witnesses
Positive: 5, Negative: 99999995
Condition exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0) is validated
Hash=03bd81ecd2a78698aafba1ce0a5c54a2
Observation ISA2+plain Sometimes 5 99999995
Time ISA2+plain 27.07
Revision exported, version 7.58
Command line: litmus7 -o mp_isa2.out -carch AArch64 litmus-tests/ISA2+plain.litmus
Parameters
#define SIZE_OF_TEST 100000
#define NUMBER_OF_RUN 10
#define AVAIL 1
#define STRIDE (-1)
#define MAX_LOOP 0
/* gcc options: -Wall -std=gnu99  -pthread */
/* barrier: user */
/* launch: changing */
/* affinity: none */
/* memory: direct */
/* safer: write */
/* preload: random */
/* speedcheck: no */
/* alloc: dynamic */
GCC=gcc
LITMUSOPTS=
Thu 26 Jun 16:11:23 BST 2025

可以發現,1:r0=1 /\ 2:r0=1 /\ 2:r1=0 確實發生了 5 次(out of 100 million)

分析 ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/litmus-tests $ cat ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus
C ISA2+pooncerelease+poacquirerelease+poacquireonce

(*
 * Result: Never
 *
 * This litmus test demonstrates that a release-acquire chain suffices
 * to order P0()'s initial write against P2()'s final read.  The reason
 * that the release-acquire chain suffices is because in all but one
 * case (P2() to P0()), each process reads from the preceding process's
 * write.  In memory-model-speak, there is only one non-reads-from
 * (AKA non-rf) link, so release-acquire is all that is needed.
 *)

{}

P0(int *x, int *y)
{
	WRITE_ONCE(*x, 1);
	smp_store_release(y, 1);
}

P1(int *y, int *z)
{
	int r0;

	r0 = smp_load_acquire(y);
	smp_store_release(z, 1);
}

P2(int *x, int *z)
{
	int r0;
	int r1;

	r0 = smp_load_acquire(z);
	r1 = READ_ONCE(*x);
}

exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0)

因為在前面的實驗我們已經驗證了 release-acquire memory ordering 可以阻止 L-L, S-S, S-L, L-S 的 reordering ,並且這是對於所有 CPU 的保證。
於是我們期望的目標是:

P0: x=1 ——release→ y=1
                    ↓
P1:               acquire(y) ——release→ z=1
                                         ↓
P2:                                   acquire(z) → 讀 x

用 herd7 分析:

 i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ herd7 -conf linux-kernel.cfg litmus-tests/ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus
Test ISA2+pooncerelease+poacquirerelease+poacquireonce Allowed
States 7
1:r0=0; 2:r0=0; 2:r1=0;
1:r0=0; 2:r0=0; 2:r1=1;
1:r0=0; 2:r0=1; 2:r1=0;
1:r0=0; 2:r0=1; 2:r1=1;
1:r0=1; 2:r0=0; 2:r1=0;
1:r0=1; 2:r0=0; 2:r1=1;
1:r0=1; 2:r0=1; 2:r1=1;
No
Witnesses
Positive: 0 Negative: 7
Condition exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0)
Observation ISA2+pooncerelease+poacquirerelease+poacquireonce Never 0 7
Time ISA2+pooncerelease+poacquirerelease+poacquireonce 0.03
Hash=11d54fb9e238127827746b4317edd24c

發現 release-acquire chain 成功阻止了 memory reordering

接著進行實機測試:

使用以下指令轉換成 C code 後:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ mkdir mp_isa2_rel_acq.out
i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ litmus7 -carch AArch64 -o mp_isa2_rel_acq.out litmus-tests/ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus
i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model $ cd mp_isa2_rel_acq.out/

補上巨集定義:

#define WRITE_ONCE(p, v) (*(volatile typeof(*(p)) *)(p) = (v))
#define READ_ONCE(p)     (*(volatile typeof(*(p)) *)(p))
#define smp_store_release(p, v) __atomic_store_n((p), (v), __ATOMIC_RELEASE)
#define smp_load_acquire(p)     __atomic_load_n((p), __ATOMIC_ACQUIRE)

但出現了錯誤:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp_isa2_rel_acq.out $ make
gcc -Wall -std=gnu99  -pthread -S ISA2+pooncerelease+poacquirerelease+poacquireonce.c
ISA2+pooncerelease+poacquirerelease+poacquireonce.c: In function ‘P0’:
ISA2+pooncerelease+poacquirerelease+poacquireonce.c:29:45: error: invalid type argument of unary ‘*’ (have ‘int’)
   29 | #define WRITE_ONCE(p, v) (*(volatile typeof(*(p)) *)(p) = (v))
      |                                             ^~~~
ISA2+pooncerelease+poacquirerelease+poacquireonce.c:218:9: note: in expansion of macro ‘WRITE_ONCE’
  218 |         WRITE_ONCE(*x, 1);
      |         ^~~~~~~~~~
ISA2+pooncerelease+poacquirerelease+poacquireonce.c: In function ‘P2’:
ISA2+pooncerelease+poacquirerelease+poacquireonce.c:30:45: error: invalid type argument of unary ‘*’ (have ‘int’)
   30 | #define READ_ONCE(p)     (*(volatile typeof(*(p)) *)(p))
      |                                             ^~~~
ISA2+pooncerelease+poacquirerelease+poacquireonce.c:275:14: note: in expansion of macro ‘READ_ONCE’
  275 |         r1 = READ_ONCE(*x);
      |              ^~~~~~~~~
make: *** [Makefile:33: ISA2+pooncerelease+poacquirerelease+poacquireonce.s] Error 1

在確保更改不會影響實驗的本質後,我進行了以下修改:
原本:

WRITE_ONCE(*x, 1);
r1 = READ_ONCE(*x);

修改後:

WRITE_ONCE(x, 1);
r1 = READ_ONCE(x); 

Q :那為什麼一開始設計的時候會是這樣呢?
A :因為 litmus 語法是從更接近 assembly 或 formal model 的角度出發撰寫的,而不是直接當成 C 的巨集使用。它的語法是:「寫入 *x 所指的位置」。

接著 make 後編譯成功:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp_isa2_rel_acq.out $ make
gcc -Wall -std=gnu99  -pthread -S ISA2+pooncerelease+poacquirerelease+poacquireonce.c
gcc -Wall -std=gnu99  -pthread  -o ISA2+pooncerelease+poacquirerelease+poacquireonce.exe outs.o utils.o litmus_rand.o ISA2+pooncerelease+poacquirerelease+poacquireonce.s
awk -f show.awk ISA2+pooncerelease+poacquirerelease+poacquireonce.s > ISA2+pooncerelease+poacquirerelease+poacquireonce.t
rm ISA2+pooncerelease+poacquirerelease+poacquireonce.s

更改參數如下:

#define SIZE_OF_TEST 1000000
#define NUMBER_OF_RUN 100

執行結果:

i-ying-tsai@raspberrypi:~/memory_model/tools/memory-model/mp_isa2_rel_acq.out $ ./run.sh
Fri 27 Jun 13:48:22 BST 2025
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Results for litmus-tests/ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
C ISA2+pooncerelease+poacquirerelease+poacquireonce

{
}
P0(int* x, int* y) {

	WRITE_ONCE(*x, 1);
	smp_store_release(y, 1);

}

P1(int* y, int* z) {

	int r0;

	r0 = smp_load_acquire(y);
	smp_store_release(z, 1);

}

P2(int* x, int* z) {

	int r0;
	int r1;

	r0 = smp_load_acquire(z);
	r1 = READ_ONCE(*x);

}


exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0)
Generated assembler
#START _litmus_P0
// 0 "" 2
	ldr	x0, [sp, 40]
	mov	w1, 1
	str	w1, [x0]
	ldr	x0, [sp, 32]
	mov	w1, 1
	stlr	w1, [x0]
// 221 "ISA2+pooncerelease+poacquirerelease+poacquireonce.c" 1
#START _litmus_P1
// 0 "" 2
	ldr	x0, [sp, 56]
	ldar	w0, [x0]
	str	w0, [sp, 44]
	ldr	x0, [sp, 48]
	mov	w1, 1
	stlr	w1, [x0]
// 248 "ISA2+pooncerelease+poacquirerelease+poacquireonce.c" 1
#START _litmus_P2
// 0 "" 2
	ldr	x0, [sp, 48]
	ldar	w0, [x0]
	str	w0, [sp, 44]
	ldr	x0, [sp, 56]
	ldr	w0, [x0]
	str	w0, [sp, 40]
// 277 "ISA2+pooncerelease+poacquirerelease+poacquireonce.c" 1
Test ISA2+pooncerelease+poacquirerelease+poacquireonce Allowed
Histogram (7 states)
32994001:>1:r0=0; 2:r0=0; 2:r1=0;
577760:>1:r0=1; 2:r0=0; 2:r1=0;
663604:>1:r0=0; 2:r0=1; 2:r1=0;
1682256:>1:r0=0; 2:r0=0; 2:r1=1;
32218278:>1:r0=1; 2:r0=0; 2:r1=1;
30946378:>1:r0=0; 2:r0=1; 2:r1=1;
917723:>1:r0=1; 2:r0=1; 2:r1=1;
No

Witnesses
Positive: 0, Negative: 100000000
Condition exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0) is NOT validated
Hash=11d54fb9e238127827746b4317edd24c
Observation ISA2+pooncerelease+poacquirerelease+poacquireonce Never 0 100000000
Time ISA2+pooncerelease+poacquirerelease+poacquireonce 27.39
Revision exported, version 7.58
Command line: litmus7 -carch AArch64 -o mp_isa2_rel_acq.out litmus-tests/ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus
Parameters
#define SIZE_OF_TEST 100000
#define NUMBER_OF_RUN 10
#define AVAIL 1
#define STRIDE (-1)
#define MAX_LOOP 0
/* gcc options: -Wall -std=gnu99  -pthread */
/* barrier: user */
/* launch: changing */
/* affinity: none */
/* memory: direct */
/* safer: write */
/* preload: random */
/* speedcheck: no */
/* alloc: dynamic */
GCC=gcc
LITMUSOPTS=
Fri 27 Jun 13:48:50 BST 2025

發現沒有出現 1:r0=1 ∧ 2:r0=1 ∧ 2:r1=0
-> release → acquire → release → acquire chain 正確地阻止了 reordering。