〈Concurrency Primer〉校訂和範例撰寫

Contributed by < huang-me >

使用 atomic 與否差異

#include <stdatomic.h>

int v = 0;
atomic_bool v_ready = false;
int bv;

void *threadA() {
	v = 42;
	v_ready = true;
}

void *threadB() {
	while(!v_ready);
	bv = v;
}

以下使用 gcc -O3 optimization

threadA 中：若將 v_ready 定義為 amtomic_bool 則 compiler 將會插入 memory fence 使得所有在之前寫入的 register 值都被更新至 memory 中。

threadA:				threadA:
.LFB40:					.LFB40:
    .cfi_startproc				.cfi_startproc
    endbr64					endbr64
    movl	$42, v(%rip)			movl	$42, v(%rip)
    movb	$1, v_ready(%rip)		movb	$1, v_ready(%rip)
    mfence			     <
    ret					ret

threadB 中：若不是使用 atomic_bool，compiler 會對 while loop 中的讀取進行優化，造成無限迴圈而無法正常執行。(row 7,8)

















threadB:				threadB:
.LFB41:					.LFB41:
	.cfi_startproc				.cfi_startproc
	endbr64					endbr64
				     >		cmpb	$0, v_ready(%rip)
				     >		jne	.L4
				     >	.L5:
				     >		jmp	.L5
	.p2align 4,,10				.p2align 4,,10
	.p2align 3				.p2align 3
.L4:					.L4:
	movzbl	v_ready(%rip), %eax  <
	testb	%al, %al	     <
	je	.L4		     <
	movl	v(%rip), %eax			movl	v(%rip), %eax
	movl	%eax, bv(%rip)			movl	%eax, bv(%rip)
	ret					ret

Check if atomic element is lock free

C/C++ 中分別定義atomic_is_lock_free/std::atomic<T>::is_lock_free 可用以檢驗此 atomic 變數是否可以在 lock free 的情況下做到 atomic operation。

struct foo {
	int a, b, c;
	char *s;
};

atomic_bool v_ready = false;
struct foo f;

int main() {
	printf("foo is lock free? %s\n", atomic_is_lock_free(&f) ? "true" : "false");
	printf("v_ready is lock free? %s\n", atomic_is_lock_free(&v_ready) ? "true" : "false");
	return 0;
}

以上測試程式執行結果如下：

foo is lock free? false
v_ready is lock free? true

若定義的 atomic variable 大小大於一個 register 之大小且 variable aligned，則需要使用 mutex lock 確保 operation 之 atomicity。
反之，若 variable 沒有被 aligned ，則是否為 lock free 取決於硬體架構的設計方式。

Read-modify-write

當需要 atomically 完成讀取、修改、寫入的一系列指令時，可以使用以下指令

atomic_<type>:
atomic_compare_exchange & atomic_exchange
atomic_flag:
atomic_flag_test_and_set & atomic_flag_clear

利用 `atomic_flag` 實作 mutex lock

atomic_flag f = ATOMIC_FLAG_INIT;
void atomic_operation() {
    while(atomic_flag_test_and_set(&f))
        nop;
    /*
        critical section
    */
    atomic_flag_clear(&f);
    return;
}

`atomic_flag` 編譯結果

原始 c code：

void *threadA() {
    while(atomic_flag_test_and_set(&f));
    t1.a++;
    atomic_flag_clear(&f);
    return NULL;
}

對應 assembly

.L2:
    movl	%edx, %eax
    xchgb	f(%rip), %al
    testb	%al, %al
    jne	.L2
    addl	$1, t1(%rip)
    xorl	%eax, %eax
    movb	$0, f(%rip)
    mfence
    ret

atomic_flag_test_and_set 使用 x86 的 exchange 指令 xchgb 設定 flag 的值
atomic_flag_clear 則在寫入值之後還加上了 memory fence 以確保所有值的修改都可以被其他 processor 看見。

compare and exchange

CAS c code

atomic_int ai = 0;
void cas() {
    atomic_int exp = 0;
    atomic_compare_exchange_strong(&ai, &exp, 1);
    return;
}

對應 assembly code








cas:
.LFB24:
    .cfi_startproc
    endbr64
    xorl	%eax, %eax
    movl	$1, %edx
    lock cmpxchgl	%edx, ai(%rip)
    ret

第 7 行中，lock 使得在做 cmpxchgl 時同時只能有一個 cpu 獲得此變數的存取權。
並且 cmpxchgl 先比較 %eax 以及 destination register (ai(%rip)，值相等時才將 %edx 中的值與 ai(%rip) 互換。

load-link and store-conditional (LL/SC)

在某些 cpu 架構之下(MIPS,RISC-V,ARM,etc…)，有指令可以追蹤此次要 store 的目標值是否被修改過，以避免 load modfiy store 覆蓋已經被更新的值。

`atomic_compare_exchange_weak` 存在意義

void addone(int val) {
    int expected = val;
    while(!atomic_compare_exchange_?(&val, &expected, val + 1))
        nop;
    return;
}

我們可能經常使用以上方式進行 CAS 動作，但是如果硬體有支援 ll/sc 的話則有可能錯誤發生在 sc 而非一開始比較 val 以及 expected 時，但其時我們不在乎什麼時候出現錯誤，所以定義 weak 版本的 CAS 只要動作沒有正確的完成就會回傳 true。

Memory ordering

有時候我們並不需要所有的指令都在一定的順序下執行，有部分的指令可以提前或者延後執行時間。
因此，在 stdatomic.h 中有 6 種 memory ordering model 可以讓 compiler 清楚各指令可以接受的優化方向。

Sequentially consistent

在各個 thread 中，所有的指定為 memory_order_seq_cst 指令都是 sequentially 執行，所有的執行順序跟程式被撰寫的順序相同，因此沒有重排指令造成執行結果錯誤的可能性。

void *threadFunc() {
	atomic_int val = atomic_load_explicit(&a, memory_order_seq_cst);
	val += 1;
	atomic_store_explicit(&a, val, memory_order_seq_cst);
	return NULL;
}

若同時讓多個 thread 執行以上 function，指令的執行順序按照 thread 建立時間以及指令順序執行。

使用 sequentially consistent 的缺點為即使使用多個 thread 還是按照 sequential 執行，使得執行效率低落，甚至可能跟 single thread 時差不多，但卻浪費了很多資源。

Acquire

只能用在 atomic_load
所有在此 load 之後的指令都不能被重排到此指令的前面。

Release

只能用在 atomic_store
與 Acquire 相反，所有在此 store 之前的指令都不能被重排到此指令的後面。

Acquire Release

可以用在 atomic_load 以及 atomic_store
必須同時滿足 Acquire 以及 Release 條件

Relaxed

所有 atomic operation 都 atomically 執行，但是不同 thread 之間對於相同變數的執行順序沒有固定。

Consume

基本上跟 Acquire 相似，不同點在於只需要保證相同變數的順序，其他不相關的變數可以做前後順序的調換，且 memory barrier 的加入為非必要。

〈Atomic Primer〉勘誤

Enforcing law and order:

應修正為:
"They look and act just like the type they mirror …"

〈Concurrency Primer〉校訂和範例撰寫

使用 atomic 與否差異

Check if atomic element is lock free

Read-modify-write

利用 atomic_flag 實作 mutex lock

atomic_flag 編譯結果

compare and exchange

load-link and store-conditional (LL/SC)

atomic_compare_exchange_weak 存在意義

Memory ordering

Sequentially consistent

Acquire

Release

Acquire Release

Relaxed

Consume

Read more

Parallel programming HW5

Parallel programming HW4

Parallel programming HW2

2022q1 Homework6 (ktcp)

利用 `atomic_flag` 實作 mutex lock

`atomic_flag` 編譯結果

`atomic_compare_exchange_weak` 存在意義