--- tags: linux2022 --- # 2022q1 Homework4 (quiz4) contribute by < `bakudr18` > > [測驗題目](https://hackmd.io/@sysprog/linux2022-quiz4) ## 執行環境 :::spoiler lscpu ```shell $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 140 Model name: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz Stepping: 1 CPU MHz: 1700.000 CPU max MHz: 4700.0000 CPU min MHz: 400.0000 BogoMIPS: 3379.20 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 128 KiB L2 cache: 5 MiB L3 cache: 12 MiB NUMA node0 CPU(s): 0-7 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl an d seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sa nitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca c mov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_per fmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid ap erfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cp l vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 ss e4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_ l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhance d tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase ts c_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f av x512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_ notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni av x512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear flush_l1d arch_capabilities ``` ::: :::spoiler uname -a ```shell $ uname -a Linux bakud-Inspiron-7400 5.13.0-35-generic #40~20.04.1-Ubuntu SMP Mon Mar 7 09:18:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux ``` ::: :::spoiler gcc --version ```shell $ gcc --version gcc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0 ``` ::: ## 測驗 `4` ```cpp= static void (**tasks)(void *); static int ntasks; #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) int main(void) { void (*registered_task[])(void *) = {task0, task1}; tasks = registered_task; ntasks = ARRAY_SIZE(registered_task); schedule(); return 0; } ``` 首先看到第 6 行 ,想起以前只會使用 `&task0` 來 assign 給 function pointer,好奇運作原理,因此馬上複習[你所不知道的C語言:指標篇](https://hackmd.io/@sysprog/c-pointer#Function-Pointer),根據 C99 規格書 > A function designator is an expression that has function type. Except when it is the operand of the sizeof operator or the unary & operator, a function designator with type ‘‘function returning type’’ is converted to an expression that has type ‘‘pointer to function returning type’’. (C99 6.3.2.1) > It is an lvalue, a function designator, or avoid expression if the unparenthesized expression is, respectively, an lvalue, a function designator, or avoid expression. (C99 6.5.1) >The unary * operator denotes indirection. If the operand points to a function, the result is a function designator (C99 6.5.3.2-4) 根據以上三點可以知道,`(*registered_task[])` is an array of function designator, a function designator will be converted to a pointer points to function type. 因此,這裡的 `(*registered_task[])` 可以 assign function designator `task0` or `task1` ,甚至可以 assign function designator `*task0` or `*****task1` ,又或者,可以 assign "pointer to function returning type" `&task0` or `&task1`。 接著看到第 7 行的 `tasks = registered_task;` ,察看 C99 規格書 >Except when it is the operand of the sizeof operator or the unary & operator, or is a string literal used to initialize an array, an expression that has type ‘‘array of type’’ is converted to an expression with type ‘‘pointer to type’’ that points to the initial element of the array object and is not an lvalue. (C99 6.3.2.1) 可以知道, `registered_task` will be converted to pointer to pointer to function type. 而 `tasks` 為 pointer to pointer to function type,因此可以直接 assign 。 ```cpp void schedule(void) { static int i; srand(0xCAFEBABE ^ (uintptr_t) &schedule); /* Thanks to ASLR */ setjmp(sched); while (ntasks-- > 0) { int n = rand() % 5; tasks[i++](&n); printf("Never reached\n"); } task_join(&tasklist); } ``` `schedule` 會根據 `nstasks` 數量去執行 `tasks[i]` ,如果超過次數就執行 `task_join` 去做完剩餘的 `task` 並回收。 因為有 [ASLR (address space layout randomization)](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux),每次執行此程式得到的 function address 都會是隨機的,因此可以拿來當作 `srand` 的 seed。 ```cpp= /* A task yields control n times */ void task0(void *arg) { jmp_buf env; static int n; n = *(int *) arg; printf("Task 0: n = %d\n", n); if (setjmp(env) == 0) { task_add(&tasklist, env); longjmp(sched, 1); } for (int i = 0; i < n; i++) { if (setjmp(env) == 0) { task_add(&tasklist, env); task_switch(&tasklist); } printf("Task 0: resume\n"); } printf("Task 0: complete\n"); longjmp(sched, 1); } ``` `task0` 與 `task1` 邏輯是一樣的,在初次執行時,在第 10 行 `setjmp(env)` 設定 `longjmp(env, val)` 回復點,且回傳 `0` ,接著把 `env` 加到 `tasklist` 中 (為 linux-list),然後回到 `schedule` 的 `setjmp(sched)` 繼續執行。 若在其他函式內 `longjmp(env, 1)` 執行到屬於 `task0` 的 `env` ,則跳回第 11 行並回傳 `1` ,接著到 `for` loop 準備執行 `n` 次任務,但每次迭代都會先 `task_add` 後,透過 `task_switch` yield 給其他 task 執行,待下次 `longjmp(env, 1)` 執行到屬於 `task0` 的 `env` 才印出 resume ,最終 complete 後回到 `schedule` 函式內會執行到 `task_join` 。 因此,假設兩個 task 都是 `n == 3` ,我的理解是預期會各輸出 3 次 resume 與 1 次 complete,如下 ```cpp Task 0: n = 3 Task 1: n = 3 Task 0: resume Task 1: resume Task 0: resume Task 1: resume Task 0: resume Task 0: complete Task 1: resume Task 1: complete ``` 而並非[題目說明](https://hackmd.io/@sysprog/linux2022-quiz4#%E6%B8%AC%E9%A9%97-4)中如下的輸出 ```cpp Task 0: n = 3 Task 1: n = 3 Task 0: resume Task 1: resume Task 0: resume Task 0: complete Task 1: resume Task 1: complete ``` 其他包含 `task_join` 與 `task_switch` 的完整實作放在下方 `jmp.c` 內 :::spoiler jmp.c ```cpp #include <setjmp.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include "list.h" struct task { jmp_buf env; struct list_head list; }; static LIST_HEAD(tasklist); static void (**tasks)(void *); static int ntasks; static jmp_buf sched; static void task_add(struct list_head *tasklist, jmp_buf env) { struct task *t = malloc(sizeof(*t)); memcpy(t->env, env, sizeof(jmp_buf)); INIT_LIST_HEAD(&t->list); list_add_tail(&t->list, tasklist); } static void task_switch(struct list_head *tasklist) { jmp_buf env; if (!list_empty(tasklist)) { struct task *t = list_first_entry(tasklist, struct task, list); list_del(&t->list); memcpy(env, t->env, sizeof(jmp_buf)); free(t); longjmp(env, 1); } } static void task_join(struct list_head *tasklist) { jmp_buf env; while (!list_empty(tasklist)) { struct task *t = list_first_entry(tasklist, struct task, list); list_del(&t->list); memcpy(env, t->env, sizeof(jmp_buf)); free(t); longjmp(env, 1); } } void schedule(void) { static int i; srand(0xCAFEBABE ^ (uintptr_t) &schedule); /* Thanks to ASLR */ setjmp(sched); while (ntasks-- > 0) { int n = rand() % 5; tasks[i++](&n); printf("Never reached\n"); } task_join(&tasklist); } /* A task yields control n times */ void task0(void *arg) { jmp_buf env; static int n; n = *(int *) arg; printf("Task 0: n = %d\n", n); if (setjmp(env) == 0) { task_add(&tasklist, env); longjmp(sched, 1); } for (int i = 0; i < n; i++) { if (setjmp(env) == 0) { task_add(&tasklist, env); task_switch(&tasklist); } printf("Task 0: resume\n"); } printf("Task 0: complete\n"); longjmp(sched, 1); } void task1(void *arg) { jmp_buf env; static int n; n = *(int *) arg; printf("Task 1: n = %d\n", n); if (setjmp(env) == 0) { task_add(&tasklist, env); longjmp(sched, 1); } for (int i = 0; i < n; i++) { if (setjmp(env) == 0) { task_add(&tasklist, env); task_switch(&tasklist); } printf("Task 1: resume\n"); } printf("Task 1: complete\n"); longjmp(sched, 1); } #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) int main(void) { void (*registered_task[])(void *) = {task0, task1}; tasks = registered_task; ntasks = ARRAY_SIZE(registered_task); schedule(); return 0; } ``` ::: --- ### 追蹤非預期的輸出 然而,此實作的執行結果不如我預期,例如當兩個 `n == 2` 時,輸出如下,task0 resume 2 次,但 task1 resume 1 次而已。 ```cpp Task 0: n = 2 Task 1: n = 2 Task 0: resume Task 1: resume Task 1: complete Task 0: resume Task 0: complete ``` ```cpp=97 void task1(void *arg) { jmp_buf env; static int n; n = *(int *) arg; printf("Task 1: n = %d\n", n); if (setjmp(env) == 0) { task_add(&tasklist, env); longjmp(sched, 1); } for (int i = 0; i < n; i++) { if (setjmp(env) == 0) { task_add(&tasklist, env); task_switch(&tasklist); } printf("Task 1: resume\n"); } printf("Task 1: complete\n"); longjmp(sched, 1); } ``` 嘗試透過 gdb 分析,將 `n` 設為定值 2 以方便分析。 首先將 break point 設在 `task1` 第 111 行,第一次 break 時 `i == 0` ,然而 continue 後直接執行到結束,沒有再回到 `i = 1` 的狀態。 ```shell $ gdb jmp (gdb) b 111 Breakpoint 1 at 0x1761: file jmp.c, line 111. (gdb) r Starting program: /home/bakud/linux2022/jmp/jmp Task 0: n = 2 Task 1: n = 2 Breakpoint 1, task1 (arg=0x555555558010 <tasklist>) at jmp.c:111 111 if (setjmp(env) == 0) { (gdb) print i $1 = 0 (gdb) c Continuing. Task 0: resume Task 1: resume Task 1: complete Task 0: resume Task 0: complete [Inferior 1 (process 128366) exited normally] ``` 重新嘗試用單步執行 `step`,無法執行 ```shell Breakpoint 1, task1 (arg=0x555555558010 <tasklist>) at jmp.c:110 110 for (int i = 0; i < n; i++) { (gdb) s 111 if (setjmp(env) == 0) { (gdb) _setjmp () at ../sysdeps/x86_64/bsd-_setjmp.S:28 28 ../sysdeps/x86_64/bsd-_setjmp.S: No such file or directory. ``` 改嘗試用 `next` ,可以發現在 `task1` yield 前 `i == 0` ,但下次 `longjmp` 回到 `task1` 時 `i == 1` ,不確定是哪裡出問題。 ```shell (gdb) r Starting program: /home/bakud/linux2022/jmp/jmp Task 0: n = 2 Task 1: n = 2 Breakpoint 1, task1 (arg=0x555555558010 <tasklist>) at jmp.c:111 111 if (setjmp(env) == 0) { (gdb) p i $2 = 0 (gdb) n 112 task_add(&tasklist, env); (gdb) 113 task_switch(&tasklist); (gdb) 0x0000555555555662 in task0 (arg=0x555555558010 <tasklist>) at jmp.c:86 86 if (setjmp(env) == 0) { (gdb) 90 printf("Task 0: resume\n"); (gdb) Task 0: resume 85 for (int i = 0; i < n; i++) { (gdb) 86 if (setjmp(env) == 0) { (gdb) 87 task_add(&tasklist, env); (gdb) 88 task_switch(&tasklist); (gdb) 0x0000555555555770 in task1 (arg=0x555555558010 <tasklist>) at jmp.c:111 111 if (setjmp(env) == 0) { (gdb) p i $3 = 1 ``` 因此只好直接查看組合語言看能不能研究出什麼,但在這之前為了方便研究我寫了簡化的實作,看看能不能重現問題,如下 ```cpp #include <setjmp.h> #include <stdio.h> static jmp_buf t1, t2, sched; void task1(void) { if(setjmp(t1) == 0) longjmp(sched, 1); for(int i = 0; i < 2; i++){ if(setjmp(t1) == 0){ longjmp(t2, 1); } printf("task1: %d\n", i); } } void task2(void) { for(int i = 0; i < 2; i++){ if(setjmp(t2) == 0){ longjmp(t1, 1); } printf("task2: %d\n", i); } } void schedule(void) { if(setjmp(sched) == 0) task1(); else task2(); } int main() { schedule(); return 0; } ``` 預期輸出應該是 ```cpp task2: 0 task1: 0 task2: 1 ``` 但實際執行結果如下,可以看到 `task1` 直接跳過了 `i == 0` 的步驟 ```cpp task2: 0 task1: 1 ``` 以 `gcc -O0 -g -o j j.c` 編譯後,`objdump -S j` 觀察其組合語言 (~~打開 CSAPP chap3 慢慢硬啃~~),`task1` 看不出什麼異常,但在 `setjmp` 發現不常見的指令 `bnd` ,發現這是來自於 Intel [MPX](https://en.wikipedia.org/wiki/Intel_MPX) 指令集。 ```shell 0000000000001120 <_setjmp@plt>: 1120: f3 0f 1e fa endbr64 1124: f2 ff 25 8d 2e 00 00 bnd jmpq *0x2e8d(%rip) # 3fb8 <_setjmp@GLIBC_2.2.5> 112b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ``` >Intel MPX (Memory Protection Extensions) was a set of extensions to the x86 instruction set architecture. With compiler, runtime library and operating system support, Intel MPX claimed to enhance security to software by checking pointer references whose normal compile-time intentions are maliciously exploited at runtime due to buffer overflows. 看起來是用來保護 stack 等 memory buffer 的相關指令,然而又查到了以下相關資料 * [The Linux Kernel Might Drop Memory Protection Extensions Support](https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Weighing-Intel-MPX) * [GCC 9 Looks Set To Remove Intel MPX Support](https://www.phoronix.com/scan.php?page=news_item&px=GCC-Patch-To-Drop-MPX) * [Glibc 2.35 Removes The Long-Deprecated Intel MPX Support](https://www.phoronix.com/scan.php?page=news_item&px=Glibc-2.35-Removes-Intel-MPX) 看來是~~被遺棄~~已經不支援的指令呢,因此推測問題可能出在這裡,然而我查閱了 gcc 文件卻沒有看到有編譯選項可以避免編譯出 mpx 相關指令,我也嘗試過使用 gcc-11 編譯結果也是一樣的,目前想不出其他方向可以嘗試。 透過 `ldd --version` 檢查 GLIBC 版本為 2.31 ,推測是產生 mpx 指令的原因,查閱 gcc docs 3.11 [Program Instrumentation Options](https://gcc.gnu.org/onlinedocs/gcc-9.4.0/gcc/Instrumentation-Options.html) 沒有看到可以關掉 mpx 的選項。 參考了 [kevinshieh0225](https://hackmd.io/@Masamaloka/linux2022-quiz4#%E5%8D%80%E5%9F%9F%E8%AE%8A%E6%95%B8%E8%BF%BD%E8%B9%A4) 同學的共筆,發現我疏忽了 `longjmp` 的細節,根據 [setjmp(3) — Linux manual page](https://man7.org/linux/man-pages/man3/setjmp.3.html) >The compiler may optimize variables into registers, and longjmp() may restore the values of other registers in addition to the stack pointer and program counter. Consequently, the values of automatic variables are unspecified after a call to longjmp() if they meet all the following criteria: > > • they are local to the function that made the corresponding setjmp() call; > • their values are changed between the calls to setjmp() and longjmp(); and > • they are not declared as volatile. 以及 C11 7.13.2 > All accessible objects have values, and all other components of the abstract machine have state, as of the time the longjmp function was called, except that the values of objects of automatic storage duration that are local to the function containing the invocation of the corresponding setjmp macro that do not have volatile-qualified type and have been changed between the setjmp invocation and longjmp call are indeterminate. 對於 automatic variables 且不為 volatile-qualified type ,經過 `longjmp` 後 automatic variables 不保證會被恢復。`longjmp` 只會復原除了 stack pointer 與 program counter 以外的 register 。 觀察 `task0` 中變數 `i` 的組合語言如下,可以發現 `i` 直接存在 `rbp - 0xd4` 的位址 (在函式初始化時保留了 `sub $0xf0,%rsp` 大小的 stack 中),沒有存在 register 中,也因此經過 `longjmp` 不會被復原。 ```shell <task0>: endbr64 push %rbp mov %rsp,%rbp sub $0xf0,%rsp # reserve 0xf0 bytes as stack ... movl $0x0,-0xd4(%rbp) # int i = 0 ... addl $0x1,-0xd4(%rbp) # i++ ``` 但注意到即使宣告為 `valotile int i` 編譯後的組合語言仍然沒變,仍不符合以上 man page 與 C11 standard 所規範的行為。 解決方法如,宣告為 `static` 就不屬於 automatic storage duration。另外像 [Kevin-Shih](https://hackmd.io/@Kevin-Shih/linux2022q1-quiz4#%E6%B8%AC%E9%A9%97-4) 同學使用 [local register variable](https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html#Local-Register-Variables) ,宣告 `register int i asm ("eax");` 可以強制變數 `i` 被存在 register 中也是一個解法。 另外我寫了一個保留 stack 的寫法,在 `task0` 的開頭透過 `calloc` 配置空間作為 `task0` 的 preserve stack ,與 `env` 一起存在 `struct context` 中,並透過 `task_add` 保留在 list 內等下次 `task_switch` 可以被恢復。不過下方的版本封裝沒有做的很好,在 `longjmp` 跳回時必須手動回復 local variable 的指標,暫時沒有想到比較好的封裝寫法。 ```cpp #define STACK_SIZE 8192 // bytes struct context { jmp_buf env; void *stack; int sp; }; struct task { struct context *context; struct list_head list; }; ... void task0(void *arg) { static int n; n = *(int *) arg; static struct context *uc; uc = calloc(1, sizeof(struct context)); uc->stack = calloc(1, STACK_SIZE); uc->sp = 0; int *i = (int*)((char*)uc->stack + uc->sp); uc->sp += sizeof(*i); printf("Task 0: n = %d\n", n); if (setjmp(uc->env) == 0) { task_add(&tasklist, uc); longjmp(sched, 1); } i = (int*)uc->stack; for (*i = 0; *i < n; (*i)++) { if (setjmp(uc->env) == 0) { task_add(&tasklist, uc); task_switch(&tasklist); } i = (int*)uc->stack; printf("Task 0: resume i: %d\n", *i); } printf("Task 0: complete\n"); free(uc->stack); free(uc); longjmp(sched, 1); } ``` 另外補充也可以透過 [getcontext/setcontext](https://man7.org/linux/man-pages/man3/getcontext.3.html) 與 [makecontext/swapcontext](https://man7.org/linux/man-pages/man3/makecontext.3.html) 來取代 `setjmp/longjmp` 實作 coroutine,前者有提供 `uc_stack` 可以保留 stack ,被保留的 context 在執行 `getcontext` 或 `swapcontext` 時會自動回復,其中也包含 stack,因此可以不用像上方作法手動回復 local variable。 --- ## 測驗 `5` ```cpp= #include <stdint.h> #include <string.h> #include <stdlib.h> #define __READ_ONCE_SIZE \ ({ \ switch (size) { \ case 1: \ *(uint8_t *) res = *(volatile uint8_t *) p; \ break; \ case 2: \ *(uint16_t *) res = *(volatile uint16_t *) p; \ break; \ case 4: \ *(uint32_t *) res = *(volatile uint32_t *) p; \ break; \ case 8: \ *(uint64_t *) res = *(volatile uint64_t *) p; \ break; \ default: \ memcpy((void *) res, (const void *) p, size); \ } \ }) static inline void __read_once_size(const volatile void *p, void *res, int size) { __READ_ONCE_SIZE; } #define READ_ONCE(x) \ ({ \ union { \ typeof(x) __val; \ uint8_t __c[1]; \ } __u; \ __read_once_size(&(x), __u.__c, sizeof(x)); \ __u.__val; \ }) ``` 根據 [ACCESS_ONCE() and compiler bugs](https://lwn.net/Articles/624126/),原本的 `ACCESS_ONCE` 在過去 Linux 核心的實作如下,透過 `volatile` 確實地到指定記憶體位址讀取其值,而非透過 register 或其他方式。 ```cpp #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x)) ``` 解釋之前先看一下 C99 規格書複習以下 type 定義: >* Integer and floating types are collectively called arithmetic types. (C99 6.2.5-18) >* Arithmetic types and pointer types are collectively called scalar types. Array and structure types are collectively called aggregate types. Note that aggregate type does not include union type because an object with union type can only contain one member at a time.(C99 6.2.5-21) 當 `ACCESS_ONCE` 傳入變數為 scalar type ,結果都會是正確的,然而 Christian Borntraeger 的 [message](https://lwn.net/Articles/624148/) 提到因為 GCC 4.6 and 4.7 存在 bug ,使 `ACCESS_ONCE` 在傳入 aggregate type 變數時,可能出現錯誤,考慮以下程式碼 ```cpp= union ipte_control { unsigned long val; struct { unsigned long k : 1; unsigned long kh : 31; unsigned long kg : 32; }; }; [...] static void ipte_unlock_siif(struct kvm_vcpu *vcpu) { union ipte_control old, new, *ic; ic = &vcpu->kvm->arch.sca->ipte_control; do { new = old = ACCESS_ONCE(*ic); new.kh--; if (!new.kh) new.k = 0; } while (cmpxchg(&ic->val, old.val, new.val) != old.val); if (!new.kh) wake_up(&vcpu->kvm->arch.ipte_wq); } ``` Christian 於 gcc 4.7.2 of fedora 18 的編譯結果如下 ```shell l %r4,0(%r3) <--- load first 32 bit of lock (k and kh) in r4 alfi %r4,2147483647 <--- add -1 to r4 llgtr %r4,%r4 <--- zero out the sign bit of r4 lg %r1,0(%r3) <--- load all 64 bit of lock into new lgr %r2,%r1 <--- load the same into old risbg %r1,%r4,1,31,32 <--- shift and insert r4 into the bits 1-31 of new llihf %r4,2147483647 ngrk %r4,%r1,%r4 jne aa0 <ipte_unlock+0xf8> nihh %r1,32767 lgr %r4,%r2 csg %r4,%r1,0(%r3) cgr %r2,%r4 jne a70 <ipte_unlock+0xc8> ``` 可以看到其指令實際上到 `%r3` load 兩次,第一次在 `l` 指令,第二次在 `lg` 指令,參考 [你所不知道的 C 語言:編譯器和最佳化原理篇](https://hackmd.io/@sysprog/c-compiler-optimization) ,**推測**可能的原因是編譯器切割 [basic block](https://en.wikipedia.org/wiki/Basic_block) 切錯了,誤將上述程式碼第 16 行的 `new` 與第 17 行的 `new.kh` 當成沒有相依性的 code,所以把第 16 行切為一個 basic block,而將第 17~19 行切為另一個 basic block,導致後續經過 [CFG](https://en.wikipedia.org/wiki/Control-flow_graph) 以及轉換 [3-address code](https://en.wikipedia.org/wiki/Three-address_code) 等流程之後變成上述組合語言的順序。 而 Linus Torvalds 的[意見](https://lwn.net/Articles/624148/)如下,他把這視為編譯器給我們工程師的提示,表示這段程式碼非常的 "**fragile**" >So I do agree with Heiko that we generally don't want to work around compiler bugs if we can avoid it. But sometimes the compiler bugs do end up saying "you're doing something very fragile". Maybe we should try to be less fragile here. 因此 Linux 開發者們嘗試修正這個問題避免未來再次發生。 接著回頭解釋測驗 `5` 程式碼,因為 `union` 並非 aggregate type,因此先在第一層有一個保護避免出現同樣問題。 ```cpp #define READ_ONCE(x) \ ({ \ union { \ typeof(x) __val; \ uint8_t __c[1]; \ } __u; \ __read_once_size(&(x), __u.__c, sizeof(x)); \ __u.__val; \ }) ``` `union` 會以 size 較大的 member type 作為 `union` size,因此可分成以下兩種情況 * `sizeof(__u) = sizeof(x)` if `sizeof(x) > 1` * `sizeof(__u) = 1` if `sizeof(x) == 1` 而以下 `__READ_ONCE_SIZE` 則會根據傳入的 size 做不同操作,若 size 為 1,2,4,8 ,則直接透過 `volatile` 取值,而若 size > 8 則直接使用 `memcpy` 根據 size 到記憶體位址取值,以支援長度不固定的 aggregate type 變數。 ```cpp #define __READ_ONCE_SIZE \ ({ \ switch (size) { \ case 1: \ *(uint8_t *) res = *(volatile uint8_t *) p; \ break; \ case 2: \ *(uint16_t *) res = *(volatile uint16_t *) p; \ break; \ case 4: \ *(uint32_t *) res = *(volatile uint32_t *) p; \ break; \ case 8: \ *(uint64_t *) res = *(volatile uint64_t *) p; \ break; \ default: \ memcpy((void *) res, (const void *) p, size); \ } \ }) static inline void __read_once_size(const volatile void *p, void *res, int size) { __READ_ONCE_SIZE; } ``` --- ### Linux 核心中 `READ_ONCE`/`WRITE_ONCE` 的演化 事前準備: * [Why the “volatile” type class should not be used](https://docs.kernel.org/process/volatile-considered-harmful.html) * [並行程式設計: Atomics 操作](https://hackmd.io/@sysprog/concurrency-atomics) Christian Borntraeger 針對上面提到的 bug 所提出的修改為 [commit 230fa2](https://github.com/torvalds/linux/commit/230fa253df6352af12ad0a16128760b5cb3f92df),其中 `__read_once_size` 的實作如下 ```cpp static __always_inline void __read_once_size(volatile void *p, void *res, int size) { switch (size) { case 1: *(__u8 *)res = *(volatile __u8 *)p; break; case 2: *(__u16 *)res = *(volatile __u16 *)p; break; case 4: *(__u32 *)res = *(volatile __u32 *)p; break; #ifdef CONFIG_64BIT case 8: *(__u64 *)res = *(volatile __u64 *)p; break; #endif default: barrier(); __builtin_memcpy((void *)res, (const void *)p, size); data_access_exceeds_word_size(); barrier(); } } ``` 可以發現相比於測驗題的實作多了 `barrier` 與 `data_access_exceeds_word_size` ,首先先來討論 `barrier` 。 Linux 核心中 [`barrier`](https://github.com/torvalds/linux/blob/master/include/linux/compiler.h#L86) 的實作如下,其用意為產生 compiler barrier ```cpp /* Optimization barrier */ #ifndef barrier /* The "volatile" is due to gcc bugs */ # define barrier() __asm__ __volatile__("": : :"memory") #endif ``` 參考 gcc 文件 [Extended Asm](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html) 的說明 >The "memory" clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters). **To ensure memory contains correct values, GCC may need to flush specific register values to memory before executing the asm.** Further, the compiler does not assume that any values read from memory before an asm remain unchanged after that asm; it reloads them as needed. Using the "memory" clobber effectively forms a read/write memory barrier for the compiler. 使用 `memory` clobber 可以讓編譯器知道需要將這之前的操作都寫入到記憶體中而不是暫存器中,因此在**編譯時期**編譯器也就不能自作主張將在 `barrier` 之{前/後}的程式碼 reorder 到 `barrier` 之{後/前}。 參考[並行程式設計: Atomics 操作](https://hackmd.io/@sysprog/concurrency-atomics#%E6%98%93%E6%96%BC%E8%AA%A4%E7%94%A8-volatile-%E9%97%9C%E9%8D%B5%E5%AD%97)的舉例,考慮以下程式碼 ```cpp= // barrier.c int a, b; int i, j; int foo(){ a = i; b = j / 16; return b - a; } ``` 將 `a, b, i, j` 宣告為全域變數,因為全域變數可以被其他 object file 修改,因此編譯器不能自作主張的認為這裡都是常數計算而直接於編譯時期優化掉。以 `gcc -g -O2 -o barrier barrier.c` 編譯後 `objdump -S` 觀察反組譯結果如下,可以看到 load `j` 的操作被重排到了 load `i` 之前。 ```shell endbr64 mov 0x2e6e(%rip),%edx # load j to %edx mov 0x2e70(%rip),%ecx # load i to %ecx test %edx,%edx lea 0xf(%rdx),%eax mov %ecx,0x2e61(%rip) # store %ecx to a cmovns %edx,%eax sar $0x4,%eax mov %eax,0x2e4d(%rip) # store %eax to b sub %ecx,%eax retq nopw 0x0(%rax,%rax,1) ``` 然而若將 `barrier` 加入第 6 行與第 7 行中間,重新觀察組合語言結果如下,可以發現 `a = i` 的操作必定會在 `b = j / 16` 前完成。 ```shell endbr64 mov 0x2e76(%rip),%eax # load i to %eax mov %eax,0x2e6c(%rip) # store %eax to a mov 0x2e62(%rip),%edx # load j to %edx test %edx,%edx lea 0xf(%rdx),%eax cmovns %edx,%eax sar $0x4,%eax mov %eax,0x2e4d(%rip) # store %eax to b sub 0x2e4f(%rip),%eax retq xchg %ax,%ax ``` 因此,在 `__read_once_size` 加入 compiler barrier 可以避免當傳入值 size 大於 8 時(無法轉成 Arithmetic types)被編譯器重排指令順序,解決上述的 bug。 至於另外一個 `data_access_exceeds_word_size` 定義如下,可以知道當傳入值 size 大於 8 時,編譯時期會發出警告表示這個操作不會是 atomic 操作,這點後續會繼續提到。 ```cpp static __always_inline void data_access_exceeds_word_size(void) #ifdef __compiletime_warning __compiletime_warning("data access exceeds word size and won't be atomic") #endif ; ``` 在 [commit dd3692](https://github.com/torvalds/linux/commit/dd36929720f40f17685e841ae0d4c581c165ea60) 加入了與測驗題目相同的 `union` ,由 commit message 可以知道,如果傳入 `x` 為 const type ,因此透過 `typeof(x) __val` 會產生同樣為 const type 的 `__val` ,然而傳入 `__read_once_size` 的引數不為 const type ,因此編譯器會發出警告,透過 `union` 宣告改傳入 `__u.__c` 解決了這個問題,~~(表示我前面在舉燭)~~。 ```diff -static __always_inline void __read_once_size(volatile void *p, void *res, int size) +static __always_inline void __read_once_size(const volatile void *p, void *res, int size) { switch (size) { case 1: *(__u8 *)res = *(volatile __u8 *)p; break; #define READ_ONCE(x) \ - ({ typeof(x) __val; __read_once_size(&x, &__val, sizeof(__val)); __val; }) + ({ union { typeof(x) __val; char __c[1]; } __u; __read_once_size(&(x), __u.__c, sizeof(x)); __u.__val; }) ``` [commit d97644](https://github.com/torvalds/linux/commit/d976441f44bc5d48635d081d277aa76556ffbf8b) 的修改來解決 [Kernel Address Sanitizer (KASAN)](https://www.kernel.org/doc/html/v4.14/dev-tools/kasan.html) 所產生的警告,使用 `READ_ONCE_NOCHECK` 可以讓編譯器使用 KASAN 時會自動忽略不檢查這個函式的 memory access ,這裡不特別討論。 [commit 76ebbe](https://github.com/torvalds/linux/commit/76ebbe78f7390aee075a7f3768af197ded1bdfbb) 在 `__READ_ONCE` 回傳 `__u.__val` 之前插入一個 `smp_read_barrier_depends` 函式,如下所示 ```diff #define __READ_ONCE(x, check) \ ({ \ union { typeof(x) __val; char __c[1]; } __u; \ if (check) \ __read_once_size(&(x), __u.__c, sizeof(x)); \ else \ __read_once_size_nocheck(&(x), __u.__c, sizeof(x)); \ + smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \ __u.__val; \ }) ``` `smp_read_barrier_depends` 在大部份處理器架構下都是定義到 [`do {} while(0)`](https://elixir.bootlin.com/linux/v4.15/source/include/asm-generic/barrier.h#L54) ,代表 no operation ,編譯器自然會忽略此行程式碼。然而,針對 Alpha 架構處理器,可以在 [arch/alpha/include/asm/barrier.h](https://elixir.bootlin.com/linux/v4.15/source/arch/alpha/include/asm/barrier.h#L62) 路徑找到其定義如下 ```cpp #define read_barrier_depends() __asm__ __volatile__("mb": : :"memory") ``` `mb` 指令為 read memory barrier ,這裡需先了解什麼是 memory barrier。 參考 [並行程式設計: Atomics 操作 - Memory Ordering 和 Barrier](https://hackmd.io/@sysprog/concurrency-atomics#Memory-Ordering-%E5%92%8C-Barrier) 可知,現代處理器執行指令的順序,可能是 out of order processing,表示處理器不一定會照著編譯出來的指令依序執行 [(Sequential Consistency)](https://hackmd.io/@sysprog/concurrency-ordering#%E6%9C%80%E7%9B%B4%E8%A6%BA%E7%9A%84%E7%B4%84%E5%AE%9A-Sequential-Consistency) ,針對執行順序的規則需參考各處理器對於 [memory order](https://en.wikipedia.org/wiki/Memory_ordering) 的規定。而其中 Alpha 處理器的 memory order 又更為複雜,Linux Torvalds 曾在[信件](https://lkml.org/lkml/2012/2/1/521)中舉了以下例子說明 > The semantics is that it's a read barrier between two different reads that we want to happen in order wrt two writes on the writing side (the writing side also has to have a "smp_wmb()" to order those writes). But the reason it isn't a simple read barrier is that the reads are actually causally *dependent*, ie we have code like > > first_read = read_pointer; > smp_read_barrier_depends(); > second_read = *first_read; > > and it turns out that on pretty much all architectures (except for alpha), the ***data* *dependency*** will already guarantee that the CPU reads the thing in order. And because a read barrier can actually be quite expensive, we don't want to have a read barrier for this case. > > But alpha? Its memory consistency is so broken that even the data dependency doesn't actually guarantee cache access order. 大部份處理器再發現兩行程式碼有 data dependency 時,會保證 CPU 依照順序執行此指令,然而 Alpha 處理器卻有可能在有 data dependency 仍舊不保證其 cache access order 。因此可以推測,在 `READ_ONCE` 最後加上 read memory barrier,是為了避免對 `x` 的相關操作與最後一行的 `__u.__val` 產生 memory reordering 而導致錯誤結果。 然而,在 [commit a5460b](https://github.com/torvalds/linux/commit/a5460b5e5fb82656807840d40d3deaecad094044) 又幾乎再次把 `READ_ONCE` 調整回之前的實作 ```cpp #define __READ_ONCE(x) (*(volatile typeof(x) *)&(x)) #define READ_ONCE(x) \ ({ \ typeof(x) __x = __READ_ONCE(x); \ smp_read_barrier_depends(); \ __x; \ }) ``` 以下節錄片段 commit message >Since GCC 4.8 is fairly vintage at this point and we emit a warning if we detect it during the build, return {READ,WRITE}_ONCE() to their former glory with an implementation that is easier to understand and, crucially, more amenable to optimisation. 若使用 gcc 4.8 以前版本來編譯 Linux 核心,編譯時期就會發出警告,因此這裡回歸到原本更易讀且更容易使編譯器得以最佳化的版本。然而繞了一大圈仍舊有一點沒有被提到,那就是這個 `READ_ONCE` 實作能夠保證是 atomic 操作嗎? 更精確的問,若使用 `volatile` 就代表能保證是 atomic 操作嗎? 首先回顧 [你所不知道的 C 語言:記憶體管理、對齊及硬體特性](https://hackmd.io/@sysprog/c-memory?type=view#data-alignment) ,若資料的記憶體位址為 unaligned,在某些架構下(如ARM)需要特別去實作 [unaligned memory access](https://www.kernel.org/doc/html/latest/core-api/unaligned-memory-access.html) ,那這類的操作就不一定保證為 atomic 。另外參考 [並行程式設計: Atomics 操作 - 背後作祟的 cache](https://hackmd.io/@sysprog/concurrency-atomics#%E8%83%8C%E5%BE%8C%E4%BD%9C%E7%A5%9F%E7%9A%84-cache) 說明,即便在 64-bits CPU 架構下,若資料座落在兩個 cacheline(通常是 64 bytes) 上,則存取此資料同樣不能保證為 atomic 操作。 因此可以知道針對不同 CPU 架構所能保證的 atomic operation 是不同的,參考 [Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html) 其中 8.1.1 Guaranteed Atomic Operations > The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will > always be carried out atomically: > • Reading or writing a byte. > • Reading or writing a word aligned on a 16-bit boundary. > • Reading or writing a doubleword aligned on a 32-bit boundary. > The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically: > • Reading or writing a quadword aligned on a 64-bit boundary. > • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus. > The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically: > • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line. 可以發現 x86 架構下只要資料沒有跨越兩個 cacheline ,即使是 unaligned 16, 32, 64-bit access ,也保證是 atomic 操作。 回頭來看,`volatile` 只能保證讓編譯器產生確實的去記憶體讀取資料的指令,並沒有保證是 atomic 操作,實際上也確實不是,但是因為有太多人誤用了,因此 [Why the “volatile” type class should not be used](https://docs.kernel.org/process/volatile-considered-harmful.html) 在開頭就提到 >C programmers have often taken volatile to mean that the variable could be changed outside of the current thread of execution; as a result, they are sometimes tempted to use it in kernel code when shared data structures are being used. In other words, **they have been known to treat volatile types as a sort of easy atomic variable, which they are not.** The use of volatile in kernel code is almost never correct 補充這些知識後,接著繼續看到 `READ_ONCE` 的演化,[commit 9e343b](https://github.com/torvalds/linux/commit/9e343b467c70379e66b8b771d96f03ae23eba351) 標題即為 **READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses** ,首先先看到 commit message: >{READ,WRITE}_ONCE() cannot guarantee atomicity for arbitrary data sizes. This can be surprising to callers that might incorrectly be expecting atomicity for accesses to aggregate structures, although there are other callers where tearing is actually permissable (e.g. if they are using something akin to sequence locking to protect the access). 開頭就說明很多人誤以為傳入任意 data sizes ,`READ_ONCE` 都是 atomic 操作,但很顯然的並不是,只是湊巧在某些情況下,caller 都有用 lock 來保護 `READ_ONCE` 的操作,也因此沒有其他執行緒會來打斷,也就剛好不會出錯。 >Linus sayeth: > > | We could also look at being stricter for the normal READ/WRITE_ONCE(), > | and require that they are > | > | (a) regular integer types > | > | (b) fit in an atomic word > | > | We actually did (b) for a while, until we noticed that we do it on > | loff_t's etc and relaxed the rules. But maybe we could have a > | "non-atomic" version of READ/WRITE_ONCE() that is used for the > | questionable cases? 關於 (b) 的 atomic word 等等搭配程式碼解釋,最終實作方案為 (a) ,並且另外提供了 non-atomic 版本的 `READ/WRITE_ONCE` ,相關程式碼如下 ```cpp /* * Use __READ_ONCE() instead of READ_ONCE() if you do not require any * atomicity or dependency ordering guarantees. Note that this may result * in tears! */ #define __READ_ONCE(x) (*(const volatile typeof(x) *)&(x)) #define __READ_ONCE_SCALAR(x) \ ({ \ typeof(x) __x = __READ_ONCE(x); \ smp_read_barrier_depends(); \ __x; \ }) #define READ_ONCE(x) \ ({ \ compiletime_assert_rwonce_type(x); \ __READ_ONCE_SCALAR(x); \ }) ``` 可以看到 `READ_ONCE` 又再新增了一層 `__READ_ONCE_SCALAR` 包裝,保證當 `x` 為 scalar type 時,`__READ_ONCE_SCALAR` 對 `x` 為 atomic 操作,若不需要保證 atomicity 則可以直接使用 `__READ_ONCE`。 至於 `compiletime_assert_rwonce_type` 函式定義如下 ```cpp /* * Yes, this permits 64-bit accesses on 32-bit architectures. These will * actually be atomic in many cases (namely x86), but for others we rely on * the access being split into 2x32-bit accesses for a 32-bit quantity (e.g. * a virtual address) and a strong prevailing wind. */ #define compiletime_assert_rwonce_type(t) \ compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \ "Unsupported access size for {READ,WRITE}_ONCE().") ``` 當傳入的 `sizeof(t)` 不為 `__native_word` 或是 `sizeof(long long)` 時,就會在編譯時期被擋下來。而 `__native_word` 被定義在 [include/linux/compiler_types.h](https://elixir.bootlin.com/linux/v5.13.19/source/include/linux/compiler_types.h#L289) 內 ```cpp /* Is this type a native word size -- useful for atomic operations */ #define __native_word(t) \ (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \ sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long)) ``` ,顯然的包含 `char`, `short`, `int`, `long` 無論在哪一種架構下,無論 32-bit, 64-bit CPU,都保證是 atomic 操作,因此可稱作 atomic word。然而 `long long` 就不一定了,某些 32-bit CPU 環境下 `long long` 為 64 bits,因此不能保證對其操作為 atmoic。然而為什麼仍舊這樣設計呢,原因可以從 commit message 看到 >The slight snag is that we also have to support 64-bit accesses on 32-bit architectures, as these appear to be widespread and tend to work out ok if either the architecture supports atomic 64-bit accesses (x86, armv7) or if the variable being accesses represents a virtual address and therefore only requires 32-bit atomicity in practice. 因為 `long long` 在目前使用環境下大部份可以保證為 atomic,而且實在太常用到了(如前一段 commit message 內提到的 `loff_t` 就是 `long long` type),因此保留了 `sizeof(long long)` 的 data type 可以傳入 `READ/WRITE_ONCE` 使用。 最後在 [commit e506ea](https://github.com/torvalds/linux/commit/e506ea451254ab17e0bf918ca36232fec2a9b10c) 提到目前 `READ/WRITE_ONCE` 已經被移到 [include/asm-generic/rwonce.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/rwonce.h) 中,以準備後續可針對不同架構有不同的 `READ/WRITE_ONCE` 實作。