# 2021q1 Homework1 (quiz1) contributed by < `ambersun1234` > > [2021q1 第 1 週測驗題](https://hackmd.io/@sysprog/linux2021-quiz1) ## 開發環境 ```shell $ uname -a Linux station 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux $ gcc -v gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) ``` ## 解釋程式運作原理 ```c #include <stdio.h> #include <stdlib.h> #include <stdbool.h> typedef struct __node { int value; struct __node *next; } node_t; static inline void list_add_node_t(node_t **list, node_t *node_t) { node_t->next = *list; *list = node_t; } static inline void list_concat(node_t **left, node_t *right) { while (*left) left = &((*left)->next); *left = right; } void quicksort(node_t **list) { if (!*list) return; node_t *pivot = *list; int value = pivot->value; node_t *p = pivot->next; pivot->next = NULL; node_t *left = NULL, *right = NULL; while (p) { node_t *n = p; p = p->next; list_add_node_t(n->value > value ? &right : &left, n); } quicksort(&left); quicksort(&right); node_t *result = NULL; list_concat(&result, left); list_concat(&result, pivot); list_concat(&result, right); *list = result; } ``` `quicksort` 函式取 list 中第一個元素作為 pivot value,在迴圈內部從第二個元素開始依序比較;若元素值大於 pivot 則將其加入 `right` 的 list 中,反之則加入至 `left`。 `list_add_node_t` 採用 insert head 的方式,即將原本的 list 串接在後面 ```graphviz digraph foo { rankdir=LR; node [shape=record]; a [label="{<data> 12| <ref>}", width=1.2]; b [label="right(NULL)}"]; a:ref:c -> b [arrowhead=vee, arrowtail=dot, dir=both, tailclip=false, arrowsize=1.2]; } ``` 實際上的串接方式就如下圖所示,足以證明為何 `left`, `right` 初始值為 NULL。 ```graphviz digraph foo { rankdir=LR; node [shape=record]; a [label="{<data> 12| <ref>}", width=1.2]; b [label="{<data> 13(right)| <ref>}"]; c [label="{<data> 14| <ref>}"]; d [label="NULL", shape="box"]; a:ref:c -> b [arrowhead=vee, arrowtail=dot, dir=both, tailclip=false, arrowsize=1.2]; b:ref:c -> c [arrowhead=vee, arrowtail=dot, dir=both, tailclip=false]; c:ref:c -> d [arrowhead=vee, arrowtail=dot, dir=both, tailclip=false]; } ``` divide 完之後就要把 list 合併,`list_concat` 就是單純將 right 串接在 left 後面而已;總共需要合併 3 次,left, pivot, right 因為 pivot 並未連接在任何 list 後面。 為了可以進行實驗,需要補齊上述程式碼; `list_make_node_t` 以及 `list_free` 的實做程式碼如下 ```cpp static void list_free(node_t **list) { while (*list) { node_t *tmp = (*list)->next; free(list); list = &tmp; } } static node_t* list_make_node_t(node_t *list, int value) { node_t *newNode = (node_t *)malloc(sizeof(node_t)); newNode->value = value; newNode->next = list; return newNode; } ``` `list_free` 從頭走訪到 list 尾端,依序釋放記憶體 `list_make_node_t` 新增一個 node,其 value 為 `random() % 1024` 所產生。將新的 node 放置於 list head,next 則接上原本的 list。 --- ## Pseudorandom number generator + 執行 list.c 中可發現輸出結果都一樣,這是因為 [pseudorandom number generator(PRNG)](https://en.wikipedia.org/wiki/Pseudorandom_number_generator) 並不是真正的亂數,他受到初始值(PRNG's seed)影響所致。 + 一開始我採用 [multiplicative LCG](https://en.wikipedia.org/wiki/Linear_congruential_generator) 即 `c=0` 的 LCG + LCG 數學公式定義如下 $$y_{i+1}\ =\ [ay_i + c]\ mod\ m$$ + 由上述公式可得知一個基本訊息,第 i+1 結果取決於第 i 次的結果,而在 y0 的挑選上如果是固定的,那不管執行幾次都是一樣的結果 + 更改過後的程式如下 ```cpp static int MLCG(int seed) { int a = 15997, m = 65536, c = 0; return (a*seed + c)%m; } int main(int argc, char **argv) { srand(time(NULL)); size_t count = 20; int previousMLCG = random() % 1024; node_t *list = NULL; while (count--) { previousMLCG = MLCG(previousMLCG); list = list_make_node_t(list, previousMLCG); } } ``` 程式執行結果為 ``` $ ./a.out 1614245903(time(NULL) 輸出結果) NOT IN ORDER : 36439 13411 21599 18699 5159 43635 12207 58011 30967 43907 45567 28459 3271 51091 43855 65211 8087 36515 60319 41291 IN ORDER : 3271 5159 8087 12207 13411 18699 21599 28459 30967 36439 36515 41291 43635 43855 43907 45567 51091 58011 60319 65211 $ ./a.out 1614245904(time(NULL) 輸出結果) NOT IN ORDER : 36439 13411 21599 18699 5159 43635 12207 58011 30967 43907 45567 28459 3271 51091 43855 65211 8087 36515 60319 41291 IN ORDER : 3271 5159 8087 12207 13411 18699 21599 28459 30967 36439 36515 41291 43635 43855 43907 45567 51091 58011 60319 65211 ``` 結果驚訝的發現輸出亂數的部份竟是如此的相似,於是反過頭檢查 `time(NULL)` 輸出結果只相差 1。 根據 [time(2) man page](https://linux.die.net/man/2/time) > time() returns the time as the number of seconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC). > If t is non-NULL, the return value is also stored in the memory pointed to by t. 因為 time() 精度只有到 seconds 所以會導致輸出結果太接近 既然使用時間是不可靠的,於是我就嘗試去取一些常常變動的數值,受到 [@linD026](https://hackmd.io/@linD026/2021q1_quiz1) 所啟發,可以用地址去當作 srand 的參數。`malloc` 出來的地址經過實驗時常變動,所以很適合拿來當作亂數種子 ### ASLR(Address space layout randomization) + [ASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization) 是一種電腦資訊安全的技巧,避免 [memory corruption](https://en.wikipedia.org/wiki/Memory_corruption) 等各種資安問題。ASLR 機制隨機分配 process 的地址(stack, heap, libraries),使得地址變得難以預測 > 要注意 ASLR 僅是一種針對攻擊的 mitigation,並不能避免資安問題 > [name=HexRabbit][color=purple] + [return-to-libc attacks](https://en.wikipedia.org/wiki/Return-to-libc_attack) 一般應用於 buffer overflow attack 中; 攻擊手法為將 stack 中的返回地址替換為其他地址。使用 ASLR 可以讓 stack 的地址變得不可預測,使得攻擊變得更加困難 > ASLR 0: 關閉 > ASLR 1: Conservative Randomization > ASLR 2: Full Randomization + [Position-independent executable(PIE)](https://zh.wikipedia.org/wiki/%E5%9C%B0%E5%9D%80%E6%97%A0%E5%85%B3%E4%BB%A3%E7%A0%81) 又稱作 PIC,通常搭配 ASLR 使用,達到 [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization) 考慮以下實驗結果 ```shell $ gcc -no-pie a.c $ ./a.out & $ cat /proc/PID/maps ``` ``` ASLR 1, no PIE 00400000-00401000 r--p 00000000 08:10 40308 /home/ambersun/ASLR/a.out 00401000-00402000 r-xp 00001000 08:10 40308 /home/ambersun/ASLR/a.out 00402000-00403000 r--p 00002000 08:10 40308 /home/ambersun/ASLR/a.out 00403000-00404000 r--p 00002000 08:10 40308 /home/ambersun/ASLR/a.out 00404000-00405000 rw-p 00003000 08:10 40308 /home/ambersun/ASLR/a.out 00405000-00426000 rw-p 00000000 00:00 0 [heap] 7f425fac9000-7f425faee000 r--p 00000000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f425faee000-7f425fc66000 r-xp 00025000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f425fc66000-7f425fcb0000 r--p 0019d000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f425fcb0000-7f425fcb1000 ---p 001e7000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f425fcb1000-7f425fcb4000 r--p 001e7000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f425fcb4000-7f425fcb7000 rw-p 001ea000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f425fcb7000-7f425fcbd000 rw-p 00000000 00:00 0 7f425fcc5000-7f425fcc6000 r--p 00000000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f425fcc6000-7f425fce9000 r-xp 00001000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f425fce9000-7f425fcf1000 r--p 00024000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f425fcf2000-7f425fcf3000 r--p 0002c000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f425fcf3000-7f425fcf4000 rw-p 0002d000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f425fcf4000-7f425fcf5000 rw-p 00000000 00:00 0 7ffe726da000-7ffe726fb000 rw-p 00000000 00:00 0 [stack] 7ffe7271c000-7ffe7271f000 r--p 00000000 00:00 0 [vvar] 7ffe7271f000-7ffe72721000 r-xp 00000000 00:00 0 [vdso] ``` ``` ASLR 1, no PIE 00400000-00401000 r--p 00000000 08:10 40308 /home/ambersun/ASLR/a.out 00401000-00402000 r-xp 00001000 08:10 40308 /home/ambersun/ASLR/a.out 00402000-00403000 r--p 00002000 08:10 40308 /home/ambersun/ASLR/a.out 00403000-00404000 r--p 00002000 08:10 40308 /home/ambersun/ASLR/a.out 00404000-00405000 rw-p 00003000 08:10 40308 /home/ambersun/ASLR/a.out 00405000-00426000 rw-p 00000000 00:00 0 [heap] 7f0600b54000-7f0600b79000 r--p 00000000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f0600b79000-7f0600cf1000 r-xp 00025000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f0600cf1000-7f0600d3b000 r--p 0019d000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f0600d3b000-7f0600d3c000 ---p 001e7000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f0600d3c000-7f0600d3f000 r--p 001e7000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f0600d3f000-7f0600d42000 rw-p 001ea000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f0600d42000-7f0600d48000 rw-p 00000000 00:00 0 7f0600d50000-7f0600d51000 r--p 00000000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f0600d51000-7f0600d74000 r-xp 00001000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f0600d74000-7f0600d7c000 r--p 00024000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f0600d7d000-7f0600d7e000 r--p 0002c000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f0600d7e000-7f0600d7f000 rw-p 0002d000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f0600d7f000-7f0600d80000 rw-p 00000000 00:00 0 7ffff7f2b000-7ffff7f4c000 rw-p 00000000 00:00 0 [stack] 7ffff7f93000-7ffff7f96000 r--p 00000000 00:00 0 [vvar] 7ffff7f96000-7ffff7f98000 r-xp 00000000 00:00 0 [vdso] ``` 可以看到 heap 的位置,兩者是一樣的 `00405000-00426000` > c.f 64 位元下,no PIE 起點為 `0x00400000` > c.f 32 位元下,no PIE 起點為 `0x08048000` gcc 預設 PIE 是開啟的,需使用 `-no-pie` 關閉,參見 [gcc/defaults.h](https://github.com/gcc-mirror/gcc/blob/16e2427f50c208dfe07d07f18009969502c25dc8/gcc/defaults.h#L1233) > [关于 Linux 下 ASLR 与 PIE 的一些理解](https://www.cnblogs.com/rec0rd/p/7646857.html) 參考 [Performance and Entropy of Various ASLR Implementations](http://pages.cs.wisc.edu/~riccardo/736finalpaper.pdf),分析 ASLR 的 entropy,考慮 stack pointer 的位置 考慮以下程式碼 ```cpp static volatile void* getsp(void) { volatile void *sp; __asm__ __volatile__ ("movq %%rsp,%0" : "=r" (sp) : /* No input */); return sp; } int main(int argc, const char *argv[]) { printf("sp: %p\n", getsp()); while (1); return 0; } ``` 得到以下執行結果 ``` sp: 0x7ffeb6b5bd10 ``` 對照 `/proc/PID/maps` ``` 00400000-00401000 r--p 00000000 08:10 40306 /home/ambersun/ASLR/a.out 00401000-00402000 r-xp 00001000 08:10 40306 /home/ambersun/ASLR/a.out 00402000-00403000 r--p 00002000 08:10 40306 /home/ambersun/ASLR/a.out 00403000-00404000 r--p 00002000 08:10 40306 /home/ambersun/ASLR/a.out 00404000-00405000 rw-p 00003000 08:10 40306 /home/ambersun/ASLR/a.out 00405000-00426000 rw-p 00000000 00:00 0 [heap] 7f14bdaaf000-7f14bdad4000 r--p 00000000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f14bdad4000-7f14bdc4c000 r-xp 00025000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f14bdc4c000-7f14bdc96000 r--p 0019d000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f14bdc96000-7f14bdc97000 ---p 001e7000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f14bdc97000-7f14bdc9a000 r--p 001e7000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f14bdc9a000-7f14bdc9d000 rw-p 001ea000 08:10 37132 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7f14bdc9d000-7f14bdca3000 rw-p 00000000 00:00 0 7f14bdcab000-7f14bdcac000 r--p 00000000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f14bdcac000-7f14bdccf000 r-xp 00001000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f14bdccf000-7f14bdcd7000 r--p 00024000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f14bdcd8000-7f14bdcd9000 r--p 0002c000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f14bdcd9000-7f14bdcda000 rw-p 0002d000 08:10 36843 /usr/lib/x86_64-linux-gnu/ld-2.31.so 7f14bdcda000-7f14bdcdb000 rw-p 00000000 00:00 0 7ffeb6b3c000-7ffeb6b5d000 rw-p 00000000 00:00 0 [stack] 7ffeb6bd4000-7ffeb6bd7000 r--p 00000000 00:00 0 [vvar] 7ffeb6bd7000-7ffeb6bd9000 r-xp 00000000 00:00 0 [vdso] ``` 不難發現 sp 的位置有一點誤差,我想這也是可接受的範圍內(`0x7ffeb6b5d000`, `0x7ffeb6b5bd10`) 值得注意的是,上述 16 進位輸出共有 `48` 位元,而不是 64 位元 可參考 [why virtual address are 48 bits not 64 bits? [duplicate] ](https://stackoverflow.com/questions/63975447/why-virtual-address-are-48-bits-not-64-bits) 多次實驗觀察 stack pointer 可得以下結果 ``` sp: 0x7fff296fa310 sp: 0x7fffd5825950 sp: 0x7ffc725181d0 sp: 0x7ffe6da0dfc0 sp: 0x7ffcd8fc69d0 ... ``` 可以發現到這些 virtual address 的 MSB 部分都是相同,而最後 4 bits 都是 0。這也就是為甚麼論文中實驗並不採用全部位元的原因 > For Debian, we observed 30 bits of entropy in the stack. This was bits 4 to 34 (least significant to most significant) > 有興趣的話也可以觀察看看 libc 或相關 library 的地址在開啟 ASLR 下有多少 bits 的 entropy > [name=HexRabbit][color=purple] 考慮以下測試程式碼 - [ ] main.c ```cpp int main(int argc, const char *argv[]) { if (fork() == 0) { char *str[] = {"./test", NULL, NULL}; char *envp[] = {0}; if (execve("./test", str, envp) < 0) { perror("error"); exit(0); } } return 0; } ``` - [ ] test.c ```cpp int main(int argc, const char *argv[]) { register void *p asm("sp"); printf("%p\n", p); return 0; } ``` 為了驗證 ASLR 的 entropy,我參照論文設計了一個簡單的實驗,內容是取得 stack pointer 的位置去分析(搭配 execve 以及 fork) 根據 [man execve](https://man7.org/linux/man-pages/man2/execve.2.html) > execve() executes the program referred to by pathname. This causes the program that is currently being run by the calling process to be replaced with a new program, with newly initialized stack, heap, and (initialized and uninitialized) data segments. execve 會重新 new stack, heap 以及 data 區塊,藉由分析 stack pointer 我們可以知道 ASLR 的 entropy。這次實驗總共測試 1000000(一百萬) 次,分別在 32 位元(`raspberry pi zero wh(rapbian)`),以及 64 位元(`ubuntu 20.04 LTS`) 下測試 完整程式碼可參考 [linux2021q1_quiz1/ASLR](https://github.com/ambersun1234/linuxkernel_internals/tree/master/2021q1_quiz1/ASLR) ### 64 位元 ``` 地址: 重複次數 ... 0x7ffeb9c5d880: 2 0x7fff0baab970: 2 0x7ffeada07720: 2 0x7ffdcce3da80: 2 0x7ffe8c644020: 2 0x7ffec6822460: 2 0x7ffe54d41c80: 2 0x7fffe12b40a0: 2 0x7ffda39f9960: 2 0x7fff7dabd5a0: 2 0x7ffcdb205a70: 3 ``` 總共統計結果 ``` 重複次數: 有多少地址重複n次 1: 999083 2: 457 3: 1 ``` 在1百萬重複次數中,僅有 917(457\*2+1\*3) 個地址重複到 ### 32 位元 ``` 地址: 重複次數 ... 0xbea55dd8: 544 0xbec38dd8: 545 0xbeb92dd8: 545 0xbeb4fdd8: 545 0xbeb1ddd8: 545 0xbea92dd8: 547 0xbee5cdd8: 549 0xbeaf1dd8: 551 0xbe9dedd8: 553 0xbe931dd8: 556 0xbeec9dd8: 559 0xbe927dd8: 566 0xbeb1fdd8: 570 0xbe992dd8: 572 ``` 可以看到在 32 位元架構下,重複機率很高,意味著攻擊者相較於 64 位元架構中,更容易猜到地址。 更改過的程式碼如下: ```cpp int main(int argc, char **argv) { int *tmp = (int *)malloc(sizeof(int)); int a = (intptr_t)tmp; srand(a); free(tmp); size_t count = 20; node_t *list = NULL; while (count--) { list = list_make_node_t(list, random() % 1024); } ``` 首先 malloc 一個空間出來,將其地址轉為儲存起來,使用 `intptr_t` 強制轉型(將 pointer 轉型為 integer),並且使用其作為亂數種子,使用完之後當然要把他 free 掉避免 memory leak。 :::warning TODO: 回顧 [「Linux 核心設計」 課程說明 (2021年春季)](https://docs.google.com/presentation/d/1bJFwpg20GCJmOcdOt6NRYGU3HKVX-Dt0fsuohqTY8x4/edit?usp=sharing) 第 19 頁,嘗試學習早期 mimalloc 運用 ASLR 的手法。 :notes: jserv ::: 考慮 ASLR 以及 PIE 可以將程式改寫如下,參考 [mimalloc/random.c](https://github.com/microsoft/mimalloc/blob/92ead2d88061dde1264800b389b744ac1b79cf39/src/random.c#L255) ```cpp int main(int argc, char **argv) { srand((uintptr_t)&main); size_t count = 20; node_t *list = NULL; while (count--) {list = list_make_node_t(list, random() % 1024); } ``` 執行結果如下 ``` $ ./a.out 568087200 NOT IN ORDER : 537 827 166 485 66 417 108 305 462 893 98 688 814 861 422 513 477 327 237 170 IN ORDER : 66 98 108 166 170 237 305 327 417 422 462 477 485 513 537 688 814 827 861 893 $ ./a.out -1303850336 NOT IN ORDER : 111 782 625 900 491 356 273 901 803 125 278 511 612 108 51 450 278 473 436 495 IN ORDER : 51 108 111 125 273 278 278 356 436 450 473 491 495 511 612 625 782 803 900 901 ``` --- ## 參照 [Optimized QuickSort — C Implementation (Non-Recursive)](https://alienryderflex.com/quicksort/) 並改寫 quick sort 文章中所實作的程式碼如下 ```cpp bool quickSort(int *arr, int elements) { #define MAX_LEVELS 1000 int piv, beg[MAX_LEVELS], end[MAX_LEVELS], i=0, L, R ; beg[0]=0; end[0]=elements; while (i>=0) { L=beg[i]; R=end[i]-1; if (L<R) { piv=arr[L]; if (i==MAX_LEVELS-1) return NO; while (L<R) { while (arr[R]>=piv && L<R) R--; if (L<R) arr[L++]=arr[R]; while (arr[L]<=piv && L<R) L++; if (L<R) arr[R--]=arr[L]; } arr[L]=piv; beg[i+1]=L+1; end[i+1]=end[i]; end[i++]=L; } else i--; } return YES; } ``` + 基本上的思路是,由於 function call 是耗費時間的,因此該實作避免 call function(recursion) 進而減少時間成本。另外,相較於遞迴版的程式多了 `beg[], end[]` 陣列的使用,因為遞迴可以使用 private array 簡化。文中提到 private array 相比 real array 來說慢得多, > Recursive functions look neat and simple in part because they don’t have to have big arrays like my beg[] and end[]. But all they’re really doing is using the stack as their own private array. This is much slower than using a real array, and could cause stack overflow on some systems, crashing the app that called the quicksort. + 在計算機組織中我們有學到,function call 就是把目前暫存器內容處存起來,然後跳到函式位置進行執行,等到返回時,將原先暫存器內容放回,所以單純以變數的角度去看不會比較慢,唯一可能影響性能的只有在 push/pop stack 當中而已 + 為了程式撰寫的方便性,在 node 的定義上新增一個 previous ```c typedef struct __node { int value; struct __node *next; struct __node *previous; } node_t; ``` + 參照以上程式碼改寫出的非遞迴版 quick sort 如下 ```c void quicksort_iterative(node_t *list, node_t *tail, int count) { #define MAX_LEVELS 100 int i = 0; node_t *tl, *tr; node_t *beg[MAX_LEVELS]; node_t *end[MAX_LEVELS]; beg[0] = list; end[0] = tail; while(i >= 0) { tl = beg[i]; tr = end[i]; if (tl != tr && tl && tr) { int pivot = tl->value; if (i == count - 1) return; while(tl != tr && tl && tr) { while (tr->value >= pivot && tl != tr) tr = tr->previous; if (tl != tr) { tl->value = tr->value; tl = tl->next; } while (tl->value <= pivot && tl != tr) tl = tl->next; if (tl != tr) { tr->value = tl->value; tr = tr->previous; } } tl->value = pivot; beg[i + 1] = tl->next; end[i + 1] = end[i]; end[i] = tl; i++; } else i--; } return; } ``` :::warning TODO: * beg[MAX_LEVELS] 和 end[MAX_LEVELS] 能否只用其中一者呢? * 及早排除 list 只包含一個節點的狀況 * 對 `MAX_LEVELS` 的範圍進行數學分析 :notes: jserv ::: ## 探討 [linux-list](https://github.com/sysprog21/linux-list/blob/master/examples/quick-sort.c) 與 [Linux list](https://github.com/torvalds/linux/blob/master/include/linux/list.h) 的差別 + 最明顯的差別莫過於 sysprog21 的 list 實作是 doubly-linked list 而 Linux 本身的實作是 `circular linked list` + 接下來考慮 `list_add` 的實作 ```c // linux static inline void __list_add(struct list_head *new, struct list_head *prev, struct list_head *next) { if (!__list_add_valid(new, prev, next)) return; next->prev = new; new->next = next; new->prev = prev; WRITE_ONCE(prev->next, new); } --- // sysprog21 static inline void list_add(struct list_head *node, struct list_head *head) { struct list_head *next = head->next; next->prev = node; node->next = next; node->prev = head; head->next = node; } ``` 我們可以發現,對於 `head->next = new` 這行的做法不一樣,差別在於 linux 使用了 `WRITE_ONCE` macro + [WRITE_ONCE](https://github.com/torvalds/linux/blob/614cb5894306cfa2c7d9b6168182876ff5948735/tools/include/linux/compiler.h#L184) 的功用,根據註解 > Prevent the compiler from merging or refetching reads or writes. The > compiler is also forbidden from reordering successive instances of > READ_ONCE and WRITE_ONCE, but only when the compiler is aware of some > particular ordering. One way to make the compiler aware of ordering is to > put the two invocations of READ_ONCE or WRITE_ONCE in different C > statements. > > These two macros will also work on aggregate data types like structs or > unions. If the size of the accessed data type exceeds the word size of > the machine (e.g., 32 bits or 64 bits) READ_ONCE() and WRITE_ONCE() will > fall back to memcpy and print a compile-time warning. > Their two major use cases are: (1) Mediating communication between > process-level code and irq/NMI handlers, all running on the same CPU, > and (2) Ensuring that the compiler does not fold, spindle, or otherwise > mutilate accesses that either do not require ordering or that interact > with an explicit memory barrier or atomic instruction that provides the > required ordering. 就是為了避免 compiler 做優化這件事,比如說,在不影響程式正確性下互調程式碼,讓 `寫入資料` 與 `讀取資料` 互換。在多執行緒的架構下,同一時間可能有多個 process 對同一資料進行存取,可能導致 data race。可參考 [LINUX KERNEL MEMORY BARRIERS](https://www.kernel.org/doc/Documentation/memory-barriers.txt) ```c static __always_inline void __write_once_size(volatile void *p, void *res, int size) { switch (size) { case 1: *(volatile __u8_alias_t *) p = *(__u8_alias_t *) res; break; case 2: *(volatile __u16_alias_t *) p = *(__u16_alias_t *) res; break; case 4: *(volatile __u32_alias_t *) p = *(__u32_alias_t *) res; break; case 8: *(volatile __u64_alias_t *) p = *(__u64_alias_t *) res; break; default: barrier(); __builtin_memcpy((void *)p, (const void *)res, size); barrier(); } } #define WRITE_ONCE(x, val) \ ({ \ union { typeof(x) __val; char __c[1]; } __u = \ { .__val = (val) }; \ __write_once_size(&(x), __u.__c, sizeof(x)); \ __u.__val; \ }) ``` [volatile](https://zh.wikipedia.org/wiki/Volatile%E5%8F%98%E9%87%8F) 關鍵字可以避免存取變數被過度優化(使用暫存器資料),因為變數有可能在另一個執行緒中被改變,導致資料不一致的行為發生。 :::danger 上述關於 volatile 關鍵字的考量不充分,請閱讀規格書和思考 embedded 的場景。 :notes: jserv ::: <hr> 那為甚麼又只有 `WRITE_ONCE(prev->next, new);` 需要 WRITE_ONCE 呢? 根據 [include/linux/list.h 7f5f873 commit message](https://github.com/torvalds/linux/commit/7f5f873c6a0772970d5fee1f364231207051ecd8) 中的線索 > Unfortunately, the compiler is within its rights to (for example) use > byte-at-a-time writes to update the pointer, which would fatally confuse > concurrent readers 由上可知,compiler 有可能會把 64 位元的資料分成兩個 32 位元進行讀取,導致資料未更新完全的情況下被存取,為了達成 atomic 所以需要 WRITE_ONCE 的協助,可參考 [Load/Store tearing in this day and age?](https://lwn.net/Articles/793895/) 總結來說,sysprog21 的 list 版本簡化了多執行緒下的必要保護措施。 > [WRITE_ONCE in linux kernel lists](https://stackoverflow.com/questions/34988277/write-once-in-linux-kernel-lists) > [Linux内核中的READ_ONCE和WRITE_ONCE宏](https://zhuanlan.zhihu.com/p/344256943) ## [A COMPARATIVE STUDY OF LINKED LIST SORTING ALGORITHMS](https://pages.mtu.edu/~shene/PUBLICATIONS/1996/3Conline.pdf) + ###### tags: `linux2021`