Linux Linked List

--- tags: System Software --- # Linux Linked List ## Material > [2021q1 第 1 週測驗題](https://hackmd.io/@sysprog/linux2021-quiz1) > [GitHub](https://github.com/RinHizakura/linux-list) ## Introduction [linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h) 是 Linux 作業系統中相當實用的 doubly linked list 封裝，只要在自定義的結構中加入 `struct list_head`，就可以搭配 Linux 中一系列的 linked list 操作(詳見[The Linux Kernel API - List Management Functions](https://www.kernel.org/doc/html/latest/core-api/kernel-api.html)) 來建立自定義結構的 linked list。在撰寫 kernel module 時會有機會跟其打交道。除了會使用到相關的 API 之外，其中的實作細節也很值得我們細細品味，因此，何嘗不是個好注意呢來看看程式碼呢? ## Start From Simple 我們可以先從 [linux-list](https://github.com/RinHizakura/linux-list)，由成大的[黃敬群(jserv)](https://zh.wikipedia.org/wiki/%E9%BB%83%E6%95%AC%E7%BE%A4) 教授所建立的專案來一探究竟。這是一個和 Linux linked list 很相似的實作，但做了適當的簡化，並只留下一部分函數。 ### Code Explanation #### `container_of()` ```cpp #if defined(__GNUC__) #define __LIST_HAVE_TYPEOF 1 #endif #ifndef container_of #ifdef __LIST_HAVE_TYPEOF #define container_of(ptr, type, member) \ __extension__({ \ const __typeof__(((type *) 0)->member) *__pmember = (ptr); \ (type *) ((char *) __pmember - offsetof(type, member)); \ }) #else #define container_of(ptr, type, member) \ ((type *) ((char *) (ptr) -offsetof(type, member))) #endif #endif ``` * 如果 GNUC extension 為開啟狀態(`#if defined(__GNUC__)`)， `__LIST_HAVE_TYPEOF` 會被設為 1 * `__extension__`: 根據 [3.4 Options Controlling C Dialect](https://gcc.gnu.org/onlinedocs/gcc/C-Dialect-Options.html)，當語法被要求和 [ANSI C](https://en.wikipedia.org/wiki/ANSI_C) 相容時(如 gcc 中加入 `-ansi` 參數)，像是 `typeof` 等 gcc extension 將不能被使用。而在 [6.48 Alternate Keywords](https://gcc.gnu.org/onlinedocs/gcc/Alternate-Keywords.html#index-_005f_005fextension_005f_005f) 則提到可以用替換關鍵字的方法實作沒有 extension 時的替代方案。但意思就是這種設計會存在用到 extension 和沒用到的兩種實作，編譯器會對於使用到 extension 的部份扔出警告，可以透過 `__extension__` 避免警告的產生。 * 使用 extension 的實作版本: * `({ ... })` 是 [statement expression](https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html) 的語法 * 透過 `((type *) 0)->member` 取得目標 member 的 instance，[typeof](https://gcc.gnu.org/onlinedocs/gcc/Typeof.html) 會取得其型別。(注意到 pointer 0 沒有被 dereference 且 instance 沒被 access，所以合法)，於是可以得到正確型別的 `ptr` 指標，存入 `__pmember` * 此步驟是為了做 type checking，確保輸入的 `ptr` 是 `type` 中的 `member` 型別之指標 * 將 `__pmember` 減去該成員在 struct 中的偏移就可以得到 struct 的開頭位址 * [`offsetof`](https://www.man7.org/linux/man-pages/man3/offsetof.3.html) 定義在 `stddef.h` 中 * 計算 offset 時要轉成 `char *`，以確保 address 的計算符合預期(可參考 [The (char *) casting in container_of() macro in linux kernel](https://stackoverflow.com/questions/20421910/the-char-casting-in-container-of-macro-in-linux-kernel) 的說明) * 不使用 extension 的實作版本: * 直接省略 type checking #### `LIST_HEAD()` / `INIT_LIST_HEAD()` ```cpp #define LIST_HEAD(head) struct list_head head = {&(head), &(head)} static inline void INIT_LIST_HEAD(struct list_head *head) { head->next = head; head->prev = head; } ``` * 初始化 `struct list_head`，先將 `next` 和 `prev` 都指向自己 * `head` 的 `next` 所指代表 linked list 結構的開頭而 `prev` 則指向結構的結尾 #### `list_add` / `list_add_tail` ```cpp static inline void list_add(struct list_head *node, struct list_head *head) { struct list_head *next = head->next; next->prev = node; node->next = next; node->prev = head; head->next = node; } static inline void list_add_tail(struct list_head *node, struct list_head *head) { struct list_head *prev = head->prev; prev->next = node; node->next = head; node->prev = prev; head->prev = node; } ``` * 將指定的 `node` 插入 linked list `head` 的開頭或者結尾 #### `list_del` ```cpp static inline void list_del(struct list_head *node) { struct list_head *next = node->next; struct list_head *prev = node->prev; next->prev = prev; prev->next = next; #ifdef LIST_POISONING node->prev = (struct list_head *) (0x00100100); node->next = (struct list_head *) (0x00200200); #endif } ``` * 將 `node` 從其所屬的 linked list 結構中移除 * 需注意 `node` 本體甚至是包含 `node` 的結構所分配的記憶體皆未被釋放，僅僅是將 `node` 從其原本的 linked list 連接移除 * `LIST_POISONING` 使得被刪除的 `node` 對於 `next` 或者 `prev` 的存取會引發 build time 的 invalid memory access(如果系統禁止 predefined memory access) #### `list_del_init` ```cpp static inline void list_del_init(struct list_head *node) { list_del(node); INIT_LIST_HEAD(node); } ``` * 基於 `list_del`，但移除的 `node` 會額外呼叫 `INIT_LIST_HEAD` 把 `prev` 和 `next` 指向自己 #### `list_empty` ```cpp static inline int list_empty(const struct list_head *head) { return (head->next == head); } ``` * 檢查 `head` 的 `next` 是否指向自己，確認 list 是否為 empty 狀態 #### `list_singular` ```cpp static inline int list_is_singular(const struct list_head *head) { return (!list_empty(head) && head->prev == head->next); } ``` * 如果 `head` 非 empty 狀態且 `prev` 和 `next` 是同一個節點，表示 linked list 只有一個節點 #### `list_splice` / `list_splice_tail` ```cpp static inline void list_splice(struct list_head *list, struct list_head *head) { struct list_head *head_first = head->next; struct list_head *list_first = list->next; struct list_head *list_last = list->prev; if (list_empty(list)) return; head->next = list_first; list_first->prev = head; list_last->next = head_first; head_first->prev = list_last; } static inline void list_splice_tail(struct list_head *list, struct list_head *head) { struct list_head *head_last = head->prev; struct list_head *list_first = list->next; struct list_head *list_last = list->prev; if (list_empty(list)) return; head->prev = list_last; list_last->next = head; list_first->prev = head_last; head_last->next = list_first; } ``` * 將 `list` 的所有 node 都插入到 `head` 的開始 / 結束位置中 * 注意 `list` 本身仍維持原貌 #### `list_spice_init` / `list_splice_tail_init` ```cpp static inline void list_splice_init(struct list_head *list, struct list_head *head) { list_splice(list, head); INIT_LIST_HEAD(list); } static inline void list_splice_tail_init(struct list_head *list, struct list_head *head) { list_splice_tail(list, head); INIT_LIST_HEAD(list); } ``` * 和 `list_splice` / `list_splice_tail` 類似，但移除的 `list` 會額外呼叫 `INIT_LIST_HEAD` 把 `prev` 和 `next` 指向自己 #### `list_cut_position` ```cpp static inline void list_cut_position(struct list_head *head_to, struct list_head *head_from, struct list_head *node) { struct list_head *head_from_first = head_from->next; if (list_empty(head_from)) return; if (head_from == node) { INIT_LIST_HEAD(head_to); return; } head_from->next = node->next; head_from->next->prev = head_from; head_to->prev = node; node->next = head_to; head_to->next = head_from_first; head_to->next->prev = head_to; } ``` * 將從 `head_from` 的第一個節點到 `node` 間的一系列節點都移動到 `head_to` 上 * `head_to` 必須是 empty 狀態(`next` 和 `prev` 都指向自己)，否則可能發生 memory leak #### `list_move` / `list_move_tail` ```cpp static inline void list_move(struct list_head *node, struct list_head *head) { list_del(node); list_add(node, head); } static inline void list_move_tail(struct list_head *node, struct list_head *head) { list_del(node); list_add_tail(node, head); } ``` * 將 `node` 從原本的 linked list 移動到另一個 linked list `head` 的開頭或尾端 #### `list_entry` ```cpp #define list_entry(node, type, member) container_of(node, type, member) ``` * `container_of` 等價的 wrapper #### `list_first_entry` / `list_last_entry` ```cpp #define list_first_entry(head, type, member) \ list_entry((head)->next, type, member) #define list_last_entry(head, type, member) \ list_entry((head)->prev, type, member) ``` * 取得 linked list 的開頭或者結尾的 entry #### `list_for_each` ```cpp #define list_for_each(node, head) \ for (node = (head)->next; node != (head); node = node->next) ``` * linear 遍歷整個 linked list * `node` 和 `head` 不能在迴圈中被更改，否則行為不可預期 #### `list_for_each_entry` ```cpp #ifdef __LIST_HAVE_TYPEOF #define list_for_each_entry(entry, head, member) \ for (entry = list_entry((head)->next, __typeof__(*entry), member); \ &entry->member != (head); \ entry = list_entry(entry->member.next, __typeof__(*entry), member)) #endif ``` * 遍歷包含 `struct list_head` 的另外一個結構之 entry * `node` 和 `head` 不能在迴圈中被更改，否則行為不可預期 * 因為 `typeof` 之限制，只能在 GNUC 下使用 #### `list_for_each_safe` / `list_for_each_entry_safe` ```cpp #define list_for_each_safe(node, safe, head) \ for (node = (head)->next, safe = node->next; node != (head); \ node = safe, safe = node->next) #define list_for_each_entry_safe(entry, safe, head, member) \ for (entry = list_entry((head)->next, __typeof__(*entry), member), \ safe = list_entry(entry->member.next, __typeof__(*entry), member); \ &entry->member != (head); entry = safe, \ safe = list_entry(safe->member.next, __typeof__(*entry), member)) ``` * 透過額外的 `safe` 紀錄每個 iteration 所操作的節點(entry)的下一個節點(entry)，因此刪除目前的節點(entry)是可被允許的，其他操作則同為不可預期行為 ### Example #### Quicksort 我們可以從一些範例來看實際的使用。例如在專案中的 [quick-sort.c](https://github.com/RinHizakura/linux-list/blob/master/examples/quick-sort.c) 程式碼。 ```cpp static void list_qsort(struct list_head *head) { struct list_head list_less, list_greater; struct listitem *pivot; struct listitem *item = NULL, *is = NULL; if (list_empty(head) || list_is_singular(head)) return; INIT_LIST_HEAD(&list_less); INIT_LIST_HEAD(&list_greater); ``` * 先確認 `head` 所承載的 linked list 有兩個以上 entry，否則就返回，不需要被排序 * 先以 `INIT_LIST_HEAD` 初始化另外兩個 list 結構，它們分別是用來插入 entry 中比 pivot 小或者其他的節點 ```cpp pivot = list_first_entry(head, struct listitem, list); list_del(&pivot->list); list_for_each_entry_safe (item, is, head, list) { if (cmpint(&item->i, &pivot->i) < 0) list_move_tail(&item->list, &list_less); else list_move(&item->list, &list_greater); } ``` * 透過 `list_first_entry` 取得第一個 entry 選為 pivot * `list_del` 將該 pivot entry 從 linked list 中移除 * 遍歷整個 linked list，`cmpint` 回傳兩個指標中的值相減的數值，因此小於零表 `item->i` 的值比 `pivot` 的值小，加入 `list_less`，反之則同理 ```cpp list_qsort(&list_less); list_qsort(&list_greater); list_add(&pivot->list, head); list_splice(&list_less, head); list_splice_tail(&list_greater, head); } ``` * 遞迴呼叫將 `list_less` 和 `list_greater` 排序 * 在 `list_for_each_entry_safe` 中，`list_move_tail` 和 `list_move` 會將所有原本在 `head` 中的節點移出，因此首先 `list_add` 加入 `pivot`，再把已經排好的 `list_less` 放在 pivot 前 `list_greater` 放在 pivot 後，完成排序 #### Quicksort Without Recursive Call 上面的程式碼是由 jserv 老師所實作，有了對其的理解後，我們也可以自己來實際使用一下相關的 API。這裡我用 [2021q1 第 1 週測驗題](https://hackmd.io/@sysprog/linux2021-quiz1) 的作業要求作為範例，參考 [Optimized QuickSort — C Implementation (Non-Recursive)](https://alienryderflex.com/quicksort/)，嘗試一下基於 Linux like 的 linked list 實作類似的非遞迴 quick sort。 ```cpp static void list_qsort_no_recursive(struct list_head *head) { struct listitem *begin[MAX_LEN], *end[MAX_LEN], *L, *R; struct listitem pivot; int i = 0; begin[0] = list_first_entry(head, struct listitem, list); end[0] = list_last_entry(head, struct listitem, list); ``` * `begin` 和 `end` 表示在 linked list 中的排序目標的開頭和結尾，因此最初是整個 linked list 的頭至尾 ```cpp while (i >= 0) { L = begin[i]; R = end[i]; ``` * `begin` 和 `end` 的效果類似 stack，會填入每一輪要處理的節點開頭至結尾，因此先取出該輪的頭尾至 `L` 和 `R` ```cpp if (L != R && &begin[i]->list != head) { // pivot is the actual address of L pivot = *begin[i]; if (i == MAX_LEN - 1) { assert(-1); return; } ``` * 接著，以最開頭的節點作為 pivot * `i == MAX_LEN - 1` 的目的是額外檢查這輪如果填入 `begin` 和 `end` 是否會超出原本給定的陣列大小，因為我們所給予的空間是有限的 ```cpp while (L != R) { while (R->i >= pivot.i && L != R) R = list_entry(R->list.prev, struct listitem, list); if (L != R) { L->i = R->i; L = list_entry(L->list.next, struct listitem, list); } while (L->i <= pivot.i && L != R) L = list_entry(L->list.next, struct listitem, list); if (L != R) { R->i = L->i; R = list_entry(R->list.prev, struct listitem, list); } } ``` * 否則的話，從結尾的點(`R`)一直往 `prev` 前進，找到比 `pivot` 值更小的節點的話就將其值移到開頭的 `L` 去 * 同理，從開頭的點(`L`)一直往 `next` 前進，找到比 `pivot` 值更大的節點的話，就將其值移到結尾的 `R` 去 * `L != R` 則負責判斷當 `L` 往 `next` 而 `R` 往 `prev` 移動碰在一起時，當 `L == R` 時，不再做上述的操作，離開迴圈 ```cpp L->i = pivot.i; begin[i + 1] = list_entry(L->list.next, struct listitem, list); end[i + 1] = end[i]; end[i++] = L; ``` * 此時 `L` 所在地方是 `pivot` 的值應在的正確位置，因此將 `pivot` 的值填入 `L` * 此時需要被處理的排序是 `pivot` 往後到結尾的一段，兩個點分別是 `L` 的 `next`，和這輪的 `end[i]` * 另一段則是 `pivot` 以前從原本的 `begin[i]` 到 `L` 一段 ```cpp } else i--; } } ``` * 如果 `L == R` 或者 `&begin[i]->list == head`，表示此段 linked list 已經不需要再做處理，`i--` 類似於 pop stack 的概念 :::warning Some issue: * 原本表示此實作在 array 下效能有可能較好，那麼在 doubly linked list 條件下，這個非遞迴的實作會有比較好的效能嗎 * 上面的實作中，用交換 list entry 中的內容的方式來實現，但考慮到整個 entry 的結構如果變得巨大，有沒有可能改成調成 link 方式來交換會比較有效率，要如何改寫? ::: ### Compare to Linux 比較 [linux-list](https://github.com/RinHizakura/linux-list) 與 [linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h) 的實作，會發現最大的差異在於 `WRITE_ONCE` macro 的使用。我們可以比較 `list_add` 來探討差異的細節 * [linux-list](https://github.com/RinHizakura/linux-list#L94) ```cpp static inline void list_add(struct list_head *node, struct list_head *head) { struct list_head *next = head->next; next->prev = node; node->next = next; node->prev = head; head->next = node; } ``` * [linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h#L84) ```cpp static inline void __list_add(struct list_head *new, struct list_head *prev, struct list_head *next) { if (!__list_add_valid(new, prev, next)) return; next->prev = new; new->next = next; new->prev = prev; WRITE_ONCE(prev->next, new); } static inline void list_add(struct list_head *new, struct list_head *head) { __list_add(new, head, head->next); } ``` 乍看之下，如果把 `WRITE_ONCE(prev->next, new)` 替換成 `prev->next = new`，兩者其實不是也沒甚麼差異嗎? `WRITE_ONCE` 定義在 [linux/tools/include/linux/compiler.h](https://github.com/torvalds/linux/blob/master/tools/include/linux/compiler.h#L184) 路徑下，我們可以從註解中對其作用進行理解: > Prevent the compiler from merging or refetching reads or writes. The compiler is also forbidden from reordering successive instances of READ_ONCE and WRITE_ONCE, but only when the compiler is aware of some particular ordering. One way to make the compiler aware of ordering is to put the two invocations of READ_ONCE or WRITE_ONCE in different C statements. 簡而言之，通過 `WRITE_ONCE` 和 `READ_ONCE` 的使用，可以避免編譯器合併、切割、甚至重排特定的記憶體讀寫操作。下面是 `WRITE_ONCE` 的定義: ```cpp #define WRITE_ONCE(x, val) \ ({ \ union { typeof(x) __val; char __c[1]; } __u = \ { .__val = (val) }; \ __write_once_size(&(x), __u.__c, sizeof(x)); \ __u.__val; \ }) ``` 我們可以再次看到 [statement expression](https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html) 技巧的應用。此外，可以看到 `WRITE_ONCE` 把 `val` 被寫入 union 結構中，使其與一個 `char _c[1]` 共享空間。藉由以 union 為基礎的 [`type-punning`](https://en.wikipedia.org/wiki/Type_punning#Use_of_union) 技巧，可避免違反 strict aliasing 規則，使得 `__c` 能作為這個 union 的指標來使用，從而允許編譯器最佳化。需注意 type tunning 的方式有不同類型，而在 gcc 中，使用 union 進行 type punning 的方式被作為語言的擴展並合法(詳見 [gcc: nonbugs](https://gcc.gnu.org/bugs/#nonbugs))，該方式不會違反 gcc 的 [strict aliasing](https://en.wikipedia.org/wiki/Aliasing_(computing)#Conflicts_with_optimization) 規則。但在其他編譯器中則可能因為違反規則導致 undefined behavior。 > 延伸閱讀: > * [What is the strict aliasing rule?](https://stackoverflow.com/questions/98650/what-is-the-strict-aliasing-rule) > * [Unions and type-punning](https://stackoverflow.com/questions/25664848/unions-and-type-punning) 關鍵的 `__write_once_size` 的任務則是把 `__val` 寫入至 `x` 中，但通過巧妙的設計避免了編譯器將其做錯誤的優化，細節請見下面的 `__write_once_size` 說明 ```cpp typedef __u8 __attribute__((__may_alias__)) __u8_alias_t; typedef __u16 __attribute__((__may_alias__)) __u16_alias_t; typedef __u32 __attribute__((__may_alias__)) __u32_alias_t; typedef __u64 __attribute__((__may_alias__)) __u64_alias_t; ``` 在深入 `__write_once_size` 之前，先來看看這個型別定義。首先需要了解何謂 [aliasing](https://en.wikipedia.org/wiki/Aliasing_(computing)): 其所指為同一個位置可被多個 symbol 存取時，這種情形會造成編譯器優化上的難處。根據 [Options That Control Optimization](https://gcc.gnu.org/onlinedocs/gcc-4.9.1/gcc/Optimize-Options.html)， `-fstrict-aliasing` 參數會讓編譯器預設程式撰寫者不會將相似類型(例如 int 和 unsigned int) 以外的 pointer 指向同一個記憶體位置，以允許做更激進的優化，這個參數在 `-O2`、`-O3`、`-Os` 下會預設開啟。但我們可以根據 [Specifying Attributes of Types](https://gcc.gnu.org/onlinedocs/gcc-4.0.2/gcc/Type-Attributes.html) 中所提到，使用 `__may_alias__` 告知編譯器此型別允許 aliasing 的發生，避免對此的過度優化導致非預期的程式行為，至於 Linux 這裡特別定義 `*_alias_t` 的目的可以參考註解: > Following functions are taken from kernel sources and break aliasing rules in their original form. > > While kernel is compiled with -fno-strict-aliasing, perf uses -Wstrict-aliasing=3 which makes build fail under gcc 4.4. Using extra `__may_alias__` type to allow aliasing in this case. ```cpp static __always_inline void __write_once_size(volatile void *p, void *res, int size) { switch (size) { case 1: *(volatile __u8_alias_t *) p = *(__u8_alias_t *) res; break; case 2: *(volatile __u16_alias_t *) p = *(__u16_alias_t *) res; break; case 4: *(volatile __u32_alias_t *) p = *(__u32_alias_t *) res; break; case 8: *(volatile __u64_alias_t *) p = *(__u64_alias_t *) res; break; default: barrier(); __builtin_memcpy((void *)p, (const void *)res, size); barrier(); } } ``` 對於 1 / 2 / 4 / 8 bytes 的變數，可以直接搭配 [`volatile`](https://en.wikipedia.org/wiki/Volatile_(computer_programming)) 的使用提示編譯器不要優化這個寫入操作。`volatile` 所考慮的情境在於: 變數的值即使從程式中乍看不會改變，在每一次存取時實際則有可能被更動(例如該變數可能指向某個 memory map I/O，會被硬體更動值，或者被 signal handler 和 main program 共用的 global variable)，因此可以避免寫入操作被錯誤優化。對於其他長度的變數，則可以透過 memory barriers 搭配 `memcpy` 的方法來寫入。`barrier()` 如同其名稱是個屏障，告訴編譯器不能 `barrier()` 前的 load/store 移動到 `barrier()` 後，反之亦然。值得一題的是，Linux 中採用了這種 "volatile access" 而非直接將 object 標註為 volitile 屬性，在 [doc: volatile considered evil](https://lwn.net/Articles/233482/) 中有提及後者可能隱藏的問題。 > 延伸閱讀: [Nine ways to break your systems code using volatile](https://blog.regehr.org/archives/28) ## list_sort 在 [linux/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) 中，可以看到 linux 特別針對系統層級的考量所設計的 linked list sort。在最頂端 GitHub 連結中也可以看到整合至使用者層級的實作。接下來，嘗試探討其中的設計原理和可能的考量。 ```cpp /** * list_sort - sort a list * @priv: private data, opaque to list_sort(), passed to @cmp * @head: the list to sort * @cmp: the elements comparison function * ``` 感謝詳細的註解，我們可以不必貿然的挑戰程式碼，先探討其中的實作內容。首先是需要的三個參數 * `priv` 是一個 pointer，可以用來傳輸 `cmp` 需要的參數 * `head` 要被排序的 list * `cmp` 是比較*自定義大小*的 function pointer ```cpp /*... * The comparison funtion @cmp must return > 0 if @a should sort after * @b ("@a > @b" if you want an ascending sort), and <= 0 if @a should * sort before @b *or* their original order should be preserved. It is * always called with the element that came first in the input in @a, * and list_sort is a stable sort, so it is not necessary to distinguish * the @a < @b and @a == @b cases. * * This is compatible with two styles of @cmp function: * - The traditional style which returns <0 / =0 / >0, or * - Returning a boolean 0/1. * The latter offers a chance to save a few cycles in the comparison * (which is used by e.g. plug_ctx_cmp() in block/blk-mq.c). * * A good way to write a multi-word comparison is:: * * if (a->high != b->high) * return a->high > b->high; * if (a->middle != b->middle) * return a->middle > b->middle; * return a->low > b->low; ``` 註解中說明了在比較 `a` `b` 時，如果 `a` 需至於 `b` 的之後，則 `cmp` 需回傳大於 0 的值。list_sort 是 [stable sort](https://en.wikipedia.org/wiki/Category:Stable_sorts)，所以不必區分小於或等於 0 的分別。最後則展示了一個簡單的 multi-word comparision。 ```cpp /* This mergesort is as eager as possible while always performing at least * 2:1 balanced merges. Given two pending sublists of size 2^k, they are * merged to a size-2^(k+1) list as soon as we have 2^k following elements. * * Thus, it will avoid cache thrashing as long as 3*2^k elements can * fit into the cache. Not quite as good as a fully-eager bottom-up * mergesort, but it does use 0.2*n fewer comparisons, so is faster in * the common case that everything fits into L1. ``` 接著，註解中提到這裡的 merge sort 實作重點。方法會保持兩個要被 merge 的 list 至少是 2 : 1 的狀態(比較大的 list 至多是小的的兩倍)，這可以有效的降低比較的次數。list_sort 與一般的 fully-eager bottom-up mergesort 方法不同，後者只要出現兩個 2^k 大小的 list，就會立刻被合併。而前者的方法是在出現兩個 $2^k$ 大小的 list 的時候，不立即合併，反之，一直等到下一個 2^k 的 list 被蒐集起來時，前者的這兩個 linked list 才會被合併起來。只要這 2 : 1 的 list 可以完全被放到 cache 裡，就可以避免 cache thrashing。在 [[RESEND PATCH v2 5/5] lib/list_sort: Optimize number of calls to comparison function ](https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1957556.html) 可以看到更完整的敘述。 ```cpp /* The merging is controlled by "count", the number of elements in the * pending lists. This is beautiully simple code, but rather subtle. * * Each time we increment "count", we set one bit (bit k) and clear * bits k-1 .. 0. Each time this happens (except the very first time * for each bit, when count increments to 2^k), we merge two lists of * size 2^k into one list of size 2^(k+1). ``` 那麼前述的規則該怎麼維持呢? 方法中會透過一個 `count` 來紀錄 pending 的節點數量。當目前的 count + 1 後，會設置第 k 個 bit 為 1，k-1 至 0 bit 為 0 時(除了 `count` 為 $2^k - 1$ 的情境)，我們就把兩個 $2^k$ 長度的 list 做合併。細節在後面探討程式碼時會再嘗試說明得更清楚。後續註解有點掌握不到意思(~~菜英文~~)，所以下面直接從程式碼的部份做理解並解釋。 ```cpp= __attribute__((nonnull(2,3))) void list_sort(void *priv, struct list_head *head, int (*cmp)(void *priv, struct list_head *a, struct list_head *b)) { struct list_head *list = head->next, *pending = NULL; size_t count = 0; /* Count of pending */ if (list == head->prev) /* Zero or one elements */ return; /* Convert to a null-terminated singly-linked list. */ head->prev->next = NULL; ``` 首先破壞掉 linked list 的 circular 結構，最後的節點(`head->prev`) 的 `next` 指向 NULL。 * [`__attribute__((nonnull))`](https://www.keil.com/support/man/docs/armcc/armcc_chr1359124976631.htm) 讓 compiler 可以對指定位置的 pointer 是 NULL 時發出警告 ```cpp=14 ... do { size_t bits; struct list_head **tail = &pending; /* Find the least-significant clear bit in count */ for (bits = count; bits & 1; bits >>= 1) tail = &(*tail)->prev; /* Do the indicated merge */ if (likely(bits)) { struct list_head *a = *tail, *b = a->prev; a = merge(priv, (cmp_func)cmp, b, a); /* Install the merged result in place of the inputs */ a->prev = b->prev; *tail = a; } /* Move one element from input list to pending */ list->prev = pending; pending = list; list = list->next; pending->next = NULL; count++; } while (list); ``` 這裡我們可以先來觀察 `count` 與 merge 的進行之規律。注意到這裡的格式中，左邊是在該 count 的 iter 開始時的狀態，右邊則是經可能的 merge 後把自己加入 pending list 後的狀態。 * $2^0$ -> count = 1 = $1_{2}$ -> $2^0$ + $2^0$ (no merge) * $2^0$ + $2^0$ -> count = 2 = $10_{2}$ -> $2^1$ + $2^0$ (將 $10_{2}$ 加 1 的話 set bit 0，merge to 2^1) * $2^1$ + $2^0$ -> count = 3 = $11_{2}$ -> $2^1$ + $2^0$ + $2^0$ (no merge) * $2^1$ + $2^0$ + $2^0$ -> count = 4 = $100_{2}$ -> 2 * $2^1$ + $2^0$ (將 $100_{2}$ 加 1 的話 set bit 0->merge to 2^1) * 2 * $2^1$ + $2^0$ -> count = 5 = $101_{2}$ -> $2^2$ + $2^0$ + $2^0$ (將 $101_{2}$ 加 1 的話 set bit 1->merge to 2^2) * $2^2$ + $2^0$ + $2^0$ -> count = 6 = $110_{2}$ -> $2^2$ + $2^1$ + $2^0$ (將 $110_{2}$ 加 1 的話 set bit 0->merge to 2^1) 可以觀察到，`count` 的最小位元的 0 假設在位置 k，根據規則就要 merge 出 $2^{k+1}$ 的 list (除了 `count` 為 $2^k - 1$)。而後面為 1 的 bit k - 1, k - 2 ... 0 可以代表 pending 裡有 $2^{k-1}, 2^{k-2} ...... 2^0$ 大小的 Llist 各一個，且因為 `*tail` 一開始就指向 $2^0$ 的那個 list，所以 `*tail` 往 prev 走 k 步就會找到 $2^k$ 大小的 list A，而 A 的 `prev` 是另一個 $2^k$ 大小的 list B，我們要把兩者 merge 在一起。再回到程式碼，for 迴圈中透過 `count` 最低位元的連續 1 找到此輪要 merge 的兩個 list，且 bits 若不為 0 剛好會等同於我們提到要 merge 的 `count`。最後。每一輪的 list 會把自己的 `next` 設為 NULL 而 `prev` 指向 pending，並更新成原本的 `next` 所指向的下一個 node 繼續流程。下面看一下 merge 具體的實作: ```cpp __attribute__((nonnull(2,3,4))) static struct list_head *merge(void *priv, cmp_func cmp, struct list_head *a, struct list_head *b) { struct list_head *head, **tail = &head; for (;;) { /* if equal, take 'a' -- important for sort stability */ if (cmp(priv, a, b) <= 0) { *tail = a; tail = &a->next; a = a->next; if (!a) { *tail = b; break; } } else { *tail = b; tail = &b->next; b = b->next; if (!b) { *tail = a; break; } } } return head; } ``` merge 接受參數的兩個 `list_head` a / b 後，`tail` 初始時指向一個 dummy pointer。然後 for 迴圈比較 a / b。如果 a 應該在 b 之前，則 `*tail` 改指向 a，並且 `tail` 被更新以指向 "指向 a 的下個節點的 pointer"(注意這裡沒多打一個指向XD)，這個步驟等同把 a 加到一個新的 linked list 中，然後下一次改成比 a 的 `next` 和 b，反之也是類似道理。直到 `a` 或 `b` 的 `next` 為 NULL 時，把另一個加入 `*tail` 則完成。繼續來看 list_sort 的實作。 ```cpp=41 /* End of input; merge together all the pending lists. */ list = pending; pending = pending->prev; for (;;) { struct list_head *next = pending->prev; if (!next) break; list = merge(priv, (cmp_func)cmp, pending, list); pending = next; } /* The final merge, rebuilding prev links */ merge_final(priv, (cmp_func)cmp, head, pending, list); } ``` 當所有的節點都被加入 pending 後，把所有的 pending 都 merge 在一起。刻意留下 pending 中的其一 `pending` 和其他的 pending merge 在一起的 `list` 去做 `merge_final`，後者的目的是因為 `merge` 只有做單向(`next`)的鏈結，需要把 `prev` 也接去正確的點上，並且回復至 circular linked list 的狀態。 ## Reference > * [WRITE_ONCE in linux kernel lists](https://stackoverflow.com/questions/34988277/write-once-in-linux-kernel-lists) > * [Linux内核中的READ_ONCE和WRITE_ONCE宏](https://blog.csdn.net/Roland_Sun/article/details/107365134) > * [编译乱序](https://zhuanlan.zhihu.com/p/102370222) ## TODO - [ ] 閱讀 [A comparative study of linked list sorting algorithms](https://pages.mtu.edu/~shene/PUBLICATIONS/1996/3Conline.pdf) 並將其落實於 [linux-list](https://github.com/RinHizakura/linux-list) 中

Read more

Linux 核心設計: Scheduler(5): EEVDF 排程器

Linux 核心設計: Scheduler(2): 概述 CFS Scheduler

Linux 核心設計: Power Management(1): System Sleep model

PCI/PCIe(2): 電源管理