Linux 核心專題: 紅黑樹實作

# Linux 核心專題: 紅黑樹實作 > 執行人: EdwardCKC > [專題解說錄影](https://youtu.be/Spw-RrBYEdw) :::success :question: 提問清單 * ? ::: ## 任務簡述重做[第 4 週測驗題的測驗三](https://hackmd.io/@sysprog/linux2023-quiz4#測驗-3)，包含延伸問題，以掌握紅黑樹實作議題。 ## TODO: 重做第四週測驗題重做[第 4 週測驗題的測驗三](https://hackmd.io/@sysprog/linux2023-quiz4#測驗-3)，包含延伸問題，彙整其他學員的成果。至少要包含: * 解釋原本程式碼的個別巨集的作用 (對照〈[你所不知道的 C 語言：前置處理器應用篇](https://hackmd.io/@sysprog/c-preprocessor)〉) * 解釋為何 `rb_node` 只包含 `left`, `right` 這二個指標，`parent` 指標被調整到哪？為何這樣的紅黑樹依舊可運作？ * 解釋給定程式碼如何新增節點 * 在給定的程式碼基礎之上，實作節點移除的操作並充分驗證 * 將[第 4 週測驗題](https://hackmd.io/@sysprog/linux2023-quiz4)的測驗一對節點的紅色和黑色標注的技巧嵌入到指標變數中 * 評估改寫後的紅黑樹效能表現，設計對應的實驗 ### 解釋程式碼 ([Incomplete rbi.c](https://gist.github.com/jserv/c2b80079ba4fc6cfe563f5c02767f455)) #### `RB_LOG2_MAX_NODES` 用於計算紅黑樹中節點的最大數量的對數值，也用作控制樹的最大深度。參考並整理 Korin777 同學的 [quiz 4 測驗三](https://hackmd.io/@Korin777/linux2023-quiz4#測驗三) 以及 Toneary 同學的 [quiz 4 測驗三](https://hackmd.io/@Toneary/2023q1-Homework4-quiz4#%E6%B8%AC%E9%A9%97%E4%B8%89): 因為 `node_t` 的大小為 `sizeof(void *) * 4` 所以 node 最多有 $$\frac{1 << (sizeof(void\ *)\ << 3)} {sizeof(rb\_node(x\_type))}$$ 取 $log_2$ 可以得到: $$ = log_2(1 << (sizeof(void\ *) << 3)\ -\ log_2(sizeof(rb\_node(x\_type))) $$ $$ = RB\_LOG2\_MAX\_MEM\_BYTES - log_2(sizeof(rb\_node(x\_type))) $$ 在 $log_2(sizeof(rb\_node(x\_type)))$ 加入 32 bit 或 64 bit 架構判定後就會是: ```cpp sizeof(void *) >= 8 ? 4 : sizeof(void *) >= 4 ? 3 : 2 - 1 ``` ```cpp #define RB_LOG2_MAX_MEM_BYTES (sizeof(void *) << 3) #define RB_LOG2_MAX_NODES \ (RB_LOG2_MAX_MEM_BYTES - \ (sizeof(void *) >= 8 ? 4 \ /***************************/ : sizeof(void *) >= 4 ? 3 \ /* sizeof(rb_node(x_type)) */ : 2) - 1) /***************************/ #define RB_MAX_DEPTH (RB_LOG2_MAX_NODES << 1) ``` 若 `sizeof(void*) == 8` 就會是 $64 - 4 - 1 = 59$ :::info 問題: 為什麼在 32-bit 或以下的系統是 -3(8) 和 -2(4)? ::: 因為不太能理解`RB_LOG2_MAX_NODES` 的程式碼，所以使用 `gcc -E` 去顯示 `printf(RB_LOG2_MAX_NODES)` 在 preprocessor 階段的結果，再配合 `gcc -O3 -S` 去輸出組合語言的結果: ```cpp // `gcc -E` 的結果 { printf(((sizeof(void *) << 3) - \ (sizeof(void *) >= 8 ? 4 : \ sizeof(void *) >= 4 ? 3 : 2) - 1)); return 0; } // `gcc -O3 -S` 的結果 main: .LFB35: .cfi_startproc endbr64 subq $8, %rsp .cfi_def_cfa_offset 16 movl $59, %edx // 這是 printf 結果 leaq .LC0(%rip), %rsi movl $1, %edi movl $0, %eax call __printf_chk@PLT movl $0, %eax addq $8, %rsp .cfi_def_cfa_offset 8 ret .cfi_endproc .LFE35: .size main, .-main .ident "GCC: (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0" .section .note.GNU-stack,"",@progbits .section .note.gnu.property,"a" .align 8 .long 1f - 0f .long 4f - 1f .long 5 ``` 發現 `RB_LOG2_MAX_NODES` 是 `59` 直接在 compile-time 轉成一個常數，同時在《Demystifying the Linux CPU Scheduler》紅黑樹的部分提到，在32 bit 和 64 bit 架構會因為 alignment，使最後的2 和 3 significant bits 沒有被使用。最後，經過老師的解釋才明白 `RB_LOG2_MAX_NODES` 也是用作分配紅黑樹的最少記憶體空間 > Kernel developers combined the parent node with the color by alignment. \_\_attribute\_\_((aligned(sizeof(long)))) would make rb_node structure aligned to the size of long, and with respect to 32-bit and 64-bit architectures, the alignment would be 4 bytes and 8 bytes. Resulting in the least 2 (for 32-bit) or 3 (for 64-bit) significant bits unused. Storing the 1-bit color information (0 for red and 1 for black) in the LSB would not cause any problem andleads to saving an additional bit. #### `sum_subtree` 利用 `rbtn_left_get` 和 `rbtn_right_get` 以迭代方式走訪整個紅黑樹，並計算所有節點的值的總和 ```c=257 static int sum_subtree(node_t *node) { ... if (!rbtn_left_get(node_t, link, node)) ... for (pre = rbtn_left_get(node_t, link, node); rbtn_right_get(node_t, link, pre) && rbtn_right_get(node_t, link, pre) != node; pre = rbtn_right_get(node_t, link, pre)) { } if (!rbtn_right_get(node_t, link, pre)) { rbtn_right_get(node_t, link, pre) = node; node = rbtn_left_get(node_t, link, node); } else { rbtn_right_get(node_t, link, pre) = NULL; ... ``` 這是左傾紅黑樹 (LLRBT)，所以第 260 行會先檢查左子節點是否為 `NULL` 在for loop (263-264 行)除了檢查 `pre` 的右子節點非空，還檢查不是目前節點，是為了確保在走訪節點時，不會進入無限循環的情況。同時在 267-271 行，檢查如果右子節點是空的，就先把 `node` 插入為 `pre` 節點的右子節點，再更新 `node` 為左子節點,使下一次循環迭代時，會繼續走訪目前節點的左子樹。如果右子節點非空，等於目前節點的右子樹已經被走訪過。在這種情況下，將 `pre` 節點的右子節點設為 NULL；這樣做的目的是防止重複存取目前節點。 ### 解釋 RBT parent ```cpp /* Root structure */ #define rb_tree(x_type) \ struct { \ x_type *root; \ } #define rb_gen_insert(x_attr, x_prefix, x_rbt_type, x_type, x_field, x_less) \ x_attr void x_prefix##insert(x_rbt_type *rbtree, x_type *node) \ { \ x_prefix##path_entry_t path[RB_MAX_DEPTH]; \ x_prefix##path_entry_t *pathp; path->node = rbtree->root; ... /* Set root, and make it black. */ \ rbtree->root = path->node; \ rbtn_black_set(x_type, x_field, rbtree->root); \ } ``` 親代節點紀錄在 `re_tree` 的 `root` 指標是因為在走訪或改變樹的結構 (旋轉) 才需要用到，把親代節點獨立出來可以節省記憶體和簡化紅黑樹的實作。需要時可以通過走訪樹獲取親代節點。以 `rb_gen_insert` 來說 `path[RB_MAX_DEPTH]` 會紀錄走訪時的路徑(包括親代節點)，當目前節點在 rotate 時要改變親代節點，會把 `path->node` 更新到 `rbtree->root` 再把親代節點設定為黑色。 `##` 是 concatenation 的意思，以 `rb_gen` 部分片段為例: ```cpp #define rb_gen(x_attr, x_prefix, x_rbt_type, x_type, x_field, x_less) \ typedef struct { \ x_type *node; \ bool less; \ } x_prefix##path_entry_t; \ x_attr void x_prefix##new (x_rbt_type * rbtree) \ { \ rb_new(x_type, x_field, rbtree); \ } ``` ```cpp rb_gen (static, tree_, tree_t, node_t, link, node_less); /* ^ ^ ^ ^ ^ ^ * | | | | | | * rb_gen(x_attr, x_prefix, x_rbt_type, x_type, x_field, x_less) */ // rb_gen 展開後: typedef struct { \ node_t *node; \ bool less; \ } tree_path_entry_t; \ static void tree_new (tree_t * rbtree) \ rb_new(node_t, link, rbtree); \ { \ } ``` ### 解釋 `rb_gen_insert` 如何新增節點完整巨集 [rb_gen_insert](https://gist.github.com/jserv/c2b80079ba4fc6cfe563f5c02767f455#file-incomplete-rbi-c-L173) 程式碼 :::spoiler 使用`gcc -E -P` 展開 `rb_gen_insert` 的函式 ```cpp static void tree_insert(tree_t *rbtree, node_t *node) { tree_path_entry_t path[(((sizeof(void *) << 3) - (sizeof(void *) >= 8 ? 4 : sizeof(void *) >= 4 ? 3 : 2) - 1) << 1)]; tree_path_entry_t *pathp; do { do { ((node))->link.left = ((void *)0); } while (0); do { ((node))->link.right = ((void *)0); } while (0); do { ((node))->link.red = 1; } while (0); } while (0); /* Wind. */ path->node = rbtree->root; for (pathp = path; pathp->node; pathp++) { _Bool less = pathp->less = node_less(node, pathp->node); if (less) { pathp[1].node = ((pathp->node)->link.left); } else { pathp[1].node = ((pathp->node)->link.right); } } pathp->node = node; /* Assertion */ ((void)sizeof((!((node)->link.left)) ? 1 : 0), __extension__({ if (!((node)->link.left)) ; else __assert_fail("!rbtn_left_get(node_t, link, node)", "rbi_raw.c", 250, __extension__ __PRETTY_FUNCTION__); })); ((void)sizeof((!((node)->link.right)) ? 1 : 0), __extension__({ if (!((node)->link.right)) ; else __assert_fail("!rbtn_right_get(node_t, link, node)", "rbi_raw.c", 250, __extension__ __PRETTY_FUNCTION__); })); /* Unwind */ while (pathp-- != path) { node_t *cnode = pathp->node; if (pathp->less) { node_t *left = pathp[1].node; do { (cnode)->link.left = left; } while (0); if (!((left)->link.red)) return; node_t *leftleft = ((left)->link.left); if (leftleft && ((leftleft)->link.red)) { /* Fix up 4-node. */ node_t *tnode; do { (leftleft)->link.red = 0; } while (0); do { (tnode) = (((cnode))->link.left); do { ((cnode))->link.left = (((tnode))->link.right); } while (0); do { ((tnode))->link.right = (cnode); } while (0); } while (0); cnode = tnode; } } else { node_t *right = pathp[1].node; do { (cnode)->link.right = right; } while (0); if (!((right)->link.red)) return; node_t *left = ((cnode)->link.left); if (left && ((left)->link.red)) { /* Split 4-node. */ do { (left)->link.red = 0; } while (0); do { (right)->link.red = 0; } while (0); do { (cnode)->link.red = 1; } while (0); } else { /* Lean left. */ node_t *tnode; _Bool tred = ((cnode)->link.red); do { (tnode) = (((cnode))->link.right); do { ((cnode))->link.right = (((tnode))->link.left); } while (0); do { ((tnode))->link.left = (cnode); } while (0); } while (0); do { (tnode)->link.red = (tred); } while (0); do { (cnode)->link.red = 1; } while (0); cnode = tnode; } } pathp->node = cnode; } rbtree->root = path->node; do { (rbtree->root)->link.red = 0; } while (0); }; ``` ::: 1. 一開始宣告一個 `path` array，用於存儲搜尋路徑。路徑中的每個節點都包含一個子節點的指標以及一個布爾值 less，表示節點應該插入到左子樹或右子樹。 ::: spoiler `path` 資料結構 ```cpp typedef struct { node_t *node; _Bool less; } tree_path_entry_t; ``` ::: :::info 為什麼要再宣告 `pathp` 的指標是因為搜尋路徑的長度是動態的。使用 `pathp` 的好處是可以在走訪過程中修改 `path` ，並且可以方便地在走訪完畢後獲取最終的路徑結果。 ::: 2. 從 `root` node 開始，根據新節點的值與每個節點的值進行比較，向下走訪樹並更新路徑。(`less` function) 在走訪過程中，如果新節點應該插入到左子樹，則將路徑中的下一個節點設置為當前節點的左子節點；否則，設置為右子節點。(`link` 結構的特性) 將路徑中的最後一個節點先記錄在 `pathp` 為將被加入的新節點 :::spoiler `link` 資料結構 struct { node_t *left, *right; _Bool red; } link; ::: 3. 以 assertion 去判定最後一個節點的 leaf node 有沒有足夠空間可以插入新節點，沒有就直接跳出程式 4. 在 `while (pathp-- != path)` loop 會在 `if (pathp->less)` 比較最後一個節點與新節點的值去判定要插入左/右子節點(左為少) 5. 從最後一個節點開始，根據紅黑樹的性質進行旋轉和顏色調整，直到達到平衡。 8. 最後，將`root` node 設置為路徑中的第一個節點，並將根節點標記為黑色，完成新增節點的操作。 ### 實作節點移除的操作 ### 把第 4 週測驗題的測驗一對節點的紅色和黑色標注的技巧嵌入到指標變數中參考並整理 wanghanchi 同學的 [quiz 4 測驗三](https://hackmd.io/@wanghanchi/linux2023-quiz4#測驗三) 以及 Hongweii 同學的 [quiz 4 測驗三](https://hackmd.io/@Hongweii/quiz4#測驗三): 因為４ byte alignment 關係，所以最後兩個 bits 用不到可以用來紀錄 node 的顏色。將節點結構顏色標注嵌入到右邊節點的指標變數當中 ```diff /* Node structure */ #define rb_node(x_type) \ struct { \ - x_type *left, *right; \ + x_type *left, *right_red; \ - bool red; } ``` 改寫其他巨集 ```c #define rbtn_left_get(x_type, x_field, x_node) \ (x_type*)(((intptr_t)((x_node)->x_field.left)) & ~3)) #define rbtn_right_get(x_type, x_field, x_node) \ ((x_type*)(((intptr_t)((x_node)->x_field.left)) & ~3)) #define rbtn_red_set(x_type, x_field, x_node) \ do { \ (x_node)->x_field.right = \ (x_type*)((uintptr_t)((x_node)->x_field.right) | 1); \ } while (0) #define rbtn_black_set(x_type, x_field, x_node) \ do { \ (x_node)->x_field.right = \ (x_type*)((intptr_t)((x_node)->x_field.right) & ~3); \ } while (0) ``` ### 評估改寫後的紅黑樹效能表現 ## TODO: 實作快取機制 Linux 核心的 rbtree 具備 cache 機制 (參見 [Red-black Trees (rbtree) in Linux](https://www.kernel.org/doc/Documentation/rbtree.txt))，探討其原理，並嘗試在上述程式碼的基礎之上實作。 > 對照《Demystifying the Linux CPU Scheduler》第 3.1 節 "Red-black tree" 相關描述在 Linux 核心的 `rbtree.h` 就已經有定義 `RB_ROOT_CACHED`，它的這個 cache 機制是加入到 root node 裡面, 然後 cache 的是整個紅黑樹的最左邊的 node, 而最左邊的 node 的值也是最小的。為什麼要 cache 整個樹最小的值，可以用 cpu scheduler 為例子作說明: 在《Demystifying the Linux CPU Scheduler》提到紅黑樹是用來記錄那些將被執行的 runnable tasks 的 `vruntime` > Red-black tree The runnable tasks are stored inside a red-black tree (RB- tree), where processes are inserted based on a linear ordering of execution time. ... the task to run next is the one with the smallest vruntime. 而最小的 `vruntime` task 會最先被執行，那在有 cache 最左邊的 node (最小 `vruntime`) 情況下會是 O(1) 的存取時間，而非是 O(log n)。在 linux 核心的紅黑樹的 node 結構如下: ```c struct rb_root { struct rb_node *rb_node; }; struct rb_node { unsigned long __rb_parent_color; struct rb_node *rb_right; struct rb_node *rb_left; } __attribute__((aligned(sizeof(long)))); /* The alignment might seem pointless, but allegedly CRIS needs it */ ``` 加入 cache 機制的 root node 結構: ```c struct rb_root_cached { struct rb_root rb_root; struct rb_node *rb_leftmost; }; ``` 它多了一個 `rb_leftmost` 的指標直接記錄最左邊的 node 的位置，使一開始在訪問時 root node 就可以直接跳過走訪的過程，減少了走訪的時間。