2023q1 Homework1 (quiz1)

--- tags: linux kernel, jserv, lab0-c --- # 2023q1 Homework1 (quiz1) contributed by < [`WangHanChi`](https://github.com/WangHanChi) > > [作業要求](https://hackmd.io/@sysprog/linux2023-quiz1) :::info - 測驗一 - [x] 解釋上述程式碼運作原理 - [x] 針對 [Quick sort](https://en.wikipedia.org/wiki/Quicksort) 提出上述程式碼的改進方案並實作 - [ ] 引入上述的 hybrid sort 策略並在 Linux 核心風格的 circular doubly-linked list 實作，需要量化性能表現並探討不同資料集 (data set) 的影響 - [ ] BONUS: 嘗試對 Linux 核心提出排序程式碼改進的貢獻 - 測驗二 - [x] 解釋上方程式碼運作原理，指出設計和實作的缺失 - [x] 比較測驗 1 和測驗 2 的程式碼，設計實驗來分析二者效能，解釋為何非遞迴的實作未能總是比遞迴的實作快 - [ ] 提出效能改進方案，特別是避免依賴 MAX_DEPTH - [ ] 引入 [Introsort](https://en.wikipedia.org/wiki/Introsort)，也就是 quick sort 和 heap sort - 測驗三 - [x] 解釋上述程式碼運作原理，指出其實作缺陷並改進 - [ ] 在上述 XOR linked list 的基礎，實作測驗 1 和 2 提及的快速排序演算法，注意要引入你改進效能的版本 - [ ] 修改 xorlist_destroy，使其不急著釋放記憶體，而搭配 [atexit(3)](https://man7.org/linux/man-pages/man3/atexit.3.html) 並在程式即將結束執行時，才批次處理資源釋放 ::: ## 測驗一 ### 解釋上述程式碼運作原理這個部份的程式定義了新的結構體 `item` ```c #include <stdint.h> #include "list.h" struct item { uint16_t i; struct list_head list; }; ``` 這個程式比較了 `item` 中的 `i` 這個值，並且回傳值為**參數1 - 參數2** ```c static inline int cmpint(const void *p1, const void *p2) { const uint16_t *i1 = (const uint16_t *) p1; const uint16_t *i2 = (const uint16_t *) p2; return *i1 - *i2; } ``` 以下的程式碼進行了 `quick_sort` 1. 首先設定了兩個 `head` 並且初始化，一個放 less ，另一個放 greater 2. 設定一個 `pivot` ，作為比較大小的基準 3. 走訪所有的鏈結串列，並且逐一比較大小 4. 對二條鏈結串列排序 5. 最後將其接起來 ```c static void list_sort(struct list_head *head) { if (list_empty(head) || list_is_singular(head)) return; struct list_head list_less, list_greater; INIT_LIST_HEAD(&list_less); INIT_LIST_HEAD(&list_greater); struct item *pivot = list_first_entry(head, AAA, BBB); list_del(&pivot->list); struct item *itm = NULL, *is = NULL; list_for_each_entry_safe(itm, is, head, list) { if (cmpint(&itm->i, &pivot->i) < 0) list_move(&itm->list, &list_less); else list_move(&itm->list, &list_greater); } list_sort(&list_less); list_sort(&list_greater); list_add(&pivot->list, head); list_splice(&list_less, head); list_splice_tail(&list_greater, head); } ``` 由 `list_first_entry` 的定義可以知道，`AAA` 需要填入的 `type` 也就是 struct item，而 `BBB` 會需要填入的就是 `member` ，接下來 `CCC` 會是需要走訪所有 list 的迴圈函式，又因為會需要移動節點，所以選擇使用 `list_for_each_entry_safe` 。而至於 quick sort 的 for-loop 會將點分配給基準值 **less 區** 與 **greater 區** ，所以 `DDD` 與 `EEE` 都會是移動節點的巨集，所以就會是使用 `list_move_tail` 到最尾端。最後要把 **greater 區** 移動到鏈結串列的尾巴，所以選用 `list_splice_tail`。 <s> > AAA = struct item > BBB = list > CCC = list_for_each_entry_safe > DDD = list_move_tail > EEE = list_move_tail > FFF = list_splice_tail </s> :::warning 著重在程式碼和其原理，不用貼出參考答案。隨堂測驗是引導學員「誠實面對自己」的手段，我們不該拘泥於其形式。 :notes: jserv ::: ### 提出程式碼的可改進地方並實作由程式碼可以看出 `pivot` 的選選擇相當的重要，若是選擇不當的話，可能會造成 **less 區** 與 **greater 區** 失衡。例如選到 `pivot` 是最小值的時候，會使得這次遞迴無效。為了解決上述的問題，目前有兩種解決方法 1. 使用老師所說的 `Hybrid Sort` 2. 改善 `pivot` 的選擇 --- ## 測驗二 ### 解釋上方程式碼運作原理，指出設計和實作的缺失此題為延續上題，改成使用非遞迴的版本可以看到下面的 stack 會先被初始化後再把 `head` 以及整條鏈結串列都放進這個 stack 裡面 ```c struct list_head stack[MAX_DEPTH]; for (int i = 0; i < MAX_DEPTH; i++) INIT_LIST_HEAD(&stack[i]); int top = -1; list_splice_init(head, &stack[++top]); ``` 接著利用 `list.h` 中提供的函式把 stack 頂端的節點放入 partition 中 ```c INIT_LIST_HEAD(&partition); list_splice_init(&stack[top--], &partition); ``` 分別建立出 **less 區** 與 **greater 區** 分別儲存比 `pivot` 大跟小的節點。而這邊 `pivot` 的選擇就跟測驗 `1` 的選擇方法一致 ```c if (!list_empty(&partition) && !list_is_singular(&partition)) { struct list_head list_less, list_greater; INIT_LIST_HEAD(&list_less); INIT_LIST_HEAD(&list_greater); struct item *pivot = list_first_entry(&partition, struct item, list); list_del(&pivot->list); INIT_LIST_HEAD(&pivot->list); ``` 這邊也跟測驗 `1` 的方法一致，走訪所有的節點，再分成 **less 區** 與 **greater 區**，因此 GGGG 也會是 `list_for_each_entry_safe` ```c struct item *itm = NULL, *is = NULL; GGGG (itm, is, &partition, list) { list_del(&itm->list); if (cmpint(&itm->i, &pivot->i) < 0) list_move(&itm->list, &list_less); else list_move(&itm->list, &list_greater); } ``` 下面這行程式碼的功能是將 `pivot` 插在 list_less 的後面，所以 `HHHH` 應該使用 `list_move_tail` 。再來是檢查 **less 區** 與 **greater 區**是否為空，若不是空的就將其加入 stack 中，所以 `IIII` 與 `JJJJ` 都要填入 `&stack[++top]` ```c HHHH (&pivot->list, &list_less); if (!list_empty(&list_greater)) list_splice_tail(&list_greater, IIII); if (!list_empty(&list_less)) list_splice_tail(&list_less, JJJJ); ``` 這邊探討的是假如 `partition` 只剩下一個節點的情況應該要怎麼處理，可以看到他在這邊將 `partition` 加到 stack 中後，執行了 `while-loop`，這邊我想不透為何要這樣執行，因此參考 [yanjiew1](https://hackmd.io/@yanjiew/linux2023q1-quiz1#%E6%B8%AC%E9%A9%972)，發現是要將 stack top 上的 list 節點移除，所以 `KKKK` 應該要填入的是 `&stack[top--]`。 ```c else { top++; list_splice_tail(&partition, &stack[top]); while (top >= 0 && list_is_singular(&stack[top])) { struct item *tmp = list_first_entry(&stack[top], struct item, list); list_del(&tmp->list); INIT_LIST_HEAD(KKKK); list_add_tail(&tmp->list, head); } } ``` ### 提出程式碼的可改進地方並實作目前觀察到的問題為 1. 程式碼一開始所定義的 MAX_DEPTH 只有 512，因此若是要處理節點數超過 512 個的鏈結串列就會超出界線，或是遇到了 Worst case (例如從 **257 到 1** 的序列)，所設定的 stack 就會發生 [buffer overflow](https://en.wikipedia.org/wiki/Buffer_overflow) ### 比較測驗 `1` 和測驗 `2` 的程式碼 #### 測驗 1 (遞迴版本) 因為要在 lab0-c 這個專案下執行，所以需要修改一些部份才能夠正常執行，我將修改的程式碼放在[這裡](https://github.com/WangHanChi/linux_kernel_quiz/blob/main/quiz1/problem1/sort.c)了，然後也會使用 lab0-c 內部的 `time` 以及 perf 工具才進行效能的測試 | 次數 | 第一次 | 第二次 | 第三次 | 平均 | | ------- | ------ | ------ | ------ | ----- | | 10萬筆 | 0.027 | 0.026 | 0.027 | 0.027 | | 20萬筆 | 0.069 | 0.074 | 0.070 | 0.071 | | 30萬筆 | 0.145 | 0.149 | 0.143 | 0.146 | | 40萬筆 | 0.257 | 0.248 | 0.248 | 0.251 | | 50萬筆 | 0.368 | 0.381 | 0.376 | 0.375 | | 60萬筆 | 0.485 | 0.455 | 0.460 | 0.467 | | 70萬筆 | 0.598 | 0.589 | 0.592 | 0.593 | | 80萬筆 | 0.722 | 0.739 | 0.715 | 0.725 | | 90萬筆 | 0.804 | 0.810 | 0.814 | 0.809 | | 100萬筆 | 0.985 | 0.943 | 0.930 | 0.953 | :::warning 善用 gnuplot 製圖。 :notes: jserv > 好的，學生會多練習 gnuplot 並在下方附上效能比對圖 ::: #### 測驗 2 (非遞迴版本) 同樣也是修改了一些部份，使其可以在lab0-c 這個專案下執行，將檔案放在[這邊](https://github.com/WangHanChi/linux_kernel_quiz/blob/main/quiz1/problem2/sort.c) | 次數 | 第一次 | 第二次 | 第三次 | 平均 | | ------- | ------ | ------ | ------ | ----- | | 10萬筆 | 0.033 | 0.034 | 0.033 | 0.033 | | 20萬筆 | 0.083 | 0.086 | 0.088 | 0.086 | | 30萬筆 | 0.183 | 0.163 | 0.173 | 0.173 | | 40萬筆 | 0.281 | 0.320 | 0.315 | 0.305 | | 50萬筆 | 0.404 | 0.408 | 0.391 | 0.401 | | 60萬筆 | 0.574 | 0.588 | 0.528 | 0.563 | | 70萬筆 | 0.650 | 0.651 | 0.626 | 0.642 | | 80萬筆 | 0.803 | 0.779 | 0.829 | 0.804 | | 90萬筆 | 0.935 | 0.921 | 0.951 | 0.936 | | 100萬筆 | 1.025 | 1.076 | 1.006 | 1.036 | #### 效能比較從上方的詳細數據或是下方的圖表看出非遞迴的版本從 **10 萬** 筆資料後的效能已經落後遞迴的版本了，為了確保完整，因此我嘗試先從資料筆數更低的部份觀察 - 資料 (10 萬筆 ~ 100 萬筆)，==曲線越靠近右下方越好== :::spoiler 圖例不清楚的圖 ![](https://i.imgur.com/1IDXNWe.png) ::: ![](https://i.imgur.com/Z2GQX8t.png) - 資料 (1 萬筆 ~ 10 萬筆)，==曲線越靠近右下方越好== :::spoiler 圖例不清楚的圖 ![](https://i.imgur.com/Rmt1Q2q.png) ::: ![](https://i.imgur.com/IRxv51W.png) 可以從上面的兩張圖片觀察到，不論在資料量多少的情況下，**非遞迴**的版本花費的時間總是比**遞迴**的版本還要多，因此打算用 perf 工具來深入觀察，以下測試是在資料比數為 50 萬筆的時候 - 遞迴版本 ``` Performance counter stats for './qtest -f traces/trace-sort.cmd' (10 runs): 24,990,278 cache-misses # 32.899 % of all cache refs ( +- 0.52% ) 75,427,362 cache-references ( +- 0.41% ) 2,412,927,383 instructions # 0.90 insn per cycle ( +- 0.08% ) 2,703,750,725 cycles ( +- 0.45% ) 0.57981 +- 0.00242 seconds time elapsed ( +- 0.42% ) ``` - 非遞迴版本 ``` Performance counter stats for './qtest -f traces/trace-sort.cmd' (10 runs): 30,481,432 cache-misses # 34.391 % of all cache refs ( +- 0.61% ) 89,248,407 cache-references ( +- 0.84% ) 2,819,848,018 instructions # 0.96 insn per cycle ( +- 0.18% ) 2,983,632,953 cycles ( +- 0.49% ) 0.63051 +- 0.00329 seconds time elapsed ( +- 0.52% ) ``` 從 cache-misses 以及 instructions 的數量就可以發現，非遞迴版的執行速度比起遞迴版的還要慢是相當正常的。因為在指令的部份，非遞迴版本就已經多許多了，在相同的電腦環境之下，是很難超越遞迴版本的。 :::warning 圖例的單位應採 [SI prefix](https://en.wikipedia.org/wiki/Metric_prefix)，亦即 K, M, G 等等，避免用「萬」。座標軸的文字用英語書寫。 :notes: jserv > 已修正, 謝謝老師 ::: :::spoiler 不好的圖 ![](https://i.imgur.com/XkISzM3.png) ::: ### 提出效能改進方案，特別是避免依賴 `MAX_DEPTH` --- ## 測驗三 ### 解釋上述程式碼運作原理，指出其實作缺陷並改進首先可以看到下面的程式碼定義了 `xor_node_t` 與 `xor_list_t` 這兩個結構體，其中，可以把 `xor_list_t` 這個看作是整個 xor_linked_list 的 head 節點，它當中包含了 `head`, `tail` 與 `count` ，用來儲存這條 xor_linked_list 的所有資訊。再來看到 XOR_COMP 這個巨集，可以知道會是通過 **Exclusive or (^)** 這樣的位元運算所得到下一個地址的，所以 `LLLL` 會是 `(uintptr_t)(a) ^ (uintptr_t)(b)` ```c typedef struct _xorlist_node { struct _xorlist_node *cmp; } xor_node_t; typedef struct _xor_list_struct { xor_node_t head, tail; uint32_t count; } xor_list_t; #define XOR_COMP(a, b) ((xor_node_t *) (LLLL)) ``` 下面這段程式碼就是運用了剛剛所定義好的巨集，並且搭配 `assert` 這個功能來進行實作的接下來先在終端機輸入 ```shell $ man assert ``` 來取得第一手 assert 的資料 :::spoiler assert ASSERT(3) Linux Programmer's Manual ASSERT(3) NAME assert - abort the program if assertion is false SYNOPSIS #include <assert.h> void assert(scalar expression); DESCRIPTION This macro can help programmers find bugs in their programs, or handle exceptional cases via a crash that will produce limited debugging output. If expression is false (i.e., compares equal to zero), assert() prints an error message to standard error and terminates the program by calling abort(3). The error message includes the name of the file and function containing the assert() call, the source code line number of the call, and the text of the argument; something like: prog: some_file.c:16: some_func: Assertion `val == 0' failed. If the macro NDEBUG is defined at the moment <assert.h> was last included, the macro assert() generates no code, and hence does nothing at all. It is not recommended to de‐ fine NDEBUG if using assert() to detect error conditions since the software may behave non-deterministically. RETURN VALUE No value is returned. ATTRIBUTES For an explanation of the terms used in this section, see attributes(7). ┌──────────┬───────────────┬─────────┐ │Interface │ Attribute │ Value │ ├──────────┼───────────────┼─────────┤ │assert() │ Thread safety │ MT-Safe │ └──────────┴───────────────┴─────────┘ CONFORMING TO POSIX.1-2001, POSIX.1-2008, C89, C99. In C89, expression is required to be of type int and undefined behavior results if it is not, but in C99 it may have any scalar type. BUGS assert() is implemented as a macro; if the expression tested has side-effects, program behavior will be different depending on whether NDEBUG is defined. This may create Heisenbugs which go away when debugging is turned on. SEE ALSO abort(3), assert_perror(3), exit(3) COLOPHON This page is part of release 5.10 of the Linux man-pages project. A description of the project, information about reporting bugs, and the latest version of this page, can be found at https://www.kernel.org/doc/man-pages/. GNU 2017-09-15 ASSERT(3) ::: 可以確保 assert 中的敘述必定為 ture ，否則就會中斷程式並且報錯，因此運用在這個函式中，就可以得到 n1 && n2 必定有值，再去回傳 XOR_COMP 這個巨集的執行結果 ```c static inline xor_node_t *address_of(xor_node_t *n1, xor_node_t *n2) { assert(n1 && n2); return XOR_COMP(n1, n2); } ``` 這邊定義了許多風格近似於 `list.h` 的巨集，包括了 container_of 等等的可以發現到這邊找尋下一個節點的方式都是透過剛剛所定義好的 `address_of` ```c #define container_of(ptr, type, member) \ __extension__({ \ const __typeof__(((type *) 0)->member) *__pmember = (ptr); \ (type *) ((char *) __pmember - offsetof(type, member)); \ }) #define xorlist_for_each(node, rp, rn, list) \ for (rp = &(list)->head, node = rp->cmp; node != &(list)->tail; \ rn = address_of(rp, node->cmp), rp = node, node = rn) #define xorlist_for_each_prev(node, rp, rn, list) \ for (rp = &(list)->tail, node = rp->cmp; node != &(list)->head; \ rn = address_of(rp, node->cmp), rp = node, node = rn) #define xorlist_for_each_from(node, pos1, pos2, rp, rn, list) \ for (rp = pos2, node = pos1; node != &(list)->tail; \ rn = address_of(rp, node->cmp), rp = node, node = rn) #define xorlist_for_each_from_prev(node, pos1, pos2, rp, rn, list) \ for (rp = pos1, node = pos2; node != &(list)->head; \ rn = address_of(rp, node->cmp), rp = node, node = rn) ``` 這邊使用了老師所提到的 C 語言前置處理器，詳情可以參考[你所不知道的 C 語言：前置處理器應用篇](https://hackmd.io/@sysprog/c-preprocessor) 。 xorlist_delete_prototype 巨集的作用是定義一個名為 `_xorlist_delete_##name` 的函數，這個函數接受一個名為 node 的 `xor_node_t` 指標參數。這個函數在實現時會被用來從一個 xor_linked_list 中刪除給定節點。xorlist_delete_call 巨集的作用是展開為 `_xorlist_delete_##name` ，這裡的 name 是在之前定義 xorlist_delete_prototype 巨集時傳入的。 ```c /* Note that when the delete function success is must return 0. */ #define xorlist_delete_prototype(name, node) \ int _xorlist_delete_##name(xor_node_t *node) #define xorlist_delete_call(name) _xorlist_delete_##name ``` 下面的函式與巨集主要在初始化節點，功能相似於 `INIT_LIST_HEAD` 及 `LIST_HEAD` ```c static inline xor_node_t *xornode_init(xor_node_t *n) { assert(n); n->cmp = NULL; return n; } #define XORNODE_INIT(n) \ do { \ (n).cmp = NULL; \ } while (0) #define XORLIST_INIT(h) \ do { \ (h).head.cmp = &(h).tail; \ (h).tail.cmp = &(h).head; \ (h).count = 0; \ } while (0) ``` 這個函式的命名其實沒有很明確，他是把節點新增到 `head` 的下一個，若是改叫做 `xorlist_add_head` 會明確很多 ```c int xorlist_add(xor_list_t *list, xor_node_t *n) { xor_node_t *real_next; if (!n) return ENOMEM; xor_node_t *real_prev = &list->head; xor_node_t *node = real_prev->cmp; if (node == &list->tail) real_next = &list->tail; else real_next = node; real_prev->cmp = n; n->cmp = XOR_COMP(real_prev, real_next); real_next->cmp = XOR_COMP(n, XOR_COMP(real_prev, real_next->cmp)); list->count++; return 0; } ``` xorlist_add_head 的步驟可以分成==目前只有兩個節點==以及==三個節點以上== 首先先從只有兩個節點的部份開始: 1. 原本鏈結串列只有兩個節點 | Address | A | B | | -------- | ---- | ---- | | 特別名稱 | head | tail | | cmp | B | A | 2. 插入一個節點 C，需要改變的是 A 與 B 的 **cmp** 以及 C 的 cmp 是從 `A^B` 而來的 | Address | A | C | B | | -------- | ------------ | ------------ | ------------ | | 特別名稱 | head | node 1 | tail | | cmp | ~~B~~ ==C== | $A \oplus B$ | ~~A~~ ==C== | 3. 接著插入節點 D，修改加入節點的**前一個節點**以及**後一個節點**的 cmp 以及重新指派新插入節點的 cmp | Address | A | D | C | B | | -------- | ------------ | ------------ | ---------------- | ---- | | 特別名稱 | head | node 2 | node 1 | tail | | cmp | ~~C~~ ==D== | $A \oplus C$ | ==$D \oplus B$== | C | 由此發現，插入節點主要就是修改插入節點前後的 cmp 以及自己的 cmp 可以再插入一個節點來驗證此發現 4. 插入節點 E ，並套用上述結論，發現可以適用! | Address | A | E | D | C | B | | -------- | ------------ | --- | ------------ | ---------------- | ---- | | 特別名稱 | head | node 3 | node 2 | node 1 | tail | | cmp | ~~D~~ ==E== | $A \oplus D$ | ==$E \oplus C$== | $D \oplus B$ | C | ### 在上述 XOR linked list 的基礎，實作測驗 1 和 2 提及的快速排序演算法 ## Reference - [gnuplot 語法解說和示範](https://hackmd.io/@sysprog/Skwp-alOg#gnuplot-script) -