# 2021q1 Homework2 (quiz2) contributed by < `shanihsu` > ###### tags: `linux2021` ## 測驗 1 ### 作答區 #### Q1、Q2、Q4 ```c= static list_ele_t *get_middle(struct list_head *list) { struct list_head *fast = list->next, *slow; list_for_each (slow, list) { if (COND1 || COND2) break; fast = fast->next->next; } return list_entry(TTT, list_ele_t, list); } ``` * COND1 : `fast->next == list` * COND2 : `fast->next->next == list` * TTT : `slow` * `get_middle` 這個 function 功能主要用於尋找 linked list 的中間點,並回傳該中間點的位址 * 由 `list_for_each` 的迭代可看出,透過 `fast` 跟 `slow` 兩變數來尋找 linked list 的中間點,在每圈迭代中 : * `fast` : 暫存 linked list 中與前次 `fast` 相隔兩格的 node * `slow` : 暫存 linked list 中與前次 `slow` 相隔一格的 node * 迭代終止條件 : 因為此 linked list 是 circuler doubly linked list,所以當 `fast->next == list` 或 `fast->next->next == list` 時,代表整個 linked list 已搜尋完畢,即停止迭代搜尋。 * 回傳值 : 當迭代完畢時,fast 會停留在 linked list 的尾端或倒數第二個,而 `slow` 會停留在linked list 中間值的位置,此時只需要呼叫 `list_entry` ,回傳 `slow` 這個 struct 的位址,即可得到該 linked list 的中間點位址。 #### Q3 ```c= void list_merge_sort(queue_t *q) { if (list_is_singular(&q->list)) return; queue_t left; struct list_head sorted; INIT_LIST_HEAD(&left.list); list_cut_position(&left.list, &q->list, MMM); list_merge_sort(&left); list_merge_sort(q); list_merge(&left.list, &q->list, &sorted); INIT_LIST_HEAD(&q->list); list_splice_tail(&sorted, &q->list); } ``` * MMM : `&get_middle(&q->list)->list` * `list_cut_position` 這個 function 功能用來將 linked list 依照中間點分成兩個 linked list , function 定義 : ```c= static inline void list_cut_position(struct list_head *head_to, struct list_head *head_from, struct list_head *node) ``` * function 第三個參數要傳入 struct `list_head` 型態, `get_middle` 回傳的型態為 `list_ele_t *` ,須將其取址後,找到 `__element` 型別,再取出其中的 `list` 元素。 ### 解釋程式碼原理 #### `INIT_LIST_HEAD` * 初始化空的 linked list `head` ```c= static inline void INIT_LIST_HEAD(struct list_head *head) { head->next = head; head->prev = head; } ``` ```graphviz digraph structs { node[shape=record] struct1 [label="<f0> prev|<f1> next", xlabel = "head"]; struct1:f0 -> struct1; struct1:f1 -> struct1; } ``` #### `list_add_tail` * 將 `node` 接在 linked list 的尾端 * 因為是 circuler doubly linked list ,所以其 `head->prev` 相當於此 linked list 的尾端 * 先宣告 `prev` 暫存 `head->prev` ,再將 node 接上 ```c= static inline void list_add_tail(struct list_head *node, struct list_head *head) { struct list_head *prev = head->prev; prev->next = node; node->next = head; node->prev = prev; head->prev = node; } ``` ```graphviz digraph structs { rankdir=LR; node [shape=record]; a [label="{ <prev> prev | <next> next}", xlabel = "head"] b [label="{ <prev> prev | <next> next }"]; c [label="{ <prev> prev | <next> next}",xlabel = "node"]; a -> b ; b -> a ; b -> c ; c -> b ; c:next -> a:prev ; a:prev -> c:next ; } ``` #### `list_del` * 先將欲刪除之 `node` 的 `prev` 與 `next` 暫存 * 再將`prev` 與 `next`相連接,以達到刪除此 `node` * 因為是 circuler doubly linked list ,所以當 `node == head | tail` 時,視為刪除一般 `ndoe` ,不需要特別考慮其狀況 ```c= static inline void list_del(struct list_head *node) { struct list_head *next = node->next, *prev = node->prev; next->prev = prev; prev->next = next; } ``` #### `list_empty` * 當 `head->next == head` 時,代表 `head` 指回自己,此 linked list 為空 ```c= static inline int list_empty(const struct list_head *head) { return (head->next == head); } ``` #### `list_is_singular` * 當 linked list 非空,且 `head->prev` 與 `head->next` 指向同一塊記憶體,表示 linked list 中只有一個 node ```c= static inline int list_is_singular(const struct list_head *head) { return (!list_empty(head) && head->prev == head->next); } ``` #### `list_splice_tail` * 將 `list` 接到 `head` 的尾端 * 運用其為 circuler doubly linked list ,可得知兩個 linked list 之頭尾,並將其暫存後,重新串接 ```c= static inline void list_splice_tail(struct list_head *list, struct list_head *head) { struct list_head *head_last = head->prev; struct list_head *list_first = list->next, *list_last = list->prev; if (list_empty(list)) return; head->prev = list_last; list_last->next = head; list_first->prev = head_last; head_last->next = list_first; } ``` #### `list_cut_position` * `head_from` : 欲切割的 linked list * `node` : 欲切割點 * `head_to` : 切割後以 `node` 為開頭,並串接原本 `head_from` 的下一個點 * 將 `head_from` 以 `node` 為切點切割,切割後分別得到以 `node` 為分界的兩個 linked list * `node` 左邊的 linked list 由 `head_to` 串接,並將 `node` 串接於首 * `node` 右邊的 linked list 由 `head_from`串接 ```c= static inline void list_cut_position(struct list_head *head_to, struct list_head *head_from, struct list_head *node) { struct list_head *head_from_first = head_from->next; if (list_empty(head_from)) return; if (head_from == node) { INIT_LIST_HEAD(head_to); return; } head_from->next = node->next; head_from->next->prev = head_from; head_to->prev = node; node->next = head_to; head_to->next = head_from_first; head_to->next->prev = head_to; } ``` #### `get_middle` * `fast` : 暫存 linked list 中與前次 `fast` 相隔兩格的 node * `slow` : 暫存 linked list 中與前次 `slow` 相隔一格的 node * 透過 `fast` 與 `slow` 兩變數在每次迴圈中移動速度不同,尋找 linked list 的中間點,並回傳該中間點的位址 * 因為此 linked list 是 circuler doubly linked list,所以當 `fast->next == list` 或 `fast->next->next == list` 時,代表整個 linked list 已搜尋完畢,即停止迭代搜尋。 * 回傳值 : 當迭代完畢時,fast 會停留在 linked list 的尾端或倒數第二個,而 `slow` 會停留在linked list 中間值的位置,此時只需要呼叫 `list_entry` ,回傳 `slow` 這個 struct 的位址,即可得到該 linked list 的中間點位址。 ```c= static list_ele_t *get_middle(struct list_head *list) { struct list_head *fast = list->next, *slow; list_for_each (slow, list) { if (fast->next == list || fast->next->next == list) break; fast = fast->next->next; } return list_entry(slow, list_ele_t, list); } ``` #### `list_merge` * 判斷 `lhs->next` 跟 `rhs->next` 的 `value` 大小,將小的 node 從原本所在的 linked list 中刪除,並將其接到 `head` 之後 * 重複上述動作,直到 `lhs` 或 `rhs` 其中之一為空時,將另一 linked list 串接到 `head` 後 ```c= static void list_merge(struct list_head *lhs, struct list_head *rhs, struct list_head *head) { INIT_LIST_HEAD(head); if (list_empty(lhs)) { list_splice_tail(lhs, head); return; } if (list_empty(rhs)) { list_splice_tail(rhs, head); return; } while (!list_empty(lhs) && !list_empty(rhs)) { char *lv = list_entry(lhs->next, list_ele_t, list)->value; char *rv = list_entry(rhs->next, list_ele_t, list)->value; struct list_head *tmp = strcmp(lv, rv) <= 0 ? lhs->next : rhs->next; list_del(tmp); list_add_tail(tmp, head); } list_splice_tail(list_empty(lhs) ? rhs : lhs, head); } ``` ### 改進空間 :::danger 不該急著列出程式碼,相反地,你該討論上述程式碼有哪些缺失,詳列可改進的空間及策略,提出落實的方案,最後才是程式碼和實驗。 :notes: jserv ::: #### `list_merge` * 下面程式碼在第7行及第12行中邏輯有誤,當該 linked list 為空時,將此空的 list 加到 `head` 之後已無意義,應做以下調整,將另一個非空的 linked list 加到 `head` 之後。 ```diff= static void list_merge(struct list_head *lhs, struct list_head *rhs, struct list_head *head) { INIT_LIST_HEAD(head); if (list_empty(lhs)) { - list_splice_tail(lhs, head); + list_splice_tail(rhs, head); return; } if (list_empty(rhs)) { - list_splice_tail(rhs, head); + list_splice_tail(lhs, head); return; } } ``` #### 現有的程式架構 * 有以下三種架構 : ```graphviz digraph structs { subgraph cluster_b{ label = "list_head" node[shape=record] struct1 [label="<f0> *prev|<f1> *next" ]; } } ``` ```graphviz digraph structs { subgraph cluster_head { label="list_ele_t"; node [shape=record]; head [label="value"]; head1 [label="*next"]; list_head [label="{*prev|*next}"]; } } ``` ```graphviz digraph G { subgraph cluster_b { node [shape=record]; label = "queue_t" subgraph cluster_head { label="list_ele_t *head"; node [shape=record]; head [label="value"]; head1 [label="*next"]; list_head [label="{*prev|*next}"]; } list_ele1 [label="size"]; subgraph cluster_tail { label="list_ele_t *tail"; node [shape=record]; tail [label="value"]; tail1 [label="*next"]; list_tail [label="{*prev|*next}"]; } subgraph cluster_list { label="list_head *list"; node [shape=record]; list [label="{*prev|*next}"]; } } } ``` * 由上圖可看出,list_ele_t 中已有 `struct list_head` 紀錄 `*next` ,所以 `struct __element *next` 是多餘的。 * 對於整個 list 而言,有 `list_ele_t` 就可串接整個 list ,也可以紀錄 `head` 跟 `tail` ,不需要再額外使用 `queue_t`。 * 故原本的程式碼可改成以下 : ```diff= typedef struct __element { char *value; - struct __element *next; struct list_head list; } list_ele_t; -typedef struct { - list_ele_t *head; /* Linked list of elements */ - list_ele_t *tail; - size_t size; - struct list_head list; -} queue_t; ``` #### malloc 的回傳值型態 * 在 `q_insert_head` 中 malloc `list_ele_t` 寫法如下 : ```c= list_ele_t *newh = malloc(sizeof(list_ele_t)); ``` * 在[你所不知道的C語言:指標篇](https://hackmd.io/@sysprog/c-pointer#%E4%BD%A0%E6%89%80%E4%B8%8D%E7%9F%A5%E9%81%93%E7%9A%84C%E8%AA%9E%E8%A8%80%EF%BC%9A%E6%8C%87%E6%A8%99%E7%AF%87)中有提到 > `void *` 存在的目的就是為了強迫使用者使用顯式轉型或是強制轉型,以避免 Undefined behavior 產生 但在上述使用 `list_ele_t *` 型態直接取用,沒有先將 `*void` 強制轉型,卻沒有發生錯誤,以下為此問題的探討 : * 根據 [C11 standard](http://open-std.org/JTC1/SC22/WG14/www/docs/n1570.pdf) > C11 [6.3.2.3] Pointers 第一點 A pointer to void may be converted to or from a pointer to any object type. A pointer to any object type may be converted to a pointer to void and back again; the result shall compare equal to the original pointer. > C11 [6.5.16.1] Simple assignment 第二點 In simple assignment (=), the value of the right operand is converted to the type of the assignment expression and replaces the value stored in the object designated by the left operand. 由此可知,在 C 語言中,使用 simple assignment (=), `void*` 是可以被 implicit cast 成為其他型別的 pointer。 :::warning 但在 [sysprog21/linux-list](https://github.com/sysprog21/linux-list/blob/master/examples/insert-sort.c) 中的第 55 行 ```c= item = (struct listitem *) malloc(sizeof(*item)); ``` 卻仍將 `void *` 強制轉型成 `struct listitem` 型態,想請問以 `malloc` 而言,是否一定要強制轉型才是安全的? > 對 C 語言沒影響,但對 C++ 語言則需要 explicit casting ::: ### 將 `lib/list_sort.c` 的實作抽離為可單獨執行 (standalone) 的使用層級應用程式 > [GitHub](https://github.com/shanihsu/Linux_2021/blob/main/quiz2/list_sort.c) 學習 [sysprog21/linux-list](https://github.com/sysprog21/linux-list) 的手法,將 [lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) 的實作抽離為可單獨執行 (standalone) 的使用層級應用程式,並利用隨機產生的 256 個 value 做測試。以下為實作方法 : #### 架構 ```graphviz digraph structs { rankdir=LR; node[shape=record]; subgraph cluster_b{ label = "list_head testlist"; struct1 [label="<f0> *prev|<f1> *next" ]; } subgraph cluster_1 { label="listitem"; head [label="i"]; subgraph cluster_c{ label = "list_head list" node[shape=record] struct2 [label="<f0> *prev|<f1> *next" ]; } } subgraph cluster_2 { label="listitem"; node [shape=record]; head1 [label="i"]; subgraph cluster_d{ label = "list_head list" node[shape=record] struct3 [label="<f0> *prev|<f1> *next" ]; } } subgraph cluster_head { label="listitem"; node [shape=record]; head2 [label="i"]; subgraph cluster_e{ label = "list_head list" node[shape=record] struct4 [label="<f0> *prev|<f1> *next" ]; } } struct1:f1->struct2; struct2:f1->struct3; struct3:f1->struct4; struct2:f0->struct1; struct3:f0->struct2; struct4:f0->struct3; } ``` #### 新增 `listitem` 其中的 `i` 所存的值為欲排序的 `value` 值 ```c= struct listitem { uint16_t i; struct list_head list; }; ``` #### 移除`cmp_func` * 將`lib/list_sort.c` 中的`cmp_func` 移除,改用 `cmpint` 替代,使 `lib/list_sort.c` 成功抽離為可單獨執行的程式。 ```c= static inline int cmpint(const void *p1, const void *p2) { const uint16_t *i1 = (const uint16_t *) p1; const uint16_t *i2 = (const uint16_t *) p2; return *i1 - *i2; } ``` * 使用 `list_entry` ,從 `listitem` 搜尋,並取出存在 `listitem` 中欲比較的值 `i` ,存入 `value_a` 、 `value_b` ,再將此兩個值取址,並傳入 `cmpint` 中比較。 ```c= static struct list_head *merge(struct list_head *a, struct list_head *b) { struct list_head *head, **tail = &head; for (;;) { struct listitem *item = NULL; uint16_t value_a = list_entry(a, __typeof__(*item), list)->i; uint16_t value_b = list_entry(b, __typeof__(*item), list)->i; /* if equal, take 'a' -- important for sort stability */ if (cmpint(&value_a, &value_b) <= 0) { *tail = a; tail = &a->next; a = a->next; if (!a) { *tail = b; break; } } else { *tail = b; tail = &b->next; b = b->next; if (!b) { *tail = a; break; } } } return head; } ``` #### 移除 `likely` 與 `unlikely` macro * `likely` 與 `unlikely` 這兩個 macro 被定義在 `/include/linux/compiler.h` 中,可以讓 compiler 在編譯 assembly code 的時候最佳化,將其移除對程式的正確性沒有影響,故移除以成功抽離 `lib/list_sort.c` ### 說明 Linux 核心的 `lib/list_sort.c` 最佳化手法 #### `likely` 與 `unlikely` branch predictor * 在 `/include/linux/compiler.h` 中有以下定義: ```cpp #define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(!!(x), 0) ``` * 範例 : ```cpp if (likely(x)) { /* execute when x is true */ } else { /* execute when x is false */ } ``` * 使用 `likely` 時,表示其 (x == true) 較常發生, compiler 會將 `likely` 敘述後方的程式碼編譯在判斷之後。 * 使用 `unlikely` 時,表示其 (x == false) 較常發生, compiler 會將 `unlikely` 敘述後方的程式碼編譯在判斷之後。 * 透過 `likely` 跟 `unlikely` ,可以調整編譯器轉換的組合語言順序,將較常發生的判斷式敘述與條件程式碼,編譯到接近的位置 * 當 mispredict 時,需從 memory 一次搬一個區塊記憶體的資料到 cache ,當程式碼透過 `likely` 跟 `unlikely` 編譯後,會產生對 branch predictor 較友善的程式碼,從而縮減 mispredict 的執行成本,但若 `likely` 和 `unlikely` 誤用,會有反效果 #### 將 circular doubly-linked list 轉成 null-terminated singly-linked list 在一開始先將 doubly linked list 轉成 singly-linked list ,再做排序,可以在運算過程減少不必要的 assign ```cpp /* Convert to a null-terminated singly-linked list. */ head->prev->next = NULL; ``` 在最後運算完成,再將其改回 doubly-linked list ```cpp /* And the final links to make a circular doubly-linked list */ tail->next = head; head->prev = tail; ``` #### 將 mergesort 盡可能地以 2:1 的比例合併 [linux/lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) ```c=129 /* * This mergesort is as eager as possible while always performing at least * 2:1 balanced merges. Given two pending sublists of size 2^k, they are * merged to a size-2^(k+1) list as soon as we have 2^k following elements. * * Thus, it will avoid cache thrashing as long as 3*2^k elements can * fit into the cache. Not quite as good as a fully-eager bottom-up * mergesort, but it does use 0.2*n fewer comparisons, so is faster in * the common case that everything fits into L1. */ ``` * 不同於一般對半切的 mergesort ,此 mergesort 盡可能採用 2:1 的比例合併,意思是當有 3 * 2^k^ 個元素存在時,會先將兩個 size 是 2^k^ 的 pending sublists ,合併成一個 size 是 2^k+1^ 的 list , 這個新合併的 list 跟剩下的 2^k^ 個元素剛好是 2:1 ,以此類推一直合併下去。 * 如果這 3 * 2^k^ 個元素可以剛好都在 cache 中,即可避免 cache thrashing 。 :::info TODO: 註解中寫到 "Not quite as good as a fully-eager bottom-up mergesort, but it does use 0.2 * n fewer comparisons" ,深入探討為何可以少 0.2 * n 次的比較。 ::: ```c=139 /* * The merging is controlled by "count", the number of elements in the * pending lists. This is beautifully simple code, but rather subtle. * * Each time we increment "count", we set one bit (bit k) and clear * bits k-1 .. 0. Each time this happens (except the very first time * for each bit, when count increments to 2^k), we merge two lists of * size 2^k into one list of size 2^(k+1). */ ``` * merge 是由變數 `count` 控制, `count` 為整個 list 的元素數量, 其第 k 個 bit 代表現存大小為 2^k^ sublist 的數量。 * 當 count 每增加 1 時,set one bit (bit k) and clear bits k-1 .. 0 ,這邊的 bit k 指的是 the least-significant clear bit in count 。 * `count` 以二進位表示,遇到進位的情況,即表示有兩組相同的 2^k^ 個 sublists 要合併,連續進位表示連續合併。 ```c=196 /* * Data structure invariants: * - All lists are singly linked and null-terminated; prev * pointers are not maintained. * - pending is a prev-linked "list of lists" of sorted * sublists awaiting further merging. * - Each of the sorted sublists is power-of-two in size. * - Sublists are sorted by size and age, smallest & newest at front. * - There are zero to two sublists of each size. * - A pair of pending sublists are merged as soon as the number * of following pending elements equals their size (i.e. * each time count reaches an odd multiple of that size). * That ensures each later final merge will be at worst 2:1. * - Each round consists of: * - Merging the two sublists selected by the highest bit * which flips when count is incremented, and * - Adding an element from the input as a size-1 sublist. */ ``` 從註解中我們可以知道,在每圈迴圈中主要會做兩件事 : * 需要 merge 時,將兩個 sublist merge * 將新的 element 加入 pending 中 * `count` + 1 下方利用表格探討每次迴圈執行結果,舉 `count` = 0~16 為例 : | count decimal | count binary | sublist 4 | sublist 3 | sublist 2 | sublist 1 | new sublist | merge range | merge length | | ------------- | ------------ | --------- | --------- | --------- | --------- | ----------- | ----------- | ------------ | | 0 | 00000 | | | | | 1 | | | | 1 | 00001 | | | | 1 | 1 | | | | 2(merge) | 00010 | | | | 2 | 1 | [0,1] | 2 | | 3 | 00011 | | | 2 | 1 | 1 | | | | 4(merge) | 00100 | | | 2 | 2 | 1 | [2,3] | 2 | | 5(merge) | 00101 | | | 4 | 1 | 1 | [0,3] | 4 | | 6(merge) | 00110 | | | 4 | 2 | 1 | [4,5] | 2 | | 7 | 00111 | | 4 | 2 | 1 | 1 | | | | 8(merge) | 01000 | | 4 | 2 | 2 | 1 | [6,7] | 2 | | 9(merge) | 01001 | | 4 | 4 | 1 | 1 | [4,7] | 4 | | 10(merge) | 01010 | | 4 | 4 | 2 | 1 | [8,9] | 2 | | 11(merge) | 01011 | | 8 | 2 | 1 | 1 | [0,7] | 8 | | 12(merge) | 01100 | | 8 | 2 | 2 | 1 | [10,11] | 2 | | 13(merge) | 01101 | | 8 | 4 | 1 | 1 | [8,11] | 4 | | 14(merge) | 01110 | | 8 | 4 | 2 | 1 | [12,13] | 2 | | 15 | 01111 | 8 | 4 | 2 | 1 | 1 | | | | 16(merge) | 10000 | 8 | 4 | 2 | 2 | 1 | [14,15] | 2 | 其中,`pending` 是一個 list of lists 的結構,將所有等待 merge 的 sublist 串接起來,為了方便理解,我們將這些串接起來的 sublist 編號, sublist k 最大可串接 2^k^ 個元素,當到達容納上限,就會 merge ;而每次新增的 element 則會串接到 new sublist 。 由上表我們可以發現 : * 當 `count` = 2 的冪 - 1 時不會 merge * merge length 在 2^k^-1 ~ 2^k+1^-1 之間的最大值會剛好落在 $$ (2^k-1) + (2^{k+1}-1) \over 2 $$ ### 效能評比 將 [linux/lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) 中的 merge sort 抽出獨立成 [quiz2/list_sort.c](https://github.com/shanihsu/Linux_2021/blob/main/quiz2/list_sort.c) ,並嘗試對 [sysprog21/linux-list](https://github.com/sysprog21/linux-list) 中的 insert sort 與 quick sort 和 [quiz2/list_sort.c](https://github.com/shanihsu/Linux_2021/blob/main/quiz2/list_sort.c) 中的 merge sort 做效能評比 * 先使用 valgrind 測試 memory leak ,發現這三個 sort 均會出現以下的錯誤訊息,但他們執行後的排序結果均是正確的,進一步觀察發現這三個 sort 的錯誤訊息完全相同, heap 狀態也都一樣,這部份使用 gdb 偵錯還是找不出問題,不知道問題出在哪。 :::info TODO: 找出三個程式中 memory leak 的地方。 ::: ``` $ valgrind -v --leak-check=full ./list_sort ==67361== Memcheck, a memory error detector ==67361== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==67361== Using Valgrind-3.15.0-608cb11914-20190413 and LibVEX; rerun with -h for copyright info ==67361== Command: ./list_sort ==67361== --67361-- Valgrind options: --67361-- -v --67361-- --leak-check=full --67361-- Contents of /proc/version: --67361-- Linux version 5.8.0-59-generic (buildd@lcy01-amd64-022) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021 --67361-- --67361-- Arch and hwcaps: AMD64, LittleEndian, amd64-cx16-lzcnt-rdtscp-sse3-ssse3-avx-avx2-bmi-f16c-rdrand --67361-- Page sizes: currently 4096, max supported 4096 --67361-- Valgrind library directory: /usr/lib/x86_64-linux-gnu/valgrind --67361-- Reading syms from /media/shani/0FEA0DDE0FEA0DDE/jserv/Linux_2021/quiz2/list_sort --67361-- Reading syms from /usr/lib/x86_64-linux-gnu/ld-2.31.so --67361-- Considering /usr/lib/x86_64-linux-gnu/ld-2.31.so .. --67361-- .. CRC mismatch (computed 975d0390 wanted 30bd717f) --67361-- Considering /lib/x86_64-linux-gnu/ld-2.31.so .. --67361-- .. CRC mismatch (computed 975d0390 wanted 30bd717f) --67361-- Considering /usr/lib/debug/lib/x86_64-linux-gnu/ld-2.31.so .. --67361-- .. CRC is valid --67361-- Reading syms from /usr/lib/x86_64-linux-gnu/valgrind/memcheck-amd64-linux --67361-- object doesn't have a symbol table --67361-- object doesn't have a dynamic symbol table --67361-- Scheduler: using generic scheduler lock implementation. --67361-- Reading suppressions file: /usr/lib/x86_64-linux-gnu/valgrind/default.supp ==67361== embedded gdbserver: reading from /tmp/vgdb-pipe-from-vgdb-to-67361-by-shani-on-??? ==67361== embedded gdbserver: writing to /tmp/vgdb-pipe-to-vgdb-from-67361-by-shani-on-??? ==67361== embedded gdbserver: shared mem /tmp/vgdb-pipe-shared-mem-vgdb-67361-by-shani-on-??? ==67361== ==67361== TO CONTROL THIS PROCESS USING vgdb (which you probably ==67361== don't want to do, unless you know exactly what you're doing, ==67361== or are doing some strange experiment): ==67361== /usr/lib/x86_64-linux-gnu/valgrind/../../bin/vgdb --pid=67361 ...command... ==67361== ==67361== TO DEBUG THIS PROCESS USING GDB: start GDB like this ==67361== /path/to/gdb ./list_sort ==67361== and then give GDB the following command ==67361== target remote | /usr/lib/x86_64-linux-gnu/valgrind/../../bin/vgdb --pid=67361 ==67361== --pid is optional if only one valgrind process is running ==67361== --67361-- REDIR: 0x4022e10 (ld-linux-x86-64.so.2:strlen) redirected to 0x580c9ce2 (???) --67361-- REDIR: 0x4022be0 (ld-linux-x86-64.so.2:index) redirected to 0x580c9cfc (???) --67361-- Reading syms from /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_core-amd64-linux.so --67361-- object doesn't have a symbol table --67361-- Reading syms from /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so --67361-- object doesn't have a symbol table ==67361== WARNING: new redirection conflicts with existing -- ignoring it --67361-- old: 0x04022e10 (strlen ) R-> (0000.0) 0x580c9ce2 ??? --67361-- new: 0x04022e10 (strlen ) R-> (2007.0) 0x0483f060 strlen --67361-- REDIR: 0x401f5f0 (ld-linux-x86-64.so.2:strcmp) redirected to 0x483ffd0 (strcmp) --67361-- REDIR: 0x4023370 (ld-linux-x86-64.so.2:mempcpy) redirected to 0x4843a20 (mempcpy) --67361-- Reading syms from /usr/lib/x86_64-linux-gnu/libc-2.31.so --67361-- Considering /usr/lib/x86_64-linux-gnu/libc-2.31.so .. --67361-- .. CRC mismatch (computed 86b78530 wanted e380f01c) --67361-- Considering /lib/x86_64-linux-gnu/libc-2.31.so .. --67361-- .. CRC mismatch (computed 86b78530 wanted e380f01c) --67361-- Considering /usr/lib/debug/lib/x86_64-linux-gnu/libc-2.31.so .. --67361-- .. CRC is valid --67361-- REDIR: 0x4900600 (libc.so.6:memmove) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x48ff900 (libc.so.6:strncpy) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x4900930 (libc.so.6:strcasecmp) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x48ff220 (libc.so.6:strcat) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x48ff960 (libc.so.6:rindex) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x4901dd0 (libc.so.6:rawmemchr) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x491ce60 (libc.so.6:wmemchr) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x491c9a0 (libc.so.6:wcscmp) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x4900760 (libc.so.6:mempcpy) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x4900590 (libc.so.6:bcmp) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x48ff890 (libc.so.6:strncmp) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x48ff2d0 (libc.so.6:strcmp) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x49006c0 (libc.so.6:memset) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x491c960 (libc.so.6:wcschr) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x48ff7f0 (libc.so.6:strnlen) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x48ff3b0 (libc.so.6:strcspn) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x4900980 (libc.so.6:strncasecmp) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x48ff350 (libc.so.6:strcpy) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x4900ad0 (libc.so.6:memcpy@@GLIBC_2.14) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x491e0d0 (libc.so.6:wcsnlen) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x491c9e0 (libc.so.6:wcscpy) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x48ff9a0 (libc.so.6:strpbrk) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x48ff280 (libc.so.6:index) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x48ff7b0 (libc.so.6:strlen) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x4908d20 (libc.so.6:memrchr) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x49009d0 (libc.so.6:strcasecmp_l) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x4900550 (libc.so.6:memchr) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x491cab0 (libc.so.6:wcslen) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x48ffc60 (libc.so.6:strspn) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x49008d0 (libc.so.6:stpncpy) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x4900870 (libc.so.6:stpcpy) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x4901e10 (libc.so.6:strchrnul) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x4900a20 (libc.so.6:strncasecmp_l) redirected to 0x48311d0 (_vgnU_ifunc_wrapper) --67361-- REDIR: 0x49e8490 (libc.so.6:__strrchr_avx2) redirected to 0x483ea10 (rindex) --67361-- REDIR: 0x48fa260 (libc.so.6:malloc) redirected to 0x483b780 (malloc) --67361-- REDIR: 0x49eb650 (libc.so.6:__mempcpy_avx_unaligned_erms) redirected to 0x4843660 (mempcpy) --67361-- REDIR: 0x49eb670 (libc.so.6:__memcpy_avx_unaligned_erms) redirected to 0x48429f0 (memmove) --67361-- REDIR: 0x48fa850 (libc.so.6:free) redirected to 0x483c9d0 (free) --67361-- REDIR: 0x49e82a0 (libc.so.6:__strchrnul_avx2) redirected to 0x4843540 (strchrnul) ==67361== ==67361== HEAP SUMMARY: ==67361== in use at exit: 0 bytes in 0 blocks ==67361== total heap usage: 257 allocs, 257 frees, 7,168 bytes allocated ==67361== ==67361== All heap blocks were freed -- no leaks are possible ==67361== ==67361== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ``` 由於原本要做效能評比的三個 sort 目前還找不到 memory leak 的原因,沒辦法進一步做分析 cache 效益,但仍有嘗試使用以下工具 :::info TODO: 解決上述三個 sort memory leak 問題之後,再重新利用以下工具進行效能評比與 cache 效益分析。 ::: * 使用 perf 比較三個 sort 的 cache miss ``` $ sudo sh -c " echo -1 > /proc/sys/kernel/perf_event_paranoid" $ perf stat --repeat 6 -e cache-misses,cache-references,instructions,cycles,L1-dcache-load-misses ./list_sort ``` - [ ] merge sort in Linux kernel ``` Performance counter stats for './list_sort' (6 runs): 1,6311 cache-misses # 29.129 % of all cache refs ( +- 11.83% ) 5,5997 cache-references ( +- 2.07% ) 131,8244 instructions # 1.04 insn per cycle ( +- 0.31% ) 126,6532 cycles ( +- 1.37% ) 1,4966 L1-dcache-load-misses ( +- 2.14% ) 0.0013360 +- 0.0000469 seconds time elapsed ( +- 3.51% ) ``` - [ ] insert sort ``` Performance counter stats for './examples/insert-sort' (6 runs): 2,7948 cache-misses # 47.922 % of all cache refs ( +- 5.11% ) 5,8320 cache-references ( +- 2.57% ) 182,6139 instructions # 1.12 insn per cycle ( +- 0.19% ) 163,5738 cycles ( +- 1.82% ) 1,5430 L1-dcache-load-misses ( +- 2.83% ) 0.0018266 +- 0.0000739 seconds time elapsed ( +- 4.04% ) ``` - [ ] quick sort ``` Performance counter stats for './examples/quick-sort' (6 runs): 1,4498 cache-misses # 25.137 % of all cache refs ( +- 12.04% ) 5,7675 cache-references ( +- 2.51% ) 148,7628 instructions # 1.08 insn per cycle ( +- 0.22% ) 138,1092 cycles ( +- 1.66% ) 1,5233 L1-dcache-load-misses ( +- 2.50% ) 0.001906 +- 0.000127 seconds time elapsed ( +- 6.64% ) ``` 對於目前有 memory leak 的三個 sort 程式而言,觀察到 cache miss 的比例為 insert sort > merge sort in Linux kernel > quick sort 。 * 使用 Cachegrind 觀察 merge sort in Linux kernel 的 cache 狀況 ``` $ valgrind --tool=cachegrind ./list_sort ``` ``` ==68266== Cachegrind, a cache and branch-prediction profiler ==68266== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==68266== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==68266== Command: ./list_sort ==68266== --68266-- warning: L3 cache found, using its data for the LL simulation. ==68266== ==68266== I refs: 740,995 ==68266== I1 misses: 1,199 ==68266== LLi misses: 1,173 ==68266== I1 miss rate: 0.16% ==68266== LLi miss rate: 0.16% ==68266== ==68266== D refs: 309,024 (200,368 rd + 108,656 wr) ==68266== D1 misses: 3,377 ( 2,570 rd + 807 wr) ==68266== LLd misses: 2,829 ( 2,087 rd + 742 wr) ==68266== D1 miss rate: 1.1% ( 1.3% + 0.7% ) ==68266== LLd miss rate: 0.9% ( 1.0% + 0.7% ) ==68266== ==68266== LL refs: 4,576 ( 3,769 rd + 807 wr) ==68266== LL misses: 4,002 ( 3,260 rd + 742 wr) ==68266== LL miss rate: 0.4% ( 0.3% + 0.7% ) ``` 我們可以進一步了解程式碼每一行的 cache 狀況 ``` $ cg_annotate --auto=yes cachegrind.out.68266 ``` 得到[結果](https://github.com/shanihsu/Linux_2021/blob/main/quiz2/cache_detail.txt),並從中觀察此 merge sort 對 cache 的效益。 :::info TODO : 1. 設計效能評比程式 2. 完成 quiz1 ,將在 quiz1 提出的改進實作,和 Linux 核心的 lib/list_sort.c 進行效能比較 3. 使用 ==[Cachegrind](https://valgrind.org/docs/manual/cg-manual.html)== 工具分析 cache 效益 :::