# 2021q1 Homework2 (quiz2)
contributed by < `shanihsu` >
###### tags: `linux2021`
## 測驗 1
### 作答區
#### Q1、Q2、Q4
```c=
static list_ele_t *get_middle(struct list_head *list)
{
struct list_head *fast = list->next, *slow;
list_for_each (slow, list) {
if (COND1 || COND2)
break;
fast = fast->next->next;
}
return list_entry(TTT, list_ele_t, list);
}
```
* COND1 : `fast->next == list`
* COND2 : `fast->next->next == list`
* TTT : `slow`
* `get_middle` 這個 function 功能主要用於尋找 linked list 的中間點,並回傳該中間點的位址
* 由 `list_for_each` 的迭代可看出,透過 `fast` 跟 `slow` 兩變數來尋找 linked list 的中間點,在每圈迭代中 :
* `fast` : 暫存 linked list 中與前次 `fast` 相隔兩格的 node
* `slow` : 暫存 linked list 中與前次 `slow` 相隔一格的 node
* 迭代終止條件 : 因為此 linked list 是 circuler doubly linked list,所以當 `fast->next == list` 或 `fast->next->next == list` 時,代表整個 linked list 已搜尋完畢,即停止迭代搜尋。
* 回傳值 : 當迭代完畢時,fast 會停留在 linked list 的尾端或倒數第二個,而 `slow` 會停留在linked list 中間值的位置,此時只需要呼叫 `list_entry` ,回傳 `slow` 這個 struct 的位址,即可得到該 linked list 的中間點位址。
#### Q3
```c=
void list_merge_sort(queue_t *q)
{
if (list_is_singular(&q->list))
return;
queue_t left;
struct list_head sorted;
INIT_LIST_HEAD(&left.list);
list_cut_position(&left.list, &q->list, MMM);
list_merge_sort(&left);
list_merge_sort(q);
list_merge(&left.list, &q->list, &sorted);
INIT_LIST_HEAD(&q->list);
list_splice_tail(&sorted, &q->list);
}
```
* MMM : `&get_middle(&q->list)->list`
* `list_cut_position` 這個 function 功能用來將 linked list 依照中間點分成兩個 linked list , function 定義 :
```c=
static inline void list_cut_position(struct list_head *head_to,
struct list_head *head_from,
struct list_head *node)
```
* function 第三個參數要傳入 struct `list_head` 型態, `get_middle` 回傳的型態為 `list_ele_t *` ,須將其取址後,找到 `__element` 型別,再取出其中的 `list` 元素。
### 解釋程式碼原理
#### `INIT_LIST_HEAD`
* 初始化空的 linked list `head`
```c=
static inline void INIT_LIST_HEAD(struct list_head *head)
{
head->next = head; head->prev = head;
}
```
```graphviz
digraph structs {
node[shape=record]
struct1 [label="<f0> prev|<f1> next", xlabel = "head"];
struct1:f0 -> struct1;
struct1:f1 -> struct1;
}
```
#### `list_add_tail`
* 將 `node` 接在 linked list 的尾端
* 因為是 circuler doubly linked list ,所以其 `head->prev` 相當於此 linked list 的尾端
* 先宣告 `prev` 暫存 `head->prev` ,再將 node 接上
```c=
static inline void list_add_tail(struct list_head *node, struct list_head *head)
{
struct list_head *prev = head->prev;
prev->next = node;
node->next = head;
node->prev = prev;
head->prev = node;
}
```
```graphviz
digraph structs {
rankdir=LR;
node [shape=record];
a [label="{ <prev> prev | <next> next}", xlabel = "head"]
b [label="{ <prev> prev | <next> next }"];
c [label="{ <prev> prev | <next> next}",xlabel = "node"];
a -> b ;
b -> a ;
b -> c ;
c -> b ;
c:next -> a:prev ;
a:prev -> c:next ;
}
```
#### `list_del`
* 先將欲刪除之 `node` 的 `prev` 與 `next` 暫存
* 再將`prev` 與 `next`相連接,以達到刪除此 `node`
* 因為是 circuler doubly linked list ,所以當 `node == head | tail` 時,視為刪除一般 `ndoe` ,不需要特別考慮其狀況
```c=
static inline void list_del(struct list_head *node)
{
struct list_head *next = node->next, *prev = node->prev;
next->prev = prev; prev->next = next;
}
```
#### `list_empty`
* 當 `head->next == head` 時,代表 `head` 指回自己,此 linked list 為空
```c=
static inline int list_empty(const struct list_head *head)
{
return (head->next == head);
}
```
#### `list_is_singular`
* 當 linked list 非空,且 `head->prev` 與 `head->next` 指向同一塊記憶體,表示 linked list 中只有一個 node
```c=
static inline int list_is_singular(const struct list_head *head)
{
return (!list_empty(head) && head->prev == head->next);
}
```
#### `list_splice_tail`
* 將 `list` 接到 `head` 的尾端
* 運用其為 circuler doubly linked list ,可得知兩個 linked list 之頭尾,並將其暫存後,重新串接
```c=
static inline void list_splice_tail(struct list_head *list,
struct list_head *head)
{
struct list_head *head_last = head->prev;
struct list_head *list_first = list->next, *list_last = list->prev;
if (list_empty(list))
return;
head->prev = list_last;
list_last->next = head;
list_first->prev = head_last;
head_last->next = list_first;
}
```
#### `list_cut_position`
* `head_from` : 欲切割的 linked list
* `node` : 欲切割點
* `head_to` : 切割後以 `node` 為開頭,並串接原本 `head_from` 的下一個點
* 將 `head_from` 以 `node` 為切點切割,切割後分別得到以 `node` 為分界的兩個 linked list
* `node` 左邊的 linked list 由 `head_to` 串接,並將 `node` 串接於首
* `node` 右邊的 linked list 由 `head_from`串接
```c=
static inline void list_cut_position(struct list_head *head_to,
struct list_head *head_from,
struct list_head *node)
{
struct list_head *head_from_first = head_from->next;
if (list_empty(head_from))
return;
if (head_from == node) {
INIT_LIST_HEAD(head_to);
return;
}
head_from->next = node->next;
head_from->next->prev = head_from;
head_to->prev = node;
node->next = head_to;
head_to->next = head_from_first;
head_to->next->prev = head_to;
}
```
#### `get_middle`
* `fast` : 暫存 linked list 中與前次 `fast` 相隔兩格的 node
* `slow` : 暫存 linked list 中與前次 `slow` 相隔一格的 node
* 透過 `fast` 與 `slow` 兩變數在每次迴圈中移動速度不同,尋找 linked list 的中間點,並回傳該中間點的位址
* 因為此 linked list 是 circuler doubly linked list,所以當 `fast->next == list` 或 `fast->next->next == list` 時,代表整個 linked list 已搜尋完畢,即停止迭代搜尋。
* 回傳值 : 當迭代完畢時,fast 會停留在 linked list 的尾端或倒數第二個,而 `slow` 會停留在linked list 中間值的位置,此時只需要呼叫 `list_entry` ,回傳 `slow` 這個 struct 的位址,即可得到該 linked list 的中間點位址。
```c=
static list_ele_t *get_middle(struct list_head *list)
{
struct list_head *fast = list->next, *slow;
list_for_each (slow, list) {
if (fast->next == list || fast->next->next == list)
break;
fast = fast->next->next;
}
return list_entry(slow, list_ele_t, list);
}
```
#### `list_merge`
* 判斷 `lhs->next` 跟 `rhs->next` 的 `value` 大小,將小的 node 從原本所在的 linked list 中刪除,並將其接到 `head` 之後
* 重複上述動作,直到 `lhs` 或 `rhs` 其中之一為空時,將另一 linked list 串接到 `head` 後
```c=
static void list_merge(struct list_head *lhs,
struct list_head *rhs,
struct list_head *head)
{
INIT_LIST_HEAD(head);
if (list_empty(lhs)) {
list_splice_tail(lhs, head);
return;
}
if (list_empty(rhs)) {
list_splice_tail(rhs, head);
return;
}
while (!list_empty(lhs) && !list_empty(rhs)) {
char *lv = list_entry(lhs->next, list_ele_t, list)->value;
char *rv = list_entry(rhs->next, list_ele_t, list)->value;
struct list_head *tmp = strcmp(lv, rv) <= 0 ? lhs->next : rhs->next;
list_del(tmp);
list_add_tail(tmp, head);
}
list_splice_tail(list_empty(lhs) ? rhs : lhs, head);
}
```
### 改進空間
:::danger
不該急著列出程式碼,相反地,你該討論上述程式碼有哪些缺失,詳列可改進的空間及策略,提出落實的方案,最後才是程式碼和實驗。
:notes: jserv
:::
#### `list_merge`
* 下面程式碼在第7行及第12行中邏輯有誤,當該 linked list 為空時,將此空的 list 加到 `head` 之後已無意義,應做以下調整,將另一個非空的 linked list 加到 `head` 之後。
```diff=
static void list_merge(struct list_head *lhs,
struct list_head *rhs,
struct list_head *head)
{
INIT_LIST_HEAD(head);
if (list_empty(lhs)) {
- list_splice_tail(lhs, head);
+ list_splice_tail(rhs, head);
return;
}
if (list_empty(rhs)) {
- list_splice_tail(rhs, head);
+ list_splice_tail(lhs, head);
return;
}
}
```
#### 現有的程式架構
* 有以下三種架構 :
```graphviz
digraph structs {
subgraph cluster_b{
label = "list_head"
node[shape=record]
struct1 [label="<f0> *prev|<f1> *next" ];
}
}
```
```graphviz
digraph structs {
subgraph cluster_head {
label="list_ele_t";
node [shape=record];
head [label="value"];
head1 [label="*next"];
list_head [label="{*prev|*next}"];
}
}
```
```graphviz
digraph G {
subgraph cluster_b {
node [shape=record];
label = "queue_t"
subgraph cluster_head {
label="list_ele_t *head";
node [shape=record];
head [label="value"];
head1 [label="*next"];
list_head [label="{*prev|*next}"];
}
list_ele1 [label="size"];
subgraph cluster_tail {
label="list_ele_t *tail";
node [shape=record];
tail [label="value"];
tail1 [label="*next"];
list_tail [label="{*prev|*next}"];
}
subgraph cluster_list {
label="list_head *list";
node [shape=record];
list [label="{*prev|*next}"];
}
}
}
```
* 由上圖可看出,list_ele_t 中已有 `struct list_head` 紀錄 `*next` ,所以 `struct __element *next` 是多餘的。
* 對於整個 list 而言,有 `list_ele_t` 就可串接整個 list ,也可以紀錄 `head` 跟 `tail` ,不需要再額外使用 `queue_t`。
* 故原本的程式碼可改成以下 :
```diff=
typedef struct __element {
char *value;
- struct __element *next;
struct list_head list;
} list_ele_t;
-typedef struct {
- list_ele_t *head; /* Linked list of elements */
- list_ele_t *tail;
- size_t size;
- struct list_head list;
-} queue_t;
```
#### malloc 的回傳值型態
* 在 `q_insert_head` 中 malloc `list_ele_t` 寫法如下 :
```c=
list_ele_t *newh = malloc(sizeof(list_ele_t));
```
* 在[你所不知道的C語言:指標篇](https://hackmd.io/@sysprog/c-pointer#%E4%BD%A0%E6%89%80%E4%B8%8D%E7%9F%A5%E9%81%93%E7%9A%84C%E8%AA%9E%E8%A8%80%EF%BC%9A%E6%8C%87%E6%A8%99%E7%AF%87)中有提到
> `void *` 存在的目的就是為了強迫使用者使用顯式轉型或是強制轉型,以避免 Undefined behavior 產生
但在上述使用 `list_ele_t *` 型態直接取用,沒有先將 `*void` 強制轉型,卻沒有發生錯誤,以下為此問題的探討 :
* 根據 [C11 standard](http://open-std.org/JTC1/SC22/WG14/www/docs/n1570.pdf)
> C11 [6.3.2.3] Pointers 第一點
A pointer to void may be converted to or from a pointer to any object type. A pointer to any object type may be converted to a pointer to void and back again; the result shall compare equal to the original pointer.
> C11 [6.5.16.1] Simple assignment 第二點
In simple assignment (=), the value of the right operand is converted to the type of the assignment expression and replaces the value stored in the object designated by the left operand.
由此可知,在 C 語言中,使用 simple assignment (=), `void*` 是可以被 implicit cast 成為其他型別的 pointer。
:::warning
但在 [sysprog21/linux-list](https://github.com/sysprog21/linux-list/blob/master/examples/insert-sort.c) 中的第 55 行
```c=
item = (struct listitem *) malloc(sizeof(*item));
```
卻仍將 `void *` 強制轉型成 `struct listitem` 型態,想請問以 `malloc` 而言,是否一定要強制轉型才是安全的?
> 對 C 語言沒影響,但對 C++ 語言則需要 explicit casting
:::
### 將 `lib/list_sort.c` 的實作抽離為可單獨執行 (standalone) 的使用層級應用程式
> [GitHub](https://github.com/shanihsu/Linux_2021/blob/main/quiz2/list_sort.c)
學習 [sysprog21/linux-list](https://github.com/sysprog21/linux-list) 的手法,將 [lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) 的實作抽離為可單獨執行 (standalone) 的使用層級應用程式,並利用隨機產生的 256 個 value 做測試。以下為實作方法 :
#### 架構
```graphviz
digraph structs {
rankdir=LR;
node[shape=record];
subgraph cluster_b{
label = "list_head testlist";
struct1 [label="<f0> *prev|<f1> *next" ];
}
subgraph cluster_1 {
label="listitem";
head [label="i"];
subgraph cluster_c{
label = "list_head list"
node[shape=record]
struct2 [label="<f0> *prev|<f1> *next" ];
}
}
subgraph cluster_2 {
label="listitem";
node [shape=record];
head1 [label="i"];
subgraph cluster_d{
label = "list_head list"
node[shape=record]
struct3 [label="<f0> *prev|<f1> *next" ];
}
}
subgraph cluster_head {
label="listitem";
node [shape=record];
head2 [label="i"];
subgraph cluster_e{
label = "list_head list"
node[shape=record]
struct4 [label="<f0> *prev|<f1> *next" ];
}
}
struct1:f1->struct2;
struct2:f1->struct3;
struct3:f1->struct4;
struct2:f0->struct1;
struct3:f0->struct2;
struct4:f0->struct3;
}
```
#### 新增 `listitem`
其中的 `i` 所存的值為欲排序的 `value` 值
```c=
struct listitem {
uint16_t i;
struct list_head list;
};
```
#### 移除`cmp_func`
* 將`lib/list_sort.c` 中的`cmp_func` 移除,改用 `cmpint` 替代,使 `lib/list_sort.c` 成功抽離為可單獨執行的程式。
```c=
static inline int cmpint(const void *p1, const void *p2)
{
const uint16_t *i1 = (const uint16_t *) p1;
const uint16_t *i2 = (const uint16_t *) p2;
return *i1 - *i2;
}
```
* 使用 `list_entry` ,從 `listitem` 搜尋,並取出存在 `listitem` 中欲比較的值 `i` ,存入 `value_a` 、 `value_b` ,再將此兩個值取址,並傳入 `cmpint` 中比較。
```c=
static struct list_head *merge(struct list_head *a, struct list_head *b)
{
struct list_head *head, **tail = &head;
for (;;) {
struct listitem *item = NULL;
uint16_t value_a = list_entry(a, __typeof__(*item), list)->i;
uint16_t value_b = list_entry(b, __typeof__(*item), list)->i;
/* if equal, take 'a' -- important for sort stability */
if (cmpint(&value_a, &value_b) <= 0) {
*tail = a;
tail = &a->next;
a = a->next;
if (!a) {
*tail = b;
break;
}
} else {
*tail = b;
tail = &b->next;
b = b->next;
if (!b) {
*tail = a;
break;
}
}
}
return head;
}
```
#### 移除 `likely` 與 `unlikely` macro
* `likely` 與 `unlikely` 這兩個 macro 被定義在 `/include/linux/compiler.h` 中,可以讓 compiler 在編譯 assembly code 的時候最佳化,將其移除對程式的正確性沒有影響,故移除以成功抽離 `lib/list_sort.c`
### 說明 Linux 核心的 `lib/list_sort.c` 最佳化手法
#### `likely` 與 `unlikely` branch predictor
* 在 `/include/linux/compiler.h` 中有以下定義:
```cpp
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
```
* 範例 :
```cpp
if (likely(x)) {
/* execute when x is true */
}
else {
/* execute when x is false */
}
```
* 使用 `likely` 時,表示其 (x == true) 較常發生, compiler 會將 `likely` 敘述後方的程式碼編譯在判斷之後。
* 使用 `unlikely` 時,表示其 (x == false) 較常發生, compiler 會將 `unlikely` 敘述後方的程式碼編譯在判斷之後。
* 透過 `likely` 跟 `unlikely` ,可以調整編譯器轉換的組合語言順序,將較常發生的判斷式敘述與條件程式碼,編譯到接近的位置
* 當 mispredict 時,需從 memory 一次搬一個區塊記憶體的資料到 cache ,當程式碼透過 `likely` 跟 `unlikely` 編譯後,會產生對 branch predictor 較友善的程式碼,從而縮減 mispredict 的執行成本,但若 `likely` 和 `unlikely` 誤用,會有反效果
#### 將 circular doubly-linked list 轉成 null-terminated singly-linked list
在一開始先將 doubly linked list 轉成 singly-linked list ,再做排序,可以在運算過程減少不必要的 assign
```cpp
/* Convert to a null-terminated singly-linked list. */
head->prev->next = NULL;
```
在最後運算完成,再將其改回 doubly-linked list
```cpp
/* And the final links to make a circular doubly-linked list */
tail->next = head;
head->prev = tail;
```
#### 將 mergesort 盡可能地以 2:1 的比例合併
[linux/lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c)
```c=129
/*
* This mergesort is as eager as possible while always performing at least
* 2:1 balanced merges. Given two pending sublists of size 2^k, they are
* merged to a size-2^(k+1) list as soon as we have 2^k following elements.
*
* Thus, it will avoid cache thrashing as long as 3*2^k elements can
* fit into the cache. Not quite as good as a fully-eager bottom-up
* mergesort, but it does use 0.2*n fewer comparisons, so is faster in
* the common case that everything fits into L1.
*/
```
* 不同於一般對半切的 mergesort ,此 mergesort 盡可能採用 2:1 的比例合併,意思是當有 3 * 2^k^ 個元素存在時,會先將兩個 size 是 2^k^ 的 pending sublists ,合併成一個 size 是 2^k+1^ 的 list , 這個新合併的 list 跟剩下的 2^k^ 個元素剛好是 2:1 ,以此類推一直合併下去。
* 如果這 3 * 2^k^ 個元素可以剛好都在 cache 中,即可避免 cache thrashing 。
:::info
TODO:
註解中寫到 "Not quite as good as a fully-eager bottom-up mergesort, but it does use 0.2 * n fewer comparisons" ,深入探討為何可以少 0.2 * n 次的比較。
:::
```c=139
/*
* The merging is controlled by "count", the number of elements in the
* pending lists. This is beautifully simple code, but rather subtle.
*
* Each time we increment "count", we set one bit (bit k) and clear
* bits k-1 .. 0. Each time this happens (except the very first time
* for each bit, when count increments to 2^k), we merge two lists of
* size 2^k into one list of size 2^(k+1).
*/
```
* merge 是由變數 `count` 控制, `count` 為整個 list 的元素數量, 其第 k 個 bit 代表現存大小為 2^k^ sublist 的數量。
* 當 count 每增加 1 時,set one bit (bit k) and clear bits k-1 .. 0 ,這邊的 bit k 指的是 the least-significant clear bit in count 。
* `count` 以二進位表示,遇到進位的情況,即表示有兩組相同的 2^k^ 個 sublists 要合併,連續進位表示連續合併。
```c=196
/*
* Data structure invariants:
* - All lists are singly linked and null-terminated; prev
* pointers are not maintained.
* - pending is a prev-linked "list of lists" of sorted
* sublists awaiting further merging.
* - Each of the sorted sublists is power-of-two in size.
* - Sublists are sorted by size and age, smallest & newest at front.
* - There are zero to two sublists of each size.
* - A pair of pending sublists are merged as soon as the number
* of following pending elements equals their size (i.e.
* each time count reaches an odd multiple of that size).
* That ensures each later final merge will be at worst 2:1.
* - Each round consists of:
* - Merging the two sublists selected by the highest bit
* which flips when count is incremented, and
* - Adding an element from the input as a size-1 sublist.
*/
```
從註解中我們可以知道,在每圈迴圈中主要會做兩件事 :
* 需要 merge 時,將兩個 sublist merge
* 將新的 element 加入 pending 中
* `count` + 1
下方利用表格探討每次迴圈執行結果,舉 `count` = 0~16 為例 :
| count decimal | count binary | sublist 4 | sublist 3 | sublist 2 | sublist 1 | new sublist | merge range | merge length |
| ------------- | ------------ | --------- | --------- | --------- | --------- | ----------- | ----------- | ------------ |
| 0 | 00000 | | | | | 1 | | |
| 1 | 00001 | | | | 1 | 1 | | |
| 2(merge) | 00010 | | | | 2 | 1 | [0,1] | 2 |
| 3 | 00011 | | | 2 | 1 | 1 | | |
| 4(merge) | 00100 | | | 2 | 2 | 1 | [2,3] | 2 |
| 5(merge) | 00101 | | | 4 | 1 | 1 | [0,3] | 4 |
| 6(merge) | 00110 | | | 4 | 2 | 1 | [4,5] | 2 |
| 7 | 00111 | | 4 | 2 | 1 | 1 | | |
| 8(merge) | 01000 | | 4 | 2 | 2 | 1 | [6,7] | 2 |
| 9(merge) | 01001 | | 4 | 4 | 1 | 1 | [4,7] | 4 |
| 10(merge) | 01010 | | 4 | 4 | 2 | 1 | [8,9] | 2 |
| 11(merge) | 01011 | | 8 | 2 | 1 | 1 | [0,7] | 8 |
| 12(merge) | 01100 | | 8 | 2 | 2 | 1 | [10,11] | 2 |
| 13(merge) | 01101 | | 8 | 4 | 1 | 1 | [8,11] | 4 |
| 14(merge) | 01110 | | 8 | 4 | 2 | 1 | [12,13] | 2 |
| 15 | 01111 | 8 | 4 | 2 | 1 | 1 | | |
| 16(merge) | 10000 | 8 | 4 | 2 | 2 | 1 | [14,15] | 2 |
其中,`pending` 是一個 list of lists 的結構,將所有等待 merge 的 sublist 串接起來,為了方便理解,我們將這些串接起來的 sublist 編號, sublist k 最大可串接 2^k^ 個元素,當到達容納上限,就會 merge ;而每次新增的 element 則會串接到 new sublist 。
由上表我們可以發現 :
* 當 `count` = 2 的冪 - 1 時不會 merge
* merge length 在 2^k^-1 ~ 2^k+1^-1 之間的最大值會剛好落在
$$
(2^k-1) + (2^{k+1}-1) \over 2
$$
### 效能評比
將 [linux/lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) 中的 merge sort 抽出獨立成 [quiz2/list_sort.c](https://github.com/shanihsu/Linux_2021/blob/main/quiz2/list_sort.c) ,並嘗試對 [sysprog21/linux-list](https://github.com/sysprog21/linux-list) 中的 insert sort 與 quick sort 和 [quiz2/list_sort.c](https://github.com/shanihsu/Linux_2021/blob/main/quiz2/list_sort.c) 中的 merge sort 做效能評比
* 先使用 valgrind 測試 memory leak ,發現這三個 sort 均會出現以下的錯誤訊息,但他們執行後的排序結果均是正確的,進一步觀察發現這三個 sort 的錯誤訊息完全相同, heap 狀態也都一樣,這部份使用 gdb 偵錯還是找不出問題,不知道問題出在哪。
:::info
TODO:
找出三個程式中 memory leak 的地方。
:::
```
$ valgrind -v --leak-check=full ./list_sort
==67361== Memcheck, a memory error detector
==67361== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==67361== Using Valgrind-3.15.0-608cb11914-20190413 and LibVEX; rerun with -h for copyright info
==67361== Command: ./list_sort
==67361==
--67361-- Valgrind options:
--67361-- -v
--67361-- --leak-check=full
--67361-- Contents of /proc/version:
--67361-- Linux version 5.8.0-59-generic (buildd@lcy01-amd64-022) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021
--67361--
--67361-- Arch and hwcaps: AMD64, LittleEndian, amd64-cx16-lzcnt-rdtscp-sse3-ssse3-avx-avx2-bmi-f16c-rdrand
--67361-- Page sizes: currently 4096, max supported 4096
--67361-- Valgrind library directory: /usr/lib/x86_64-linux-gnu/valgrind
--67361-- Reading syms from /media/shani/0FEA0DDE0FEA0DDE/jserv/Linux_2021/quiz2/list_sort
--67361-- Reading syms from /usr/lib/x86_64-linux-gnu/ld-2.31.so
--67361-- Considering /usr/lib/x86_64-linux-gnu/ld-2.31.so ..
--67361-- .. CRC mismatch (computed 975d0390 wanted 30bd717f)
--67361-- Considering /lib/x86_64-linux-gnu/ld-2.31.so ..
--67361-- .. CRC mismatch (computed 975d0390 wanted 30bd717f)
--67361-- Considering /usr/lib/debug/lib/x86_64-linux-gnu/ld-2.31.so ..
--67361-- .. CRC is valid
--67361-- Reading syms from /usr/lib/x86_64-linux-gnu/valgrind/memcheck-amd64-linux
--67361-- object doesn't have a symbol table
--67361-- object doesn't have a dynamic symbol table
--67361-- Scheduler: using generic scheduler lock implementation.
--67361-- Reading suppressions file: /usr/lib/x86_64-linux-gnu/valgrind/default.supp
==67361== embedded gdbserver: reading from /tmp/vgdb-pipe-from-vgdb-to-67361-by-shani-on-???
==67361== embedded gdbserver: writing to /tmp/vgdb-pipe-to-vgdb-from-67361-by-shani-on-???
==67361== embedded gdbserver: shared mem /tmp/vgdb-pipe-shared-mem-vgdb-67361-by-shani-on-???
==67361==
==67361== TO CONTROL THIS PROCESS USING vgdb (which you probably
==67361== don't want to do, unless you know exactly what you're doing,
==67361== or are doing some strange experiment):
==67361== /usr/lib/x86_64-linux-gnu/valgrind/../../bin/vgdb --pid=67361 ...command...
==67361==
==67361== TO DEBUG THIS PROCESS USING GDB: start GDB like this
==67361== /path/to/gdb ./list_sort
==67361== and then give GDB the following command
==67361== target remote | /usr/lib/x86_64-linux-gnu/valgrind/../../bin/vgdb --pid=67361
==67361== --pid is optional if only one valgrind process is running
==67361==
--67361-- REDIR: 0x4022e10 (ld-linux-x86-64.so.2:strlen) redirected to 0x580c9ce2 (???)
--67361-- REDIR: 0x4022be0 (ld-linux-x86-64.so.2:index) redirected to 0x580c9cfc (???)
--67361-- Reading syms from /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_core-amd64-linux.so
--67361-- object doesn't have a symbol table
--67361-- Reading syms from /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so
--67361-- object doesn't have a symbol table
==67361== WARNING: new redirection conflicts with existing -- ignoring it
--67361-- old: 0x04022e10 (strlen ) R-> (0000.0) 0x580c9ce2 ???
--67361-- new: 0x04022e10 (strlen ) R-> (2007.0) 0x0483f060 strlen
--67361-- REDIR: 0x401f5f0 (ld-linux-x86-64.so.2:strcmp) redirected to 0x483ffd0 (strcmp)
--67361-- REDIR: 0x4023370 (ld-linux-x86-64.so.2:mempcpy) redirected to 0x4843a20 (mempcpy)
--67361-- Reading syms from /usr/lib/x86_64-linux-gnu/libc-2.31.so
--67361-- Considering /usr/lib/x86_64-linux-gnu/libc-2.31.so ..
--67361-- .. CRC mismatch (computed 86b78530 wanted e380f01c)
--67361-- Considering /lib/x86_64-linux-gnu/libc-2.31.so ..
--67361-- .. CRC mismatch (computed 86b78530 wanted e380f01c)
--67361-- Considering /usr/lib/debug/lib/x86_64-linux-gnu/libc-2.31.so ..
--67361-- .. CRC is valid
--67361-- REDIR: 0x4900600 (libc.so.6:memmove) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x48ff900 (libc.so.6:strncpy) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x4900930 (libc.so.6:strcasecmp) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x48ff220 (libc.so.6:strcat) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x48ff960 (libc.so.6:rindex) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x4901dd0 (libc.so.6:rawmemchr) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x491ce60 (libc.so.6:wmemchr) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x491c9a0 (libc.so.6:wcscmp) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x4900760 (libc.so.6:mempcpy) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x4900590 (libc.so.6:bcmp) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x48ff890 (libc.so.6:strncmp) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x48ff2d0 (libc.so.6:strcmp) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x49006c0 (libc.so.6:memset) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x491c960 (libc.so.6:wcschr) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x48ff7f0 (libc.so.6:strnlen) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x48ff3b0 (libc.so.6:strcspn) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x4900980 (libc.so.6:strncasecmp) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x48ff350 (libc.so.6:strcpy) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x4900ad0 (libc.so.6:memcpy@@GLIBC_2.14) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x491e0d0 (libc.so.6:wcsnlen) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x491c9e0 (libc.so.6:wcscpy) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x48ff9a0 (libc.so.6:strpbrk) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x48ff280 (libc.so.6:index) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x48ff7b0 (libc.so.6:strlen) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x4908d20 (libc.so.6:memrchr) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x49009d0 (libc.so.6:strcasecmp_l) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x4900550 (libc.so.6:memchr) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x491cab0 (libc.so.6:wcslen) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x48ffc60 (libc.so.6:strspn) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x49008d0 (libc.so.6:stpncpy) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x4900870 (libc.so.6:stpcpy) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x4901e10 (libc.so.6:strchrnul) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x4900a20 (libc.so.6:strncasecmp_l) redirected to 0x48311d0 (_vgnU_ifunc_wrapper)
--67361-- REDIR: 0x49e8490 (libc.so.6:__strrchr_avx2) redirected to 0x483ea10 (rindex)
--67361-- REDIR: 0x48fa260 (libc.so.6:malloc) redirected to 0x483b780 (malloc)
--67361-- REDIR: 0x49eb650 (libc.so.6:__mempcpy_avx_unaligned_erms) redirected to 0x4843660 (mempcpy)
--67361-- REDIR: 0x49eb670 (libc.so.6:__memcpy_avx_unaligned_erms) redirected to 0x48429f0 (memmove)
--67361-- REDIR: 0x48fa850 (libc.so.6:free) redirected to 0x483c9d0 (free)
--67361-- REDIR: 0x49e82a0 (libc.so.6:__strchrnul_avx2) redirected to 0x4843540 (strchrnul)
==67361==
==67361== HEAP SUMMARY:
==67361== in use at exit: 0 bytes in 0 blocks
==67361== total heap usage: 257 allocs, 257 frees, 7,168 bytes allocated
==67361==
==67361== All heap blocks were freed -- no leaks are possible
==67361==
==67361== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
```
由於原本要做效能評比的三個 sort 目前還找不到 memory leak 的原因,沒辦法進一步做分析 cache 效益,但仍有嘗試使用以下工具
:::info
TODO:
解決上述三個 sort memory leak 問題之後,再重新利用以下工具進行效能評比與 cache 效益分析。
:::
* 使用 perf 比較三個 sort 的 cache miss
```
$ sudo sh -c " echo -1 > /proc/sys/kernel/perf_event_paranoid"
$ perf stat --repeat 6 -e cache-misses,cache-references,instructions,cycles,L1-dcache-load-misses ./list_sort
```
- [ ] merge sort in Linux kernel
```
Performance counter stats for './list_sort' (6 runs):
1,6311 cache-misses # 29.129 % of all cache refs ( +- 11.83% )
5,5997 cache-references ( +- 2.07% )
131,8244 instructions # 1.04 insn per cycle ( +- 0.31% )
126,6532 cycles ( +- 1.37% )
1,4966 L1-dcache-load-misses ( +- 2.14% )
0.0013360 +- 0.0000469 seconds time elapsed ( +- 3.51% )
```
- [ ] insert sort
```
Performance counter stats for './examples/insert-sort' (6 runs):
2,7948 cache-misses # 47.922 % of all cache refs ( +- 5.11% )
5,8320 cache-references ( +- 2.57% )
182,6139 instructions # 1.12 insn per cycle ( +- 0.19% )
163,5738 cycles ( +- 1.82% )
1,5430 L1-dcache-load-misses ( +- 2.83% )
0.0018266 +- 0.0000739 seconds time elapsed ( +- 4.04% )
```
- [ ] quick sort
```
Performance counter stats for './examples/quick-sort' (6 runs):
1,4498 cache-misses # 25.137 % of all cache refs ( +- 12.04% )
5,7675 cache-references ( +- 2.51% )
148,7628 instructions # 1.08 insn per cycle ( +- 0.22% )
138,1092 cycles ( +- 1.66% )
1,5233 L1-dcache-load-misses ( +- 2.50% )
0.001906 +- 0.000127 seconds time elapsed ( +- 6.64% )
```
對於目前有 memory leak 的三個 sort 程式而言,觀察到 cache miss 的比例為 insert sort > merge sort in Linux kernel > quick sort 。
* 使用 Cachegrind 觀察 merge sort in Linux kernel 的 cache 狀況
```
$ valgrind --tool=cachegrind ./list_sort
```
```
==68266== Cachegrind, a cache and branch-prediction profiler
==68266== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==68266== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==68266== Command: ./list_sort
==68266==
--68266-- warning: L3 cache found, using its data for the LL simulation.
==68266==
==68266== I refs: 740,995
==68266== I1 misses: 1,199
==68266== LLi misses: 1,173
==68266== I1 miss rate: 0.16%
==68266== LLi miss rate: 0.16%
==68266==
==68266== D refs: 309,024 (200,368 rd + 108,656 wr)
==68266== D1 misses: 3,377 ( 2,570 rd + 807 wr)
==68266== LLd misses: 2,829 ( 2,087 rd + 742 wr)
==68266== D1 miss rate: 1.1% ( 1.3% + 0.7% )
==68266== LLd miss rate: 0.9% ( 1.0% + 0.7% )
==68266==
==68266== LL refs: 4,576 ( 3,769 rd + 807 wr)
==68266== LL misses: 4,002 ( 3,260 rd + 742 wr)
==68266== LL miss rate: 0.4% ( 0.3% + 0.7% )
```
我們可以進一步了解程式碼每一行的 cache 狀況
```
$ cg_annotate --auto=yes cachegrind.out.68266
```
得到[結果](https://github.com/shanihsu/Linux_2021/blob/main/quiz2/cache_detail.txt),並從中觀察此 merge sort 對 cache 的效益。
:::info
TODO :
1. 設計效能評比程式
2. 完成 quiz1 ,將在 quiz1 提出的改進實作,和 Linux 核心的 lib/list_sort.c 進行效能比較
3. 使用 ==[Cachegrind](https://valgrind.org/docs/manual/cg-manual.html)== 工具分析 cache 效益
:::