owned this note
owned this note
Published
Linked with GitHub
---
tags: linux2022
---
# 2022q1 Homework1 (lab0)
contributed by < `dYu0y` >
## review by `kdnvt`
- 在 `q_swap` 以及 `q_reverse` 中 `list_del()` 加上 `list_add()` 的操作可以用 `list_move()` 來簡化。
- 重新思考 `q_swap` 的實作,有沒有辦法只對原始串列做操作就可達到題目要求的效果,而避免使用 `tmp_list` 以及 `result`。
- 在排序效能的實驗中,你測試了 `list_sort` 以及 `q_sort` 的速度並得出 cache 使用率較佳的結果。但是 cache 只是其中一個議題,事實上 `list_sort` 為了避免差距過大的合併,會特意推遲合併的時機,因此造成 cache 使用率不是最好。但也因為這樣的取捨,讓它不會在某些特定長度有特別差的表現。
- 在實驗的設計上,我建議你可以使用更多不同長度的串列來比較,例如比實驗中的長度 1000000 略長的 1050000 ,或是較短的 525000 ,可能會產生不同的結果。
## 實驗環境
```shell
$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
$ lscpu
Architecture: x86_64
CPU op-mode(s: 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Stepping: 3
CPU MHz: 2600.000
CPU max MHz: 3500.0000
CPU min MHz: 800.0000
BogoMIPS: 5199.98
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-7
```
## [作業要求](https://hackmd.io/@sysprog/linux2022-lab0)
詳細閱讀 [C Programming Lab ](http://www.cs.cmu.edu/~213/labs/cprogramminglab.pdf),依據指示著手修改 queue.[ch] 和連帶的檔案,測試後用 Git 管理各項修改,記得也該實作 q_sort 函式。
- q_new: 創一個空的 queue
- q_free: 釋放掉一個 queue
- q_insert_head: 在 head 插入一個 element (LIFO)
- q_insert_tail: 在 tail 插入一個 element (FIFO), O(1)
- q_remove_head: 把 head 的 element 移除
- q_remove_tail: 把 tail 的 element 移除
- q_size: return queue 的大小
- q_delete_mid: 刪除位於 queue 正中央的元素
- q_delete_dup: 刪除所有含有重複內容的 node
- q_swap: 將 queue 內的節點倆倆互換
- q_reverse: 將 queue 反轉,不配置或釋放任何的 element
- q_sort: 將 queue 由小排到大, sort by nature sort
## 開發過程
著手實作前有先將 list.h 大致瀏覽過一遍,基本上有現成可用的 function 就直接使用不再另行實作了。
### q_new
原本在 `malloc` 前加上了 [casting operator](https://en.wikibooks.org/wiki/C_Programming/Operators_and_type_casting#Cast_operators),看到老師在 [Risheng1128](https://github.com/Risheng1128) 的[作業評論](https://github.com/Risheng1128/lab0-c/commit/d51aee2efc9ef1e1c7faf97f48c4e527a92b05c0#comments)中說到這是可以省略的,所以變成現在的樣子。
```cpp
queue_t *q_new()
{
struct list_head *head;
head = malloc(sizeof(struct list_head));
if (head)
INIT_LIST_HEAD(head);
return head;
}
```
有人可能會寫成類似這樣:
```cpp
queue_t *q_new()
{
struct list_head *head;
head = malloc(sizeof(struct list_head));
if (head) {
INIT_LIST_HEAD(head);
return head;
}
return NULL;
}
```
主要是 `return` 部分,個人認為這樣的寫法有點冗餘,可以和我的實作比較看看。
### q_free
一開始實作的版本沒有注意到有 `q_release_element()` 可以使用,所以大概多了兩行,看到後就直接拿來用了。
原本:
```cpp
void q_free(queue_t *q)
{
if (!l)
return;
element_t *entry, *safe;
list_for_each_entry_safe (entry, safe, l, list) {
free(entry->value);
free(entry);
}
free(l);
}
```
最新版:
```cpp
void q_free(queue_t *q)
{
if (!l)
return;
element_t *entry, *safe;
list_for_each_entry_safe (entry, safe, l, list)
q_release_element(entry);
free(l);
}
```
就我個人的理解,下面的實作是比原本的來得好。
主要的優點如下,考慮 `element_t` 的定義被更改的場景,如:
```cpp
typedef struct {
char *value;
char *name; // 新增!!
struct list_head list;
} element_t;
```
按照舊的實作方式,必須找到所有釋放 `element_t` 資源的程式碼,並逐一加上 `free(name);`,顯然是個容易出錯與產生疏漏的過程。
但是後者不一樣,只需要在 `q_release_element()` 內新增一行 `free(name);`,如此呼叫 `q_release_element()` 的所有地方都仍然可以正確地將資源進行釋放。
而**可能帶來**的壞處就僅僅是多了一層 function call 的成本,而且有非常大的概率編譯器可以進行展開等操作,使得兩者的執行效能相差無幾。
### q_insert_head
我觀察到它與 `q_insert_tail()` 有可以重複使用的部分(建立元素與給予其初始值),因此新增了輔助的 macro `q_create_element_with_value()` 如下:
```cpp
#define q_create_element_with_value(element, val) \
element_t *element = malloc(sizeof(element_t)); \
if (element) { \
element->value = strdup(val); \
if (!element->value) { \
free(element); \
element = NULL; \
} \
}
```
如此 `q_insert_head()` 要做的事就簡單多了:
```cpp
bool q_insert_head(struct list_head *head, char *s)
{
if (!head)
return false;
q_create_element_with_value(element, s);
if (!element)
return false;
list_add(&element->list, head);
return true;
}
```
### q_insert_tail
跟 `q_insert_head()` 的差別僅有最後呼叫的 function 不同而已。
```cpp
bool q_insert_tail(struct list_head *head, char *s)
{
if (!head)
return false;
q_create_element_with_value(element, s);
if (!element)
return false;
list_add_tail(&element->list, head);
return true;
}
```
:::danger
不要說「忘記看到誰」,將心比心,如果換成你被其他同學「致意」,還屢次不被署名,你作何感想?去找出源頭!
:notes: jserv
:::
<s>忘記看到誰</s>
看到 [laneser](https://github.com/laneser) 實作`q_remove_tail()` 時是這樣利用了 `q_remove_head()`,感覺很不錯,盡量減少了程式碼的冗餘,大概如下:
```cpp
bool q_remove_tail(struct list_head *head, char *s)
{
if (!head || !head->prev)
return NULL;
return q_remove_head(head->prev->prev, sp, bufsize);
}
```
運用同樣的概念,`q_insert_tail()` 也可以如此利用 `q_insert_head()` 來變得更簡潔:
```cpp
bool q_insert_tail(struct list_head *head, char *s)
{
return head && q_insert_head(head->prev , s);
}
```
### q_remove_head
有人為了處理字串的複製費了不少心力,比如計算並比較長度等,但是看清楚對於 `strncpy()` 的[說明](https://www.cplusplus.com/reference/cstring/strncpy/),可以發現一些事前處理都是非必要的,反正到了 `strncpy()` 內部還是會再做一遍,不如就不做了省點效能。
以下是簡單的分析,對於下面的呼叫:
```cpp
strncpy(destination, source, num);
```
在呼叫參數都合法的前提下,依照[三一律](https://zh.wikipedia.org/wiki/%E4%B8%89%E5%88%86%E5%BE%8B)可知面對的狀況有三:
1. `strlen(source) < num`:
此時 `strncpy()` 會幫助我們將 `strlen(source)` 的位置開始補 `'\0'` 直到有 `num` 個字元被寫入 `destination`。
結論:呼叫後就結束了,不用理會。
2. `strlen(source) == num`
最後缺一個 `'\0'` 要補一下 `destination[num - 1] = '\0';`。
3. `strlen(source) > num`
同狀況 2.。
由此可知,要覆蓋到所有情況要做的事只有呼叫 `strncpy()` 並在最後補一個 `'\0'` 即可,因此我的實作如下。
```cpp
element_t *q_remove_head(struct list_head *head, char *sp, size_t bufsize)
{
if (!head || list_empty(head))
return NULL;
element_t *element = list_entry(head->next, element_t, list);
list_del_init(&element->list);
if (sp) {
strncpy(sp, element->value, bufsize - 1);
sp[bufsize - 1] = '\0';
}
return element;
}
```
補充一下使用 `list_del_init()` 是為了避免使用者拿到回傳的指標後找到 queue 裡面來,但是看老師在 [tinyynoob](https://github.com/tinyynoob) 的作業評論說是[不必要](https://github.com/tinyynoob/lab0-c/commit/5f8644b1c27e34b0e4ddebf17566c47aa245736e#r66790280)的,我也不懂,看老師之後怎麼講解這一部分。
### q_remove_tail
如同 `q_remove_head()`:
```cpp
element_t *q_remove_tail(struct list_head *head, char *sp, size_t bufsize)
{
if (!head || list_empty(head))
return NULL;
element_t *element = list_entry(head->prev, element_t, list);
list_del_init(&element->list);
if (sp) {
strncpy(sp, element->value, bufsize - 1);
sp[bufsize - 1] = '\0';
}
return element;
}
```
也可以像[前面提到](#q_insert_tail)的 [laneser](https://github.com/laneser) 一樣呼叫 `q_remove_head()` 來實做。
### q_size
由作業的[題目說明](https://hackmd.io/@sysprog/linux2022-lab0#K01-lab0)內提供:
```cpp
int q_size(struct list_head *head)
{
if (!head)
return 0;
int len = 0;
struct list_head *li;
list_for_each (li, head)
len++;
return len;
}
```
### q_delete_mid
簡單的 one pass 實作,省下計算長度的時間。
這裡的想法是,同時有兩個指標從前方及後方往中間找,當兩者碰在一起時,便是在 list 的中間。
至於怎麼知道先動 front 還是 back 才能拿到正確的 node?
自己畫幾個小例子就能知道了。
```cpp
bool q_delete_mid(struct list_head *head)
{
// https://leetcode.com/problems/delete-the-middle-node-of-a-linked-list/
if (!head || list_empty(head))
return false;
struct list_head *front, *back = front = head;
while (true) {
front = front->next;
if (back == front)
break;
back = back->prev;
if (back == front)
break;
}
list_del(back);
q_release_element(list_entry(back, element_t, list));
return true;
}
```
### q_delete_dup
誤會題意了,下面是將重複的元素刪到剩一個的實作,而非題目要求的全部刪除。
e.g. 輸入 `[a b b c d d]`
我的輸出: `[a b c d]`
題目要求: `[a c]`
```cpp
bool q_delete_dup(struct list_head *head)
{
// https://leetcode.com/problems/remove-duplicates-from-sorted-list-ii/
if (!head || list_empty(head))
return false;
element_t *entry, *safe;
list_for_each_entry_safe (entry, safe, head, list) {
if (&safe->list == head)
break;
if (strcmp(entry->value, safe->value) == 0) {
list_del(&entry->list);
q_release_element(entry);
}
}
return true;
}
```
以下是更正版本 [dd5c0c9](https://github.com/dYu0y/lab0-c/commit/dd5c0c9):
```cpp=
bool q_delete_dup(struct list_head *head)
{
// https://leetcode.com/problems/remove-duplicates-from-sorted-list-ii/
if (!head || list_empty(head))
return false;
char *dup_str = NULL;
element_t *entry, *safe;
list_for_each_entry_safe (entry, safe, head, list) {
if (dup_str) {
if (strcmp(dup_str, entry->value) == 0) {
list_del(&entry->list);
q_release_element(entry);
} else {
free(dup_str);
dup_str = NULL;
}
}
if (!dup_str && &safe->list != head) {
if (strcmp(entry->value, safe->value) == 0) {
list_del(&entry->list);
dup_str = entry->value;
free(entry);
}
}
}
free(dup_str);
return true;
}
```
從舊的錯誤版本直接改出來的,新增 dup_str 用來指向重複的 value。
若 queue 中的元素小於兩個,則相當於無事發生。
否則從頭到尾兩兩比較 queue 中的元素,確認兩者的 value 是否重複。
一旦確認到有重複,令 dup_str 指向其中一方的 value,並釋放其容器所佔用的記憶體,接著刪除所有 value 與 dup_str 一致的元素。
刪除完畢後,釋放 dup_str 所指向的空間,繼續尋找下一個重複的元素,如此反覆做到結束,便完整刪除所有含有重複內容的元素了。
不過必須承認目前的程式碼不夠精簡,邏輯也沒辦法一目了然,會找時間試試看能不能對其重構並使其精簡易懂。
### q_swap
使用一些額外的 list 來輔助操作,至少對我來說較好理解。
以下是一個解釋過程的例子:
```
原本的 queue: [a, b, c, d, ...]
tmp_list: []
result: []
```
第 10~12 行的迴圈,每次從 queue 中移出一個 node 並加到 tmp_list 的**前端**。
```
第 1 次迴圈結束
queue: [b, c, d, ...]
tmp_list: [a]
result: []
第 2 次迴圈 13 行 if 之前
queue: [c, d]
tmp_list: [b, a]
result: []
```
可以看出此時前兩個元素已經被交換了,這時 13~15 行把已經被交換的元素加從 tmp_list 移入 result 裡面,並繼續抓取下兩個要交換的元素。
```
第 2 次迴圈結束
queue: [c, d, ...]
tmp_list: []
result: [b, a]
```
如此往復操作幾次,偶數長度的 queue 就已經 swap 完成了,只要透過 19 行的 `list_splice(&result, head)` 把答案換入原本的 queue 內即可。
```
19 行 list_splice 結束
queue: [b, a, d, c, ...]
tmp_list: []
result: []
```
而對於奇數長度的 queue,在迴圈結束後,會留下 queue 內的最後一個元素在 tmp_list 內。
```
原本的 queue: [a, b, c, d, e]
迴圈結束後:
queue: []
tmp_list: [e]
result: [b, a, d, c]
```
第 18 行的 `list_splice(&tmp_list, head)` 的作用是把這個剩餘的元素加回到結果中,接著 19 行的 `list_splice(&result, head)` 再把位於 result 內的 list 置於 queue 的最前面,這樣就得到正確的結果了。
```cpp=
void q_swap(struct list_head *head)
{
// https://leetcode.com/problems/swap-nodes-in-pairs/
if (!head || list_empty(head))
return;
LIST_HEAD(tmp_list);
LIST_HEAD(result);
int i = 0;
struct list_head *node, *safe;
list_for_each_safe (node, safe, head) {
list_del(node);
list_add(node, &tmp_list);
if (++i == 2) {
list_splice_tail_init(&tmp_list, &result);
i = 0;
}
}
list_splice(&tmp_list, head);
list_splice(&result, head);
}
```
### q_reverse
可能不是最高效的實作方式,但應該算簡單易懂的。
```cpp
void q_reverse(struct list_head *head)
{
if (!head || list_empty(head))
return;
LIST_HEAD(reversed_list);
struct list_head *node, *safe;
list_for_each_safe (node, safe, head) {
list_del(node);
list_add(node, &reversed_list);
}
list_splice(&reversed_list, head);
}
```
### q_sort
原本要實做的排序演算法,在 quick sort、heap sort 與 merge sort 之間考慮:
Quick sort 的遞迴版本對於 linked list 也是能簡單的實做出來(參考:[你所不知道的 C 語言: linked list 和非連續記憶體](https://hackmd.io/@sysprog/c-linked-list#Quick-sort-%E9%81%9E%E8%BF%B4%E7%89%88%E6%9C%AC)),但遞迴版本可能會遇上 stack overflow 的問題;而非遞迴版可能會遇到使用 variable-length array 產生的各種疑難雜症。最重到的是,若 pivot 挑選的不好,quick sort 的時間複雜度會提昇至 $O(n^2)$。
Heap sort 則是可以利用 list node 上的 prev 與 next 指標充當 tree node 上的 left 和 right 指標,將 list 轉化成一個 min/max heap 後就可以進行 heap sort 了。可是因為缺少了像 array 那樣具有 $O(1)$ 隨機存取的能力,不好 inplace 的將 linked list 變成 heap。再來是如果 quick sort 會有遞迴導致的 stack overflow 問題,那麼 heap sort 也同樣會遇到。
Merge sort 使用遞迴的實作方式有兩個缺點,其一是為了將 list 對半切分為兩份,每次要往下遞迴下去前須先花費 $O(N)$ 的時間找到 list 的中間點,第二是倘若資料量過大,導致遞迴過深也會有 stack overflow 的問題存在。
幸好[益智遊戲 2048](https://zh.wikipedia.org/wiki/2048_(%E9%81%8A%E6%88%B2)) 提供給我一個靈感可以解決前面提到的問題。
第一個問題:須花費 $O(n)$ 尋找中間點,是因為遞迴的版本是從 top-down 的角度來實做,但若從 bottom-up 的角度來看,問題就變成有 $n$ 個長度為 $1$ 的 sublist 不斷兩兩進行 merge 直到成為唯一的 list 為止,整個[過程類似遊戲 2048](https://imgur.com/eBmJQrC),大致如下圖:

已經有合併好的 sublist 我們將其置於一旁先不管(如下方那排),直到出現一個讓合併可以進行下去的 sublist 出現(如紅圈處的 2),再一口氣進行所有可以進行的合併。
較為精確的過程說明如下:
1. 如果原 list 不為空,就從最前面拿走一個 node,將其視為長度為 $1$ 的 sublist,並將其加入一個 sublist 的集合 $S$ 之中。
2. 檢查 $S$ 中是否有長度一致為 $len$ 的 sublist:
2.1 若有,則將兩者從 $S$ 中取出並合併為新的長度為 $2⋅len$ 的新 sublist,然後將新 sublist 加回 $S$ 中,返回步驟 2.。
2.2 若沒有,返回步驟 1.。
當上方的步驟結束後(原 list 中的節點皆已取完,成為空 list),會得到一個有若干長度為二的冪次的 sublist 的集合,只要不斷取出其中長度最短的兩個 sublist 進行合併後再放回集合,直到裡面只剩下唯一的 sublist 時,合併排序便完成了。
:::warning
不要輕易說「最終版」,理工人說話要精準。
:notes: jserv
:::
版本 [f9040cff](https://github.com/dYu0y/lab0-c/commit/f9040cff),由於不是很好讀,會嘗試重構看看能不能讓每個 function 的邏輯盡量簡單易懂,也比較方便進行說明。
:::warning
`list2d` 這樣的命名很糟糕,應改進
:notes: jserv
> 好的,會再思考看看更好的變數名稱。
:::
版本 [4832a29](https://github.com/dYu0y/lab0-c/commit/4832a29) 與 [9aed5ed](https://github.com/dYu0y/lab0-c/commit/9aed5ed),進行重構,並新增一些輔助的 macros。
版本 [6b87b4e](https://github.com/dYu0y/lab0-c/commit/6b87b4e) 取消與 `list2d` 相關的命名,並暫時改命名為 `list`,若有更好的名稱會再進行更名。
:::spoiler 6b87b4e 版程式碼
```cpp
struct list_slice {
struct list_head *head;
struct list_head **begin;
int i;
int len;
};
#define LIST_SLICE_BEGIN_WITH(slice, begin, length, head) \
struct list_slice slice = {head, &(begin->prev->next), 0, length};
#define LIST_SLICE(slice, length, head) \
struct list_slice slice = {head, &(head->next), 0, length};
#define LIST_SLICE_NEXT(slice) \
({ \
struct list_slice *__slice = slice; \
struct list_head *next = *__slice->begin; \
if (__slice->i == __slice->len || next == __slice->head) \
next = NULL; \
else { \
list_del(next); \
++__slice->i; \
} \
next; \
})
#define LIST_SLICE_EXPAND(slice, length) \
(slice)->len += length - (slice)->i; \
(slice)->i = 0;
#define LIST_NEXT(curr, head) \
({ \
curr = curr->next; \
curr != head ? curr : NULL; \
})
#define MY_CMP(l, r) \
({ \
char *s1 = list_entry(l, element_t, list)->value; \
char *s2 = list_entry(r, element_t, list)->value; \
strcmp(s1, s2); \
})
#define MERGE_TWO_LIST(list1, get_next_fn1, list2, get_next_fn2, cmp) \
({ \
struct list_head *end_of_list1 = list1; \
while (list1 && list2) { \
end_of_list1 = list1; \
if (cmp(list1, list2) > 0) { \
list_add_tail(list2, list1); \
list2 = get_next_fn2; \
} else { \
list1 = get_next_fn1; \
} \
} \
LIST_HEAD(rest_of_list2); \
while (list2) { \
list_add_tail(list2, &rest_of_list2); \
list2 = get_next_fn2; \
} \
list_splice(&rest_of_list2, end_of_list1); \
})
static inline void list_merge(struct list_head *list, int size)
{
LIST_SLICE(slice, 1, list);
struct list_head *next = LIST_SLICE_NEXT(&slice);
if (!next)
__builtin_unreachable();
LIST_HEAD(merged_list);
list_add(next, &merged_list);
int i = 1;
for (; i & size; i <<= 1) {
LIST_SLICE_EXPAND(&slice, i);
struct list_head *list1 = merged_list.next,
*list2 = LIST_SLICE_NEXT(&slice);
MERGE_TWO_LIST(list1, LIST_NEXT(list1, &merged_list), list2,
LIST_SLICE_NEXT(&slice), MY_CMP);
}
list_splice(&merged_list, list);
}
#define LIST_NTH(node, n, head) \
({ \
int i = 0; \
for (node = head->next; node != head && i < n; ++i) \
node = node->next; \
})
#define SPLIT_LSB(len, size) len = size, len &= -size, size &= ~len
static void list_merge_final(struct list_head *list, int size)
{
LIST_HEAD(merged_list);
struct list_head *cut_point;
int len;
SPLIT_LSB(len, size);
LIST_NTH(cut_point, len, list);
list_cut_position(&merged_list, list, cut_point->prev);
for (SPLIT_LSB(len, size); len; SPLIT_LSB(len, size)) {
LIST_SLICE(slice, len, list);
struct list_head *list1 = merged_list.next,
*list2 = LIST_SLICE_NEXT(&slice);
MERGE_TWO_LIST(list1, LIST_NEXT(list1, &merged_list), list2,
LIST_SLICE_NEXT(&slice), MY_CMP);
}
list_splice(&merged_list, list);
}
/*
* Sort elements of queue in ascending order
* No effect if q is NULL or empty. In addition, if q has only one
* element, do nothing.
*/
void q_sort(struct list_head *head)
{
if (!head || list_empty(head) || list_is_singular(head))
return;
LIST_HEAD(list);
int size = 0;
struct list_head *it, *safe;
list_for_each_safe (it, safe, head) {
list_del(it);
list_add(it, &list);
if (size & 1) {
list_merge(&list, size);
}
++size;
}
list_merge_final(&list, size);
list_splice(&list, head);
}
```
:::
## 比較自己實作的 merge sort 和 Linux 核心程式碼之間效能落差
這邊因為比較沒有頭緒接下來要做什麼,還有要怎麼做。
要做什麼和取得資料的過程幾乎都是參考同學的步驟,只有實驗的數據和結果是自己產生的。
### 加入 linux 的 list_sort() 程式碼
[參考 lanser 同學比較的方式](https://hackmd.io/@laneser/linux2022_lab0#%E6%AF%94%E8%BC%83%E8%87%AA%E5%B7%B1%E5%AF%A6%E4%BD%9C%E7%9A%84-merge-sort-%E5%92%8C-Linux-%E6%A0%B8%E5%BF%83%E7%A8%8B%E5%BC%8F%E7%A2%BC%E4%B9%8B%E9%96%93%E6%95%88%E8%83%BD%E8%90%BD%E5%B7%AE),將 `list_sort.c` 的程式碼貼入 `qtest.c` 中,同樣的新增 option `use_linux_sort` 來決定使用的 sort 是自己實做的版本或是 linux 核心所使用的 list_sort。
```diff
// in do_sort()
// ...
+ void (*sort_func)(struct list_head *) =
+ use_linux_sort ? linux_list_sort : q_sort;
set_noallocate_mode(true);
if (exception_setup(true))
- q_sort(l_meta.l);
+ sort_func(l_meta.l);
exception_cancel();
set_noallocate_mode(false);
```
使用以下的 commands 來測量,藉由調整 use_linux_sort 為 0 或 1,來切換 sort 的實做版本。
```
option fail 0
option malloc 0
option use_linux_sort 0
new
it RAND 1000000
time sort
free
```
新增兩個 trace command 檔案來輔助測試:
trace-19-test-my-sort.cmd
```
option fail 0
option malloc 0
option use_linux_sort 0
new
it RAND 1000000
time sort
free
```
trace-20-test-linux-sort.cmd
```
option fail 0
option malloc 0
option use_linux_sort 1
new
it RAND 1000000
time sort
free
```
並新增一個 bash 來進行比較大量的測試
```bash
#!/bin/bash
rm *_result.txt
for i in {1..20}
do
./qtest -v 3 -f traces/trace-19-test-my-sort.cmd >> my_result.txt
done
for i in {1..20}
do
./qtest -v 3 -f traces/trace-20-test-linux-sort.cmd >> linux_result.txt
done
```
使用 grep 指令找出兩邊的執行時間,搭配 sort 更好看出最小值最大值
```
$ grep -E Delta my_result.txt | sort
Delta time = 1.075
Delta time = 1.078
Delta time = 1.089
Delta time = 1.094
Delta time = 1.096
Delta time = 1.096
Delta time = 1.096
Delta time = 1.100
Delta time = 1.102
Delta time = 1.104
Delta time = 1.106
Delta time = 1.106
Delta time = 1.109
Delta time = 1.113
Delta time = 1.113
Delta time = 1.117
Delta time = 1.121
Delta time = 1.121
Delta time = 1.125
Delta time = 1.138
$ grep -E Delta linux_result.txt | sort
Delta time = 1.139
Delta time = 1.151
Delta time = 1.153
Delta time = 1.156
Delta time = 1.170
Delta time = 1.172
Delta time = 1.172
Delta time = 1.173
Delta time = 1.173
Delta time = 1.181
Delta time = 1.185
Delta time = 1.186
Delta time = 1.189
Delta time = 1.193
Delta time = 1.194
Delta time = 1.197
Delta time = 1.234
Delta time = 1.253
Delta time = 1.266
Delta time = 1.269
```
q_sort() 平均執行時間約 1.1 秒,list_sort() 平均執行時間約 1.2 秒,q_sort() 比 list_sort() 快了大約 7%。
### 使用 valgrind cachegrind 分析
蠻驚訝自己的實做結果看起來比較好,參考 [bakudr18](https://hackmd.io/@MR-9Qa9wQLWglSAUyd6UhQ/BkvLMLkeq?fbclid=IwAR101v4CNV0DsfoURf_I5adMTAAlfdt7s1bhxrFD-nHEreoJCknFYx5OG0o#%E4%BA%8B%E5%89%8D%E6%BA%96%E5%82%99) 與 [kdnvt](https://hackmd.io/WuqBNTdFQnuNxkFtR6FIWg?view#%E7%94%A8-valgrind-cachegrind-%E5%88%86%E6%9E%90-cache-miss%EF%BC%9A) 兩位同學的共筆,使用 valgrind 來分析是什麼導致了這樣的結果。
```bash
$ valgrind --tool=cachegrind ./qtest -f traces/trace-19-test-my-sort.cmd
$ valgrind --tool=cachegrind ./qtest -f traces/trace-20-test-linux-sort.cmd
```
trace-19-test-my-sort.cmd 的測試結果(僅列出相關內容)
```
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
914,645,427 1,667 1,647 237,791,305 11,210,108 6,134,757 115,709,062 3,979,143 2,448,753 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
133,854,069 6 6 30,244,165 4,317,492 1,575,233 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
80,452,521 11 11 25,438,380 2,255,556 629,148 7,264,254 1 1 /home/dyu0y/linux2022/lab0-c/queue.c:q_sort
37,903,217 10 10 12,692,437 536,077 188,464 24,690,208 2,434,578 1,060,175 /home/dyu0y/linux2022/lab0-c/list.h:q_sort
```
:::spoiler trace-19-test-my-sort 完整測試資料
```
-------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 6291456 B, 64 B, 12-way associative
Command: ./qtest -f traces/trace-19-test-my-sort.cmd
Data file: cachegrind.out.10276
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
914,645,427 1,667 1,647 237,791,305 11,210,108 6,134,757 115,709,062 3,979,143 2,448,753 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
133,854,069 6 6 30,244,165 4,317,492 1,575,233 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
121,811,256 3 3 30,576,104 3 0 11,466,039 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random_r.c:random_r
97,953,376 26 26 16,673,109 2 1 16,672,866 694,596 694,593 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_malloc
80,452,521 11 11 25,438,380 2,255,556 629,148 7,264,254 1 1 /home/dyu0y/linux2022/lab0-c/queue.c:q_sort
80,262,281 2 2 30,576,106 0 0 7,644,027 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random.c:random
56,271,812 13 13 15,978,451 25,977 22,746 6,947,300 3 3 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_free
44,822,465 2 2 4,516,328 1 1 7,643,682 0 0 /home/dyu0y/linux2022/lab0-c/qtest.c:fill_rand_string
37,903,217 10 10 12,692,437 536,077 188,464 24,690,208 2,434,578 1,060,175 /home/dyu0y/linux2022/lab0-c/list.h:q_sort
33,346,298 7 7 8,336,621 3 2 2,778,911 1 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc
27,786,599 3 3 6,946,650 1 1 8,335,979 86,818 86,818 /home/dyu0y/linux2022/lab0-c/harness.c:test_malloc
26,690,130 52 50 13,345,026 85 59 19 1 1 ???:???
26,397,292 3 3 8,335,980 381,629 337,572 4,862,654 761,904 606,245 /home/dyu0y/linux2022/lab0-c/harness.c:test_free
19,309,660 5 5 1,389,330 0 0 2,778,660 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms
18,756,659 9 9 2,778,766 2,082,850 1,840,514 694,764 2 2 /home/dyu0y/linux2022/lab0-c/qtest.c:show_queue
15,636,743 1 1 3,127,348 0 0 3,127,349 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/rand.c:rand
14,589,225 2 2 3,473,625 43,407 37,570 0 0 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:free
14,587,902 4 4 4,167,972 347,936 347,339 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_l_avx
9,030,632 2 2 1,389,328 0 0 2,083,992 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_insert_tail
8,684,324 1 1 3,821,676 0 0 1,041,996 0 0 /home/dyu0y/linux2022/lab0-c/harness.c:test_strdup
8,336,038 10 10 2,431,347 1 1 1,042,008 0 0 /home/dyu0y/linux2022/lab0-c/qtest.c:do_it
7,644,026 0 0 3,822,013 1 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/../sysdeps/unix/sysv/linux/x86/lowlevellock.h:random
4,794,041 6 6 1,042,370 0 0 694,894 9 9 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
4,168,024 6 6 1,042,004 434,193 433,609 347,344 0 0 /home/dyu0y/linux2022/lab0-c/qtest.c:do_sort
4,168,016 3 3 2,084,008 4 4 1,042,004 0 0 /home/dyu0y/linux2022/lab0-c/harness.c:error_check
3,125,988 1 1 1,041,996 86,636 71,489 1,041,996 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_release_element
2,778,660 1 1 694,665 0 0 694,665 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_free
2,778,660 0 0 0 0 0 694,665 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_malloc
2,431,347 1 1 347,338 346,737 300,984 347,335 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_free
1,736,660 0 0 347,332 0 0 1,389,328 0 0 /home/dyu0y/linux2022/lab0-c/list.h:q_insert_tail
1,389,337 2 2 347,334 347,334 347,334 0 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_size
1,389,328 0 0 0 0 0 347,332 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_strdup
1,389,324 2 2 694,662 2 2 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_avx
```
:::
trace-20-test-linux-sort.cmd 的測試結果(僅列出相關內容)
```
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
928,501,647 1,649 1,629 226,685,637 14,082,259 7,643,661 105,960,042 1,833,586 1,674,310 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
131,174,459 6 6 29,640,025 4,695,061 1,967,595 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
79,113,309 1 1 9,027,065 991,304 316,007 13,921,702 1 1 /home/dyu0y/linux2022/lab0-c/qtest.c:merge
47,421,952 1 1 17,783,232 4,153,026 1,480,133 5,927,744 1 1 /home/dyu0y/linux2022/lab0-c/qtest.c:sort_comp
14,422,478 6 6 2,410,564 177,157 172,248 3,099,326 302,042 298,217 /home/dyu0y/linux2022/lab0-c/qtest.c:list_sort
```
:::spoiler trace-20-test-linux-sort 完整測試資料
```
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 6291456 B, 64 B, 12-way associative
Command: ./qtest -f traces/trace-20-test-linux-sort.cmd
Data file: cachegrind.out.10300
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
928,501,647 1,649 1,629 226,685,637 14,082,259 7,643,661 105,960,042 1,833,586 1,674,310 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
131,174,459 6 6 29,640,025 4,695,061 1,967,595 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
120,648,156 3 3 30,284,152 3 0 11,356,557 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random_r.c:random_r
97,118,374 26 26 16,530,981 2 1 16,530,738 688,674 688,671 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_malloc
79,495,899 2 2 30,284,152 0 0 7,571,038 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random.c:random
79,113,309 1 1 9,027,065 991,304 316,007 13,921,702 1 1 /home/dyu0y/linux2022/lab0-c/qtest.c:merge
55,792,130 13 13 15,842,245 25,877 22,642 6,888,080 3 3 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_free
47,421,952 1 1 17,783,232 4,153,026 1,480,133 5,927,744 1 1 /home/dyu0y/linux2022/lab0-c/qtest.c:sort_comp
44,397,182 2 2 4,475,570 1 1 7,572,351 0 0 /home/dyu0y/linux2022/lab0-c/qtest.c:fill_rand_string
33,062,042 7 7 8,265,557 3 2 2,755,223 1 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc
27,549,719 3 3 6,887,430 1 1 8,264,915 86,078 86,078 /home/dyu0y/linux2022/lab0-c/harness.c:test_malloc
26,316,264 52 50 13,158,093 85 59 19 1 1 ???:???
26,172,258 3 3 8,264,916 378,038 333,996 4,821,200 755,544 600,421 /home/dyu0y/linux2022/lab0-c/harness.c:test_free
19,147,522 5 5 1,377,486 0 0 2,754,972 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms
18,596,765 9 9 2,755,078 2,065,107 1,822,940 688,842 2 2 /home/dyu0y/linux2022/lab0-c/qtest.c:show_queue
15,483,880 1 1 3,096,776 0 0 3,096,776 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/rand.c:rand
14,464,863 2 2 3,444,015 42,887 37,282 0 0 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:free
14,463,540 4 4 4,132,440 344,979 344,378 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_l_avx
14,422,478 6 6 2,410,564 177,157 172,248 3,099,326 302,042 298,217 /home/dyu0y/linux2022/lab0-c/qtest.c:list_sort
8,953,646 2 2 1,377,484 0 0 2,066,226 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_insert_tail
8,608,029 1 1 3,786,835 0 0 1,033,113 0 0 /home/dyu0y/linux2022/lab0-c/harness.c:test_strdup
8,264,974 10 10 2,410,620 1 1 1,033,125 0 0 /home/dyu0y/linux2022/lab0-c/qtest.c:do_it
7,571,038 0 0 3,785,519 1 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/../sysdeps/unix/sysv/linux/x86/lowlevellock.h:random
4,754,414 6 6 1,033,487 0 0 688,972 9 9 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
4,132,491 6 6 1,033,121 430,492 429,933 344,383 0 0 /home/dyu0y/linux2022/lab0-c/qtest.c:do_sort
4,132,484 3 3 2,066,242 4 4 1,033,121 0 0 /home/dyu0y/linux2022/lab0-c/harness.c:error_check
3,099,339 1 1 1,033,113 85,909 71,001 1,033,113 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_release_element
2,754,972 1 1 688,743 0 0 688,743 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_free
2,754,972 0 0 0 0 0 688,743 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_malloc
2,410,620 1 1 344,377 343,765 298,378 344,374 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_free
1,721,855 0 0 344,371 0 0 1,377,484 0 0 /home/dyu0y/linux2022/lab0-c/list.h:q_insert_tail
1,377,493 2 2 344,373 344,373 344,373 0 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_size
1,377,484 0 0 0 0 0 344,371 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_strdup
1,377,480 2 2 688,740 2 2 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_avx
```
:::
> 備註:[kdnvt 提到的](https://hackmd.io/WuqBNTdFQnuNxkFtR6FIWg?both#%E7%94%A8-valgrind-cachegrind-%E5%88%86%E6%9E%90-cache-miss%EF%BC%9A)有多個 strcmp,藉由表格最後一欄 `file:function` 判斷:
> ```
> .../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_avx
> ```
> 帶有 `strcasecmp` 的 strcmp 是 do_sort() 最後用來判斷 queue 是否已正確排序時呼叫的,所以將其忽略不計。
以下是比較表格
| q_sort | DLmr | percent |
| ------:| -----------:| -------:|
| strcmp | 1,575,233 | 65.8% |
| q_sort | 817,612 | 34.2% |
| list_sort | DLmr | percent |
| ---------:| -----------:| -------:|
| strcmp | 1,967,595 | 50.0% |
| sort_cmp | 1,480,133 | 37.6% |
| merge | 316,007 | 8.0% |
| list_sort | 172,248 | 4.4% |
到這邊終於知道為什麼 list_sort() 效能會不如 q_sort() 了,可以看到 list_sort() 光是在字串比較的部份就有總共高達 3,447,728 個 DLmr,佔總數的 87.6%,list_sort() 本身加上 merge() 總共只有 488,255 個而已,大約是 q_sort() 的六成左右,q_sort() 能比較快完全是憑藉著排序過程中進行 strcmp() 時發生 cache miss 的次數較少。
### 改變新增元素時的 malloc() 方式
若使用 [komark06](https://hackmd.io/@komark06/linux2022-lab0#q_free-%E5%AF%A6%E4%BD%9C) 提出的只使用一次 malloc() 配置 element_t 和內部字串的記憶體,搭配 [kdnvt](https://hackmd.io/WuqBNTdFQnuNxkFtR6FIWg?both#%E7%94%A8-valgrind-cachegrind-%E5%88%86%E6%9E%90-cache-miss%EF%BC%9A) 的實驗結果,list_sort() 的速度可能會變得比 q_sort() 還要快。
對於新增元素時 malloc() 的方式進行以下的更改:
```diff
#define q_create_element_with_value(element, val) \
+ int len = strlen(val); \
- element_t *element = malloc(sizeof(element_t)); \
+ element_t *element = malloc(sizeof(element_t) + len + 1); \
if (element) { \
- element->value = strdup(val); \
- if (!element->value) { \
- free(element); \
- element = NULL; \
- } \
+ element->value = (char *) element + sizeof(element_t); \
+ strncpy(element->value, val, len); \
+ element->value[len] = '\0'; \
}
```
改寫後重新測試:
```bash
$ ./test_sort.sh
$ grep -E Delta my_result.txt | sort
Delta time = 0.982
Delta time = 1.002
Delta time = 1.011
Delta time = 1.014
Delta time = 1.026
Delta time = 1.041
Delta time = 1.044
Delta time = 1.045
Delta time = 1.046
Delta time = 1.048
Delta time = 1.048
Delta time = 1.050
Delta time = 1.052
Delta time = 1.066
Delta time = 1.068
Delta time = 1.070
Delta time = 1.079
Delta time = 1.079
Delta time = 1.079
Delta time = 1.094
$ grep -E Delta linux_result.txt | sort
Delta time = 1.090
Delta time = 1.091
Delta time = 1.093
Delta time = 1.094
Delta time = 1.095
Delta time = 1.096
Delta time = 1.099
Delta time = 1.105
Delta time = 1.106
Delta time = 1.106
Delta time = 1.108
Delta time = 1.111
Delta time = 1.114
Delta time = 1.119
Delta time = 1.124
Delta time = 1.133
Delta time = 1.138
Delta time = 1.138
Delta time = 1.144
Delta time = 1.150
```
兩邊的執行時間都有降下來了,不過仍然是 q_sort() 比較快。
不過使用 valgrind 時,卻出現了問題:
沒有使用 valgrind 時,無論是執行 `./test_sort.sh` 或是直接執行 `./qtest -f traces/trace-19-test-my-sort.cmd`,time sort 得出的時間都比 element 與 value 分開 malloc() 的版本來的短。
可是一旦搭配 valgrind 後:
```bash
$ valgrind --tool=cachegrind ./qtest -f traces/trace-19-test-my-sort.cmd
$ valgrind --tool=cachegrind ./qtest -f traces/trace-20-test-linux-sort.cmd
```
trace-19-test-my-sort.cmd 測試結果(僅列出相關內容):
```
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
2,226,715,824 1,691 1,671 584,243,513 24,123,415 14,512,734 263,839,259 9,929,850 5,649,957 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
422,608,446 10 10 93,579,579 6,346,541 1,982,516 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
246,361,546 12 12 78,031,532 5,699,854 1,906,220 22,215,654 1 1 /home/dyu0y/linux2022/lab0-c/queue.c:q_sort
112,855,909 10 10 37,785,639 1,525,410 524,334 73,571,276 6,907,432 2,944,581 /home/dyu0y/linux2022/lab0-c/list.h:q_sort
```
:::spoiler trace-19-test-my-sort 完整測試資料
```
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 6291456 B, 64 B, 12-way associative
Command: ./qtest -f traces/trace-19-test-my-sort.cmd
Data file: cachegrind.out.12720
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
2,226,715,824 1,691 1,671 584,243,513 24,123,415 14,512,734 263,839,259 9,929,850 5,649,957 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
422,608,446 10 10 93,579,579 6,346,541 1,982,516 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
318,755,732 3 3 80,011,560 3 0 30,004,335 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random_r.c:random_r
246,361,546 12 12 78,031,532 5,699,854 1,906,220 22,215,654 1 1 /home/dyu0y/linux2022/lab0-c/queue.c:q_sort
210,030,345 2 2 80,011,560 0 0 20,002,890 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random.c:random
141,008,621 32 32 24,001,841 2 1 24,001,170 999,917 999,902 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_malloc
129,028,637 2 2 13,002,647 1 1 22,004,091 0 0 /home/dyu0y/linux2022/lab0-c/qtest.c:fill_rand_string
112,855,909 10 10 37,785,639 1,525,410 524,334 73,571,276 6,907,432 2,944,581 /home/dyu0y/linux2022/lab0-c/list.h:q_sort
81,007,447 13 13 23,002,169 50,426 47,726 10,001,119 3 3 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_free
69,436,522 51 49 34,718,222 80 54 19 1 1 ???:???
54,000,731 10 10 8,000,110 5,998,819 5,764,077 2,000,100 3 3 /home/dyu0y/linux2022/lab0-c/qtest.c:show_queue
48,002,446 7 7 12,000,665 6 5 4,000,267 1 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc
47,621,997 10 10 11,999,990 250,384 250,256 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_l_avx
45,007,220 1 1 9,001,444 0 0 9,001,444 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/rand.c:rand
41,002,647 3 3 13,002,647 0 0 7,000,000 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_insert_tail
40,000,039 3 3 10,000,010 1 1 12,000,011 350,168 350,168 /home/dyu0y/linux2022/lab0-c/harness.c:test_malloc
38,000,051 4 4 12,000,012 998,344 861,495 7,000,006 1,670,832 1,354,293 /home/dyu0y/linux2022/lab0-c/harness.c:test_free
37,600,470 26 25 5,000,077 0 0 2,000,039 3 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcpy-avx2.S:__strncpy_avx2
24,000,070 9 9 7,000,024 1 1 3,000,012 0 0 /home/dyu0y/linux2022/lab0-c/qtest.c:do_it
24,000,024 5 5 2,000,002 0 0 4,000,004 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms
21,001,302 2 2 5,000,310 249,776 242,709 0 0 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:free
20,002,890 0 0 10,001,445 1 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/../sysdeps/unix/sysv/linux/x86/lowlevellock.h:random
12,000,040 6 6 3,000,008 1,000,002 999,101 1,000,012 0 0 /home/dyu0y/linux2022/lab0-c/qtest.c:do_sort
12,000,032 4 4 6,000,016 4 4 3,000,008 0 0 /home/dyu0y/linux2022/lab0-c/harness.c:error_check
7,000,023 1 1 1,000,006 999,367 931,586 1,000,003 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_free
5,000,000 0 0 1,000,000 0 0 4,000,000 0 0 /home/dyu0y/linux2022/lab0-c/list.h:q_insert_tail
4,000,009 1 1 1,000,002 1,000,001 1,000,001 0 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_size
4,000,004 1 1 1,000,001 0 0 1,000,001 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_free
4,000,004 0 0 0 0 0 1,000,001 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_malloc
3,999,996 2 2 1,999,998 2 2 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_avx
3,000,000 0 0 0 0 0 1,000,000 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:q_insert_tail
```
:::
trace-20-test-linux-sort.cmd 測試結果(僅列出相關內容):
```
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
2,306,265,760 1,670 1,650 558,955,633 32,291,630 18,792,171 239,096,012 3,849,674 3,527,751 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
421,944,336 10 10 93,434,189 8,864,020 3,524,362 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
248,128,571 1 1 27,686,544 7,808 30 43,373,112 1 1 /home/dyu0y/linux2022/lab0-c/qtest.c:merge
149,492,456 1 1 56,059,671 12,611,175 4,917,600 18,686,557 1 1 /home/dyu0y/linux2022/lab0-c/qtest.c:sort_comp
42,024,269 7 7 6,999,966 257,196 250,067 8,999,976 827,106 821,608 /home/dyu0y/linux2022/lab0-c/qtest.c:list_sort
```
:::spoiler trace-20-test-linux-sort 完整測試資料
```
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 6291456 B, 64 B, 12-way associative
Command: ./qtest -f traces/trace-20-test-linux-sort.cmd
Data file: cachegrind.out.13059
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
2,306,265,760 1,670 1,650 558,955,633 32,291,630 18,792,171 239,096,012 3,849,674 3,527,751 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
421,944,336 10 10 93,434,189 8,864,020 3,524,362 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
318,687,944 3 3 79,994,544 3 0 29,997,954 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random_r.c:random_r
248,128,571 1 1 27,686,544 7,808 30 43,373,112 1 1 /home/dyu0y/linux2022/lab0-c/qtest.c:merge
209,985,678 2 2 79,994,544 0 0 19,998,636 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random.c:random
149,492,456 1 1 56,059,671 12,611,175 4,917,600 18,686,557 1 1 /home/dyu0y/linux2022/lab0-c/qtest.c:sort_comp
141,009,149 32 32 24,001,961 2 1 24,001,200 999,921 999,899 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_malloc
129,004,739 2 2 13,001,360 1 1 22,000,677 0 0 /home/dyu0y/linux2022/lab0-c/qtest.c:fill_rand_string
81,008,124 13 13 23,002,366 50,220 47,576 10,001,211 3 3 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_free
69,374,112 51 49 34,687,017 80 54 19 1 1 ???:???
54,000,731 10 10 8,000,110 5,998,797 5,764,316 2,000,100 3 3 /home/dyu0y/linux2022/lab0-c/qtest.c:show_queue
48,002,434 7 7 12,000,665 6 5 4,000,271 1 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc
47,621,380 7 7 11,999,988 250,195 250,050 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_l_avx
44,996,585 1 1 8,999,317 0 0 8,999,317 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/rand.c:rand
42,024,269 7 7 6,999,966 257,196 250,067 8,999,976 827,106 821,608 /home/dyu0y/linux2022/lab0-c/qtest.c:list_sort
41,001,360 3 3 13,001,360 0 0 7,000,000 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_insert_tail
40,000,039 3 3 10,000,010 1 1 12,000,011 350,092 350,092 /home/dyu0y/linux2022/lab0-c/harness.c:test_malloc
38,000,053 4 4 12,000,012 998,417 861,766 7,000,006 1,671,047 1,355,132 /home/dyu0y/linux2022/lab0-c/harness.c:test_free
37,600,372 25 24 5,000,077 0 0 2,000,039 3 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcpy-avx2.S:__strncpy_avx2
24,000,070 9 9 7,000,024 1 1 3,000,012 0 0 /home/dyu0y/linux2022/lab0-c/qtest.c:do_it
24,000,024 5 5 2,000,002 0 0 4,000,004 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms
21,001,302 2 2 5,000,310 249,915 243,001 0 0 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:free
19,998,636 0 0 9,999,318 1 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/../sysdeps/unix/sysv/linux/x86/lowlevellock.h:random
12,000,039 6 6 3,000,008 1,000,004 999,122 1,000,012 0 0 /home/dyu0y/linux2022/lab0-c/qtest.c:do_sort
12,000,032 4 4 6,000,016 4 4 3,000,008 0 0 /home/dyu0y/linux2022/lab0-c/harness.c:error_check
7,000,023 1 1 1,000,006 999,387 931,573 1,000,003 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_free
5,000,000 0 0 1,000,000 0 0 4,000,000 0 0 /home/dyu0y/linux2022/lab0-c/list.h:q_insert_tail
4,000,009 1 1 1,000,002 1,000,001 1,000,001 0 0 0 /home/dyu0y/linux2022/lab0-c/queue.c:q_size
4,000,004 1 1 1,000,001 0 0 1,000,001 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_free
4,000,004 0 0 0 0 0 1,000,001 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_malloc
3,999,996 2 2 1,999,998 2 2 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_avx
3,000,000 0 0 0 0 0 1,000,000 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:q_insert_tail
```
:::
無論是 q_sort() 或是 list_sort() 總體的 DLmr 都提昇為更改 malloc 方式前的兩倍多,並且關鍵的呼叫 strcmp() 時產生的 DLmr 數量也都沒有下降,下方是比較表格。
| q_sort | 改版前的 DLmr | 改版後的 DLmr |
| ------:| -----------:| ------------:|
| strcmp | 1,575,233 | 1,982,516 |
| q_sort | 817,612 | 2,430,554 |
| list_sort | 改版前的 DLmr | 改版後的 DLmr |
| ---------:| -----------:| ------------:|
| strcmp | 1,967,595 | 3,524,362 |
| sort_cmp | 1,480,133 | 4,917,600 |
| merge | 316,007 | 30 |
| list_sort | 172,248 | 250,067 |
Cache miss 發生的總數大幅上升,但單看不使用 valgrind 時的執行時間是變短的,雖然不知道具體發生了什麼事,但只能猜測可能是 valgrind 在模擬和採集數據的過程中出了點問題,才出現了這種矛盾的結果。
## 在 qtest 提供新的命令 shuffle
### 想法
參考 [kdnvt](https://hackmd.io/WuqBNTdFQnuNxkFtR6FIWg?both#%E5%9C%A8-qtest-%E6%8F%90%E4%BE%9B%E6%96%B0%E7%9A%84%E5%91%BD%E4%BB%A4-shuffle) 的紀錄,可以知道直接對 linked list 實做 [Fisher–Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle) 的基本時間複雜度為 $O(n^2)$,其來自遍歷整個 list 需要 $O(n)$,遍歷的過程中,每經過一個節點最多須花費 $O(n)$ 的時間找到與其交換的另一個節點,才能對兩者進行交換。
這邊我有個想法是,將原本的 list 改變為 circular singly-linked list,將目前空出來的 prev 用來儲存隨機數,示意圖如下。
原本的 list:
```
# a circular doubly-linked list:
head <~~> a <~~> b <~~> c <~~> d <~~|
^__________________________________|
```
轉變成 single-linked list 後:
```
# a circular singly-linked list:
head ~~> a ~~> b ~~> c ~~> d ~~|
^_____________________________|
```
使用 prev 儲存隨機出來的數字:
```
# a circular singly-linked list:
head ~~> a ~~> b ~~> c ~~> d ~~|
^ 1 3 2 3 |
|_____________________________|
```
完成以上步驟後,再利用 prev 儲存的數字對整個 list 做排序,雖然過程與結果並不等價於 Fisher–Yates shuffle,但也差不多打亂了原本的順序。
時間複雜度部份:
1. 將 list 變為 singly-linked list:$O(n)$
2. 給每一個節點的 prev 給予一個隨機值: $O(n)$
3. 排序:$O(nlogn)$
4. 修復為 doubly-linked list: $O(n)$
總共為 $O(nlogn)$,其中步驟 1. 和 2. 是可以在同一次的遍歷中進行的。
完整實做待補,對 singly-linked list 進行費時 $O(nlogn)$ 的排序,可以經由修改 q_sort() 後達成。
### 實做
後來發現可以更簡單的達成,先將 q_sort() 進行簡單的重構:
```cpp
#define LIST_SORT_IMPL(head, cmp) \
({ \
LIST_HEAD(list); \
int size = 0; \
struct list_head *it, *safe; \
list_for_each_safe (it, safe, head) { \
list_del(it); \
list_add(it, &list); \
if (size & 1) { \
list_merge(&list, size, cmp); \
} \
++size; \
} \
list_merge_final(&list, size, cmp); \
list_splice(&list, head); \
})
/*
* Sort elements of queue in ascending order
* No effect if q is NULL or empty. In addition, if q has only one
* element, do nothing.
*/
void q_sort(struct list_head *head)
{
if (!head || list_empty(head) || list_is_singular(head))
return;
LIST_SORT_IMPL(head, MY_CMP);
}
```
然後新增以下程式碼:
queue.c
```cpp
#define RAND_CMP(l, r) ({ (rand() < rand()); })
void q_shuffle(struct list_head *head)
{
if (!head || list_empty(head) || list_is_singular(head))
return;
LIST_SORT_IMPL(head, RAND_CMP);
}
```
因為要 commit 時發現不被允許更改 queue.h,所以將 q_shuffle() 的宣告放在 qtest.c 中。
qtest.c
```cpp
void q_shuffle(struct list_head *head);
static bool do_shuffle(int argc, char **argv)
{
if (argc != 1) {
report(1, "%s takes no arguments", argv[0]);
return false;
}
if (!l_meta.l)
report(3, "Warning: Try to access null queue");
error_check();
q_shuffle(l_meta.l);
show_queue(3);
return !error_check();
}
```
並在 console_init() 中加入指令
```diff
static void console_init()
{
// ...
+ ADD_COMMAND(shuffle, " | Shuffle every nodes in queue");
// ...
}
```
測試結果:
```bash
$ ./qtest -v 3
cmd> new
l = []
cmd> ih RAND 1000000
l = [sdoxe zranucmrz vcinebi jgmwdwbb kvadvmdc yqoams ipuvoksl ahboq nvunpubxr ztbydgevp rjjcv fzjxxs xjdfs bmlkmicox ptqwia uwckg lgjzupc rnyren imdynpif nufluy ckfxslap zsfmeel oubxi inzvrqzn qcbsmlm hofpxnxyh kjsiqumc gyuqnscoc ksrrqbsa wmzixugqz ... ]
cmd> time sort
l = [aaaai aaaawu aaablykts aaackg aaacm aaadj aaadwqyiw aaaejhfw aaaffxur aaafllsl aaafmeni aaagdpxq aaaha aaahaow aaahlacqr aaaiikgan aaaiqrs aaajcouag aaajxfi aaakb aaakhlsqp aaakxgpfj aaalfgfnj aaals aaamc aaamg aaamld aaamlgl aaamn aaampxm ... ]
Delta time = 1.086
cmd> time shuffle
l = [urbigi xfeay fzprgy ihzdxto ejskbihw axbpcimf txwvzy aqccwj cyvxfy ksfzdgjv yntxk iurpply ksykscdp qchnhxiu vqipigpx xdurz zzzirxus tzdrdlmpv opqalhena yaakyter macqpxmgs gzlmhp axozn mlzoiwg wmywcqt qmxvh wvgng ouepwqh glljdvpl vtesqs ... ]
Delta time = 1.236
cmd> sort
l = [aaaai aaaawu aaablykts aaackg aaacm aaadj aaadwqyiw aaaejhfw aaaffxur aaafllsl aaafmeni aaagdpxq aaaha aaahaow aaahlacqr aaaiikgan aaaiqrs aaajcouag aaajxfi aaakb aaakhlsqp aaakxgpfj aaalfgfnj aaals aaamc aaamg aaamld aaamlgl aaamn aaampxm ... ]
cmd> quit
Freeing queue
```