facebooc-q0 - HackMD

# facebooc-q0 ###### tags: `facebooc 2022` `linux` contributed by <discord: `pau123`, github: [`paul90317`](https://github.com/paul90317)> [共筆](https://hackmd.io/KgutBL34SBmIjZJIyk4N-Q?view) [GitHub](https://github.com/paul90317/facebooc-q0) 使用的排版命令 ```shell $ astyle --style=linux --suffix=none *.[ch] -H -p -U -f ``` 要留言請用 tag 的方式放到 **整份共筆** 的最底下，例如 :::success [pau1234 筆記](#pau1234-筆記) 這筆記很實用 :notes: pau1234 ::: **禁止人身攻擊** ## pau1234 筆記 > 歡迎參考或修改 [astyle 使用教學](https://hackmd.io/@paul90317/astyle) [vim 使用教學](https://hackmd.io/@paul90317/vim) [gdb 使用教學](https://hackmd.io/@paul90317/gdb) [time 使用教學](https://hackmd.io/@paul90317/time) ## Queue 實作 ### q_new 初始化 queue ```c /* * Create empty queue. * Return NULL if could not allocate space. */ queue_t *q_new() { queue_t *ret = malloc(sizeof(queue_t)); ret->head = NULL; ret->tail = NULL; ret->size = 0; return ret; } ``` ### q_free 清除 queue 記憶體 ```c /* Free all storage used by queue */ void q_free(queue_t *q) { element_t *now = q->head, *next; while (now) { next = now->next; free(now->value); free(now); now = next; } free(q); } ``` ### q_insert_head 將一個 string 利用 `strdup` 複製一份，加入 queue 的起始點。 ```c /* * Attempt to insert element at head of queue. * Return true if successful. * Return false if q is NULL or could not allocate space. * Argument s points to the string to be stored. * The function must explicitly allocate space and copy the string into it. */ bool q_insert_head(queue_t *q, char *s) { if (!q) return false; element_t *t = malloc(sizeof(element_t)); if (!t) return false; t->value = strdup(s); if (!t->value) { + free(t); return false; } t->next = q->head; if (!q->head) q->tail = t; q->head = t; ++q->size; return true; } ``` :::warning 之前忘記考慮 `malloc`、`strdup` 失敗的情況，根據 [strdup - cplusplus](https://en.cppreference.com/w/c/experimental/dynamic/strdup) > ... > If an error occurs, a null pointer is returned and errno may be set. > ... 若失敗 (記憶體不足)，回傳空指針。 ::: ### q_insert_tail 將一個 string 利用 `strdup` 複製一份，加入 queue 的尾巴。 ```c /* * Attempt to insert element at tail of queue. * Return true if successful. * Return false if q is NULL or could not allocate space. * Argument s points to the string to be stored. * The function must explicitly allocate space and copy the string into it. */ bool q_insert_tail(queue_t *q, char *s) { if (!q) return false; element_t *t = malloc(sizeof(element_t)); if (!t) return false; t->value = strdup(s); if (!t->value) return false; t->next = NULL; if (q->tail) { q->tail->next = t; q->tail = t; } else { q->tail = t; q->head = t; } ++q->size; return true; } ``` ### q_remove_head 移除頭元素 ```c /* * Attempt to remove element from head of queue. * Return true if successful. * Return false if queue is NULL or empty. * If sp is non-NULL and an element is removed, copy the removed string to *sp * (up to a maximum of bufsize-1 characters, plus a null terminator.) * The space used by the list element and the string should be freed. */ bool q_remove_head(queue_t *q, char *sp, size_t bufsize) { if (!q || !q->head) return false; element_t *t = q->head; if (sp){ strncpy(sp, t->value, bufsize - 1); + sp[bufsize-1] = 0; } free(t->value); q->head = t->next; free(t); --q->size; return true; } ``` ### q_size 回傳大小 ```c /* * Return number of elements in queue. * Return 0 if q is NULL or empty */ size_t q_size(queue_t *q) { return q ? q->size : 0; } ``` ### q_reverse 利用 stack `stk`，翻轉整個 `queue`。練習題 [206. Reverse Linked List - LeetCode](https://leetcode.com/problems/reverse-linked-list/) ```c /* * Reverse elements in queue * No effect if q is NULL or empty * This function should not allocate or free any list elements * (e.g., by calling q_insert_head, q_insert_tail, or q_remove_head). * It should rearrange the existing ones. */ void q_reverse(queue_t *q) { element_t *stk = NULL, *next, *head = q->head; q->tail = head; while (head) { next = head->next; head->next = stk; stk = head; head = next; } q->head = stk; } ``` ## Merge Sort 實作 ### q_insert_ele_tail 該函數直接整合了把 `t` 的頭元素拿出，放入 `q` 的尾巴的過程。 ```c //by paul90317 /* Insert the head of t to tail of q without malloc and free * Return next of t */ element_t *q_insert_ele_tail(queue_t *q, element_t *t) { element_t *ret = t->next; t->next = NULL; if (!q->size) { q->head = t; q->tail = t; } else { q->tail->next = t; q->tail = t; } ++q->size; return ret; } ``` ### l_mid_last 回傳 list `h` 中間的前一個，因為在 divde 實要把中間切斷，變成兩條 list。而右邊的 list 開頭一定要是中間元素，所以才需要回傳前一個。 ```c //by paul90317 /* Return last element of mid element of h * Return NULL if t only have 1 or 0 element. */ element_t *l_mid_last(element_t *h) { if (!h || !h->next) return NULL; element_t *t = h; h = h->next->next; while (h && h->next) { t = t->next; h = h->next->next; } return t; } ``` ### merge 合併兩個已經排序的 list。練習題 [21. Merge Two Sorted Lists - LeetCode](https://leetcode.com/problems/merge-two-sorted-lists/) ```c //by paul90317 /* Merge two sorted list a, b to dst */ void merge(queue_t *dst, element_t *a, element_t *b) { memset(dst, 0, sizeof(queue_t)); while (a && b) { if (strcmp(a->value, b->value) <= 0) { a = q_insert_ele_tail(dst, a); } else { b = q_insert_ele_tail(dst, b); } } while (a) { a = q_insert_ele_tail(dst, a); } while (b) { b = q_insert_ele_tail(dst, b); } } ``` ### merge sort 遞迴 function，先將 list 切成兩份，對兩份分別呼叫自己做排序，最後再用 `merge` 做合併。 ```c /* * The function's sorting algorithm should be merge sort. */ void merge_sort(element_t **head) { if (!(*head) || !(*head)->next) return; element_t *hh = *head; element_t *half = l_mid_last(hh); element_t *temp = half; half = half->next; temp->next = NULL; merge_sort(&hh); merge_sort(&half); queue_t newq = {0}; merge(&newq, hh, half); *head = newq.head; } ``` :::danger `l_mid_last()` 若是沒有 `h = h->next->next`，將會出現 **Segmentation fault**，`gdb` 顯示如下 ```shell Program received signal SIGSEGV, Segmentation fault. merge_sort (head=head@entry=0x7fffff7ff030) at queue.c:245 245 element_t* hh = *head; ``` 你應該很好奇，為什麼錯誤會發生在這一行，`head` 到這一行是一個 `malloc` 的空間，存取應該要是有效的。那原因只會是 **Stack overflow**，也就是原先挖的 `head`，讀取時會超過界限。發生原因是因為斷點太靠後，例如 `2` 變成 `2 0`, `3` 變成 `3 0`，形成無窮遞迴。 **Stack overflow 是 Segmentation fault 的一種** [`Yu-Wei-Chang` 的筆記 - 已知問題](https://hackmd.io/@Yu-Wei-Chang/HJ5lkM-U8#%E5%B7%B2%E7%9F%A5%E5%95%8F%E9%A1%8C) :::  ## 設計實驗 :::spoiler 環境 ```shell $ lscpu BogoMIPS: 4799.99 Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K Flags: ... ``` ::: 為了完成實驗，我修改了 `test.c`，讓他可以針對不同數據大小去做測試，方法是複製原本的數據。 :::spoiler 修改後的 `test.c` ```clike #include <assert.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include "queue.h" /*static void q_show(queue_t *q) { for (element_t *e = q->head; e; e = e->next) printf("%s", e->value); }*/ static bool validate(queue_t *q) { size_t size = q->size; q_sort(q); size_t cnt = 1; for (element_t *e = q->head; e->next; e = e->next) { ++cnt; if (strcmp(e->value, e->next->value) > 0) return false; } if (cnt != size) return false; q_reverse(q); cnt = 1; for (element_t *e = q->head; e->next; e = e->next) { ++cnt; if (strcmp(e->value, e->next->value) < 0) return false; } if (cnt != size) return false; return true; } int main(int argc, char**argv) { queue_t *q = q_new(); char buf[256]; FILE *fp = fopen("cities.txt", "r"); if (!fp) { perror("failed to open cities.txt"); exit(EXIT_FAILURE); } while (fgets(buf, 256, fp)) q_insert_head(q, buf); fclose(fp); element_t *curr = q->head; size_t test_size = argc >= 2 ? strtoull(argv[1], NULL, 10) : 0; for (size_t i = 0; i < test_size; curr = curr->next, ++i) { q_insert_tail(q, curr->value); } puts("Start!"); assert(validate(q)); puts("Finish!"); q_free(q); return 0; } ``` ```shell $ time ./test [額外資料大小] ``` ::: ## `strdup` 有無測試考慮以下兩種 `q_insert_ele_tail` 的實作方式，進行效能與快取測試。 ### 實作一 ```c element_t * q_insert_ele_tail(queue_t *q, element_t *t) { element_t *ret = t->next; q_insert_tail(q, t->value); free(t->value); free(t); return ret; } ``` 程式簡單，但因為 `q_insert_tail` 會 **複製記憶體** 以及 **記憶體內容**，所以外面的記憶體需要 `t` 與 `t->value` 都需要 `free`，增加兩個多餘的記憶體配置操作，以下是結果。 :::spoiler cachegrind 一般大小 ```shell $ valgrind --tool=cachegrind ./test ==574== Cachegrind, a cache and branch-prediction profiler ==574== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==574== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==574== Command: ./test ==574== --574-- warning: L3 cache found, using its data for the LL simulation. Start! Finish! ==574== ==574== I refs: 675,750,046 ==574== I1 misses: 1,221 ==574== LLi misses: 1,193 ==574== I1 miss rate: 0.00% ==574== LLi miss rate: 0.00% ==574== ==574== D refs: 271,443,253 (176,990,984 rd + 94,452,269 wr) ==574== D1 misses: 3,934,493 ( 3,804,416 rd + 130,077 wr) ==574== LLd misses: 107,314 ( 2,057 rd + 105,257 wr) ==574== D1 miss rate: 1.4% ( 2.1% + 0.1% ) ==574== LLd miss rate: 0.0% ( 0.0% + 0.1% ) ==574== ==574== LL refs: 3,935,714 ( 3,805,637 rd + 130,077 wr) ==574== LL misses: 108,507 ( 3,250 rd + 105,257 wr) ==574== LL miss rate: 0.0% ( 0.0% + 0.1% ) ``` ::: :::spoiler time ```shell $ time ./test 10000000 Start! Finish! real 0m35.136s user 0m34.418s sys 0m0.261s ``` ::: ### 實作二 ```c element_t *q_insert_ele_tail(queue_t *q, element_t *t) { element_t *ret = t->next; t->next = NULL; if (!q->size) { q->head = t; q->tail = t; } else { q->tail->next = t; q->tail = t; } ++q->size; return ret; } ``` 這是不需要額外 `strdup`、`free` 的實作，`valgrind` 結果如下。 :::spoiler cachegrind 1千萬大小 ```shell $ valgrind --tool=cachegrind ./test 10000000 ==535== Cachegrind, a cache and branch-prediction profiler ==535== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==535== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==535== Command: ./test 10000000 ==535== --535-- warning: L3 cache found, using its data for the LL simulation. ==535== brk segment overflow in thread #1: can't grow to 0x4a4b000 ==535== (see section Limitations in user manual) ==535== NOTE: further instances of this message will not be shown Start! Finish! ==535== ==535== I refs: 23,267,018,490 ==535== I1 misses: 1,261 ==535== LLi misses: 1,249 ==535== I1 miss rate: 0.00% ==535== LLi miss rate: 0.00% ==535== ==535== D refs: 8,371,756,742 (5,993,606,568 rd + 2,378,150,174 wr) ==535== D1 misses: 577,302,877 ( 565,694,399 rd + 11,608,478 wr) ==535== LLd misses: 326,976,374 ( 315,710,868 rd + 11,265,506 wr) ==535== D1 miss rate: 6.9% ( 9.4% + 0.5% ) ==535== LLd miss rate: 3.9% ( 5.3% + 0.5% ) ==535== ==535== LL refs: 577,304,138 ( 565,695,660 rd + 11,608,478 wr) ==535== LL misses: 326,977,623 ( 315,712,117 rd + 11,265,506 wr) ==535== LL miss rate: 1.0% ( 1.1% + 0.5% ) ``` ::: :::warning ```shell ==535== brk segment overflow in thread #1: can't grow to 0x4a4b000 ==535== (see section Limitations in user manual) ==535== NOTE: further instances of this message will not be shown ``` https://stackoverflow.com/questions/35129135/valgrind-reporting-a-segment-overflow 該訊息是因為 `queue` 設太大 (1千萬)，記憶體不夠，需要用 `mmap()` 這樣的函數去模擬，詳細點網址。 ::: :::spoiler cachegrind 一般大小 ```shell $ valgrind --tool=cachegrind ./test ==582== Cachegrind, a cache and branch-prediction profiler ==582== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==582== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==582== Command: ./test ==582== --582-- warning: L3 cache found, using its data for the LL simulation. Start! Finish! ==582== ==582== I refs: 173,273,997 ==582== I1 misses: 1,219 ==582== LLi misses: 1,191 ==582== I1 miss rate: 0.00% ==582== LLi miss rate: 0.00% ==582== ==582== D refs: 65,105,910 (43,871,317 rd + 21,234,593 wr) ==582== D1 misses: 2,949,205 ( 2,841,039 rd + 108,166 wr) ==582== LLd misses: 107,310 ( 2,057 rd + 105,253 wr) ==582== D1 miss rate: 4.5% ( 6.5% + 0.5% ) ==582== LLd miss rate: 0.2% ( 0.0% + 0.5% ) ==582== ==582== LL refs: 2,950,424 ( 2,842,258 rd + 108,166 wr) ==582== LL misses: 108,501 ( 3,248 rd + 105,253 wr) ==582== LL miss rate: 0.0% ( 0.0% + 0.5% ) ``` ::: :::spoiler time ```shell $ time ./test 10000000 Start! Finish! real 0m16.724s user 0m15.837s sys 0m0.381s ``` ::: ### 分析比較要分析，要先了解 cache，首先 cache 分很多層 (L1, L2, L3)，數字越大離 CPU 越遠，大小越大，memory stall 也會越高。而 cachegrind 顯示的 LLi 與 LLd 是 instruction 和 data 的 cache，會比 LL 更接近 CPU 一點。  我們可以看出 LLd misses (rd) 沒有區別，為什麼 ? `strdup` 會增加 reference 次數，因為 `strdup` 需要 **複製記憶體內容**，但是字串的記憶體是連續且短的，才導致 miss 次數沒有顯著增加。可以看到實作一不論哪一個 cache 的 reference time 都高了不少，而最後時間的部分是實作二明顯比較短。 ## Iteration v.s. Recursive 該實驗參考自 [`bakudr18` 的筆記 - merge sort list 的結論](https://hackmd.io/@bakudr18/BkvLMLkeq#merge-sort-list-%E7%B5%90%E8%AB%96) > ... > 比較兩個方法的 q_sort 實作，divide and conquer 比起分段合併少走訪 nlog(n) 個 node，而且更 cache friendly。 > ... 於是我想設計一個實驗，讓 iteration 的 merge sort 不用分段合併的順序去合併，而是使用 cache friendly iteration 的合併方式。看一下是否會比 cache friendly recursive 的 merge sort 還要快。 ### 分段合併 example ```shell= ;一開始有 8 條 sorted list，長度皆為 1 1 1 1 1 1 1 1 1 ;前兩個 1 合併 2 1 1 1 1 1 1 ;前兩個 1 合併 2 2 1 1 1 1 ;前兩個 1 合併 2 2 2 1 1 ;前兩個 1 合併 2 2 2 2 ;沒有 1，前兩個 2 合併 4 2 2 ;前兩個 2 合併 4 4 ;沒有 2，前兩個 4 合併 8 ;最後得到長度為 8 的 sorted list ``` ### cache friendly iteration cache friendly 的實作方式就是將 8 條 sorted list 輸入到 stack 中，過程中只要 stack 最上面兩個 sorted list 長度相同，就合併。 stack|input|註解 --|--|-- (empty)|1 1 1 1 1 1 1 1|一開始有 8 條 sorted list 1|1 1 1 1 1 1 1|輸入 1 1 1|1 1 1 1 1 1|輸入 1 2|1 1 1 1 1 1|合併 2 1|1 1 1 1 1|輸入 1 2 1 1|1 1 1 1|輸入 1 2 2|1 1 1 1|合併 4|1 1 1 1|合併 4 1|1 1 1|輸入 1 4 1 1|1 1|輸入 1 4 2|1 1|合併 4 2 1|1|輸入 1 4 2 1 1|(empty)|輸入 1 4 2 2|(empty)|合併 4 4|(empty)|合併 8|(empty)|合併可以看到整個過程都是 cache friendly，這個演算法時作時需要有 stack 這樣一個額外空間，所需的複雜度是 $O(lgn)$，這點跟 merge sort 的 function stack 是一樣的。 ### code ```c //實作簡單的 log2 size_t lg(int x) { size_t cnt = -1; while (x) { ++cnt; x >>= 1; } return cnt; } ``` ```c //確認可不可以合併 bool can_merge(queue_t *stack, size_t stack_size) { return stack_size >= 2 && stack[stack_size - 2].size == stack[stack_size - 1].size; } ``` ```c //演算法 void q_sort(queue_t *q) { if (!q || q->size <= 1) return; //initialize stack queue_t *stack = malloc((lg(q->size) + 1) * sizeof(queue_t)); size_t stack_size = 0; element_t *head = q->head, *temp; while (head) { //extract one element temp = head; head = head->next; temp->next = NULL; //input one element stack[stack_size].head = temp; stack[stack_size].tail = temp; stack[stack_size].size = 1; ++stack_size; //merge until can't merge anymore while (can_merge(stack, stack_size)) { int sz = stack[stack_size - 2].size + stack[stack_size - 1].size; merge(stack + stack_size - 2, stack[stack_size - 2].head, stack[stack_size - 1].head); --stack_size; stack[stack_size - 1].size = sz; } } //merge until remained one element while (stack_size >= 2) { int sz = stack[stack_size - 2].size + stack[stack_size - 1].size; merge(stack + stack_size - 2, stack[stack_size - 2].head, stack[stack_size - 1].head); --stack_size; stack[stack_size - 1].size = sz; } memcpy(q, stack, sizeof(queue_t)); free(stack); } ``` :::spoiler time ```shell $ time ./test 10000000 Start! Finish! real 0m15.311s user 0m14.534s sys 0m0.272s ``` ::: :::spoiler cachegrind ```shell $ valgrind --tool=cachegrind ./test 10000000 ==440== Cachegrind, a cache and branch-prediction profiler ==440== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==440== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==440== Command: ./test 10000000 ==440== --440-- warning: L3 cache found, using its data for the LL simulation. ==440== brk segment overflow in thread #1: can't grow to 0x4a4b000 ==440== (see section Limitations in user manual) ==440== NOTE: further instances of this message will not be shown Start! Finish! ==440== ==440== I refs: 22,829,542,611 ==440== I1 misses: 1,262 ==440== LLi misses: 1,250 ==440== I1 miss rate: 0.00% ==440== LLi miss rate: 0.00% ==440== ==440== D refs: 7,999,995,859 (5,676,338,287 rd + 2,323,657,572 wr) ==440== D1 misses: 362,632,224 ( 351,291,993 rd + 11,340,231 wr) ==440== LLd misses: 232,662,485 ( 221,398,889 rd + 11,263,596 wr) ==440== D1 miss rate: 4.5% ( 6.2% + 0.5% ) ==440== LLd miss rate: 2.9% ( 3.9% + 0.5% ) ==440== ==440== LL refs: 362,633,486 ( 351,293,255 rd + 11,340,231 wr) ==440== LL misses: 232,663,735 ( 221,400,139 rd + 11,263,596 wr) ==440== LL miss rate: 0.8% ( 0.8% + 0.5% ) ``` ::: ### 結果分析雖然時間差不多，但可以看出 cache misses 下降了不少。 ## merge 改進在參考 [`chenzenyan` 的 `m_sorted`](https://hackmd.io/@chenzenyang0905/r1cyAFlzs) 之後，我發現 `merge` 其實也可以更漂亮，於是我開始著手改進。 ```c element_t *better_merge(element_t *a, element_t *b) { element_t *ret, **next = &ret; //一開始，下一個要修改的是 ret while (a && b) { if (strcmp(a->value, b->value) <= 0) { *next = a; a = a->next; } else { *next = b; b = b->next; } next = &(*next)->next; //紀錄下一個要修改的位置 } *next = a ? a : b;//接上剩下的 return ret; } ``` `better_merge` 利用指標的指標 `next` 去存 (要修改的指標) 的指標，不論是要修改回傳的頭 `ret`、`a->next`、`b->next`，都只要用 `next = 指標的指標`。若要對其儲存的值 (指標) 修改，只要 `*next = 指標` 就好。以下是修改後的 `merge_sort`。 ```c bool can_merge(size_t *sizes, size_t stack_size) { return stack_size >= 2 && sizes[stack_size - 2] == sizes[stack_size - 1]; } void merge_sort(element_t **qhead) { if (!(*qhead) || !(*qhead)->next) return; //initialize stack size_t stack_size = 0, *sizes, qsize = 0; element_t *head = *qhead, *temp = head, **stack; while (temp) { ++qsize; temp = temp->next; } stack = malloc((lg(qsize) + 1) * sizeof(element_t*)); sizes = malloc((lg(qsize) + 1) * sizeof(size_t)); //input and merge while (head) { //extract one element temp = head; head = head->next; temp->next = NULL; //input one element stack[stack_size] = temp; sizes[stack_size] = 1; ++stack_size; //merge until can't merge anymore while (can_merge(sizes, stack_size)) { sizes[stack_size - 2] += sizes[stack_size - 1]; stack[stack_size - 2] = better_merge(stack[stack_size - 2], stack[stack_size - 1]); --stack_size; } } //merge until remained one element while (stack_size >= 2) { sizes[stack_size - 2] += sizes[stack_size - 1]; stack[stack_size - 2] = better_merge(stack[stack_size - 2], stack[stack_size - 1]); --stack_size; } *qhead=stack[0]; free(stack); free(sizes); } ``` :::spoiler time ```shell $ time ./test 10000000 Start! Finish! real 0m16.364s user 0m15.564s sys 0m0.342s ``` ::: :::spoiler cachegrind ```shell $ valgrind --tool=cachegrind ./test 10000000 ==395== Cachegrind, a cache and branch-prediction profiler ==395== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==395== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==395== Command: ./test 10000000 ==395== --395-- warning: L3 cache found, using its data for the LL simulation. ==395== brk segment overflow in thread #1: can't grow to 0x4a4b000 ==395== (see section Limitations in user manual) ==395== NOTE: further instances of this message will not be shown Start! Finish! ==395== ==395== I refs: 20,690,199,519 ==395== I1 misses: 1,263 ==395== LLi misses: 1,251 ==395== I1 miss rate: 0.00% ==395== LLi miss rate: 0.00% ==395== ==395== D refs: 6,579,894,694 (5,005,315,944 rd + 1,574,578,750 wr) ==395== D1 misses: 382,864,914 ( 371,509,691 rd + 11,355,223 wr) ==395== LLd misses: 252,849,266 ( 241,585,684 rd + 11,263,582 wr) ==395== D1 miss rate: 5.8% ( 7.4% + 0.7% ) ==395== LLd miss rate: 3.8% ( 4.8% + 0.7% ) ==395== ==395== LL refs: 382,866,177 ( 371,510,954 rd + 11,355,223 wr) ==395== LL misses: 252,850,517 ( 241,586,935 rd + 11,263,582 wr) ==395== LL miss rate: 0.9% ( 0.9% + 0.7% ) ``` ::: :::spoiler leak-check ```shell $ valgrind --leak-check=full ./test ==418== Memcheck, a memory error detector ==418== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==418== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==418== Command: ./test ==418== Start! Finish! ==418== ==418== HEAP SUMMARY: ==418== in use at exit: 0 bytes in 0 blocks ==418== total heap usage: 187,660 allocs, 187,660 frees, 3,870,463 bytes allocated ==418== ==418== All heap blocks were freed -- no leaks are possible ==418== ==418== For counts of detected and suppressed errors, rerun with: -v ==418== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ``` ::: 改進之後 LLd cache misses 上升了不少。雖然 refs 是下降的，但這結果明顯變差。以上的程式碼每次要對堆疊做操作，都會要跨越 `stack`、`sizes` 兩個陣列。於是我打算用 `list_t` 將 `element_t`、`size_t` 包起來，讓存取堆疊的跨度可以縮小。 ### 記憶體重新排列 ```c typedef struct list { size_t size; element_t *head; } list_t; ``` 理論上修改後應該要更加 cache friendly。 ```c void merge_sort(element_t **qhead) { if (!(*qhead) || !(*qhead)->next) return; //initialize stack size_t stack_size = 0, qsize = 0; element_t *head = *qhead, *temp = head; list_t *stack; while (temp) { ++qsize; temp = temp->next; } stack = malloc((lg(qsize) + 1) * sizeof(list_t)); //input and merge while (head) { //extract one element temp = head; head = head->next; temp->next = NULL; //input one element stack[stack_size].head = temp; stack[stack_size].size = 1; ++stack_size; //merge until can't merge anymore while (can_merge(stack, stack_size)) { stack[stack_size - 2].size += stack[stack_size - 1].size; stack[stack_size - 2].head = better_merge(stack[stack_size - 2].head, stack[stack_size - 1].head); --stack_size; } } //merge until remained one element while (stack_size >= 2) { stack[stack_size - 2].size += stack[stack_size - 1].size; stack[stack_size - 2].head = better_merge(stack[stack_size - 2].head, stack[stack_size - 1].head); --stack_size; } *qhead = stack[0].head; free(stack); } ``` :::spoiler time ```shell $ time ./test 10000000 Start! Finish! real 0m15.957s user 0m15.201s sys 0m0.271s ``` ::: :::spoiler cachegrind ```shell $ valgrind --tool=cachegrind ./test 10000000 ==485== Cachegrind, a cache and branch-prediction profiler ==485== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==485== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==485== Command: ./test 10000000 ==485== --485-- warning: L3 cache found, using its data for the LL simulation. ==485== brk segment overflow in thread #1: can't grow to 0x4a4b000 ==485== (see section Limitations in user manual) ==485== NOTE: further instances of this message will not be shown Start! Finish! ==485== ==485== I refs: 20,720,480,669 ==485== I1 misses: 1,261 ==485== LLi misses: 1,249 ==485== I1 miss rate: 0.00% ==485== LLi miss rate: 0.00% ==485== ==485== D refs: 6,579,894,595 (5,005,315,887 rd + 1,574,578,708 wr) ==485== D1 misses: 382,792,170 ( 371,438,588 rd + 11,353,582 wr) ==485== LLd misses: 252,849,013 ( 241,585,434 rd + 11,263,579 wr) ==485== D1 miss rate: 5.8% ( 7.4% + 0.7% ) ==485== LLd miss rate: 3.8% ( 4.8% + 0.7% ) ==485== ==485== LL refs: 382,793,431 ( 371,439,849 rd + 11,353,582 wr) ==485== LL misses: 252,850,262 ( 241,586,683 rd + 11,263,579 wr) ==485== LL miss rate: 0.9% ( 0.9% + 0.7% ) ``` ::: ### 結果時間變快，其實結果變化都不大，畢竟 `lg(10000000)` 大概等於 23，陣列的大小應該還是在同一個 page，結果差異大部分是誤差。