# facebooc-q0
###### tags: `facebooc 2022` `linux`
contributed by <discord: `pau123`, github: [`paul90317`](https://github.com/paul90317)>
[共筆](https://hackmd.io/KgutBL34SBmIjZJIyk4N-Q?view)
[GitHub](https://github.com/paul90317/facebooc-q0)
使用的排版命令
```shell
$ astyle --style=linux --suffix=none *.[ch] -H -p -U -f
```
要留言請用 tag 的方式放到 **整份共筆** 的最底下,例如
:::success
[pau1234 筆記](#pau1234-筆記)
這筆記很實用
:notes: pau1234
:::
**禁止人身攻擊**
## pau1234 筆記
> 歡迎參考或修改
[astyle 使用教學](https://hackmd.io/@paul90317/astyle)
[vim 使用教學](https://hackmd.io/@paul90317/vim)
[gdb 使用教學](https://hackmd.io/@paul90317/gdb)
[time 使用教學](https://hackmd.io/@paul90317/time)
## Queue 實作
### q_new
初始化 queue
```c
/*
* Create empty queue.
* Return NULL if could not allocate space.
*/
queue_t *q_new()
{
queue_t *ret = malloc(sizeof(queue_t));
ret->head = NULL;
ret->tail = NULL;
ret->size = 0;
return ret;
}
```
### q_free
清除 queue 記憶體
```c
/* Free all storage used by queue */
void q_free(queue_t *q)
{
element_t *now = q->head, *next;
while (now) {
next = now->next;
free(now->value);
free(now);
now = next;
}
free(q);
}
```
### q_insert_head
將一個 string 利用 `strdup` 複製一份,加入 queue 的起始點。
```c
/*
* Attempt to insert element at head of queue.
* Return true if successful.
* Return false if q is NULL or could not allocate space.
* Argument s points to the string to be stored.
* The function must explicitly allocate space and copy the string into it.
*/
bool q_insert_head(queue_t *q, char *s)
{
if (!q)
return false;
element_t *t = malloc(sizeof(element_t));
if (!t)
return false;
t->value = strdup(s);
if (!t->value) {
+ free(t);
return false;
}
t->next = q->head;
if (!q->head)
q->tail = t;
q->head = t;
++q->size;
return true;
}
```
:::warning
之前忘記考慮 `malloc`、`strdup` 失敗的情況,根據 [strdup - cplusplus](https://en.cppreference.com/w/c/experimental/dynamic/strdup)
> ...
> If an error occurs, a null pointer is returned and errno may be set.
> ...
若失敗 (記憶體不足),回傳空指針。
:::
### q_insert_tail
將一個 string 利用 `strdup` 複製一份,加入 queue 的尾巴。
```c
/*
* Attempt to insert element at tail of queue.
* Return true if successful.
* Return false if q is NULL or could not allocate space.
* Argument s points to the string to be stored.
* The function must explicitly allocate space and copy the string into it.
*/
bool q_insert_tail(queue_t *q, char *s)
{
if (!q)
return false;
element_t *t = malloc(sizeof(element_t));
if (!t)
return false;
t->value = strdup(s);
if (!t->value)
return false;
t->next = NULL;
if (q->tail) {
q->tail->next = t;
q->tail = t;
} else {
q->tail = t;
q->head = t;
}
++q->size;
return true;
}
```
### q_remove_head
移除頭元素
```c
/*
* Attempt to remove element from head of queue.
* Return true if successful.
* Return false if queue is NULL or empty.
* If sp is non-NULL and an element is removed, copy the removed string to *sp
* (up to a maximum of bufsize-1 characters, plus a null terminator.)
* The space used by the list element and the string should be freed.
*/
bool q_remove_head(queue_t *q, char *sp, size_t bufsize)
{
if (!q || !q->head)
return false;
element_t *t = q->head;
if (sp){
strncpy(sp, t->value, bufsize - 1);
+ sp[bufsize-1] = 0;
}
free(t->value);
q->head = t->next;
free(t);
--q->size;
return true;
}
```
### q_size
回傳大小
```c
/*
* Return number of elements in queue.
* Return 0 if q is NULL or empty
*/
size_t q_size(queue_t *q)
{
return q ? q->size : 0;
}
```
### q_reverse
利用 stack `stk`,翻轉整個 `queue`。
練習題 [206. Reverse Linked List - LeetCode](https://leetcode.com/problems/reverse-linked-list/)
```c
/*
* Reverse elements in queue
* No effect if q is NULL or empty
* This function should not allocate or free any list elements
* (e.g., by calling q_insert_head, q_insert_tail, or q_remove_head).
* It should rearrange the existing ones.
*/
void q_reverse(queue_t *q)
{
element_t *stk = NULL, *next, *head = q->head;
q->tail = head;
while (head) {
next = head->next;
head->next = stk;
stk = head;
head = next;
}
q->head = stk;
}
```
## Merge Sort 實作
### q_insert_ele_tail
該函數直接整合了把 `t` 的頭元素拿出,放入 `q` 的尾巴的過程。
```c
//by paul90317
/* Insert the head of t to tail of q without malloc and free
* Return next of t
*/
element_t *q_insert_ele_tail(queue_t *q, element_t *t)
{
element_t *ret = t->next;
t->next = NULL;
if (!q->size) {
q->head = t;
q->tail = t;
} else {
q->tail->next = t;
q->tail = t;
}
++q->size;
return ret;
}
```
### l_mid_last
回傳 list `h` 中間的前一個,因為在 divde 實要把中間切斷,變成兩條 list。而右邊的 list 開頭一定要是中間元素,所以才需要回傳前一個。
```c
//by paul90317
/* Return last element of mid element of h
* Return NULL if t only have 1 or 0 element.
*/
element_t *l_mid_last(element_t *h)
{
if (!h || !h->next)
return NULL;
element_t *t = h;
h = h->next->next;
while (h && h->next) {
t = t->next;
h = h->next->next;
}
return t;
}
```
### merge
合併兩個已經排序的 list。
練習題 [21. Merge Two Sorted Lists - LeetCode](https://leetcode.com/problems/merge-two-sorted-lists/)
```c
//by paul90317
/* Merge two sorted list a, b to dst
*/
void merge(queue_t *dst, element_t *a, element_t *b)
{
memset(dst, 0, sizeof(queue_t));
while (a && b) {
if (strcmp(a->value, b->value) <= 0) {
a = q_insert_ele_tail(dst, a);
} else {
b = q_insert_ele_tail(dst, b);
}
}
while (a) {
a = q_insert_ele_tail(dst, a);
}
while (b) {
b = q_insert_ele_tail(dst, b);
}
}
```
### merge sort
遞迴 function,先將 list 切成兩份,對兩份分別呼叫自己做排序,最後再用 `merge` 做合併。
```c
/*
* The function's sorting algorithm should be merge sort.
*/
void merge_sort(element_t **head)
{
if (!(*head) || !(*head)->next)
return;
element_t *hh = *head;
element_t *half = l_mid_last(hh);
element_t *temp = half;
half = half->next;
temp->next = NULL;
merge_sort(&hh);
merge_sort(&half);
queue_t newq = {0};
merge(&newq, hh, half);
*head = newq.head;
}
```
:::danger
`l_mid_last()` 若是沒有 `h = h->next->next`,將會出現 **Segmentation fault**,`gdb` 顯示如下
```shell
Program received signal SIGSEGV, Segmentation fault.
merge_sort (head=head@entry=0x7fffff7ff030) at queue.c:245
245 element_t* hh = *head;
```
你應該很好奇,為什麼錯誤會發生在這一行,`head` 到這一行是一個 `malloc` 的空間,存取應該要是有效的。
那原因只會是 **Stack overflow**,也就是原先挖的 `head`,讀取時會超過界限。
發生原因是因為斷點太靠後,例如 `2` 變成 `2 0`, `3` 變成 `3 0`,形成無窮遞迴。
**Stack overflow 是 Segmentation fault 的一種**
[`Yu-Wei-Chang` 的筆記 - 已知問題](https://hackmd.io/@Yu-Wei-Chang/HJ5lkM-U8#%E5%B7%B2%E7%9F%A5%E5%95%8F%E9%A1%8C)
:::
<!--
:::danger
請改進以下幾點:
1. 請排碼排好。可以使用以下工具,幫你自動格式化。
```
$ sudo apt-get install -y astyle
$ astyle --style=linux --suffix=none XXXX.[ch]
```
2. 不要出現無意義的文字。
3. 不要把整段程式碼一次複製貼上,然後不做任何描述或解釋。
:notes: Astus
:::
-->
## 設計實驗
:::spoiler 環境
```shell
$ lscpu
BogoMIPS: 4799.99
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
Flags: ...
```
:::
為了完成實驗,我修改了 `test.c`,讓他可以針對不同數據大小去做測試,方法是複製原本的數據。
:::spoiler 修改後的 `test.c`
```clike
#include <assert.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "queue.h"
/*static void q_show(queue_t *q)
{
for (element_t *e = q->head; e; e = e->next)
printf("%s", e->value);
}*/
static bool validate(queue_t *q)
{
size_t size = q->size;
q_sort(q);
size_t cnt = 1;
for (element_t *e = q->head; e->next; e = e->next) {
++cnt;
if (strcmp(e->value, e->next->value) > 0)
return false;
}
if (cnt != size)
return false;
q_reverse(q);
cnt = 1;
for (element_t *e = q->head; e->next; e = e->next) {
++cnt;
if (strcmp(e->value, e->next->value) < 0)
return false;
}
if (cnt != size)
return false;
return true;
}
int main(int argc, char**argv)
{
queue_t *q = q_new();
char buf[256];
FILE *fp = fopen("cities.txt", "r");
if (!fp) {
perror("failed to open cities.txt");
exit(EXIT_FAILURE);
}
while (fgets(buf, 256, fp))
q_insert_head(q, buf);
fclose(fp);
element_t *curr = q->head;
size_t test_size = argc >= 2 ? strtoull(argv[1], NULL, 10) : 0;
for (size_t i = 0; i < test_size; curr = curr->next, ++i) {
q_insert_tail(q, curr->value);
}
puts("Start!");
assert(validate(q));
puts("Finish!");
q_free(q);
return 0;
}
```
```shell
$ time ./test [額外資料大小]
```
:::
## `strdup` 有無測試
考慮以下兩種 `q_insert_ele_tail` 的實作方式,進行效能與快取測試。
### 實作一
```c
element_t * q_insert_ele_tail(queue_t *q, element_t *t)
{
element_t *ret = t->next;
q_insert_tail(q, t->value);
free(t->value);
free(t);
return ret;
}
```
程式簡單,但因為 `q_insert_tail` 會 **複製記憶體** 以及 **記憶體內容**,所以外面的記憶體需要 `t` 與 `t->value` 都需要 `free`,增加兩個多餘的記憶體配置操作,以下是結果。
:::spoiler cachegrind 一般大小
```shell
$ valgrind --tool=cachegrind ./test
==574== Cachegrind, a cache and branch-prediction profiler
==574== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==574== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==574== Command: ./test
==574==
--574-- warning: L3 cache found, using its data for the LL simulation.
Start!
Finish!
==574==
==574== I refs: 675,750,046
==574== I1 misses: 1,221
==574== LLi misses: 1,193
==574== I1 miss rate: 0.00%
==574== LLi miss rate: 0.00%
==574==
==574== D refs: 271,443,253 (176,990,984 rd + 94,452,269 wr)
==574== D1 misses: 3,934,493 ( 3,804,416 rd + 130,077 wr)
==574== LLd misses: 107,314 ( 2,057 rd + 105,257 wr)
==574== D1 miss rate: 1.4% ( 2.1% + 0.1% )
==574== LLd miss rate: 0.0% ( 0.0% + 0.1% )
==574==
==574== LL refs: 3,935,714 ( 3,805,637 rd + 130,077 wr)
==574== LL misses: 108,507 ( 3,250 rd + 105,257 wr)
==574== LL miss rate: 0.0% ( 0.0% + 0.1% )
```
:::
:::spoiler time
```shell
$ time ./test 10000000
Start!
Finish!
real 0m35.136s
user 0m34.418s
sys 0m0.261s
```
:::
### 實作二
```c
element_t *q_insert_ele_tail(queue_t *q, element_t *t)
{
element_t *ret = t->next;
t->next = NULL;
if (!q->size) {
q->head = t;
q->tail = t;
} else {
q->tail->next = t;
q->tail = t;
}
++q->size;
return ret;
}
```
這是不需要額外 `strdup`、`free` 的實作,`valgrind` 結果如下。
:::spoiler cachegrind 1千萬大小
```shell
$ valgrind --tool=cachegrind ./test 10000000
==535== Cachegrind, a cache and branch-prediction profiler
==535== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==535== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==535== Command: ./test 10000000
==535==
--535-- warning: L3 cache found, using its data for the LL simulation.
==535== brk segment overflow in thread #1: can't grow to 0x4a4b000
==535== (see section Limitations in user manual)
==535== NOTE: further instances of this message will not be shown
Start!
Finish!
==535==
==535== I refs: 23,267,018,490
==535== I1 misses: 1,261
==535== LLi misses: 1,249
==535== I1 miss rate: 0.00%
==535== LLi miss rate: 0.00%
==535==
==535== D refs: 8,371,756,742 (5,993,606,568 rd + 2,378,150,174 wr)
==535== D1 misses: 577,302,877 ( 565,694,399 rd + 11,608,478 wr)
==535== LLd misses: 326,976,374 ( 315,710,868 rd + 11,265,506 wr)
==535== D1 miss rate: 6.9% ( 9.4% + 0.5% )
==535== LLd miss rate: 3.9% ( 5.3% + 0.5% )
==535==
==535== LL refs: 577,304,138 ( 565,695,660 rd + 11,608,478 wr)
==535== LL misses: 326,977,623 ( 315,712,117 rd + 11,265,506 wr)
==535== LL miss rate: 1.0% ( 1.1% + 0.5% )
```
:::
:::warning
```shell
==535== brk segment overflow in thread #1: can't grow to 0x4a4b000
==535== (see section Limitations in user manual)
==535== NOTE: further instances of this message will not be shown
```
https://stackoverflow.com/questions/35129135/valgrind-reporting-a-segment-overflow
該訊息是因為 `queue` 設太大 (1千萬),記憶體不夠,需要用 `mmap()` 這樣的函數去模擬,詳細點網址。
:::
:::spoiler cachegrind 一般大小
```shell
$ valgrind --tool=cachegrind ./test
==582== Cachegrind, a cache and branch-prediction profiler
==582== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==582== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==582== Command: ./test
==582==
--582-- warning: L3 cache found, using its data for the LL simulation.
Start!
Finish!
==582==
==582== I refs: 173,273,997
==582== I1 misses: 1,219
==582== LLi misses: 1,191
==582== I1 miss rate: 0.00%
==582== LLi miss rate: 0.00%
==582==
==582== D refs: 65,105,910 (43,871,317 rd + 21,234,593 wr)
==582== D1 misses: 2,949,205 ( 2,841,039 rd + 108,166 wr)
==582== LLd misses: 107,310 ( 2,057 rd + 105,253 wr)
==582== D1 miss rate: 4.5% ( 6.5% + 0.5% )
==582== LLd miss rate: 0.2% ( 0.0% + 0.5% )
==582==
==582== LL refs: 2,950,424 ( 2,842,258 rd + 108,166 wr)
==582== LL misses: 108,501 ( 3,248 rd + 105,253 wr)
==582== LL miss rate: 0.0% ( 0.0% + 0.5% )
```
:::
:::spoiler time
```shell
$ time ./test 10000000
Start!
Finish!
real 0m16.724s
user 0m15.837s
sys 0m0.381s
```
:::
### 分析比較
要分析,要先了解 cache,首先 cache 分很多層 (L1, L2, L3),數字越大離 CPU 越遠,大小越大,memory stall 也會越高。而 cachegrind 顯示的 LLi 與 LLd 是 instruction 和 data 的 cache,會比 LL 更接近 CPU 一點。
<!--
我讀了 [`bakudr18` 的筆記](https://hackmd.io/@bakudr18/BkvLMLkeq#%E4%BA%8B%E5%89%8D%E6%BA%96%E5%82%99) 之後,明白 cachegrind 的資訊到底該怎麼讀,重點是看 LLd misses (rd) 以及 DLmr,但是我的 cachegrind 應該是因為用 wsl,所以找不到 DLmr。LLd misses (rd) 白話來說是讀取 last level data cache 的 miss 次數。
-->
我們可以看出 LLd misses (rd) 沒有區別,為什麼 ?
`strdup` 會增加 reference 次數,因為 `strdup` 需要 **複製記憶體內容**,但是字串的記憶體是連續且短的,才導致 miss 次數沒有顯著增加。
可以看到 實作一 不論哪一個 cache 的 reference time 都高了不少,而最後時間的部分是 實作二 明顯比較短。
## Iteration v.s. Recursive
該實驗參考自 [`bakudr18` 的筆記 - merge sort list 的結論](https://hackmd.io/@bakudr18/BkvLMLkeq#merge-sort-list-%E7%B5%90%E8%AB%96)
> ...
> 比較兩個方法的 q_sort 實作,divide and conquer 比起分段合併少走訪 nlog(n) 個 node,而且更 cache friendly。
> ...
於是我想設計一個實驗,讓 iteration 的 merge sort 不用分段合併的順序去合併,而是使用 cache friendly iteration 的合併方式。看一下是否會比 cache friendly recursive 的 merge sort 還要快。
### 分段合併 example
```shell=
;一開始有 8 條 sorted list,長度皆為 1
1 1 1 1 1 1 1 1
;前兩個 1 合併
2 1 1 1 1 1 1
;前兩個 1 合併
2 2 1 1 1 1
;前兩個 1 合併
2 2 2 1 1
;前兩個 1 合併
2 2 2 2
;沒有 1,前兩個 2 合併
4 2 2
;前兩個 2 合併
4 4
;沒有 2,前兩個 4 合併
8
;最後得到長度為 8 的 sorted list
```
### cache friendly iteration
cache friendly 的實作方式就是將 8 條 sorted list 輸入到 stack 中,過程中只要 stack 最上面兩個 sorted list 長度相同,就合併。
stack|input|註解
--|--|--
(empty)|1 1 1 1 1 1 1 1|一開始有 8 條 sorted list
1|1 1 1 1 1 1 1|輸入 1
1 1|1 1 1 1 1 1|輸入 1
2|1 1 1 1 1 1|合併
2 1|1 1 1 1 1|輸入 1
2 1 1|1 1 1 1|輸入 1
2 2|1 1 1 1|合併
4|1 1 1 1|合併
4 1|1 1 1|輸入 1
4 1 1|1 1|輸入 1
4 2|1 1|合併
4 2 1|1|輸入 1
4 2 1 1|(empty)|輸入 1
4 2 2|(empty)|合併
4 4|(empty)|合併
8|(empty)|合併
可以看到整個過程都是 cache friendly,這個演算法時作時需要有 stack 這樣一個額外空間,所需的複雜度是 $O(lgn)$,這點跟 merge sort 的 function stack 是一樣的。
### code
```c
//實作簡單的 log2
size_t lg(int x)
{
size_t cnt = -1;
while (x) {
++cnt;
x >>= 1;
}
return cnt;
}
```
```c
//確認可不可以合併
bool can_merge(queue_t *stack, size_t stack_size)
{
return stack_size >= 2 && stack[stack_size - 2].size == stack[stack_size - 1].size;
}
```
```c
//演算法
void q_sort(queue_t *q)
{
if (!q || q->size <= 1)
return;
//initialize stack
queue_t *stack = malloc((lg(q->size) + 1) * sizeof(queue_t));
size_t stack_size = 0;
element_t *head = q->head, *temp;
while (head) {
//extract one element
temp = head;
head = head->next;
temp->next = NULL;
//input one element
stack[stack_size].head = temp;
stack[stack_size].tail = temp;
stack[stack_size].size = 1;
++stack_size;
//merge until can't merge anymore
while (can_merge(stack, stack_size)) {
int sz = stack[stack_size - 2].size + stack[stack_size - 1].size;
merge(stack + stack_size - 2, stack[stack_size - 2].head, stack[stack_size - 1].head);
--stack_size;
stack[stack_size - 1].size = sz;
}
}
//merge until remained one element
while (stack_size >= 2) {
int sz = stack[stack_size - 2].size + stack[stack_size - 1].size;
merge(stack + stack_size - 2, stack[stack_size - 2].head, stack[stack_size - 1].head);
--stack_size;
stack[stack_size - 1].size = sz;
}
memcpy(q, stack, sizeof(queue_t));
free(stack);
}
```
:::spoiler time
```shell
$ time ./test 10000000
Start!
Finish!
real 0m15.311s
user 0m14.534s
sys 0m0.272s
```
:::
:::spoiler cachegrind
```shell
$ valgrind --tool=cachegrind ./test 10000000
==440== Cachegrind, a cache and branch-prediction profiler
==440== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==440== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==440== Command: ./test 10000000
==440==
--440-- warning: L3 cache found, using its data for the LL simulation.
==440== brk segment overflow in thread #1: can't grow to 0x4a4b000
==440== (see section Limitations in user manual)
==440== NOTE: further instances of this message will not be shown
Start!
Finish!
==440==
==440== I refs: 22,829,542,611
==440== I1 misses: 1,262
==440== LLi misses: 1,250
==440== I1 miss rate: 0.00%
==440== LLi miss rate: 0.00%
==440==
==440== D refs: 7,999,995,859 (5,676,338,287 rd + 2,323,657,572 wr)
==440== D1 misses: 362,632,224 ( 351,291,993 rd + 11,340,231 wr)
==440== LLd misses: 232,662,485 ( 221,398,889 rd + 11,263,596 wr)
==440== D1 miss rate: 4.5% ( 6.2% + 0.5% )
==440== LLd miss rate: 2.9% ( 3.9% + 0.5% )
==440==
==440== LL refs: 362,633,486 ( 351,293,255 rd + 11,340,231 wr)
==440== LL misses: 232,663,735 ( 221,400,139 rd + 11,263,596 wr)
==440== LL miss rate: 0.8% ( 0.8% + 0.5% )
```
:::
### 結果分析
雖然時間差不多,但可以看出 cache misses 下降了不少。
## merge 改進
在參考 [`chenzenyan` 的 `m_sorted`](https://hackmd.io/@chenzenyang0905/r1cyAFlzs) 之後,我發現 `merge` 其實也可以更漂亮,於是我開始著手改進。
```c
element_t *better_merge(element_t *a, element_t *b)
{
element_t *ret, **next = &ret; //一開始,下一個要修改的是 ret
while (a && b) {
if (strcmp(a->value, b->value) <= 0) {
*next = a;
a = a->next;
} else {
*next = b;
b = b->next;
}
next = &(*next)->next; //紀錄下一個要修改的位置
}
*next = a ? a : b;//接上剩下的
return ret;
}
```
`better_merge` 利用指標的指標 `next` 去存 (要修改的指標) 的指標,不論是要修改回傳的頭 `ret`、`a->next`、`b->next`,都只要用 `next = 指標的指標`。若要對其儲存的值 (指標) 修改,只要 `*next = 指標` 就好。
以下是修改後的 `merge_sort`。
```c
bool can_merge(size_t *sizes, size_t stack_size)
{
return stack_size >= 2 && sizes[stack_size - 2] == sizes[stack_size - 1];
}
void merge_sort(element_t **qhead)
{
if (!(*qhead) || !(*qhead)->next)
return;
//initialize stack
size_t stack_size = 0, *sizes, qsize = 0;
element_t *head = *qhead, *temp = head, **stack;
while (temp) {
++qsize;
temp = temp->next;
}
stack = malloc((lg(qsize) + 1) * sizeof(element_t*));
sizes = malloc((lg(qsize) + 1) * sizeof(size_t));
//input and merge
while (head) {
//extract one element
temp = head;
head = head->next;
temp->next = NULL;
//input one element
stack[stack_size] = temp;
sizes[stack_size] = 1;
++stack_size;
//merge until can't merge anymore
while (can_merge(sizes, stack_size)) {
sizes[stack_size - 2] += sizes[stack_size - 1];
stack[stack_size - 2] = better_merge(stack[stack_size - 2], stack[stack_size - 1]);
--stack_size;
}
}
//merge until remained one element
while (stack_size >= 2) {
sizes[stack_size - 2] += sizes[stack_size - 1];
stack[stack_size - 2] = better_merge(stack[stack_size - 2], stack[stack_size - 1]);
--stack_size;
}
*qhead=stack[0];
free(stack);
free(sizes);
}
```
:::spoiler time
```shell
$ time ./test 10000000
Start!
Finish!
real 0m16.364s
user 0m15.564s
sys 0m0.342s
```
:::
:::spoiler cachegrind
```shell
$ valgrind --tool=cachegrind ./test 10000000
==395== Cachegrind, a cache and branch-prediction profiler
==395== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==395== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==395== Command: ./test 10000000
==395==
--395-- warning: L3 cache found, using its data for the LL simulation.
==395== brk segment overflow in thread #1: can't grow to 0x4a4b000
==395== (see section Limitations in user manual)
==395== NOTE: further instances of this message will not be shown
Start!
Finish!
==395==
==395== I refs: 20,690,199,519
==395== I1 misses: 1,263
==395== LLi misses: 1,251
==395== I1 miss rate: 0.00%
==395== LLi miss rate: 0.00%
==395==
==395== D refs: 6,579,894,694 (5,005,315,944 rd + 1,574,578,750 wr)
==395== D1 misses: 382,864,914 ( 371,509,691 rd + 11,355,223 wr)
==395== LLd misses: 252,849,266 ( 241,585,684 rd + 11,263,582 wr)
==395== D1 miss rate: 5.8% ( 7.4% + 0.7% )
==395== LLd miss rate: 3.8% ( 4.8% + 0.7% )
==395==
==395== LL refs: 382,866,177 ( 371,510,954 rd + 11,355,223 wr)
==395== LL misses: 252,850,517 ( 241,586,935 rd + 11,263,582 wr)
==395== LL miss rate: 0.9% ( 0.9% + 0.7% )
```
:::
:::spoiler leak-check
```shell
$ valgrind --leak-check=full ./test
==418== Memcheck, a memory error detector
==418== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==418== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==418== Command: ./test
==418==
Start!
Finish!
==418==
==418== HEAP SUMMARY:
==418== in use at exit: 0 bytes in 0 blocks
==418== total heap usage: 187,660 allocs, 187,660 frees, 3,870,463 bytes allocated
==418==
==418== All heap blocks were freed -- no leaks are possible
==418==
==418== For counts of detected and suppressed errors, rerun with: -v
==418== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
```
:::
改進之後 LLd cache misses 上升了不少。雖然 refs 是下降的,但這結果明顯變差。
以上的程式碼每次要對堆疊做操作,都會要跨越 `stack`、`sizes` 兩個陣列。
於是我打算用 `list_t` 將 `element_t`、`size_t` 包起來,讓存取堆疊的跨度可以縮小。
### 記憶體重新排列
```c
typedef struct list {
size_t size;
element_t *head;
} list_t;
```
理論上修改後應該要更加 cache friendly。
```c
void merge_sort(element_t **qhead)
{
if (!(*qhead) || !(*qhead)->next)
return;
//initialize stack
size_t stack_size = 0, qsize = 0;
element_t *head = *qhead, *temp = head;
list_t *stack;
while (temp) {
++qsize;
temp = temp->next;
}
stack = malloc((lg(qsize) + 1) * sizeof(list_t));
//input and merge
while (head) {
//extract one element
temp = head;
head = head->next;
temp->next = NULL;
//input one element
stack[stack_size].head = temp;
stack[stack_size].size = 1;
++stack_size;
//merge until can't merge anymore
while (can_merge(stack, stack_size)) {
stack[stack_size - 2].size += stack[stack_size - 1].size;
stack[stack_size - 2].head = better_merge(stack[stack_size - 2].head, stack[stack_size - 1].head);
--stack_size;
}
}
//merge until remained one element
while (stack_size >= 2) {
stack[stack_size - 2].size += stack[stack_size - 1].size;
stack[stack_size - 2].head = better_merge(stack[stack_size - 2].head, stack[stack_size - 1].head);
--stack_size;
}
*qhead = stack[0].head;
free(stack);
}
```
:::spoiler time
```shell
$ time ./test 10000000
Start!
Finish!
real 0m15.957s
user 0m15.201s
sys 0m0.271s
```
:::
:::spoiler cachegrind
```shell
$ valgrind --tool=cachegrind ./test 10000000
==485== Cachegrind, a cache and branch-prediction profiler
==485== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==485== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==485== Command: ./test 10000000
==485==
--485-- warning: L3 cache found, using its data for the LL simulation.
==485== brk segment overflow in thread #1: can't grow to 0x4a4b000
==485== (see section Limitations in user manual)
==485== NOTE: further instances of this message will not be shown
Start!
Finish!
==485==
==485== I refs: 20,720,480,669
==485== I1 misses: 1,261
==485== LLi misses: 1,249
==485== I1 miss rate: 0.00%
==485== LLi miss rate: 0.00%
==485==
==485== D refs: 6,579,894,595 (5,005,315,887 rd + 1,574,578,708 wr)
==485== D1 misses: 382,792,170 ( 371,438,588 rd + 11,353,582 wr)
==485== LLd misses: 252,849,013 ( 241,585,434 rd + 11,263,579 wr)
==485== D1 miss rate: 5.8% ( 7.4% + 0.7% )
==485== LLd miss rate: 3.8% ( 4.8% + 0.7% )
==485==
==485== LL refs: 382,793,431 ( 371,439,849 rd + 11,353,582 wr)
==485== LL misses: 252,850,262 ( 241,586,683 rd + 11,263,579 wr)
==485== LL miss rate: 0.9% ( 0.9% + 0.7% )
```
:::
### 結果
時間變快,其實結果變化都不大,畢竟 `lg(10000000)` 大概等於 23,陣列的大小應該還是在同一個 page,結果差異大部分是誤差。