2023q1 Homework1 (lab0)

contributed by < hongpei100 >

實驗環境

$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i5-8300H CPU @ 2.30GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            10
    CPU max MHz:         4000.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4599.93

在終端機執行 export LC_ALL=C，再執行上述命令。lscpu 套件目前的中文翻譯品質不佳。
:notes: jserv

:bulb: 已使用上述指令命令，將 lscpu 命令的語言從中文敘述轉成英文敘述
注意資訊科技詞彙翻譯，command 的翻譯是「命令」。

Lab0 作業要求

作業要求

開發流程紀錄

了解 Linux 內部資料結構

透過閱讀你所不知道的 C 語言: linked list 和非連續記憶體 :

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

透過上面兩張圖可以發現，資料結構為 Circular doubly linked list，並且有以下觀察 :

List 的 head 不存資料，而是透過 next 指向第一個有存資料的節點
當一個 list 為 empty 狀態，它的 head->next 會等於 head

並且可透過 Linux 提供的巨集，來簡化程式碼的維護。

針對佇列的操作修改程式碼

q_new

透過 malloc 函式，分配一塊大小為 struct list_head 的記憶體給變數 q
確認佇列 q 有被成功分配到記憶體，並透過 INIT_LIST_HEAD 對佇列做初始化

:bulb: INIT_LIST_HEAD 會將 head 的 prev 與 next 指標指向 head

/* Create an empty queue */
struct list_head *q_new()
{
    struct list_head *q = malloc(sizeof(struct list_head));

    if (!q)
        return NULL;

    INIT_LIST_HEAD(q);
    return q;
}

注意程式碼風格規範
malloc 不需要額外的轉型，亦即 (struct list_head *) malloc(...) 非必要

:notes: jserv

已透過命令 clang-format -i queue.c 來排版，符合程式碼風格規範
已將 malloc 轉型移除

q_free

在實作清除佇列記憶體時，回想起教材內部有針對 linked list 的 NULL 與 empty 狀態做討論，並且為了盡量減少分支:

linked list 為 NULL 或者是 empty，會直接回傳
若 linked list 有帶資料的節點存在，則以巨集 list_for_each_entry_safe 對 list 走訪，並參考 kdnvt 學員的 code review，得知可以善用 queue.h 內部提供的 q_release_element 來移除 linked list element

/* Free all storage used by queue */
void q_free(struct list_head *l)
{
    if (l && !list_empty(l)) {
        element_t *entry = NULL, *safe = NULL;
        list_for_each_entry_safe (entry, safe, l, list)
            q_release_element(entry);
        free(l);
    }
}

q_insert_head

在實作這部份時，發現沒有提供函式來新增 linked list element，故先在 queue.c 當中，定義一個 q_new_ele 的函式來創造新的節點 :

/* Create a new element to put into list */
element_t *q_new_ele(char *s)
{
    if (!s)
        return NULL;

    element_t *new_ele = malloc(sizeof(element_t));
    if (!new_ele)
        return NULL;

    INIT_LIST_HEAD(&new_ele->list);

    int len = strlen(s) + 1;
    new_ele->value = malloc(sizeof(char) * len);
    if (!new_ele->value) {
        free(new_ele);
        return NULL;
    }

    strncpy(new_ele->value, s, len);
    new_ele->value[len - 1] = '\0';
    return new_ele;
}

:bulb: 記得複製長度為 n 的字串的時候，記憶體要分配 n + 1 個空間，留一個空間給 \0

有了上述函式，則可實作 LIFO 的節點插入 :

若 linked list 為 NULL 就先回傳
透過 q_new_ele 創立一個節點，並觀察是否為 NULL 以確認創建成功
最後透過巨集 list_add 來將節點放至 linked list 的開頭

:bulb: 觀察 list_add 的函式，可知函式參數兩者皆為 struct list_head*，因此第一個參數放入的是節點 new_node 底下的 list

/* Insert an element at head of queue */
bool q_insert_head(struct list_head *head, char *s)
{
    if (!head)
        return false;

    element_t *new_node = q_new_ele(s);
    if (!new_node)
        return false;

    list_add(new_node->list, head);
    return true;
}

q_insert_tail

跟 q_insert_head 的概念相似，可重複利用新增節點的函式 q_new_ele，並透過巨集 list_add_tail 來將節點插入尾端.

/* Insert an element at tail of queue */
bool q_insert_tail(struct list_head *head, char *s)
{
    if (!head)
        return false;
    
    element_t *new_node = q_new_ele(s);
    if (!new_node)
        return false;
    
    list_add_tail(new_node->list, head);
    return true;
}

q_remove_head

這邊提供三個函式參數，在閱讀 queue.h 此函式的解釋後，知道可利用 buf_size - 1 來複製第一個 entry 的字串到 sp 內。

而透過 list.h 可找到巨集 list_del_init 來輔助 remove element

步驟如下 :

若 linked list 為 NULL 或是 empty，就回傳 NULL，表示沒有任何被移除的節點
若 linked list 有帶資料的節點存在，則分別利用 head->next 和 list_first_entry 來取得 linked list 開頭的 element 和該元素封裝的 entry
利用 strndup 做字串複製到 sp，避免 sp 沒有被分配到記憶體
最後透過 list_del_init 將第一個節點從 linked list 移除
回傳移除節點的指標

:bulb: 不論 sp 是否為 NULL，仍然要把該資料節點 remove

/* Remove an element from head of queue */
element_t *q_remove_head(struct list_head *head, char *sp, size_t bufsize)
{
    if (!head || list_empty(head))
        return NULL;
        
    struct list_head *r_node = head->next;
    element_t *r_ele = list_first_entry(head, element_t, list);
    int min_len = 0;

    if (sp && bufsize) {
        min_len = strlen(r_ele->value) + 1 > bufsize ? bufsize
                                                     : strlen(r_ele->value) + 1;
        strncpy(sp, r_ele->value, min_len);
        sp[min_len - 1] = '\0';
    }
    list_del_init(r_node);
    return r_ele;
}

q_remove_tail

跟 q_remove_head 的概念相似，差別在於要取的節點為 linked list 尾端節點，因此可透過 head->prev 與 list_last_entry 取得 linked list 開頭的 element 和該元素封裝的 entry

/* Remove an element from tail of queue */
element_t *q_remove_tail(struct list_head *head, char *sp, size_t bufsize)
{
    if (!head || list_empty(head))
        return NULL;

    struct list_head *r_node = head->prev;
    element_t *r_ele = list_last_entry(head, element_t, list);
    int min_len = 0;

    if (sp && bufsize) {
        min_len = strlen(r_ele->value) + 1 > bufsize ? bufsize
                                                     : strlen(r_ele->value) + 1;
        strncpy(sp, r_ele->value, min_len);
        sp[min_len - 1] = '\0';
    }
    list_del_init(r_node);
    return r_ele;
}

q_size

若 linked list 為 NULL 或者是 empty，就直接回傳 0，表示 linked list 沒有任何帶資料的節點
若 linked list 有帶資料的節點，利用巨集 list_for_each 來對 linked list 走訪，計算節點數量

:bulb: 若 linked list 能夠維護一個整數變數 cnt，在插入或刪除節點時更新 cnt，這個函式應該能從 O(n)變到 O(1)

這樣對每個節點佔用的空間會有顯著衝擊。工程就是一系列的取捨 (trade-off)
:notes: jserv

/* Return number of elements in queue */
int q_size(struct list_head *head)
{
    if (!head || list_empty(head))
        return 0;
        
    int cnt = 0;
    struct list_head *tmp = NULL;

    list_for_each (tmp, head)
        cnt++;
    return cnt;
}

q_delete_mid

若 linked list 的節點數量為n，則一個 linked list 的中間節點，為在 0-base indexing 之下的第 floor(n/2) 的 element :

藉由 q_size 來算出中間節點的 element 為第 idx 個
透過巨集 list_for_each 走訪 linked list，找到第 idx 個 element
再透過巨集 list_del_init 來移除中間節點
最後使用巨集 q_release_element，來將節點記憶體釋放

:bulb: 注意到 : 這個函式為 delete，但是巨集提供的 list_del_init 只會把節點移除，須呼叫巨集 q_release_element 函式來把節點記憶體空間歸還給作業系統

/* Delete the middle node in queue */
bool q_delete_mid(struct list_head *head)
{
    if (!head || list_empty(head))
        return false;

    int idx = q_size(head) / 2;
    int cnt = 0;
    struct list_head *tmp = NULL;
    element_t *d_node = NULL;
    
    list_for_each (tmp, head) {
        if (cnt == idx)
            break;
        cnt++;
    }

    d_node = list_entry(tmp, element_t, list);
    list_del_init(tmp);
    q_release_element(d_node);
    return true;
}

q_delete_dup

這個函式想的比較久，有幾個想法: 令整個鏈結串列有

n

個節點。

直覺使用二層 for 迴圈來尋找重複的節點，時間複雜度是 O(n^2)

我試著找尋更有效率的解法來尋找重複的節點，於是想到第二種方法:
2. hash table 的 Chaining :同樣以鏈結串列實作，且對於搜尋單一字串，至多是 O(n)。

ASCII table
觀察 ASCII table 的字母對應的十進位，可發現最小值為 32，最大值為 126，因此建立 hash table 並使其具備

126 - 32 + 1 = 95

個 bucket，並使用每個節點字串的第一個字母，來計算 hash value，就可以進行建表與搜尋。

但是在實作到一半的過程發現，我們只有維護一個鏈結串列，若為了要刪除重複資料的節點，需要在走訪鏈結串列過程中，一併建出 hash table。這樣不僅時間複雜度沒有改善，反而還浪費了空間。

分析如下 : 針對 worst case 分析，若當前鏈結串列有 10 個節點，字串各別為 a 開頭，且每個字串皆相異。依據上述方法，會建出 1 個 bucket，但有 10 個節點鏈結串列。

在此過程中，若走訪至第一個節點，則需要檢查 bucket 第一個節點 ; 走訪第二個節點，則需要檢查 bucket 前兩個節點 ; 依此類推，故時間複雜度仍然為 O(n^2)，且還浪費與原本鏈結串列相同大小的空間 O(n)

結論 : 此方法比第一種方法來的更差

先將 linked list 做排序，在迭代比較相鄰節點的字串，若有重複，就刪除當前走訪到的節點

:bulb: 此方法是讓 unsorted list 轉變成 sorted list，並走訪整個 linked list，因此只需要花 O(n^logn) 即可完成。若題目嚴謹要求只能在 unsorted list 上操作，則此方法無效，須採用前兩種方法。

:bulb: 最後發現在作業要求與 leetcode 題目說明之下，給定的 linked list 為已經排序的，故選擇第三種方法。

:bulb: 這邊採用 marked 變數來紀錄重複資料節點刪除的狀況: 若有遇到兩兩相鄰節點資料相同，則設立 marked 為 1 ; 反之若資料不相同，則設為 0。這個變數的重點在於 : 當重複資料節點刪除時，遇到 safe 繞回 head，或者是重複資料節點刪除到只剩下一個的時候，需要把剩下那一個節點也刪除掉，也就是代表 marked 從 1 變回 0。

/* Delete all nodes that have duplicate string */
bool q_delete_dup(struct list_head *head)
{
    if (!head)
        return false;
        
    int marked = 0;
    element_t *entry = NULL, *safe = NULL;

    list_for_each_entry_safe (entry, safe, head, list) {
        if (&safe->list == head) {
            if (marked) {
                list_del_init(&entry->list);
                q_release_element(entry);
            }
            break;
        }

        if (strcmp(entry->value, safe->value) == 0) {
            list_del_init(&entry->list);
            q_release_element(entry);
            marked = 1;
        }
        else {
            if (marked) {
                list_del_init(&entry->list);
                q_release_element(entry);
            }
            marked = 0;
        }
    }
    return true;
}

q_swap

根據 leetcode 題目的描述，linked list 的節點交換，不能夠直接交換兩個節點的資料，須更改節點本身的 link。

一開始想到的作法比較直觀 : 利用 list_for_each(tmp, safe, head, list) 走訪整個 linked list，並額外使用 struct list_head *prev, *next 分別當作 tmp 節點的前一個節點，與 safe 節點的後一個節點，來做 swap :

/* Swap every two adjacent nodes */
void q_swap(struct list_head *head)
{
    if (head && !list_empty(head)) {
        struct list_head *tmp = NULL, *safe = NULL;
        list_for_each_safe (tmp, safe, head) {
            struct list_head *prev = tmp->prev, *next = safe->next;

            tmp->prev = safe;
            safe->next = tmp;
            tmp->next = next;
            safe->prev = prev;
        }
    }
}

後來經由觀摩 kdvnt 的 q_swap 想法，得知可以善用 list.h 裡面的巨集 list_move_tail 來調動節點。

但是在這邊思考了一下，有兩個疑惑的點 :

為什麼要用 list_move_tail ? 用 list_move 一定也能達到相同的操作

的確可以做到，兩者差別在於 :

list_move_tail(current->next, current)，current node 放在第二個參數，移動的節點是 current node 的下一個節點;
list_move(current, current->next)，current node 放在第一個參數，移動的節點就是 current node

list_move_tail 強調把 current node 當作 head，把其下一個節點，放至 head->prev ;
list_move 則強調把 current node 下一個節點當 head，把 current node 放至 head->next;

:bulb: 注意 : 佇列的 head，跟 list_move 與 list_move_tail 的參數的 head 不能混淆

為什麼不用 list_for_each_safe ?

因為迴圈中止條件需要多設立一個 node->next != head，否則會繼續 swap

綜合以上描述，新的 q_swap 函式如下 :

/* Swap every two adjacent nodes */
void q_swap(struct list_head *head)
{
    if (!head || list_empty(head))
        return;

    struct list_head *cur = NULL;

    for (cur = head->next; cur != head && cur->next != head; cur = cur->next)
        list_move_tail(cur->next, cur);
}

q_reverse

想法與 q_swap 相似，利用 list_move，讓每個 current node 的下一個節點，都被調到 linked list head 的下一個節點

/* Reverse elements in queue */
void q_reverse(struct list_head *head)
{
    if (!head || list_empty(head))
        return;

    struct list_head *cur = NULL;

    for (cur = head->next; cur != head && cur->next != head;)
        list_move(cur->next, head);
}

q_reverse_k

想法與 q_reverse 相似，每一次以 k 個節點作為單位，進行反向順序排列。
唯一不同的點在於，當每 k 個節點反向排序完成時，需要更新 tmp_head 至當前 k 個節點的最後一個，作為下一群 k個節點的 head，才能正確操作 list_move。
若走訪到後面，已經剩餘不足 k 個節點，就不需要 reverse。

/* Reverse the nodes of the list k at a time */
void q_reverseK(struct list_head *head, int k)
{
    if (!head || list_empty(head))
        return;

    struct list_head *cur = NULL, *tmp_head = head;
    int i, j;

    for (i = 0; i < q_size(head) && q_size(head) - i >= k; i += k) {
        for (j = 0, cur = tmp_head->next; j < k - 1; j++)
            list_move(cur->next, tmp_head);
        tmp_head = cur;
    }
}

q_sort

根據 linked list sort，這邊採用 Merge sort 來排序 linked list，而 Merge sort 可分成 partition 和 merge 的部份來思考 :

Partition :
每一次的 Partition，都會把當前的 linked list 分成兩半，分別為 left 與 right。這邊透過兩個 struct list_head* 的指標，分別為 slow 和 fast，來決定這一個 linked list 的中間節點，再根據中間節點來分割成兩個 linked list。

在 while 迴圈中，fast 每次都會比 slow 走的多一個節點，這個代表每次第一個與第二個 linked list都會各新增一個節點，直到 fast 遇到 head

:bulb: 不過 queue.c 已經有實作 q_size，可透過迭代尋找在 0-based indexing 之下的第 floor(n/2) 個節點，當作 slow，而 fast = slow->next。

:bulb: 當找到第一個與第二個 linked list 的分界點，會發現 Merge sort 在遞迴呼叫時，放入的參數是 linked list 的 head，其不存放任何資料。此時發現，可使用 LIST_HEAD 來產生 left 跟 right 的 head

接著，再透過 list_cut_position，把包含 slow 以前的 linked list 放入 left，剩下的放入 right 內，並遞迴的進行 partition。
2. Merge:
不斷分別比較第一個與第二個 list 內的節點資料大小，並呼叫 list_move_tail 來把節點的 value 較小者，放入 head linked list 的尾端。

等到 left 與 right 其中一個為 empty，就呼叫 list_splice_tail_init 來把另一個不為 empty 的 linked list 內部所有節點，都放到 head 的尾端，並呼叫 free 來把 q1 和 q2 的記憶體歸還給作業系統，最後回傳 head。注意這個函式不需要在額外 malloc 一塊空間來放資料，而是直接放入 head 即可，因為 head 傳入時必定為 empty 。

void merge(struct list_head *head, struct list_head *q1, struct list_head *q2)
{
    element_t *cur_q1 = NULL, *cur_q2 = NULL;

    while (!list_empty(q1) && !list_empty(q2)) {
        cur_q1 = list_entry(q1->next, element_t, list);
        cur_q2 = list_entry(q2->next, element_t, list);

        if (strcmp(cur_q1->value, cur_q2->value) <= 0)
            list_move_tail(q1->next, head);
        else
            list_move_tail(q2->next, head);
    }

    if (q1 && !list_empty(q1))
        list_splice_tail_init(q1, head);
    if (q2 && !list_empty(q2))
        list_splice_tail_init(q2, head);
}

void q_sort(struct list_head *head)
{
    if (!head || list_empty(head) || list_is_singular(head))
        return;

    struct list_head *fast = head->next->next, *slow = head->next;

    // split list
    while (fast != head && fast->next != head) {
        slow = slow->next;
        fast = fast->next->next;
    }

    // generate left & right head
    LIST_HEAD(left);
    LIST_HEAD(right);
    list_cut_position(&left, head, slow);
    list_splice_init(head, &right);
    q_sort(&left);
    q_sort(&right);
    merge(head, &left, &right);
}

q_descend

根據 leetcode - Remove Nodes From Linked List 題目可知，對於 linked list 內的每個節點，若當前節點右邊有任意一個節點的值，嚴格大於其本身的值，則當前節點會從 linked list 內被刪除。

由上敘述，可知從 linked list 的最後一個節點開始，依序向左檢驗每一個節點的值，並持續以 max_ele 紀錄當前遇到擁有最大 ASCII 值的節點，並刪除字串小於該最大 ASCII 字串的節點，最後再透過 q_size 回傳當前的節點數量。

/* Remove every node which has a node with a strictly greater value anywhere to
 * the right side of it */
int q_descend(struct list_head *head)
{
    if (!head || list_empty(head))
        return;

    struct list_head *cur = NULL;
    element_t *cur_ele = NULL,
              *max_ele = list_entry(head->prev, element_t, list);

    for (cur = max_ele->list->prev; cur != head; cur = max_ele->list->prev) {
        cur_ele = list_entry(cur, element_t, list);
        if (strcmp(max_ele->value, cur_ele->value) > 0) {
            list_del_init(cur);
            q_release_element(cur_ele);
        } else
            max_ele = cur_ele;
    }
    return q_size(head);
}

q_merge

一開始看到這個函式，非常疑惑為什麼不是給雙重指標，如 struct lish_ead **head 的形式，這樣才能存取到每一個 queue 的開頭。
後來經由觀摩 paul90317 的 q_merge，得知這個 struct list_head *head 確實是每一個 queue 的開頭，經由 queue.h 內部的下列結構體可知，head 就是 queue_context_t 結構下的 struct list_Head *q。

/**
 * queue_contex_t - The context managing a chain of queues
 * @q: pointer to the head of the queue
 * @chain: used by chaining the heads of queues
 * @size: the length of this queue
 * @id: the unique identification number
 */
typedef struct {
    struct list_head *q;
    struct list_head chain;
    int size;
    int id;
} queue_contex_t;

於是開始思考可以利用 Merge sort 內的 merge 函式，每走訪一次所有的 queue，就讓兩兩相鄰的 queue 合併。
想法如下:

若 queue 為 NULL 或 empty，代表沒有任何 queue 有資料，回傳 0
若有多個 queue，令總共要 merge k linked lists :
- 總共需要走訪整個 chain celling(k/2) 次，由於 C 的整數除法為無條件捨去，因此可改寫成 (k + 1) / 2
- 每一次走訪需要合併的次數為當前 queue 的數量的一半
- 每次利用 cur 和 empty 當指標走訪所有的 queues。
- cur 每一次都會合併當前的 queue 和下一個 queue，放到 temp
- empty 則指向第一個為 empty 的 queue ，讓 cur 合併好的 temp queue 放到 empty
在每一次走訪完之後，須確認這次合併的 queue 的數量 :
- 若是為偶數，代表這一次走訪，每一個 queue 都有被合併到
- 若是為奇數，代表這一次走訪，還有一個 queue 沒有被合併，需要將其搬到最後一個合併的 queue 的下一個 queue

/* Merge all the queues into one sorted queue, which is in ascending order */
int q_merge(struct list_head *head)
{
    if (!head || list_empty(head))
        return 0;

    int k;

    // Total k queue needs ceiling(n/2) times to merge, which means (k + 1) / 2
    for (k = q_size(head); k > 1 ; k = (k + 1) / 2) {
        struct list_head *cur = head->next, *empty = head->next;
        // k queue needs k/ 2 times merge
        for (int i = 0; i < k / 2; i++) {
            LIST_HEAD(temp);
            merge(&temp, list_entry(cur, queue_contex_t, chain)->q, list_entry(cur->next, queue_contex_t, chain)->q);
            list_splice_tail(&temp, list_entry(empty, queue_contex_t, chain)->q);

            cur = cur->next->next;
            empty = empty->next;
        }
        // if there has odd number queues, put the last queue to the next of the last combined queue
        if (k % 2)
            list_splice_tail_init(list_entry(cur, queue_contex_t, chain)->q, list_entry(empty, queue_contex_t, chain)->q);
    }

    return q_size(list_entry(head->next, queue_contex_t, chain)->q);
}

評分結果

目前得到 95 分，剩下 5 分目前無法得知不是 constant time 的原因，在多次重跑的情況之下，q_insert_tail, q_insert_head, q_remove_tail, and q_remove_head 皆有出現 Probably constant time 過。

+++ TESTING trace trace-17-complexity:
Test if time complexity of q_insert_tail, q_insert_head, q_remove_tail, and q_remove_head is constant
ERROR: Probably not constant time or wrong implementation
ERROR: Probably not constant time or wrong implementation
Probably constant time
ERROR: Probably not constant time or wrong implementation
–- trace-17-complexity 0/5
–- TOTAL 95/100

開發過程中的疑惑

linked list 的結構體宣告如下 :

struct list_head {
    struct list_head *prev;
    struct list_head *next;
};

而封裝 linked list 的結構 element_T 結構體宣告如下 :

typedef struct {
    char *value;
    struct list_head list;
} element_t;

為什麼 element_t 內的成員 list，不是以指標的形式宣告? 每一個 linked list 的 prev 與 next 都是指標，那照理來講，element_t 的成員 list 應該是指標，才能讓其它節點的 prev 或 next 指向 list 這個節點。

在 qtest 提供新的命令 shuffle

根據 Fisher–Yates shuffle 演算法，若資料結構內有 n 個 element，則以陣列實作的方法如下 :

總共有 n 個 element，代表索引從 0 ~ n-1
令 i 從陣列索引 n-1 至 0，不斷從範圍 [0,i] 內取出一個隨機數值 j，把第 i 個索引元素，跟第 j 個索引元素交換。

注意，[0,i] 是包含 0 與 i 索引。

Pseudo code 引用自維基百科:

-- To shuffle an array a of n elements (indices 0..n-1):
for i from n−1 downto 1 do
     j ← random integer such that 0 ≤ j ≤ i
     exchange a[j] and a[i]

在 linked list 內的 Fisher–Yates shuffle，則可以透過巨集 list_move tail 與 list_move 的輔助來完成 shuffle，想法如下 :

透過 q_size 取得共 cnt 個節點，並逐次遞減 cnt
令 safe 節點為不會被修改的節點，每次的 safe->prev 就是會被交換的節點之一，safe 會不斷的移動至 safe->prev，直到 head->next 等同於 safe
- 這邊 safe->prev 對應到陣列實作版本的 a[i] 元素
從 [0,cnt] 隨機抽取一個數值 j，並以 cur 指標，從 linked list head，移動 j + 1 次至要被交換的令一個元素。
- 這邊 cur 對應到陣列實作版本的 a[j] 元素
令 temp = cur->prev 當作傳入 list_move 的 head 參數 ; 並令 safe 當作傳 list_move_tail 的 head 參數 :
- 分別以 list_move_tail 與 list_move 來移動 cur 節點與 safe->prev->prev 節點
:bulb: 在 linked list 交換節點有兩個要注意的地方 :
1. 如果第 i 個節點與第 j 個節點相同，就不需要交換
2. 避免第 j+1 個節點與第 i 個節點相同，使用 list_move_tail 與 list_move，才不會發生相同節點呼叫 list_move 或 list_move_tail 的巨集時，存取到 list_del 過後的節點

/* Shuffle elements in queue */
void q_shuffle(struct list_head *head)
{
    if (!head || list_empty(head) || list_is_singular(head))
        return;

    srand(199);
    struct list_head *cur = NULL, *safe = NULL, *temp = NULL;
    int j = 0, cnt = q_size(head);

    for (safe = head; cnt >= 1; cnt--, safe = safe->prev) {
        j = rand() % cnt;
        for (cur = head; j >= 0; j--)
            cur = cur->next;

        if (cur != safe->prev) {
            temp = cur->prev;
            list_move_tail(cur, safe);
            list_move(safe->prev->prev, temp);
        }
    }
}

並在 qtest.c 內部加入下列程式碼 :

static bool do_shuffle(int argc, char *argv[])
{
    if (argc != 1) {
        report(1, "%s takes too much arguments", argv[0]);
        return false;
    }

    if (!current || !current->q)
        report(3, "Warning: Calling ascend on null queue");
    error_check();

    set_noallocate_mode(true);
    if (exception_setup(true))
        q_shuffle(current->q);
    exception_cancel();

    set_noallocate_mode(false);
    q_show(3);
    return !error_check();
}

static void console_init()
{
    ADD_COMMAND(shuffle, "Shuffle queue", "");
    ...
}

簡短探討 random shuffle linked list

根據 What-is-the-most-efficient-way-to-randomize-shuffle-a-linked-list 的討論，可以得知 :

在 O(1) space complexity 之下，可以做到最快的 random shuffle 為 O(nlogn) time complexity
在 O(n) space complexity 之下，可以做到最快的 random shuffle 為 O(n) time complexity

並且 Fisher–Yates shuffle 可以在 linked list 內節點數量多的時候，得到較好的效率，其方法是透過一個額外的指標陣列，以索引的方式存取到節點，來做到節點交換，時間複雜度可以達到 O(n)。若像上述實作的方式，利用第二層 for 迴圈找到相應的節點，則時間複雜度 O(n^2)。

解釋本程式的 “simulation” 模式

論文重點整理

根據 Dude, is my code constant time? :

不同的 measurements 是經由不同的 p percentile pre-processing 處理得到的
作者選擇捨棄大於 p percentile 的 timing distribution，是因為作者 low tails of time distribution 較容易觀察的到 constant time implementation 和 non-constant time implementation 之間的差異
作者採用兩種資料的來源是基於 fix-vs-random testing 的想法 : 所以在做實驗比較的時候，一個是來自於 fixed-value，也就是固定的 clocks cycle. 作者使用的 Cycle counter 常見於現代 CPU，如 x86 或 ARM 家族都有 ; 另一種是 random value，也就是隨機的 clocks cycle，
實驗主要使用 Welch's t test 來衡量兩個 timing distribution
Figure 1 : x 軸代表一個函式執行了多少個 cycles，y 軸為機率密度函數，代表函式擁有 y % 的機率，執行最多有 x 個 cycles.
- 可看到 Constant time 是一個 step function
- 也可看到 Constant time 跟 Non-constant time 執行的 CDF 之間的差異，是很明顯的 timing attack
Figure 2 : x 軸代表這兩個 timing distribution 有多少個 measurements; y 軸代表 t 值，同一條線代表同一種的 pre-processing 參數組。

解釋 Student’s t-distribution 及程式實作的原理。

Student's t-distribution :
- t's test 又稱為 Student's t-test ，其類似於常態分佈，可以獨立雙樣本檢定兩個分佈的群組，並驗證主張 "兩個分佈相近，都是 constant time" 的 Null hypothesis.
- 但是通常會假設兩個分佈的變異數具有同性質
- 而 Welch's t-test 是 Student's t-test 的改版，可用於兩個分佈擁有不同變異數時，他們的均值多相近
- Welch's t t-test 透過 degree of freedom 的概念來解決 Student's t-test 之下的變異數不同問題。
- 當 t 值越大，代表兩個分佈差越遠
- 如果 Welch's t-test 接受 Null hypothesis，代表我們的函式有可能是 Constant time，但無法完全保證一定就是 Constant time。
- 如果 Welch's t-test 拒絕 Null hypothesis，代表我們的函式可能不是 Constant time 或者絕對不是 Constant time，取決於 Critical values。
Welford’s method :
- 參考 Welford’s method，可知這種方法，可避免一般計算變異數，會採用的 Second-pass algorithm，其須大量暫存中間計算值，例如 : One-pass 會先計算一筆資料的平均值，Second-pass 再計算 Square difference from the mean。
- Welford's method 想法是，觀察第 N 與第 N - 1 差平方和，可以透過簡化，以 bottom-up 的方式來推出第 N 的變異數與標準差
- 在程式碼裡，對應到 ttest.c 內的 t_push 函式
Welch's t-test 原理 :
- Welch's t-test
  - Welch's t-test 的計算式如下 :
  - 其對應的 Degree of freedom 計算式如下 :
    
    :bulb: 兩個 t 分佈的變異數若不相同的話，自由度需要做加權調整，才可用於 t 檢定
  - 類似於卡方分佈的概念，我們可透過查詢 t-distribution table 內，找到在先前設定的 significance level + 計算完的 degree of freedom，來找到對應的 critical values :
    
    :bulb: 若兩者分佈的變異數相同的話，我們可以雙尾檢定的 significance level 來查表 ; 若兩個分佈中，一方的變異數大於另一方，則採用單尾檢定的 significance level。由於論文要做 fix-vs-random test，須採單尾檢定。
  - 最後驗證 Null hypothesis :
    - 如果計算出來的 t 統計量
      $t^{^{'}}$ ，大於我們查表得到的 critical values
      $t_{s i g n i f i c a n c e l e v e l, d e g r e e o f f r e e d o m}$ ，則我們拒絕 Null hypothesis，接受 alternative hypothesis，意義代表兩者的分佈，平均值有顯著差異。
    - 反之，若小於 critical values，我們則接受 Null hypothesis，拒絕 alternative hypothesis，意義代表兩者分佈的均值相近。
- 在程式碼裡面，對應到 fixture.c 內的 report 函式
:bulb: Welch's t-test 的計算式與範例可參考此網站

程式實作原理 : 先做 Code trace，再做分析。

Code trace

首先看到 fixture.c 的前幾行註解可知，該檔案以多次不同的輸入來衡量一個函式的執行時間，並利用 Welch's t-test 來決定其是否為 Constant time implementation。

根據 fixture.c:

先看到 test_const 其接收參數 char *text 是被衡量的 function 名稱，其執行多次的 doit 來衡量此函式。

// fixture.c
#define ENOUGH_MEASURE 10000
#define TEST_TRIES 10

...

static bool test_const(char *text, int mode)
{
    bool result = false;
    t = malloc(sizeof(t_context_t));

    for (int cnt = 0; cnt < TEST_TRIES; ++cnt) {
        printf("Testing %s...(%d/%d)\n\n", text, cnt, TEST_TRIES);
        init_once();
        for (int i = 0; i < ENOUGH_MEASURE / (N_MEASURES - DROP_SIZE * 2) + 1;
             ++i)
            result = doit(mode);
        printf("\033[A\033[2K\033[A\033[2K");
        if (result)
            break;
    }
    free(t);
    return result;
}

接著可發現 doit 就是用來執行一整個衡量流程的函式 :

static bool doit(int mode)
{
    int64_t *before_ticks = calloc(N_MEASURES + 1, sizeof(int64_t));
    int64_t *after_ticks = calloc(N_MEASURES + 1, sizeof(int64_t));
    int64_t *exec_times = calloc(N_MEASURES, sizeof(int64_t));
    uint8_t *classes = calloc(N_MEASURES, sizeof(uint8_t));
    uint8_t *input_data = calloc(N_MEASURES * CHUNK_SIZE, sizeof(uint8_t));

    if (!before_ticks || !after_ticks || !exec_times || !classes ||
        !input_data) {
        die();
    }

    prepare_inputs(input_data, classes);

    bool ret = measure(before_ticks, after_ticks, input_data, mode);
    differentiate(exec_times, before_ticks, after_ticks);
    update_statistics(exec_times, classes);
    ret &= report();

    free(before_ticks);
    free(after_ticks);
    free(exec_times);
    free(classes);
    free(input_data);

    return ret;
}

其會先根據 constan.c 內的 prepare_inputs 來準備 N_MEASURES = 150 筆隨機字串當測試資料，也就是一次 test 會有 150 筆 measurements:

// constant.c
void prepare_inputs(uint8_t *input_data, uint8_t *classes)
{
    randombytes(input_data, N_MEASURES * CHUNK_SIZE);
    for (size_t i = 0; i < N_MEASURES; i++) {
        classes[i] = randombit();
        if (classes[i] == 0)
            memset(input_data + (size_t) i * CHUNK_SIZE, 0, CHUNK_SIZE);
    }

    for (size_t i = 0; i < N_MEASURES; ++i) {
        /* Generate random string */
        randombytes((uint8_t *) random_string[i], 7);
        random_string[i][7] = 0;
    }
}

// constan.h
/* Number of measurements per test */
#define N_MEASURES 150

準備好測試資料的輸入之後，可看到 measure 函式，會針對不同的函式，會從剛剛準備好的測試資料內隨機存取一筆資料來衡量其 CPU cycles，方法為 :
- 在執行的函式開始之前，插入 beforeticks[i] = cpucycles()，紀錄當前在第幾個 cpu cycle
- 在執行的函式結束之後，插入 afterticks[i] = cpucycles()，紀錄執行後在第幾個 cpu cycle

:question: 不知道為何每一個函式 case，內部都有一個 dut_insert_head( get_random_string(), *(uint16_t *) (input_data + i * CHUNK_SIZE) % 10000);

bool measure(int64_t *before_ticks,
             int64_t *after_ticks,
             uint8_t *input_data,
             int mode)
{
    assert(mode == DUT(insert_head) || mode == DUT(insert_tail) ||
           mode == DUT(remove_head) || mode == DUT(remove_tail));

    switch (mode) {
    case DUT(insert_head):
        for (size_t i = DROP_SIZE; i < N_MEASURES - DROP_SIZE; i++) {
            char *s = get_random_string();
            dut_new();
            dut_insert_head(
                get_random_string(),
                *(uint16_t *) (input_data + i * CHUNK_SIZE) % 10000);
            int before_size = q_size(l);
            before_ticks[i] = cpucycles();
            dut_insert_head(s, 1);
            after_ticks[i] = cpucycles();
            int after_size = q_size(l);
            dut_free();
            if (before_size != after_size - 1)
                return false;
        }
        break;
    case DUT(insert_tail):
        for (size_t i = DROP_SIZE; i < N_MEASURES - DROP_SIZE; i++) {
            char *s = get_random_string();
            dut_new();
            dut_insert_head(
                get_random_string(),
                *(uint16_t *) (input_data + i * CHUNK_SIZE) % 10000);
            int before_size = q_size(l);
            before_ticks[i] = cpucycles();
            dut_insert_tail(s, 1);
            after_ticks[i] = cpucycles();
            int after_size = q_size(l);
            dut_free();
            if (before_size != after_size - 1)
                return false;
        }
        break;
    case DUT(remove_head):
        for (size_t i = DROP_SIZE; i < N_MEASURES - DROP_SIZE; i++) {
            dut_new();
            dut_insert_head(
                get_random_string(),
                *(uint16_t *) (input_data + i * CHUNK_SIZE) % 10000 + 1);
            int before_size = q_size(l);
            before_ticks[i] = cpucycles();
            element_t *e = q_remove_head(l, NULL, 0);
            after_ticks[i] = cpucycles();
            int after_size = q_size(l);
            if (e)
                q_release_element(e);
            dut_free();
            if (before_size != after_size + 1)
                return false;
        }
        break;
    case DUT(remove_tail):
        for (size_t i = DROP_SIZE; i < N_MEASURES - DROP_SIZE; i++) {
            dut_new();
            dut_insert_head(
                get_random_string(),
                *(uint16_t *) (input_data + i * CHUNK_SIZE) % 10000 + 1);
            int before_size = q_size(l);
            before_ticks[i] = cpucycles();
            element_t *e = q_remove_tail(l, NULL, 0);
            after_ticks[i] = cpucycles();
            int after_size = q_size(l);
            if (e)
                q_release_element(e);
            dut_free();
            if (before_size != after_size + 1)
                return false;
        }
        break;
    default:
        for (size_t i = DROP_SIZE; i < N_MEASURES - DROP_SIZE; i++) {
            dut_new();
            dut_insert_head(
                get_random_string(),
                *(uint16_t *) (input_data + i * CHUNK_SIZE) % 10000);
            before_ticks[i] = cpucycles();
            dut_size(1);
            after_ticks[i] = cpucycles();
            dut_free();
        }
    }
    return true;
}

接著透過 differentiate 來將函式測試時，執行完畢時的 clocks cycle 減去開始執行時的 clocks cycle，得到針對不同 input 時，執行一次函式所需要的 clocks cycle :

static void differentiate(int64_t *exec_times,
                          const int64_t *before_ticks,
                          const int64_t *after_ticks)
{
    for (size_t i = 0; i < N_MEASURES; i++)
        exec_times[i] = after_ticks[i] - before_ticks[i];
}

再利用update_statistics 來執行 Welford’s method，以 single-pass 的方式計算變異數

// fixture.c
static void update_statistics(const int64_t *exec_times, uint8_t *classes)
{
    for (size_t i = 0; i < N_MEASURES; i++) {
        int64_t difference = exec_times[i];
        /* CPU cycle counter overflowed or dropped measurement */
        if (difference <= 0)
            continue;

        /* do a t-test on the execution time */
        t_push(t, difference, classes[i]);
    }
}

// ttest.c
void t_push(t_context_t *ctx, double x, uint8_t class)
{
    assert(class == 0 || class == 1);
    ctx->n[class]++;

    /* Welford method for computing online variance
     * in a numerically stable way.
     */
    double delta = x - ctx->mean[class];
    ctx->mean[class] = ctx->mean[class] + delta / ctx->n[class];
    ctx->m2[class] = ctx->m2[class] + delta * (x - ctx->mean[class]);
}

最後再利用 report，執行 Welch's t-test 來看是否要拒絕 Null hypothesis，其 Null hypothesis 為 : 我們寫的函式等同於 Constant time :

// fixture.c
static bool report(void)
{
    double max_t = fabs(t_compute(t));
    double number_traces_max_t = t->n[0] + t->n[1];
    double max_tau = max_t / sqrt(number_traces_max_t);

    printf("\033[A\033[2K");
    printf("meas: %7.2lf M, ", (number_traces_max_t / 1e6));
    if (number_traces_max_t < ENOUGH_MEASURE) {
        printf("not enough measurements (%.0f still to go).\n",
               ENOUGH_MEASURE - number_traces_max_t);
        return false;
    }

    /* max_t: the t statistic value
     * max_tau: a t value normalized by sqrt(number of measurements).
     *          this way we can compare max_tau taken with different
     *          number of measurements. This is sort of "distance
     *          between distributions", independent of number of
     *          measurements.
     * (5/tau)^2: how many measurements we would need to barely
     *            detect the leak, if present. "barely detect the
     *            leak" = have a t value greater than 5.
     */
    printf("max t: %+7.2f, max tau: %.2e, (5/tau)^2: %.2e.\n", max_t, max_tau,
           (double) (5 * 5) / (double) (max_tau * max_tau));

    /* Definitely not constant time */
    if (max_t > t_threshold_bananas)
        return false;

    /* Probably not constant time. */
    if (max_t > t_threshold_moderate)
        return false;

    /* For the moment, maybe constant time. */
    return true;
}

// ttest.c
double t_compute(t_context_t *ctx)
{
    double var[2] = {0.0, 0.0};
    var[0] = ctx->m2[0] / (ctx->n[0] - 1);
    var[1] = ctx->m2[1] / (ctx->n[1] - 1);
    double num = (ctx->mean[0] - ctx->mean[1]);
    double den = sqrt(var[0] / ctx->n[0] + var[1] / ctx->n[1]);
    double t_value = num / den;
    return t_value;
}

實作分析與缺陷

利用 fix-vs-random test:
- 程式內部會不斷以 constant sample 與 random sample，來驗證函式以 random sample 下是否貼近 constant sample 下的均值，並以多次測試來決定其是否為 constant time
沒有做論文內提及的 Cropping:
- Cropping 是一種在 statistical analysis 之前的 pre-processing 方法
- Cropping 可根據不同的 p percentile 來捨棄掉大於 p 的 measurements
- 作者希望專注在做 low cycle count tail 的單尾檢定，並認為 upper tail 容易被 date-independent noise 給影響
  
  :bulb: Data-independent noise 像是執行函式時，被 OS 發出中斷訊號，或者其它影響函式測試的行為

根據上述，如果沒有做 Cropping 的話，因為 fixture.c 對於每一個函式，都會做 ENOUGH_MEASURE / (N_MEASURES - DROP_SIZE * 2) + 1 次的測試，並且測試回合數為 TEST_TRICES = 10，因此在執行過程中，執行次數多且時間長，upper tail 容易被影響，因此 t 檢定值可能會隨著 data-independent noise 而使得每次測試結果不同。

在我每一次以 simulation mode 跑測試時，四個函式的結果每次測出來的結果都不太一樣，經過追蹤程式碼，並畫出當自動評分認為不是 constant time 時， insert_head 一個測試的Cycle count 如下 :

可發現在接近 x = 80 的 measurements，有峰值接近 1000 的 cycle，可能是函式被 OS 的 interrupt 的中斷影響，而中斷可能來自於 Linux 的排程，也有可能來自於 I/O。

引入 lib/list_sort.c

根據 linux/list_sort.c，可分析 linux 核心的 linked list 排序由三個函式組成 : merge, merge_final 和 list_sort

為了引入 linux 的排序演算法，並且跟我的 merge sort 能夠互相切換，我先在 queue.h 內引入了下列程式碼:

#define LINUX_SORT
typedef unsigned char u8;
#define likely(x)    __builtin_expect(!!(x), 1)
#define unlikely(x)  __builtin_expect(!!(x), 0)

其中:

u8 代表 unsigned 型態且為 8 bit 的資料，故這邊直接以 typedef unsigned char 來定義
根據 likely and unlikely macro 的解釋可知，__bultin_expect 是 gcc 的內建 function ，用來將 branch 的相關資訊提供給 compiler，告訴 compiler 設計者期望的比較結果，以便對程式碼改變分支順序來進行優化。
因此 likely(x) 代表我們認為它發生 x == 1的機率比較高，也告訴 Compiler 要條件式內 x == 1 對應到的程式碼，接在後面執行
unlikely(x) 則代表發生x == 0 的機率比較高，告訴 Compiler 要條件式內 x == 0 對應到的程式碼，接在後面執行

:bulb: 因此 likely 和 unlikely 有助於程式減少分支跳躍的成本，但編譯時 gcc 要開啟最佳化， __builtin_expect 才有效果

:bulb: __builtin_expect 使用 !!(x) 的原因在於，連續兩次的 NOT 運算，能夠保證其值為 0 或 1，例如 !!(5) = !(0) = 1

並重新改寫了 queue.c 內的 q_sort 與新增 linux_cmp:

利用 predecessor 來切換使用的是我的 merge sort 或是 linux 的 sort
並根據 linux/list_sort.h，找到 list_cmp_func_t 的定義，再參考 Function pointer，實作出 linux_cmp

/* Sort elements of queue in ascending order */
void q_sort(struct list_head *head)
{
    #ifndef LINUX_SORT
        if (!head || list_empty(head) || list_is_singular(head))
            return;

        struct list_head *fast = head->next->next, *slow = head->next;

        // split list
        while (fast != head && fast->next != head) {
            slow = slow->next;
            fast = fast->next->next;
        }

        // generate left & right head
        LIST_HEAD(left);
        LIST_HEAD(right);
        list_cut_position(&left, head, slow);
        list_splice_init(head, &right);
        q_sort(&left);
        q_sort(&right);
        my_merge(head, &left, &right);
    #else
        list_sort(NULL, head, linux_cmp);
    #endif
}

static int linux_cmp(void *priv , const struct list_head *a, const struct list_head *b) {

    char *a_str = list_entry(a, element_t, list)->value;
    char *b_str = list_entry(b, element_t, list)->value;

    return strcmp(a_str, b_str);
}

接著我使用 linux 系統的 perf，為一套系統效能分析的工具，可提供多個面向的效能比較，使用方式可參考成大資工wiki - perf

先寫好執行各別在

10^{6}, 10^{7}, 10^{8}

排序資料的命令

:bulb: perf 缺點在於若要量測一段程式碼，須將其獨立出來。這邊使用的命令，測量上也包括了 new，無法最公平的比較

# Test of my merge sort v.s. Linux sort
# 執行插入一百萬次的腳本
option fail 0
option malloc 0
new
ih RAND 1000000
sort

並利用 Python 作為腳本，執行 perf 來觀察每一個排序的效能比較 :

import subprocess
from math import pow

nums = [int(pow(10, k)) for k in range(6, 9)]
for k in nums:
    p = subprocess.run(f"""sudo perf stat -C 1,2,3,4,5,6,7 -e branches -e instructions -e context-switches -e cpu-cycles -e cache-misses -e cache-references -e L1-dcache-loads -e system_time -e user_time --repeat 10 -o perf_report/sortcmp_perf_{k}_report.txt ./qtest -f traces/my-trace-sorting-cmp-{k}.cmd""",
    shell=True)

得到的結果比較如下 :

my merge sort

資料數量	branches	instructions	context-switches	cpu-cycles	cache-misses
$10^{6}$	679,012,523	4,875,172,271	978	7,829,057,985	101,640,677
$10^{7}$	788,000,774	5,622,147,933	1,623	9,277,850,351	140,175,351
$10^{8}$	630,853,888	5,085,136,193	1,423	9,329,504,976	138,576,715

資料數量	cache-references	L1-dcache-loads	system_time	user_time
$10^{6}$	209,742,446	1,001,857,248	3,010,816,900	10,992,776,200
$10^{7}$	255,647,029	1,152,871,723	3,329,508,000	12,598,190,500
$10^{8}$	254,038,731	1,173,757,396	3,321,200,400	12,635,215,600

list_sort

資料數量	branches	instructions	context-switches	cpu-cycles	cache-misses
$10^{6}$	526,535,184	4,222,727,130	956	7,149,526,207	91,477,708
$10^{7}$	606,694,876	5,152,602,028	1,298	8,656,871,637	98,835,155
$10^{8}$	778,221,524	5,612,350,276	1,043	8,469,879,742	93,550,735

資料數量	cache-references	L1-dcache-loads	system_time	user_time
$10^{6}$	159,253,765	840,097,790	2,892,750,700	9,682,878,100
$10^{7}$	198,423,050	1,179,638,533	3,351,852,000	11,284,926,800
$10^{8}$	196,866,365	1,118,001,716	3,265,051,300	11,206,414,100

由上述統計結果，我們進行下列的討論與結論 :

排序不需要大量 kernel-space 才能執行的函式，因此排序消耗的時間大多都是在 user-space :
- 可看到我的 merge sort 消耗在 user-space 的時間就多於 linux 的 list_sort
- Linux 比較快的地方在於它以 bottom-up 的方式排序元素，不需要 Top-down 遞迴呼叫的時間，也省下使用遞迴堆疊的空間
再來看到 branches，我實作的 merge sort 也比 linux 的 list_sort 還要多 :
- 因為 list_sort 有利用 __builtin_expect 來減少分支跳躍指令的情形，可以讓編譯器最佳化指令的排序
- 而且我的排序是以 Top-down 的方式，先做 partition，再 Bottom up 的組合起來 ; 而 Linux 的 list_sort 是以 bottom up 的方式直接把元素組合起來，因此所需的條件分支，比我的少更多
最後看到 cache-misses 的部份，我實作的 merge sort，明顯在每個層級比 list_sort 的快取失誤還要來的多 :
- 主因是 list_sort 在合併的時候有考量到 2:1 的原則，也就是維持已合併與合併的 linked list 為 2:1，防止 cache threashing 的現象