2022q1 Homework1 (lab0)

contributed by < tinyynoob >

作業要求

開發前準備

因為之前嘗試寫過 2021 年之作業，因此需要對 lab0-c 重新進行 fork。

首先，將原有的遠端 repo 刪除，之後建立一個新的遠端 repo 2021-lab0 ，並將舊的本地 repo 推到該遠端庫。最後，重新 fork 新版 lab0-c。

開發環境

Linux tinyynoob-notebook 5.13.0-30-generic #33~20.04.1-Ubuntu SMP Mon Feb 7 14:25:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

開發紀錄

queue.c

q_size()

加入以下兩行，用於檢測 NULL 佇列及空佇列

    if (!head || list_empty(head))
        return 0;

觀察 list.h ，得知 list_for_each() 為一 macro ，其作用為走訪 head 中的每一項元素。

#define list_for_each(node, head) \
    for (node = (head)->next; node != (head); node = node->next)

因此可透過完整走訪整個 list 來計數 list 中的元素個數。

struct list_head *q_new()
{
    struct list_head *newNode =
        (struct list_head *) malloc(sizeof(struct list_head));
    if (!newNode)
        return NULL;
    INIT_LIST_HEAD(newNode);
    return it;
}

q_new()

配置空間並初始化，若配置失敗則回傳 NULL 。

struct list_head *newNode =
    (struct list_head *) malloc(sizeof(struct list_head));
if (!newNode)
    return NULL;
INIT_LIST_HEAD(newNode);
return newNode;

q_insert_head()

撰寫程式碼如下




















bool q_insert_head(struct list_head *head, char *s)
{
    if (!head)
        return false;
    element_t *newNode = (element_t *) malloc(sizeof(element_t));
    if (!newNode)
        return false;
    INIT_LIST_HEAD(&newNode->list);
    newNode->value = NULL;
    size_t size = strlen(s);
    newNode->value = (char *) malloc(
        sizeof(char) * (size + 1));  //+1 to include the null terminator
    if (!newNode->value) {
        free(newNode);
        return false;
    }
    strncpy(newNode->value, s, sizeof(char) * (size + 1));
    list_add(&newNode->list, head);
    return true;
}

在 newNode 配置成功後先將其所有成員初始化避免錯誤操作。
其中第 8 行之用法源於 list.h 中對於 INIT_LIST_HEAD() 之註解

 * This can also be used to initialize a unlinked list node.

為 newNode->value 配置空間時需要加 1 ，才能為 \0 安排空間。另外因 char 不一定佔據 1 byte ，因此不可將 sizeof(char) 省略。

接著第 10 至 17 行將 s 內的字串複製至 newNode->value ，過程中使用 strncpy() 以避免 strcpy() 之潛在危險。

等上述操作皆成功完成後，才呼叫 kernel API 進行鏈結串列之相連。

q_insert_tail()

內容與 q_insert_head() 幾乎相同，僅將第 18 行之 list_add() 置換為 list_add_tail() 。

完成一部份後先 ./qtest 進行測試，發現 q_size() 的 return 值有錯

queue.c: In function ‘q_size’:
queue.c:137:12: error: returning ‘struct list_head *’ from a function with return type ‘int’ makes integer from pointer without a cast [-Werror=int-conversion]
  137 |     return it;
      |            ^~

因此將 it 修正為 size

q_free()

程式碼如下

void q_free(struct list_head *l)
{
    if (!l)  // prevent doubly free
        return;
    while (!list_empty(l)) {
        element_t *toDel = container_of(l->next, element_t, list);
        list_del(l->next);
        q_release_element(toDel);
    }
    free(l);
}

首先須判斷 l 是否為 NULL 來避免被使用者 doubly free 。

而後使用迴圈依序刪除節點，先使用 list_del() 將節點移出鏈結串列，再將整個容器清除，其中 q_release_element() 定義如下

void q_release_element(element_t *e)
{
    free(e->value);
    free(e);
}

串列為空後執行 free(l)

I think q_release_element() should be an inline function.

詳述你的考量、設計實驗，並拿實際的效益來說服其他人。
:notes: jserv

q_remove_head()

移除首節點，若 sp 非空，則將 entry 內之字串複製至 sp 。













element_t *q_remove_head(struct list_head *head, char *sp, size_t bufsize)
{
    if (!head || list_empty(head))
        return NULL;
    element_t *toRm = container_of(head->next, element_t, list);
    list_del_init(head->next);
    if (sp) {
        size_t len = strnlen(toRm->value, bufsize - 1);
        strncpy(sp, toRm->value, len);
        sp[len] = '\0';
    }
    return toRm;
}

根據 man page

size_t strnlen(const char *s, size_t maxlen);

有如下敘述

The strnlen() function returns the number of bytes in the string pointed to by s, excluding the terminating null byte ('\0'), but at most maxlen.

因此題目關於 bufsize 的要求可以用 strnlen() 來達成，此外因其計算長度時並未計入 \0 ，所以程式第 10 行直接存取第 len 個元素將剛好是 \0 需要在的位置。

q_remove_tail()

實作方式與 q_remove_head() 相似。

q_delete_mid()

我們可以利用環狀且雙向的特性，讓兩個指標從 head 開始反方向移動，直到它們交會時便是串列中點的位置。

但需要特別注意 list size 為偶數的情況，為了找到中點

⌊ n / 2 ⌋ = n / 2

，我們不能讓兩個指標同時動作，需要由 forwd 先走一步，觀察是否交會，再由 backwd 走一步，否則在偶數的情況下兩者將擦身而過。

bool q_delete_mid(struct list_head *head)
{
    if (!head || list_empty(head))
        return false;
    struct list_head *forwd = head->next, *backwd = head->prev;
    while (forwd != backwd) {
        forwd = forwd->next;
        if (forwd == backwd)
            break;
        backwd = backwd->prev;
    }
    element_t *todel = container_of(forwd, element_t, list);
    list_del(forwd);
    q_release_element(todel);
    return true;
}

或許可以先使用 q_size() 判斷奇偶數，再視結果用不同的方式尋找中點。
不過考量 q_size() 需要走訪全部節點，因此我估計其執行速度將不如目前版本。

q_delete_dup()

建立一條 list trashcan 收集欲刪除的節點，再於最後一口氣刪除。
























bool q_delete_dup(struct list_head *head)
{
    if (!head || list_empty(head))
        return false;
    struct list_head *trashcan = q_new();
    if (!trashcan)
        return false;
    struct list_head *it;
    list_for_each (it, head) {  // go through entire list
        char *curr = container_of(it, element_t, list)->value;
        /* if this entry is not duplicate */
        if (it->next == head ||
            strcmp(curr, container_of(it->next, element_t, list)->value))
            continue;
        /* if duplicate */
        it = it->prev;
        do {
            list_move(it->next, trashcan);
        } while (it->next != head &&
                 !strcmp(curr, container_of(it->next, element_t, list)->value));
    }
    q_free(trashcan);
    return true;
}

顧及走訪的速度，我們讓非 duplicate 的節點快速通過，有 duplicate 的情況放到迴圈後半部處理。

由於 duplicate 節點被要求全數刪除，因此在發現 duplicate 後需要先倒退一步(第 16 行)，再把資料內容等於 curr 的節點一一移到 trashcan 。

發現 duplicate ：

倒退一步：

接著由 17 到 20 行循環移除 it->next ，直到

q_swap()

此函式要求每遇到兩個節點便進行 swap ，因此我們引入 swap_pair() 使用，其功能為交換兩個節點。









static inline void swap_pair(struct list_head **head)
{
    struct list_head *first = *head, *second = first->next;
    first->next = second->next;
    second->next = first;
    second->prev = first->prev;
    first->prev = second;
    *head = second;
}

before :

after swap_pair() :

經由畫圖的過程，我發現了 swap_pair() 的錯誤，於是在第 3 行之後加入

    second->next->prev = first;

藉著 swap_pair() 我們可以簡潔實作 q_swap() 如下

void q_swap(struct list_head *head)
{
    if (!head)
        return;
    for (struct list_head **it = &head->next;
         *it != head && (*it)->next != head; it = &(*it)->next->next)
        swap_pair(it);
}

q_reverse()

由於我們的 list 是 doubly-linked ，因此可以直接交換每個節點的 next 跟 prev 達成 reverse ，程式碼如下：

void q_reverse(struct list_head *head)
{
    if (!head || list_empty(head))
        return;
    for (struct list_head *it = head->next; it != head; it = it->prev)
        swapptr((char **) &it->next, (char **) &it->prev);
    swapptr((char **) &head->next, (char **) &head->prev);
}

需要注意到迴圈方向 it = it->prev 其實是在往前走，因為我們已經交換了 next 跟 prev 。另外，迴圈不可由head 起始，否則會直接觸發中止條件結束。

static inline void swapptr(char **a, char **b)
{
    char *tmp = *a;
    *a = *b;
    *b = tmp;
}

目前的交換函式如上，我希望能藉由 xor 來達成不佔額外空間的交換，然而語法：

(__intptr_t)(*a) ^= (__intptr_t)*b;

卻會導致錯誤訊息 運算式必須是可修改的 Lvalue ，或許需要再查看規格書尋找原因。

gcc 內建的中文訊息是個悲劇，Lvalue 不該翻譯為「左值」，在 C99 後，lvalue 正式名稱是 locator value，跟「左」無關。
:notes: jserv

我發覺到不應於 assignment 左方進行轉型，然而改成

static inline void swapptr(char **a, char **b)
{
    (*a) = (__intptr_t)(*a) ^ (__intptr_t)(*b);
    (*b) = (__intptr_t)(*b) ^ (__intptr_t)(*a);
    (*a) = (__intptr_t)(*a) ^ (__intptr_t)(*b);
}

編譯時仍出現錯誤訊息

queue.c:280:10: error: assignment to ‘char *’ from ‘long int’ makes pointer from integer without a cast [-Werror=int-conversion]
  280 |     (*a) = (__intptr_t)(*a) ^ (__intptr_t)(*b);
      |          ^

根據 C99 規格書：

An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.

我認定整數型別轉為指標型別應是合法操作。

推測可能是指標型態不合，再將程式碼修改為

static inline void swapptr(char **a, char **b)
{
    (*a) = (char *) ((__intptr_t)(*a) ^ (__intptr_t)(*b));
    (*b) = (char *) ((__intptr_t)(*b) ^ (__intptr_t)(*a));
    (*a) = (char *) ((__intptr_t)(*a) ^ (__intptr_t)(*b));
}

終於編譯通過並得到正確的交換結果。

我不確定規格書上 intptr_t 與這個 __intptr_t 的相異之處

q_sort()

註：由於此節內容略多，因此我提高其層級

基於上一次的經驗，我直接選擇使用 merge sort 來進行排序。

稍微瀏覽這篇筆記後，嘗試自己撰寫。首先將 list 切成單一節點來跳過慢慢切細的過程，目前我想不出 doubly-linked 的特性能做什麼利用，因此仍以和 singly-linked 相同的方式一步一步 merge 回去，第一個版本使用迭代實作。

























void q_sort(struct list_head *head)
{
    if (!head || list_empty(head) || list_is_singular(head))
        return;
    size_t size = q_size(head);
    struct list_head *lists[size];
    struct list_head *it = head->next;
    for (size_t i = 0; i < size; i++) {
        lists[i] = it;
        it = it->next;
        it->prev->next = NULL;
    }
    for (size_t n = size; n > 1; n = n / 2 + (n & 1)) {
        for (size_t i = 0; i < n / 2; i++)
            lists[i] = merge_sorted_singly(
                lists[i], lists[n / 2 + (n & 1) + i]);
    }
    head->next = lists[0];
    lists[0]->prev = head;
    for (it = lists[0]; it->next; it = it->next)
        it->next->prev = it;
    it->next = head;
    head->prev = it;
    return;
}

n / 2 + (n & 1) 之功能為將整數除以 2 後無條件進位

第 6 至 12 行：將 list 分割為單一節點
第 13 至 17 行：視為 singly-linked list ，兩兩一組進行合併
第 18 至 23 行：連結 head ，並重新走訪排序後的 list ，將每個節點的 prev 成員指向上一個

struct list_head *merge_sorted_singly(struct list_head *l1,
                                      struct list_head *l2)
{
    struct list_head *ans = NULL, **prev = &ans;
    while (l1 && l2) {
        struct list_head **small =
            strcmp(container_of(l1, element_t, list)->value,
                   container_of(l2, element_t, list)->value) < 0
                ? &l1
                : &l2;
        *prev = *small;
        (*small) = (*small)->next;
        prev = &(*prev)->next;
    }
    *prev = l1 ? l1 : l2;
    return ans;
}

試著使用 conditional operator 以及 indirect pointer 來簡化 merge 的程式碼。

我對於這個版本不大滿意，仍有多處不夠優美。目前期望的改進方向是在 merge 的過程中就處理掉 prev 成員的問題，另外我也希望能避免使用 lists 陣列來降低記憶體用量。

使用 make test 進行測試

scripts/driver.py -c
---     Trace           Points
+++ TESTING trace trace-01-ops:
# Test of insert_head and remove_head
---     trace-01-ops    5/5
+++ TESTING trace trace-02-ops:
# Test of insert_head, insert_tail, remove_head, remove_tail, and delete_mid
---     trace-02-ops    6/6
+++ TESTING trace trace-03-ops:
# Test of insert_head, insert_tail, reverse, and remove_head
---     trace-03-ops    6/6
+++ TESTING trace trace-04-ops:
# Test of insert_head, insert_tail, size, swap, and sort
---     trace-04-ops    6/6
+++ TESTING trace trace-05-ops:
# Test of insert_head, insert_tail, remove_head, reverse, size, swap, and sort
---     trace-05-ops    6/6
+++ TESTING trace trace-06-ops:
# Test of insert_head, insert_tail, delete duplicate, and sort
---     trace-06-ops    6/6
+++ TESTING trace trace-07-string:
# Test of truncated strings
---     trace-07-string 6/6
+++ TESTING trace trace-08-robust:
# Test operations on empty queue
---     trace-08-robust 6/6
+++ TESTING trace trace-09-robust:
# Test remove_head with NULL argument
---     trace-09-robust 6/6
+++ TESTING trace trace-10-robust:
# Test operations on NULL queue
---     trace-10-robust 6/6
+++ TESTING trace trace-11-malloc:
# Test of malloc failure on insert_head
---     trace-11-malloc 6/6
+++ TESTING trace trace-12-malloc:
# Test of malloc failure on insert_tail
---     trace-12-malloc 6/6
+++ TESTING trace trace-13-malloc:
# Test of malloc failure on new
---     trace-13-malloc 6/6
+++ TESTING trace trace-14-perf:
# Test performance of insert_tail, reverse, and sort
---     trace-14-perf   0/6
+++ TESTING trace trace-15-perf:
# Test performance of sort with random and descending orders
# 10000: all correct sorting algorithms are expected pass
# 50000: sorting algorithms with O(n^2) time complexity are expected failed
# 100000: sorting algorithms with O(nlogn) time complexity are expected pass
---     trace-15-perf   6/6
+++ TESTING trace trace-16-perf:
# Test performance of insert_tail
---     trace-16-perf   6/6
+++ TESTING trace trace-17-complexity:
# Test if time complexity of q_insert_tail, q_insert_head, q_remove_tail, and q_remove_head is constant
Probably constant time
Probably constant time
Probably constant time
Probably constant time
---     trace-17-complexity     5/5
---     TOTAL           94/100

發現雖然正確性測試都有通過，但 q_sort() 的速度不夠快，還需要再想辦法做改進。

將原先 q_sort() 第 11 行中的 it->prev 替換為 lists[i] 減少不必要的取值。

    for (size_t i = 0; i < size; i++) {
        lists[i] = it;
        it = it->next;
        lists[i]->next = NULL;
    }

接著我參考了 linux/lib/list_sort.c 的實作，將重建 prev 的部份融入到最後一次的 merge 中，減少一個 pass 的處理。修改成：

void q_sort(struct list_head *head)
{
    if (!head || list_empty(head) || list_is_singular(head))
        return;
    const size_t size = q_size(head);
    struct list_head *lists[size];
    struct list_head *it = head->next;
    for (size_t i = 0; i < size; i++) {
        lists[i] = it;
        it = it->next;
        lists[i]->next = NULL;
    }
    for (size_t n = size; n > 2; n = n / 2 + (n & 1)) {
        for (size_t i = 0, j = n - 1; i < j; i++, j--)
            lists[i] = merge_sorted_singly(lists[i], lists[j]);
    }
    final_merge(head, lists[0], lists[1]);
}

其中 lists 合併的方式略有調整，為了避免每次都要計算合併對象，直接在迴圈中引入 j 變數進行頭尾合併，如此也降低了 array index 對 n 的依賴。

此外也可以看見最後一次 merge 被移出迴圈並改寫為 final_merge() ，其結構大致與 merge_sorted_singly() 相似，主要差異在於增加了對 prev 成員的處理。

void final_merge(struct list_head *head,
                 struct list_head *l1,
                 struct list_head *l2)
{
    struct list_head *it = head;
    while (l1 && l2) {
        struct list_head **small =
            strcmp(container_of(l1, element_t, list)->value,
                   container_of(l2, element_t, list)->value) <= 0
                ? &l1
                : &l2;
        (*small)->prev = it;
        it->next = *small;
        it = it->next;
        (*small) = (*small)->next;
    }
    for (it->next = l1 ? l1 : l2; it->next; it = it->next)
        it->next->prev = it;
    it->next = head;
    head->prev = it;
}

以

size = 6

為例圖解 q_sort() ：

⟹

n = 6 {\begin{cases} merge (0, 5) \\ merge (1, 4) \\ merge (2, 3) \end{cases}

n = ⌈ 6 / 2 ⌉ = 3 {\begin{cases} merge (0, 2) \end{cases}

n = ⌈ 3 / 2 ⌉ = 2 final_merge (0, 1)

然而，經過修改後的程式碼仍無法通過速度測試，目前對於接續該如何優化沒有頭緒。

從討論串得到一些啟發，或許可以從 locality 的角度改進，使得程式執行上對 cache 更加友善。

突然有些好奇，會不會有同樣一份程式碼跑測試，在 cache 比較好的電腦測試通過，而在 cache 比較差勁的電腦卻 fail 的情形。

目前的想法卡在，若策略為使相鄰的 list 兩兩合併，例如 lists[i] 併 lists[i+1] ，那麼該如何處置奇數個 list 的問題，若一味地將落單的 list 併到最後，可能使得最後一條 list 出現越來越長的不平衡現象影響合併速度。

我發現自己看錯錯誤訊息，一直誤以為是 sort 的速度太慢才沒有通過 trace-15 。剛才發現我錯的是 trace-14 ，之後照著命令內容操作出現以下結果：

cmd> option fail 0
cmd> option malloc 0
cmd> new
l = []
cmd> ih dolphin 1000000
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> reverse
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> it gerbil 1000000
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> reverse
l = [gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil ... ]
cmd> sort
程式記憶體區段錯誤 (核心已傾印)

然而測試使用較小的 size 是可以通過的：

cmd> option fail 0
cmd> option malloc 0
cmd> new
l = []
cmd> ih dolphin 100000
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> it gerbil 100000
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> reverse
l = [gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil ... ]
cmd> sort
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]

推測可能是因為此宣告 struct list_head *lists[size] 佔用了太大的空間。

接著觀摩 bakudr18 同學的實驗，讓我學習到可以針對自己的電腦硬體做測試。

我電腦的 cache 資訊如下

L1d 快取：                       128 KiB
L1i 快取：                       128 KiB
L2 快取：                        1 MiB
L3 快取：                        6 MiB

我想測試看看是否輸入資料的 size 塞得進某一層 cache 就能通過，首先因為 struct list_head* 佔用的空間為 8 bytes ，而 6 MiB 大約等於 6000000 bytes ，計算 6000000 除以 8 約為 750000 ，因此我重複使用 trace-14 的測試，並選擇兩個 cases 分別為

$300000 + 300000 = 600000$
$400000 + 400000 = 800000$

得出結果

cmd> new
l = []
cmd> ih dolphin 300000
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> it gerbil 300000
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> reverse
l = [gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil ... ]
cmd> sort
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]

cmd> new
l = []
cmd> ih dolphin 400000
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> it gerbil 400000
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> reverse
l = [gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil ... ]
cmd> sort
ERROR: Time limit exceeded.  Either you are in an infinite loop, or your code is too inefficient
Segmentation fault occurred.  You dereferenced a NULL or invalid pointer
中止 (核心已傾印)

再多測一個

350000 + 350000

：

cmd> new
l = []
cmd> ih dolphin 350000
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> it gerbil 350000
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> reverse
l = [gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil gerbil ... ]
cmd> sort
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]

於是我猜想 L3 cache 即是函式是否超時的關鍵。

我無法解釋為何這些 "dolphin" 和 "gerbil" 字串沒有佔用到 cache 空間

cmd> new
l = []
cmd> ih dolphin 300000
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> it gerbil 300000
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
cmd> time sort
l = [dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin dolphin ... ]
Delta time = 1.114

使用 time 命令測試後發現我撰寫的 q_sort() 執行速度真的非常緩慢。

有了這樣的認識後，我想將處理方式改成

\begin{aligned} 2000000 & = 500000 + 500000 + 500000 + 500000 \\ = 1000000 + 1000000 \\ = 2000000 \end{aligned}

然而問題在於，即使一開始可以分成每個 sublist 500000 個元素來裝進 L3 cache ，但是終究需要將它們合併到 1000000 ，此時又會遭遇 Time limit exceeded ，不曉得該如何解決。

測試使用快慢指標分割再合併的方法，竟然就通過測試了。我原先一直以為這會更慢，也有閱覽 Merge Sort 與它的變化效能測試節，得此結果使我非常納悶。

程式碼

void q_sort(struct list_head *head)
{
    if (!head || list_empty(head) || list_is_singular(head))
        return;
    head->prev->next = NULL;
    mergesort(&head->next);
    struct list_head *prev = head;
    for (struct list_head *it = head->next; it; it = it->next) {
        it->prev = prev;
        prev = it;
    }
    head->prev = prev;
    prev->next = head;
}

/* merge sort for singly linked list*/
void mergesort(struct list_head **head)
{
    if (!*head || !(*head)->next)
        return;
    struct list_head *mid = split_list(*head);
    mergesort(head);
    mergesort(&mid);
    *head = merge_sorted_singly(*head, mid);
}

/* split the list and return a pointer to mid */
struct list_head *split_list(struct list_head *const head)
{
    struct list_head *fast = head->next, *slow = head;
    while (fast) {
        if (fast->next)
            fast = fast->next->next;
        else
            fast = fast->next;
        slow = slow->next;
    }
    slow->prev->next = NULL;
    return slow;
}

與方法 1 使用相同的 merge_sorted_singly()

在 q_sort() 我們主要試了兩個方式：

直接分割成單一節點再進行頭尾合併 (非遞迴版)
使用快慢指標遞迴分割合併

方法 1 sketch ：

\begin{aligned} 8 & = 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 \\ = 2 + 2 + 2 + 2 \\ = 4 + 4 \\ = 8 \end{aligned}

方法 2 sketch ：

\begin{aligned} 8 & = 4 + 4 \\ = 2 + 2 + 2 + 2 \\ = 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 \\ = 2 + 2 + 2 + 2 \\ = 4 + 4 \\ = 8 \end{aligned}

我認為非常反直覺的原因在於，在其它運算都相同的情況下，既然方法 1 省去了開始時一步一步分割的步驟，為何方法 1 的速度會慢於方法 2 呢？真是神奇。

補：其實它們的運算並不相同，方法 1 是頭尾合併，而方法 2 是相鄰合併

嘗試學習 bakudr18 同學使用 valgrind 的 cachegrind 進行實驗。我直接使用 qtest 進行實驗，並挑選一些重要的項目展現於此。

Cachegrind doc

試驗中對方法 1, 2 輸入相同的 command ：

$ valgrind --tool=cachegrind ./qtest 2
cmd> new
cmd> ih dolphin 50000
cmd> it gerbil 50000
cmd> reverse
cmd> sort
cmd> quit

方法 1 結果：

--------------------------------------------------------------------------------
Ir         I1mr ILmr Dr        D1mr      DLmr      Dw        D1mw    DLmw     file:function
--------------------------------------------------------------------------------
28,213,903   71   63 4,802,424        71        64 4,802,289 200,061 199,997  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_malloc
27,124,304   10    9 6,103,634 1,322,531 1,139,711         0       0       0  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
24,258,396    2    2 6,766,220 1,479,787 1,198,364 4,459,724       2       2  /home/tinyynoob/code/2022-lab0/queue.c:merge_sorted_singly
16,207,984   40   38 4,602,325    12,512     1,569 2,001,228       2       2  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_free
 9,604,588    7    7 2,401,257        23        22   800,505       6       5  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc
 9,451,354   14   14 1,400,189 1,047,151   468,209   350,175       6       6  /home/tinyynoob/code/2022-lab0/qtest.c:show_queue
 8,736,761    7    7 1,900,187   571,477   272,502 1,418,140   2,052       7  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc_consolidate
 8,000,039    6    4 2,000,010         2         2 2,400,011  24,994  24,993  /home/tinyynoob/code/2022-lab0/harness.c:test_malloc
 7,600,072    5    5 2,400,012    99,297    54,379 1,400,006 221,914 100,187  /home/tinyynoob/code/2022-lab0/harness.c:test_free

若我們僅觀察 last-level cache data read misses (DLmr) 之結果， cache miss 熱點主要分佈於：

strcmp() 1,139,711
merge_sorted_singly() 1,198,364
show_queue() 468,209
malloc_consolidate() 272,502

雖然是這樣做，但是我不懂為什麼 last level cache 是關鍵

方法 2 結果：

--------------------------------------------------------------------------------
Ir         I1mr ILmr Dr        D1mr      DLmr    Dw        D1mw    DLmw     file:function
--------------------------------------------------------------------------------
28,212,992   64   64 4,802,256        64      60 4,802,172 200,058 199,995  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_malloc
18,070,400    2    2 5,075,110   511,304  22,664 3,445,062     512       2  /home/tinyynoob/code/2022-lab0/queue.c:merge_sorted_singly
17,988,657    7    7 4,076,356   550,022 120,395         0       0       0  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
16,207,558   33   31 4,602,202        29      27 2,001,165       2       2  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_free
 9,604,482    7    7 2,401,221        23      22   800,487       6       5  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc
 9,451,354   18   17 1,400,189 1,047,144 468,655   350,175       6       6  /home/tinyynoob/code/2022-lab0/qtest.c:show_queue
 8,900,420    7    7 1,900,175   324,665 237,735 1,499,996       4       4  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc_consolidate
 8,000,039    4    4 2,000,010         2       2 2,400,011  24,994  24,993  /home/tinyynoob/code/2022-lab0/harness.c:test_malloc
 7,700,038    5    5 2,400,012    50,003  50,003 1,400,006  62,502  62,502  /home/tinyynoob/code/2022-lab0/harness.c:test_free
 6,374,228    5    4 1,600,008         6       3   600,003       0       0  /build/glibc-eX1tMB/glibc-2.31/stdlib/random_r.c:random_r
 6,299,564    1    1 2,683,950 1,272,142 158,255    99,999       0       0  /home/tinyynoob/code/2022-lab0/queue.c:split_list

cache miss 熱點主要分佈於：

show_queue() 468,655
malloc_consolidate() 237,735

可以觀察到在方法 2 之中，雖然 merge_sorted_singly(), strcmp() 及 split_list() 也有著蠻高的 D1mr ，但是 DLmr 卻能相對於方法 1 降到非常低 (約 0.1 million) 。

我希望能整合兩者的優點，寫出一支從單一節點開始相鄰合併的程式。

qtest.c

Shuffle 命令

打開 qtest.c ，於 console_init() 中加入新命令：

    ADD_COMMAND(shuffle, "                | Shuffle the contents in queue");

並在 qtest.c 中加入實作函式 do_shuffle() ：




























static bool do_shuffle(int argc, char *argv[])
{
    if (argc != 1) {
        report(1, "%s takes no arguments", argv[0]);
        return false;
    }

    if (!l_meta.l)
        report(3, "Warning: Try to access null queue");
    error_check();

    struct list_head old;
    old.next = l_meta.l->next;
    old.prev = l_meta.l->prev;
    old.next->prev = &old;
    old.prev->next = &old;

    INIT_LIST_HEAD(l_meta.l);
    for (int i = l_meta.size; i; i--) {
        int roll = rand() % i;
        struct list_head *it = old.next;
        for (int j = 0; j < roll; j++)
            it = it->next;
        list_move_tail(it, l_meta.l);
    }
    show_queue(3);
    return !error_check();
}

主要依循 Fisher–Yates shuffle 算法，完成上方的 12 至 25 行。

shuffle 所做的事情簡單來說就是改變節點的順序，我認為 swap 的方式比較適合於 array ，因此對於 linked list ，我選用的方法是隨機從原 list 取出節點放入新 list ，直到全數節點都被取出。

思考了之後發現 old 這個變數根本非必要，因為第 19 和 20 行的作用會限制 struck out 的作用範圍，因此直接將洗出來的節點串到 list 的最後即可，被插入尾端的節點不會再次被選中。

程式碼可縮短為：

static bool do_shuffle(int argc, char *argv[])
{
    if (argc != 1) {
        report(1, "%s takes no arguments", argv[0]);
        return false;
    }

    if (!l_meta.l)
        report(3, "Warning: Try to access null queue");
    error_check();

    for (int i = l_meta.size; i; i--) {
        int roll = rand() % i;
        struct list_head *it = l_meta.l->next;
        for (int j = 0; j < roll; j++)
            it = it->next;
        list_move_tail(it, l_meta.l);
    }
    show_queue(3);
    return !error_check();
}

使用 Valgrind

依序輸入命令 make clean>>make SANITIZER=1>>./qtest>>help 卻未遇見任何錯誤。

而當我嘗試使用 Valgrind 去檢測 qtest 的執行時，會出現錯誤訊息

$ valgrind --tool=memcheck ./qtest
==37195==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

先輸入

$ valgrind -q --leak-check=full ./qtest

之後再試一次

$ valgrind --tool=memcheck ./qtest

就順利開始執行了，原因不明。
測試幾個指令後都沒遇到 valgrind memcheck 回報錯誤。

而輸入命令 make valgrind 會得到落落長的分析結果

結果在這裡，原本太長只擷取一段

$ make valgrind
# Explicitly disable sanitizer(s)
make clean SANITIZER=0 qtest
make[1]: 進入目錄「/home/tinyynoob/code/2022-lab0」
rm -f qtest.o report.o console.o harness.o queue.o random.o dudect/constant.o dudect/fixture.o dudect/ttest.o linenoise.o .qtest.o.d .report.o.d .console.o.d .harness.o.d .queue.o.d .random.o.d .dudect/constant.o.d .dudect/fixture.o.d .dudect/ttest.o.d .linenoise.o.d *~ qtest /tmp/qtest.*
rm -rf .dudect
rm -rf *.dSYM
(cd traces; rm -f *~)
  CC    qtest.o
  CC    report.o
  CC    console.o
  CC    harness.o
  CC    queue.o
  CC    random.o
  CC    dudect/constant.o
  CC    dudect/fixture.o
  CC    dudect/ttest.o
  CC    linenoise.o
  LD    qtest
make[1]: 離開目錄「/home/tinyynoob/code/2022-lab0」
cp qtest /tmp/qtest.9sYRe7
chmod u+x /tmp/qtest.9sYRe7
sed -i "s/alarm/isnan/g" /tmp/qtest.9sYRe7
scripts/driver.py -p /tmp/qtest.9sYRe7 --valgrind 
---     Trace           Points
+++ TESTING trace trace-01-ops:
# Test of insert_head and remove_head
==37706== 8 bytes in 1 blocks are still reachable in loss record 1 of 3
==37706==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==37706==    by 0x4A5450E: strdup (strdup.c:42)
==37706==    by 0x110A8C: linenoiseHistoryAdd (linenoise.c:1236)
==37706==    by 0x11161F: linenoiseHistoryLoad (linenoise.c:1325)
==37706==    by 0x10C7B2: main (qtest.c:974)
==37706== 
==37706== 116 bytes in 19 blocks are still reachable in loss record 2 of 3
==37706==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==37706==    by 0x4A5450E: strdup (strdup.c:42)
==37706==    by 0x110A00: linenoiseHistoryAdd (linenoise.c:1236)
==37706==    by 0x11161F: linenoiseHistoryLoad (linenoise.c:1325)
==37706==    by 0x10C7B2: main (qtest.c:974)
==37706== 
==37706== 160 bytes in 1 blocks are still reachable in loss record 3 of 3
==37706==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==37706==    by 0x110A4C: linenoiseHistoryAdd (linenoise.c:1224)
==37706==    by 0x11161F: linenoiseHistoryLoad (linenoise.c:1325)
==37706==    by 0x10C7B2: main (qtest.c:974)
==37706== 
---     trace-01-ops    5/5
...

因為實在執行太久了，所以我在看到 trace-16 的結果後就先按下 ctrl+c 。

可以發現回報的都是重複的內容，沿著訊息的指示尋找，可以發現錯誤訊息的路徑是 qtest.c 的 main() => linenoiseHistoryLoad() => linenoiseHistoryAdd() => malloc(), strdup() ， log 指出錯誤主要來自這兩行程式碼

history = malloc(sizeof(char *) * history_max_len);

linecopy = strdup(line);

不過因為出現的都是 still reachable ，根據本次作業說明，我們可以去查看 global variable ，我推測問題出在 history 變數。

我懷疑是因為 qtest.c 的 main() 在結束前沒有 free history 相關的操作才出現這個問題。於是我去下載了 linenoise 提供的 example.c ，觀察其結束前也沒有 free history 一類的呼叫。但我同樣用 valgrind 的 memcheck 工具去測試 example.c ，卻沒有出現任何報錯。回頭發覺

$ valgrind --tool=memcheck ./qtest

並照著 trace-xx 的內容輸入也沒有狀況發生，因此應該是我根本沒搞懂 make valgrind 做了什麼事情。

我在 run_console() 中找到了 linenoiseFree() 呼叫，不過它跟 history 也沒有關係。

我去查詢了 memcheck manual ，得到這段解釋

"Still reachable". This covers cases 1 and 2 (for the BBB blocks) above. A start-pointer or chain of start-pointers to the block is found. Since the block is still pointed at, the programmer could, at least in principle, have freed it before program exit. "Still reachable" blocks are very common and arguably not a problem. So, by default, Memcheck won't report such blocks individually.

但我還是沒有解決問題，為了了解 make valgrind 到底在做什麼，引入 make VERBOSE=1 ，並再次執行得到

結果

$ make valgrind
# Explicitly disable sanitizer(s)
make clean SANITIZER=0 qtest
make[1]: 進入目錄「/home/tinyynoob/code/2022-lab0」
rm -f qtest.o report.o console.o harness.o queue.o random.o dudect/constant.o dudect/fixture.o dudect/ttest.o linenoise.o .qtest.o.d .report.o.d .console.o.d .harness.o.d .queue.o.d .random.o.d .dudect/constant.o.d .dudect/fixture.o.d .dudect/ttest.o.d .linenoise.o.d *~ qtest /tmp/qtest.*
rm -rf .dudect
rm -rf *.dSYM
(cd traces; rm -f *~)
  CC    qtest.o
  CC    report.o
  CC    console.o
  CC    harness.o
  CC    queue.o
  CC    random.o
  CC    dudect/constant.o
  CC    dudect/fixture.o
  CC    dudect/ttest.o
  CC    linenoise.o
  LD    qtest
make[1]: 離開目錄「/home/tinyynoob/code/2022-lab0」
cp qtest /tmp/qtest.NAa83W
chmod u+x /tmp/qtest.NAa83W
sed -i "s/alarm/isnan/g" /tmp/qtest.NAa83W
scripts/driver.py -p /tmp/qtest.NAa83W --valgrind 
---     Trace           Points
+++ TESTING trace trace-01-ops:
# Test of insert_head and remove_head
==47704== 4 bytes in 1 blocks are still reachable in loss record 1 of 3
==47704==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==47704==    by 0x4A5450E: strdup (strdup.c:42)
==47704==    by 0x110A8C: linenoiseHistoryAdd (linenoise.c:1236)
==47704==    by 0x11161F: linenoiseHistoryLoad (linenoise.c:1325)
==47704==    by 0x10C7B2: main (qtest.c:974)
==47704== 
==47704== 142 bytes in 19 blocks are still reachable in loss record 2 of 3
==47704==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==47704==    by 0x4A5450E: strdup (strdup.c:42)
==47704==    by 0x110A00: linenoiseHistoryAdd (linenoise.c:1236)
==47704==    by 0x11161F: linenoiseHistoryLoad (linenoise.c:1325)
==47704==    by 0x10C7B2: main (qtest.c:974)
==47704== 
==47704== 160 bytes in 1 blocks are still reachable in loss record 3 of 3
==47704==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==47704==    by 0x110A4C: linenoiseHistoryAdd (linenoise.c:1224)
==47704==    by 0x11161F: linenoiseHistoryLoad (linenoise.c:1325)
==47704==    by 0x10C7B2: main (qtest.c:974)
==47704== 
---     trace-01-ops    5/5
...
...

不知道為什麼手動輸入命令沒有訊息，而透過 driver.py 就會偵測到 still reachable 。

我發現 linenoise.c 試圖透過 linenoiseAtExit() 來清除 history ：

/* At exit we'll try to fix the terminal to the initial conditions. */
static void linenoiseAtExit(void)
{
    disableRawMode(STDIN_FILENO);
    freeHistory();
}

並且發現其註冊寫在 enableRawMode() 中，並使用atexit_registered 讓 linenoiseAtExit() 只會註冊一次：
















/* Raw mode: 1960 magic shit. */
static int enableRawMode(int fd)
{
    struct termios raw;

    if (!isatty(STDIN_FILENO))
        goto fatal;
    if (!atexit_registered) {
        puts("I am registering linenoiseAtExit.\n");
        atexit(linenoiseAtExit);
        atexit_registered = 1;
    }
    if (tcgetattr(fd, &orig_termios) == -1)
        goto fatal;
    ...
    ...

我在第 9 行處加了 print 來查看它是否有被註冊，得到：

$ ./qtest
I am registering linenoiseAtExit.

cmd> new
l = []
cmd> ih aaa
l = [aaa]
cmd> free
l = NULL
cmd> quit
Freeing queue

表示 linenoiseAtExit() 應該有被正確註冊才對。

再使用 algrind ./test 了解到直接執行並不會記憶體洩漏的訊息，因此需要去觀察 drive.py 到底是怎麼做的。

嘗試執行

$ valgrind ./qtest -v 3 -f traces/trace-01-ops.cmd

也會得到錯誤訊息：

cmd> # Test of insert_head and remove_head
cmd> option fail 0
cmd> option malloc 0
cmd> new
l = []
cmd> ih gerbil
l = [gerbil]
cmd> ih bear
l = [bear gerbil]
cmd> ih dolphin
l = [dolphin bear gerbil]
cmd> rh dolphin
Removed dolphin from queue
l = [bear gerbil]
cmd> rh bear
Removed bear from queue
l = [gerbil]
cmd> rh gerbil
Removed gerbil from queue
l = []
cmd> 
Freeing queue
==150407== 2 bytes in 1 blocks are still reachable in loss record 1 of 3
==150407==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==150407==    by 0x4A4C3BE: strdup (strdup.c:42)
==150407==    by 0x110A98: linenoiseHistoryAdd (linenoise.c:1235)
==150407==    by 0x11162B: linenoiseHistoryLoad (linenoise.c:1324)
==150407==    by 0x10C7B2: main (qtest.c:976)
==150407== 
==150407== 104 bytes in 19 blocks are still reachable in loss record 2 of 3
==150407==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==150407==    by 0x4A4C3BE: strdup (strdup.c:42)
==150407==    by 0x110A0C: linenoiseHistoryAdd (linenoise.c:1235)
==150407==    by 0x11162B: linenoiseHistoryLoad (linenoise.c:1324)
==150407==    by 0x10C7B2: main (qtest.c:976)
==150407== 
==150407== 160 bytes in 1 blocks are still reachable in loss record 3 of 3
==150407==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==150407==    by 0x110A58: linenoiseHistoryAdd (linenoise.c:1223)
==150407==    by 0x11162B: linenoiseHistoryLoad (linenoise.c:1324)
==150407==    by 0x10C7B2: main (qtest.c:976)
==150407==

且其中也沒有看見我安插的 I am registering linenoiseAtExit. 訊息，代表 linenoiseAtExit() 沒有被正確註冊。

由於 -v 只是在設定 VERBOSE level ，因此推測是跟 -f 有關，嘗試只用 -f 執行確實也有跳出 valgrind 訊息。

探討 qtest.c 裡的 main() ：

        case 'f':
            strncpy(buf, optarg, BUFSIZE);
            buf[BUFSIZE - 1] = '\0';
            infile_name = buf;
            break;

我發現我在整個 program 中遍尋不著 optarg 變數的宣告。

藉著 man optarg 查到了：

       int getopt(int argc, char * const argv[],
                  const char *optstring);

       extern char *optarg;

optstring is a string containing the legitimate option characters. If such a character is followed by a colon, the option requires an argument, so getopt() places a pointer to the following text in the same argv-element, or the text of the following argv-element, in optarg.

因此在這裡 optarg 應該會存下字串 "traces/trace-01-ops.cmd" 。

繼續追查 file descriptor 如何使用，在 console.c 中發現：

/* Create new buffer for named file.
 * Name == NULL for stdin.
 * Return true if successful.
 */
static bool push_file(char *fname)
{
    int fd = fname ? open(fname, O_RDONLY) : STDIN_FILENO;
    has_infile = fname ? true : false;
...

於是我們發現 stdin 跟外來的 file 會得到不同 has_infile 值，最終會在 run_console() 中走進不同的分支。

bool run_console(char *infile_name)
{
    if (!push_file(infile_name)) {
        report(1, "ERROR: Could not open source file '%s'", infile_name);
        return false;
    }

    if (!has_infile) {
        char *cmdline;
        while ((cmdline = linenoise(prompt)) != NULL) {
            interpret_cmd(cmdline);
            linenoiseHistoryAdd(cmdline);       /* Add to the history. */
            linenoiseHistorySave(HISTORY_FILE); /* Save the history on disk. */
            linenoiseFree(cmdline);
        }
    } else {
        while (!cmd_done())
            cmd_select(0, NULL, NULL, NULL, NULL);
    }

    return err_cnt == 0;
}

cmd_select() 程式碼

/*
 * Handle command processing in program that uses select as main control loop.
 * Like select, but checks whether command input either present in internal
 * buffer
 * or readable from command input.  If so, that command is executed.
 * Same return as select.  Command input file removed from readfds
 *
 * nfds should be set to the maximum file descriptor for network sockets.
 * If nfds == 0, this indicates that there is no pending network activity
 */
int cmd_select(int nfds,
               fd_set *readfds,
               fd_set *writefds,
               fd_set *exceptfds,
               struct timeval *timeout)
{
    int infd;
    fd_set local_readset;

    if (cmd_done())
        return 0;

    if (!block_flag) {
        /* Process any commands in input buffer */
        if (!readfds)
            readfds = &local_readset;

        /* Add input fd to readset for select */
        infd = buf_stack->fd;
        FD_SET(infd, readfds);
        if (infd == STDIN_FILENO && prompt_flag) {
            printf("%s", prompt);
            fflush(stdout);
            prompt_flag = true;
        }

        if (infd >= nfds)
            nfds = infd + 1;
    }
    if (nfds == 0)
        return 0;

    int result = select(nfds, readfds, writefds, exceptfds, timeout);
    if (result <= 0)
        return result;

    infd = buf_stack->fd;
    if (readfds && FD_ISSET(infd, readfds)) {
        /* Commandline input available */
        FD_CLR(infd, readfds);
        result--;
        if (has_infile) {
            char *cmdline;
            cmdline = readline();
            if (cmdline)
                interpret_cmd(cmdline);
        }
    }
    return result;
}

由於 cmd_select() 中並沒有用到 linenoise 相關的功能，因此也沒有地方可以註冊 linenoiseAtExit() ，而既然不會用到，我就直接改成讓 main() 不去 trigger linenoise ：

+   if (!infile_name) {
        linenoiseSetCompletionCallback(completion);

        linenoiseHistorySetMaxLen(HISTORY_LEN);
        linenoiseHistoryLoad(HISTORY_FILE); /* Load the history at startup */
+   }

修改後可見 make valgrind 的錯誤訊息大幅縮減，對於 history 變數不再報錯。

result

$ make valgrind
# Explicitly disable sanitizer(s)
make clean SANITIZER=0 qtest
make[1]: 進入目錄「/home/tinyynoob/code/lab0-c」
rm -f qtest.o report.o console.o harness.o queue.o random.o dudect/constant.o dudect/fixture.o dudect/ttest.o linenoise.o .qtest.o.d .report.o.d .console.o.d .harness.o.d .queue.o.d .random.o.d .dudect/constant.o.d .dudect/fixture.o.d .dudect/ttest.o.d .linenoise.o.d *~ qtest /tmp/qtest.*
rm -rf .dudect
rm -rf *.dSYM
(cd traces; rm -f *~)
  CC    qtest.o
  CC    report.o
  CC    console.o
  CC    harness.o
  CC    queue.o
  CC    random.o
  CC    dudect/constant.o
  CC    dudect/fixture.o
  CC    dudect/ttest.o
  CC    linenoise.o
  LD    qtest
make[1]: 離開目錄「/home/tinyynoob/code/lab0-c」
cp qtest /tmp/qtest.IVK82L
chmod u+x /tmp/qtest.IVK82L
sed -i "s/alarm/isnan/g" /tmp/qtest.IVK82L
scripts/driver.py -p /tmp/qtest.IVK82L --valgrind 
---     Trace           Points
+++ TESTING trace trace-01-ops:
# Test of insert_head and remove_head
---     trace-01-ops    5/5
+++ TESTING trace trace-02-ops:
# Test of insert_head, insert_tail, remove_head, remove_tail, and delete_mid
---     trace-02-ops    6/6
+++ TESTING trace trace-03-ops:
# Test of insert_head, insert_tail, reverse, and remove_head
---     trace-03-ops    6/6
+++ TESTING trace trace-04-ops:
# Test of insert_head, insert_tail, size, swap, and sort
---     trace-04-ops    6/6
+++ TESTING trace trace-05-ops:
# Test of insert_head, insert_tail, remove_head, reverse, size, swap, and sort
---     trace-05-ops    6/6
+++ TESTING trace trace-06-ops:
# Test of insert_head, insert_tail, delete duplicate, and sort
---     trace-06-ops    6/6
+++ TESTING trace trace-07-string:
# Test of truncated strings
---     trace-07-string 6/6
+++ TESTING trace trace-08-robust:
# Test operations on empty queue
---     trace-08-robust 6/6
+++ TESTING trace trace-09-robust:
# Test remove_head with NULL argument
---     trace-09-robust 6/6
+++ TESTING trace trace-10-robust:
# Test operations on NULL queue
---     trace-10-robust 6/6
+++ TESTING trace trace-11-malloc:
# Test of malloc failure on insert_head
---     trace-11-malloc 6/6
+++ TESTING trace trace-12-malloc:
# Test of malloc failure on insert_tail
---     trace-12-malloc 6/6
+++ TESTING trace trace-13-malloc:
# Test of malloc failure on new
---     trace-13-malloc 6/6
+++ TESTING trace trace-14-perf:
# Test performance of insert_tail, reverse, and sort
---     trace-14-perf   6/6
+++ TESTING trace trace-15-perf:
# Test performance of sort with random and descending orders
# 10000: all correct sorting algorithms are expected pass
# 50000: sorting algorithms with O(n^2) time complexity are expected failed
# 100000: sorting algorithms with O(nlogn) time complexity are expected pass
---     trace-15-perf   6/6
+++ TESTING trace trace-16-perf:
# Test performance of insert_tail
---     trace-16-perf   6/6
+++ TESTING trace trace-17-complexity:
# Test if time complexity of q_insert_tail, q_insert_head, q_remove_tail, and q_remove_head is constant
Probably constant time
ERROR: Probably not constant time
Probably constant time
Probably constant time
==152278== 32 bytes in 1 blocks are still reachable in loss record 1 of 29
==152278==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==152278==    by 0x10CE63: malloc_or_fail (report.c:189)
==152278==    by 0x10D8ED: add_cmd (console.c:89)
==152278==    by 0x10DC77: init_cmd (console.c:408)
==152278==    by 0x10C5A9: main (qtest.c:969)
==152278== 
==152278== 32 bytes in 1 blocks are still reachable in loss record 2 of 29
==152278==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==152278==    by 0x10CE63: malloc_or_fail (report.c:189)
==152278==    by 0x10D8ED: add_cmd (console.c:89)
==152278==    by 0x10DC91: init_cmd (console.c:409)
==152278==    by 0x10C5A9: main (qtest.c:969)
==152278== 
==152278== 32 bytes in 1 blocks are still reachable in loss record 3 of 29
==152278==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==152278==    by 0x10CE63: malloc_or_fail (report.c:189)
==152278==    by 0x10D8ED: add_cmd (console.c:89)
==152278==    by 0x10DCAB: init_cmd (console.c:410)
==152278==    by 0x10C5A9: main (qtest.c:969)
==152278== 
==152278== 32 bytes in 1 blocks are still reachable in loss record 4 of 29
==152278==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==152278==    by 0x10CE63: malloc_or_fail (report.c:189)
==152278==    by 0x10D8ED: add_cmd (console.c:89)
==152278==    by 0x10DCC5: init_cmd (console.c:411)
==152278==    by 0x10C5A9: main (qtest.c:969)
==152278== 
...
==152278== 32 bytes in 1 blocks are still reachable in loss record 22 of 29
==152278==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==152278==    by 0x10CE63: malloc_or_fail (report.c:189)
==152278==    by 0x10D8ED: add_cmd (console.c:89)
==152278==    by 0x10C72F: console_init (qtest.c:822)
==152278==    by 0x10C72F: main (qtest.c:970)
==152278== 
==152278== 40 bytes in 1 blocks are still reachable in loss record 23 of 29
==152278==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==152278==    by 0x10CE63: malloc_or_fail (report.c:189)
==152278==    by 0x10D967: add_param (console.c:110)
==152278==    by 0x10DD32: init_cmd (console.c:415)
==152278==    by 0x10C5A9: main (qtest.c:969)
==152278== 
==152278== 40 bytes in 1 blocks are still reachable in loss record 24 of 29
==152278==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==152278==    by 0x10CE63: malloc_or_fail (report.c:189)
==152278==    by 0x10D967: add_param (console.c:110)
==152278==    by 0x10DD51: init_cmd (console.c:416)
==152278==    by 0x10C5A9: main (qtest.c:969)
==152278== 
==152278== 40 bytes in 1 blocks are still reachable in loss record 25 of 29
==152278==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==152278==    by 0x10CE63: malloc_or_fail (report.c:189)
==152278==    by 0x10D967: add_param (console.c:110)
==152278==    by 0x10DD70: init_cmd (console.c:417)
==152278==    by 0x10C5A9: main (qtest.c:969)
==152278== 
...
==152278== 
==152278== 40 bytes in 1 blocks are still reachable in loss record 29 of 29
==152278==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==152278==    by 0x10CE63: malloc_or_fail (report.c:189)
==152278==    by 0x10D967: add_param (console.c:110)
==152278==    by 0x10C78C: console_init (qtest.c:827)
==152278==    by 0x10C78C: main (qtest.c:970)
==152278== 
---     trace-17-complexity     0/5
---     TOTAL           95/100

剩餘的都是 cmd 或 param 在配置空間後沒有釋放，並且集中發生在 trace-17-complexity.cmd 為 input file 時。由於其它測資都不會出現這種 leak 的情況，我們需要先釐清命令：

option simulation 1

到底會發生什麼事。

立起 simulation 之後，會去執行 dudect/ 目錄下的時間測試程式，而其中也沒看到有什麼和 cmd, param 有關的地方。

接著我們發現 cmd 和 param 的清除都 console.c 的 do_quit() 進行：

/* Built-in commands */
static bool do_quit(int argc, char *argv[])
{
    puts("I am in do_quit().\n");
    cmd_ptr c = cmd_list;
    bool ok = true;
    while (c) {
        cmd_ptr ele = c;
        c = c->next;
        free_block(ele, sizeof(cmd_ele));
    }

    param_ptr p = param_list;
    while (p) {
        param_ptr ele = p;
        p = p->next;
        free_block(ele, sizeof(param_ele));
    }

    while (buf_stack)
        pop_file();

    for (int i = 0; i < quit_helper_cnt; i++) {
        ok = ok && quit_helpers[i](argc, argv);
    }

    quit_flag = true;
    return ok;
}

即使是沒有下 quit 命令也會因為 quit_flag 尚未立起而自動在 finish_cmd() 呼叫。

而手動測試 rt 發現並不會看到 valgrind 報錯：

$ valgrind ./qtest
cmd> option simulation 1
cmd> rt
ERROR: Probably not constant time
cmd> quit
Freeing queue

因為手動測試一定要靠 quit 離開，所以可能不太準確，這次將 trace-17-complexity.cmd 中的 it, ih 及 rh 註解掉，只留下 rt 測試：

結果

$ valgrind ./qtest -f traces/trace-17-complexity.cmd
cmd> # Test if time complexity of q_insert_tail, q_insert_head, q_remove_tail, and q_remove_head is constant
cmd> option simulation 1
cmd> #it
Unknown command '#it'
cmd> #ih
Unknown command '#ih'
cmd> #rh
Unknown command '#rh'
cmd> rt
ERROR: Probably not constant time
cmd> option simulation 0
==175742== 32 bytes in 1 blocks are still reachable in loss record 1 of 29
==175742==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==175742==    by 0x10CE63: malloc_or_fail (report.c:189)
==175742==    by 0x10D8F9: add_cmd (console.c:89)
==175742==    by 0x10DC83: init_cmd (console.c:409)
==175742==    by 0x10C5A9: main (qtest.c:969)
==175742== 
==175742== 32 bytes in 1 blocks are still reachable in loss record 2 of 29
==175742==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==175742==    by 0x10CE63: malloc_or_fail (report.c:189)
==175742==    by 0x10D8F9: add_cmd (console.c:89)
==175742==    by 0x10DC9D: init_cmd (console.c:410)
==175742==    by 0x10C5A9: main (qtest.c:969)
==175742== 
==175742== 32 bytes in 1 blocks are still reachable in loss record 3 of 29
==175742==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==175742==    by 0x10CE63: malloc_or_fail (report.c:189)
==175742==    by 0x10D8F9: add_cmd (console.c:89)
==175742==    by 0x10DCB7: init_cmd (console.c:411)
==175742==    by 0x10C5A9: main (qtest.c:969)
==175742== 
==175742== 32 bytes in 1 blocks are still reachable in loss record 4 of 29
==175742==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==175742==    by 0x10CE63: malloc_or_fail (report.c:189)
==175742==    by 0x10D8F9: add_cmd (console.c:89)
==175742==    by 0x10DCD1: init_cmd (console.c:412)
==175742==    by 0x10C5A9: main (qtest.c:969)
==175742== 
==175742== 32 bytes in 1 blocks are still reachable in loss record 5 of 29
==175742==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==175742==    by 0x10CE63: malloc_or_fail (report.c:189)
==175742==    by 0x10D8F9: add_cmd (console.c:89)
==175742==    by 0x10DCEB: init_cmd (console.c:413)
==175742==    by 0x10C5A9: main (qtest.c:969)
...

未完全解決

simulation

嘗試透過 massif 工具觀察 simulation 過程的使用量，先嘗試對 ih 進行操作，得

可以看出過程中的 memory consumption 不斷在波動，也可以得知最高點就在第 1 個 snapshot 。

加入 web 命令

研讀 Dude, is my code constant time?

這篇論文提出使用統計學方法來測試給定的程式是否為 constant time

解釋測試程式對 constant time 的測量

嘗試從 qtest.c 中對 is_insert_head_const() 的呼叫開始追查，可以發現大部分的相關程式都在 dudect/ 路徑下，並且測試流程主要由 fixture.c 的 doit() 所掌控。

static bool doit(int mode)
{
    int64_t *before_ticks = calloc(n_measure + 1, sizeof(int64_t));
    int64_t *after_ticks = calloc(n_measure + 1, sizeof(int64_t));
    int64_t *exec_times = calloc(n_measure, sizeof(int64_t));
    uint8_t *classes = calloc(n_measure, sizeof(uint8_t));
    uint8_t *input_data = calloc(n_measure * chunk_size, sizeof(uint8_t));

    if (!before_ticks || !after_ticks || !exec_times || !classes ||
        !input_data) {
        die();
    }

    prepare_inputs(input_data, classes);

    measure(before_ticks, after_ticks, input_data, mode);
    differentiate(exec_times, before_ticks, after_ticks);
    update_statistics(exec_times, classes);
    bool ret = report();

    free(before_ticks);
    free(after_ticks);
    free(exec_times);
    free(classes);
    free(input_data);

    return ret;
}

prepare_inputs() 函式大致的工作是產生一些隨機的資料、字串等等，主要透過 random.c 來實現，其中比較有趣的部份是用到了 /dev/urandom 這個東西。根據我在維基百科找到的說明，它可以藉由對 entropy 的控管來產生真正的亂數。

/dev/urandom

接著 measure() 會根據不同的操作 (ih, it, rh, rt) 進行不同的測量，例如要測量 insert head 時會執行以下程式碼：











for (size_t i = drop_size; i < n_measure - drop_size; i++) {
    char *s = get_random_string();
    dut_new();
    dut_insert_head(
        get_random_string(),
        *(uint16_t *) (input_data + i * chunk_size) % 10000);
    before_ticks[i] = cpucycles();
    dut_insert_head(s, 1);
    after_ticks[i] = cpucycles();
    dut_free();
}

其中 dut_xxx() 模樣的函式都是為了測試而對原本的 q_xxx() 函式進行重新包裝的 macro 。

get_random_string() 則是取出一在 prepare_inputs() 時生成的隨機字串，每個字串長度 7 ，

cpucycles() 依據不同平台而有不同定義，在 i386 及 x86_64 架構下定義為

    unsigned int hi, lo;
    __asm__ volatile("rdtsc\n\t" : "=a"(lo), "=d"(hi));
    return ((int64_t) lo) | (((int64_t) hi) << 32);

不知道中間那一行是什麼，只覺得怎麼看都不像 C 語言，查詢 GNU C doc ，了解到這是內嵌組合語言的語法，並且它的 qualifier 是要放在後面，所以這個 volatile 才會長成這樣子。

這篇文章提到了：

在intel pentium以上級別的cpu中，有一個稱為“時間戳（time stamp）”的部件，它以64位無符號整型數的格式，記錄了自cpu上電以來所經過的時鐘週期數。
在pentium以上的cpu中，提供了一條機器指令rdtsc（read time stamp counter）來讀取這個時間戳的數字，並將其儲存在edx:eax暫存器對中。

所以 cpucycles() 使用兩個變數儲存從 register 取得的值，再用 bitwise 的方式合成並以 int64_t 的格式回傳。

不過我也有看到文章表示這樣的用法已經不合時宜：

多核時代不宜再用 x86 的 RDTSC 指令測試指令週期和時間

至此 cpucycles() 的功能已大致理解，接著繼續探討 measure() 留下的兩個問題：

為何要先執行多次 dut_insert_head() 且不量測 cpu cycle ，之後再獨立執行 1 次 dut_insert_head() 來進行測量呢？
為何迴圈的頭尾要去掉兩段 drop_size ？

針對 1. ，我發現不管是要 measure 哪一種操作，前面都固定是 dut_insert_head() ，內容如上方的 4 至 6 行。不同操作間改變的僅有兩個 cpucycles() 間的部份。

完整的 measure() 函式

void measure(int64_t *before_ticks,
             int64_t *after_ticks,
             uint8_t *input_data,
             int mode)
{
    assert(mode == test_insert_head || mode == test_insert_tail ||
           mode == test_remove_head || mode == test_remove_tail);

    switch (mode) {
    case test_insert_head:
        for (size_t i = drop_size; i < n_measure - drop_size; i++) {
            char *s = get_random_string();
            dut_new();
            dut_insert_head(
                get_random_string(),
                *(uint16_t *) (input_data + i * chunk_size) % 10000);
            before_ticks[i] = cpucycles();
            dut_insert_head(s, 1);
            after_ticks[i] = cpucycles();
            dut_free();
        }
        break;
    case test_insert_tail:
        for (size_t i = drop_size; i < n_measure - drop_size; i++) {
            char *s = get_random_string();
            dut_new();
            dut_insert_head(
                get_random_string(),
                *(uint16_t *) (input_data + i * chunk_size) % 10000);
            before_ticks[i] = cpucycles();
            dut_insert_tail(s, 1);
            after_ticks[i] = cpucycles();
            dut_free();
        }
        break;
    case test_remove_head:
        for (size_t i = drop_size; i < n_measure - drop_size; i++) {
            dut_new();
            dut_insert_head(
                get_random_string(),
                *(uint16_t *) (input_data + i * chunk_size) % 10000);
            before_ticks[i] = cpucycles();
            element_t *e = q_remove_head(l, NULL, 0);
            after_ticks[i] = cpucycles();
            if (e)
                q_release_element(e);
            dut_free();
        }
        break;
    case test_remove_tail:
        for (size_t i = drop_size; i < n_measure - drop_size; i++) {
            dut_new();
            dut_insert_head(
                get_random_string(),
                *(uint16_t *) (input_data + i * chunk_size) % 10000);
            before_ticks[i] = cpucycles();
            element_t *e = q_remove_tail(l, NULL, 0);
            after_ticks[i] = cpucycles();
            if (e)
                q_release_element(e);
            dut_free();
        }
        break;
    default:
        for (size_t i = drop_size; i < n_measure - drop_size; i++) {
            dut_new();
            dut_insert_head(
                get_random_string(),
                *(uint16_t *) (input_data + i * chunk_size) % 10000);
            before_ticks[i] = cpucycles();
            dut_size(1);
            after_ticks[i] = cpucycles();
            dut_free();
        }
    }
}

先往後看，在執行完 measure() 之後要執行的是 differentiate() ，它會把 after_ticks 和 before_ticks 相減儲存至 exec_time() ，就是數學上差分的概念。

接著是 update_statistics()

根據註解的提點， Welford's online algorithm