2024q1 Homework1 (lab0)

contributed by < yourui1017 >

Reviewed by `yu-hsiennn`

可以利用 list.h 裡面的 API 來實作，以精簡呈現出來的程式碼。

開發環境

$ gcc --version
(Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   39 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          12
On-line CPU(s) list:             0-11
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
CPU family:                      6
Model:                           158
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
Stepping:                        10
CPU max MHz:                     4100.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4399.99
Virtualization:                  VT-x
Caches (sum of all):     
L1d:                             192 KiB (6 instances)
L1i:                             192 KiB (6 instances)
L2:                              1.5 MiB (6 instances)
L3:                              9 MiB (1 instance)
NUMA:                    
NUMA node(s):                    1
NUMA node0 CPU(s):               0-11

針對佇列的操作

重要的結構體

struct list_head {
    struct list_head *prev;
    struct list_head *next;
};

typedef struct {
    char *value;
    struct list_head list;
} element_t;

q_new

struct list_head *q_new()
{
    struct list_head *head = malloc(sizeof(struct list_head));
    if (!head)
        return NULL;
    INIT_LIST_HEAD(head);
    return head;
}

allocate 的翻譯是「配置」，以區格 dispatch (分發/分配) 的翻譯。

已更改

首先用 malloc 分配配置計憶體大小為 struct list_head 的空間，並根據 C 語言規格書[7.20.3]，如果記憶體空間不足以配置，則函式應該回傳 NULL 。

If the space cannot be allocated, a null pointer is returned

相反的，如果記憶體空間足夠配置，則使用 list.h 的參考函式INIT_LIST_HEAD 函式初始化 head 並且回傳 head 。

q_free

void q_free(struct list_head *head) {
    if(!head)
        return;
    element_t *entry, *safe;
    list_for_each_entry_safe(entry, safe, head, list){
        free(entry->value);
        free(entry);
    }
    free(head);
}

先判斷 head 是否為 NULL ，再使用 list.h 的list_for_each_entry_safe 函式走訪每個 entry ，並考量到 value 可能會以動態記憶體宣告，因此先將 char * 變數型別存放的記憶體釋放，再釋放 entry 的記憶體。

改善漢語表達，上述說法不精確。

q_insert_head

bool q_insert_head(struct list_head *head, char *s)
{
    if (!head)
        return false;
    
    element_t *new = malloc(sizeof(element_t));
    if(!new)
        return false;
-
-   new->value = s;
+   int str_size = sizeof(char) * (strlen(s) + 1);
+   new->value = malloc(str_size);
+   if (!new->value) {
+       free(new);
+       return false;
+   }
+   strncpy(new->value, s, str_size - 1);
    list_add(&new->list, head);    
    return true;
}

原本的寫法沒有考慮到 value 沒有被 malloc 的情況，因此經過考慮後，先配置 element_t 的記憶體位置後再配置 value 的記憶體位置，並將 s 字串複製給予 value 。

HackMD 不是讓你張貼完整程式碼的地方，GitHub 才是！因此你在開發紀錄只該列出關鍵程式碼 (善用 diff 標示)，可附上對應 GitHub commit 的超連結，列出程式碼是為了「檢討」和「便於他人參與討論」，不是用來「假裝自己有付出」

在完成 q_insert_head 和 q_remove_head 兩個函式並使用 make test 測試，發現在 trace-01-ops 出現錯誤

+++ TESTING trace trace-01-ops:
# Test of insert_head and remove_head
ERROR: Removed value dolphinU��� != expected value dolphin
ERROR: Removed value bearU��� != expected value bear
ERROR: Removed value gerbilU��� != expected value gerbil

因此回頭檢查q_insert_head 和 q_remove_head 兩個函式，發現在q_insert_head 中的 strcpyn 的引數錯誤，引數更正如下：

-    strncpy(new->value, s, str_size - 1);
+    strncpy(new->value, s, str_size);

q_remove_head

element_t *q_remove_head(struct list_head *head, char *sp, size_t bufsize)
{
    if (!head || list_empty(head))
        return NULL;

    element_t *remove = list_first_entry(head, element_t, list);

    if (sp) 
        strncpy(sp, remove->value, bufsize - 1);
    list_del(&remove->list);
    return remove;
}

使用 list.h 的 list_first_entry 找到首個 element_t ，並將該 element_t 的 value 複製給 sp ，再將該 element_t 移走。

改進你的漢語表達。

q_delete_mid

bool q_delete_mid(struct list_head *head)
{
    if (!head || list_empty(head))
            return false;

    struct list_head **indir = &head->next;

    for (struct list_head *fast = head->next;
         fast != head && fast->next != head; fast = fast->next->next)
        indir = &(*indir)->next;
    struct list_head *del = *indir;
    list_del(del);
-   free(del);
+   free(list_entry(del, element_t, list)->value);
+   free(list_entry(del, element_t, list));
    return true;
}

參考你所不知道的 C 語言: linked list 和非連續記憶體 案例探討: Leetcode 2095. Delete the Middle Node of a Linked List使用快慢指標找到中間節點，再根據此節點找到 element_t 並釋放該 element_t 和 value的記憶體位置。

q_sort

原始版本的 mergeTwoLists 程式碼如下：

struct list_head *mergeTwoLists(struct list_head *L1, struct list_head *L2, bool descend) {
    struct list_head *head = NULL, **ptr = &head, **node;

    for (node = NULL; L1 && L2; *node = (*node)->next) {
        if (descend)
            node = (list_entry(L1, element_t, list)->value > list_entry(L2, element_t, list)->value) ? &L1: &L2;
        else
            node = (list_entry(L1, element_t, list)->value < list_entry(L2, element_t, list)->value) ? &L1: &L2;
        
        *ptr = *node;
        ptr = &(*ptr)->next;
    }
    *ptr = (struct list_head *)((uintptr_t) L1 | (uintptr_t) L2);
    return head;
}

但因為再跑 make test 時，一直解決不了在trace-03-ops出現的報錯如下：

+++ TESTING trace trace-03-ops:
# Test of insert_head, insert_tail, remove_head, reverse and merge
ERROR: Not sorted in ascending order
ERROR: Not sorted in ascending order
ERROR: Not sorted in ascending order
ERROR: Removed value a != expected value z
ERROR: Removed value b != expected value r
Error limit exceeded.  Stopping command execution

由於多次的檢查都認為程式碼的邏輯是正確的，因此參考 Risheng1128 Hackmd ，發現該同學的程式碼中有 strcmp 語法，才得知在 make test 時輸入的會是字串而不是數字，因而針對程式碼進行更正。

strcmp 不是語法 (syntax)，謹慎用詞！

更正的部份如下：

+        element_t *L1_entry = list_entry(L1, element_t, list);
+        element_t *L2_entry = list_entry(L2, element_t, list);
         if (descend)
-            node = (list_entry(L1, element_t, list)->value >
-                    list_entry(L2, element_t, list)->value)
-                       ? &L1
-                       : &L2;
+            node = (strcmp(L1_entry->value, L2_entry->value) > 0) ? &L1 : &L2;
         else
-            node = (list_entry(L1, element_t, list)->value <
-                    list_entry(L2, element_t, list)->value)
-                       ? &L1
-                       : &L2;
+            node = (strcmp(L1_entry->value, L2_entry->value) < 0) ? &L1 : &L2;

q_reverseK, q_descend

原始的程式碼如下：

void q_reverseK(struct list_head *head, int k)
{
    // https://leetcode.com/problems/reverse-nodes-in-k-group/
    if (!head || list_empty(head) || list_is_singular(head))
        return;

    bool flag = false;
    struct list_head **ptr = &head;
    struct list_head **tmp = &head;
    while (true) {
        for (int i = 0; i < k; i++) {
            ptr = &(*ptr)->next;
            if (*ptr == head) {
                flag = true;
                break;
            }
        }
        if (flag)
            break;

        ptr = tmp;
        for (int j = 0; j < k; j++) {
            list_move((*ptr)->next, (*tmp));
            ptr = &(*ptr)->next;
        }
        tmp = ptr;
    }
}

int q_descend(struct list_head *head)
{
    if (!head || list_empty(head))
        return 0;

    struct list_head *cur = head->prev->prev, *tmp = head->prev;
    int count = 1;
    while (cur != head) {
        count++;
        int comp = strcmp(list_entry(cur, element_t, list)->value,
                          list_entry(tmp, element_t, list)->value);
        if (comp <= 0) {
            struct list_head *del = cur;
            cur = cur->prev;
            list_del(del);
            q_release_element(list_entry(del, element_t, list));
            count--;
        } else
            cur = cur->prev;
    }
    // https://leetcode.com/problems/remove-nodes-from-linked-list/
    return 0;
    return count;
}

但因為再跑 make test 時，發現在trace-06-ops出現的報錯如下：

+++ TESTING trace trace-06-ops:
# Test of insert_head, insert_tail, delete duplicate, sort, descend and reverseK
ERROR: Removed value c != expected value d
ERROR: Removed value d != expected value c
ERROR: Removed value c != expected value b
ERROR: Removed value b != expected value a
ERROR: Removed value c != expected value a
Error limit exceeded.  Stopping command execution

經過檢查發現是 q_reverseK 和 q_descend 函式的邏輯錯誤。
釐清整體的架構後，更正 q_reverseK 函式的指標使用方法;更正 q_descend 函式中最小值，修正邏輯錯誤。

更正的部份如下：

q_reverseK

     bool flag = false;
-    struct list_head **ptr = &head;
-    struct list_head **tmp = &head;
+    struct list_head *ptr = head;
+    struct list_head *tmp = head;
     while (true) {
         for (int i = 0; i < k; i++) {
-            ptr = &(*ptr)->next;
-            if (*ptr == head) {
+            ptr = ptr->next;
+            if (ptr == head) {
                 flag = true;
                 break;
             }

q_descend

             q_release_element(list_entry(del, element_t, list));
             count--;
-        } else
+        } else {
+            tmp = cur;
             cur = cur->prev;
+        }

研讀論文 Dude, is my code constant time?

本篇論文測量執行時間的方式是對執行時間進行洩漏偵測測試。

步驟一:測量執行時間

Classes definition : dudect 會輸入 fix-vs-random 兩種不同類別的資料並且使用 leakage detection 進行測試。
Cycle counters : 使用CPU的 cycle counter 計算實際的執行時間。
Environmental conditions : 隨機選擇資料類別，希望減少環境改變對結果的影響。

步驟二:後處理

Cropping : 執行時間較久測資很有可能會有偏差，因此把超過時間 threshold 的測量刪除。
Higher-order preprocessing : 使用CJRR99 的Higher-order preprocessing。

步驟三:靜態測試

t-test : 透過 Welch’s t-test ，輸入兩筆測試資料，確認兩筆資料的執行時間分布是否有違反 null hypothesis: "the two timing distributions are equal" 。如果檢測失敗代表執行時間不是常數時間。
Non-parametric tests : 使用 Non-parametric tests 針對這些未知分布做更穩健的測試。

trace-17-ops

有鑑於 trace-17-ops 一直報錯，因此追蹤 make test 給定的命令檔如下：

# Test if time complexity of q_insert_tail, q_insert_head, q_remove_tail, and q_remove_head is constant
option simulation 1
it
ih
rh
rt
option simulation 0

可以發現它會將 simulation flag 設成1，並執行 it 等等指令。

因此再追蹤到 qtest.c 的 queue_insert 中，找到 qtest.c 會呼叫 is_insert_tail_const() ，最後追蹤到 fixture.c 會根據論文 Dude, is my code constant time? 進行執行時間為 constant time 的驗證。

在作業要求有提到 oreparaz/dudect 的程式碼具備 percentile 的處理，但在 lab0-c 則無，因此比較兩者差異。

在比較完 fixture.c 和 oreparaz/dudect 可以發現 lab0-c 並沒有考慮到論文 Dude, is my code constant time? 中，步驟二後處理的 CROP ，因此將 oreparaz/dudect中的 percentile 在 fixture 中實做。

其中要特別注意到 prepare_percentiles 只能夠做一次。

/*
 set different thresholds for cropping measurements.
 the exponential tendency is meant to approximately match
 the measurements distribution, but there's not more science
 than that.
*/

根據 oreparaz/dudect 在 prepare_percentiles 的註解中，可以知道 prepare_percentiles 會給定要被去除的 threshole ，因此必須保持 threshold 固定。

最後 doit 就會如以下的程式碼：

    prepare_inputs(input_data, classes);
    bool ret = measure(before_ticks, after_ticks, input_data, mode);
    bool first_time = percentiles[NUMBER_PERCENTILES - 1] == 0;
    differentiate(exec_times, before_ticks, after_ticks);
    if (first_time) {
        // throw away the first batch of measurements.
        // this helps warming things up.
        prepare_percentiles(percentiles, exec_times);
    } else {
        update_statistics(exec_times, classes, percentiles);
        ret &= report();
    }

引入 Linux 核心原始碼到 lab0-c

參考 Risheng1128 將list_sort 引入 lab0-c。

在 Makefile 中新增 compare 執行 traces/trace-sort.cmd

traces/trace-sort.cmd 的內容如下：
可藉由調整 Rand 調整資料的數量。

option fail 0
option malloc 0
new
ih RAND 100000/300000/500000
time
sort
time

輸入指令

make compare

結果如下：

資料總數	q_sort() 測試1(s)	q_sort() 測試2(s)	q_sort() 測試3(s)	q_sort() 測試4(s)	q_sort() 測試5(s)
100000	0.08	0.076	0.078	0.075	0.076
300000	0.302	0.309	0.316	0.309	0.313
500000	0.614	0.585	0.578	0.576	0.585

資料總數	list_sort() 測試1(s)	list_sort() 測試2(s)	list_sort() 測試3(s)	list_sort() 測試4(s)	list_sort() 測試5(s)
100000	0.052	0.053	0.053	0.056	0.052
300000	0.205	0.207	0.206	0.202	0.204
500000	0.383	0.372	0.374	0.375	0.374

資料總數	q_sort() 平均(s)	list_sort() 平均(s)	效能比較(%)
100000	0.077	0.053	31.16%
300000	0.310	0.205	33.87%
500000	0.588	0.375	36.22%

可以看到 list_sort() 的效能比起 q_sort() 還要好，且也可以觀察到隨著資料總數大小增加，效能的差距也變得愈加明顯。

接下來使用 Linux 效能分析工具: Perf 比較效能細項的部份。

使用更改kernel權限

sudo sysctl kernel.perf_event_paranoid=<parameter>

輸入指令

perf stat --repeat 10 -e cache-misses,cache-references,instructions,cycles ./qtest -f traces/trace-sort.cmd

讓它根據 trace-sort.cmd (此處的 Rand 為500000)針對 cache-misses 、 cache-references 、 instructions 、 cycles 做10次測試並且輸出平均值。

結果如下：

	q_sort	list_sort
cache-misses	56,154,930	45,669,350
cache-references	102,054,812	79,353,435
instructions	2,414,610,171	2,394,503,043
cycles	4,502,998,337	3,601,189,041
insn per cycle	0.54	0.66

可以看到不管是 cache-misses 、 cache-references 、 instructions 還是 cycles ， list_sort() 都是優於 q_sort()。

在 Linux 核心專題: 改進 lib/list_sort.c 中提及可以使用 bottom-up 的方法改善 cache-thrshing 的問題，以加快速度。

因為不懂 top-down 和 bottom-up 的實作差別，所以參考了動態規劃(Dynamic Programming)，裡面提及 top-down 和 bottom-up 的差別，且通常 top-down 會使用遞迴來實作，而 bottom-up 通常使用迭代實作，因此我又開始疑惑，所以遞迴的速度一定會比迭代還要慢嗎？

又參考了關於遞迴加快速度的迷思？才終於對遞迴和迭代有了一點點的概念。

接著，參考 Merge Sort 與它的變化使用了迭代的方式完成 Merge sort，並且發現，因為程式碼實作的關係，遞迴版的 Merge sort 比起迭代版的 Merge sort有更多的 Cache miss，導致速度變慢。

參考自 csotaku0926 同學。

bottom-up 的過程就是一直合併，cache 元素參照的機會更大。
top-down 涉及 parition，這麼做會導致元素不斷從 cache 進出，而導致 cache thrashing 的問題。

TODO：嘗試實作 Timsort 演算法，引入 lab0-c 與 list_sort 進行比較。

Timsort 內容請看 2024q1 Homework2 (quiz1+2) 。

參考資料

CPU Cache 原理探討