2025q1 Homework1 (lab0)

contributed by Potassium-chromate

作業書寫規範:

無論標題和內文中，中文和英文字元之間要有空白字元 (對排版和文字搜尋有利)
文字訊息 (尤其是程式執行結果) 請避免用圖片來表示，否則不好搜尋和分類
共筆書寫請考慮到日後協作，避免過多的個人色彩，用詞儘量中性
不要在筆記內加入 [TOC] : 筆記左上方已有 Table of Contents (TOC) 功能，不需要畫蛇添足
不要變更預設的 CSS 也不要加入任何佈景主題: 這是「開發紀錄」，用於評分和接受同儕的檢閱
當在筆記中貼入程式碼時，避免非必要的行號，也就是該手動將 c= 或 cpp= 變更為 c 或 cpp。行號只在後續討論明確需要行號時，才要出現，否則維持精簡的展現。可留意「你所不知道的 C 語言: linked list 和非連續記憶體」裡頭程式碼展現的方式
HackMD 不是讓你張貼完整程式碼的地方，GitHub 才是！因此你在開發紀錄只該列出關鍵程式碼 (善用 diff 標示)，可附上對應 GitHub commit 的超連結，列出程式碼是為了「檢討」和「便於他人參與討論」
留意科技詞彙的使用，請參見「資訊科技詞彙翻譯」及「詞彙對照表」
不要濫用 :::info, :::success, :::warning 等標示，儘量用清晰的文字書寫。:::danger 則僅限授課教師作為批注使用
避免過多的中英文混用，已有明確翻譯詞彙者，例如「鏈結串列」(linked list) 和「佇列」(queue)，就使用該中文詞彙，英文則留給變數名稱、人名，或者缺乏通用翻譯詞彙的場景
在中文敘述中，使用全形標點符號，例如該用「，」，而非 ","。注意書名號的使用，即 〈 和 〉，非「小於」和「大於」符號
避免使用不必要的 emoji 字元

開發環境

$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

$ lscpu
eason@eason-System-Product-Name:~$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i5-14400F
    CPU family:           6
    Model:                183
    Thread(s) per core:   2
    Core(s) per socket:   10
    Socket(s):            1
    Stepping:             1
    CPU max MHz:          4700.0000
    CPU min MHz:          800.0000
    BogoMIPS:             4992.00
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
                          ca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 s
                          s ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc 
                          art arch_perfmon pebs bts rep_good nopl xtopology nons
                          top_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq 
                          dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 
                          xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_d
                          eadline_timer aes xsave avx f16c rdrand lahf_lm abm 3d
                          nowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_
                          enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsb
                          ase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed
                           adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsa
                          vec xgetbv1 xsaves split_lock_detect user_shstk avx_vn
                          ni dtherm ida arat pln pts hwp hwp_notify hwp_act_wind
                          ow hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg
                           gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_
                          clear serialize arch_lbr ibt flush_l1d arch_capabiliti
                          es
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    416 KiB (10 instances)
  L1i:                    448 KiB (10 instances)
  L2:                     9.5 MiB (7 instances)
  L3:                     20 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-15

實作指定佇列操作

在撰寫程式的過程中，我注意到結構體的定義方式有兩種，並開始思考為何會有所差異。一種是使用 typedef struct ，並在結尾指定名稱，例如 element_t ；另一種是直接使用 struct 定義名稱，例如 list_head。具體範例如下：

//在queue.h中定義的結構
typedef struct {
    char *value;
    struct list_head list;
} element_t;

//在list.h中定義的結構
struct list_head {
    struct list_head *prev;
    struct list_head *next;
};

為了解答疑惑，首先看看 C99 是如何描述 typedef:

C99 (6.7.1.3)

The typedef specifier is called a ‘‘storage-class specifier’’ for syntactic convenience
only; it is discussed in 6.7.7. The meanings of the various linkages and storage durations were discussed in 6.2.2 and 6.2.4.

從 C99 規格書中的描述中，我們可以得知 typedef 實際上與儲存持續時間或連結無關——它純粹用於定義新類型名稱（別名）。而在 6.7.7 有提到更詳細說明，如下：

C99 (6.7.7.3)

In a declaration whose storage-class specifier is typedef, each declarator defines an
identifier to be a typedef name that denotes the type specified for the identifier in the way described in 6.7.5. Any array size expressions associated with variable length array declarators are evaluated each time the declaration of the typedef name is reached in the order of execution. A typedef declaration does not introduce a new type, only a synonym for the type so specified. That is, in the following declarations:
typedef T type_ident;
type_ident D;
type_ident is defined as a typedef name with the type specified by the declaration specifiers in T (known as T), and the identifier in D has the type ‘‘derived-declaratortype-list T’’ where the erived-declarator-type-list is specified by the declarators of D. A typedef name shares the same name space as other identifiers declared in ordinary declarators.

可以發現在規格書中給的範例中，typedef 告訴編譯器 type_ident 不是變數而是型別別名。的 type_ident 的實際類型會由 T 決定。並且從說明中可以發現，如果 T 中包含可變長度的陣列，則每次 type_ident 被呼叫時都會重新評估陣列大小。

分配記憶體方式

在先前的分析中，我們了解到 element_t 已經透過 typedef 定義了別名，因此在分配記憶體時，可以直接在 sizeof() 中使用其別名。而對於 list_head，由於沒有經過 typedef 建立別名，因此需要在 sizeof() 中明確指定為 struct list_head。

正確的記憶體分配方式：

對於 element_t 由於 element_t 是 typedef 定義的別名，因此可以直接使用：
```
element_t *new_node = malloc(sizeof(element_t));
```
對於 list_head 由於 list_head 沒有透過 typedef 定義別名，因此需要加上 struct：
```
struct list_head *new_node = malloc(sizeof(struct list_head));
```

錯誤的記憶體分配方式：

在 element_t 上使用 struct element_t 是別名，無需加 struct。如果硬加上，會導致編譯錯誤：
錯誤訊息：

queue.c:38:49: error: invalid application of ‘sizeof’ to incomplete type ‘struct element_t’
38 |      struct element_t *new_node = malloc(sizeof(struct element_t));
  |                                                 ^~~~~~

這是因為編譯器無法找到名為 struct element_t 的完整定義。

在 list_head 上省略 struct 如果省略 struct，也會導致編譯錯誤，因為 list_head 並未設置別名：
```
list_head *new_node = malloc(sizeof(list_head));
```
錯誤訊息：
```
error: unknown type name ‘list_head’
```

`q_free`

以下是我一開始的的撰寫方式：

void q_free(struct list_head *head) {
   while(head){
       struct list_head *temp = head;
       if(head->next){
           head = head->next;
           head->prev = NULL;
       }
       free(temp);
   }
   return;
}

可以發現會出現以下三點問題:

由於每個 node 是以 element_t 的結構存在，因此單純釋放 list_head 結構的記憶體會出問題，正確的作法是先用container_of (已被巨集取代成 list_entry )來先找出 element_t 的地址後，在進行記憶體釋放。
element_t 中的 value 由於有經過 malloc 分配記憶體所以也要被釋放。
從上面的程式碼可以發現，進行到最後一個節點時 head->next 會指向最一開始的頭，但是實際上該點先前已經被釋放掉，因此最後會出現free(NULL)的情況。

正確的實做方式可以參考: Commit b62717d

`q_insert_head` / `q_insert_tail`

起初，我的想法是直接複製 q_insert_head 的邏輯，然後再進行微幅的修改。然而，我發現可以直接利用 q_insert_head 來簡化 q_insert_tail 的實作，如下所示：

bool q_insert_tail(struct list_head *head, char *s)
{   
    return q_insert_head(head->prev, s);
}

這樣的實作方式避免了重複程式碼，同時還提高了程式的可讀性與維護性。

完整程式碼可以參考: Commit 0ad386a

`q_remove_head` / `q_remove_tail`

具體的開發過程與 q_insert_head / q_insert_tail 相同，q_remove_tail 同樣是可以重複利用 q_ㄐemove_head 來簡化實做過程。

完整程式碼可以參考: Commit 0ad386a

`q_delete_mid`

本函式的核心在於如何尋找鏈結串列的中點。

常見方法：使用快慢指標，快指標一次走兩步，慢指標一次走一步，當快指標到達尾端時，慢指標正好指向中間節點。
本函式的方法：由於這是雙向鏈結串列，可以直接用兩個指標從兩端向外移動，這樣無需額外的快慢指標，簡化了程式邏輯。

bool q_delete_mid(struct list_head *head)
{
    if (!head || list_empty(head)) {
        return false;
    }
    struct list_head *first = head->next;
    struct list_head *last = head->prev;
    while (first != last && first->next != last) {
        first = first->next;
        last = last->prev;
    }
    list_del(last);
    free(list_entry(last, element_t, list)->value);
    free(list_entry(last, element_t, list));
    return true;
}

q_reverse

具體的作法為：

遍歷鍊結串列，交換節點指標：
- 以 node 作為當前節點，tmp 作為下一個節點，以確保不丟失鍊結串列結構。
- 交換 node->next 和 node->prev，然後移動到下一個節點。
調整 head 的 next 和 prev：
- head->next 原本指向舊的第一個節點，head->prev 指向舊的最後一個節點。
- 反轉後，這兩者需要對調，以確保 head 正確連接新鍊結串列的頭部與尾部。

void q_reverse(struct list_head *head)
{
    if (!head || list_empty(head) || list_is_singular(head)) {
        return;
    }

    struct list_head *cur, *tmp;
    list_for_each_safe(cur, tmp, head) {
        struct list_head *swap = cur->next;
        cur->next = cur->prev;
        cur->prev = swap;
    }

    struct list_head *swap = head->next;
    head->next = head->prev;
    head->prev = swap;
}

完整程式碼可以參照：

Commit 5c4f5a1: reverse
Commit 736c314: reverseK

q_sort

一共會用到兩個輔助函式，來完成merge sort

mergeSortList：負責 Divide & Conquer，將鏈表拆分並排序。
merge：負責合併兩條已排序的鏈表。

遇到的問題

mergeSortList 僅處理 next 指標，但未維護 prev，導致排序後的鏈表缺失 prev 連結，不是完整的雙向鏈表。
缺少 head 的正確連接，可能導致 head->prev 失效，進而發生 Segmentation Fault。

解決方案

修復 prev 指標，確保雙向鏈表結構完整
- mergeSortList 不會處理 prev，因此需手動遍歷重新設置 prev，確保每個節點的前向指標正確。
確保 head 連接完整
- 重新找到排序後的最後一個節點 (tail)，確保 head 和 tail 正確連接，使鏈表仍然是循環雙向鏈表。

void q_sort(struct list_head *head, bool descend)
{
    if (!head || list_empty(head) || list_is_singular(head))
        return;
    // break the doubly linked structure
    struct list_head *temp = head;
    head->prev->next = NULL;
    head->next = mergeSortList(head->next, descend);

+    struct list_head *curr = head->next;
+
+    while (curr->next != NULL) {
+        curr->prev = temp;
+        temp = curr;
+        curr = curr->next;
+    }
+    curr->prev = temp;
+    curr->next = head;
+    head->prev = curr;
}

q_merge

遇到的問題

與 q_sort 類似，q_merge 在 merge 過程後未正確維護雙向鏈表 (prev 指標)，導致 Segmentation Fault。

解決方案
解決方案也與 q_sort 的相同，最後在遍歷整個連結串列手動連接 prev 即可。

int q_merge(struct list_head *head, bool descend)
{
    if (!head || list_empty(head))
        return 0;
    else if (list_is_singular(head))
        return q_size(list_first_entry(head, queue_contex_t, chain)->q);

    queue_contex_t *node = list_entry(head->next, queue_contex_t, chain);
    node->q->prev->next = NULL;
    struct list_head *temp = node->chain.next;
    for (int i = 0; i < ((node->size) - 1); i++) {
        queue_contex_t *next_node = list_entry(temp, queue_contex_t, chain);
        next_node->q->prev->next = NULL;
        node->q->next = merge(node->q->next, next_node->q->next, descend);
        INIT_LIST_HEAD(next_node->q);
        temp = temp->next;
    }

    struct list_head *curr = node->q->next;
    struct list_head *pre = node->q;
+    while (curr->next != NULL) {
+        curr->prev = pre;
+        pre = curr;
+        curr = curr->next;
+    }
+    curr->prev = pre;
+    curr->next = node->q;
+    node->q->prev = curr;

    return q_size(node->q);
}

論文閱讀：Dude, is my code constant time?

Introduction

本文討論了定時攻擊（Time Attack）問題，這是一種旁道攻擊，攻擊者通過測量加密實現的執行時間來提取敏感訊息，如金鑰或密碼。常見的例子是密碼比對，如果採用逐字比對，當遇到第一個不同的字母時會立即回傳錯誤。這種方式容易被駭客利用，因為駭客可以通過測量回傳時間來推測已經匹配的字母數量。正確的做法應該是將密碼比對至最後一個字母，從而避免時間差異泄露敏感信息。

目前有多種工具可用於偵測可變時間程式碼，如ctgrind、Flow-tracker和ctverif，但這些工具通常需要準確的硬體模型才能正常運行。由於CPU製造商提供的資料有限，且微程式碼更新可能會改變行為，這使得工具的運行變得具有挑戰性。

本文介紹了一種新工具——dudect，它旨在檢測特定平台上給定程式碼片段的執行時間是否恆定。與靜態分析或硬體建模不同，dudect通過執行時序測量並進行統計分析來偵測時間變化。這樣就可以繞過了硬體建模的複雜性，並提供一種更直接、簡便的方式來評估加密函數在目標平台上的執行時間是否恆定。

Approach

論文使用的方法旨在透過測量兩個不同輸入資料類別的執行時間並分析這些類別的時序分佈在統計上是否不同來檢測加密函數中的時序洩漏。其過程可以分為三步驟。

第一步：測量執行時間

類別定義：

將輸入資料分為兩類以進行檢測。常見的方法是固定與隨機測試，其中一類輸入固定為恆定值，另一類輸入隨機值。

Cycle 計數器：

現代CPU 提供可用於精確計時測量的週期計數器（例如，x86 CPU 中的TSC 或ARM 中的systick）。對於低階處理器，可以使用高解析度計時器或外部裝置。

環境條件：

為了減少環境因素的影響，在進行測量之前，測量值會被隨機分配到兩個輸入類別之一。這有助於確保結果不會因外部變數而產生偏差。

第二步：後處裡

Cropping:
執行時間的分佈通常顯示正偏差( positive skew )，可能會由於作業系統中斷進程執行導致時間較長。 cropping 可以去除極值，以提高檢測的準確性。
Higher-order preprocessing:
根據所使用的統計測試，可以應用更先進的預處理技術，例如中心乘積（模擬高階差分功率分析或 DPA）。這些方法捕捉更細微的洩漏模式。

這部分就偏統計了，對於統計我是完全不懂。關於DPA，目前只知道是量測電路或設備的功耗來分析其行為。

第三步：統計檢驗

收集和處理資料後，將應用統計測試來確定兩個類別的時間分佈是否顯著不同。

t 檢定：

可以使用 t 檢定等基本檢定來比較兩個分佈的平均值。原假設失敗（即分佈相等）表示存在時序洩漏，這意味著平均執行時間會根據輸入而變化。

關於t-檢定，可以參考: 數據檢定之t檢定

t 檢定與變異數的關聯:
t 檢定的假設之一是樣本來源的總體分佈具有某種變異程度。變異數是量化這種分佈特徵的核心，當變異數較大時，代表數據的內部波動性較強，可能會降低檢定結果的顯著性。

簡單來說，變異數的引入是為了確保 t 檢定能準確評估差異的顯著性，避免因忽略數據波動性而得出誤導性結論。

非參數檢定：

Kolmogorov-Smirnov是一種非參數檢驗，可用於更一般的情況，因為它不會假設時序資料的特定分佈。雖然它可能速度較慢並且需要更多樣本，但它的優點是對不同的資料分佈更穩健。

改進lab-0c

在進行測試時，發現最後一個 constant time 測試無法通過。一開始推測可能是在分配字串記憶體時，因字串長度不同導致時間行為存在差異，使其時間複雜度呈現

O (1)

而非理想的常數時間。後來查詢 queue.h 發現，字串的最大長度已限定為 1024 字節。於是，在執行 q_insert_head 和 q_insert_tail 時，統一分配 1024 字節的記憶體空間。此外，考慮到不同平台在記憶體分配上的時間行為存在差異，我採用了 calloc 函數來初始化記憶體值，期望統一操作的時間複雜度為

O (1024)

O (1)

，從而提升整體操作的穩定性，試圖達成常數時間的目標。然而，測試結果依然未能通過。具體實現如下：

element_t *new_node = calloc(1, sizeof(element_t));
if(!new_node){
    printf("free\n");
    return false;
}
INIT_LIST_HEAD(&new_node->list);
new_node->value = calloc(1,1024);
if (!new_node->value) {
    free(new_node);
    printf("free\n");
    return false;
}
memcpy(new_node->value, s, 1024);
list_add(&new_node->list, head);
return true;

為了探究問題的根源，我們可以繼續查閱 dudect/fixture.c 中的程式碼。根據分析，update_statistics 函數會使用韋爾福德法來滾動式更新兩個 class 執行時間的平均值與變異數，並將這些統計數據記錄在 t_context_t 結構中。接著，透過 t_compute 函數計算 t 值，以判斷 class 0 和 class 1 是否存在顯著差異。

為了進一步調查，我對 update_statistics 進行了一些修改，將兩個 class 的執行時間額外保存至文字檔。然後，我利用 Python 繪圖工具將這些數據進行視覺化。繪圖結果顯示，僅從圖形上就可以清楚地觀察到兩者之間的顯著差異。經過後續的 t 檢定計算，結果顯示 t 值高達 65，這進一步證實了兩個 class 執行時間在統計上存在極其顯著的差異。

至於class 0 與 class 1之間的資料有何差異？在第一步：測量執行時間有提到，論文作者是用固定與隨機測試，其中一類輸入固定為恆定值，另一類輸入隨機值。在 constant.c 中的 prepare_inputs 可以發現其作法與之相同。當 class == 1 時，輸入資料為隨機值，反之當 class == 0 時，所有位元皆設為 0 。

for (size_t i = 0; i < N_MEASURES; i++) {
        classes[i] = randombit();
        if (classes[i] == 0)
            memset(input_data + (size_t) i * CHUNK_SIZE, 0, CHUNK_SIZE);
    }

在分析 fixture.c 中的去極值過程後，我發現原本的 measure 函數存在一些問題。具體來說，該函數會直接捨棄 DROP_SIZE 個最前面和最末尾的元素。然而，這樣的處理並未正確去除極值，因為 update_statistics 函數會處理整個 exec_times 陣列，這樣就無法有效過濾掉極端值。

為了解決這個問題，正確的做法應該是：

在 measure 之後，先保留完整長度的 exec_times 陣列，並對其進行排序。
接著，update_statistics 函數應該只處理排除前後 DROP_SIZE 個元素後的數據，這樣才能有效地去除極端值，保證統計結果的準確性。

bool ret = measure(before_ticks, after_ticks, input_data, mode);
differentiate(exec_times, before_ticks, after_ticks);
+ sort(exec_times);
update_statistics(exec_times, classes);
ret &= report();

bool measure(int64_t *before_ticks,
             int64_t *after_ticks,
             uint8_t *input_data,
             int mode)
{
    assert(mode == DUT(insert_head) || mode == DUT(insert_tail) ||
           mode == DUT(remove_head) || mode == DUT(remove_tail));

    switch (mode) {
    case DUT(insert_head):
-       for (size_t i = DROP_SIZE; i < N_MEASURES - DROP_SIZE; i++) {
+       for (size_t i = 0; i < N_MEASURES; i++) {

static void update_statistics(const int64_t *exec_times, uint8_t *classes)
{
-   for (size_t i = 0; i < N_MEASURES; i++) {
+   for (size_t i = DROP_SIZE; i < (N_MEASURES - DROP_SIZE); i++) {

修正完以上三處，即可通過最後一筆測試。

最後我們再來，看看去除極端值的數據分佈，可以發現結果相當不錯，數據幾乎一模一樣。

class 0: 19940 筆
class 1: 20100 筆
T-Statistic: -0.1832
P-Value: 8.5466e-01

實作 shuffle 命令

在開始實做前先要來了解老師在作業提供腳本是如何執行的。

subprocess 模組介紹
Python 3.5 引入了 subprocess 模組，允許 Python 腳本執行 shell 命令，並與外部程式進行互動。
指令執行方式
程式中包含以下命令：

command = './qtest -v 3'

這表示程式透過 qtest 介面來插入新節點並執行排序。

如何在 `qtest` 中新增 `shuffle` 命令？

在 qtest.c 中，所有命令都是透過巨集 ADD_COMMAND 來添加，而 ADD_COMMAND 會呼叫 add_cmd 來將命令加入 cmd_list。add_cmd 會按照字母順序插入命令，因此我們在 qtest 中新增 shuffle 命令的步驟如下：

實作 do_shuffle 函式
- do_shuffle 需要呼叫 queue.c 中的 q_shuffle 來執行隊列隨機打亂的邏輯。
使用 ADD_COMMAND 註冊 shuffle 命令
- 在 qtest.c 中新增 ADD_COMMAND(shuffle, "Shuffle queue", NULL);
- ADD_COMMAND 會展開為 add_cmd("shuffle", do_shuffle, "Shuffle queue", NULL);
- add_cmd 會依照字母順序將 shuffle 插入 cmd_list。

add_cmd 運作方式：

cmd_list 是一個鏈結串列，存放所有命令的資訊。
add_cmd 會遍歷 cmd_list，並按照命令名稱的字母順序插入新的命令。
每個命令對應一個函式（operation），當 qtest 執行該命令時，會調用對應的函式。

#define ADD_COMMAND(cmd, msg, param) add_cmd(#cmd, do_##cmd, msg, param)
void add_cmd(char *name, cmd_func_t operation, char *summary, char *param)
{
    cmd_element_t *next_cmd = cmd_list;
    cmd_element_t **last_loc = &cmd_list;
    while (next_cmd && strcmp(name, next_cmd->name) > 0) {
        last_loc = &next_cmd->next;
        next_cmd = next_cmd->next;
    }

    cmd_element_t *cmd = malloc_or_fail(sizeof(cmd_element_t), "add_cmd");
    cmd->name = name;
    cmd->operation = operation;
    cmd->summary = summary;
    cmd->param = param;
    cmd->next = next_cmd;
    *last_loc = cmd;
}

實做結果

以下是做完 10,000 次的結果：

Expectation:  416
Observation:  {'1234': 412, '1243': 393, '1324': 391, '1342': 435, '1423': 399, '1432': 443, '2134': 442, '2143': 428, '2314': 442, '2341': 419, '2413': 428, '2431': 408, '3124': 401, '3142': 445, '3214': 444, '3241': 381, '3412': 369, '3421': 435, '4123': 414, '4132': 408, '4213': 402, '4231': 398, '4312': 430, '4321': 433}
chi square sum:  26.39423076923077

shuffle_result

以下是做完 1,000,000 次的結果：

Expectation:  41666
Observation:  {'1234': 41600, '1243': 41734, '1324': 41298, '1342': 41622, '1423': 41622, '1432': 41409, '2134': 42013, '2143': 41370, '2314': 41702, '2341': 41579, '2413': 41454, '2431': 41598, '3124': 41715, '3142': 41556, '3214': 41803, '3241': 41972, '3412': 41932, '3421': 41665, '4123': 42055, '4132': 41875, '4213': 41596, '4231': 41712, '4312': 41326, '4321': 41792}
chi square sum:  24.286948591177456

shuffle_result

可以發現做到 1,000,000 次左右時，結果已經趨於穩定。

研讀 lib/list_sort.c 並嘗試改進

整合 list_sort.c 到 lab0-c

cmp 函式實作與 struct debug_el 差異

在撰寫 cmp 的過程中，我參考了在 linux kernel 的 linux/lib/test_list_sort.c 是如何實做的，可以發現與我最大的差別在於 struct debug_el 的使用。

static int cmp(void *priv, const struct list_head *a, const struct list_head *b)
{
+    struct debug_el *ela, *elb;

+    ela = container_of(a, struct debug_el, list);
+    elb = container_of(b, struct debug_el, list);

-    element_t *node_a = container_of(a, element_t, list);
-    element_t *node_b = container_of(b, element_t, list);
}

在 Linux Kernel 的 test_list_sort.c 中，原本的 cmp 函式使用 element_t，而我的實作則使用 debug_el。這讓我好奇 debug_el 在這個環境中的作用。

struct debug_el 的作用

在 linux/lib/test_list_sort.c 可以發現 debug_el 的定義。

struct debug_el {
	unsigned int poison1;
	struct list_head list;
	unsigned int poison2;
	int value;
	unsigned int serial;
};

該結構的記憶體排列方式如下：

| poison1 | list_head (next, prev) | poison2 | value | serial |

為何使用 poison1 和 poison2？

poison1 和 poison2 主要用於 檢查 list_head 是否有溢位，這對於調試（debugging）來說非常有用。
這些欄位通常被設定為 特殊的魔數 (magic number)，如 0xDEADBEEF，如果這些值在程式執行中被意外改變，則表示發生了 記憶體溢位 (buffer overflow) 或指標錯誤 (pointer corruption)。

在本次的實作中，沒有調試 (debugging) 需求，因此可以直接使用一般的結構，例如： element_t

隨機字元生成函式

在本次的實作中，我使用以下方法來生成隨機字元，字元範圍為 '0' (ASCII 48) 到 'z' (ASCII 122)：

//ascii range: 48 ~ 122
for (int i = 0; i < len; i++) {
        char key = rand() % (75) + 48;
        str[i] = key;
    }

實做細節可以參考 Commit 88f2468

list_sort 與 q_sort 的效能比較

以下為 q_sort 排序 45000 個節點，每個節點10個字元的結果。

 Performance counter stats for './linklisttest' (100 runs):

             17.15 msec task-clock                       #    0.986 CPUs utilized               ( +-  0.72% )
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
               759      page-faults                      #   44.266 K/sec                       ( +-  0.01% )
     <not counted>      cpu_atom/cycles/                                                        ( +- 26.16% )  (0.00%)
         6962,7495      cpu_core/cycles/                 #    4.061 GHz                         ( +-  0.32% )
     <not counted>      cpu_atom/instructions/                                                  ( +- 29.08% )  (0.00%)
       1,1811,1240      cpu_core/instructions/           #   15.31  insn per cycle              ( +-  0.53% )
     <not counted>      cpu_atom/branches/                                                      ( +- 29.92% )  (0.00%)
         2248,2911      cpu_core/branches/               #    1.311 G/sec                       ( +-  0.61% )
     <not counted>      cpu_atom/branch-misses/                                                 ( +- 28.35% )  (0.00%)
           46,9295      cpu_core/branch-misses/          #   22.77% of all branches             ( +-  0.93% )
             TopdownL1 (cpu_core)                 #     26.5 %  tma_backend_bound      
                                                  #     22.3 %  tma_bad_speculation    
                                                  #     21.5 %  tma_frontend_bound     
                                                  #     29.8 %  tma_retiring             ( +-  0.32% )

          0.017387 +- 0.000125 seconds time elapsed  ( +-  0.72% )

以下為 list_sort 排序 45000 個節點，每個節點10個字元的結果。

$ sudo perf stat -r 100 ./linklisttest

 Performance counter stats for './linklisttest' (100 runs):

             16.32 msec task-clock                       #    0.984 CPUs utilized               ( +-  1.35% )
                 1      context-switches                 #   61.268 /sec                        ( +-  7.85% )
                 0      cpu-migrations                   #    0.000 /sec                      
               758      page-faults                      #   46.442 K/sec                       ( +-  0.01% )
     <not counted>      cpu_atom/cycles/                                                        ( +- 16.66% )  (0.00%)
         6466,1161      cpu_core/cycles/                 #    3.962 GHz                         ( +-  1.12% )
     <not counted>      cpu_atom/instructions/                                                  ( +- 17.88% )  (0.00%)
       1,1780,4013      cpu_core/instructions/           #    8.02  insn per cycle              ( +-  1.02% )
     <not counted>      cpu_atom/branches/                                                      ( +- 17.95% )  (0.00%)
         2038,1859      cpu_core/branches/               #    1.249 G/sec                       ( +-  1.09% )
     <not counted>      cpu_atom/branch-misses/                                                 ( +- 17.86% )  (0.00%)
           48,9608      cpu_core/branch-misses/          #   10.77% of all branches             ( +-  2.18% )
             TopdownL1 (cpu_core)                 #     19.7 %  tma_backend_bound      
                                                  #     25.7 %  tma_bad_speculation    
                                                  #     21.9 %  tma_frontend_bound     
                                                  #     32.6 %  tma_retiring             ( +-  1.12% )

          0.016580 +- 0.000229 seconds time elapsed  ( +-  1.38% )

以下是我分別測試 5000, 10000, 15000 與 20000 個節點，使用q_sort 與 list_sort 排序的結果。

Valgrind + 自動測試程式

massif

Massif 是 Valgrind 的一個工具，專門用來分析 heap memory 使用情況。它可以追蹤記憶體配置的變化，並提供視覺化圖表來顯示程式執行時的記憶體用量。

Massif 會在不同時間點擷取記憶體快照 (snapshot)，以便分析記憶體使用趨勢：

. (Normal Snapshot)：程式執行過程中定期擷取的快照。
@ (Detailed Snapshot)：更詳細的快照，包含更深入的記憶體資訊。
# (Peak Snapshot)：記錄記憶體使用高峰，但僅在釋放 (deallocation) 發生後才會擷取。

以下是利用 massif 來觀察 make test 的記憶體使用量的結果。

$ valgrind --tool=massif --massif-out-file=massif.out make test
$ ms_print massif.out
--------------------------------------------------------------------------------


    KB
269.7^                                                                    #   
     |                                                          @         #   
     |                                                          @         #:::
     |                                                   :: ::::@:::::::::#:::
     |                              :  ::  ::  :::::::@::: :: : @: :::::::#:::
     |                    :@:::::::::::: ::: :::::::: @::: :: : @: :::::::#:::
     |         ::  ::@:::::@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |        :::::: @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |    :::::::: : @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |    :: ::::: : @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |  @@:: ::::: : @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |  @ :: ::::: : @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |  @ :: ::::: : @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |  @ :: ::::: : @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |  @ :: ::::: : @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |  @ :: ::::: : @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |  @ :: ::::: : @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |  @ :: ::::: : @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |  @ :: ::::: : @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
     |  @ :: ::::: : @::: :@: ::: : :::: : : : :::::: @::: :: : @: :::::::#:::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   14.12

Number of snapshots: 57
 Detailed snapshots: [2, 11, 16, 35, 42, 52 (peak)]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1        269,644              136              120            16            0
  2        502,134          143,536          134,917         8,619            0
  3        890,232          169,560          156,753        12,807            0
  4      1,228,242          174,048          158,423        15,625            0
  5      1,462,847          179,432          161,463        17,969            0
  6      1,773,765          185,000          164,664        20,336            0
  7      1,950,742          193,504          172,314        21,190            0
  8      2,221,092          196,960          175,578        21,382            0
  9      2,445,901          191,136          169,658        21,478            0
 10      2,750,316          200,496          178,874        21,622            0
 11      3,147,652          197,984          176,122        21,862            0
 12      3,341,947          194,968          173,106        21,862            0
 13      3,605,482          198,344          176,338        22,006            0
 14      3,881,990          206,848          184,706        22,142            0
 15      4,243,741          211,544          189,138        22,406            0
 16      4,437,633          208,744          186,258        22,486            0

massif-visualizer

Massif-Visualizer 是一個獨立的 GUI 應用程式，專門用來讀取 massif.out.<pid> 檔案，並以圖形化方式顯示記憶體變化，可以更直觀地分析程式的記憶體使用趨勢。

觀測 make test 記憶體使用情況
在執行 Massif 來分析 make test 的記憶體使用情況後，會在目前的資料夾中生成一個記憶體報告檔：

valgrind --tool=massif --massif-out-file=massif.out make test

執行後，會在當前資料夾生成類似以下名稱的檔案：

massif.out.XXXXX

其中 XXXXX 是該執行程序的 Process ID (PID)。

使用 Massif-Visualizer 來視覺化記憶體變化
我們可以使用 massif-visualizer 來開啟這個報告，將記憶體變化呈現在圖表上：

massif-visualizer massif.out.XXXXX

比較亂數產生器 (Xorshift vs. 內建) 的品質

作業生成亂數的方式

在 qtest.c 中，我們可以找到 fill_rand_string 函式，其主要功能是產生隨機字串，其中的關鍵點在於 randombytes 是核心的亂數生成函式，randombytes 產生的隨機數被轉換為對應的字元，存入 buf 中。

static void fill_rand_string(char *buf, size_t buf_size)
{
    size_t len = 0;
    while (len < MIN_RANDSTR_LEN)
        len = rand() % buf_size;

    uint64_t randstr_buf_64[MAX_RANDSTR_LEN] = {0};
    randombytes((uint8_t *) randstr_buf_64, len * sizeof(uint64_t));
    for (size_t n = 0; n < len; n++)
        buf[n] = charset[randstr_buf_64[n] % (sizeof(charset) - 1)];

    buf[len] = '\0';
}

在 random.c 中，randombytes 的實作根據不同作業系統選擇適當的方法來產生隨機數。儘管程式碼看似複雜，本質上只是根據系統環境決定使用哪種系統呼叫來獲取亂數。

本次實驗在 Linux 系統上進行，並且使用 glibc (GNU C函式庫)，因此 randombytes 會透過 linux_getrandom 來取得隨機數據。

int randombytes(uint8_t *buf, size_t n)
{
#if defined(__linux__) || defined(__GNU__)
#if defined(USE_GLIBC)
    /* Use getrandom system call */
    return linux_getrandom(buf, n);
#elif defined(SYS_getrandom)
    /* Use getrandom system call */
    return linux_getrandom(buf, n);
#else
    /* When we have enough entropy, we can read from /dev/urandom */
    return linux_urandom(buf, n);
#endif
#elif defined(BSD)
    return bsd_randombytes(buf, n);
#else
#error "randombytes(...) is not supported on this platform"
#endif
}

在 Linux kernel 的 linux/drivers/char/random.c 檔案中，getrandom() 是一個提供給使用者空間的系統呼叫，負責生成安全的亂數。其核心機制依賴於加密隨機數產生器（CRNG, Cryptographic Random Number Generator），而其能否正常運作，關鍵在於 crng_ready() 的狀態。

crng_ready() 用於確保 CRNG 已經獲得足夠的熵，能夠產生安全的隨機數。若 CRNG 尚未就緒，getrandom() 可能會：

阻塞（Block），直到足夠的熵被收集並初始化 CRNG。
返回 -EAGAIN 錯誤。

而熵池（Entropy Pool）熵的來源於，Linux 內核從多種不可預測的系統事件收集熵，這些事件包括但不限於：

滑鼠移動
鍵盤輸入
磁碟 I/O 活動
網路封包到達時間

這些隨機性來源會被用來初始化內核的熵池（Entropy Pool），進而驅動 CRNG。

Linux 的偽隨機數產生器 (CSPRNG) 運作原理具體可以參考：
Understanding the Red Hat Enterprise Linux random number generator interface

SYSCALL_DEFINE3(getrandom, char __user *, ubuf, size_t, len, unsigned int, flags)
{
	struct iov_iter iter;
	int ret;

	if (flags & ~(GRND_NONBLOCK | GRND_RANDOM | GRND_INSECURE))
		return -EINVAL;

	/*
	 * Requesting insecure and blocking randomness at the same time makes
	 * no sense.
	 */
	if ((flags & (GRND_INSECURE | GRND_RANDOM)) == (GRND_INSECURE | GRND_RANDOM))
		return -EINVAL;

	if (!crng_ready() && !(flags & GRND_INSECURE)) {
		if (flags & GRND_NONBLOCK)
			return -EAGAIN;
		ret = wait_for_random_bytes();
		if (unlikely(ret))
			return ret;
	}

	ret = import_ubuf(ITER_DEST, ubuf, len, &iter);
	if (unlikely(ret))
		return ret;
	return get_random_bytes_user(&iter);
}

Xorshift實做

以下是我的實做，並已將其整合進 random.c 中。但是我認為有其潛在的安全性問題，整個過程其實是可逆的，所以並不適合用於安全性需求高的應用，未來還有改進的空間。

uint64_t xorshift64(void)
{
    xorshift_state ^= xorshift_state << 11;
    xorshift_state ^= xorshift_state >> 15;
    xorshift_state ^= xorshift_state << 7;
    xorshift_state ^= xorshift_state >> 31;
    xorshift_state ^= xorshift_state << 23;
    return xorshift_state;
}

int randombytes_xor(uint8_t *buf, size_t n)
{
    for (size_t i = 0; i < n; i++)
        buf[i] = (uint8_t) (xorshift64() & 0xFF);

    return 0;
}

以下是對比 程式內建的亂數產生器 與 自實作的 xorshift64 生成數據的常態分佈圖。
normal_distribution

具體實做細節可以參考 Commit de78eee

解釋 select 系統

介紹 select 與 file descriptor

在解釋 select 是什麼之前，先來了解一下 linux 中的 file descriptor(檔案描述符)

參考自：理解linux中的file descriptor(檔案描述子)
file descriptors table由使用者程序所有，每個行程都有一個這樣的表，這裡記錄了行程開啟的檔案所代表的fd，fd的值對應到file table中的條目(entry)。

另外，每個行程都會預留3個預設的fd: stdin、stdout、stderr;它們的值分別是0、1，2。

Integer value Name symbolic constant file stream

0 Standard input STDIN_FILENO stdin

1 Standard output STDOUT_FILENO stdout

2 Standard error STDERR_FILENO stderr

file table
file table是全域唯一的表，由系統核心維護。這個表記錄了所有進程打開的檔案的狀態(是否可讀、可寫等狀態)，同時它也映射到inode table中的entry。

inode table
inode table同樣是全域唯一的，它指向了真正的檔案位址(磁碟中的位置)，每個entry全域唯一。

Integer value	Name	symbolic constant	file stream
0	Standard input	STDIN_FILENO	stdin
1	Standard output	STDOUT_FILENO	stdout
2	Standard error	STDERR_FILENO	stderr

簡單來說當進程開啟一個檔案（或另一個 I/O 流）時，作業系統會為其指派一個檔案描述符，進程使用該描述符進行後續的讀取、寫入或關閉操作。

再來了解一下 select 是什麼，在 select(2) 中有提到 select 的輸入如下：

int select(int nfds, fd_set *_Nullable restrict readfds,
                  fd_set *_Nullable restrict writefds,
                  fd_set *_Nullable restrict exceptfds,
                  struct timeval *_Nullable restrict timeout);

在 man7.org > Linux > man-pages 中也可以找到以下敘述：

fd_set
A structure type that can represent a set of file descriptors.
According to POSIX, the maximum number of file descriptors in an
fd_set structure is the value of the macro FD_SETSIZE.

由此可知 select 是用於管理大量檔案描述符。

在閱讀完 select(2) - Linux man page 後，我得到以下結論：
select() 用於同時監視多個檔案描述符，以檢查它們是否已準備好進行 I/O 操作（讀取、寫入或異常情況）。這樣做的優點有：

一次處理多個檔案描述符
- 不用阻塞單一 read() 或多個 write()，可以用 select() 一次檢查多個檔案描述符。
避免運傳資源浪費
- 不會在循環中不斷檢查檔案描述符（浪費 CPU 週期），而是 select() 會休眠直到至少一個位於 fd_set 中的檔案描述符準備就緒，才會返回值。
適用於任何 FD
- 適用於套接字 (sockets)、管道 (pipes)和其他常規文件。

select 應用案例

在 web.c 中的 web_eventmux 函式，我們可以看到 select 的實際應用。由於 lab-0 已整合網頁伺服器，並透過 API 讓使用者執行指令（如 new、ih 10），來間接操作 qtest 介面，因此在伺服器運作時，需要監控多個檔案描述符（FD），包括標準輸入（stdin）與伺服器 socket（server_fd），以處理使用者輸入或是新的連線請求。

正如前面所提到的，如果使用阻塞式的 read() 或 accept() 來進行請求處理會消耗許多資源，因此可以採用 select() 以更有效率的方式來等待並監聽這些事件，僅在有數據可讀或連線請求可接受時才進行處理。

以下是 select() 在 web.c 的具體用途：

呼叫 select()
這會持續阻塞，直到發生以下情況：
- 有使用者輸入的新資料（STDIN_FILENO）。
- 有新連接 server_fd。
```
int result = select(max_fd + 1, &listenset, NULL, NULL, NULL);
```
檢查連線是否就緒 (server_fd)

如果有新的用戶端連接，accept() 則會建立一個專用於與該特定用戶端通訊的新 socket，並回傳一個新的檔案描述符（web_connfd）來代表此連線。

web_recv() 會讀取 HTTP 請求 → 發送回應 → 將來自客戶端的請求資料儲存在 buf。

if (server_fd > 0 && FD_ISSET(server_fd, &listenset)) {
    FD_CLR(server_fd, &listenset);
    struct sockaddr_in clientaddr;
    socklen_t clientlen = sizeof(clientaddr);
    int web_connfd =
        accept(server_fd, (struct sockaddr *) &clientaddr, &clientlen);

    char *p = web_recv(web_connfd, &clientaddr);
    char *buffer = "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\n\r\n";
    web_send(web_connfd, buffer);
    strncpy(buf, p, strlen(p) + 1);
    free(p);
    close(web_connfd);
    return strlen(buf);
}

rio 應用案例

在 web.c 有用到函式 rio_read()，以及對應的資料結構 rio_t。參考教材: CS:APP 第 10 章重點提示，以下是我對於 rio 的理解。

為什麼 RIO 比傳統的 read 更有效率?
當我們使用 read(fd, buf, size) 來從檔案或 socket 讀取資料時，每次讀取都會觸發 系統呼叫 (system call)，而這帶來幾個問題：

緩衝區 (buffer) 無法充分利用：
如果我們逐行讀取，但行的大小不固定，每次 read() 可能只讀取幾個字元，導致緩衝區被浪費。
系統呼叫的開銷大：
每次 read() 會進行 user mode → kernel mode 切換，這個上下文切換 (context switch) 成本非常高，會降低程式的效能。
讀取的效率低：
read() 可能一次只返回部分資料，導致程式需要反覆呼叫 read()，增加了不必要的開銷。

RIO 如何解決這些問題？
RIO (Robust I/O) 使用更高效的機制來處理 I/O，主要透過利用位於 user space 的緩衝區來減少系統呼叫次數，以提升效能。在 lab-0 的實做方式具體方式如下：

使用 rio_t 結構：
該結構位於 user space，用於來維護一個較大的緩衝區(在 lab-0 中為 1KB)
批量讀取 (Batch Read)：
rio_read() 一次性地讀取整個 buffer 大小的資料。這樣只需要呼叫一次 read() 就可以填滿 buffer，減少系統呼叫的次數。
進程只需從 rio_t 的 buffer 讀取資料：
當程式需要讀取數據時，優先從 rio_t 的緩衝區中取得。只有當 buffer 為空時，才會再呼叫 read() 來讀取新的資料。

以下為對應的程式碼：

typedef struct {
    int fd;            /* descriptor for this buf */
    int count;         /* unread byte in this buf */
    char *bufptr;      /* next unread byte in this buf */
    char buf[BUFSIZE]; /* internal buffer */
} rio_t;

static ssize_t rio_read(rio_t *rp, char *usrbuf, size_t n)
{
    int cnt;
    while (rp->count <= 0) { /* refill if buf is empty */
        rp->count = read(rp->fd, rp->buf, sizeof(rp->buf));
        if (rp->count < 0) {
            if (errno != EINTR) /* interrupted by sig handler return */
                return -1;
        } else if (rp->count == 0) { /* EOF */
            return 0;
        } else
            rp->bufptr = rp->buf; /* reset buffer ptr */
    }

    /* Copy min(n, rp->count) bytes from internal buf to user buf */
    cnt = n;
    if (rp->count < n)
        cnt = rp->count;
    memcpy(usrbuf, rp->bufptr, cnt);
    rp->bufptr += cnt;
    rp->count -= cnt;
    return cnt;
}