2024q1 Homework1 (lab0)

contributed by < Kuanch >

Reviewed by `Lccgth`

佇列實作部分應只貼出關鍵程式碼，例如 q_merge 中可省略變數宣告的程式碼，以文字敘述。

for (struct list_head *p = head->next->next; p != head; p = p->next) {
    queue_contex_t *node = list_entry(p, queue_contex_t, chain);
    list_splice_init(node->q, merged_list->q);
}
q_sort(merged_list->q, descend);

在佇列實作部分可列出相應的 git commit 紀錄，方便日後優化時進行比較。

在 q_sort 與 list_sort 效能比較中，將 list_sort 的 likely 和 unlikely 巨集進行更動，會影響編譯時的最佳化，可以參考 gcc 13.5 第 6.59 章中的 __builtin_expect。

Kuanch

減少張貼的程式碼

將 likely 和 unlikely 補回進行實驗；在 qsort 與 listsort 在 l1d 以及 mem_load_retired 之實驗時已經補回。此外亦針對 likely 和 unlikely 產生的 asm 有分析。

Reviewed by `lintin528`

github 上佇列實作分為多次 commit 在維護的時候會較方便。

開發環境

$ uname -a
Linux 5.15.0-100-generic #110~20.04.1-Ubuntu SMP

$ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             8
On-line CPU(s) list:                0-7
Thread(s) per core:                 2
Core(s) per socket:                 4
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              140
Model name:                         11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
Stepping:                           1
CPU MHz:                            776.504
CPU max MHz:                        4200.0000
CPU min MHz:                        400.0000
BogoMIPS:                           4838.40
Virtualization:                     VT-x
L1d cache:                          192 KiB
L1i cache:                          128 KiB
L2 cache:                           5 MiB
L3 cache:                           8 MiB
NUMA node0 CPU(s):                  0-7
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ valgrind --version
valgrind-3.15.0

$ cppcheck --version
Cppcheck 1.90

實作指定的佇列操作

以下暫時僅展示有修正改進、有探討議題的操作，完整的程式碼請見 [Github](https://github.com/Kuanch/lab0-c/blob/queue/queue.c)。大部分程式碼雖能通過測試，但可讀性不足，仍待重構，並將重構完畢的程式碼放上。

你的洞見呢？

佇列實作

`q_sort`

程式碼列出是為了討論和檢討，倘若你沒有進行討論，就不要列出。讀者要看你的洞見。

Kuanch 依照老師和同學建議減少張貼的程式碼

q_sort 要求 In-place 操作，故使用 LIST_HEAD(new_head); 為實作重點；對比於 malloc 的動態分配(Dynamic Allocation)，此處為靜態分配(Static Allocation)，後者有幾項特點

編譯期決定，編譯器管理生命週期且無需佔用作業系統資源
由於由編譯器管理，不存在記憶體破碎 (memory fragmentation)
分配在Stack上，其容量有限，大的物件可能會導致溢出 (stack overflow)

無論標題和內文中，中文和英文字元之間要有空白字元 (對排版和文字搜尋有利)。

留意各式細節，唯有重視小處並步步為營，方可挑戰原始程式碼規模超越三千七百萬行的 Linux 核心。

`q_merge`

本例中將前述常用的結構 element_t 改為 queue_context_t，每一個包含 list_head 的 queue_context_t 帶有一個佇列開頭 node->q，透過 list_splice_init 合併到第一個 queue_context_t 當中的佇列，之後便可進行排序。

我認為 queue_context_t 特別之處在於每一個元素都能夠再帶有一個佇列，所以也許可以考慮以下這種應用

#include <stdio.h>
#include <stdlib.h>

struct list_head {
    struct list_head *next, *prev;
};

typedef struct {
    struct list_head list;
    int id;
    int runtime;
    int priority;
    double energy_cost;
    // other properties of tasks
    // ...
} task_t;

typedef struct {
    struct list_head q;     // Head of the queue of tasks
    struct list_head *other_task_queue; 
    int size;               // Number of tasks in the queue
    int id;                 // Identifier for the queue
} queue_context_t;

void link_task_queue(queue_context_t queue1, queue_context_t queue2) {
    queue1.other_task_queue = &queue2.q;
    queue2.other_task_queue = &queue1.q;
}

int main() {
    queue_context_t queue_task_type1;
    queue_context_t queue_task_type2;

    // initialize task queue, enqueue, schedule etc.
    // ...
        
    // link task queue
    link_task_queue(queue_task_type1, queue_task_type2);

    return 0;
}

假設 queue_task_type1 和 queue_task_type２ 為性質相似的任務佇列，在內可以儲存各自的任務，亦可走訪有相似性質的其他任務佇列。

分析與引入 `lib/list_sort.c`

修改 `lib/list_sort.c`

原先想直接與 linux/lib/list_sort.c 編譯，但屢試無果，故參考 komark06 的方式移植到 queue.c 2023q1 Homework1 (lab0)。

~~此處不列出所有修改，只列出需要討論的部分，完整修改請見 [queue.c](https://github.com/Kuanch/lab0-c/blob/list_sort/queue.c#L53)。~~

「本該如此」的事，就不要寫。

`merge`

-__attribute__((nonnull(2,3,4)))
-static struct list_head *merge(void *priv, list_cmp_func_t cmp,
-				struct list_head *a, struct list_head *b)
+__attribute__((nonnull)) struct list_head *merge(struct list_head *a,
+                                                 struct list_head *b)

`merge_final`

-        if (unlikely(!++count))
-            cmp(priv, b, b);
+        // if (!++count)
+        //    strcmp(list_entry(b, element_t, list)->value,
+        //           list_entry(b, element_t, list)->value);

`list_sort`

-        if (likely(bits)) {
-            struct list_head *a = *tail, *b = a->prev;
+        if (bits) {
+            struct list_head *a = *tail, *b = a->prev;

q_sort 與 list_sort 效能比較

使用 perf 進行分析，測試程式如下

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>

#include "queue.h"

int main() {
    struct list_head *q = q_new();
    int list_size = 100000;
    for (int i = 0; i < list_size; i++) {
        char s = (char) (rand() % 26 + 'a');
        q_insert_tail(q, &s);
    }
    // q_sort(q, true);
    list_sort(q);
    q_free(q);
}

以下使用 ~~perf state -r 5~~perf stat -r 5，可以參考 Linux 效能分析工具: Perf

 Performance counter stats for './q_sort' (5 runs):

         7,2825.90 msec task-clock                #    1.000 CPUs utilized            ( +-  0.28% )
               362      context-switches          #    4.968 /sec                     ( +-  6.35% )
                 8      cpu-migrations            #    0.115 /sec                     ( +- 22.14% )
              3582      page-faults               #   49.186 /sec                     ( +-  0.01% )
    2957,7153,0020      cycles                    #    4.061 GHz                      ( +-  0.22% )
     625,4618,0370      instructions              #    0.21  insn per cycle           ( +-  0.00% )
     208,1336,9257      branches                  #  285.796 M/sec                    ( +-  0.00% )
          124,9445      branch-misses             #    0.01% of all branches          ( +-  0.59% )
  1,4784,0455,5899      slots                     #   20.301 G/sec                    ( +-  0.22% )
     285,2832,2723      topdown-retiring          #      1.9% retiring                ( +-  1.54% )
     846,9593,7809      topdown-bad-spec          #      5.7% bad speculation         ( +-  8.49% )
      69,6204,8854      topdown-fe-bound          #      0.5% frontend bound          ( +-  2.68% )
  1,3755,0196,3845      topdown-be-bound          #     92.0% backend bound           ( +-  0.25% )

        72.831 +- 0.206 seconds time elapsed  ( +-  0.28% )


         Performance counter stats for './list_sort' (5 runs):

     7,3189.10 msec task-clock                #    1.000 CPUs utilized            ( +-  0.17% )
           384      context-switches          #    5.252 /sec                     ( +-  5.30% )
            12      cpu-migrations            #    0.158 /sec                     ( +- 34.55% )
          3582      page-faults               #   48.942 /sec                     ( +-  0.01% )
2972,8367,5031      cycles                    #    4.062 GHz                      ( +-  0.18% )
 626,1171,8129      instructions              #    0.21  insn per cycle           ( +-  0.00% )
 208,3703,4674      branches                  #  284.701 M/sec                    ( +-  0.00% )
      126,7828      branch-misses             #    0.01% of all branches          ( +-  1.30% )
1,4852,8263,7616      slots                     #   20.294 G/sec                    ( +-  0.17% )
 282,5978,1184      topdown-retiring          #      1.9% retiring                ( +-  1.87% )
 908,7784,9886      topdown-bad-spec          #      6.0% bad speculation         ( +-  6.01% )
  63,9197,5271      topdown-fe-bound          #      0.4% frontend bound          ( +-  4.39% )
1,3783,7450,0815      topdown-be-bound          #     91.7% backend bound           ( +-  0.33% )

        73.193 +- 0.123 seconds time elapsed  ( +-  0.17% )

整理重要項為表格 :

	my sort	list sort
task clock	72825.9 msec	73189.1 msec
context switches	362	384
page faults	3582	3582
branches misses	124,9445	126,7828

在作業說明中提到

用 bottom up 實作 merge sort 對 cache 較友善，因為過程中就是一直合併，cache 被參照到的機會更大。而 top down 是會先做 partition 再來 merge，但 partition 本身對 cache 不友善，在 cache 移進移出（內容不斷更新，導致 cache thrashing。

以及改進 lib/list_sort.c 中提到

lib/list_sort.c 中實作的 merge sort 演算法採用 bottom-up 策略。它首先將給定的串列轉換為一個以空結尾的單向鏈結串列。該演算法維護一個稱為 "pending" 的數據結構，表示等待進一步合併的已排序子串列。

雖然可以見到幾乎無差距，但原先設想是，我的實作方法是遞迴，每次都會切出一半的資料進行兩次遞迴(即我理解的 top-down 而非 bottom-up)，理論上應該會導致 cache thrashing、有大量的 page-faults？

此處效能測量可能有誤，待釐清。

3 / 7 更新可能被編譯器最佳化了？除了調整最佳化等級，可以用 perf + 觀察 asm 看是否有顯著差距

設計新實驗

以下實驗針對幾點做改進

執行時間樣本的分析，如分布 (distribution)
設計具有某些特性的輸入，而非隨機輸入
增加 perf 試驗數

時間複雜度

我們依然分別隨機產生資料，但本次係從 10 個資料點，每次增加 10 個資料並進行排序，直到 20000 個資料點，紀錄 CPU cycles 後繪製成下圖

可以見到我所實作的 q_sort 隨著資料增加，CPU cycles 有指數增加的趨勢，相對於 list sort 接近於線性的增加，在大量資料的狀況下已有明顯差距，以下為測試程式碼

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <stdint.h>
#include <x86intrin.h>

#include "queue.h"
#include "dudect/cpucycles.h"

void static inline q_sort_wrapper(struct list_head *head) {
    bool predefined_descend = true;
    q_sort(head, predefined_descend);
}

int main(int argc, char** argv) {
    if (argc != 2) {
        printf("Usage: %s <sort_name>\n", argv[0]);
        return 1;
    }
    char* sort_name = argv[1];
    char* save_file;
    void (*sort_func)(struct list_head *head);
    if (strcmp(sort_name, "qsort") == 0) {
        sort_func = &q_sort_wrapper;
        save_file = "qsort_perf.txt";
    }
    else if (strcmp(sort_name, "listsort") == 0) {
        sort_func = list_sort;
        save_file = "listsort_perf.txt";
    }
    else {
        printf("Invalid sort name: %s\n", sort_name);
        return 1;
    }
    struct list_head *q = q_new();
    FILE *fileout = fopen(save_file, "w");
    for (int num_data = 10; num_data <= 20000; num_data += 10) {
        q = q_new();
        for (int i = 0; i < num_data; i++) {
            char s = (char) (rand() % 26 + 'a');
            q_insert_tail(q, &s);
        }
        _mm_mfence();
        int64_t start = cpucycles();
        sort_func(q);
        int64_t end = cpucycles();
        // save the result to text for analysis later
        fprintf(fileout, "%d %ld\n", num_data, end - start);
    }
}

分析樣本分布

多少樣本數才足以取得具有代表性的統計量？

問題的答案因母體分配而異，若假設母體為常態分佈，則根據《Educational research competencies for analysis and application》 (Gay, L. R., 1992)，至少需有 30 個樣本數，小於 30 則影響大多數檢定之檢定力，如 z-test；又根據〈A Practical Guide for Using Statistical Tests to Assess Randomized Algorithms in Software Engineering〉(Andrea Arcuri, 2011)，至少 30 次試驗是一個普遍適用的標準，但對於部分檢定，如 U test，100 次仍不足，而 1000 次則是推薦的次數。

以隨機產生的資料，每次有 20000 筆資料，進行 1000 次排序，刪除了大於第 99 分位數 (

P_{99}

) 的資料後，得到的 CPU Cycle 如下圖

作圖後可以顯著地看到兩者分布的有顯著差異，我們可以判斷兩者母體服從於不同分布；而就視覺上看，其母體可能非常態分布，若需要執行統計檢定需要注意，可以透過 K-S 檢定判斷是否服從常態分布。

增加 `perf` 試驗數和不同樣本特性

本次除了增加 perf 的試驗數外，每一次試驗都將進行一千次的排序，即

    int num_test = 1000;
    for (int test = 0; test < num_test; test++) {
        q = q_new();
        data_prod(q);
        // print_list(q);
        _mm_mfence();
        int64_t start = cpucycles();
        sort_func(q);
        int64_t end = cpucycles();
        // printf("Cycles: %ld\n", end - start);
        // print_list(q);
        fprintf(fileout, "%ld ", end - start);
    }

猜測 list_sort 可能在特定場景能發揮更好的效能，故我們準備 2 種資料，依據這 2 種資料，使用 perf stat -r 1000 得到以下四種結果

大致排序完成而尾部少數不等

我們以以下程式碼產生資料

void static inline almost_sorted_data(struct list_head *head) {
    for (int i = 0; i < 10000; i++) {
        char s = (char) (rand() % 26 + 'a');
        q_insert_tail(head, &s);
    }
    q_sort(head, false);
    for (int i = 0; i < 100; i++) {
        char s = (char) (rand() % 26 + 'a');
        q_insert_tail(head, &s);
    }
}

 Performance counter stats for './sort_dist qsort almost_sort' (1000 runs):

           3137.17 msec task-clock                #    0.998 CPUs utilized            ( +-  0.01% )
                 8      context-switches          #    2.544 /sec                     ( +-  1.39% )
                 0      cpu-migrations            #    0.000 /sec                   
           35,5163      page-faults               #  112.949 K/sec                    ( +-  0.00% )
     131,3528,1701      cycles                    #    4.177 GHz                      ( +-  0.01% )
     206,3582,6760      instructions              #    1.57  insn per cycle           ( +-  0.00% )
      45,1489,8824      branches                  #    1.436 G/sec                    ( +-  0.00% )
       1,0356,1613      branch-misses             #    2.29% of all branches          ( +-  0.01% )
     656,3148,1975      slots                     #   20.872 G/sec                    ( +-  0.01% )
     203,3289,0494      topdown-retiring          #     30.9% retiring                ( +-  0.02% )
     154,4270,1641      topdown-bad-spec          #     23.2% bad speculation         ( +-  0.06% )
     118,3940,4591      topdown-fe-bound          #     18.1% frontend bound          ( +-  0.03% )
     180,5324,4299      topdown-be-bound          #     27.8% backend bound           ( +-  0.03% )

          3.144975 +- 0.000249 seconds time elapsed  ( +-  0.01% )


 Performance counter stats for './sort_dist listsort almost_sort' (1000 runs):

           2889.82 msec task-clock                #    0.999 CPUs utilized            ( +-  0.01% )
                 7      context-switches          #    2.421 /sec                     ( +-  1.51% )
                 2      cpu-migrations            #    0.692 /sec                     ( +-  1.12% )
           35,5164      page-faults               #  122.833 K/sec                    ( +-  0.00% )
     120,8325,0954      cycles                    #    4.179 GHz                      ( +-  0.01% )
     189,2326,3768      instructions              #    1.57  insn per cycle           ( +-  0.00% )
      41,4924,8982      branches                  #    1.435 G/sec                    ( +-  0.00% )
       1,0522,7313      branch-misses             #    2.54% of all branches          ( +-  0.01% )
     603,1795,1240      slots                     #   20.861 G/sec                    ( +-  0.01% )
     189,2327,8820      topdown-retiring          #     31.3% retiring                ( +-  0.01% )
     151,3862,3056      topdown-bad-spec          #     25.4% bad speculation         ( +-  0.05% )
     120,6359,0248      topdown-fe-bound          #     20.0% frontend bound          ( +-  0.02% )
     141,9245,9115      topdown-be-bound          #     23.3% backend bound           ( +-  0.04% )

          2.892066 +- 0.000265 seconds time elapsed  ( +-  0.01% )

整理重要項為表格 :

	my sort	list sort
task clock	3137.17 msec	2889.82 msec
context switches	8	7
page faults	35,5163	35,5164
instructions	206,3582,6760	189,2326,3768
branches misses	1,0356,1613	1,0522,7313

反向排序完成

以以下程式碼產生測試資料

void static inline reverse_sorted_data(struct list_head *head) {
    for (int i = 0; i < 10000; i++) {
        char s = (char) (rand() % 26 + 'a');
        q_insert_head(head, &s);
    }
    q_sort(head, true);
}

Performance counter stats for './sort_dist qsort reverse_sort' (1000 runs):

           3198.79 msec task-clock                #    1.002 CPUs utilized            ( +-  0.01% )
                11      context-switches          #    3.447 /sec                     ( +-  1.05% )
                 1      cpu-migrations            #    0.313 /sec                     ( +-  2.77% )
           35,1647      page-faults               #  110.201 K/sec                    ( +-  0.00% )
     133,7968,5032      cycles                    #    4.193 GHz                      ( +-  0.01% )
     206,4549,2457      instructions              #    1.55  insn per cycle           ( +-  0.00% )
      44,8644,2627      branches                  #    1.406 G/sec                    ( +-  0.00% )
       1,0598,1816      branch-misses             #    2.36% of all branches          ( +-  0.01% )
     666,4662,4310      slots                     #   20.886 G/sec                    ( +-  0.01% )
     201,2466,6948      topdown-retiring          #     30.4% retiring                ( +-  0.02% )
     167,2699,5905      topdown-bad-spec          #     24.8% bad speculation         ( +-  0.06% )
     111,9575,3395      topdown-fe-bound          #     16.8% frontend bound          ( +-  0.05% )
     188,1787,0393      topdown-be-bound          #     28.0% backend bound           ( +-  0.03% )

          3.191458 +- 0.000349 seconds time elapsed  ( +-  0.01% )


 Performance counter stats for './sort_dist listsort reverse_sort' (1000 runs):

           2869.04 msec task-clock                #    0.997 CPUs utilized            ( +-  0.01% )
                13      context-switches          #    4.517 /sec                     ( +-  0.78% )
                 0      cpu-migrations            #    0.000 /sec                   
           35,1649      page-faults               #  122.183 K/sec                    ( +-  0.00% )
     120,0394,6287      cycles                    #    4.171 GHz                      ( +-  0.01% )
     191,8827,7875      instructions              #    1.59  insn per cycle           ( +-  0.00% )
      40,7025,6305      branches                  #    1.414 G/sec                    ( +-  0.00% )
       1,0781,1929      branch-misses             #    2.65% of all branches          ( +-  0.01% )
     600,1660,7175      slots                     #   20.853 G/sec                    ( +-  0.01% )
     190,6409,8749      topdown-retiring          #     31.8% retiring                ( +-  0.02% )
     169,4586,5555      topdown-bad-spec          #     27.8% bad speculation         ( +-  0.05% )
     105,9116,5972      topdown-fe-bound          #     18.0% frontend bound          ( +-  0.03% )
     134,1547,6897      topdown-be-bound          #     22.4% backend bound           ( +-  0.04% )

          2.878567 +- 0.000240 seconds time elapsed  ( +-  0.01% )

整理重要項為表格 :

	my sort	list sort
task clock	3198.79 msec	2869.04 msec
context switches	11	13
page faults	35,1647	35,1649
instructions	206,4549,2457	191,8827,7875
branches misses	1,0598,1816	1,0781,1929

新實驗後我們可以暫時整理三點：

list sort 確實顯著快於我所寫的 qsort
引述相關文獻，就實驗次數檢視本實驗是否具證據力
為什麼比較不好仍待驗證：
- page-fault 沒有特別大，為什麼？因 cache 足夠大？
- instructions 較多，多了什麼？

L1 cache 分析

和老師討論後發現原先 perf 並沒有顯示出 Cache 項次，在補上了 likely 及 unlikely 後，數據如下

 Performance counter stats for './perf_sort qsort' (1000 runs):

       1,1367,2224      L1-dcache-loads                                               ( +-  0.08% )  (36.89%)
       1,0174,7077      L1-dcache-loads-misses    #   91.49% of all L1-dcache accesses  ( +-  0.01% )  (37.03%)
           40,2304      LLC-loads                                                     ( +-  0.83% )  (37.03%)
              3121      LLC-loads-misses          #    0.91% of all LL-cache accesses  ( +-  3.67% )  (37.03%)
       1,0315,2129      l1d.replacement                                               ( +-  0.02% )  (36.12%)
            2,2600      l1d_pend_miss.l2_stall                                        ( +-  9.23% )  (35.98%)
          113,4894      mem_load_retired.l1_hit                                       ( +-  0.55% )  (35.98%)
       1,0398,5941      mem_load_retired.l1_miss                                      ( +-  0.01% )  (35.98%)
       8,4183,5218      cycle_activity.cycles_l1d_miss                                     ( +-  0.02% )  (35.98%)
       8,4118,9368      cycle_activity.stalls_l1d_miss                                     ( +-  0.02% )  (35.98%)
       1,0465,9441      mem_inst_retired.all_loads                                     ( +-  0.01% )  (35.98%)

         0.3565312 +- 0.0000646 seconds time elapsed  ( +-  0.02% )


 Performance counter stats for './perf_sort listsort' (1000 runs):

       1,1180,5254      L1-dcache-loads                                               ( +-  0.06% )  (35.94%)
       1,0282,2597      L1-dcache-loads-misses    #   92.29% of all L1-dcache accesses  ( +-  0.01% )  (37.05%)
           55,1223      LLC-loads                                                     ( +-  0.79% )  (37.88%)
              5174      LLC-loads-misses          #    1.06% of all LL-cache accesses  ( +-  3.52% )  (37.88%)
       1,0396,6378      l1d.replacement                                               ( +-  0.01% )  (37.43%)
            2,7826      l1d_pend_miss.l2_stall                                        ( +-  5.49% )  (36.33%)
          102,7050      mem_load_retired.l1_hit                                       ( +-  2.02% )  (35.50%)
       1,0478,8193      mem_load_retired.l1_miss                                      ( +-  0.01% )  (35.50%)
       8,5502,9344      cycle_activity.cycles_l1d_miss                                     ( +-  0.02% )  (35.50%)
       8,5046,0227      cycle_activity.stalls_l1d_miss                                     ( +-  0.02% )  (35.50%)
       1,0576,1424      mem_inst_retired.all_loads                                     ( +-  0.01% )  (35.50%)

         0.3605272 +- 0.0000668 seconds time elapsed  ( +-  0.02% )

整理重要項為表格 :

	my sort	list sort
task clock	3198.79 msec	2869.04 msec
context switches	11	13
page faults	35,1647	35,1649
instructions	206,4549,2457	191,8827,7875
branches misses	1,0598,1816	1,0781,1929

Impacts of likely and unlikely

引入到 lab0-c 專案

在 queue.c 加入

+bool do_lsort(int argc, char *argv[]) {...}
...
+    ADD_COMMAND(lsort, "Sort the queue with list sort in Linux Kernel", "");

即可使用 lsort 呼叫 list_sort 進行排序。我們也可以加入 option time_sort，用以快速測量排序的速度

+    struct timespec start, end;
+    if (current && exception_setup(true)) {
+        clock_gettime(CLOCK_MONOTONIC, &start);
+        list_sort(current->q);
+        clock_gettime(CLOCK_MONOTONIC, &end);
+        if (time_sort) {
+            uint64_t delta_us = (end.tv_sec - start.tv_sec) * 1000000 + (end.tv_nsec - start.tv_nsec) / 1000;
+            report(4, "Time to sort: %ld microsecond", delta_us);
+        }
+    }
...
+    add_param("time_sort", &time_sort, "If report the time of sort", NULL);

效果如下

option fail 0
option malloc 0
option time_sort 1 
new
ih RAND 100000
lsort
free

// l = []
// l = [rajxbcvef qxakpf adhydugvm ... ]
// Time to sort: 30542 microsecond
// l = [aaacprq aaahp aaalbkd ... ]

提供新的命令 shuffle

實作請見 Github commit 使用作業說明中提供的測試程式，計算卡方值以及繪圖

由於我們希望檢驗演算法輸出的資料是否符合均勻分布 (uniform distribution)，故我們假設期望值

E = 100000 / 24 \approx 4166

即隨機 shffle 100000次，平均分配到 24 種狀況，每種狀況平均應該出現 4166 次。依照測試程式計算，樣本之卡方值為 30.16，依照自由度 5 查表，由於 20.52 < 30.16，得到 p-value = 0.001，故我們可以推論，在設定

α = 0.05

之下，不拒絕虛無假設，依照抽樣結果，本演算法得到的資料分布不是均勻分布的可能性小於 0.001。

能夠見到畫出的圖片亦近似於均勻分布

研讀論文〈Dude, is my code constant time?〉

完整開發請見 Github branch fix_dudect

~~TL;DR~~

不要濫用 TL;DR，誠實面對自己，闡述你的洞見，並接受他人批評指教，從而及早發現自身認知不足。

本節做到的是

理解並記錄本論文如何透過統計檢定，推論演算法是否為線性複雜度
理解並記錄計算 CPU Cycle 的做法以及問題
如何解決樣本不滿足檢定前提 (解決後 CI 即通過)

Dude, is my code constant time? 使用統計方法判斷固定輸入資料與隨機輸入資料的執行時間是否有不同，若有則表示演算法可能會因為輸入資料而有不同的效能，導致了非線性複雜度的可能。(結合 Time attack，應該有更好的表達方式，再修改)

We first measure the execution time for two different input data classes, and then check whether the two timing distributions are statistically different.

文中亦闡明了演算法的過程，即

Measure execution time
Apply post-processing
Apply statistical test

我們可以依照上述過程找到專案中的相應程式碼，亦可參照該論文釋出的程式碼 dudect.h。

Measure execution time

首步為計算 CPU cycles，定義於 dudect/cpucycles.h

static inline int64_t cpucycles(void)
{
#if defined(__i386__) || defined(__x86_64__)
    unsigned int hi, lo;
    __asm__ volatile("rdtsc\n\t" : "=a"(lo), "=d"(hi));
    return ((int64_t) lo) | (((int64_t) hi) << 32);
#elif defined(__aarch64__)
    // ....
#endif
}

透過指令 rdtsc ，即 Read Time-Stamp Counter，該指令會傳回自 CPU 啟動以來的循環 cycle 數，其中 lo 為低位的 32 位元，hi 為高位的 32 位元，((int64_t) lo) | (((int64_t) hi) << 32) 即可組成 CPU cycles。

這樣的計算具有幾個特徵，

不考慮 out-of-order，在計算途中可能會混入其他指令；也是論文中提到要進行後處理的主因。
假定 CPU 頻率為定值，而事實上可能有高有低

特別的是，dudect.h 中的 cpucycles() 為

static inline int64_t cpucycles(void) {
  _mm_mfence();
  return (int64_t)__rdtsc();
}

_mm_mfence() 應是指令 MFENCE ，其作用為確保在指令 rdtsc 之前，先前的所有記憶體讀寫都應該被完成，故在計算 CPU cycles 時便不會引入先前其他指令。

為何 MFENCE 能夠使 rdtsc 更準確計算 CPU cycles (ChatGPT)

Serialization of Memory Operations: MFENCE ensures that all load (read) and store (write) operations that appear before the MFENCE in the program order are completed before any load or store operations that come after the MFENCE. It effectively draws a line in the sand, saying "no memory operation that comes after this point can be completed until all memory operations before this point are done."

Impact on rdtsc: The rdtsc instruction reads the time-stamp counter, which counts the number of cycles since the processor was reset. It's a non-serializing instruction, which means it doesn't by itself ensure that previous instructions have completed. By placing an MFENCE before rdtsc, you can ensure that all prior memory operations have been fully carried out before the cycle count is read.

Preventing Speculative Execution: Speculative execution is another form of out-of-order execution where the processor guesses the path of execution and begins executing ahead. If the guess is incorrect, the results are discarded. MFENCE prevents the results of any speculative execution that involve memory operations from being committed until all previous memory operations are confirmed to be completed.

Apply post-processing

Student's t-test 用於估計小樣本的未知標準差的常態分布之期望值，若樣本數大則可使用中央極限定理求其母體參數；該論文使用 Welch’s t-test 作為檢定方法，其基於 Student's t-test。

比較 dudect/fixture.c 可得知並未實作 prepare_percentiles() ，這可能導致不符合 Welch’s t-test 假設；由於其假設樣本來自於常態分布，而執行時間則是左偏態(執行時間短的任務佔大多數)，故我們需要捨棄左側一部份樣本來達到常態分布假設。

Apply statistical test

在 dudect/fixture.c 中，可以找到 update_statistics()，比對 dudect.h，缺乏移除左側長尾的部分

-static void update_statistics(const int64_t *exec_times, uint8_t *classes)
+static void update_statistics(const int64_t *exec_times,
+                              uint8_t *classes,
+                              const int64_t *percentiles)
 {
     for (size_t i = 0; i < N_MEASURES; i++) {
         int64_t difference = exec_times[i];
@@ -74,6 +109,14 @@ static void update_statistics(const int64_t *exec_times, uint8_t *classes)
 
         /* do a t-test on the execution time */
         t_push(t, difference, classes[i]);
+
+        // t-test on cropped execution times, for several cropping thresholds.
+        for (size_t crop_index = 0; crop_index < DUDECT_NUMBER_PERCENTILES;
+             crop_index++) {
+            if (difference < percentiles[crop_index]) {
+                t_push(t, difference, classes[i]);
+            }
+        }
     }
 }

在加入這部分後，線上 CI 通過，達到 100/100，見 fix_dudect。

指出論文和程式碼實作的出入之處。

補充

In-place 操作: 引用 ChatGPT

Here are the key reasons why in-place operations hold significant importance:

Limited Memory Resources

Performance Optimization

Real-time Constraints

Atomicity and Synchronization

Reducing Fragmentation

Why don't we replace all memory allocations with static? (stupid question but useful) 引用 ChatGPT

Data might not be known at compile time
Precise life management might be necessary
Large data might not fit into stack

疑難雜症

enable Git workflow

$ make
Fatal: GitHub Actions MUST be activated.
Check this article: https://docs.github.com/en/actions/managing-workflow-runs/disabling-and-enabling-a-workflow
Then, execute 'make' again.
make: *** [Makefile:34: .git/hooks/applied] Error 1

參考作業說明 - GitHub Actions 設定

留意作業說明！

tic-tac-toe

蒙地卡羅搜尋樹 (Monte Carlo Tree Search) 如何運作

下棋的時候，我們是先在腦中知道可以走的走法、試誤，譬如炮二平五、俥五進六，發現某些棋對局勢更好 (更有獲勝的機會、想像不到對手有好的走法)，我們就選擇這樣的走法；對手根據我們的走法，也同樣選擇，若他的選擇在我們的估計內，我們會發現勝率越來越高，最終獲勝。

參考上圖，各項流程的意義是 1. Simulation

基於目前的局勢猜測勝率，此處實作是雙方隨機落子，即蒙地卡羅模擬法，僅有 1.0 0.0 0.5 三種離散結果。

2. Backpropagation

依照隨機模擬的結果，更新某一個落子的勝率以及被選到的次數；前一輪被選中的落子，會在下一輪才被 Bcakpropagation 更新勝率。

3. Selection

根據 UCT(upper confidence bound applied to trees) 「真正」選擇下一步，在 Backpropagation 當中預估的勝率越大，前項越大 (exoloitation)；在先前被選中的次數越少，後項越大，鼓勵選取未曾探索過的落子 (exploration)，兩者互相拮抗；實作中設定為：

U C T = \frac{s c o r e_{c h}}{n u m_v i s i t s_{c h}} \times \sqrt{\frac{2 \cdot l n (n u m_t o t a l_v i s i t s_{p r})}{n u m_v i s i t s_{c h}}}

$s c o r e_{c h}$ 代表使用該子節點的勝率
$n u m_v i s i t s_{c h}$ 代表該子節點曾被使用於模擬的次數
$n u m_t o t a l_v i s i t s_{p r}$ 代表該節點的親代節點曾被使用於模擬的次數

依照上式，我們可以推論 a. 當使用該節點在模擬中獲得更好的分數，越可能再次選擇它向下模擬 b. 當該節點的親代節點越常被使用，我們越可能再次選擇它向下模擬 c. 當該節點被選擇的次數太多，我們盡量避免使用它向下模擬

4. Expansion

如果某一個落子之後的結果還沒有被預估（葉節點），則將目前合法的落子初始化成他的子節點

( 如果這樣落子，會有多少勝率 (Simulation and Backpropagation) -> 思考這樣落子後，後續的發展 (Expansion) )

\times N

-> 在

N

次落子中選擇一個勝率最高的走法 (Selection)

定點數取代浮點數

定點數就是將某一正數放大後的表示方法，縮放係數 (scalar factor) 係指將正數轉換成定點數需要的乘數；如 12.34 在縮放係數為

2^{8}

時，即是

12.34 \times 2^{8} = 3159.04

；換句話說，當我們使用定點數來表示數字時，3159.04 代表 12.34。

我們可以將十進位轉為二進位來觀察

定點數 12.34:　0b1100.01010111000010100011110101110000101000111101011100001 3159.04:0b110001010111.000010100011110101110000101000111101011100001

可以觀察到，3159.04之定點數是將 12.34 向左位移 8 個位元，即

12.34 << 8 ⟹ 12.34 \times 2^{8} = 3159.04

IEEE 754 12.34:　 0b0.10000000010.1000101011100001010001111010111000010100011110101110 3159.04: 0b0.10000001010.1000101011100001010001111010111000010100011110101110

而 IEEE 754 中，sign 與 mantissa 相同，而雙精度下指數項之 bias 為

2^{10} - 1

；而當我們乘以

2^{8}

，即是

(2^{10} - 1) + 3 + 8 ⟹ 0 b 10000000010 + 0 b 00000001000 = 0 b 10000001010

What is the purpose of "bias" in IEEE 754 ?

怎麼把十進位轉換為 IEEE 754 浮點數？

寫作定點數
$8.125 ⟹ 1000.001$
轉換為科學記號
$1000.001 ⟹ 1.000001 \times 2^{3}$
指數項為
$2^{b i a s + n}$
$127 + 3 ⟹ 0 b 10000010$
尾數項為科學記號小數後
$0 b 000001$
$8.125 ⟹ 0.10000010 .00000100000000000000000$

如果只是 0.5 呢？因為是

2^{- 1}

，所以指數項是 127 - 1 0.025？127 - 2　即　0b0.01111101.00000000000000000000000 0.0125？127 - 3 即 0b0.01111100.00000000000000000000000 0.00625？127 - 4 即 0b0.01111011.00000000000000000000000

會發現，當僅有小數時，bias 是用於被扣除的！僅有小數時，可以儲存更小精度的位數，故 IEEE 754 的值域才能夠達

還有簡化比較，若沒有加上 bias，以上述狀況，指數項是負的，我們必須要在指數項引入二補數，直接比較會導致負數大於正數，譬如

11111100 > 01111101

但前者是負數，需要特別處理；引入後儲存的皆是正數，就可以直接比較。

在決定縮放係數前，我們首先觀察 score 之值域，以求至少在儲存浮點數以及其四則運算中誤差最小；去除掉第一次拜訪所得 DBL_MAX 之值，最大值為 3.552684，最小值為 0。

當縮放係數為

$2^{8}$ ：

可表示為 11.10001101
轉換回浮點數得到 3.550781，損失為 0.001903

當縮放係數為 =

$2^{16}$ ：

可表示為 11.1000110101111100
轉換回浮點數得到 3.552673，損失為 0.000011

當縮放係數為 =

$2^{20}$ ：

可表示為 11.10001101011111001011
轉換回浮點數得到 3.552684，損失為 0

雖然採用越大的縮放係數能夠逼近更小的位數，使得損失降低，但並非越大越好；接下來我們展示定點數的乘除，就能發現縮放係數的大小事實上是一種取捨。

Q Format

定點數的乘法

定點數乘法的問題在於容易上溢位 (overflow)，假設縮放係數為

2^{16}

，並考慮以下程式碼








double a = 2.25;
double b = 5;
printf("%f\n", a * b);

// fixed point number
unsigned int c = a * (1 << FIXED_POINT_SCALE);
unsigned int d = b * (1 << FIXED_POINT_SCALE);
printf("%f\n", (double) (c * d) / pow(2, FIXED_POINT_SCALE * 2));

當 FIXED_POINT_SCALE = 14 時，兩者相等；當 FIXED_POINT_SCALE = 15 時，定點數乘法計算為 3.25，是因為左移過程中有兩個位元溢出，故與原先值差為 8；FIXED_POINT_SCALE = 16 時，計算結果為 0.25，亦是溢出兩個位元。

故設計不符需求的縮放係數可能會導致上溢位；而我們可以檢測當兩乘數之最高位元相加，當大於系統位元數，就必須處理上溢位。

定點數的除法

定點數除法則會導致下溢位 (underflow)，除了上溢位的反例，即過小的縮放係數會導致小數位的損失外，當兩數的最高位數相近，或除數大於被除數，也會導致下溢位；我們可以先放大被除數來避免，如

static unsigned int fixed_point_divide(unsigned int a, unsigned int b)
{
    unsigned int result;
    int a_ls = __fls(a);
    int b_ls = __fls(b);
    if ((b > a || b_ls - a_ls >= FIXED_POINT_SCALE) && a_ls + FIXED_POINT_SCALE < 32)
        result = (a << FIXED_POINT_SCALE) / b;
    else
        result = (a / b) << FIXED_POINT_SCALE;
    return result;
}

int main()
{
    double a = 123;
    double b = 45;
    printf("%f\n", a / b);                                   // 2.733333

    unsigned int c = a * (1 << FIXED_POINT_SCALE);
    unsigned int d = b * (1 << FIXED_POINT_SCALE);
    double r1 = c / d;
    double r2 = fixed_point_divide(c, d);
    printf("%f\n", (double) r1);                             // 2.000000
    printf("%f\n", (double) r2 / (1 << FIXED_POINT_SCALE));  // 2.730469
}

透過 __fls() 得到最高位數進行檢查避免上溢位，以上程式碼可以處理

a > 2^{16} - 1

的狀況；經實驗，在 #define ITERATIONS 100000 的情況下，log(n_total) / n_visits 約在 1.9 附近，遠小於

2^{16}

；且後續應用中，被除數 a >= 除數 b 必然成立，故不採用 __fls() 檢查除法中的上溢位，直接將 (a << FIXED_POINT_SCALE) / b 作為除法結果。

根據牛頓勘根法 (Newton's Method) 求近似平方根

牛頓勘根法的一般式可寫作

x_{n + 1} = x_{n} - \frac{f (x_{n})}{f^{'} (x_{n})}

逐項迭代後，

x_{n + 1}

最後將逼近

f (x) = 0

之解。

故我們將

f (x, c) = x^{2} - c

帶入，其中 c 為任意欲求平方根之常數；如令 c = 2，依據牛頓勘根法可球的

\sqrt{2}

之近似值；以以下程式碼實作：

#define FIXED_POINT_SCALE 8
#define TOLERANCE (1 << (FIXED_POINT_SCALE / 2))

static inline unsigned int fixed_point_divide(unsigned int a, unsigned int b)
{
    return (a << FIXED_POINT_SCALE) / b;
}

static inline unsigned int fixed_point_sqrt(unsigned int x)
{
    unsigned int r = x > 1 ? (x >> 1) : 1;
    unsigned int r_new;
    while (1) {
        if (r == 0)
            return r;
        unsigned int div = fixed_point_divide(x, r);
        r_new = (r + div) >> 1;
        unsigned int diff = r_new - r;
        if (diff < TOLERANCE) {
            return r_new;
        }
        r = r_new;
    }
}

int main()
{
    double max_diff = 0;
    double sum_diff = 0;
    double std_diff = 0;
    double *diff_arr = malloc(65535 * sizeof(double));
    for (int a = 1; a < 65536; a++) {
        float sqrt_2 = sqrt(a);
        unsigned int approx_sqrt_2 = fixed_point_sqrt(a * (1 << FIXED_POINT_SCALE));
        double approx_sqrt_2_double = fixed_point_to_double(approx_sqrt_2, FIXED_POINT_SCALE);
        double diff = fabs(sqrt_2 - approx_sqrt_2_double);
        diff_arr[a - 1] = diff;
        sum_diff += diff;
        if (diff > max_diff) {
            max_diff = diff;
        }
    }
    double mean_diff = sum_diff / 65535;
    for (int i = 0; i < 65535; i++) {
        std_diff += pow(diff_arr[i] - mean_diff, 2);
    }
    std_diff = sqrt(std_diff / 65535);
    printf("%f, %f, %f\n", max_diff, mean_diff, std_diff);
}

經以上實驗，牛頓勘根法的近似值與浮點數 sqrt 之最大差值為 0.003906，其平均值為 0.001957，標準差為 0.001132。

礙於顯示效果，我們每 100 個數才作圖：

由於差異極小，很難辨別誤差，但確實展示出極佳的逼近效果。

根據牛頓勘根法 (Newton's Method) 求近似對數

依據上述方法，令

f (x, c) = 2^{x} - c

，求得對數之迭代公式為

x_{n + 1} = x_{n} - \frac{2^{x_{n}} - c}{2^{x_{n}} l n 2}



















static inline unsigned int fixed_point_log(unsigned int x)
{
    if (x == 1)
        return 0;
    unsigned int l = x > 1 ? (x >> 1) : 1;
    unsigned int l_new;
    while (1) {
        if (l == 0)
            return l;
        unsigned int fx0 = pow(2, (double) l / (1 << FIXED_POINT_SCALE)) * (1 << FIXED_POINT_SCALE);
        unsigned int div = fixed_point_divide(x - fx0, fixed_point_multiply(fx0, LN2));
        l_new = l + div;
        unsigned int diff = l_new - l;
        if (diff < 1) {
            return l_new;
        }
        l = l_new;
    }
}

在實作時發現重大問題是，當

x_{n}

帶有小數，我們無法使用定點數計算出

2^{x_{n}}

；可以看到在第 10 行，仍是採用 pow()；參考 marvin0102，應引入尤拉公式。

使用定點數實作 mcts 之誤差

主要將 uct_score 改為定點數運算：

static inline unsigned int uct_score(unsigned int n_total, unsigned int n_visits, unsigned int score)
{
    if (n_visits == 0)
-       return DBL_MAX;
+       return UINT_MAX;
-   return score / n_visits +
-          EXPLORATION_FACTOR * sqrt(log(n_total) / n_visits);
+   unsigned int exploitation = fixed_point_divide(score, n_visits);
+   unsigned int exploration = fixed_point_sqrt(fixed_point_divide(fixed_point_log(n_total), n_visits));
+   return exploitation + EXPLORATION_FACTOR * exploration;
}

與原先 uct_score 之值相減後計算出誤差，作圖如下

見 commit 08cf048。

允許使用者切換對弈模式

如 lab0，新增 ai_vs_ai 參數，並使用 do_ttt 包裝 ttt 函數，將控制參數傳入即可，見 commit 7d11da4。

惟為了使用 function pointer 使程式更簡潔，將 negamax_predict 傳回類型改為整數類，因目前實作中暫不需要 move_t；且三者皆改為傳入 (char *, char *)，隨後再轉型回各自之型態。

在實作 coroutine 前，係直接取代 get_input，而直接導向原先電腦演算法，此時已經可以是 "電腦 v.s. 電腦" 對弈。

-               move = get_input(turn);
-               if (table[move] == ' ') {
+               if (ai_vs_ai)
+#ifdef USE_RL
+                   move = ai_algorithm(table, &agent);
+#else
+                   move = ai_algorithm(table, &turn);
+#endif
+               else
+                   move = get_input(turn);
+               if (table[move] == ' ')
                    break;
-               }

顯示時間

引入 coroutine 進行電腦 v.s. 電腦對弈

考慮 concurrent-programs/coro/coro.c

參考資料

Jim Huang

2024/03/14 13:31:23

q_sort 與 list_sort 效能

你如何準備測試資料？目前僅有隨機字串作為輸入，無法涵蓋其他非隨機分佈的資料。

Kuanch

2024/03/17 12:33:45

已設計新實驗，發現兩者確有顯著差距

2024/03/14 13:34:22

揭露硬體資訊

2024/03/15 13:32:17

已更新！

2024/03/14 13:35:07

3 / 7 更新

不需要標注日期，HackMD 會追蹤

2024/03/14 13:35:49

可能被編譯器最佳化了

列出產生的組合語言指令序列來佐證

2024/03/14 13:38:39

理論上

什麼理論？告知理論的出處

2024/05/01 07:31:45

量化！

2024/05/01 07:34:14

https://hackmd.io/@sysprog/sqrt

403 Forbidden - HackMDBuild together with Markdown

HackMD

2024q1 Homework1 (lab0)

Reviewed by Lccgth

Reviewed by lintin528

開發環境

實作指定的佇列操作

佇列實作

q_sort

q_merge

相關資料結構及操作

container_of

分析與引入 lib/list_sort.c

修改 lib/list_sort.c

merge

merge_final

list_sort

q_sort 與 list_sort 效能比較

設計新實驗

時間複雜度

分析樣本分布

增加 perf 試驗數和不同樣本特性

L1 cache 分析

Impacts of likely and unlikely

引入到 lab0-c 專案

提供新的命令 shuffle

研讀論文〈Dude, is my code constant time?〉

Measure execution time

Apply post-processing

Apply statistical test

補充

疑難雜症

enable Git workflow

tic-tac-toe

蒙地卡羅搜尋樹 (Monte Carlo Tree Search) 如何運作

定點數取代浮點數

Q Format

定點數的乘法

定點數的除法

根據牛頓勘根法 (Newton's Method) 求近似平方根

根據牛頓勘根法 (Newton's Method) 求近似對數

使用定點數實作 mcts 之誤差

允許使用者切換對弈模式

顯示時間

引入 coroutine 進行電腦 v.s. 電腦對弈

參考資料

Read more

Review with code: Linux Scheduler 閱讀筆記 (1)

EEVDF Notes

Demystifying the Linux CPU Scheduler 閱讀筆記 2024

Chapter 6: The Memory Hierarchy 閱讀筆記

Reviewed by `Lccgth`

Reviewed by `lintin528`

`q_sort`

`q_merge`

`container_of`

分析與引入 `lib/list_sort.c`

修改 `lib/list_sort.c`

`merge`

`merge_final`

`list_sort`

增加 `perf` 試驗數和不同樣本特性