2024q1 Homework1 (lab0)

contributed by < 25077667 >

平常當社畜，又有其他 side project 要寫，大家都好忙，大家一起加油！

前置踩坑

➜  ~ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --enable-languages=ada,c,c++,d,fortran,go,lto,m2,objc,obj-c++ --enable-bootstrap --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --with-build-config=bootstrap-lto --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-cet=auto --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-libstdcxx-backtrace --enable-link-serialization=1 --enable-linker-build-id --enable-lto --enable-multilib --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-werror
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.1 20230801 (GCC)

➜  ~ ldd --version
ldd (GNU libc) 2.39
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

筆記 GTX 750-Ti driver EOL

因為有一台電腦好久沒開機了，開機滾系統滾到最新發現 750-Ti 的 NV 裝置驅動程式不再支援，所以需要往前查看舊版到那裡：
在 mint 論壇上面看到 535 是最後一版 mint
所以，我們會需要額外掛上 nvidia-535xx-dkms

paru nvidia-535xx-dkms

我們就可以快樂得：「Linux。啟動」

Wed Feb 21 00:27:07 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 750 Ti      Off | 00000000:01:00.0  On |                  N/A |
| 28%   44C    P0               2W /  38W |    486MiB /  2048MiB |     13%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     11432      G   /usr/lib/Xorg                               156MiB |
|    0   N/A  N/A     11577    C+G   /usr/lib/gnome-remote-desktop-daemon         28MiB |
|    0   N/A  N/A     11592      G   /usr/bin/gnome-shell                        141MiB |
|    0   N/A  N/A     12082      G   ...on,Ipfs --variations-seed-version=1      102MiB |
|    0   N/A  N/A     12419      G   /usr/bin/kgx                                 46MiB |
+---------------------------------------------------------------------------------------+

cppcheck version

root@b3f6fb6d357a:/# cppcheck --version
Cppcheck 2.7
root@b3f6fb6d357a:/# 
────────────────────────────────────
➜  lab0-c git:(master) cppcheck --version
Cppcheck 2.12.0
➜  lab0-c git:(master)

上面是 Ubuntu:22.04 (in docker) 下面是我的 Archlinux (host)

直接執行似乎有 cppcheck 版本不同，導致 default enable checking flags 的不同，導致 git pre-commit hook 會叫：

➜  lab0-c git:(master) git commit
console.c:287:20: style: Parameter 'vname' can be declared as pointer to const [constParameterPointer]
bool get_int(char *vname, int *loc)
                   ^
console.c:397:36: style: Parameter 'argv' can be declared as const array. However it seems that 'do_web' is a callback function, if 'argv' is declared with const you might also need to cast function pointer(s). [constParameterCallback]
static bool do_web(int argc, char *argv[])
                                   ^
console.c:432:5: note: You might need to cast the function pointer here
    ADD_COMMAND(web, "Read commands from builtin web server", "[port]");
    ^
console.c:397:36: note: Parameter 'argv' can be declared as const array
static bool do_web(int argc, char *argv[])
                                   ^
report.c:266:19: error: Returning pointer to local variable 'ss' that will be invalid when returning. [returnDanglingLifetime]
    return strncpy(ss, s, len + 1);
                  ^
report.c:266:20: note: Passed to 'strncpy'.
    return strncpy(ss, s, len + 1);
                   ^
report.c:256:11: note: Variable created here.
    char *ss = malloc(len + 1);
          ^
report.c:266:19: note: Returning pointer to local variable 'ss' that will be invalid when returning.
    return strncpy(ss, s, len + 1);
                  ^
report.c:48:24: style: Parameter 'file_name' can be declared as pointer to const [constParameterPointer]
bool set_logfile(char *file_name)
                       ^
report.c:164:28: style: Parameter 'format' can be declared as pointer to const [constParameterPointer]
static void fail_fun(char *format, char *msg)
                           ^
report.c:164:42: style: Parameter 'msg' can be declared as pointer to const [constParameterPointer]
static void fail_fun(char *format, char *msg)
                                         ^
report.c:249:29: style: Parameter 's' can be declared as pointer to const [constParameterPointer]
char *strsave_or_fail(char *s, char *fun_name)
                            ^
log2_lshift16.h:15:1: information: ValueFlow analysis is limited in log2_lshift16. Use --check-level=exhaustive if full analysis is wanted. [checkLevelNormal]
{
^

Fail to pass static analysis.

註：這情形在 390ade9eca44b432acb758084e7122cc2c46e485 上
➜ lab0-c git:(master) git show HEAD | grep -o '[a-z0-9]{40}'
390ade9eca44b432acb758084e7122cc2c46e485

最後這些討論詳見：

cppcheck 一直叫

經過上面的 PR 我開始實作 queue.c

「作」是古老的字，甲骨文時代就存在，最初的涵義是「起」，現代漢語裡仍然使用的「振作」、「一鼓作氣」、「槍聲大作」中的「作」，都是「起」的意思。在這個意義上跟「做」不衝突，因為「做」無此含義。「做」是後造字，始於宋元時代，當「即使」、「播弄」、「做作」講。

對照 Dictionary.com 對 implement 一詞的解釋:

to fulfill; perform; carry out:
to put into effect according to or by means of a definite plan or procedure.
Computers. to realize or instantiate (an element in a program), often under certain conditions as specified by the software involved.
to fill out or supplement

「實作」會是符合上述 "implement" 語意的譯詞，取「起」以「創作」的意思，而「製造」本質上是固定輸入並通常產生固定輸出的行為。

\to

資訊科技詞彙翻譯

但我最後發現 cppcheck 這個 const pointer checking 在 2.12 真是有夠糟糕，一堆 false positive
它一直叫阻礙了我寫 queue.c
我在這筆 commit 385f691 把 queue.c 的 constant checking 關閉

舉例來說：我們在 q_ascend

int q_ascend(struct list_head *head)

這功能是要做到「從頭到尾看過去，移除 rhs 小於 lhs 的節點」
我在實作過程中，用到 list_for_each_safe 巨集以方便我移出節點不會尋訪失敗。
但 cppcheck 卻推導不出 head 透過 container_of 之後的節點可能會被移除。
（確實，container_of 這透過 __typeof__ 計算出 offset 取出該物件的方式很 hacky）
而作為基礎建設的 cppcheck 會因此推導不出來，會對這作業的開發造成困擾。因此我決定抑制 cppcheck 在 queue.c 的所有 constParameterPointer 警告。

補齊指定的佇列操作

目前得分為 100 分

`q_new`, `q_free`

這邊的佇列有點特別，他是我們觀念上 FIFO 的沒錯，但在這 lab0-c 中 queue_contex_t 與 element_t 的關係是:

沒錯，其實 q_context 與 element_t 二者相接的機制是相同的，都是 Linux 核心風格的 struct list_head。

運作機制詳見: Linked list 在 Linux 核心原始程式碼

因此我們在 q_new 與 q_free 分別對應分配與釋放

/* Create an empty queue */
struct list_head *q_new()
{
    struct list_head *head = malloc(sizeof(struct list_head));
    if (!head) {
        free(head);
        return NULL;
    }

    INIT_LIST_HEAD(head);
    return head;
}

/* Free all storage used by queue */
void q_free(struct list_head *l)
{
    if (__glibc_unlikely(!l))
        return;

    if (__glibc_unlikely(list_empty(l))) {
        free(l);
        return;
    }

    struct list_head *cur = NULL, *safe = NULL;
    list_for_each_safe (cur, safe, l)
        q_release_element(container_of(cur, element_t, list));
    free(l);
}

TODO: 使用 Linux 風格的 unlikely 巨集。

收到，預計會引入 Linux 風格的 unlikely 巨集 scc

insert, remove

這邊承襲我去年的風格，把相同的邏輯抽出來到輔助函式，以減少不冗餘的程式碼，於是這四個

insert head
insert tail
remove head
remove tail

成為底層實作的包裝。

/* Insert an element at head of queue */
bool q_insert_head(struct list_head *head, char *s)
{
    return q_insert(head, s, list_add);
}

/* Insert an element at tail of queue */
bool q_insert_tail(struct list_head *head, char *s)
{
    return q_insert(head, s, list_add_tail);
}

/* Remove an element from head of queue */
element_t *q_remove_head(struct list_head *head, char *sp, size_t bufsize)
{
    return q_remove(head, sp, bufsize, head->next);
}

/* Remove an element from tail of queue */
element_t *q_remove_tail(struct list_head *head, char *sp, size_t bufsize)
{
    return q_remove(head, sp, bufsize, head->prev);
}

size

寫到這邊有在思考，是否能夠做一些 Linux kernel 喜歡的 hack：把額外資訊與某些 bit 壓在一起，例如 Linux 核心的紅黑樹，因為指標最低 2 個位元是沒有作用的，所以以把顏色的資料放進去其中一個位元中。

這邊原本也想實作類似的效果，我們可以做出如下觀察：

測試環境幾乎都是 64 位元
libc 對於 memory arena 的實作不清楚從第幾版開始便是使用 mmap 作為 memory chunk 的底層機制 malloc internals
在 make test hooking malloc 底下，經過測試前四個 byte 幾乎都是 0x00, 0x00, 0x??, 0x?? 相當固定。
- 這樣相當合理，因為 intel 在 10th Generation Intel® Processor Families 當中提到： Maximum Guest Address Width (MGAW) 就只有 48-bit
- This field indicates the maximum DMA virtual addressability supported by remapping hardware. The Maximum Guest Address Width (MGAW) is computed as (N+1), where N is the value reported in this field. For example, a hardware implementation supporting 48-bit MGAW reports a value of 47 (101111b) in this field. If the value in this field is X, untranslated and translated DMA requests to addresses above 2(x+1)-1 are always blocked by hardware. Translations requests to address above 2(x+1)-1 from allowed devices return a null Translation Completion Data Entry with R=W=O. Guest addressability for a given DMA request is limited to the minimum of the value reported through this field and the adjusted guest address width of the corresponding page-table structure. (Adjusted guest address widths supported by hardware are reported through the SAGAW field). Implementations are recommended to support MGAW at least equal to the physical addressability (host address width) of the platform.
- 因此前兩個 byte 必為 0
同時會變動的兩個 byte 在同一支 process 的數值也相當固定。詳細可參考 memory layout

因此一度有這樣的實作：

#ifdef __IS_64BIT__
static unsigned long long regular_msb = 0;
#define REGULAR_MSB_MASK 0xffffffff00000000

static inline size_t my_strlen(const char *s)
{
    if (__glibc_unlikely(!s))
        return 0;

    const unsigned long long len = (unsigned long long) s ^ regular_msb;
    return (size_t) len;
}
// ...

如果他是 64 位元，我就把前 4 bytes 與字串長度 xor。希望可以將如此開銷降低為

O (1)

在虛擬機器實作中，壓縮 64 位元指標是其中一種改進策略，例如 Pointer Compression in V8，不過在原生應用程式往往會得不償失。

同意，但對 string 相關的操作可以從 amortized constant time 成為 always constant time。
如此舉措的好處與重點在於：可擴展性（scalability）scc

結果改了改，發現還要改 qtest.c 覺得成本太高了，我還是先不考慮這 minor issue。

ascend, descend

考慮這樣的架構：假如鏈結串列是一串佇列，把左邊當作 head、右邊當作 tail。

在 descend 的情境下：這樣由左至右掃過去，如果看到最大比現在的還大，要把現在這串鏈結串列斷鏈後移除。這樣實作起來過於麻煩。
如此我們可以反過來：從尾走到頭，看到的比我小的，那我就把它丟棄（因為由左至右是嚴格遞減，換句話說就是由又至左嚴格遞增），

這時候我們需要反向的 iteration 但 list.h 沒有，git commit hook 也不希望我們修改。

因為作業的設計是讓學員善用既有的 List API。

我們可以寫在自己的 queue.c 裡。

// We introduce these macros to make our q_remove_nodes_with_order easier to
// implement.
// ---------------------------------------------------------
/**
 * list_prev_entry - get the prev element in list
 * @pos:	the type * to cursor
 * @member:	the name of the list_head within the struct.
 */
#define list_prev_entry(pos, member) \
    list_entry((pos)->member.prev, typeof(*(pos)), member)

/**
 * list_for_each_entry_safe_reverse - iterate backwards over list safe against
 * removal
 * @pos:	the type * to use as a loop cursor.
 * @n:		another type * to use as temporary storage
 * @head:	the head for your list.
 * @member:	the name of the list_head within the struct.
 *
 * Iterate backwards over list of given type, safe against removal
 * of list entry.
 */
#define list_for_each_entry_safe_reverse(pos, n, head, member) \
    for (pos = list_last_entry(head, typeof(*pos), member),    \
        n = list_prev_entry(pos, member);                      \
         &pos->member != (head); pos = n, n = list_prev_entry(n, member))

// ---------------------------------------------------------

這兩個巨集是我從 Linux source code 上拔下來的，記得這篇筆記是 GPL。

未必！只能說上方程式碼以 GNU GPLv2 發布，但本篇筆記不是直接建構在該程式碼之上的衍生著作，關鍵在於作用範圍。

事實上我對衍生著作的理解相當有限。我的理解是對原作之改作（含擴充）即為衍生著作。這邊我使用 GPL 的著作，使其擴充到我的應用案例，我會謹慎起見，認定這屬於 GPL 保護。scc

有了反過來的想法，就可以寫的相當簡潔：

static inline int q_remove_nodes_with_order(
    struct list_head *head,
    bool (*comp_func)(const element_t *, const element_t *))
{
    element_t *current = NULL;
    element_t *pivot = NULL;
    element_t *safe = NULL;
    int count = 0;

    list_for_each_entry_safe_reverse(current, safe, head, list)
    {
        if (!pivot || comp_func(current, pivot)) {
            pivot = current;
            count++;
        } else {
            list_del(&current->list);
            q_release_element(current);
            current = safe;
        }
    }

    return count;
}

merge, sort

這邊的基本策略是 merge sort。作為 devide and conqure 的基本問題，我們可以考慮這問題的 optimize sub-structure。
合併兩個已排序的鏈結串列: merge_sorted_queues

static int merge_sorted_queues(struct list_head *dest,
                               struct list_head *src,
                               bool is_descend)
{
    // Preconditions:
    // 1. Both dest and src are non-empty.
    // 2. Both queues are sorted according to the order specified by is_descend.
    // 3. Postcondition: src is emptied and its elements are merged into dest.

    typedef bool (*comp_func_t)(const element_t *, const element_t *);
    comp_func_t comp_func = is_descend ? compare_greater : compare_less;

    int total_size = q_size(dest) + q_size(src);

    // Iterate through src, inserting each element into the appropriate position
    // in dest. The sorted nature of dest and src allows for optimized insertion
    // by maintaining a reference to the current insertion point in dest.
    struct list_head *cur_src = NULL, *safe = NULL, *dest_next = dest->next;
    list_for_each_safe (cur_src, safe, src) {
        element_t *cur = list_entry(cur_src, element_t, list);
        bool inserted = false;

        // Traverse dest to find the correct insertion point for the current src
        // element.
        while (dest_next != dest) {
            element_t *dest_ele = list_entry(dest_next, element_t, list);
            if (comp_func(cur, dest_ele)) {
                // Insert before the current destination element if the
                // comparison condition is met.
                list_move_tail(cur_src, dest_next);
                inserted = true;
                break;
            }
            dest_next = dest_next->next;
        }

        // If the element from src was not inserted, it's either the largest
        // (ascending order) or the smallest (descending order) compared to all
        // elements in dest. Thus, append it to the end of dest.
        if (!inserted) {
            list_move_tail(cur_src, dest);
            // Optimizing for subsequent insertions by updating the insertion
            // point.
            dest_next = dest->prev->next;
        }
    }

    return total_size;
}

merge_sorted_queues將兩個已排序的鏈結串列（dest和src）合併到排序的鏈結串列（dest），並提供升序或降序排序的選項。以下是演算法的主要特點：

前提假設：
- dest 和 src鏈結串列均存在有效的資料節點，且已按指定順序（由 is_descend 決定）排序。
- 合併後，src 被清空，其所有節點都轉移到 dest 中。
比較函式：
- 根據 is_descend 標誌，選擇 compare_greater 或 compare_less 作為節點比較函式，支持升序或降序合併。
合併邏輯：
- 對 src 中的每個節點，找到其在 dest 中的正確插入位置，保持排序順序。
- 如果找到插入位置，則將 src 節點移至 dest 的該位置；若未找到，則將其添加到 dest 的尾端。
最佳化：
- 通過維護 dest 中目前插入點的參考（dest_next），減少每次插入時走訪節點的次數，提高效率。
複雜度：
- 在最壞情況下，演算法的時間複雜度可能達到
  $O (N + M)$ ，但平均情況下由於特定的改善措施而略好，普遍情況是
  $O (N)$ 。
後置條件：
- 執行後，src 被清空，dest 包含兩串列的所有節點，按指定的 is_descend 順序排序，函式傳回串列合併後的總空間。

謝謝 GPT 小朋友的解釋。

使用 ChatGPT 一類的工具得到內文後，應自行調整為符合作業規範的用語和書寫風格。隨時為了與更大的工程團隊的協作，做好必要的準備！

我有意稱此 AI 助手為 GPT 有以下兩個原因

已先入為主厭惡大眾對 ChatGPT 的神格化，認為其無所不能。

我使用的未必是 OpenAI 的 Generative Pre-trained Transformer，我有時後會使用其他不公開的 Transformer 模型協助潤飾語句。 scc

reverseK, reverse

因為 reverse a 鏈結串列 with given range 是 reverseK 的子問題。

改進你的漢語表達。

切長度為 K 的鏈結串列下來、反轉、拼接上去。

TODO: 提供這演算法是 optimizal 的證明

所以我們可以觀察 reverse a鏈結串列這一問題：
要把整串鏈結串列反轉，必然至少需要走訪

N

個節點，（因為每個都要改到）。
那麼我們可以有一個優雅的想法：「走過去，每個節點都重新取代舊頭為新頭」。
如下範例：

[0, 1, 2, 3, 4]
[1, 0, 2, 3, 4]
[2, 1, 0, 3, 4]
[3, 2, 1, 0, 4]
[4, 3, 2, 1, 0]

/* Reverse elements in queue */
void q_reverse(struct list_head *head)
{
    if (__glibc_unlikely(!head || list_empty(head)))
        return;
    struct list_head *cur = NULL, *safe = NULL;
    list_for_each_safe (cur, safe, head)
        list_move(cur, head);  // Move to the beginning
}

註：如果這 linked list 如果有足夠良好的封裝，只需要將 iterator 調換即可。

改進 dudect 實作

很不幸的，完成上面實作，只能獲得 95 分，在最後 test 17 如同作業說明上面提到的：

注意：現有實作存在若干致命缺陷，請討論並提出解決方案
在 oreparaz/dudect 的程式碼具備 percentile 的處理，但在 lab0-c 則無。

因此我們去觀察 oreparaz/dudect 的原始碼 line 429：

bool first_time = ctx->percentiles[DUDECT_NUMBER_PERCENTILES - 1] == 0;

dudect_state_t ret = DUDECT_NO_LEAKAGE_EVIDENCE_YET;
if (first_time) {
    // throw away the first batch of measurements.
    // this helps warming things up.
    prepare_percentiles(ctx);
} else {
    update_statistics(ctx);
    ret = report(ctx);
}

再看到 prepare_percentiles

/*
 set different thresholds for cropping measurements.
 the exponential tendency is meant to approximately match
 the measurements distribution, but there's not more science
 than that.
*/
static void prepare_percentiles(dudect_ctx_t *ctx) {
  qsort(ctx->exec_times, ctx->config->number_measurements, sizeof(int64_t), (int (*)(const void *, const void *))cmp);
  for (size_t i = 0; i < DUDECT_NUMBER_PERCENTILES; i++) {
    ctx->percentiles[i] = percentile(
        ctx->exec_times, 1 - (pow(0.5, 10 * (double)(i + 1) / DUDECT_NUMBER_PERCENTILES)),
        ctx->config->number_measurements);
  }
}

大略是：把極端值去除，只取

1 - {0.5}^{10 \cdot \frac{i + 1}{N}}

以下的部份。

但是為什麼是這神秘數字？

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

不知道，原作者論文也只有說：

a) Pre-processing: We pre-process measurements prior to statistical processing. We discard measurements that are larger than the
$p$ percentile, for various
$^{2}$ values of
$p$ . The rationale here is to test a restricted distribution range, especially the lower cycle count tail. The upper tail may be more influenced by data-independent noise. This (heuristic) process gives good empirical results (makes detection faster); nevertheless we also process measurements without pre-processing.

觀察這神秘算式似乎有一個問題，它收斂到 1 的速度相當快，給定我們的 DUDECT_NUMBER_PERCENTILES 常數。
大約有一半的 measure value (即

1 - {0.5}^{10 * \frac{1}{2}}

1 - {0.5}^{5}

= 0.96875) 後面的 0.03125

粗略得說，在這 percentiles 蒐集完採樣後，會有一半的資料來自於右尾的 3%

我們可以透過 Python 來畫圖：
我們令

x

sample data 是由小到大排列，的第

x

項抽樣資料

y

是這邊的 DUDECT_NUMBER_PERCENTILES 即抽樣數
那在本次 lab0-c 的實驗 150 底下，會長得像

這張圖縱軸的意義是被抽取到的機率密度。這樣後面 3% 幾乎每次都會抽到，這真得合乎預期嗎？
這樣的評估似乎怪怪的。

TODO: 採納 @jouae 的建議，修正圖表。

而我們把 DUDECT_NUMBER_PERCENTILES 當作變數，畫作三維圖片會是：

同樣縱軸會是被抽到的機率密度。
x 是由小到大的第 x 筆資料。
y 是抽樣數，[150, 1500]

測試 oreparaz/dudect 之實作

測試用的命令序列修改自 traces/trace-17-complexity.cmd

option simulation 1
it
option simulation 0

我們透過 printf 大法，把這神奇算式的值匯出至 local file 並使用 Python dict 做出輸出

update_statistics(exec_times, classes);
save_exec_times_to_file("exec_times.txt", exec_times);
ret &= report();

#!/usr/bin/env python3
import matplotlib.pyplot as plt

# Read data from file
data = []
with open("./exec_times.txt", "r") as file:
    for line in file:
        data.append(int(line.strip()))

# plot them in a histogram, x-axis is time, y-axis frequency
# table is a dict, key is the time, value is the frequency
table = {}
for time in data:
    if time in table:
        table[time] += 1
    else:
        table[time] = 1

# plot the histogram
plt.bar(table.keys(), table.values(), color="g", edgecolor="black", linewidth=15)

plt.xlabel("Cost Time", fontsize=12, fontweight="bold")
plt.ylabel("Frequency", fontsize=12, fontweight="bold")

# Make the ticks bolder and larger
plt.xticks(fontsize=10, fontweight="bold")
plt.yticks(fontsize=10, fontweight="bold")

# Set a thicker border for the plot
plt.gca().spines["top"].set_linewidth(1.5)
plt.gca().spines["right"].set_linewidth(1.5)
plt.gca().spines["left"].set_linewidth(1.5)
plt.gca().spines["bottom"].set_linewidth(1.5)

# Remove plt.legend() if it's not needed
# If you need a legend, ensure to provide a label in plt.bar(...) and uncomment the next line
plt.legend()
plt.show()

bold

這圖似乎與原始數據之分佈有所偏離：
我們把 percentile 註解起來，即 without any processing。
原始數據為：

顯然 oreparaz/dudect 的原始碼 line 429 之神奇公式有所缺陷，並不能充分反應我們 lab0-c 預期結果。

提出我的修正

高端的食材,只需要最樸素的烹飪方式

https://zhuanlan.zhihu.com/p/401221696

我的 percentile 取法為：

\frac{f l o a t i n g (i)}{N}

floating 意思是 floating point 減少因為總是取 floor 而採到 exactly same sample data

這樣的作法意義是：raw data 我全都要，但最右尾的我不要
即只把最後一筆離群替除，而不是如原論文有 50% 資料反覆取末 3%

scc_filter

註: 此更動需要重新編譯程式，綠圖與藍圖並非同資料集，但幾乎獨立同分佈(Independent and identically distributed)

為什麼最右尾的我不要？

簡單來說：在評測(benchmark) 中 jittering 有可能是 cache miss 或 soft/hard interrupt 等導致向右偏移。

然而，我們做實驗需要控制變因，減少變異數極大的依變項

我們在這評估的是 cpu cycles ，然而一次 page fault 的 cache miss 就是 600ns
一次 L1 cache miss 大約 10 cycle，大約 2.5 ns (我電腦 4.100GHz)
一次 L2 cache miss 大約 600 cycle，大約 150 ns

參見: What is the Cost of an L1 Cache Miss?

諸如此類的原因，我請 ChatGPT 幫我總結：

在電腦科學背景下，尤其是在分析性能指標或系統行為時，異常值可能會顯著影響數據的解釋。有時出於特定原因，丟棄右尾離群值（明顯高於大多數資料集的資料點）是合理的，特別是當這些離群值不代表系統的典型行為或效能時。以下是電腦科學家可能丟棄右尾異常值的三大原因，暗示快取未命中和中斷等情況：

非代表性事件：右尾異常值通常代表罕見的非代表性事件，這些事件可能會扭曲系統典型行為的分析。例如，由於罕見的中斷或不尋常的快取未命中情況而導致的執行時間突然激增可能無法準確反映系統的常規效能。透過丟棄這些異常值，電腦科學家旨在分析更能代表系統標準操作條件的核心效能指標。
專注於常見情況效能：在系統設計和最佳化中，重點通常是提高「常見情況」效能，而不是針對罕見的最壞情況進行最佳化。例如，雖然快取未命中可能會導致資料存取時間顯著延遲，但如果此類事件很少發生（因此在效能測量中顯示為右尾異常值），電腦科學家可能會將它們排除在分析之外，以集中精力優化更頻繁的緩存命中場景。這種方法有助於做出有利於大多數營運的決策，從而提供更有效的整體系統改進。
預測建模的資料標準化：開發預測模型或演算法時，異常值可能會扭曲模型，導致典型用例的預測不準確。右尾異常值（例如由顯著偏離正常值的不可預測中斷引起的異常值）可能會影響這些模型的準確性。透過刪除這些異常值，電腦科學家可以對資料集進行標準化，確保模型根據更準確地反映預期操作參數的資料進行訓練。這種標準化有助於創建更好地預測系統在正常條件下的行為的模型。

總之，在計算機科學中丟棄右尾異常值通常是一種刻意的選擇，以專注於最具代表性的數據，以進行系統性能分析、常見案例場景的優化以及準確的預測建模。這種方法有助於做出更明智的決策，從而提高系統的效率和可靠性，而不會受到罕見的、非代表性事件的過度影響。

如此，我的修正會更貼近原始數據，排除離群值。

為什麼原始論文的會運作？

答：有 50% 的資料取自相同 3% 的地方，甚至越末尾取越多次，這檢定意義過低，很容易認為是 constant time。

能不能提出原始論文的 false negative 而我比較好的測試資料？

發現早前我 prepare_percentiles 的缺失

當我寫到「能不能提出原始論文的 false negative 而我比較好的測試資料？」小節，花了大概幾個小時都沒有找到有足夠檢定力的測試資料。想說怎麼可能？！

我的評估方式 b38057f 是將 insert remove 時，都額外走訪一整串完整的 queue ，使其複雜度成為

O (N)

然而，很不幸的，在我的改善算式與原作的 percentile 機制皆無法 classify 出這不是 constant time (因為我已改為

O (N)

)。
覺得肯定有哪裡出現問題，使得無法拒絕 Null Hypothesis

先驗證我 insert remove 是 O(N)

同樣 printf 大法，把 raw data print 出來後使用 Python 畫圖：

Figure_1

可觀察到，走訪整個佇列的操作未被編譯器最佳化所消除，換句話說，就是可以確認原始資料是

O (N)

的時間複雜度。

追蹤程式邏輯

doit function

在這會做幾個步驟：

prepare_inputs
measure
differentiate
prepare_percentiles
update_statistics

我們追進去觀察 prepare_inputs 實做細節會發現：

randombytes(input_data, N_MEASURES * CHUNK_SIZE);
for (size_t i = 0; i < N_MEASURES; i++) {
    classes[i] = randombit();
    if (classes[i] == 0)
        memset(input_data + (size_t) i * CHUNK_SIZE, 0, CHUNK_SIZE);
}

如果第 i 項是 Null hypothesis 時，則將該區段的記憶體清空。這意謂著：Alternativce hypothesis 的資料與 Null hypothesis 有可能是左右相鄰的。

即：input_data 這一個陣列同時包含 null hypothesis 的資料集與 alternativce hypothesis 的資料集。

measure function

我們在 measure 的時候，是呼叫到 constant.c 中的 measure 函式

bool measure(int64_t *before_ticks,
             int64_t *after_ticks,
             uint8_t *input_data,
             int mode)
{
    assert(mode == DUT(insert_head) || mode == DUT(insert_tail) ||
           mode == DUT(remove_head) || mode == DUT(remove_tail));

    switch (mode) {
    case DUT(insert_head):
        for (size_t i = DROP_SIZE; i < N_MEASURES - DROP_SIZE; i++) {
            char *s = get_random_string();
            dut_new();
            dut_insert_head(
                get_random_string(),
                *(uint16_t *) (input_data + i * CHUNK_SIZE) % ENOUGH_MEASURE);
            int before_size = q_size(l);
            before_ticks[i] = cpucycles();
            dut_insert_head(s, 1);
            after_ticks[i] = cpucycles();
            int after_size = q_size(l);
            dut_free();
            if (before_size != after_size - 1)
                return false;
        }
        break;
    // ...
    }           
    return true;
}

在這是將 Null hypothesis 與 Alternative hypothesis 一起被計算 exec_time

所以在上面畫的圖會發現 0 是如此之多，佔據一半的資料量，因為那些是 Null hypothesis 的資料。

prepare_percentiles function

接下來我們將目光轉向 prepare_percentiles 函式。

static void prepare_percentiles(int64_t *exec_times)
{
    qsort(exec_times, N_MEASURES, sizeof(int64_t), compare_int64);

#ifdef DUDECT
    for (size_t i = 0; i < N_MEASURES; i++)
        exec_times[i] = percentile(
            exec_times, (1 - pow(0.5, 10 * (double) (i + 1) / N_MEASURES)),
            N_MEASURES);
#else
    for (size_t i = 0; i < N_MEASURES; i++)
        exec_times[i] =
            percentile(exec_times, (double) i / N_MEASURES, N_MEASURES);
#endif
}

赫然發現，我們直接 qsort exec_times 邏輯上不正確！！
在這直接 qsort 陣列 exec_times 會使得 null hypothesis 的資料一起被 sort，而這並不符合預期。

先前那樣操作是沒有細追 code ，以為 class[0] 與 class[1] 屬於不同 array 。

對比 Dude 論文原作 dudect/src/dudect.h prepare_percentiles

/*
 set different thresholds for cropping measurements.
 the exponential tendency is meant to approximately match
 the measurements distribution, but there's not more science
 than that.
*/
static void prepare_percentiles(dudect_ctx_t *ctx) {
  qsort(ctx->exec_times, ctx->config->number_measurements, sizeof(int64_t), (int (*)(const void *, const void *))cmp);
  for (size_t i = 0; i < DUDECT_NUMBER_PERCENTILES; i++) {
    ctx->percentiles[i] = percentile(
        ctx->exec_times, 1 - (pow(0.5, 10 * (double)(i + 1) / DUDECT_NUMBER_PERCENTILES)),
        ctx->config->number_measurements);
  }
}

而在原作的 update_statistics 中，將 percential 的 T-test 分配到不同的 ttest_ctxs[crop_index + 1] 計算

static void update_statistics(dudect_ctx_t *ctx) {
  for (size_t i = 10; i < (ctx->config->number_measurements-1); i++) {
    int64_t difference = ctx->exec_times[i];

    if (difference < 0) {
      continue; // the cpu cycle counter overflowed, just throw away the measurement
    }

    // t-test on the execution time
    t_push(ctx->ttest_ctxs[0], difference, ctx->classes[i]);

    // t-test on cropped execution times, for several cropping thresholds.
    for (size_t crop_index = 0; crop_index < DUDECT_NUMBER_PERCENTILES; crop_index++) {
      if (difference < ctx->percentiles[crop_index]) {
        t_push(ctx->ttest_ctxs[crop_index + 1], difference, ctx->classes[i]);
      }
    }

    // The last if condition is trivial ...
  }
}

目前我有兩個問題尚未釐清：

為什麼 qsort 後取 percentile 會維持 class[0] 或 class[1] 屬性
- 舉例來說：
  - 原始資料 ctx->exec_times 為： {0, 42, 31, 10430, 0, 0, 1471 …}
  - prepare_inputs 初始化的 class 為 {0, 1, 1, 1, 0, 0, 1 …}
  - 如此為何 qsort 後取 ctx->percentiles[i] = percentile(ctx->exec_times ...); 是可以與 class 配對上？以至於 update_statistics 中，可以：

if (difference < ctx->percentiles[crop_index]) {
    t_push(ctx->ttest_ctxs[crop_index + 1], difference, ctx->classes[i]);
}

update_statistics 中 for crop_index 迴圈是將各自 percentile 的資料放到各自的 ctx->ttest_ctxs 。這似乎意味著，相同 percentile 的資料僅會與相同 percentile 比較到。這樣是合理的嗎？
- 舉例來說：這比較會是：第一次 measure 的 42% 與第 9527 次 measure 的 42% 做 T-test，因為它們都在 ctx->ttest_ctxs[42 + 1] 上。這樣是合理的嗎？
- 給定我待估測(estimate)的演算法，且測試資料區間固定，即僅對 insert, remove 測試
  $\forall N \in (0, 10001)$ 次操作。
  - 此時第 k 百分位的估測，本身就該屬於常態分佈。
  - 退一步而言之，根據 CLT，對任何分佈平均值的分配會收斂到常態分佈。

Yan-Jie

2024/02/26 10:58:59

一般來說在數學裡頭不會把縱軸的 ylabel 記作 f(x,150) 而且這邊 f(x,150) 也沒有定義清楚 f(x,y) 是機率密度函數 (PDF) 嗎？還是其他函數？

2024/02/26 11:03:50

如果 x 跟 y 軸都有明確文字定義，而且是有物理因次，就直接把 xlabel ylabel 寫上他們的定義跟因次這是在論文跟期刊上 figure 正常的作法

2024/02/26 11:05:33

那文字定義部份太長，就會用 caption 來說明整個圖片的架構。

楊志璿

2024/02/28 06:56:15

@jouae 感謝建議，確實我在作圖時有疏忽。會在今天修正