2024q1 Homework6 (integration)

contributed by < teresasa0731 >

實驗環境

$ gcc --version
gcc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Multiple GCC Versions on Ubuntu
原本的 gcc 是 11.4.0 版本，無法編譯ksort，故先安裝 gcc-12。
安裝參考

自我檢查清單

研讀前述 Linux 效能分析描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作
$\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
閱讀〈Linux 核心模組運作原理〉並對照 Linux 核心原始程式碼 (v6.1+)，解釋 insmod 後，Linux 核心模組的符號 (symbol) 如何被 Linux 核心找到 (使用 List API)、MODULE_LICENSE 巨集指定的授權條款又對核心有什麼影響 (GPL 與否對於可用的符號列表有關)，以及藉由 strace 追蹤 Linux 核心的掛載，涉及哪些系統呼叫和子系統？

〈Linux 核心模組運作原理〉列出的程式碼較舊，歡迎編輯頁面，更新到 Linux v6.1 以上。
閱讀《The Linux Kernel Module Programming Guide》(LKMPG) 並解釋 simrupt 程式碼裡頭的 mutex lock 的使用方式，並探討能否改寫為 lock-free;

參照 2021 年的筆記。歡迎貢獻 LKMPG!

$\to$ 搭配閱讀〈並行和多執行緒程式設計〉
探討 Timsort, Pattern Defeating Quicksort (pdqsort) 及 Linux 核心 lib/sort.c 在排序過程中的平均比較次數，並提供對應的數學證明;

對照 fluxsort 和 crumsort 的分析和效能評比方式
研讀 CMWQ (Concurrency Managed Workqueue) 文件，對照 simrupt 專案的執行表現，留意到 worker-pools 類型可指定 "Bound" 來分配及限制特定 worker 執行於指定的 CPU，Linux 核心如何做到？CMWQ 關聯的 worker thread 又如何與 CPU 排程器互動？

搭配閱讀《Demystifying the Linux CPU Scheduler》
解釋 xoroshiro128+ 的原理 (對照〈Scrambled Linear Pseudorandom Number Generators〉論文)，並利用 ksort 提供的 xoro 核心模組，比較 Linux 核心內建的 /dev/random 及 /dev/urandom 的速度，說明 xoroshiro128+ 是否有速度的優勢？其弱點又是什麼？

$\to$ 搭配閱讀: 不亂的「亂數」
解釋 ksort 如何運用 CMWQ 達到並行的排序;

開發紀錄

研讀 CMWQ (Concurrency Managed Workqueue) 文件

在linux kernel 中有提供 workqueue 的機制用來存放工作項目，以下探討舊的 workqueue 存在什麼問題以至於需要引入 CMWQ ？以及新的 CMWQ 怎麼實現呼叫功能＆與舊的系統相容。

為何需要 CMWQ ？

在非同步執行環境中會定義 work，相應處理的線程稱之 worker，並以 workqueue 來儲存工作項目。根據原本 wq 設計，單線程 (single-thread, ST) wq 在整個系統中只有一個工作線程;多線程（multi-thread, MT）在每一個 CPU 都有一個工作線程，前者會有 deadlock 問題，後者則是隨著 CPU 核心增加而創建大量工作線程，進一步導致 PID 空間浪費。

CMWQ 重新設計 wq 實作方式：

保持原有wq API的兼容性。
每一個 CPU 維持自己的 workers pool，並在 wq 中共享。
自動調節 worker pool 和平行度。

CMWQ 提出 worker pool （類似一種 thread pool），在系統中存在 worker pools 不與特定 wq 相關連而是所有 wq 共享，並根據 flag 交付該 work 給某個 worker pool。

When a work item is queued to a workqueue, the target worker-pool is determined according to the queue parameters and workqueue attributes and appended on the shared worklist of the worker-pool. For example, unless specifically overridden, a work item of a bound workqueue will be queued on the worklist of either normal or highpri worker-pool that is associated to the CPU the issuer is running on.

workqueue.[ch]

根據linux 程式碼 /include/linux/workqueue.h可以看到在 struct workqueue_attrs 定義了 wq 的屬性，其中有 cpumask_var_t cpumask 來限定 work 分配到的 CPU

Work items in this workqueue are affine to these CPUs and not allowed to execute on other CPUs. A pool serving a workqueue must have the same @cpumask.

此處提及使用此遮罩可以限定在與工作提交相同的 CPU 上執行，也就是說在同一 wq 的資源持必須有相同 @cpumask，避免工作在 CPU 間遷移，盡可能使用 CPU 快取的區域局部性。而在 /kernel/workqueue.c 則透過 workqueue_init_early() 函式來進行初始化的設定，包含記憶體分配、cpumasks 和 idr。

解釋 ksort 如何運用 CMWQ 達到並行的排序

在 ksort/sort_impl.c 中引入了 linux/workqueue.h。根據 qsort_algo() 邏輯，可以看到當元素數量小於 7 時會使用 Insertion Sort：

    if (n < 7) {
        for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)
            for (pl = pm; pl > (char *) a && CMP(thunk, pl - es, pl) > 0;
                 pl -= es)
                q_swap(pl, pl - es);
        return;
    }

元素數目大於 7 時則改用 Quick Sort 排序：

    pm = (char *) a + (n / 2) * es;
    if (n > 7) {
        pl = (char *) a;
        pn = (char *) a + (n - 1) * es;
        if (n > 40) {
            d = (n / 8) * es;
            pl = med3(pl, pl + d, pl + 2 * d, cmp, thunk);
            pm = med3(pm - d, pm, pm + d, cmp, thunk);
            pn = med3(pn - 2 * d, pn - d, pn, cmp, thunk);
        }
        pm = med3(pl, pm, pn, cmp, thunk);
    }
    q_swap(a, pm);
    pa = pb = (char *) a + es;

    pc = pd = (char *) a + (n - 1) * es;
    for (;;) {
        while (pb <= pc && (r = CMP(thunk, pb, a)) <= 0) {
            if (r == 0) {
                swap_cnt = 1;
                q_swap(pa, pb);
                pa += es;
            }
            pb += es;
        }
        while (pb <= pc && (r = CMP(thunk, pc, a)) >= 0) {
            if (r == 0) {
                swap_cnt = 1;
                q_swap(pc, pd);
                pd -= es;
            }
            pc -= es;
        }
        if (pb > pc)
            break;
        q_swap(pb, pc);
        swap_cnt = 1;
        pb += es;
        pc -= es;
    }

    pn = (char *) a + n * es;
    r = min(pa - (char *) a, pb - pa);
    vecswap(a, pb - r, r);
    r = min(pd - pc, pn - pd - (long) es);
    vecswap(pb, pn - r, r);

    if (swap_cnt == 0) { /* Switch to insertion sort */
        r = 1 + n / 4;   /* n >= 7, so r >= 2 */
        for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)
            for (pl = pm; pl > (char *) a && CMP(thunk, pl - es, pl) > 0;
                 pl -= es) {
                q_swap(pl, pl - es);
                if (++swap_cnt > r)
                    goto nevermind;
            }
        return;
    }

而若是沒有發生交換，即swap_cnt == 0，則切換回插入排序；若是交換次數超過閾值 r = 1 + n / 4 則會跳入並行處理區間 nevermind:

nevermind:
    nl = (pb - pa) / es;
    nr = (pd - pc) / es;

    if (nl > 100 && nr > 100) {
        struct qsort *q = kmalloc(sizeof(struct qsort), GFP_KERNEL);
        init_qsort(q, a, nl, c);
        queue_work(workqueue, &q->w);
    } else if (nl > 0) {
        qs->a = a;
        qs->n = nl;
        /* The struct qsort is used for recursive call, so don't free it in
         * this iteration.
         */
        do_free = false;
        qsort_algo(w);
    }

    if (nr > 0) {
        a = pn - nr * es;
        n = nr;
        goto top;
    }

    if (do_free)
        kfree(qs);
}

若是分區後左右子陣列都大於 100 ，此時會分配新的 qsort 結構給左子陣列並加入 wq 並行處理

2024q1 Homework6 (integration)

實驗環境

自我檢查清單

開發紀錄

研讀 CMWQ (Concurrency Managed Workqueue) 文件

為何需要 CMWQ ？

workqueue.[ch]

解釋 ksort 如何運用 CMWQ 達到並行的排序

Read more

2024q1 Homework5 (assessment)

Linux Random Number Generator

2024 年 Linux 核心設計/實作 課程資源閱讀筆記

2024q1 Homework1 (lab0)

2024 年 Linux 核心設計/實作課程資源閱讀筆記