2024q1 Homework6 (integration)

# 2024q1 Homework6 (integration) contributed by < [Appmedia06](https://github.com/Appmedia06) > ## 開發環境 ``` $ gcc --version gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz CPU family: 6 Model: 140 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 1 CPU max MHz: 4200.0000 CPU min MHz: 400.0000 BogoMIPS: 4838.40 Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 192 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 5 MiB (4 instances) L3: 8 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 ``` ## 自我檢查清單 - [x] 研讀前述 ==Linux 效能分析== 描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗; - [x] 閱讀〈[Linux 核心模組運作原理](https://hackmd.io/@sysprog/linux-kernel-module)〉並對照 Linux 核心原始程式碼 (v6.1+)，解釋 `insmod` 後，Linux 核心模組的符號 (symbol) 如何被 Linux 核心找到 (使用 List API)、`MODULE_LICENSE` 巨集指定的授權條款又對核心有什麼影響 (GPL 與否對於可用的符號列表有關)，以及藉由 [strace](https://man7.org/linux/man-pages/man1/strace.1.html) 追蹤 Linux 核心的掛載，涉及哪些系統呼叫和子系統？ - [x] 閱讀《[The Linux Kernel Module Programming Guide](https://sysprog21.github.io/lkmpg/)》(LKMPG) 並解釋 [simrupt](https://github.com/sysprog21/simrupt) 程式碼裡頭的 mutex lock 的使用方式，並探討能否改寫為 [lock-free](https://hackmd.io/@sysprog/concurrency-lockfree); > 參照 [2021 年的筆記](https://hackmd.io/@linD026/simrupt-vwifi)。歡迎貢獻 LKMPG! > $\to$ 搭配閱讀〈[並行和多執行緒程式設計](https://hackmd.io/@sysprog/concurrency)〉 - [x] 探討 Timsort, Pattern Defeating Quicksort (pdqsort) 及 Linux 核心 [lib/sort.c](https://github.com/torvalds/linux/blob/master/lib/sort.c) 在排序過程中的平均比較次數，並提供對應的數學證明; > 對照 [fluxsort](https://github.com/scandum/fluxsort) 和 [crumsort](https://github.com/scandum/crumsort) 的分析和效能評比方式 - [x] 研讀 [CMWQ](https://www.kernel.org/doc/html/latest/core-api/workqueue.html) (Concurrency Managed Workqueue) 文件，對照 [simrupt](https://github.com/sysprog21/simrupt) 專案的執行表現，留意到 worker-pools 類型可指定 "Bound" 來分配及限制特定 worker 執行於指定的 CPU，Linux 核心如何做到？CMWQ 關聯的 worker thread 又如何與 CPU 排程器互動？ > 搭配閱讀《Demystifying the Linux CPU Scheduler》 - [ ] 解釋 `xoroshiro128+` 的原理 (對照〈[Scrambled Linear Pseudorandom Number Generators](https://arxiv.org/pdf/1805.01407.pdf)〉論文)，並利用 [ksort](https://github.com/sysprog21/ksort) 提供的 `xoro` 核心模組，比較 Linux 核心內建的 `/dev/random` 及 `/dev/urandom` 的速度，說明 `xoroshiro128+` 是否有速度的優勢？其弱點又是什麼？ > $\to$ 搭配閱讀: [不亂的「亂數」](https://blog.cruciferslab.net/?p=599) - [ ] 解釋 [ksort](https://github.com/sysprog21/ksort) 如何運用 CMWQ 達到並行的排序; ## 處理並行排序的 Linux 核心模組的前期準備根據 [M06: integration](https://hackmd.io/@sysprog/linux2024-integration-a#-ksort-%E8%99%95%E7%90%86%E4%B8%A6%E8%A1%8C%E6%8E%92%E5%BA%8F%E7%9A%84-Linux-%E6%A0%B8%E5%BF%83%E6%A8%A1%E7%B5%84) 的步驟準備。我的電腦上已經有 Linux 系統了。我們需要將 UEFI Secure Boot 的功能**關閉**。可以重新開機，選擇 UEFI 的選項，找到進階設定中的 Security ，把 Secure Boot 設置為 Disable 。接著就可以按照步驟將 `ksort` 專案載入，但在使用 `make check` 命令編譯後，出現以下警告，是編譯器版本不同的問題，但尚未處理。 ``` warning: the compiler differs from the one used to build the kernel The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 You are using: gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 ``` 透過以下命令就可以將模組載入核心，使用 `insmod` 將模組載入核心後才會有下面的裝置檔案 `/dev/sort`。 ``` $ sudo insmod sort.ko ``` ### 閱讀〈Linux 核心模組運作原理〉 #### `insmod` 解釋在解釋 `insmod` 命令之前，先來說明 Linux 核心模組(module) 到底是什麼？ Linux 核心模組是可以動態載入到 Linux 核心而無需重新啟動的程式碼片段。它旨在擴展核心的功能，允許包含新功能或支援各種硬體設備或檔案系統。核心模組允許在核心中新增或刪除程式碼，而無需重新編譯或修改整個核心。由於它們允許按需添加設備驅動程序，因此它們對於支援編譯核心時不可用的硬體特別有用。有動態載當就就靜態載入。靜態載入的缺點就是有問題會很難處理，每次修改一個地方就要重新編譯核心並下載，導致效率不好。而若採用靜態載入的驅動程式較多，就會導致核心非常龐大，浪費資源。核心模組儲存在副檔名為 `.ko` 的檔案中，而這個檔案是一個可執行的二進制文件，我們要透過 `insmod` 或 `modprobe` 命令將其載入到核心中，若是載入核心，就要在 kernel space 執行，需要將原本在 user space 中的文件透過 system call 的方式處理文件並將其複製到 kernel space 的記憶體空間中。一旦加載後，該模組就成為作業系統核心的一個組成部分，並且可以與其他系統元件互動。 ![linux-kernel-module-1](https://hackmd.io/_uploads/B1CXCE-DR.png) > 資料及圖片來源: [What is a Kernel Module in Linux?](https://www.scaler.com/topics/linux-kernel-module/) 以教材中的 Hello 程式來解釋。 ```cpp #include <linux/init.h> #include <linux/module.h> MODULE_LICENSE("Dual BSD/GPL"); static int hello_init(void) { printk(KERN_INFO "Hello, world\n"); return 0; } static void hello_exit(void) { printk(KERN_INFO "Goodbye, cruel world\n"); } module_init(hello_init); module_exit(hello_exit); ``` * 包含核心模組開發所需的頭文件。 ```cpp #include <linux/init.h> #include <linux/module.h> ``` * 這些巨集定義模組的授權、作者身份和描述。在這種情況下，該模組是根據 GPL（GNU 通用公共授權）獲得許可。 ```cpp MODULE_LICENSE("Dual BSD/GPL"); ``` 根據 [GPL](https://zh.wikipedia.org/zh-tw/GNU%E9%80%9A%E7%94%A8%E5%85%AC%E5%85%B1%E8%AE%B8%E5%8F%AF%E8%AF%81) 可以知道 GPL(General Public License)是一個自由軟體授權條款，而從 [Linux kernel licensing rules](https://docs.kernel.org/process/license-rules.html) 得知有以下幾種條款。 1. `GPL`：GNU General Public License，通用公共許可證。 2. `GPL v2`：GNU General Public License version 2，第二版的通用公共許可證。 3. `GPL and additional rights`：表示模組除了遵守 GPL 還有額外的權利。 4. `Dual BSD/GPL`：表示模組同時遵守 BSD 和 GPL 許可證。 5. `Proprietary`：表示模組是專有軟件，不遵守 GPL。 * 該函數是模組的初始化例程。當使用 insmod 命令將模組載入到核心時會呼叫它。根據這篇 [Message logging with printk](https://www.kernel.org/doc/html/latest/translations/zh_CN/core-api/printk-basics.html) 可以知道 printk 是 Linux 核心中的一個函數，用於核心狀態消息打印和日誌記錄。類似於用戶態中的 printf 函數。`<loglevel>` 表示日誌級別，用於指示消息的嚴重程度。範例中的 `KERN_INFO` 就是信息性消息。 ```cpp static int hello_init(void) { printk(KERN_INFO "Hello, world\n"); return 0; } ``` * 該函數是模組的退出例程，當使用 `rmmod` 命令從核心卸載模組時呼叫函數。 ```cpp static void hello_exit(void) { printk(KERN_INFO "Goodbye, cruel world\n"); } ``` #### 使用 strace 工具追蹤 Linux 核心的掛載，涉及哪些系統呼叫和子系統？輸入 `$ sudo strace insmod fibdrv.ko` 命令，即可以觀察式執行的 system call。 * 設置特定於體系結構的行程或執行緒狀態，如`arch_prctl`。 * `mmap` 將文件映射到記憶體空間上，`access` 檢查呼叫程序是否可以存取該文件路徑名。`mprotect` 更改呼叫的存取保護行程的記憶體區間。 * `openat` 打開模組文件，最後再用 `close` 關起來。 * `finit_module` 將可 ELF(Executable and Linking Format) 載入到 kernel space 並初始化模組。 ## 閱讀《[The Linux Kernel Module Programming Guide](https://sysprog21.github.io/lkmpg/)》(LKMPG) 並解釋 [simrupt](https://github.com/sysprog21/simrupt) 程式碼 ### 背景知識 [simrupt](https://github.com/sysprog21/simrupt) 模擬硬體中斷，並將模擬的中斷事件數據傳遞到 User space。而整體的操作流程如下： 1. 使用 `simrupt_init` 函式加載模組，初始化 FIFO 緩衝區並設置定時器。 2. 定時器周期性觸發，調用 `timer_handler` 。 3. 調用 `process_data` 函式生成模擬數據並將其放入快速緩衝區。 4. 任務從快速緩衝區中提取數據，並將其放入 FIFO 緩衝區。 5. User space 的程式通過讀取數據設備來獲取這些數據。 ### kfifo 值得注意的是， FIFO 緩衝區是使用定義在 [include/linux/kfifo.h](https://github.com/torvalds/linux/blob/master/include/linux/kfifo.h) 的 Linux API 。它是一個**環狀佇列**。並且在 One producer One Consumer(1P1C) 的情況下是可以無鎖。以下是 kfifo 的示意圖： ![Screenshot from 2024-07-04 21-28-45](https://hackmd.io/_uploads/HJ8sm74D0.png) 先看一下 kfifo 的結構體： ```cpp struct __kfifo { unsigned int in; unsigned int out; unsigned int mask; unsigned int esize; void *data; }; ``` kfifo 使用了 `in` 和 `out` 兩個變量分別作為加入和離開索引。而在最簡單的多執行緒，也就是 1P1C 的情況下，兩個執行緒做加入和離開時就不會發生同時存取到同一個變數，進而達成無鎖的狀況。 * 加入 `n` 個數據 `in` 就加 `n` * 離開 `m` 個數據 `out` 就加 `m` * `out` 不能比 `in` 大，若 `out` 等於 `in` 代表佇列為空 * `in` 不能比 `out` 大超過 FIFO 的空間考慮到如果一直有數據加入，`in` 的數值會越來越大，但因為其是 `unsigned int` 型態，所以當 `in` 到 $2^{32}$ 後，就會自動變成 0 了。這邊整理幾個 kfifo 的特色： 1. `in` 和 `out` 使用 `unsigned int` 型態，在加入和離開時不對其取餘數，而是讓其自然溢出，**讓其保持 `in - out` 永遠是佇列的大小**。 2. **保證佇列的大小為 2 的幂**，這樣做的目的是為了使加入和離開數據的讀取不用取餘數，而是利用 `mask` 變數去 AND 上 `in` 或 `out`，而 `mask` 本身就是佇列大小減一。根據 [lib/kfifo.c](https://elixir.bootlin.com/linux/v4.14/source/lib/kfifo.c#L109): ```cpp=109 off &= fifo->mask; ``` 3. 使用 [Memory barrier](https://zh.wikipedia.org/zh-tw/%E5%86%85%E5%AD%98%E5%B1%8F%E9%9A%9C) 來達成 1P1C 的情況下 kfifo 的無鎖並行(Concurrency)。以下是 kfifo 在加入和離開時用的幾個 Memory barrier 操作。 * `smp_rmb` 保證讀取操作之間不會出現亂序 * `smp_wmb` 保證寫入操作之間不會出現亂序 * `smp_mb` 保證讀寫操作都不會出現亂序 ### simrupt 專案解釋本專案旨在模擬一個中斷的 Linux 設備驅動程式，並將資料傳回 User space。先上流程圖： ![simrupt](https://hackmd.io/_uploads/ryM9BYqw0.png) 這邊一步一步來解釋： #### Timer 使用計時器 `timer_handler` 來模擬週期性的 IRQ (Interrupt request) 。會發出中斷請求。 ```cpp /* Timer to simulate a periodic IRQ */ static struct timer_list timer; static void timer_handler(struct timer_list *__timer) { ktime_t tv_start, tv_end; s64 nsecs; pr_info("simrupt: [CPU#%d] enter %s\n", smp_processor_id(), __func__); /* We are using a kernel timer to simulate a hard-irq, so we must expect * to be in softirq context here. */ WARN_ON_ONCE(!in_softirq()); /* Disable interrupts for this CPU to simulate real interrupt context */ local_irq_disable(); tv_start = ktime_get(); process_data(); tv_end = ktime_get(); nsecs = (s64) ktime_to_ns(ktime_sub(tv_end, tv_start)); pr_info("simrupt: [CPU#%d] %s in_irq: %llu usec\n", smp_processor_id(), __func__, (unsigned long long) nsecs >> 10); mod_timer(&timer, jiffies + msecs_to_jiffies(delay)); local_irq_enable(); } ``` ### process_data 收到 Timer 的中斷信號後，`process_data` 函式呼叫`fast_buf_put(update_simrupt_data());` ，其中 `update_simrupt_data()` 會產生資料，這些資料的範圍在 0x20 到 0x7E 之間，即 ASCII 中的可顯示字元，這些資料會被放入 circular buffer 中，最後交由 `tasklet_schedule` 進行排程。 ### fast circular buffer Simrupt 在將產生的資料放入 kfifo 前，會先使用這個「更快」的緩衝區來存放資料。 ```cpp /* We use an additional "faster" circular buffer to quickly store data from * interrupt context, before adding them to the kfifo. */ static struct circ_buf fast_buf; ``` fast circular buffer，顧名思義，是一個環狀的緩衝區，它用兩個變數 `head` 和 `tail` 來只是佇列的大小。當 `head` 和 `tail` 相同時代表緩衝區為空。 * `fast_buf_put` 扮演一個 producer 的角色，藉由 `CIRC_SPACE()` 判斷 buffer 中是否有剩餘空間，並更新 `head index`。它會將生成的資料放入 fast circular buffer 。 ```cpp static int fast_buf_put(unsigned char val) { struct circ_buf *ring = &fast_buf; unsigned long head = ring->head; /* prevent the compiler from merging or refetching accesses for tail */ unsigned long tail = READ_ONCE(ring->tail); /* is circular buffer full? */ if (unlikely(!CIRC_SPACE(head, tail, PAGE_SIZE))) return -ENOMEM; ring->buf[ring->head] = val; /* commit the item before incrementing the head */ smp_wmb(); /* update header pointer */ ring->head = (ring->head + 1) & (PAGE_SIZE - 1); return 0; } ``` * `fast_buf_get` 扮演一個 consumer 的角色，會從緩衝區中取得資料，並更新 `tail index`。它會將 fast circular buffer 的資料放入 work queue。 ```cpp static int fast_buf_get(void) { struct circ_buf *ring = &fast_buf; /* prevent the compiler from merging or refetching accesses for tail */ unsigned long head = READ_ONCE(ring->head), tail = ring->tail; int ret; if (unlikely(!CIRC_CNT(head, tail, PAGE_SIZE))) return -ENOENT; /* read index before reading contents at that index */ smp_rmb(); /* extract item from the buffer */ ret = ring->buf[tail]; /* finish reading descriptor before incrementing tail */ smp_mb(); /* increment the tail pointer */ ring->tail = (tail + 1) & (PAGE_SIZE - 1); return ret; } ``` #### tasklet 先來考慮一下 Linux 中斷的流程，大致分成兩個部份： 1. 中斷 handler (top half) 2. deferable task (bottom half) 對於中斷 handler (top half) 是中斷中緊急的部份，例如cpu與週邊裝置之間的交互，取得狀態，ack狀態，收發資料等等。而 deferable task 的工作則是相對不緊急的，例如後續的資料收發。因此對於作業系統而言，我可以設計一個機制來將 top half 和 bottom half 的任務分離，盡可能加速 top half 的工作。而 tasklet 就是這樣的機制，用來推遲不重要的任務的執行。 tasklet 是基於 softirq 之上建立的，但最大的差別在於 tasklet 可以動態配置且可以被用在驅動裝置上。使用 `queue_work` 將 work 放入 workqueue 中，並記錄執行時間。 ```cpp /** * queue_work - queue work on a workqueue * @wq: workqueue to use * @work: work to queue */ static inline bool queue_work(struct workqueue_struct *wq, struct work_struct *work) { return queue_work_on(WORK_CPU_UNBOUND, wq, work); } ``` #### workqueue 請參照下面筆記提到 workqueue 的運作原理。這裡用了兩個 mutex_lock ，一個是 producer，一個是 comsumer 的鎖。producer 的鎖是用來防止多個任務寫入 kfifo ，comsumer 的鎖是用來阻止其他任務取得 circular buffer 的資料。 ```cpp /* Workqueue handler: executed by a kernel thread */ static void simrupt_work_func(struct work_struct *w) { int val, cpu; /* This code runs from a kernel thread, so softirqs and hard-irqs must * be enabled. */ WARN_ON_ONCE(in_softirq()); WARN_ON_ONCE(in_interrupt()); /* Pretend to simulate access to per-CPU data, disabling preemption * during the pr_info(). */ cpu = get_cpu(); pr_info("simrupt: [CPU#%d] %s\n", cpu, __func__); put_cpu(); while (1) { /* Consume data from the circular buffer */ mutex_lock(&consumer_lock); val = fast_buf_get(); mutex_unlock(&consumer_lock); if (val < 0) break; /* Store data to the kfifo buffer */ mutex_lock(&producer_lock); produce_data(val); mutex_unlock(&producer_lock); } wake_up_interruptible(&rx_wait); } ``` #### kfifo 使用 `produce_data` 將資料插入 kfifo 的緩衝區裡。 ```cpp /* Insert a value into the kfifo buffer */ static void produce_data(unsigned char val) { /* Implement a kind of circular FIFO here (skip oldest element if kfifo * buffer is full). */ unsigned int len = kfifo_in(&rx_fifo, &val, sizeof(val)); if (unlikely(len < sizeof(val)) && printk_ratelimit()) pr_warn("%s: %zu bytes dropped\n", __func__, sizeof(val) - len); pr_debug("simrupt: %s: in %u/%u bytes\n", __func__, len, kfifo_len(&rx_fifo)); } ``` 最後，使用 `kfifo_to_user` 將最多 len 個 bytes 資料從 fifo 移到 userspace。再透過 `simrupt_read` 從 user space 讀取資料。 ### mutex lock 改寫為 lock-free 從上面的說明可以知道在 1P1C 的情況下 kfifo 是無鎖並行的，因此不需要對 kfifo 處理，而是要處理快速緩衝區 `fast_buf` 要改為無鎖操作。原本的 mutex lock 為一個行程進入臨界區，若是無法獲得鎖的情況下，會進入阻塞狀態。而這邊使用到的是 `mutex_lock_interruptible` 函式，其和一般的 mutex lock 的差別就是提供 mutex 在睡眠狀態下仍然能被中斷。 ```cpp ... if (mutex_lock_interruptible(&read_lock)) return -ERESTARTSYS; ... mutex_unlock(&read_lock); ``` 參考了[並行程式設計: Atomics 操作](https://hackmd.io/@sysprog/concurrency-atomics#wait-free-amp-lock-free)裡面提到: > mutex 導致效能下降的因素，往往是因臨界區過長且過於頻繁，也就限制並行程度，或競爭過於激烈 (context switch 開銷大幅增加)。lock-free/wait-free 演算法的價值在於，其保證一個或所有執行緒始終有進展，而非絕對的高效率。lock-free 和 wait-free 演算法在一種狀況會獲得更好的效率：演算法本身可用少量 atomic 指令實作 —— 實作 lock 也要用到 atomic 指令，當演算法本身用少數指令即可完成時，相比額外用 lock 當然就會有優勢。因此我嘗試引入 `linux/atomic.h` 的原子操作來實現，但還未確定效能上是否有較 mutex lock 更優異。 ```cpp static int fast_buf_get(void) { struct circ_buf *ring = &fast_buf; unsigned long head = atomic_read(&ring->head); unsigned long tail = atomic_read(&ring->tail); int ret; if (CIRC_CNT(head, tail, PAGE_SIZE) == 0) return -ENOENT; smp_rmb(); ret = ring->buf[tail]; atomic_set(&ring->tail, (tail + 1) & (PAGE_SIZE - 1)); return ret; } ``` ## 探討 Timsort, pdqsort、list_sort 平均比較次數本節要探討 Timsort、pdqsort、list_sort 的平均比較次數，Timsort 和 list_sort 我之前已經分析過了，而 pdqsort 則是結合了插入排序法(Insertion sort)、堆積排序法(Heap sort)以及快速排序法(Quick sort)。 * [我的 Timsort 筆記](https://hackmd.io/DMjzkMZrSjWBSMznxlSQtw?view#quiz1-%E6%B8%AC%E9%A9%972) * [我的 list_sort 筆記](https://hackmd.io/@Appmedia/Linux_kernel_list_sort) * [pdqsort 筆記](https://hackmd.io/@Chang-Chia-Chi/pdqsort) > 參考 Chang-Chia-Chi ### pdqsort Pattern Defeating Quicksort(pdqsort) 是一種多種理想排序的混合排序演算法，其核心思想就是判斷現在數組的狀況來選擇要使用的演算法。和快速排序法一樣是 unstable 的，而時間複雜度上，在最好的情況為 $O(n)$ ，而最差的情況則為 $O(nlogn)$ 。 * pdqsort 使用到的排序演算法 | Algorithm | Best case | Average case | Worst case | Stability | | -------------- | ---------- | ------------ | ---------- | --------- | | Insertion sort | $O(n)$ | $O(n^2)$ | $O(n^2)$ | Stable | | Quick sort | $O(nlogn)$ | $O(nlogn)$ | $O(n^2)$ | Unstable | | Heap sort | $O(nlogn)$ | $O(nlogn)$ | $O(nlogn)$ | Unstable | 從上表可以看出三個不同演算法的特性： 1. Insrtion sort 在數組大小較短時，通常會有非常好的表現。 2. Quick sort 在平均狀況下的時間複雜度為 $O(nlogn)$ ，但是一旦 `pivot` 的選擇不好時，就會退化到 $O(n^2)$ ，因此選擇一個好的 `pivot` 是快速排序法的重點。 > 這邊可以參考我之前做的[實驗](https://hackmd.io/DMjzkMZrSjWBSMznxlSQtw#%E9%81%BF%E5%85%8D%E6%9C%80%E5%B7%AE%E7%8B%80%E6%B3%81%E7%9A%84%E5%BF%AB%E9%80%9F%E6%8E%92%E5%BA%8F%E5%AF%A6%E4%BD%9C) 3. Heap sort 則保證了就算在最差情況其時間複雜度仍有 $O(nlogn)$ 。根據這三個演算法各自的優點，pdqsort 會根據不同情況使用不同演算法： * 當長度小於一定數值時，執行插入排序法，$O(n)$ * 在一般狀況時，執行快速排序法，$O(nlogn)$ * 若是發現快速排序法的效能不佳，則執行堆積排序法，$O(nlogn)$ ![pdqsort](https://hackmd.io/_uploads/SJ-y0bYv0.png) ### 實驗我將 Timsort、[pdqsort](https://github.com/hankluo6/pdqsort/blob/main/pdqsort.c)、list_sort 三種排序演算法整理，來探討其比較次數及執行時間的差別。 > 原始碼請參照 [Appmedia06/sort-comparison](https://github.com/Appmedia06/sort-comparison) 實驗環境如本筆記開頭所示。克隆專案後，可以直接執行 `make` 命令編譯，便可以執行實驗程式： ``` $ ./main <sort type> <size> <rand size> ``` * <sort type>: * 0: `list_sort` * 1: `Timsort` * 2: `pdqsort` * <size>: 數列長度 * <rand size>: 數列中隨機數的數量這邊說明一下，在做實驗前都有將快取的資料清除，而因為本程式包含生成數列的部份以及因為數列部份不一，因此 perf 效能工具不好量測，因此我在排序函式前後放置量測時間變數。 #### 數列長度 10000，完全隨機 | | list_sort | Timsort | pdqsort | | ------------- | --------- | -------- | -------- | | 比較次數(次) | 121006 | 131788 | 165964 | | 排序時間(sec) | 0.004242 | 0.004398 | 0.001908 | #### 數列長度 50000，完全隨機 | | list_sort | Timsort | pdqsort | | ------------- | --------- | -------- | -------- | | 比較次數(次) | 721306 | 770839 | 954597 | | 排序時間(sec) | 0.011339 | 0.010800 | 0.009194 | #### 數列長度 10000，一半隨機 | | list_sort | Timsort | pdqsort | | ------------- | --------- | -------- | -------- | | 比較次數(次) | 98943 | 75600 | 125043 | | 排序時間(sec) | 0.002629 | 0.002530 | 0.001190 | #### 數列長度 50000，一半隨機 | | list_sort | Timsort | pdqsort | | ------------- | --------- | -------- | -------- | | 比較次數(次) | 579353 | 451795 | 670515 | | 排序時間(sec) | 0.007334 | 0.006555 | 0.006200 | #### 數列長度 10000，已排序（升序） | | list_sort | Timsort | pdqsort | | ------------- | --------- | -------- | -------- | | 比較次數(次) | 73777 | 21070 | 20008 | | 排序時間(sec) | 0.000972 | 0.000354 | 0.000103 | #### 數列長度 50000，已排序（升序） | | list_sort | Timsort | pdqsort | | ------------- | --------- | -------- | -------- | | 比較次數(次) | 437120 | 105515 | 100008 | | 排序時間(sec) | 0.003127 | 0.001418 | 0.000532 | #### 視覺化 * 比較次數，數列長度 50000 ![compare](https://hackmd.io/_uploads/S1qsbVYwA.png) * 時間，數列長度 50000 ![time](https://hackmd.io/_uploads/rkEh-EKvR.png) ### 總結探討 pdqsort ，作為混合排序法的演算法，可以看到當數列從亂序到已排序，比較次數明顯下降，便可以推斷出因為以排序了，導致快速排序法的效能不好，因此改為堆積排序法。而 pdqsort 不管在哪個數列的情況都是贏過另外兩個演算法，也證明了混合各自演算法優點是一個好的決策。這邊有一個有趣的發現，就是前面提到每次實驗都會先將快取清除才開始。Timsort 和 pdqsort 其實有沒有清掉快取較能並沒有插到非常多。但是 list_sort 若是沒有清掉快取效能速度竟然可以快了 $2.6$ 倍，這也證明了 list_sort 是對快取友善的。 ### 參考 > [Pattern-defeating Quicksort](https://arxiv.org/abs/2106.05123) > [Pattern-defeating Quicksort 閱讀筆記](https://hackmd.io/@Chang-Chia-Chi/pdqsort) > [打造 Go 语言最快的排序算法](https://blog.csdn.net/ByteDanceTech/article/details/124464192) > [hankluo6/pdqsort](https://github.com/hankluo6/pdqsort) ## 研讀 [CMWQ](https://www.kernel.org/doc/html/latest/core-api/workqueue.html) 文件，對照 [simrupt](https://github.com/sysprog21/simrupt) 專案的執行表現 > 可以參照以下筆記，寫的非常詳細 > [M06: integration](https://hackmd.io/@sysprog/linux2024-integration-c#M06-integration) ### Concurrency Managed Workqueue Linux 核心提供 workqueue 介面以達成非同步 (asynchronous) 的流程執行，使用者以一個 "work item" 來描述需被執行的 context，將其加入到所建立佇列之中，Linux 核心隨後會選出合適的 "worker" thread 來完成，處理結束者會從佇列中刪除，目的是要簡化執行緒的建立，可根據目前系統的 CPU 個數建立執行緒，使得執行緒得以並行。而為了解決以前 workqueue 的問題： * 使用 ST workqueue 得到的並行等級較低，但較省資源。 * 使用 MT workqueue 固然有嚴重的消耗資源，能提高的並行等級卻很有限。就有了 CMWQ 的產生，其具有以下特性： * 保持與原本的 workqueue API 的兼容性。 * 捨棄各個 workqueue 獨立對應一組 worker-pools 的作法。取而代之，所有 workqueue 會共享 per-CPU worker pools，並按需提供靈活的並行等級，避免大量的資源耗損 * worker pool 和並行等級間的調整由系統內部處理，對使用者來說可以視為一個黑盒子。 ### Bound & Unbound 考慮到 worker-pools 有兩種類型: 1. Bound worker-pools: 它會綁定特定的 CPU，使管理的 worker 執行在指定的 CPU 上執行，而每個 CPU 中會有二個 worker-pools 一個為高優先級的，另一個給普通優先級的。 2. Unbound worker-pools: thread pool 用於處理不綁定特定 CPU，其 thread pool 是動態變化，藉由設定 workqueue 的屬性建立對應的 worker-pools。

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.