Linux 核心專題: 重作 lab0

# Linux 核心專題: 重作 lab0 > 執行人: kurtislin ### Reviewed by `MuChengChen` >我針對了三種數據大小去做測試 > - 1028 （2的冪再多一點） > - 4096 （2的冪） > - 50000 為什麼測試 list sort 的效能差異是選用這 3 個數據大小做測試 ? 有數學上的理論依據嗎 ? > [name=kurtislin] > 在第一篇論文中題到了資料的大小對於merge sort的比較次數有週期性的影響，當測試資料數量 n 接近 2的冪時所需的比較次數較低如果為 2的冪再加上小部份的資料那所需的比較次數就會增多我想去測試是否在這兩個特殊情況下比較次數的差異 ## TODO: 研讀作業提及的論文並驗證 > 依據[第一份作業](https://hackmd.io/@sysprog/linux2025-lab0/)指示，研讀指定的論文，紀錄你的認知並進行驗證，應當包含 Linux 核心原始程式碼 `lib/list_sort.c` 背後的原理。驗證過程中，務必排除相關的干擾。針對 dudect 程式碼和其論文，也該進行相似的投入，掌握理論和實務。 ## 研究重點本專題聚焦於深入理解 linux 核心的 list sort 實作及理論基礎 :::danger 注意書寫規範：中英文間用一個半行空白字元區隔。 ::: ## linux list sort 和一般merge sort的差別 ### 一般merge sort * top down 遞迴方式 * 固定 1:1 切割 ### linux list sort * bottom up * depth first order * 使用 pending list 去管理不同大小的子序列並確保每次合併都是2:1平衡的 ### linux list sort採用以上策略的目的採用bottom up 的原因 * 可以不使用遞迴以避免stack overflow的風險 * top down 需要先去做partition再去merge ，bottom up 只需要去 merge 就好，在partition 的過程中每層都要完整遍歷子序列 cache 要一直去更新，很常要用到的資料已經不在 cache 裡面了所以會導致 cache thrashing * 要先知道資料的數量採用 depth first 是要讓只要有兩個 list 是一樣大小就讓他們合併，可以讓前期的合併能盡量塞得下 L1 cache ，不用一開始就把整個資料集搬進L2 cache或DRAM 減少 cmp() 呼叫的次數， cmp 是 function pointer ，每次呼叫都有 control hazard 的風險 function pointer 造成 control hazard 是因為 cpu 必須等到執行時才能知道要跳轉的目標地址沒有辦法去提前預設和準備後續指令導致 pipeline 停頓或需要清空 ## [commit b5c56e0](https://github.com/torvalds/linux/commit/b5c56e0cdd62979dd538e5363b06be5bdf735a09) 這份commit 之前的linux list sort * bottom-up * 使用depth-fist order * 考慮cache blocking * 使用simple eager merge 作者在這個commit新增的功能 * 保證2:1平衡這個commit的主要目標是去減少 `cmp()` 的使用 #### CONFIG_RETPOLINE 作者提及CONFIG_RETPOLINE 使間接函式呼叫的成本更高所以值得花心思在減少`cmp()`的使用 >CONFIG_RETPOLINE has severely degraded indirect function call performance, so it's worth putting some effort into reducing the number of times cmp() is called. CONFIG_RETPOLINE 是為了防止有人惡意去毒化分支預測器它利用 return + 無限循環去困住惡意的推測執行 [Spectre Attacks: Exploiting Speculative Execution](https://ieeexplore.ieee.org/document/8835233) CONFIG_RETPOLINE 會讓指令增加導致間接函式呼叫的成本提高 ### 減少cmp使用比較次數可以用以下形式表達 \begin{gather*} nlog_2 n - Kn + O(1) \end{gather*} merege sort在領導係數的部分 $\ nlog_2 (n)$ 和理論最優解 $\log_2 (n!)$ 都是 1 所以能改進的部分就是針對 K 去改進 (K越高越好) ### 論文1: Bottom-up Mergesort: A Detailed Analysis Wolfgang Panny, Helmut Prodinger Algorithmica 14(4):340--354, October 1995 https://doi.org/10.1007/BF01294131 https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.5260 這篇論文有針對bottom up merge sort在平均情況下需要用到的比較次數推出了數學表達式 #### bottom up pseudocode code ``` begin length:= 1 ; while length< N do begin p := 0; q := length; while q + length <= N do begin Merge(a[p + 1 .. p + length] , a[q + 1 .. q + length]); p := q + length; q := p + length end; if q < N then Merge(a[p + 1 .. p + length], a[q + 1 .. N]); length := 2 * length end end; ``` #### top down vs bottom up comparison times | | Best case | Average case | Worst case | |-----------|-----------|--------------|------------| | Top-down | $$\frac{1}{2}N \lg N - 0.146N$$ | $$N \lg N - 1.248N$$ | $$N \lg N - 0.943N$$ | | Bottom-up | $$\frac{1}{2}N \lg N - 0.146N$$ | $$N \lg N - 0.965N$$ | $$N \lg N - 0.701N$$ | 可以看到top down 在 average worse 中都優於 bottom up 但考慮到 cache thrashing ，實作上就不會去採用 top down #### bottom up average case 的 K 和 $log_2(N)$ 的週期關係 ![截圖 2025-07-02 11.46.49 PM](https://hackmd.io/_uploads/B1aOVCfrxe.png) X軸是 $log_2(N)$ 這張圖的Ｙ軸是 $\frac{C_{a,bu}(N) - N \lg N}{N}$ 相當於[commit b5c56e0](https://github.com/torvalds/linux/commit/b5c56e0cdd62979dd538e5363b06be5bdf735a09)中提到的-K，兩者是負相關 $\frac{C_{a,bu}(N) - N \lg N}{N}=-K$ K的平均是0.9650736678 >the mean value of B*(x) is $c_{0} + 1 - \alpha' = -0.9650736678$ 可以看到在bottom up average中Ｋ會隨著 $log_2(N)$ 有週期性的變化在N剛好超過2的冪時，K會顯著降低 ::: info [commit b5c56e0](https://github.com/torvalds/linux/commit/b5c56e0cdd62979dd538e5363b06be5bdf735a09) 就是想要解決 bottom up 要怎麼避免這種 unbalance的情況 ::: 而在[commit b5c56e0](https://github.com/torvalds/linux/commit/b5c56e0cdd62979dd538e5363b06be5bdf735a09) 之前還沒有採用保證 2:1 平衡的方法，作者對於 bottom up average case 的測試得出 K=1.01，而理論預測 K=0.965 >Because of this, bottom-up mergesort achieves K < 0.5 for such sizes, and has an average (over all sizes) K of around 1. (My experiments show K=1.01, while theory predicts K=0.965.) ### 論文2 The cost distribution of queue-mergesort, optimal mergesorts, and power-of-two rules Wei-Mei Chen, Hsien-Kuei Hwang, Gen-Huey Chen Journal of Algorithms 30(2); Pages 423--448, February 1999 https://doi.org/10.1006/jagm.1998.0986 https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5380 queue merge sort 是一個用到balanced power of two 這篇論文中提到 ### queue merge sort vs top down merge sort vs bottom up merge sort | | Worst case | Average case | |-----|------------|--------------| | QM | $$n \log_2 n - 0.943n$$ | $$n \log_2 n - 1.207n$$ | | TDM | $$n \log_2 n - 0.943n$$ | $$n \log_2 n - 1.248n$$ | | BUM | $$n \log_2 n - 0.701n$$ | $$n \log_2 n - 1.246n$$ | 作者提到他的實作和 queue merge sort 在 average case 中都是1.207，比原本使用使用 simple eager merge 的 bottom up 還要多了 0.2*n ，而且和 top down 只差了0.04*n 不直接採用 queue merge sort 的原因是他是 bread first ，他和減少 cache thrashing 的目的有衝突 > This implementation is as eager as possible while ensuring that all merge passes are at worst 1:2 unbalanced. This achieves the same average K=1.207 as queue-mergesort, which is 0.2*n better then bottom-up, and only 0.04*n behind top-down mergesort. ## 測試list sort的效能差異 ### 比較對象 * 我自己實作的 merge sort * top-down * 使用 depth-fist order * linux list sort before commit * bottom up * 使用 depth-fist order * simple eager merge * linux list sort after commit * bottom-up * 使用 depth-fist order * 確保不超過2:1平衡程式碼我閱讀了[學長之前的紀錄](https://hackmd.io/WuqBNTdFQnuNxkFtR6FIWg?view#%E7%A0%94%E8%AE%80-liblist_sortc) 和 [list sort 合併方式](https://hackmd.io/@sysprog/linux2025-lab0/%2F%40sysprog%2Flinux2025-lab0-e#%E5%90%88%E4%BD%B5%E6%96%B9%E5%BC%8F)再搭配[commit b5c56e0](https://github.com/torvalds/linux/commit/b5c56e0cdd62979dd538e5363b06be5bdf735a09) 去理解作者是如何運用 count 來維持 2:1 的合併 ```c do { size_t bits; struct list_head **tail = &pending; /* Find the least-significant clear bit in count */ for (bits = count; bits & 1; bits >>= 1) tail = &(*tail)->prev; /* Do the indicated merge */ if (likely(bits)) { struct list_head *a = *tail, *b = a->prev; a = merge(priv, cmp, b, a); /* Install the merged result in place of the inputs */ a->prev = b->prev; *tail = a; } /* Move one element from input list to pending */ list->prev = pending; pending = list; list = list->next; pending->next = NULL; count++; } while (list); ``` ### 設計實驗參考[移除干擾因素](https://hackmd.io/@sysprog/linux2025-kxo/%2F%40sysprog%2Flinux2025-kxo-b#-Linux-%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90) 針對三種數據大小去做測試 * 1028 （2的冪再多一點） * 4096 （2的冪） * 50000 取得以下兩個數據 * number of comparison and k * number of cycles 每個條件都是生成五筆隨機資料去做測試並去取平均 ### 測試結果: ### 1. **Number of Comparisons** ``` Algorithm Size 1,028 Size 4,096 Size 50,000 --------- ---------- ---------- ----------- original 8,982 43,942 718,222 (K=1.268/1.272/1.245) kernel_old 9,586 43,942 733,146 (K=0.681/1.272/0.947) kernel_new 8,981 43,942 721,138 (K=1.269/1.272/1.187) ``` * 在 number of comparisons 上可以從 k 發現當 N 為 1028 這種極端情況時舊的 linux sort 在比較次數上明顯更差 * 而在數量為 4096 時，觀察到三個演算法都對 2 的冪有一樣的比較次數 * 在數量 50000 時可以看到在比較次數 orginal < kernel_new < kernel_old 這個結果，驗證了論文指出在一般情況下 top down的比較次數會比 bottom up 還要更多 ### 2. **CPU Cycles** ``` Algorithm Size 1,028 Size 4,096 Size 50,000 --------- ----------- ------------ ------------- original 15,279,117 272,130,010 54,116,629,173 kernel_old 15,311,390 271,450,065 54,507,399,687 kernel_new 15,120,920 271,656,366 54,422,878,795 ``` 在這邊我著重於比較 kernel_old 和 kernel_new 因為在我的測試結果中儘管 original 的效能看起來跟另外兩個但差不多但考慮到 top down 會有呼叫遞迴導致 stack overflow 的風險，還有 cache thrashing ，而且還要提前知道資料的數量，所以在 kernel 中的 list sort 不會去採用它在 1028 和 50000 這兩個大小的情況下 kernel_new 的在 cpu cycles 和 comparison times 上表現都比 kernel_old 好但是在 4096 時，在比較次數相等的情況下 kernel_new 的 cycles 更多，這有可能是更多的 cache miss 造成的(還沒去驗證) <s>[Screenshot from 2025-07-03 20-33-57](https://hackmd.io/_uploads/BJFfFx4rlg.png) </s> :::danger 文字訊不要用圖片展現，尊重視覺障礙的人群。 ::: :::danger 注意書寫規範：中英文間用一個半行空白字元區隔。 ::: ## TODO: 彙整其他學員的成果 > 紀錄你從其他學員成果獲得的啟發，應適度重現實驗並詳實標注