2024 年 Linux 核心設計/實作課程作業 —— integration (D)

--- title: 2024 年 Linux 核心設計/實作課程作業 —— integration (D) description: 引導學員開發 Linux 核心模組，預期知悉核心模組如何掛載進入 Linux 核心、Virtual File System (VFS) 概況、character device driver 撰寫，和準備對應的測試及效能分析工具 tags: linux2024 --- # M06: integration :::warning :warning: 請務必詳閱作業描述 [(一)](/@sysprog/linux2024-integration-a), [(二)](/@sysprog/linux2024-integration-b), [(三)](/@sysprog/linux2024-integration-c) 及 [(四)](/@sysprog/linux2024-integration-d) ::: > 主講人: [jserv](https://wiki.csie.ncku.edu.tw/User/jserv) / 課程討論區: [2024 年系統軟體課程](https://www.facebook.com/groups/system.software2024/) :mega: 返回「[Linux 核心設計/實作](https://wiki.csie.ncku.edu.tw/linux/schedule)」課程進度表 ## Linux 核心的 `lib/sort.c` 的實作參見 [Bottom-up Heapsort](https://hackmd.io/@Uduru0522/bottomup-heapsort-analyze)，對照 git log 提及的論文，解釋其原理，包含數學推導。藉由 perf 一類的效能分析工具，探討 Linux 核心 [lib/sort.c](https://github.com/torvalds/linux/blob/master/lib/sort.c) 實作的效益。 > :warning: 近期參與過 Linux 核心課程的學員 visitorckw 對 Linux 核心的 `lib/sort.c` 進行一系列改進，但下方討論並未充分反映。 ### 〈[BOTTOM-UP-HEAPSORT](https://www.sciencedirect.com/science/article/pii/030439759390364Y)〉論文研讀先來複習資料結構學到的 heap sort ，利用 [heap](https://en.wikipedia.org/wiki/Heap_(data_structure)) 的性質，資料量為 n 的陣列 `a`，先將此陣列利用 **reheap** (部分資料結構參考書將此稱為 `heapify` ) 調整為符合 min-heap (此論文都以 min-heap 來討論) 結構，再將 $a[1]$ 的值跟 $a[n]$ 交換，再對 $a[1]$ ~ $a[n-1]$ 進行 **reheap**。(論文內陣列從 $a[1]$ 開始存值，而不是 $a[0]$) :::info ***HEAPSORT(a)*** 1) /* heap creation phase */ 2) for ( **i** from **$\lfloor \frac{n}{2} \rfloor$** downto **1** ) begin 3) &emsp;***reheap*** **(n, i)** 4) end 6) /* select phase */ 7) for ( **m** from **n** downto **2** ) begin 8) &emsp;**interchange** **a[1]** and **a[m]** 9) &emsp;if **m** $\neq$ 2 then ***reheap*** **(m - 1, 1)** 10) end ::: :::info ***reheap(m, i)*** 1) if (**i** > **$\frac{m}{2}$**) exit 2) if (**i** < **$\frac{m}{2}$**) **MIN** = min( **a[i]**, **a[2 * i]**, **a[2 * i + 1]** ) with 2 comparisons 3) if (**i** = **$\frac{m}{2}$**) **MIN** = min( **a[i]**, **a[2 * i]** ) with 1 comparison 4) if ( **MIN** = **a[i]** ) exit 5) else if ( **MIN** = **a[2 * i]** ) begin 6) &emsp;**interchange** **a[i]** and **a[2 * i]** 7) &emsp;***reheap*** **(m, 2 * i)** 8) end 9) else if ( **MIN** = **a[2 * i + 1]** ) begin 10) &emsp;**interchange** **a[i]** and **a[2 * i + 1]** 11) &emsp;***reheap*** **(m, 2 * i + 1)** 12) end ::: 在進行 **reheap** 時，第一個參數 $m$，總是會放該陣列的最後一筆資料，因 heap 為 complete binary tree ，因此 $a[\frac{m}{2}]$ 指的是 $a[m]$ 的 parent 。找出 `i` 以及 `i` 的 child 之中最小的為 `MIN` ，若 `MIN` 就是 $a[i]$ ，則終止，反之將最小的那 index 設為新 `i` ，重複這個做法，直到第 1 行或是第 5 行的中止條件成立。是由上而下的進行 interchange ，來將樹調整成 min-heap 。考慮 heap 高度為 d (root 所在的高度為 0)，上述的 **reheap** 最多會執行 $2 \times d$ 次的 comparisons ，以及 d 次的 interchange ，然而，因為在執行 **heap-sort** 時，每一輪會把 heap 中最後一個值 $a[m]$ (考慮儲存最後一個值的 index 為 $m$) 放到 root ，在進行 **reheap** 會將 $a[m]$ 放置合適的位置，但這個位置會傾向靠近於 leaves，因此論文提到一個新的方法來找出代替 reheap 找出合適的位置，且有較少的 comparison 次數以及 interchange 次數。搭配圖片解說，以下方的 heap 作為示範。($n = 13$，$n$ 為資料量)，原文演算法的描述排版會做點修改，但執行順序以及方式不會做改變。 :::danger :warning: 陣列初始位置的 index 為 $1$ 而非 $0$ ::: ```graphviz graph heap{ 4 -- 7 4 -- 8 7 -- 23 7 -- 21 8 -- 14 8 -- 10 23 -- 31 23 -- 29 21 -- 24 21 -- 25 14 -- 15 14 -- 17 } ``` 因為要討論的 **reheap** 的執行方式，那我們就要先進行第一次的 **heapsort** ，也就是將最後一個位置的值與 root 的值交換，接著要對這棵樹進行 **reheap** 。(不會動到已排序好的值，此例為從 root 移到最後一個位置的 4) ```graphviz graph heap1{ 17 -- 7 17 -- 8 7 -- 23 7 -- 21 8 -- 14 8 -- 10 23 -- 31 23 -- 29 21 -- 24 21 -- 25 14 -- 15 sorted[label = 4, color = gray] 14 -- sorted } ``` 首先是 **leaf-search**，目的是要找出 **special leaf** ， **special leaf** 為從 root 開始，找出目前節點中的 childs 較小的那個，再指向該點，直到指到最後一層，也就是這個 heap 的 leaf ，該點就是 **special leaf** 。 :::info ***leaf-search(m, i)*** 1) **j = i** 2) while ( **2 * j < m** ) begin 3) &emsp;if ( **a[2 * j]** < **a[2 * j + 1]** ) **j = 2 * j** 4) &emsp;else **j = 2 * j + 1** 5) end 6) if ( **2 * j = m** ) **j = m** 7) return **j** ::: 橘色節點的 $24$ 為 **special leaf** ，而走訪的路徑(紅色節點的部分與橘色節點)為 **special path** 。下圖執行 ***leaf-search(12, 1)*** 會回傳 $10$，其為 **special leaf** 的 index 。 ```graphviz graph heap2{ 17[label = 17, color = red] 17 -- 7 17 -- 8 7[label = 7, color = red] 7 -- 23 21[label = 21, color = red] 7 -- 21 8 -- 14 8 -- 10 23 -- 31 23 -- 29 24[label = 24, color = orange] 21 -- 24 21 -- 25 14 -- 15 sorted[label = 4, color = gray] 14 -- sorted } ``` **bottom-up-search** 為找出應該放 $a[m]$ 的合適位置。透過 **leaf-search** 找出 special leaf ，再由這個 leaf ，再拿 $a[m]$ 從這點往上比較，找出適合放入的位置，該位置稱為 **special object**。 :::info ***bottom-up-search(i, j)*** /* **j** is the output of ***leaf-search(m, i)*** */ 1) while ( **a[i] < a[j]** ) **j = $\lfloor \frac{j}{2} \rfloor$** 2) return **j** ::: 從 **special leaf** 往上找，直到找到比 root 還要小的值。(紫色節點) 下圖執行 ***bottom-up-search(1 , 10)*** 會回傳 $2$，其為 `reheap` 後， root 要換過去的位置。 ```graphviz graph heap3{ 7[label = 7, color = purple] 17 -- 7 17 -- 8 7 -- 23 7 -- 21 8 -- 14 8 -- 10 23 -- 31 23 -- 29 24[label = 24, color = orange] 21 -- 24 21 -- 25 14 -- 15 sorted[label = 4, color = gray] 14 -- sorted } ``` **interchange-1** 為 root 移至指定位置，且 root 到指定位置上的節點往上移動。 :::info ***interchange-1(i, j)*** /* **j** is the output of ***bottom-up-search(i, j)*** */ 1) **l** = **$\lfloor log(\frac{j}{i}) \rfloor$** 2) **x** = **a[i]** 3) for ( **k** from **l - 1** downto **0** ) **a[$\lfloor\frac{j}{2^{k+1}}\rfloor$] = a[$\lfloor \frac{j}{2^k} \rfloor$]** 4) **a[j]** = **x** ::: 下圖為執行 ***interchange-1(1, 2)*** 後的結果，滿足 heap 的性質。(不考慮最後一個已排序的節點) (root 移至 $a[2]$，原本 $a[2]$ 的值往其親代節點移動，因此移到 root) ```graphviz graph heap4{ 17[label = 17, color = purple] 7 -- 17 7 -- 8 17 -- 23 17 -- 21 8 -- 14 8 -- 10 23 -- 31 23 -- 29 24[label = 24, color = orange] 21 -- 24 21 -- 25 14 -- 15 sorted[label = 4, color = gray] 14 -- sorted } ``` 因此原本的 **reheap** 可以被修改成 bottom-up 形式，為 :::info ***bottom-up-reheap(m, i)*** 1. **j** = **leaf-search(m, i)** 2. **j** = **bottom-up-search(i, j)** 3. **interchange-1(i, j)** ::: 考慮任一個值任意排序的陣列(每個值都相異)，對其做 **HEAPSORT** ，首先要先將陣列調整為 heap，需要做 **reheap($n, \lfloor \frac{n}{2} \rfloor$)**, **reheap($n, \lfloor \frac{n}{2} \rfloor-1$)**, ..., **reheap($n, 1$)**，因上述透過 bottom-up 的方式改良，因此上述函式會呼叫 **leaf-search($n, \lfloor \frac{n}{2} \rfloor$)**, **leaf-search($n, \lfloor \frac{n}{2} \rfloor-1$)**, ..., **leaf-search($n, 1$)**，而這些 **leaf-search** 最少會執行 $n - \lceil \log{(n+1)} \rceil - \lceil \log{(n)} \rceil + 1$ 次 comparisons，最多會執行 $n - 2$ 次 comparisons。(論文中的 lemma 4.1) 證明: 先考慮 $n = 2^{k}-1$ 的情況，也就是 full binary tree (共 $k$ 層，第 $0$ 層至第 $k-1$ 層)，第 $0$ 層的節點 (root) 要找到 **special leaf** 要進行 $k - 1$ 次的 comparisons，第 $1$ 層則是 $k - 2$ 次的 comparisons，第 $k - 2$ 層則是 $1$ 次，因為第 $i$ 層有 $2^i$ 個節點，因此總共會進行 $\begin{split}\sum_{i = 0}^{k-2}2^i(k-1-i) & = 2^0(k-1)+2^1(k-2)+...+2^{k-2}(1) \\ & =(2^0+2^1+...+2^{k-2})+(2^0+2^1+...+2^{k-3})+...+(2^0+2^1)+2^0 \\ & =(2^{k-1}-1)+(2^{k-2}-1)+...+(2^2-1)+(2^1-1) \\ & = (2^{k-1}+2^{k-2}+...+2^1)-(k-1)\times 1 \\ & = (2^{k-1}+2^{k-2}+...+2^1+2^0)-2^0-(k-1)\times 1 \\ & =(2^k-1)-k \\ & =n - \lceil\log{(n+1)} \rceil \end{split}$ 次的 comparisons。接著考慮不為 full binary tree 的情況，在第 $k$ 層加入 $i$ 個節點。 $i$ 為偶數，才會增加 worst-case 下 comparison 次數，也就是說 $i=0$ 與 $i=1$ 在 worst-case 增加comparison 次數是一樣的，$i=2$ 與 $i=3$ 在 worst-case 增加的 comparison次數是一樣的，...，以此類推，在第 $k$ 層增加 $i$ 個節點，會影響第 $k-1$ 層中的前 $\lceil\frac{(i-1)}{2}\rceil$ 個節點，在 worst-case 下，他們會多做一次的 comparison，而第 $k-2$ 層中前 $\lceil\frac{(i-1)}{2^2}\rceil$ 個節點也要多做一次的 comparison，...，第 $0$ 層則會有 $\lceil\frac{i-1}{2^{k}}\rceil$ 個節點要多做一次 comparison ，因此共要多做 $\begin{split}\sum_{j=1}^{k}\lceil\frac{(i-1)}{2^j} \rceil\end{split}$ 次。 $\begin{split}\sum_{j=1}^{k}\lceil\frac{(i-1)}{2^j} \rceil\end{split}$ 的 upper bound 為 $i-2+k$，證明如下: $\begin{split}\sum_{j=1}^{k}\lceil\frac{(i-1)}{2^j} \rceil & = \sum_{j=1}^{k}\frac{(i-1)}{2^j}+(\sum_{j=1}^{k}\lceil\frac{(i-1)}{2^j} \rceil-\sum_{j=1}^{k}\frac{(i-1)}{2^j})\\ &\leq(i-1)(1-\frac{1}{2^k})+\sum_{j=1}^{k}(1-\frac{1}{2^j}) \\ & =i-1-\frac{i}{2^k}+\frac{1}{2^k}+k-(1-\frac{1}{2^k})\\ & =i-2+k+\frac{2-i}{2^k}\\ & \leq i-2+k \ \ \ (\text{for}\ i \geq 2) \end{split}$ 因為 $i<2$ 不會增加 comparison 次數，所以只考慮 $i\geq 2$ 的情況。因此對於非 full binary tree 來說， worst-case 下，comparison 最多會進行 $\begin{split}\underbrace{(2^k-1-k)}_{\text{full binary tree}}+(i-2+k) &= (2^k-1+i)-2 =n-2 \ 次。\end{split}$ 對於非 full binary tree 的 best-case ，就是 root 到第 $n$ 個節點的路徑上的所有點，在進行 comparison 時，都指向右子節點，遠離第 $n$ 個節點，這樣這些點進行 **leaf-search** 都會少一個 comparison ，而這路徑上的點(不包含 leaf)共有 $\lceil\log n\rceil-1$ 個，因此 best-case 下，comparison的次數至少會是 $n-\lceil\log(n+1)\rceil-(\lceil\log n\rceil - 1)=n-\lceil\log(n+1)\rceil-\lceil\log n\rceil + 1$ 次。 best-case 發生在 $n = 2$。雖然分析出 best-case 與 worst-case 情況下的 comparison 次數，但實際上在調整至 heap 的過程，不管是在 best-case 還是 worst-case ，comparison 次數其實是差不多的，平均 comparison 次數就還是以 $n+\Theta(\log n)$ 考慮。接著來討論 **bottom-up-search** 在陣列調整至 heap 時的執行次數，考慮透過 **leaf-search** 時找出的 **special path** ，將這個 path(不包含 root ) 用 $b(1),b(2),...b(d)$ 表示，$b(0) = -\infty,b(d+1) = \infty$ ， $x$ 為要執行 **reheap** 的 root，目的就是要找出滿足 $b(j)<x<b(j+1)$ 的 $j$，接著 **HEAPSORT** 以及 **bottom-up-search** 的 comparison 次數為以下表格。 | | $j<d$ | $j=d$ | |:--------------------:| :------: |:-----:| | **HEAPSORT** | $2(j+1)$ | $2d$ | | **bottom-up-search** | $d-j+1$ | $d$ | :::info **HEAPSORT** 每一層在做 comparison 時，會做 $2$ 次比較，因此要乘上 $2$ 。 ::: 用 $l_{\text{HS}}$ 表示 **HEAPSORT** 的 comparison 次數的==一半==，以及 $l_{\text{BUS}}$ 表示 **bottom-up-search** 的 comparison 次數。再藉由上述表格，可以得到以下結論 : | | $l_{\text{HS}}+l_{\text{BUS}}$ | |:---------------------:|:------------------------------:| | $0<j<d$ | $d+2$ | | $j=0 \text{ or } j=d$ | $d+1$ | 接著透過〈[An average case analysis of Floyd's algorithm to construct heaps](https://www.sciencedirect.com/science/article/pii/S0019995884800534)〉的結論，**HEAPSORT** 平均 comparison 次數為 $(\alpha_1+2\alpha_2-2)n+\Theta(\log n)$ 次，以及平均 interchange 次數為 $(\alpha_1+\alpha_2-2)n+\Theta(\log n)$ 次，其中 $\alpha_1=1.6066951...,\alpha_2=1.1373387...,$ $\alpha_i$ 的公式為 $\begin{split}\alpha_i=\sum_{j=1}^{\infty}(2^j-1)^{-i}\end{split}$ 。(論文中的 Thm 4.2) 因為在調整 heap 的過程中， **bottom-up-reheap** 會呼叫 $\lfloor\frac{n}{2}\rfloor$ 次，因此其中的 **bottom-up-search** 也同樣會執行 $\lfloor\frac{n}{2}\rfloor$ 次，以 $L_{\text{BUS}},L_{\text{HS}},D$ 作為隨機變數，分別代表的是 $l_{\text{BUS}},l_{\text{HS}},d$ 的總和，目標是要求出 $E(L_{\text{BUS}})$ ，也就是在調整至 heap 的過程，執行 **bottom-up-search** 的次數的期望值。藉由上述的 lemma 4.1 跟 Thm 4.2 ，可以得出 $$ E(D)=n+\Theta(\log n)、E(L_{\text{HS}})=(\alpha_1/2+\alpha_2-1)n+\Theta(\log n) $$ 用 $T$ 來表示呼叫 **bottom-up-search** 且 $l_{\text{HS}}+l_{\text{BUS}}=d+1$ 的次數，則 $\lfloor\frac{n}{2}\rfloor-T$ 則為呼叫 **bottom-up-search** 且 $l_{\text{HS}}+l_{\text{BUS}}=d+2$ 的次數。因此 $$ \begin{split} E(L_{\text{BUS}}) &= E(T)\cdot(E(d)+1-E({l_{\text{HS}}}))+(\lfloor\frac{n}{2}\rfloor-E(T))\cdot(E(d)+2-E(l_{\text{HS}})) \\ &=\lfloor\frac{n}{2}\rfloor E(d)+2\lfloor\frac{n}{2}\rfloor-\lfloor\frac{n}{2}\rfloor E(l_{\text{HS}})-E(T) \\ &=E(D)+2\lfloor\frac{n}{2}\rfloor-E(L_{\text{HS}})-E(T) \\ &=n+2\lfloor\frac{n}{2}\rfloor-(\alpha_1/2+\alpha_2-1)n-E(T)+\Theta(\log n) \\ &=(3-\alpha_1/2-\alpha_2)n-E(T)+\Theta(\log n) \end{split} $$ 處理 $E(L_{\text{BUS}})$ 中的 $E(T)$，考慮 $j=0$ 的情況，也就是 **special object** 恰巧就為該子樹的 root ，若該子樹有 r 個節點，則發生 $j=0$ 的機率為 $\frac{1}{j}$，考慮 full binary tree ，總高度為 $h$ ，在允許 $\Theta(\log n)$ 的誤差，第 $0$ 層有 $1$ 個子樹且該樹的節點為 $n$ 個，第 $1$ 層有 $2$ 個子樹且該樹的節點大致為 $\frac{n}{2}$ 個，...，第 $h-2$ 層有 $2^{h-2}$ 個子樹且該樹節點大致為 $\frac{2}{2^{h-2}}$，因此 $j=0$ 的狀況發生在各個非 leaf 的節點上的機率為 $$ \begin{split} \beta=\sum_{h=2}^{\infty}[2^h(2^h-1)]^{-1} = 0.1066952 \end{split} $$ 因此發生在 $j=0$ 的情況下的期望值為 $\beta n+\Theta(\log n)$ 。接著是 $j=d$ 的情況，在處理前，先說明一個等式，$c$ 為在執行 **reheap** 時的 comparison 次數，$i$ 為 interchange 次數，$s$ 為 **special object** 的子代節點的個數，則以下等式成立 : $c=2\times i+s$ 考慮 $C、I、S$ 分別為 $c、i、s$ 在調整 heap 的總和，由上述等式可得 : $$ E(C)=2\times E(I)+E(S) \Rightarrow E(S)=E(C)-2\times E(I) $$ 因為二元樹的關係，$s$ 只有三個可能性，也就是 $s\in \{0,1,2\}$，$s=1$ 的情況，也就是 **special object** 為第 $\frac{n}{2}$ 個節點，且 $n$ 為偶數，這情況最多只會發生 $\log n$ 次( root 到最後一個節點這路徑上的節點才有可能會發生)，當 $n$ 很大時， $s=1$ 發生次數少到可以不用考慮，所以我們可以改成 $s\in \{0,2\}$，設 $p_{s=0}、p_{s=2}$ 分別為 $s=0, s=2$ 發生的機率，可得 $$ \begin{align} &E(s)=0\times p_{s=0}+2\times p_{s=2}\\ \Rightarrow &E(s)=2\times p_{s=2}\\ \Rightarrow &\lfloor\frac{n}{2}\rfloor E(s)=\lfloor\frac{n}{2}\rfloor\times 2\times p_{s=2} \\ \Rightarrow &E(S)=\lfloor\frac{n}{2}\rfloor\times 2\times p_{s=2}\\ \Rightarrow &p_{s=2}=(E(S)/\lfloor\frac{n}{2}\rfloor)/2=2-\alpha_1+\Theta(n^{-1}\log n) \end{align} $$ 則 $s=0$ 就能求出， $p_{s=0}=1-p_{s=2}=\alpha_1-1+\Theta(n^{-1}\log n)。$ 再回來看要處理 $j=d$ 的情況，也就是 **special object** 為 leaf 的情況，剛好就是 $s=0$ 的意思，因此發生在 $j=d$ 的情況下的期望值為 $$ \lfloor\frac{n}{2}\rfloor p_{s=0}=\lfloor\frac{n}{2}\rfloor(\alpha_1-1+\Theta(n^{-1}\log n))=(\alpha_{1}/2-1/2)n+\Theta(\log n) $$ 求出 $j=0、j=d$ 的期望值後，$E(T)$ 為這兩者之和，因此 $E(T)=(\alpha_1/2-1/2+\beta)n+\Theta(\log n)$，帶回 $E(L_{\text{BUS }})$ 中，可得 : $$ \begin{split}E(L_{\text{BUS}})&=(3-\alpha_1/2-\alpha_2)n-(\alpha_1/2-1/2+\beta)n+\Theta(\log n)\\ &=(7/2-\alpha_1-\alpha_2+\beta)n+\Theta(\log n)\end{split} $$ 因此得到以下結論，在進行 **bottom-up-HEAPSORT** 一開始 heap creation phase 時 comparison 平均花費次數為 $$ \begin{split} &\underbrace{n+\Theta(\log n)}_{\text{leaf-search}}+\underbrace{(7/2-\alpha_1-\alpha_2+\beta)n+\Theta(\log n)}_{\text{bottom-up-search}} \\ = &(9/2-\alpha_1-\alpha_2+\beta)n+\Theta(\log n)\\ \approx &1.649271n \end{split} $$ 接著分析 select phase 的 comparison ，會進行 **leaf-search(n-1,1)、leaf-search(n-2,1)、...、leaf-search(2,1)** ，在執行 **leaf-search(i,1)** 的 worst-case 下，共會執行 $\lfloor\log(i-1)\rfloor$ 次的 comparisons，worst-case 下，全部 **leaf-search** 的 comparisons 次數為 $$ \begin{split} \sum_{i=2}^{n-1}\lfloor\log(i-1) \rfloor &= \sum_{i=1}^{n-2}\lfloor\log i\rfloor \\ &=\sum_{i=1}^{\lfloor\log(n-2)\rfloor-1}i2^i+\lfloor\log(n-2)\rfloor(n-2^{\lfloor\log(n-2)\rfloor}-1)\\ &= n\lfloor\log (n-2)\rfloor-2\cdot2^{\lfloor\log(n-2)\rfloor}-\lfloor\log(n-2)\rfloor+2 \end{split} $$ 將以上調整至 $n\log n-c(n)n$ 的形式，則 $$ c(n)=(\log n-\lfloor\log(n-2)\rfloor)+2\cdot2^{\lfloor\log(n-2)\rfloor}/n $$ $c(n)$ 在 $n\approx 1.386\cdot 2^k$ 時為最大值，此時 $c(n)\approx 1.91393$ 。 worst-case 下，**leaf-search** 會執行 $n\log n-1.91393n$ 次 comparisons。但這是在 worst-case 下，並不是每次找到的 **special path** 總會是最長的，也是有可能找到較短的 **special path**，其長度為 $\lfloor\log (n-1)\rfloor-1$ ，在論文的實驗下，在 $n\approx1.4\cdot2^k$ 約會有 $0.519$ 的機率找到的 **special path** 不是最長的。綜上所述，**leaf search** 平均下會執行 $n\log n -1.91393n-0.519n=n\log n -2.43293n$ 次 comparisons。 **bottom-up-search** 也會做 comparison，因為每次 **bottom-up-search** 至少會做一次的 comparison，且大部分作的次數都會很少(因為在做 heap sort 要將最後一個值 x 與 root 交換，再使 x 下沉至 x 應該在的位子，因 x 為 heap 的葉子，預期其值會相當大，因此 x 新的位子也會相當靠近最後一層)，論文實驗分析，會執行 $1.17n$ 次的 comparisons。那執行整體 **bottom-up-HEAPSORT** 預計共會花 $$ \begin{split} \underbrace{1.649271n}_{\text{heap creation phase}}+\underbrace{n\log n-2.43293n}_{\text{select phase leaf-search}}+\underbrace{1.17n}_{\text{select phase bottom-up-search}}=n\log n+0.386341n \end{split} $$ 而論文中則是用猜想的方式，以 $n\log n+d(n)n$ 為執行 comparison 次數之期望值，來進行實驗，得到以下結果(部分節錄) | n | 1000 | 2000 | 3000 | 4000 | ... | 27000 | 28000 | 29000 | 30000 | | ------ | ----- |:-----:|:-----:|:-----:|:---:|:-----:|:-----:|:-----:|:-----:| | $d(n)$ | 0.345 | 0.349 | 0.383 | 0.358 | ... | 0.381 | 0.378 | 0.373 | 0.371 | 透過實驗結果，這可解釋 [lib/sort.c](https://github.com/torvalds/linux/blob/master/lib/sort.c) 中，開頭的 > This performs n\*log2(n) + 0.37\*n + o(n) comparisons on average 考慮 $o(n)$ 為容許的誤差，實驗結果也與分析出來的 $n\log n+0.386341n$ 相當接近。 > and 1.5\*n\*log2(n) + O(n) in the (very contrived) worst case. 也可從論文中 Thm 5.1 中得出。 ### 建立相關實驗環境 [commit f8f230a](https://github.com/kart81604/linux2023_final/commit/f8f230acbd6f8b8ef794407fd806ceb0084677ac) 建立一個陽春簡單的 driver 可用 Linux 核心中的 sort 排序，及準備相關的使用者空間的測試程式。 [commit 0fcbe0f](https://github.com/kart81604/linux2023_final/commit/0fcbe0f6521bb678dbc305c90b4362c4d16f7a2c) 引入 [xoroshiro128+](https://en.wikipedia.org/wiki/Xoroshiro128%2B) 這個 [Pseudorandom number generator](https://en.wikipedia.org/wiki/Pseudorandom_number_generator)，以下為引入後的時間表現圖。 ![](https://hackmd.io/_uploads/Hk4He2Tvn.png) 利用 perf stat 來查看資源利用。 ``` Performance counter stats for './client': 9,396.16 msec task-clock # 1.000 CPUs utilized 40 context-switches # 4.257 /sec 1 cpu-migrations # 0.106 /sec 60 page-faults # 6.386 /sec 31,746,894,265 cycles # 3.379 GHz 23,920,611,236 instructions # 0.75 insn per cycle 3,855,864,140 branches # 410.366 M/sec 658,348,492 branch-misses # 17.07% of all branches 9.396961274 seconds time elapsed 0.011975000 seconds user 9.364692000 seconds sys ``` 利用 `perf record` 來查看熱點(僅列出與 sort 有關的函式)。 ``` 49.76% client [kernel.kallsyms] [k] cmpint64 37.61% client [kernel.kallsyms] [k] sort_heap 10.00% client [kernel.kallsyms] [k] do_swap 0.88% client [kernel.kallsyms] [k] sort_read ``` 不意外的，花最多時間 comparison 上。 [commit 2751912](https://github.com/kart81604/linux2023_final/commit/27519128b5b7302f359e853cb5bf57210fe1bca8) 將每次執行 comparison 次數與論文中的 $n\log(n)+0.37n$ 做比較，結果兩者高度接近！ ![](https://hackmd.io/_uploads/BJu2ptSO3.png) ## Hybrid sorting 除了在教科書出現的經典排序演算法，如 merge sort 和 quick sort，在諸多軟體會採用截長補短的策略，融合兩項或更多排序演算法，這樣的案例包含 [Timsort](https://en.wikipedia.org/wiki/Timsort) 和 [Introsort](https://en.wikipedia.org/wiki/Introsort)。 [Introsort](https://en.wikipedia.org/wiki/Introsort) 可看做 quick sort 的強化，遞迴分割陣列，區間越來越短，數字也幾乎排序好。對於幾乎已經排序好的短區間，quick sort 表現不佳，於是切換為 heapsort。分水嶺通常設定成 $\log{N^2} = 2 \times log{N}$，`N` 是輸入資料的長度。以下是參考的 [Introsort](https://en.wikipedia.org/wiki/Introsort) 實作: ```c /** * introsort is comprised of: * - Quicksort with explicit stack instead of recursion, and ignoring small * partitions * - Binary heapsort with Floyd's optimization, for stack depth > 2log2(n) * - Final shellsort pass on skipped small partitions (small gaps only) */ #include <limits.h> #include <stdint.h> #include <string.h> typedef int (*comparator_t)(const void *, const void *); typedef struct { char *low, *high; } stack_node_t; #define idx(x) (x) * data_size /* manual indexing */ #define STACK_SIZE (sizeof(size_t) * CHAR_BIT) /* size constant */ static inline int __log2(size_t x) { return 63 - __builtin_clzll(x); } #define WORD_ALIGNED 1 static inline void swap(void *restrict _a, void *restrict _b, size_t data_size) { #if WORD_ALIGNED long *a = (long *) _a, *b = (long *) _b; while ((data_size--) / sizeof(long) > 0) { long tmp = *a; *a++ = *b; *b++ = tmp; } #else char *a = (char *) _a, *b = (char *) _b; while (data_size-- > 0) { char tmp = *a; *a++ = *b; *b++ = tmp; } #endif } void sort(void *_array, size_t length, size_t data_size, comparator_t comparator) { if (length == 0) return; char *array = (char *) _array; const size_t max_thresh = data_size << 2; const int max_depth = __log2(length) << 1; /* Temporary storage used by both heapsort and shellsort */ char tmp[data_size]; if (length > 16) { char *low = array, *high = array + idx(length - 1); stack_node_t stack[STACK_SIZE] = {0}; stack_node_t *top = stack + 1; int depth = 0; while (stack < top) { /* Exceeded max depth: do heapsort on this partition */ if (depth > max_depth) { size_t part_length = (size_t)((high - low) / data_size) - 1; if (part_length > 0) { size_t i, j, k = part_length >> 1; /* heapification */ do { i = k; j = (i << 1) + 2; memcpy(tmp, low + idx(i), data_size); while (j <= part_length) { if (j < part_length) j += (comparator(low + idx(j), low + idx(j + 1)) < 0); if (comparator(low + idx(j), tmp) <= 0) break; memcpy(low + idx(i), low + idx(j), data_size); i = j; j = (i << 1) + 2; } memcpy(low + idx(i), tmp, data_size); } while (k-- > 0); /* heapsort */ do { i = part_length; j = 0; memcpy(tmp, low + idx(part_length), data_size); /* Floyd's optimization: * Not checking low[j] <= tmp saves nlog2(n) comparisons */ while (j < part_length) { if (j < part_length - 1) j += (comparator(low + idx(j), low + idx(j + 1)) < 0); memcpy(low + idx(i), low + idx(j), data_size); i = j; j = (i << 1) + 2; } /* Compensate for Floyd's optimization by sifting down * tmp. This adds O(n) comparisons and moves. */ while (i > 1) { j = (i - 2) >> 1; if (comparator(tmp, low + idx(j)) <= 0) break; memcpy(low + idx(i), low + idx(j), data_size); i = j; } memcpy(low + idx(i), tmp, data_size); } while (part_length-- > 0); } /* pop next partition from stack */ --top; --depth; low = top->low; high = top->high; continue; } /* 3-way "Dutch national flag" partition */ char *mid = low + data_size * ((high - low) / data_size >> 1); if (comparator(mid, low) < 0) swap(mid, low, data_size); if (comparator(mid, high) > 0) swap(mid, high, data_size); else goto skip; if (comparator(mid, low) < 0) swap(mid, low, data_size); skip: char *left = low + data_size, *right = high - data_size; /* sort this partition */ do { while (comparator(left, mid) < 0) left += data_size; while (comparator(mid, right) < 0) right -= data_size; if (left < right) { swap(left, right, data_size); if (mid == left) mid = right; else if (mid == right) mid = left; left += data_size, right -= data_size; } else if (left == right) { left += data_size, right -= data_size; break; } } while (left <= right); /* Prepare the next iteration * Push larger partition and sort the other; unless one or both * smaller than threshold, then leave it to the final shellsort. * Both below threshold */ if ((size_t)(right - low) <= max_thresh && (size_t)(high - left) <= max_thresh) { --top; --depth; low = top->low; high = top->high; } /* Left below threshold */ else if ((size_t)(right - low) <= max_thresh && (size_t)(high - left) > max_thresh) { low = left; } /* Right below threshold */ else if ((size_t)(right - low) > max_thresh && (size_t)(high - left) <= max_thresh) { high = right; } else { /* Push big left, sort smaller right */ if ((right - low) > (high - left)) { top->low = low, top->high = right; low = left; ++top; ++depth; } else { /* Push big right, sort smaller left */ top->low = left, top->high = high; high = right; ++top; ++depth; } } } } /* Clean up the leftovers with shellsort. * Already mostly sorted; use only small gaps. */ const size_t gaps[2] = {1ul, 4ul}; int i = 1; do { if (length < 4) continue; for (size_t j = gaps[i], k = j; j < length; k = ++j) { memcpy(tmp, array + idx(k), data_size); while (k >= gaps[i] && comparator(array + idx(k - gaps[i]), tmp) > 0) { memcpy(array + idx(k), array + idx(k - gaps[i]), data_size); k -= gaps[i]; } memcpy(array + idx(k), tmp, data_size); } } while (i-- > 0); } ``` [Introsort](https://en.wikipedia.org/wiki/Introsort) 又可進一步擴充為 [Pattern Defeating Quicksort](https://github.com/orlp/pdqsort)，簡稱 pdqsort，進行細部改良，可參見論文〈[Pattern-defeating Quicksort](https://drive.google.com/file/d/0B1-vl-dPgKm_T0Fxeno1a0lGT0E/view)〉。 ![image](https://hackmd.io/_uploads/B1PBmESg0.png) [swenson/sort](https://github.com/swenson/sort) 彙整多項排序實作，並提供多種情境的效能評估，參考執行輸出: ``` Qsort 100000 x86_64 109 9216431.2 ns/op Binary_insertion_sort 100000 x86_64 1 1020912000.0 ns/op Bitonic_sort 100000 x86_64 2 994466000.0 ns/op Quick_sort 100000 x86_64 185 5419470.3 ns/op Merge_sort 100000 x86_64 151 6638079.5 ns/op Heap_sort 100000 x86_64 164 6099676.8 ns/op Shell_sort 100000 x86_64 113 8874380.5 ns/op Tim_sort 100000 x86_64 135 7445088.9 ns/op Merge_sort_in_place 100000 x86_64 158 6372259.5 ns/op Grail_sort 100000 x86_64 122 8244967.2 ns/op Sqrt_sort 100000 x86_64 144 6978180.6 ns/op Rec_stable_sort 100000 x86_64 36 27847444.4 ns/op Grail_sort_dyn_buffer 100000 x86_64 139 7201316.5 ns/op ``` 延伸閱讀: * CppCon 2019: Andrei Alexandrescu "[Speed Is Found In The Minds of People](https://youtu.be/FJJTYQYB1JQ)" - [簡報檔案](https://github.com/CppCon/CppCon2019/blob/master/Presentations/speed_is_found_in_the_minds_of_people/speed_is_found_in_the_minds_of_people__andrei_alexandrescu__cppcon_2019.pdf) > In all likelihood, sorting is one of the most researched classes of algorithms. It is a fundamental task in Computer Science, both on its own and as a step in other algorithms. Efficient algorithms for sorting and searching are now taught in core undergraduate classes. Are they at their best, or is there more blood to squeeze from that stone? This talk will explore a few less known – but more allegro! – variants of classic sorting algorithms. And as they say, the road matters more than the destination. Along the way, we'll encounter many wondrous surprises and we'll learn how to cope with the puzzling behavior of modern complex architectures. ### 引入到 Linux 核心 [commit fd5f87b](https://github.com/kart81604/linux2023_final/commit/fd5f87b9a437d035e5cd403064ca999fd0cd070a) 引入 introsort。透過 `perf stat` 來查看 introsort 的資源利用。 ``` Performance counter stats for './client': 6,553.87 msec task-clock # 0.999 CPUs utilized 223 context-switches # 34.026 /sec 3 cpu-migrations # 0.458 /sec 60 page-faults # 9.155 /sec 22,107,043,269 cycles # 3.373 GHz 13,762,687,088 instructions # 0.62 insn per cycle 2,719,459,349 branches # 414.939 M/sec 617,936,408 branch-misses # 22.72% of all branches 6.559289342 seconds time elapsed 0.011913000 seconds user 6.492720000 seconds sys ``` 及 `perf record` (僅列出與 sort 有關的函式)。 ``` 63.88% client [kernel.kallsyms] [k] cmpint64 32.18% client [kernel.kallsyms] [k] sort_intro 1.35% client [kernel.kallsyms] [k] sort_read ``` 並與 bottom-up heapsort 比較，這次紀錄方法使用 `printk` 印出兩者的執行時間，再透過 `dmesg` 將資料印出，接著做成圖表。因為 dmesg 的 ring buffer 為 8000+ 個，因此資料量從 10000 調整至 8000。比較成果如下圖所示。 ```shell $ sudo dmesg -t > ttest.txt ``` 若 dmesg 已有其他的系統訊息，可先用以下命令清除。 ```shell $ sudo dmesg -C ``` 執行時間的比較： ![](https://hackmd.io/_uploads/Syq4XoRw2.png) 延伸閱讀: (涉及排序演算法和現代處理器的關聯) * [Hoare’s Rebuttal and Bubble Sort’s Comeback](https://blog.reverberate.org/2020/05/29/hoares-rebuttal-bubble-sorts-comeback.html) * [Lomuto’s Comeback](https://dlang.org/blog/2020/05/14/lomutos-comeback/) ### pdqsort > [commit 94d6aa](https://github.com/hankluo6/ksort/commit/94d6aa23be7051dd4113f1b976f00daa8bb921dc) 測試 100000 筆資料排序 ![image](https://hackmd.io/_uploads/rJ7qHNBx0.png) pdqsort 與 intro sort 相比，速度又快上許多，但在 comparison 的次數上並沒有提昇。quick sort 的比較次數應比 heap sort 還少，但在 intro sort 與 pdqsort 中皆使用到 insertion sort 做排序，而 insertion sort 雖在速度上有優勢 (良好的 locality 與小範圍排序)，但其比較次數是隨著元素個數成長，其值應比 heap sort 的平均狀況 $1.5\times n \log2n + O(n)$ 還高。