2024q1 Homework2 (quiz1+2)

# 2024q1 Homework2 (quiz1+2) contributed by < [weihsinyeh](https://github.com/weihsinyeh/Sorting-Lab) > ## [第一週測驗題](https://hackmd.io/@sysprog/linux2024-quiz1) ### 測驗一參考 [Optimized QuickSort — C Implementation (Non-Recursive)](https://alienryderflex.com/quicksort/)以實作作者 Darel Rex Finley 想法並提出改進。不用遞迴方式原因是，遞迴執行效率比直接使用真正的陣列慢。同時因為遞迴其實用 stack 作為 their private array，可能會 stack overflow 導致無預警的 system crashing。為了避免無預警的狀況發生，因此改設定 `MAX_LEVELS` 為 1000 的陣列如果到達上限就 `return NO` 。 [Wikipedia: C language implementation of QuickSort](http://en.wikipedia.org/wiki/Quicksort#C) 上的 R 每次都只移動一個，而 L 則一次移到比 pivot 大的值上面然後將與 R 交換所以此時 R 有可能比 pivot 大但是也被交換到排序較小的地方。當 L 比 R 大的時候，再將 pivot 放到 L-1 的地方。作者有三個改進的想法 : 1. 既然 L 可以一次移到比 pivot 大的值的位置， R 也可以一次移到比 pivot 小的值的位置。因此當 pivot 要放到 L-1 的地方時排序比 pivot 小的都是值比 pivot 小。 2. 用 pivot 作為交換，由於 pivot 就紀錄當前處理的範圍是 L 的大小。 `while (arr[R] >= piv && L < R)` 從最右邊往左邊尋找比 pivot 小的值，一旦找到，就可以直接將它放在原先存 L 的大小的地方 `arr[L]`，因為 pivot 已經紀錄了，再把 L 往前移一個。接著 `while (arr[L] <= piv && L < R)` 從最左邊往右邊尋找比pivot 大的值後就可以直接放到 `arr[R]` 的位置，因為剛剛已經將 `arr[R]` 存到 pivot 的地方。而原先存比 pivot 大的值 arr[L] 則可以直接存 pivot。示意圖如下 ```graphviz digraph structs { node[shape=plaintext]; pivot [fontcolor="red"]; L [fontcolor="green"]; R [fontcolor="green"]; node[shape=box]; structpiv [label= "4"]; struct1 [label= "1"]; struct2 [label= "2"]; struct3 [label= "3"]; struct4 [label= "4"]; struct5 [label= "5"]; struct7 [label= "7"]; { rank="same"; struct4 -> struct1 struct1 -> struct3 struct3 -> struct5 struct5 -> struct2 struct2 -> struct7 } pivot -> structpiv; L -> struct4; R -> struct7; } ``` ```c=1 while (arr[R] >= piv && L < R) // 先找到哪個比 pivot 小 R--; if (L < R) arr[L++] = arr[R]; // swap 的第一步 ``` ```graphviz digraph structs { node[shape=plaintext]; pivot [fontcolor="red"]; L [fontcolor="green"]; R [fontcolor="green"]; node[shape=box]; structpiv [label= "4"]; struct1 [label= "1"]; struct2 [label= "2"]; struct3 [label= "3"]; struct4 [label= "2"]; struct5 [label= "5"]; struct7 [label= "7"]; { rank="same"; struct4 -> struct1 struct1 -> struct3 struct3 -> struct5 struct5 -> struct2 struct2 -> struct7 } pivot -> structpiv; L -> struct4; R -> struct2; } ``` 這裡就是上方程式碼的第四行 `L++` ```graphviz digraph structs { node[shape=plaintext]; pivot [fontcolor="red"]; L [fontcolor="green"]; R [fontcolor="green"]; node[shape=box]; structpiv [label= "4"]; struct1 [label= "1"]; struct2 [label= "2"]; struct3 [label= "3"]; struct4 [label= "2"]; struct5 [label= "5"]; struct7 [label= "7"]; { rank="same"; struct4 -> struct1 struct1 -> struct3 struct3 -> struct5 struct5 -> struct2 struct2 -> struct7 } pivot -> structpiv; L -> struct1; R -> struct2; } ``` ```c while (arr[L] <= piv && L < R) // 先找到哪個比 pivot 大 L++; ``` ```graphviz digraph structs { node[shape=plaintext]; pivot [fontcolor="red"]; L [fontcolor="green"]; R [fontcolor="green"]; node[shape=box]; structpiv [label= "4"]; struct1 [label= "1"]; struct2 [label= "2"]; struct3 [label= "3"]; struct4 [label= "2"]; struct5 [label= "5"]; struct7 [label= "7"]; { rank="same"; struct4 -> struct1 struct1 -> struct3 struct3 -> struct5 struct5 -> struct2 struct2 -> struct7 } pivot -> structpiv; L -> struct5; R -> struct2; } ``` ```c＝1 if (L < R) arr[R--] = arr[L]; // swap 的第二步 ``` ```graphviz digraph structs { node[shape=plaintext]; pivot [fontcolor="red"]; L [fontcolor="green"]; R [fontcolor="green"]; node[shape=box]; structpiv [label= "4"]; struct1 [label= "1"]; struct2 [label= "5"]; struct3 [label= "3"]; struct4 [label= "2"]; struct5 [label= "5"]; struct7 [label= "7"]; { rank="same"; struct4 -> struct1 struct1 -> struct3 struct3 -> struct5 struct5 -> struct2 struct2 -> struct7 } pivot -> structpiv; L -> struct5; R -> struct2; } ``` 上方程式碼第二行 `R--` ```graphviz digraph structs { node[shape=plaintext]; pivot [fontcolor="red"]; L [fontcolor="green"]; R [fontcolor="green"]; node[shape=box]; structpiv [label= "4"]; struct1 [label= "1"]; struct2 [label= "5"]; struct3 [label= "3"]; struct4 [label= "2"]; struct5 [label= "5"]; struct7 [label= "7"]; { rank="same"; struct4 -> struct1 struct1 -> struct3 struct3 -> struct5 struct5 -> struct2 struct2 -> struct7 } pivot -> structpiv; L -> struct5; R -> struct5; } ``` ```c arr[L] = piv; // swap 的第三步 ``` ```graphviz digraph structs { node[shape=plaintext]; pivot [fontcolor="red"]; L [fontcolor="green"]; R [fontcolor="green"]; node[shape=box]; structpiv [label= "4"]; struct1 [label= "1"]; struct2 [label= "5"]; struct3 [label= "3"]; struct4 [label= "2"]; struct5 [label= "4"]; struct7 [label= "7"]; { rank="same"; struct4 -> struct1 struct1 -> struct3 struct3 -> struct5 struct5 -> struct2 struct2 -> struct7 } pivot -> structpiv; L -> struct5; R -> struct5; } ``` 比起平常寫程式都會直接要多用一個 `tmp` 的變數，這裡作者巧妙地運用 pivot 的變數來 swap 。讓我很開心又學習到新的寫程式技巧。 3. 透過 stack 紀錄每次要處理的為第 i 個編號的，相對應 beg[i] 與 end[i] 紀錄排序的範圍，而 Wikipedia 中則是會透過遞迴的方式每次將新的範圍代錄去紀錄每次要處理的排序範圍。這裡我也認為超級酷的，作者很巧的運用 stack 的想法，假設原先是用第 i 個 block 透過遞迴的方式分裂成二個繼續進行那就直接重複使用第 i 個 block 與第 i + 1 個 block 就好了，由於再也不會使用第 i 個 block ，因此不需用第 i + 1 個 block 與第 i + 2 個block。 ```c beg[0] = 0; end[0] = elements; ``` ```graphviz digraph struct{ rankdir = "LR"; subgraph cluster0 { rank = same { indices begarray } color = white; indices [ shape = record, color = white, label = "{ 0 | 1 | 2 | 3 | 4 | ... }" xlabel=begin ]; begarray [ shape = record, label = "{ 0 | | | | | ... }", name = "sdf"]; } nodesep = 0; } ``` ```graphviz digraph struct{ rankdir = "LR"; subgraph cluster0 { rank = same { indices begarray } color = white; indices [ shape = record, color = white, label = "{ 0 | 1 | 2 | 3 | 4 | ... }" xlabel=end]; begarray [ shape = record, label = "{ 6 | | | | | ... }"]; } nodesep = 0; } ``` ```c=1 beg[i + 1] = L + 1; end[i + 1] = end[i]; end[i++] = L; ``` 這裡有趣的寫法是第一個是不用重新設定 `beg[i]` 了因為就用以前的，再來因為要使用 `end[i]` 當作 `end[i+1]` ，所以不能先設定改變 `end[i]`。此外這裡最藝術的是為什麼右邊(也就是 pivot 大的值)要放在 i+1 的編號，而左邊 (較 pivot 小的值)要放在 i 的編號不能顛倒。因為這跟上面 `i++` 有關係。也就是下一回合要處理的都會是右邊的部分，如果右邊找到新的 pivot 又會被分成兩半，再繼續往編號大的繼續增加，不會覆蓋掉以前左半邊的 beg 與 end 的資料。確保不會覆蓋掉左半邊的 beg 與 end 的資料後，那下個問題就是確認程式真的會回到左半邊去處理嗎? 當 L<R 就會分裂成兩半邊。所以最極端的狀況是剩二個節點，所以在分兩半的時候分配到 `i` 與 `i+1` 但因為都只剩一個所以 `L == R` 則都做 `i--; i--` 所以回到 `i-1` 也就是回到左邊了。 ```graphviz digraph struct{ rankdir = "LR"; subgraph cluster0 { rank = same { indices begarray } color = white; indices [ shape = record, color = white, label = "{ 0 | 1 | 2 | 3 | 4 | ... }" xlabel=begin ]; begarray [ shape = record, label = "{ 0 | 4 | | | | ... }"]; } nodesep = 0; } ``` ```graphviz digraph struct{ rankdir = "LR"; subgraph cluster0 { rank = same { indices begarray } color = white; indices [ shape = record, color = white, label = "{ 0 | 1 | 2 | 3 | 4 | ... }" xlabel=end ]; begarray [ shape = record, label = "{ 3 | 6 | | | | ... }"]; } nodesep = 0; } ``` #### 程式原理 :::warning 注意書名號 (即 `〈` 和 `〉`) 的使用，不該用大於 (即 `>`) 和小於 (即 `<`) 取代，不僅排版效果不同，也會造成 HackMD 解析的問題。 ::: 測驗題中的程式碼 list_tail 與 list_free 是有品味的程式碼，透過傳入指標的指標，來移除處理特例的情況。因為參數為指標的指標所以把它們都變成示意圖後比較好懂。 **指標的指標** `node_t** left == address X` 而 `X 是 **存放 structure node_t 的東西**的位址` ```graphviz digraph structs { node[shape=box]; struct1 [label= "X" xlabel = "Address : &left"]; { rank="same"; struct1 } } ``` **指標** `*left == address Y` 而 `Y 是 structure node_t的位址` ```graphviz digraph structs { node[shape=box]; struct2 [label= "Y" xlabel = "Address : X"]; { rank="same"; struct2 } } ``` **值** `**left == address Z` 而 `裡面存的東西是 structure node_t` ```graphviz digraph structs { node[shape=box]; struct3 [label= "node_t" xlabel = "Address : Y"]; struct4 [label= "Z" ]; { rank="same"; struct3 -> struct4 [label=" node *next "] } } ``` ```graphviz digraph structs { node[shape=box]; struct5 [label= "node_t" xlabel = "Address : Z"]; { rank="same"; struct5 } } ``` > **C99 standard (§ 6.5.2.3) Structure and union members** > The first operand of the -> operator shall have type "pointer to qualified or unqualified structure" or "pointer to qualified or unqualified union", and the second operand shall name a member of the type pointed to. 參閱規格書可知，指標的作用是紀錄一個記憶體的位址。 (`*left`) 是存 "qualified or unqualified structure" 的位址也就如**指標**所說的存 `structure node_t的位址`。而第二個運算子則為在位址裡住的其他成員 `next`。`next` 為**指標**。所以當要讓迭代器 `left` 也就是**指標的指標**移動的話，則將**指標** `next` 的位址指派給迭代器。接下來是測驗中的答案 [commit 803b10d ](https://github.com/weihsinyeh/Sorting-Lab/commit/803b10d82a8bee62fa7d7b0987900e9ee0258528) ```c node_t *list_tail(node_t **left) { while ((*left) && (*left)->next) left = &((*left)->next); return *left; } int list_length(node_t **left) { int n = 0; while (*left) { ++n; left = &((*left)->next); } return n; } ``` 看到這一段的時候這跟原先在陣列做 quick_sort 的差別在除了 begin 與 end 的陣列存的式 node_t 的位址外。原本看到這段，其實在想 `i += 2` 最後到底會不會回到 0，因為上面非遞迴的方式有 `i += 1`。因為原先作者把 pivot 放到左邊 L 裡的最大一個元素，而這邊這是多開一個 `mid` 去存 pivot 。雖然我覺得這個好像可以改成上面把 pivot 也存在 L 的最大的元素，但我想這樣可以讓每次都少比較一個元素，犧牲空間換時間。把 pivot 獨立存在 `begin[i+1]`，這樣 `if (L != R)` 也不會比較，所以到 `else` 的地方直接將 pivot 放到最後的result去了 `i--`。那這樣就跟作者的想法一樣 `i` 一定會減到為 0 的地方。 ```c void quick_sort(node_t **list) { int n = list_length(list); int value; int i = 0; int max_level = 2 * n; node_t *begin[max_level], *end[max_level]; node_t *result = NULL, *left = NULL, *right = NULL; begin[0] = *list; end[0] = list_tail(list); while (i >= 0) { node_t *L = begin[i], *R = end[i]; if (L != R) { node_t *pivot = L; value = pivot->value; node_t *p = pivot->next; pivot->next = NULL; while (p) { node_t *n = p; p = p->next; list_add(n->value > value ? &right : &left, n); } begin[i] = left; end[i] = list_tail(&left); begin[i + 1] = pivot; end[i + 1] = pivot; begin[i + 2] = right; end[i + 2] = list_tail(&right); left = right = NULL; i += 2; } else { if (L) list_add(&result, L); i--; } } *list = result; } ``` 使用 Linux 核心風格的 List API 改寫上述程式碼，引用 lab0-c 的 list.h [commit 0560344](https://github.com/weihsinyeh/Sorting-Lab/commit/0560344b5cdbb8a1fabba316d562266af1be2374)。同時採用 lab0-c queue.c 的寫法來改寫成 main.c 。因為開成環形雙向串列，所以這邊就不用 end 來存資料了。同時也發現我把程式碼寫成這樣。這裡忽略無用的部份，我把比 pivot 小的值都放右邊，比 pivot 大的值都放左邊。也因此我在下面要合併到 result 的時候是用黏到尾端而不是像上面黏到開頭。以下為我的程式碼 ```c if (L != R) { ... list_for_each_safe (iter, next, begin[i]) { list_del(iter); list_add(iter, list_entry(iter, node_t, list)->value < value ? right : left); } ... } else { if (!(!L || list_empty(L))) list_splice_tail(begin[i], result); i--; } ``` 因為這個發現，我又再認真思考一次通常連結到尾端不是很合理嗎 ? 但這件事情從 `begin[i] = left; begin[i + 1] = pivot; begin[i + 2] = right;` right (存比 pivot 大的值) 相比 left (存比 pivot 小的值)是放到 begin 的較後方。那因為 `i+=2` 所以都會先處理 right 再處理 pivot 再處理 left 。以下為測驗題的程式碼 ```c void list_add(node_t **list, node_t *node_t) // origin and new { node_t->next = *list; *list = node_t; } ``` ```c if (L) list_add(&result, L); i--; ``` 因此示意圖為如下，都從開頭加入才合理。 ``` ----------- --------------- --------------------- 頭 right 尾頭 pivot right 尾頭 left pivot right 尾 ----------- --------------- --------------------- ``` #### 實驗接下來的實驗，是用 lab0-c 的 Dudect 裡面的 cpucycles.h 裡的 `cpucycles` 函式計算實驗時間。這個做法參考 [yehsudo](https://hackmd.io/@yehsudo/linux2024-homework2)。 Quick Sort 的 worst case 為 $O(n^2)$ 如下圖實驗數據時間複雜度也是 $O(n^2)$ 的曲線。 > **[wikipedia](https://en.wikipedia.org/wiki/Quicksort)** > Quicksort with this scheme degrades to $O(n^2)$ when the array is already in order, due to the partition being the worst possible one. 分別用隨機的輸入資料、遞增排序的輸入資料與遞減排序的輸入資料做比較。可以發現原本輸入資料(遞增)與排序規則(遞增) 反而還會增加時間。從實作方式，與輸入資料兩項作分析。實作方式一：我的方法將比 pivot 小的資料加入 R ，比 pivot 大的資料加入 L 。最後都將 L 插入尾端 ![image](https://hackmd.io/_uploads/BJxSp2Rnpp.png) 實作方式二：測驗題的方法將比 pivot 小的資料加入 L ，比 pivot 大的資料加入 R 。最後都將 L 插入頭端 ![image](https://hackmd.io/_uploads/ryTseuap6.png) 由兩者實作的方式，我原先認為沒有差別，就算有差別也只是因為測試資料的隨機性。然而卻在這兩張圖上看到我的方式所花費的時間比測驗題的方式好，但目前不知發生原因。從輸入資料可知，Quick sort 的效率會因為 pivot 的選擇而有很大的差別而有很大的影響，worst case 就發生在 pivot 每次都切挑到最大值或是最小值，而 best case 則是每次都剛好 pivot 挑到中位數。 #### Random pivot 這裡因為在排序遞增的序列，隨機選擇 pivot，會讓它與其他比較有失公平，因此我選擇用遞增規則排序，遞減的序列來做隨機選擇 pivot 的實驗。 ```c void random_pivot(struct list_head *head) { if (!head || list_is_singular(head)) return; int r = rand() % list_length(head); struct list_head *cur; list_for_each (cur, head) { if (r == 0) break; r--; } if (head->next != cur) list_swap(head->next, cur); } ``` 輸入資料分別用隨機的、遞增排序的、遞減排序的做比較從實驗結果發現將 pivot 透過隨機的方式選可以對無論是用隨機的輸入資料、在排序過的輸入資料(遞增)與排序過程(遞增)一樣、排序過的輸入資料(遞減)與排序過程(遞增)相反，都表現得很好。 ![image](https://hackmd.io/_uploads/r14FhAn6T.png) 綜合一起來看： ![image](https://hackmd.io/_uploads/ByaGaC366.png) 為何這些曲線不是平滑的，跟作業系統的存取有關，不連續的記憶體操作，有 page fault 。如果是陣列（連續記憶體）的話會減少此干擾。同時考慮到 CPU 排程，因為電腦是多工的，有發生中斷訊號，去切換現在要處理的行程，因此時間不只含程式碼的運行。還包含切換行程的時間。同時排程演算法有 round-robin 或 priority queue ... 等，所以其實所計算的時間包含其他行程的運行。 shuffle 函式運行，隨機裡面切排讓它有些是排序的，這種輸入資料讓 Timsort 可以表現的較好。如果是很隨機的輸入資料，用 Timsort 就相當於合併排序。 #### Use the median of three for the pivot value. > **Essential Improvements to basic quicksort** Use the median of three for the pivot value. > Quicksort is slowest when the pivot is always the smallest or largest possible value. The best possible pivot is the median of the segment $b[h..k]$ being sorted. That median can actually be calculated and used, but the calculation is too slow. Instead, one generally uses the median of three values: $b[h]$, $b[(h+k)/2]$, and $b[k]$. To the right, we define a method to swap the median of three values of $b[h..k]$ into $b[h]$. 這裡用 lab0-c 的二個反向的指標去計算終點，作為中點，在將他與 L 最左邊的元素交換。 ![runtime](https://hackmd.io/_uploads/HknZMW0p6.png) 雖然用 median 與 random 的效果差不多，都可以解決輸入資料是排序過的，用 quick sort 效果很差，但是 median 卻比random好，其中的差別就只差在 random 要計算現在要交換哪個值，而 median 不用計算，直接走訪串列即可。所以才看到 median pivot 較 random pivot 用時少。 ```c void mid_pivot(struct list_head *head) { if (!head || list_is_singular(head)) return; struct list_head *posptr = head->next; struct list_head *negptr = head->prev; while (posptr != negptr && posptr->next != negptr) { posptr = posptr->next; negptr = negptr->prev; } if (head->next != posptr) list_swap(head->next, posptr); } ``` ### 測驗二 [Tim sort的動畫](https://www.chrislaux.com/timsort) Tim sort 的演算法分為兩部份，第一部份為分割第二部份為合併。分割的時候，先找出遞減跟遞增的數列。 1. 如果數列的長度小於 minrun ，則補後面的數字到該數列，並透過插入排序將其變為遞增。 2. 如果數列的長大大於 minrun 且為遞減則反轉。再來 minrun 為多少。其實我在查這件是的時候意外發現了一個從沒有想過得問題 Insertion sort is a simple and efficient algorithm for small input sizes. 那到底怎樣的數量叫做 small input sizes ? 這個問題我還不知道要去哪裡查，也不知道是跟誰比較有效率。但 Tim sort 選擇 minrun 是考量插入排序在輸入數量上的效率，因為每個 run 如果其中數列的數字並不是遞增，就要用插入排序。 #### find_run 此函式的功能是要回傳 run 的頭尾兩個節點，並確保是 ascengind order 。這裡會將 length 存在 head->next->prev ，然後再 run_size 的時候可以用它回傳 size_t 的型態。 ```c head->next->prev = (struct list_head *) len; ``` > **C99 standard (§ 6.5.3.4 The sizeof operator)** > The sizeof operator yields the size (in bytes) of its operand, which may be an expression or the parenthesized name of a type. The size is determined from the type of the operand. The result is an integer. If the type of the operand is a variable length array type, the operand is evaluated; otherwise, the operand is not evaluated and the result is an integer constant. 而因為 size_t 是 size0f 運算子所回傳的型態，所以 size_t 的大小 (in bytes) 閱讀規格書後，我可以理解用 size_t 的用法，但對於要將 size_t 當成記憶體位址去存放，並沒有很清楚。 > **C99 standard (§ 7.18.3 Limits of other integer types)** > — limit of size_t > SIZE_MAX 65535 > Its implementation-defined value shall be equal to or greater in magnitude (absolute value) than the corresponding value given below, with the same sign. #### run_size & merge_at 合併 `at` 與 `at->prev` 成為 list。這裡比較特別的 `size_t` 參閱規格書後發現 `size_t` 可以回傳指向該物件的指標。第三點所說的 > **C99 standard (§ 6.5.3.4)** > * The value of the result is implementation-defined, and its type (an unsigned integer type) is size_t, defined in <stddef.h> (and other headers) > * A principal use of the sizeof operator is in communication with routines such as storage allocators and I/O systems. > * A storage-allocation function might accept a size (in bytes) of an object to allocate and return a pointer to void. For example: `extern void *alloc(size_t);` >TODO: 要確認。跟我原先想的好像不是同件事。回傳 size_t 0 `run_size(at) + run_size(at->prev)` 操作讓 $len=0+$`前面與其合併的 runsize` 可能發生的情況：　$A$ $B$ $C$ 中的是 $B$ 或 $C$ 都已經合併到其他位置上。所以現在是 NULL。stk_size < 3 ```graphviz digraph structs { node[shape=plaintext]; at [fontcolor="red"]; node[shape=box]; struct1 [label= "head \n NULL"]; { rank="same"; struct1 } at -> struct1 } ``` 回傳 size_t 1 `run_size(at) + run_size(at->prev)` 操作讓 $len=1+$`前面與其合併的 runsize` ，因為後面沒東西，所以 `at` 這個 run 的大小一定為 1。可能發生的情況：　`at` 指向 $B$，而 $C$ 已經合併到其他位置，`stk_size = 2` `at` 指向 $C$，而 $C$ 後面已經沒有 run 了，`stk_size = 3` ```graphviz digraph structs { node[shape=plaintext]; at [fontcolor="red"]; node[shape=box]; struct1 [label= "head"]; struct3 [label= "head->next\n NULL"]; { rank="same"; struct1 -> struct3 } at -> struct1 } ``` 回傳 size_t pointer to head `run_size(at) + run_size(at->prev)` 操作讓 $len=$ `當前 at 的runsize` $+$ `前面與其合併的 runsize` ```graphviz digraph structs { node[shape=plaintext]; at [fontcolor="red"]; node[shape=box]; struct1 [label= "head"]; struct3 [label= "head->next"]; { rank="same"; struct1 -> struct3 } at -> struct1 } ``` ```c static inline size_t run_size(struct list_head *head) { if (!head) return 0; if (!head->next) return 1; return (size_t) (head->next->prev); } ``` ```graphviz digraph structs { node[shape=plaintext]; node[shape=box]; struct4 [label= "prev"]; struct1 [label= "at->prev"]; struct3 [label= "at"]; struct5 [label= "next"]; { rank="same"; struct4 -> struct1 struct1 -> struct3 struct3 -> struct5 } } ``` ```graphviz digraph structs { node[shape=plaintext]; node[shape=box]; struct4 [label= "prev"]; struct1 [label= "list"]; { rank="same"; struct4 -> struct1 } } ``` 由於下方的合併操作只採取 $A+(B+C)$ 或 $(A+B)+C$ 所以每次合併玩都是放在 prev 的位置，$(B+C)$ 後放在 $B$ 的位置， $(A+B)$放在 $A$ 的位置 ```c static struct list_head *merge_at(void *priv, list_cmp_func_t cmp, struct list_head *at) { size_t len = run_size(at) + run_size(at->prev); struct list_head *prev = at->prev->prev; struct list_head *list = merge(priv, cmp, at->prev, at); list->prev = prev; list->next->prev = (struct list_head *) len; --stk_size; // 因為有二個合併，所以 stk_size 相當於 pop 二個合併再 push 一個。 return list; } ``` #### merge_force_collapse >堆疊頂端的 3 個 run 是否滿足以下原則： >* $A$ 的長度要大於 $B$ 和 $C$ 的長度總和。 >* $B$ 的長度要大於 $C$ 的長度 >確保堆疊中 run 的長度平衡，且因合併只能在相鄰的兩個 run 間進行，以確保排序的穩定性，因此合併操作只採取 $A+(B+C)$ 或 $(A+B)+C$ 的形式進行因此下方程式中第 5 行先比較 $A$ `(tp->prev->prev)` 跟 $C$ `tp` 的大小，如果 $A < C$ 且滿足 $A > (B+C)$ 則一定要 $(A+B)+C$ 才能滿足。所以合併 `(tp->prev->prev)` 與 `(tp->prev)`。如果 $A > C$，合併 $(B+C)$ 可以確保合併後的 run，因為 $(B+ C)$ 合併後是放在 $B$ 的位置。而要確保 $B$ 長度盡量大於 $C$。如果是 $(A+B)+C$，那下一輪 $A$ 的位置是放這次合併的 $(A+B)$，$B$ 是放 $C$ 但卻不能確保會到於新進來的 $C$。 > TODO : 確認這裡的數學推導下一輪一定滿足也滿足上方的 > **[Merging Pattern](https://timwellsaid.com/tims-take-on-timsort/)** > If we have Y:190, Z:90, A:30, B:20, C:15, since A is not greater than B + C, it violates #1, we must merge ABC by merging the smaller of A or C with B, so we merge B and C together, resulting Y:190, Z:90, A:30, BC:35. Now we notice A is less than BC, which violates variants #2, we have to keep merging, result Y:190, Z:90, ABC:75. Now the runs satisfy both invariants and we can stop now. 這裡表示不用每次 run 都符合上述的原則。 ```c=1 struct list_head *merge_force_collapse(void *priv, list_cmp_func_t cmp, struct list_head *tp) { while (stk_size >= 3) { if (run_size(tp->prev->prev) < run_size(tp)) { tp->prev = merge_at(priv, cmp, tp->prev); } else { tp = merge_at(priv, cmp, tp); } } return tp; } struct list_head *merge_collapse(void *priv, list_cmp_func_t cmp, struct list_head *tp) { int n; while ((n = stk_size) >= 2) { if ((n >= 3 && run_size(tp->prev->prev) <= run_size(tp->prev) + run_size(tp)) || (n >= 4 && run_size(tp->prev->prev->prev) <= run_size(tp->prev->prev) + run_size(tp->prev))) { if (run_size(tp->prev->prev) < run_size(tp)) { tp->prev = merge_at(priv, cmp, tp->prev); } else { tp = merge_at(priv, cmp, tp); } } else if (run_size(tp->prev) <= run_size(tp)) { tp = merge_at(priv, cmp, tp); } else { break; } } return tp; } ``` #### 提出改進方案並予以實作接下來看 [Wikipedia: Timsort](https://en.wikipedia.org/wiki/Timsort#Merge_direction) 提及的 Galloping mode during merge，在原本的程式碼中沒有對應的實作，因此我參考其中的演算法，首先用二分搜尋找要插入的位置。 > In this mode, the algorithm performs a two-stage search for the place in the run R1 where the next element x of the run R2 would be inserted. In the first stage it performs an exponential search, also known as a galloping search, until finding a k such that $R1[2^{k−1} − 1] < x <= R1[2^k − 1]$, i.e. a region of uncertainty comprising $2^{k−1} − 1$ consecutive elements of R1. The second stage performs a straight binary search of this region to find the exact location in R1 for x. Galloping mode is an attempt to adapt the merge algorithm to the pattern of intervals between elements in runs. 但做到一半我發現如果要用二分搜尋法在鏈結串列中，那就要找到中點。找到中點的同時，不就可以直接找到那個我要插入的位置嗎？想到這裡後我認為要找到插入的位置比較可以省時間的辦法用最好寫程式的簡易辦法，就是我就用兩個指標去只前後找，這樣也不用用一個問題去解決一個問題。因為 Timsort 會用 Galloping mode 的方法是因為這承襲原先 Timsort 演算法的精神。可能兩個序列為： $a_1, a_2, a_3, a_4,$ 與 $b_1, b_2, b_3, b_4, c_1, c_2$ 而 $a_i < a_j < b_i < b_j < c_i < c_j, i <j$ 那這樣就可以直接將第二個序列插入到第一個序列而不是一直比較然後交換指標。所以如果只有一個指標從前面開始找要插入到哪裡，就根本沒有解決問題，只是解決了減少交換指標的問題而已。然而在鏈結串列中用二元搜尋沒有效果，所以我就試試看用頭尾指標。但我後來發現頭尾指標並不是一個很容易的實作方法，所以我改成另一種我將 sub list 的數值如果都小於另一個 list 後再一起移動。 > [commit 34fe47e](https://github.com/weihsinyeh/Sorting-Lab/commit/34fe47e3962ad5aeab8a476de33c219a6b68d8d5) 心得：雖然寫完後發現是這麼的直覺，但其實在想的過程我卻繞著超大一圈，我一直想用 list_splice 與 list_cut_position 來做。但我越寫考慮的事情越多，一切變得複雜。正當想放棄去睡覺時，我把所有程式碼都刪掉，重新審視短短的程式，發現竟然不用處理 prev。那就可以直接捨棄 LIST_API 。呼應如果一件事很複雜那一定是我用錯方式了。這裡從隨機的輸入資料看不出來 Gallop 有特別大的效果 ![image](https://hackmd.io/_uploads/Bkm5HY8Ra.png) 還發現 Gallop mode 比較次數還變比較多。但我認為比較次數要是一樣的，因為只是把比較的操作往前移。 ![image](https://hackmd.io/_uploads/BkgbxdFUA6.png) 後來再仔細思考程式碼，的確比較一定會比較多。舉個worst case為例：$L1 = {a1, a3, a5, a7, a9}$ 與 $L2 = {a2, a4, a6, a8, a10}$ 而 $a_i < a_j, i < j$。那每次兩個串列的元素比較都一定會多比一次，而這多一次的比較（比較下一個元素是否依然也小於另一個串列的當前元素）下一次也要再比一次。因為串列的元素大小為交替出現，所以不會在當前的回合將下一個元素也加入至要回傳的合併串列，因此當然在下一輪，下一個元素依然出現原先的串列中。接下來試試設計輸入資料，使輸入資料符合 Timsort 演算法設計的初衷:部份資料為排序過的。不知道是因為設計的輸入資料不夠好，還是我的改進根本沒有用，我換了很多種輸入測試資料，兩者的執行時間的差距微忽其微，不標顏色的話，我可能都認為他們是出自同一個實作方式。之所以會有兩條曲線我認為是因為我設計的輸入資料 0.25 的機率會出現排序過的資料。 > [commit ef0805b](https://github.com/weihsinyeh/Sorting-Lab/commit/ef0805b6b50b77284fbd9ef4a759f28cd82684b3) 執行時間 ![image](https://hackmd.io/_uploads/BJHUpsUAa.png) 比較次數 ![image](https://hackmd.io/_uploads/SJWQg38AT.png) 所以重新去看了 [Timsort 的效果描述](https://bugs.python.org/file4451/timsort.txt) 在文章中比較了 "一對一比較" 與 "galloping" 兩種合併演算法。 galloping 的優勢在於在輸入資料有部份排序或是出現重複元素的情況下，能夠減少交換回傳合併串列的指標變更次數。缺點則是在輸入資料的大小是交替出現在二個串列中，那比較次數反而會很大。但因為我認為我的修改是將「當合併二個排序過的序列時，其中一個序列的一部分，在另一個序列的相鄰節點的範圍區間內，那就不用每比較一個節點都要存取到 tail 這個變數」考慮進去了。所以我又透過 `valgrind --tool=cachegrind` 來比較 cache miss 與 cache reference 的次數。以下是比較的結果：我的實作方式 ```shell I refs: 13,935,157,138 I1 misses: 1,380 LLi misses: 1,330 I1 miss rate: 0.00% LLi miss rate: 0.00% D refs: 10,956,528,365 (6,281,425,834 rd + 4,675,102,531 wr) D1 misses: 139,523,388 ( 62,472,994 rd + 77,050,394 wr) LLd misses: 38,308 ( 2,111 rd + 36,197 wr) D1 miss rate: 1.3% ( 1.0% + 1.6% ) LLd miss rate: 0.0% ( 0.0% + 0.0% ) LL refs: 139,524,768 ( 62,474,374 rd + 77,050,394 wr) LL misses: 39,638 ( 3,441 rd + 36,197 wr) LL miss rate: 0.0% ( 0.0% + 0.0% ) ``` Linux 的方式 ```shell I refs: 13,955,149,160 I1 misses: 1,376 LLi misses: 1,327 I1 miss rate: 0.00% LLi miss rate: 0.00% D refs: 11,046,271,135 (6,257,687,985 rd + 4,788,583,150 wr) D1 misses: 139,515,397 ( 62,473,019 rd + 77,042,378 wr) LLd misses: 38,308 ( 2,111 rd + 36,197 wr) D1 miss rate: 1.3% ( 1.0% + 1.6% ) LLd miss rate: 0.0% ( 0.0% + 0.0% ) LL refs: 139,516,773 ( 62,474,395 rd + 77,042,378 wr) LL misses: 39,635 ( 3,438 rd + 36,197 wr) LL miss rate: 0.0% ( 0.0% + 0.0% ) ``` 雖然這兩者差異不大，但 Linux 的 miss 次數較少。首先這個修改而言不算改進，因為我自以為我的想法將某種例子考慮進去了，但有幾個問題卻沒被納入考量：這某種例子常出現嗎？即便我嘗試修改輸入資料以讓它真的很常出現，但我的修改卻沒有達到預期的效果。第二，改變了目標是減少存取到 tail 這個變數的次數，但這操作是發生在鏈節串列，不是在連續記憶體，而 tail 是存放節點的位址，對快取把一個區塊取代成 tail 影響不大，因為原先次數，但這操作是發生在鏈節串列，不是在連續記憶體，而 tail 是存放節點的位址，對快取存的也是迭代器而已，也就是存鏈結串列其中一個元素的記憶體位址。所以才會看到 miss 的次數沒有降低，因為這存取的區塊記憶體用量低，不會影響到有 256KB 的 L1 cache 。那它之所以會增加的原因我還沒想到。 > 補充說明：以計算機結構的角度，有可能跟 cache 有關係，像是先把一個資料讀進來用 PREFATCH 。跟 cache line有關係。 ``` 64 B 8 B ｜ 8 B | 8 B | 8 B ``` > 讀第一個 8B 其實可以讀所有的 64B 。但對於鏈節串列不連續的記憶體，可以用 prefatch 先讀。[remove prefatch.h](https://github.com/torvalds/linux/blob/master/include/linux/prefetch.h)只有非常數量非常多的時候 prefatch 才有用。 :::warning 諾貝爾經濟學獎得主、美國經濟學家 Joseph E. Stiglitz 於 1987年發表〈[Technological Change, Sunk Costs, and Competition](https://academiccommons.columbia.edu/doi/10.7916/D8NS14TN/download)〉一文，提出「沉沒成本」([sunk cost](https://en.wikipedia.org/wiki/Sunk_cost)) 的概念，指「已花費掉且無法復原的成本」。 > a cost that has already been incurred and cannot be recovered. 你將上述經濟學用語硬套在工程討論上，適合嗎？ ::: #### [fennecJ 筆記](https://hackmd.io/@fennecJ/linux2024-q1q2#%E4%BB%A5-linux-%E6%A0%B8%E5%BF%83%E9%A2%A8%E6%A0%BC%E4%B9%8B-API-%E9%80%B2%E8%A1%8C%E6%94%B9%E5%AF%AB) ##### timsort 效能與記憶體用量的平衡：做不到 `inplace` ，但 stack 大小不會太大也可以。透過程式技巧（傳遞指標 `priv` ）opaque 隱藏實作細節：紀錄比較次數，傳遞指標 `priv` 去紀錄次數。去看是不是 stable sorting。不同 sample 下 compare 的次數。統計把資料維度提高。排序是統計學的手法。去比較不一樣的資料分佈。cpython ##### adaptive hiversSort 符合特別的一條線。 ##### power sort 計算兩個 run 的 power 盡可能讓合併的 run 越平衡，比較次數降低。在合併 run 的時機不一樣 --- ## [第二週測驗題](https://hackmd.io/@sysprog/linux2024-quiz2) ### 測驗 `1` 建立二元樹需要有透過得知中序（in-order）與前序 (pre-order) 或是後序（post-order)才能具體說明一個獨一無二的二元樹的原因說明舉例如下。前序是 `[A B C]` 透過前序知道 `A` 是中間的那個點，但卻無法透過前序知道 `B` `C` 相對於A的左邊還是右邊。因此產生的二元樹可能如下中序分別是 `[B C A]`, `[B A C]` , `[A B C]` ``` A A A / / \ \ B B C B \ \ C C (1) (2) (3) ``` 第一個樹：由中序知道 `A` 是中間的那個點，`B` `C` 相對於`A`在左邊。所以都在 `A` 的左子樹中。第二個樹：由中序知道 `A` 是中間的那個點，`B` 相對於 `A` 在左邊，所以在 `A` 的左子樹中。`C` 相對於`A`在右邊，所以在 `A` 的右子樹中。第三個樹：由中序知道 `A` 是中間的那個點，`B` `C` 相對於`A`在右邊。所以都在 `A` 的右子樹中。所以 DFS 中的實作是透過 preorder 會藉由 `find` 找到`preorder[pre_low]` 在`inorder`中的位置 `idx`。由圖得知在 `idx` 左邊的都會被放到左子樹，在`idx`右邊的都會被放到右子樹。因此對應到左子樹的在中序涵蓋的範圍： `[in_low,idx-1]`，右子樹在中序涵蓋的範圍： `[idx+1,in_high]` ```c tn->left = dfs(preorder, pre_low + 1, pre_low + (idx - in_low), inorder, in_low, idx - 1, in_heads, size); tn->right = dfs(preorder, pre_high - (in_high - idx - 1), pre_high, inorder, idx + 1, in_high, in_heads, size); ``` 而觀察剛剛的前序可以知道他的分佈的規則就是 `[中點左子樹右子樹]`，從這裡就可以看到只要知道左子樹有多少個，右子樹有多少個，就可以知道在前序的範圍了。而因為 `pre_low` 是中點左子樹在前序的涵蓋範圍： `[pre_low + 1, pre_low + (idx - in_low)]` 其實就是`pre_low`往後左子樹的在中序涵蓋的範圍： `[in_low,idx-1]` 右子樹在前序的涵蓋範圍： `[pre_high - (in_high - idx - 1), pre_high]` 其實就是`pre_high`往前右子樹的在中序涵蓋的範圍： `[idx+1,in_high]` 也因此從這樣的分析其實知道如果題目提供的只有後序與中序，我只要從後序的最後一個往前讀。改變範圍即可因為後序的分佈是 `[左子樹右子樹中點]`，`post_high`是中點。所以左子樹在後序的涵蓋範圍：`[post_low, post_low + (idx - 1 - in_low)]` 右子樹在後序的涵蓋範圍：`[post_high - 1 - (in_high - (idx+1)) , post_high-1]` 所以這樣也就明白中序在建立二元樹中缺一不可的角色，而前序與後序則是可以只挑其中一個。而 `find` 函式則是用 hash table 去紀錄 `in-order` 的位置而已，可換成其他資料結構來實作。做preorder postorder 都是為了做編譯器最佳化時使用。 Constant folding 為什麼可以做計算，讓常數乘常數還是常數。 #### 在 Linux 核心的 cgroups 相關原始程式碼，找出 pre-order walk 程式碼並探討 [include/linux/cgroup.h-css_for_each_descendant_pre](https://github.com/torvalds/linux/blob/master/include/linux/cgroup.h) [include/linux/lru_cache.h](https://github.com/torvalds/linux/blob/master/include/linux/lru_cache.h?fbclid=IwAR2LZH5MCIjZtj9k0hHCR8rrLqq_u6EPWQlzvdQGiB8ios62X4tRPR8sSdE) ### 測驗 `2` ### 測驗 `3` [But what are Hamming codes? The origin of error correction](https://youtu.be/X8jsijhllIA)看這個影片，原本很疑惑為什麼列的矩陣要長這樣，所以我就把那些遮罩畫出來。 which column? Q0$\left(\begin{array}{cccc} 1&0&1&0\\ 1&0&1&0\\ 1&0&1&0\\ 1&0&1&0\\ \end{array}\right)$ or $\left(\begin{array}{cccc} 0&\style{background-color:#F28500}1&0&1\\ 0&1&0&1\\ 0&1&0&1\\ 0&1&0&1\\ \end{array}\right)$ $\to$ $\left(\begin{array}{cccc}A\\A\\A\\A\\\end{array}\right)$ or $\left(\begin{array}{cccc}5\\5\\5\\5\\\end{array}\right)$ Q1$\left(\begin{array}{cccc} 1&1&0&0\\ 1&1&0&0\\ 1&1&0&0\\ 1&1&0&0\\ \end{array}\right)$ or $\left(\begin{array}{cccc} 0&0&\style{background-color:#F28500}1&1\\ 0&0&1&1\\ 0&0&1&1\\ 0&0&1&1\\ \end{array}\right)$ $\to$ $\left(\begin{array}{cccc}C\\C\\C\\C\\\end{array}\right)$ or $\left(\begin{array}{cccc}3\\3\\3\\3\\\end{array}\right)$ which row? Q2$\left(\begin{array}{cccc} 1 & 1 & 1 & 1\\ 0 & 0 & 0 & 0\\ 1 & 1 & 1 & 1\\ 0 & 0 & 0 & 0\\ \end{array}\right)$ or $\left(\begin{array}{cccc} 0 & 0 & 0 & 0\\ \style{background-color:#F28500}1 & 1 & 1 & 1\\ 0 & 0 & 0 & 0\\ 1 & 1 & 1 & 1\\ \end{array}\right)$ $\to$ $\left(\begin{array}{cccc}F\\0\\F\\0\\\end{array}\right)$ or $\left(\begin{array}{cccc}0\\F\\0\\F\\\end{array}\right)$ Q3$\left(\begin{array}{cccc} 1 & 1 & 1 & 1\\ 1 & 1 & 1 & 1\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ \end{array}\right)$ or $\left(\begin{array}{cccc} 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ \style{background-color:#F28500}1 & 1 & 1 & 1\\ 1 & 1 & 1 & 1\\ \end{array}\right)$ $\to$ $\left(\begin{array}{cccc}F\\F\\0\\0\\\end{array}\right)$ or $\left(\begin{array}{cccc}0\\0\\F\\F\\\end{array}\right)$ 畫出來後我才了解甚麼這8個遮罩能夠拿成二進位的效果。將 Q0~Q3 先想像成第 0 個位元到第 3 個位元，然後左邊是清為 0，右邊是設為 1。問題在於為什麼 Q0 的兩個矩陣一定要是這兩個。觀察時就突然想到[上課討論到的 count bit](https://hackmd.io/@sysprog/HJff_gdwp#Change-Problem)做mask的時候的概念(上面用$\to$表示對應的概念)。而影片中的奇偶位元為 1, 2, 4, 8 。選這幾個位置是因為他們剛好是每次 mask 的第一個位置。這也讓分布為 2 的冪。之所以會說是 15-11 Hamming code，再設置 4 個同位元讓各個mask 的 block都為偶數後，還需傳遞結果說明是否為error condition，因此再犧牲第0個位元 parity check the whole block。$16-(1)=15$。再用4個奇偶位元$15-(4)=11$，讓整塊都為偶數。 --- ## 課程回顧 ### [你所不知道的 C 語言:浮點數運算](https://youtu.be/1QmbHXYc8_g?t=8943) 1985 年 Intel 推動 IEEE 754 規格發展。而後 Google 提出 BFloat16 規格來針對深度學習，需要加速器 GPU, Tensorflow, TPU 等，而 CPU 無可避免地要大量存取資料，所以如果能將浮點數占用的資源縮減，單位時間處理的資料量就會倍增。相比整數每個表示的數字都是均勻存在的，而浮點數運算計算的優點在可表示極大跟極小的值，而越接近零的情況可以表示的 accurate 的數字會越多。 `0 / 0` 會出現 signaling Nan 在電腦裡面不是未定義是有定義不過要另外處理。 underflow 與 overflow 作數值處理，要逼近 0 。subnormal 與 denormals 在逼近 0 的地方一直前進，以表示接近 0 的數值。 Quient NaN Signaling NaN 只要有程式就會停止運行或是跳到另外處理。不能隨便把 32 位元的數值改成 64 未免以避免它會 underflow 或 overflow 。因為 CPU 怎樣傳資料給 TPU 都已經先定義好。如果直接增加會讓每次作運算都有效能上的衝擊。 Lazy FP state restore 如果做context swithc的時候要切換通用暫存器，也要切換浮點數暫存器。會很花時間。所以如果只有一個程式有用到浮點數，那A PU 裡面的值則不清掉。不切換 APU Spectre and Meltdown 因為 APU 裡面的東西沒有被清掉，所以遺留一些必要的資訊。要修正來自硬體的 side channel attack 導致越差的執行效能。 Fixed point 的背後實作方式是透過整數。 Fused Madd (Multiply-Add operation) :::warning 何時會用到 fused madd 一類的操作？影像處理領域很多案例。 ::: 浮水印 ### 你所不知道的 C 語言:函式呼叫篇詳見〈[calloc 與 malloc + memset 不同](https://hackmd.io/@sysprog/c-function#ROP-%E6%94%BB%E6%93%8A)〉 > overcommit: malloc() 是一個「你要多少，我先給你」的操作。先前使用 stack 會發生 segfault 其實是一個"正確"的錯誤提醒，而非真正的錯誤。因此，為了防止對同一個指標執行兩次 `free()` 操作而導致程式失敗，應該在每次使用 `free()` 後將指標設為 `NULL`。這是因為釋放一個空指標不會有任何影響，且能避免潛在的問題。見 C99 (7.20.3.2) The free function > The free function causes the space pointed to by ptr to be deallocated, that is, made available for further allocation. If ptr is a null pointer, no action occurs. Parameter 為型式參數 (formal parameter)。 Argument 為實際參數 (actual argument)。 heap allocator 是記憶體裡面常說的 heap 。然而他跟資料結構中的 heap 也息息相關都是分堆。 * 詳見〈[你所不知道的 C 語言：藏在 Heap 裡的細節](https://hackmd.io/@sysprog/c-function#%E8%97%8F%E5%9C%A8-Heap-%E8%A3%A1%E7%9A%84%E7%B4%B0%E7%AF%80)〉 > 為了方便管理 free 出來的 chunk， glibc 會去將一些特定大小的 chunk 做集中管理，其集中管理的機制又分為：fast bin small bin large bin 等等。工具 : * [A graphical processor simulator and assembly editor for the RISC-V ISA](https://github.com/mortbopet/Ripes): 視覺化模擬 CPU。 * [scheduling internals](https://tontinton.com/posts/scheduling-internals/) --- ## 研讀 Linux 核心原始程式碼的 [lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) 程式碼主體採用 bottom up 實作 merge sort，該途徑對 cache 較友善，因為過程中就是一直合併，cache 被參照到的機會更大。相較之下， top down 是會先做 partition 再來 merge，但 partition 本身對 cache 不友善，在 cache 移進移出（內容不斷更新），導致 [cache thrashing](https://en.wikipedia.org/wiki/Thrashing_(computer_science))。 ### 合併方式合併方式是當有 $3 \times 2^k$ 個節點時，合併前二個 $2^k$ 變成 $2^{k + 1}$，並留下一個 $2^k$ 不動，維持著合併：不合併為 $2 : 1$ 的比例。 > This mergesort is as eager as possible while always performing at least 2:1 balanced merges. Given two pending sublists of size 2^k, they are merged to a size-2^(k+1) list as soon as we have 2^k following elements. 而用 $2 : 1$ 比例原因如下。思考選擇合併排序演算法的原因 : 合併排序演算法的時間複雜度是穩定的，不論在無論是最壞情況、平均情況還是最佳情況，算法的執行時間都不會偏離複雜度$O(n\log{n})$。根據合併排序法 $T(n) = 2T(\dfrac{n}{2}) + \theta (n)$ ，第一項是將原問題分成二個子問題做遞迴，第二項是因為合併過程為同時拜訪二個排序過的子串列，因此是線性的。由此可知先前條件為「二個合併的要是一樣大小的最好」，無論在 bottom-up (power of two 合併的二個串列為 2 的冪) 或是 top-down (half-half way 合併的二個串列為原本串列的一半) 都是符合先前條件的。選擇 bottom-up 的原因可從快取 cache 與資料結構 list 的角度分析。前者由於 bottom-up 實作 merge sort 過程中就是一直合併，快取被參照到的機會較大為 cache friendly 。後者因排序的資料結構是串列，用 top-down 需等所有節點都加入至串列才能分成二個子串列，這需考慮節點的數量已得知要從哪裡分開。但是，串列每次都只加入一個節點，剛好是 bottom-up 的想法。因此雖然在論文〈[Bottom-up mergesort — A detailed analysis](https://link.springer.com/content/pdf/10.1007/BF01294131.pdf)〉提到實驗發現 top-down 能用較少的比較次數，經過取捨後仍選擇 bottom-up。 > [commit b5c56e0](https://github.com/torvalds/linux/commit/b5c56e0cdd62979dd538e5363b06be5bdf735a09) "lib/list_sort: optimize number of calls to comparison function" > However, that requires knowing the number of entries to be sorted ahead of time, and making a full pass over the input to count it conflicts with a second performance goal, which is cache blocking. 下面這段話的 fully-eager bottom-up 指當有二個串列都有 $2^k$ 數量的節點就合併。 > Thus, it will avoid cache thrashing as long as $3 \times 2^k$ elements can fit into the cache. Not quite as good as a fully-eager bottom-up mergesort, but it does use 0.2*n fewer comparisons, so is faster in the common case that everything fits into L1. 也就是 fully-eager bottom-up 有可以避免 cache trashing 的優點，然而缺點卻是比 $2:1$ 的想法多 $0.2n$ 的比較次數。開發者在二者間做取捨，最終選擇比較重視比較次數，因為只要 $3 \times 2^k$ 放得進去 L1 快取也可以避免 cache trashing 。 fully-eager bottom up 可避免 cache trashing 是因為 cache trashing 這個問題來自整體效能低， CPU 為了提高效能而加快運算，而問題其實出自存取記憶體需花費時間，才導致瓶頸瓶頸。CPU 為了提高效能而加快運算反而增加存取記憶體的次數，因此導致效能不增反減。而 fully-eager bottom up 讓 CPU 一有二個串列都有 $2^k$ 數量的節點就合併。運算次數高，因此不會在提高效能以加快運算。 fully-eager bottom-up 會有較多的比較次數的原因是 $T(n) = 2T(\dfrac{n}{2}) + \theta (n)$ 。當二個已排序的鏈結串列加起來總長是 $n$ ，那二個串列越平衡越好合併達到排序的效果會發揮最好。舉例已排序與未排序比例為 $16 : 1$ ，則平均要加入的那 1 個節點需要與已經排序過的串列做 8 次的比較。然而 $4 : 4$ 合併也是做 8 次的比較。花費相同比較次數，但前者只為加入一個節點。而會產生 $16 : 1$ 的結果是因為採用 fully-eager bottom-up 的方式。接下來看看論文其中〈[Queue-Mergesort](https://www.sciencedirect.com/science/article/pii/002001909390088Q?via%3Dihub)〉可以了解為什麼 fully-eager bottom-up 產生 $16 : 1$ 為什麼比較次數會比較多次，同時也可以知道要怎麼改善 bottom-up 讓這個變成 optimal。 ### Linux 核心排序實作的演進探討 [lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) 相關實作時，除了觀察程式碼，也該理解為何 Linux 核心開發者採用這段程式碼，推敲開發者是如何思考及進行取捨。我們可觀察 [lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) 最近一次演算法上的改動在 2019 年 5 月 15 日，即 [commit b5c56e0c](https://github.com/torvalds/linux/commit/b5c56e0cdd62979dd538e5363b06be5bdf735a09), **lib/list_sort: optimize number of calls to comparison function** 引入，其中引用 3 篇論文： 1. Bottom-up Mergesort: A Detailed Analysis Wolfgang Panny, Helmut Prodinger Algorithmica 14(4):340--354, October 1995 https://doi.org/10.1007/BF01294131 https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.5260 2. The cost distribution of queue-mergesort, optimal mergesorts, and power-of-two rules Wei-Mei Chen, Hsien-Kuei Hwang, Gen-Huey Chen Journal of Algorithms 30(2); Pages 423--448, February 1999 https://doi.org/10.1006/jagm.1998.0986 https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5380 3. Queue-Mergesort Mordecai J. Golin, Robert Sedgewick Information Processing Letters, 48(5):253--259, 10 December 1993 https://doi.org/10.1016/0020-0190(93)90088-q 開發者探討 merge sort 的三種實作方式，分別為 **top-down mergesort**、**bottom-up mergesort** 和 **queue-mergesort**，以及開發者提出的方式，以下簡述不同的實作方式： **1. Top-down mergesort** * 有最少的 average case、worst case 的 comparison 次數。 * 需要使用遞迴或是額外空間作為 user stack。 * 需要事先知道整個 list 的大小。下圖例子是 balanced power-of-2 rule dividing strategy: ```graphviz digraph G { splines=line node[shape=box, height=.1];sorted_1 sorted_2 sorted_3 sorted_4 sorted_5 sorted_6 sorted_7 sorted_8; node[shape=record, height=.1];input result divide_41 divide_42 divide_21 divide_22 divide_23 divide_24 merge_21 merge_22 merge_23 merge_24 merge_41 merge_42; node[shape=none];merge_pad input[label="<f0>2|<f1>5|<f2>4|<f3>6|<f4>8|<f5>1|<f6>7|<f7>3"] result[label="<f0>1|<f1>2|<f2>3|<f3>4|<f4>5|<f5>6|<f6>7|<f7>8"] divide_41[label="<f1>2|<f2>5|<f3>4|<f4>6"] divide_42[label="<f1>8|<f2>1|<f3>7|<f4>3"] divide_21[label="<f0>2|<f1>5"] divide_22[label="<f0>4|<f1>6"] divide_23[label="<f0>8|<f1>1"] divide_24[label="<f0>7|<f1>3"] sorted_1[label="1"] sorted_2[label="2"] sorted_3[label="3"] sorted_4[label="4"] sorted_5[label="5"] sorted_6[label="6"] sorted_7[label="7"] sorted_8[label="8"] merge_21[label="<f0>2|<f1>5"] merge_22[label="<f0>4|<f1>6"] merge_23[label="<f0>1|<f1>8"] merge_24[label="<f0>3|<f1>7"] merge_41[label="<f1>2|<f2>4|<f3>5|<f4>6"] merge_42[label="<f1>1|<f2>3|<f3>7|<f4>8"] merge_pad[label=""] input -> divide_41 input -> divide_42 divide_41 -> divide_21 divide_41 -> divide_22 divide_42 -> divide_23 divide_42 -> divide_24 divide_21:f0 -> sorted_2 divide_21:f1 -> sorted_5 divide_22:f0 -> sorted_4 divide_22:f1 -> sorted_6 divide_23:f0 -> sorted_8 divide_23:f1 -> sorted_1 divide_24:f0 -> sorted_7 divide_24:f1 -> sorted_3 sorted_1; sorted_2; sorted_3; sorted_4; sorted_5; sorted_6; sorted_7; sorted_8; sorted_2 -> merge_21:f0 sorted_5 -> merge_21:f1 sorted_4 -> merge_22:f0 sorted_6 -> merge_22:f1 sorted_8 -> merge_23:f1 sorted_1 -> merge_23:f0 sorted_7 -> merge_24:f1 sorted_3 -> merge_24:f0 merge_22 -> merge_pad[style=invis] merge_23 -> merge_pad[style=invis] merge_pad -> result[style=invis] merge_21 -> merge_41 merge_22 -> merge_41 merge_23 -> merge_42 merge_24 -> merge_42 merge_41 -> result merge_42 -> result } ``` **2. Bottom-up mergesort** * 在這幾種變形中需要最多的 comparison 次數。 ```graphviz digraph G { splines=line node[shape=record, height=.1];input result merge_21 merge_22 merge_23 merge_24 merge_41 merge_42; input[label="<f0>2|<f1>5|<f2>4|<f3>6|<f4>8|<f5>1|<f6>7|<f7>3"] result[label="<f0>1|<f1>2|<f2>3|<f3>4|<f4>5|<f5>6|<f6>7|<f7>8"] merge_21[label="<f0>2|<f1>5"] merge_22[label="<f0>4|<f1>6"] merge_23[label="<f0>1|<f1>8"] merge_24[label="<f0>3|<f1>7"] merge_41[label="<f1>2|<f2>4|<f3>5|<f4>6"] merge_42[label="<f1>1|<f2>3|<f3>7|<f4>8"] input:f0 -> merge_21:f0 input:f1 -> merge_21:f1:n input:f2 -> merge_22:f0 input:f3 -> merge_22:f1 input:f4 -> merge_23:f1 input:f5 -> merge_23:f0 input:f6 -> merge_24:f1 input:f7 -> merge_24:f0:n merge_21:f0 -> merge_41:f1 merge_21:f1 -> merge_41:f3 merge_22:f0 -> merge_41:f2 merge_22:f1 -> merge_41:f4 merge_23:f0 -> merge_42:f1 merge_23:f1 -> merge_42:f4 merge_24:f0 -> merge_42:f2 merge_24:f1 -> merge_42:f3 merge_41:f1 -> result:f1 merge_41:f2 -> result:f3 merge_41:f3 -> result:f4 merge_41:f4 -> result:f5 merge_42:f1 -> result:f0 merge_42:f2 -> result:f2 merge_42:f3 -> result:f6 merge_42:f4 -> result:f7 } ``` **3. Queue-mergesort** * 特別適合用於鏈結串列的排序。 * queue-mergesort comparison 的次數少於 bottom-up mergesort，略高於 top-down mergesort。 **4. lib/list_sort.c** 查看更之前的版本 [commit 043b3f7b](https://github.com/torvalds/linux/commit/043b3f7b6388fca6be86ca82979f66c5723a0d10)也會發現是用 bottom-up mergesort 實作。接下來要說明bottom-up 從前面研究發現是比較次數最多的，那開發者如何改進他，讓他也可以達到 optimal。 ### 證明 `list_sort` 的演算法是 optimal mergesort optimal merge sort 定義：在最差的情況下，比較次數小於等於所有其他 mergesort。對所有 mergesort，都可以繪製出一個 merge tree 來表示其合併的順序及長度，每一個節點的權重代表當下的串列長度。方形的是 external node (leaf)，圓形的是 internal node 。如下圖所示： ```graphviz digraph BST { node [fontname="Arial" ]; l1 [ label = "5" ]; l21 [ label = "3" ]; l22 [ label = "2" ]; l31 [ label = "2" ]; l32 [shape=rect label = "1" ]; l33 [shape=rect label = "1" ]; l34 [shape=rect label = "1" ]; l41 [shape=rect label = "1" ]; l42 [shape=rect label = "1" ]; l1 -> { l21 l22 }; l21 -> { l31 l32 }; l22 -> { l33 l34 }; l31 -> { l41 l42 }; } ``` internal node 的數量為 leaf 數減一。在最差的情況下，合併 (x,y) 的次數為 x+y-1 ，因此最差情況下總共的比較次數就是 $\sum\limits_{v}^{T}({w(v)} - 1) = \sum\limits_{v}^{T}{w(v)} - (n-1)$ 其中 $n$ 為這次排序的節點總數，也是 leaf 的總數。而 $v$ 為所有的 internal node 。$w(v)$ 為該 internal node 的權重（也就是長度）。在這邊因為對所有合併排序 n - 1 都是固定的，所以只要看 $\sum\limits_{v}^{T}{w(v)}$ 就好。所以從上面的例子知道比較次數為 $(1 + 1) \times 3 + (1+1+1) \times 2 = 12$ 接下來看論文〈[Queue-Mergesort](https://www.sciencedirect.com/science/article/pii/002001909390088Q?via%3Dihub)〉 queue-mergesort 是 optimal merge-sort 的原因可以聯想 Huffman-encoding 來理解。佇列中所有串列都是以節點數量遞增排列。將佇列開頭二個串列合併加入佇列尾端。此合併方式會有最少量的比較次數為 optimal 。透過數學歸納法證明。第 $1$ 步每個串列都只有一個節點，第一次合併會從開頭取二個擁有一個節點的串列合併，變成一個有二個節點的串列加入尾端中，此時仍滿足串列以節點數量遞增排列。第 $N$ 步假設串列以節點數量遞增排列，將最少節點數量的二個串列從開頭取出，合併加入尾端。第 $N+1$ 步要合併的二個串列，由於遞增關係，一定比 $N$ 步合併的二個串列還多節點數量。那合併後即將加入到尾端的串列一定比目前尾端的串列還大。 ``` head 1 1 1 1 1 1 tail head 1 1 1 1 2 tail head 1 1 2 2 tail head 2 2 2 tail head 2 4 tail head 6 tail ``` 那這件事情的重點就在於 optimal 的由來是因為每次都是用數量最接近的二個串列來合併。那這樣就不會有前面講的不平衡的合併問題，像是 $16:1$ 的問題合併與不合並的比例過大出現。論文中還比較 queue-mergesort, top-down 與 bottom-up 分別會有多少個比較次數。 queue-mergesort 的比較次數就是 $(1+1+1+1+1+1)\times4 + (1+1+1+1+1)\times3=39$ ```graphviz digraph BST { node [fontname="Arial" ]; l1 [ label = "11" ]; l21 [ label = "7" ]; l22, l31 [ label = "4" ]; l32 [ label = "3" ]; l33, l34, l41, l42, l43 [ label = "2" ]; l44, l45, l46, l47, l48, l51, l52, l53, l54, l55, l56 [shape=rect label = "1" ]; l1 -> { l21 l22 }; l21 -> { l31 l32 }; l22 -> { l33 l34 }; l31 -> { l41 l42 }; l32 -> { l43 l44 }; l33 -> { l45 l46 }; l34 -> { l47 l48 }; l41 -> { l51 l52 }; l42 -> { l53 l54 }; l43 -> { l55 l56 }; } ``` top-down merge sort 的比較次數就是 $(1+1+1+1+1+1)\times4 + (1+1+1+1+1)\times3=39$ ```graphviz digraph BST { node [fontname="Arial" ]; l1 [ label = "11" ]; l21 [ label = "5" ]; l22 [ label = "6" ]; l31, l44, l45, l47 [ label = "2" ]; l32, l33, l34 [ label = "3" ]; l41, l42, l43, l46, l48, l51, l52, l53, l54, l55, l56 [shape=rect label = "1" ]; l1 -> { l21 l22 }; l21 -> { l31 l32 }; l22 -> { l33 l34 }; l31 -> { l41 l42 }; l32 -> { l43 l44 }; l33 -> { l45 l46 }; l34 -> { l47 l48 }; l44 -> { l51 l52 }; l45 -> { l53 l54 }; l47 -> { l55 l56 }; } ``` bottom-up merge sort 的比較次數就是 $(1+1+1+1+1+1+1+1)\times4 + (1+1)\times3 + 1\times2=40$ ```graphviz digraph BST { node [fontname="Arial" ]; l1 [ label = "11" ]; l21 [ label = "8" ]; l31, l32 [ label = "4" ]; l22 [ label = "3" ]; l33, l41, l42, l43, l44 [ label = "2" ]; l34, l45, l46, l51, l52, l53, l54, l55, l56, l57, l58 [shape=rect label = "1" ]; l1 -> { l21 l22 }; l21 -> { l31 l32 }; l22 -> { l33 l34 }; l31 -> { l41 l42 }; l32 -> { l43 l44 }; l33 -> { l45 l46 }; l41 -> { l51 l52 }; l42 -> { l53 l54 }; l43 -> { l55 l56 }; l44 -> { l57 l58 }; } ``` 從這裡就清楚看出為什麼〈[Bottom-up mergesort — A detailed analysis](https://link.springer.com/content/pdf/10.1007/BF01294131.pdf)〉提到 top-down 能用較少的比較次數，因為每次都是將次要合併的二個串列一定是由一個 $N$ 個節點的串列分一半成二個子串列。因為每次都是用數量相等或差一個的二個串列來合併。而變成二個二元樹表示，則只能有兩中情況，深度相同、深度差 1 。由於比較次數是依據深度計算，那盡量要越接近 Complete binary tree(leaf 節點深度)差 1 越好。這之後會用到。之所以 Linux 核心的 [lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) 不採用 top-down 與 queue-merge 的考量如前所述，後者是因 breadth-first multi-pass structure 並依照前面的數學歸納法可知。 queue sort 的方式要等待所有節點都加入，形成一堆擁有一個節點的串列後，才能合併。否則如果合併到一半，又有節點加入就違反歸納法中，第 $N+1$ 步要合併的二個串列，一定比 $N$ 步合併的二個串列還多節點數量的假設。 bottom up 的方式因為一有新節點加入，就盡量合併，是比較 cache-frinedly 的作法。 > This cache-friendly depth-first merge order depends on us merging the beginning of the input as much as possible before we've even seen the end of the input (and thus know its size). 所以既然知道其他 merge sort 的方法是如何達到 optimal 的如論文〈[Queue-Mergesort](https://www.sciencedirect.com/science/article/pii/002001909390088Q?via%3Dihub)〉下方所言，就可運用這樣的特性優化原先比較次數較多的 bottom-up 的方法。 > We use the fact that the weight of a merge tree is equal to its external path length. The height h(f) of a node I in a tree is the distance of a path from 1 to the root. The external path length of a tree T’ is the sum E(T’) = $\sum\limits_{l\ a\ leaf\ of\ T'}{h(l)}$ > Thus a merge tree T describes an optimal merge-sort on n items if and only if T has minimum external path length $\sum\limits_{l\ a\ leaf}{h(l)}$. It is known that this occurs if and only if the following condition is satisfied: all of T’s leaves are located on its bottom two levels. 因此可知，只要證明 `list_sort` 的 merge tree 符合所有 leaf 都在最下面兩層這個條件，就可以證明它是 optimal merge sort 。但因 fully-eager bottom-up 採用 cache-friendly 的方式然而卻導致較多比較次數。那所改善的目標的就是繼續維持 bottom-up cache-frinedly 的特性，同時改變合併方式以降低比較次數盡可能接近 optimal 。而這樣的解法就是 $2:1$ 的合併方式。分析 `list_sort` 的演算法，可以得出以下特性： 1. 在合併的過程中，最多只有一個子串列不是 2 的冪（第二階段排最後的子串列）。 2. 在合併的過程中，二者差異不大於兩倍。證明合併過程中對所有長度 n 的子串列，其 merge tree 的所有 leaf 都在最下面兩層。證明過程：第一階段所有合併皆為 2 的冪，符合命題。用數學歸納法證明第二階段：最小端的子串列只可能是 (1,1) 或 (1,2)，二者合併都符合命題。對第二階段過程中的任意合併，假設其二個子串列都符合命題。則合併的過程中，由第一點可知，其中一個子串列一定為 2 的冪，因此其 merge tree 為 full binary tree ，所有 leaf 都在最底層。另其長度為 $2^{k_1}$ ，則 merge tree 的高度為 $k_1$ 。當另一個子串列如果也是 2 的冪，因為第二點中二者差異不大於兩倍，因此其大小只可能是 $2^{k_1-1}$ 或 $2^{k_1+1}$ 。 merge tree 的高度 $k_2 = k_1\pm1$，符合命題。當第二的子串列不為 2 的冪，那根據假設，此串列的 merge tree 所有 leaf 都在最下面兩層，而其中一層就是 $k_1$ ，否則會違反第二點二者差異不大於兩倍的描述，因此也符合命題。根據數學歸納法，對第二階段過程中的任意合併，其 merge tree 的所有 leaf 都在最下面兩層。所以可以得出最後一次合併所產生的 merge tree 也符合命題。根據論文中所述，可得知 `list_sort` 中的演算法也是 optimal merge sort 。 > The merging is controlled by "count", the number of elements in the pending lists. This is beautifully simple code, but rather subtle. Each time we increment "count", we set one bit (bit k) and clear bits k-1 .. 0. Each time this happens (except the very first time for each bit, when count increments to 2^k), we merge two lists of size 2^k into one list of size 2^(k+1). 下面是圖解上方的證明當 11 `1 0 1 1` $\to$ 12 `1 1 0 0` 的時候，第二個位元變成 1 。所以代表進位。讓在圖上的過程是 ```graphviz digraph BST { node [fontname="Arial" ]; l31, l32 [ label = "4" ]; l33, l41, l42, l43, l44 [ label = "2" ]; l34, l51, l52, l53, l54, l55, l56, l57, l58 [shape=rect label = "1" ]; l45, l46 [shape=rect label = "1" color = "green" ]; l31 -> { l41 l42 }; l32 -> { l43 l44 }; l33 -> { l45 l46 }; l41 -> { l51 l52 }; l42 -> { l53 l54 }; l43 -> { l55 l56 }; l44 -> { l57 l58 }; } ``` 因為是第二個位元被 set 了，所以會有兩個有 $2^2$ 的鏈結串列合併，也就是下圖的二個綠色的點，因此才說合併排序與未合併排序的比例是 $2 : 1$ 相當於二個綠色的串列 : 最右邊沒接起來的節點數量。 ```graphviz digraph BST { node [fontname="Arial" ]; l21 [ label = "8" ]; l31, l32 [ label = "4" color = "green" font="bold"]; l33, l41, l42, l43, l44[ label = "2" ]; l45, l46, l47, l51, l52, l53, l54, l55, l56, l57, l58 [shape=rect label = "1" ]; l48 [shape=rect label = "1" color = "red"]; l21 -> { l31 l32 }; l31 -> { l41 l42 }; l32 -> { l43 l44 }; l33 -> { l45 l46 }; l41 -> { l51 l52 }; l42 -> { l53 l54 }; l43 -> { l55 l56 }; l44 -> { l57 l58 }; } ``` 在看一個例子特別的例子。以 count 變化 $14 \to 15$ 二進位變化為 `1 1 1 0` $\to$ `1 1 1 1`來說明： ```graphviz digraph BST { node [fontname="Arial" ]; l21 [ label = "8" ]; l22 [ label = "4" ]; l23 [ label = "2" ]; l22, l31, l32 [ label = "4" ]; l33 [ label = "2" ]; l22 [ label = "4" ]; l33, l34, l41, l42, l43, l44[ label = "2" ]; l35, l36, l45, l46, l47, l51, l52, l53, l54, l55, l56, l57, l58 [shape=rect label = "1" ]; l48 [shape=rect label = "1"]; new [shape=rect label = "1" color = "red"]; l21 -> { l31 l32 }; l22 -> { l33 l34 }; l23 -> { l35 l36 }; l31 -> { l41 l42 }; l32 -> { l43 l44 }; l33 -> { l45 l46 }; l34 -> { l47 l48 }; l41 -> { l51 l52 }; l42 -> { l53 l54 }; l43 -> { l55 l56 }; l44 -> { l57 l58 }; } ``` 接下來如何從 `1 1 1 1` 知道合併排序是從哪兩個鏈結串列要做合併呢？ `1 1 1 1` 對應為所代表的十進制是 `8 4 2 1`。那想像有四個鏈結串列，這分別有8個節點、4個節點、2個節點、1個節點。當前四個鏈結串列的表示 $8 + 4 + 2 + 1$ 那合併兩個鏈節而來的話上一步可能有（已知pending會是哪四個鏈結串列所表示）。 $8 + 4 + (1 + 1) + 1$ 合併二個串列擁有節點數 $2^0$ $8 + (2 + 2) + 2 + 1$ 合併二個串列擁有節點數 $2^1$ $(4 + 4) + 4 + 2 + 1$ 合併二個串列擁有節點數 $2^2$ 那怎麼知道是要合併二個串列擁有節點數 $2^0$？跟上面比對可以看的出來這取決於新的節點是加入哪一群。上面的是新的節點加入可以造成 $3 \to 4$ 因此合併前二個擁有 4 個節點的串列。那現在這個情況是讓 $0 \to 1$ 因此合併前二個擁有一個節點的串列。 ```graphviz digraph BST { node [fontname="Arial" ]; l21 [ label = "8" ]; l22 [ label = "4" ]; l23 [ label = "2" ]; l24 [ label = "1" color ="red"]; l22, l31, l32 [ label = "4" ]; l33 [ label = "2" ]; l22 [ label = "4" ]; l33, l34, l41, l42, l43, l44[ label = "2" ]; l35, l36 [shape=rect label = "1" color ="green"]; l45, l46, l47, l51, l52, l53, l54, l55, l56, l57, l58 [shape=rect label = "1" ]; l48 [shape=rect label = "1"]; l21 -> { l31 l32 }; l22 -> { l33 l34 }; l23 -> { l35 l36 }; l31 -> { l41 l42 }; l32 -> { l43 l44 }; l33 -> { l45 l46 }; l34 -> { l47 l48 }; l41 -> { l51 l52 }; l42 -> { l53 l54 }; l43 -> { l55 l56 }; l44 -> { l57 l58 }; } ``` 該操作目的是為了要讓所有排列的串列是符合節點數量遞增的順序。當新節點加入後可以造成第 k 個 bit 變動，代表形成了一個未排序的 $2^k$。那合併兩個 $2^k$ 後，則合併與未合併變成變成 $2:1$，當放入這些到 L1 cache 成為 $3 \times 2^k$ 進行排序。那到底 $3 \times 2^k$ 放進去的優勢在哪裡。看上面這張圖 8 跟 4 的 list 如果要合併，那就會變成下圖，那這樣葉節點的深度都會只差一。 ```graphviz digraph BST { node [fontname="Arial" ]; l11 [ label = " "] l21 [ label = "8" ]; l22 [ label = "4" ]; l22, l31, l32 [ label = "4" ]; l33 [ label = "2" ]; l22 [ label = "4" ]; l33, l34, l41, l42, l43, l44[ label = "2" ]; l45, l46, l47, l51, l52, l53, l54, l55, l56, l57, l58 [shape=rect label = "1" ]; l48 [shape=rect label = "1"]; l11 -> { l21 l22 } l21 -> { l31 l32 }; l22 -> { l33 l34 }; l31 -> { l41 l42 }; l32 -> { l43 l44 }; l33 -> { l45 l46 }; l34 -> { l47 l48 }; l41 -> { l51 l52 }; l42 -> { l53 l54 }; l43 -> { l55 l56 }; l44 -> { l57 l58 }; } ``` 因此知道有三群子樹，將其中二個相同大小的子樹合併多出一個深度。那只要確保下一次合併的時候也必定會出現一個跟相同大小的子樹，那深度就只會差一層而已。因此$2 : 1$ 就是這樣來的所以這也是可以在上面表格用二進制的關係，因為造成一個位元被 set 代表形成一個完整二元樹了，那就前兩個相同大小的二元樹合併已形成深度多一層的二元樹，下次將這兩個合併就會只差一層。因此讓 leaf 都在最下面兩層。而符合上面數學推導只要證明 list_sort 的 merge tree 符合所有 leaf 都在最下面兩層這個條件，就可以證明它是 optimal merge sort 。接著就可以用表格把剛剛的想法解釋進去。 ### 何時合併 `count` 為 pending list 中節點的數量，在 `count` 變 `count + 1` 後，可以觀察到第 `k` 個 bit 會改為 1，0 ~ `k - 1` 個 bit 會變 0，此時會將 2 個 $2^k$ 合併，並留下一個 $2^k$。每當 `count + 1`，會有二種情況： 1. 當 `count + 1` 後為 $2^k$，就不合併 (只有 leading 1 是 1，其餘都為 0)，案例 `count` = 1（$0001_2$） `count + 1` = 2（$0010_2$）因為 `count + 1` 的結果為 2，是 2 的冪，所以此種情況不合併。 2. 當 count + 1 後不為 $2^k$，就合併，案例 `count` = 2（$0010_2$） `count + 1` = 3（$0011_2$）因為 `count + 1` 為 3，不是 2 的冪，所以此種情況要合併。可以觀察到在 `count` 變 `count + 1` 後，第 k 個 bit 會改為 1，0 ~ `k - 1` 個 bit 會變 0。而這裡的 k 為 0，所以會將 2 個 $2^0$ 合併，並留下一個 $2^0$，也就是合併 2 個 1 為 2，留一個 1 不合併。以下是 `count` 從 0 一直加到 16，合併的狀況：（括號內是當次被合併的節點加起來的節點數量，用加號來分隔尚未合併的節點，黃色底則是告知此次合併的是 1 + 1, 2 + 2 或 4 + 4 等。）而這裡也可以用二進制可來說明為什麼每次當第 K 個的位元被設為 1。則一定有 $2^k$ 個因為最小 significant 的的位元被設為 1 ，代表必定還有較大的 significant 的位元為 1 ，否則他就是當前二進位值唯一被設為 1 的位元，那就是 2 的冪，不用合併。知道這件事後，最小 significant 的的位元被設為 1，則說明該位元曾經被清為 0。而在之所以清為 0，則是因為他在更過去被設為 1 。因此時間軸從過去到現在就是 $1\to0\to1$。已知第 $K$ 個位元被設為 1 代表有 $2^K$ 個節點未被合併。再由時間軸來看必定過去有二群 $2^K$個節點出現過，才會有 $1\to0\to1$ 的變化。因此就把這二群 $2^K$ 個節點合併。則合併的二群 $2^K$ 個節點合併 $:2^K$ 個節點未被合併$=2:1$ | count 變化 | count 二進位值 | 是否合併 | pending 上的節點 | | -------- | -------- | -------- | -------- | | 0 $\to$ 1 | 0000 $\to$ 0001 | 不合併（$2^0$） | 1 | | 1 $\to$ 2 | 0001 $\to$ 0010 | 不合併（$2^1$） | 1 + 1 | | 2 $\to$ 3 | 0010 $\to$ 001==1== | 合併 | (2) + 1 | | 3 $\to$ 4 | 0011 $\to$ 0100 | 不合併（$2^2$） | 2 + 1 + 1 | | 4 $\to$ 5 | 0100 $\to$ 010==1== | 合併 | 2 + (2) + 1| | 5 $\to$ 6 | 0101 $\to$ 01==1==0 | 合併 | (4) + 1 + 1| | 6 $\to$ 7 | 0110 $\to$ 011==1== | 合併 | 4 + (2) + 1| | 7 $\to$ 8 | 0111 $\to$ 1000 | 不合併（$2^3$） | 4 + 2 + 1 + 1| | 8 $\to$ 9 | 1000 $\to$ 100==1== | 合併 | 4 + 2 + (2) + 1| | 9 $\to$ 10 | 1001 $\to$ 10==1==0 | 合併 | 4 + (4) + 1 + 1| | 10 $\to$ 11 | 1010 $\to$ 101==1== | 合併 | 4 + 4 + (2) + 1| | 11 $\to$ 12 | 1011 $\to$ 1==1==00 | 合併 | (8) + 2 + 1 + 1| | 12 $\to$ 13 | 1100 $\to$ 110==1== | 合併 | 8 + 2 + (2) + 1| | 13 $\to$ 14 | 1101 $\to$ 11==1==0 | 合併 | 8 + (4) + 1 + 1| | 14 $\to$ 15 | 1110 $\to$ 111==1== | 合併 | 8 + 4 + (2) + 1| | 15 $\to$ 16 | 1111 $\to$ 10000 | 不合併（$2^4$） | 8 + 4 + 2 + 1 + 1 | 接下來的程式碼就可用來描述這些想法。 ### `list_sort` > 以下原始程式碼取自 Linux 核心原始程式碼，配合解說進行適度修剪。 ```c void list_sort(void *priv, struct list_head *head, list_cmp_func_t cmp) { struct list_head *list = head->next, *pending = NULL; size_t count = 0; /* Count of pending */ if (list == head->prev) /* Zero or one elements */ return; /* Convert to a null-terminated singly-linked list. */ head->prev->next = NULL; ... ``` + `priv`: 從 `merge` 函式可見 priv 會被當作 cmp 的參數傳入，在其他地方不會用到。 + `head`: 傳入指向 `struct list_head` 的指標，和原本自己寫的 `q_sort` 傳入的參數一樣 + `cmp`: 排序過程中用以比較的函式，以函式指標的型別傳入 cmp 參數有考慮到通用性，但會增加 function call 的成本。 `list` 指向 `head` 的第一個節點，`pending` 指向 `NULL`，先檢查是否沒有任何元素或只有一個元素，然後將 `head` 前一個節點指向的下一個位置指向 NULL，將雙向鏈結串列變成單向。 ```c=1 do { size_t bits; struct list_head **tail = &pending; /* Find the least-significant clear bit in count */ for (bits = count; bits & 1; bits >>= 1) tail = &(*tail)->prev; /* Do the indicated merge */ if (likely(bits)) { struct list_head *a = *tail, *b = a->prev; a = merge(priv, cmp, b, a); /* Install the merged result in place of the inputs */ a->prev = b->prev; *tail = a; } /* Move one element from input list to pending */ list->prev = pending; pending = list; list = list->next; pending->next = NULL; count++; } while (list); ``` 上方程式碼中在 while 迴圈，會先讓 `tail` 往前移動到待合併的節點。怎麼判斷串列要合併就是看上面程式碼的 5-7 行的地方。當找到 least-significant clear bit 即二進位表示中最低位為0的位元，而這個位元會在下一次新增節點的時候被 set 為 1 。符合二進制的概念。而進位代表，多產生了一個 2 的冪 $2^k$。此時就會觸發合併操作，當前指向的位置的串列與前一個串列的數量也會剛好是 $2^k$ 合併變成 $2\times2^k$ 節點數量的串列。其中與新加入節點多產生的 $2^k$ 未排序的串列就是 $2:1$。然後 if 判斷是否需要合併。那何時不需要合併，回憶就知道是 2 的冪次方時不用合併，在變為 2 的冪次的上一次的 `count` 是所有的位元都為 1。那就會執行 5-7 行的程式碼直到 `bits` 變為 0。所以也就部會執行合併的過程。而 `bits` 為 0 只發生在加一個節點後總節點數為 2 的冪次方。所以 `bits` 為 0 是較少發生的情況。`bits` 為 1 較常發生，所以用 `likely` 提示編譯器進行分支預測的最佳化 (儘管不保證有一定有效)。最後移動 pending 和 list 的位置，直到 list 沒有節點。pending 從原先沒有節點，然後一次次將節點從 list 中搬到 pending，等到 list 沒有節點就代表現階段結束。 ```c /* End of input; merge together all the pending lists. */ list = pending; pending = pending->prev; for (;;) { struct list_head *next = pending->prev; if (!next) break; list = merge(priv, cmp, pending, list); pending = next; } /* The final merge, rebuilding prev links */ merge_final(priv, cmp, head, pending, list); } ``` 接下來上方這段 for 迴圈的程式碼就是因為如果現在有 15 個節點 pending 上面有 8 4 2 1 的串列，接下來要將這些串列全都合併起來就可以了。 ### merge ```c static struct list_head * merge(void *priv, list_cmp_func_t cmp, struct list_head *a, struct list_head *b) { struct list_head *head, **tail = &head; for (;;) { /* if equal, take 'a' -- important for sort stability */ if (cmp(priv, a, b) <= 0) { *tail = a; tail = &a->next; a = a->next; if (!a) { *tail = b; break; } } else { *tail = b; tail = &b->next; b = b->next; if (!b) { *tail = a; break; } } } return head; } ``` 當 `cmp(priv, a, b) <= 0` 表示 a 的值小於 b，因為由小排序到大，所以先接 a 再接 b，`cmp(priv, a, b) > 0` 表示 a 的值大於 b，則是先接 b 再接 a。其中 `head` 會在排序過 list 的最前面，並回傳回去。 ### 讀 The Cost Distribution of Queue-Mergesort, Optimal Mergesorts, and Power-of-2 Rules ### 排序過程圖示以內含 4, 3, 2, 1 的鏈結串列 `list` 為例，使用 `list_sort` 進行排序，隨著 `count` 增加，`pending` 內節點數量漸漸增加，而 `list` 內的節點數量逐漸減少，以下從 `count` 的變化來解說每個步驟。 #### count = 0 $\to$ count = 1 list 指向 head->next： ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] list [label="list", style="normal", color="white"] head:right:e -> node4:w node4:left:w -> head:e node4:right:e -> node3:w node3:left:w -> node4:e node3:right:e -> node2:w node2:left:w -> node3:e node2:right:e -> node1:w node1:left:w -> node2:e list -> node4:n } ``` 因為 bits 為 0，不需合併。`list->prev = pending` 因為 pending 為 NULL，所以 `list->prev` 也指向 NULL： ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] list [label="list", style="normal", color="white"] head:right:e -> node4:w node4:right:e -> node3:w node3:left:w -> node4:e node3:right:e -> node2:w node2:left:w -> node3:e node2:right:e -> node1:w node1:left:w -> node2:e list -> node4:n } ``` pending 節點數量 + 1，list 節點數量 - 1，`count + 1`，`pending->next = NULL`： ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] list [label="list", style="normal", color="white"] pending [label="pending", style="normal", color="white"] tail [label="tail", style="normal", color="white"] head:right:e -> node4:w node4:right:e -> node3:w[color="white"] node3:left:w -> node4:e node3:right:e -> node2:w node2:left:w -> node3:e node2:right:e -> node1:w node1:left:w -> node2:e list -> node3:n pending -> node4:n tail -> pending:n } ``` #### count = 1 $\to$ count = 2 ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] list [label="list", style="normal", color="white"] pending [label="pending", style="normal", color="white"] tail [label="tail", style="normal", color="white"] head:right:e -> node4:w node4:right:e -> node3:w[color="white"] node3:left:w -> node4:e node3:right:e -> node2:w node2:left:w -> node3:e node2:right:e -> node1:w node1:left:w -> node2:e list -> node3:n pending -> node4:n tail -> node4:left:s } ``` ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] list [label="list", style="normal", color="white"] pending [label="pending", style="normal", color="white"] head:right:e -> node4:w node4:right:e -> node3:w[color="white"] node3:left:w -> node4:e node3:right:e -> node2:w[color="white"] node2:left:w -> node3:e node2:right:e -> node1:w node1:left:w -> node2:e list -> node2:n pending -> node3:n } ``` #### count = 2 $\to$ count = 3 ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] list [label="list", style="normal", color="white"] pending [label="pending", style="normal", color="white"] tail [label="tail", style="normal", color="white"] a [label="a", style="normal", color="white"] b [label="b", style="normal", color="white"] head:right:e -> node4:w node4:right:e -> node3:w[color="white"] node3:left:w -> node4:e node3:right:e -> node2:w[color="white"] node3:right:e -> node4:e node2:left:w -> node3:e node2:right:e -> node1:w node1:left:w -> node2:e list -> node2:n pending -> node3:n tail -> pending a -> node3:n b -> node4:n } ``` ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] list [label="list", style="normal", color="white"] pending [label="pending", style="normal", color="white"] tail [label="tail", style="normal", color="white"] a [label="a", style="normal", color="white"] b [label="b", style="normal", color="white"] head:right:e -> node4:w node4:right:e -> node3:w[color="white"] node3:right:e -> node2:w[color="white"] node3:right:e -> node4:e node2:left:w -> node3:e node2:right:e -> node1:w node1:left:w -> node2:e list -> node2:n pending -> node3:n tail -> pending a -> node3:n b -> node4:n } ``` ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] list [label="list", style="normal", color="white"] pending [label="pending", style="normal", color="white"] tail [label="tail", style="normal", color="white"] head:right:e -> node4:w node4:right:e -> node3:w[color="white"] node3:right:e -> node2:w[color="white"] node3:right:s -> node4:e node2:left:w -> node3:e node2:right:e -> node1:w[color="white"] node1:left:w -> node2:e list -> node1:n pending -> node2:n tail -> pending } ``` #### count = 3 $\to$ count = 4 ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] list [label="list", style="normal", color="white"] pending [label="pending", style="normal", color="white"] tail [label="tail", style="normal", color="white"] head:right:e -> node4:w node4:right:e -> node3:w[color="white"] node3:right:e -> node2:w[color="white"] node3:right:s -> node4:e node2:left:w -> node3:e node2:right:e -> node1:w[color="white"] node1:left:w -> node2:e list -> node1:n pending -> node2:n tail -> node3:left:s } ``` ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] pending [label="pending", style="normal", color="white"] tail [label="tail", style="normal", color="white"] head:right:e -> node4:w node4:right:e -> node3:w[color="white"] node3:right:e -> node2:w[color="white"] node3:right:s -> node4:e node2:left:w -> node3:e node2:right:e -> node1:w[color="white"] node1:left:w -> node2:e pending -> node1:n tail -> node3:left:s } ``` #### count = 4 ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] list [label="list", style="normal", color="white"] pending [label="pending", style="normal", color="white"] tail [label="tail", style="normal", color="white"] next [label="next", style="normal", color="white"] a [label="a", style="normal", color="white"] b [label="b", style="normal", color="white"] head:right:e -> node1:w node4:right:e -> node3:w[color="white"] node3:right:e -> node2:w[color="white"] node3:right:s -> node4:e node2:left:w -> node3:e node1:left:w -> head:e pending -> node2:n next -> node3:n list -> node1:n a -> node2:n b -> node1:n } ``` ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] list [label="list", style="normal", color="white"] pending [label="pending", style="normal", color="white"] tail [label="tail", style="normal", color="white"] next [label="next", style="normal", color="white"] a [label="a", style="normal", color="white"] b [label="b", style="normal", color="white"] head:right:e -> node1:w node4:right:e -> node3:w[color="white"] node3:right:e -> node2:w[color="white"] node3:right:s -> node4:e node2:left:w -> node3:e node1:left:w -> head:e pending -> node2:n next -> node3:n list -> node1:n tail -> node1:n a -> node2:n b -> node2:n } ``` ```graphviz digraph ele_list { rankdir=LR node[shape=record] head [label="head|{<left>prev|<right>next}", style="bold"] node4 [label="4|{<left>prev|<right>next}", style="bold"] node3 [label="3|{<left>prev|<right>next}", style="bold"] node2 [label="2|{<left>prev|<right>next}", style="bold"] node1 [label="1|{<left>prev|<right>next}", style="bold"] list [label="list", style="normal", color="white"] pending [label="pending", style="normal", color="white"] tail [label="tail", style="normal", color="white"] next [label="next", style="normal", color="white"] a [label="a", style="normal", color="white"] b [label="b", style="normal", color="white"] head:right:e -> node1:w node1:left:w -> head:e node1:right:e -> node2:w node2:left:w -> node1:e node2:right:e -> node3:w node3:left:w -> node2:e node3:right:e -> node4:w node4:left:w -> node3:e list -> node1:n pending -> node2:n a -> node2:n b -> node4:n tail -> node4:n } ``` ### 證明前提：所有子串列其長度皆為 2 的冪，所有合併皆為同大小串列。設 $t_i$ 為目前長度小於 $2^i$ 的所有子串列的節點數總和。$i$ 為正整數。第一點：證明當 $t_i$ 從 $3\times2^{i-1}-1$ 加一變成 $3\times2^{i-1}$ 時，不存在正整數 $j \neq i$ 使得 $t_j$ = $3\times2^{j-1}$ 。假設存在 $j>i$ 且 $t_j$ = $3\times2^{j-1}$ 。則有一個正整數 $l$ 使得 $3\times2^{i-1} + l = 3\times2^{j-1}$ 。因為長度小於 $2^{i}$ 的所有子串列的節點數總和都包含在 $t_i$ 裡面了，所以 $l$ 都在長度大於等於 $2^i$ 的串列裡。因此可以把 $l$ 寫成 $A\times2^i$ 。其中 A 等於任意正整數。 $3\times2^{i-1} + A\times2^i = 3\times2^{j-1}$ $(3+2A)\times2^{i-1} = 3\times2^{j-1}$ $\dfrac{(3+2A)}{3} = 2^{j-i}$ $1+\dfrac{2A}{3} = 2^{j-i}$ 令 $A = 3B$ 代入，$B$ 為任意正整數。 $1+2B = 2^{j-i}$ 因為 $j>i$ ，所以 $2^{j-i}$ 必為偶數，不存在任何正整數 $B$ 使得 $1+2B$ 等於偶數，假設不成立，因此不存在 $j>i$ 且 $t_j$ = $3\times2^j$ 。假設存在 $j<i$ 且 $t_j$ = $3\times2^{j-1}$ 。則有一個正整數 $l$ 使得 $3\times2^{i-1} = 3\times2^{j-1}+l$ 。因為長度小於 $2^{j}$ 的所有子串列的節點數總和都包含在 $t_j$ 裡面了，所以 $l$ 都在長度大於等於 $2^j$ 的串列裡。因此可以把 $l$ 寫成 $A\times2^j$ 。其中 A 等於任意正整數。 $3\times2^{i-1} = 3\times2^{j-1} + A\times2^j$ $3\times2^{i-1} = (3+2A)\times2^{j-1}$ $\dfrac{(3+2A)}{3} = 2^{i-j}$ $1+\dfrac{2A}{3} = 2^{i-j}$ 令 $A = 3B$ 代入，$B$ 為任意正整數。 $1+2B = 2^{i-j}$ 因為 $j<i$ ，所以 $2^{i-j}$ 必為偶數，不存在任正整數 $B$ 使得 $1+2B$ 等於偶數，假設不成立，因此不存在 $j<i$ 且 $t_j$ = $3\times2^j$ 。總和上述兩點，可得證。第二點：證明在「能保持差異最大 $2:1$ 的情況下，若可合併則進行合併」的條件下，當$t_{n+1}$ 從 $3\times2^n-1$ 增加到 $3\times2^n$ 時，一定存在二個長度為 $2^n$ 的子串列可供合併。當 $n = 0$，因為長度最短為 1， $3\times2^0$ = 3，長度分別為 1,1,1 ，因此可以合併，成立。設當 $n = k$ 時成立，則當 n = k+1 時，必須證明當 $t_{k+2}$ 增加到 $3\times2^{k+1}$ 時，一定存在二個長度為 $2^{k+1}$ 的子串列可合併。第一個 $2^{k+1}$ 子串列在 $t_{k+1}$來到 $3\times2^k$ 時根據假設誕生。此時 $t_{k+2}$ 也等於 $3\times2^k$ ，合併後 $t_{k+2}$ = $3\times2^k$，$t_{k+1}$ = $2^k$ 。第二個 $2^{k+1}$ 子串列在 $t_{k+1}$再次來到 $3\times2^k$ 時根據假設誕生，此時 $t_{k+2}$ 來到 $5\times2^k$ 。由上述兩點可知，在 $t_{k+2}$ 從 0 增加到 $3\times2^{k+1}$ 的過程中，一定會經過 $3\times2^k$ 以及 $5\times2^k$ ，並在這兩時刻產生出長度為 $2^{k+1}$ 的串列。所以當 $t_{k+2}$ 增加到 $6\times2^{k} = 3\times2^{k+1}$ 時，一定存在二個長度為 $2^{k+1}$ 的子串列可供合併，根據數學歸納法得證。第三點：證明 $2^i$ 長度串列的合併(二個 $2^{i-1}$ 合併成 $2^i$)只會發生在 $t_i$ 變成 $3\times2^{i-1}$ 。由第二點可知，對所有自然數 $i$ ，$t_i$ 不會超過 $3\times 2^{i-1}$ ，因為只要等於 $3\times2^{i-1}$ 就會合併串列並變回 $2^{i-1}$ 。所以合併不會發生在 $t_i > 3\times2^{i-1}$ 。當 $t_i < 3\times 2^{i-1}$ ，此時合併成 $2^i$ 串列無法保證兩條串列比例差異不會超過 $2:1$ ，所以不會合併。故合併只會發生在 $t_i = 3\times 2^{i-1}$ 得證。綜合上述三點，可得出當 count 增加到 count+1 ，最多只會存在一正整數 $i$ 使得 $t_i$ 為 $3\times 2^{i-1}$ 。若此 $i$ 存在，則一定可合併成長度為 $2^i$ 的串列，且此串列為此時合併的唯一串列。每次 count 遞增所合併的串列也不會造成更長串列的連鎖合併，因為合併成 $2^i$ 只會改變 $t_i$ 的值，不會改變其他的 $t$ 值，所以不會有其他 $t_j$ 增加成 $3\times 2^{j-1}$ 而造成合併。