研讀 Linux 核心的 lib/list_sort.c 原始程式碼

contributed by < DokiDokiPB >

返回 quiz2 作業

研讀 Linux 核心的 lib/list_sort.c 原始程式碼，學習 sysprog21/linux-list 的手法，將 lib/list_sort.c 的實作抽離為可單獨執行 (standalone) 的使用層級應用程式，並設計效能評比的程式碼，說明 Linux 核心的 lib/list_sort.c 最佳化手法

查詢該程式碼提交時的 comment：optimize number of calls to comparison function

This patch avoids badly unbalanced merges on unlucky input sizes. It slightly increases the code size, but saves an average of

0.2 * n

calls to cmp().

Unfortunately, there's not a lot of low-hanging fruit in a merge sort; it already performs only

n * l o g 2 (n) - K * n + O (1)

compares. The leading coefficient is already at the theoretical limit (

l o g 2 (n!)

corresponds to

K = 1.4427

), so we're fighting over the linear term, and the best mergesort can do is

K = 1.2645

, achieved when n is a power of 2.

當執行排序中的 list size 為
$2$ 的次方時，能夠減少比較次數。
此 patch 表示能夠改進的部分有限，
$K 數值最差也有 1.4427$ ，目前情況是
$K = 1.2645$

The differences between mergesort variants appear when n is not a power of 2; K is a function of the fractional part of

l o g 2 (n)

. Top-down mergesort does best of all, achieving a minimum

K = 1.2408

, and an average (over all sizes)

K = 1.248

However, that requires knowing the number of entries to be sorted ahead of time, and making a full pass over the input to count it conflicts with a second performance goal, which is cache blocking.

Obviously, we have to read the entire list into L1 cache at some point, and performance is best if it fits. But if it doesn't fit, each full pass over the input causes a cache miss per element, which is undesirable.

效能好壞取決於，當排序中的 list size 不為
$2$ 的次方的時候。
Top-down mergesort 能夠切割良好的 sublist size，但因為實作上 list_head 使用的是 linked list，必須走一遍 list 去取得 size，會影響效能。
而影響效能的第二考量為 cache blocking

While textbooks explain bottom-up mergesort as a succession of merging passes, practical implementations do merging in depth-first order: as soon as two lists of the same size are available, they are merged.
This allows as many merge passes as possible to fit into L1; only the final few merges force cache misses.

This cache-friendly depth-first merge order depends on us merging the beginning of the input as much as possible before we've even seen the end of the input (and thus know its size).

一般的 bottom-up mergesort 是 level order merge.
這裡的 depth-first 表示透過修改合併順序，能夠減少 cache miss 的機率。

The simple eager merge pattern causes bad performance when n is just over a power of 2. If n=1028, the final merge is between 1024- and 4-element lists, which is wasteful of comparisons. (This is actually worse on average than n=1025, because a 1204:1 merge will, on average, end after 512 compares, while 1024:4 will walk 4/5 of the list.)

Because of this, bottom-up mergesort achieves K < 0.5 for such sizes, and has an average (over all sizes) K of around 1. (My experiments show K=1.01, while theory predicts K=0.965.)

在某些在保持
$2^{k}$ 個情況下、depth-first order 時，依然會遇到 bad performance，例如 1024 + 4 個的情況。

There are "worst-case optimal" variants of bottom-up mergesort which avoid this bad performance, but the algorithms given in the literature, such as queue-mergesort and boustrodephonic mergesort, depend on the breadth-first multi-pass structure that we are trying to avoid.

This implementation is as eager as possible while ensuring that all merge passes are at worst 1:2 unbalanced. This achieves the same average K=1.207 as queue-mergesort, which is 0.2n better then bottom-up, and only 0.04n behind top-down mergesort.

結論是讓兩個將要合併的 sublists 保持在最差
$2 : 1$ 個

Specifically, defers merging two lists of size 2^k until it is known that there are 2^k additional inputs following. This ensures that the final uneven merges triggered by reaching the end of the input will be at worst 2:1. This will avoid cache misses as long as 3*2^k elements fit into the cache.

跟 depth-first order 不同的合併順序：有調整過合併的順序，延遲
$2^{k}$ 個 list 直到遇到另一個
$2^{k}$ ，再跟其合併。

避免「舉燭」，應該一邊了解程式碼，一邊實作。
from deleted message.

list/list_sort.c

`cmp_func`


typedef int __attribute__((nonnull(2,3))) (*cmp_func)(void *,
		struct list_head const *, struct list_head const *);

declare cmp_func as pointer to function (pointer to struct list_head const, pointer to struct list_head const) returning int

來源 cdecl.org

檔案開頭即定義 compare function 樣式 cmp_func，透過實作自定義的 cmp_func 來達到比較效果。

__attribute__((nonnull(2,3)))

The keyword __attribute__ allows you to specify special attributes when making a declaration. This keyword is followed by an attribute specification inside double parentheses. The following attributes are currently defined for functions on all targets: … nonnull. Several other attributes are defined for functions on particular target systems

來源

表示 __attribute__ 屬於 function attribute，表示第二、第三的輸入參數必定不為 NULL

`list_sort()`

查看 function 註解

 * This mergesort is as eager as possible while always performing at least
 * 2:1 balanced merges.  Given two pending sublists of size 2^k, they are
 * merged to a size-2^(k+1) list as soon as we have 2^k following elements.
 *
 * ...
 *
 * The merging is controlled by "count", the number of elements in the
 * pending lists.  This is beautiully simple code, but rather subtle.
 *
 * Each time we increment "count", we set one bit (bit k) and clear
 * bits k-1 .. 0.  Each time this happens (except the very first time
 * for each bit, when count increments to 2^k), we merge two lists of
 * size 2^k into one list of size 2^(k+1).
 * This merge happens exactly when the count reaches an odd multiple of
 * 2^k, which is when we have 2^k elements pending in smaller lists,
 * so it's safe to merge away two lists of size 2^k.
 
 * There are six states we distinguish.  "x" represents some arbitrary
 * bits, and "y" represents some arbitrary non-zero bits:
 * 0:  00x: 0 pending of size 2^k;           x pending of sizes < 2^k
 * 1:  01x: 0 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
 * 2: x10x: 0 pending of size 2^k; 2^k     + x pending of sizes < 2^k
 * 3: x11x: 1 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
 * 4: y00x: 1 pending of size 2^k; 2^k     + x pending of sizes < 2^k
 * 5: y01x: 2 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
 * (merge and loop back to state 2)

盡量最差保持
$2 : 1$ 待合併比例
實作方式是使用 count 變數去記錄 the number of elements in the pending lists
例如 count 以二進位表示，遇到進位的情況，即表示有兩組相同的
$2^{k}$ 個 sublists 要合併；連續進位表示連續合併。
k 在這就定義為進位後的最高位 bit.
因為是 specifically-depth-first-order bottom-up mergesort，透過列出 count 以及合併的方式，可以簡單輕易了解其意思：

參考 hankluo6 的共筆表格
跟玩 2048 遊戲有點類似。

Count Decimal	Count Binary	Merge	Range	Size
0	00000	No	X	X
1	00001	No	X	X
2	00010	Yes	[0,1]	2
3	00011	No	X	X
4	00100	Yes	[2,3]	2
5	00101	Yes	[0,3]	4
6	00110	Yes	[4,5]	2
7	00111	No	X	X
8	01000	Yes	[6,7]	2
9	01001	Yes	[4,7]	4
10	01010	Yes	[8,9]	2
11	01011	Yes	[0,7]	8
12	01100	Yes	[10,11]	2
13	01101	Yes	[8,11]	4
14	01110	Yes	[12,13]	2
15	01111	No	X	X
16	10000	Yes	[14,15]	2

list_sort























































__attribute__((nonnull(2,3)))
void list_sort(void *priv, struct list_head *head,
		int (*cmp)(void *priv, struct list_head *a,
			struct list_head *b))
{
    struct list_head *list = head->next, *pending = NULL;
	size_t count = 0;	/* Count of pending */

	if (list == head->prev)	/* Zero or one elements */
		return;

	/* Convert to a null-terminated singly-linked list. */
    head->prev->next = NULL;

	/*
	 * ...
	 */
    do {
        size_t bits;
        struct list_head **tail = &pending;

		/* Find the least-significant clear bit in count */
	    for (bits = count; bits & 1; bits >>= 1)
	    	tail = &(*tail)->prev;
		/* Do the indicated merge */
	    if (likely(bits)) {
            struct list_head *a = *tail, *b = a->prev;
            a = merge(priv, (cmp_func)cmp, b, a);
            /* Install the merged result in place of the inputs */
            a->prev = b->prev;
            *tail = a;
        }

		/* Move one element from input list to pending */
        list->prev = pending;
        pending = list;
        list = list->next;
        pending->next = NULL;
        count++;
    } while (list);

	/* End of input; merge together all the pending lists. */
    list = pending;
    pending = pending->prev;
    for (;;) {
        struct list_head *next = pending->prev;

        if (!next)
            break;
        list = merge(priv, (cmp_func)cmp, pending, list);
        pending = next;
    }
    /* The final merge, rebuilding prev links */
    merge_final(priv, (cmp_func)cmp, head, pending, list);
}

透過操作 * pending 以及 list->prev 來串接 sorted lists.
sort lists 中會以 next 成員串接 elements.
在進入下個 while-loop前都會新增一個 element 進入 * pending 中
下列表格出其串接的模式
- 其中數字為 sorted lists 中的數量
- - 表示 edge
- 例如以 Graphviz 表示下列表格中 1-1-2 的情形：

while-loop 開始時的 Count	Merge	while-loop 開始時的 pending list	merge function 後 pending list
0	No	NULL	X
1	No	1	X
2	Yes	1-1	2
3	No	1-2	X
4	Yes	1-1-2	2-2
5	Yes	1-2-2	1-4
6	Yes	1-1-4	2-4
7	No	1-2-4	X
8	Yes	1-1-2-4	2-2-4
9	Yes	1-2-2-4	1-4-4
10	Yes	1-1-4-4	2-4-4
11	Yes	1-2-4-4	1-2-8
12	Yes	1-1-2-8	2-2-8
13	Yes	1-2-2-8	1-4-8
14	Yes	1-1-4-8	2-4-8
15	No	1-2-4-8	X
16	Yes	1-1-2-4-8	2-2-4-8