contributed by < DokiDokiPB
>
研讀 Linux 核心的 lib/list_sort.c 原始程式碼,學習 sysprog21/linux-list 的手法,將 lib/list_sort.c 的實作抽離為可單獨執行 (standalone) 的使用層級應用程式,並設計效能評比的程式碼,說明 Linux 核心的 lib/list_sort.c 最佳化手法
查詢該程式碼提交時的 comment:optimize number of calls to comparison function
This patch avoids badly unbalanced merges on unlucky input sizes. It slightly increases the code size, but saves an average of
Unfortunately, there's not a lot of low-hanging fruit in a merge sort; it already performs only
The differences between mergesort variants appear when n is not a power of 2; K is a function of the fractional part of
However, that requires knowing the number of entries to be sorted ahead of time, and making a full pass over the input to count it conflicts with a second performance goal, which is cache blocking.
Obviously, we have to read the entire list into L1 cache at some point, and performance is best if it fits. But if it doesn't fit, each full pass over the input causes a cache miss per element, which is undesirable.
While textbooks explain bottom-up mergesort as a succession of merging passes, practical implementations do merging in depth-first order: as soon as two lists of the same size are available, they are merged.
This allows as many merge passes as possible to fit into L1; only the final few merges force cache misses.
This cache-friendly depth-first merge order depends on us merging the beginning of the input as much as possible before we've even seen the end of the input (and thus know its size).
The simple eager merge pattern causes bad performance when n is just over a power of 2. If n=1028, the final merge is between 1024- and 4-element lists, which is wasteful of comparisons. (This is actually worse on average than n=1025, because a 1204:1 merge will, on average, end after 512 compares, while 1024:4 will walk 4/5 of the list.)
Because of this, bottom-up mergesort achieves K < 0.5 for such sizes, and has an average (over all sizes) K of around 1. (My experiments show K=1.01, while theory predicts K=0.965.)
There are "worst-case optimal" variants of bottom-up mergesort which avoid this bad performance, but the algorithms given in the literature, such as queue-mergesort and boustrodephonic mergesort, depend on the breadth-first multi-pass structure that we are trying to avoid.
This implementation is as eager as possible while ensuring that all merge passes are at worst 1:2 unbalanced. This achieves the same average K=1.207 as queue-mergesort, which is 0.2n better then bottom-up, and only 0.04n behind top-down mergesort.
Specifically, defers merging two lists of size 2^k until it is known that there are 2^k additional inputs following. This ensures that the final uneven merges triggered by reaching the end of the input will be at worst 2:1. This will avoid cache misses as long as 3*2^k elements fit into the cache.
避免「舉燭」,應該一邊了解程式碼,一邊實作。
from deleted message.
cmp_func
typedef int __attribute__((nonnull(2,3))) (*cmp_func)(void *,
struct list_head const *, struct list_head const *);
declare cmp_func as pointer to function (pointer to struct list_head const, pointer to struct list_head const) returning int
檔案開頭即定義 compare function 樣式 cmp_func,透過實作自定義的 cmp_func 來達到比較效果。
__attribute__((nonnull(2,3)))
The keyword
__attribute__
allows you to specify special attributes when making a declaration. This keyword is followed by an attribute specification inside double parentheses. The following attributes are currently defined for functions on all targets: … nonnull. Several other attributes are defined for functions on particular target systems表示
__attribute__
屬於 function attribute,表示第二、第三的輸入參數必定不為 NULL
list_sort()
查看 function 註解
* This mergesort is as eager as possible while always performing at least
* 2:1 balanced merges. Given two pending sublists of size 2^k, they are
* merged to a size-2^(k+1) list as soon as we have 2^k following elements.
*
* ...
*
* The merging is controlled by "count", the number of elements in the
* pending lists. This is beautiully simple code, but rather subtle.
*
* Each time we increment "count", we set one bit (bit k) and clear
* bits k-1 .. 0. Each time this happens (except the very first time
* for each bit, when count increments to 2^k), we merge two lists of
* size 2^k into one list of size 2^(k+1).
* This merge happens exactly when the count reaches an odd multiple of
* 2^k, which is when we have 2^k elements pending in smaller lists,
* so it's safe to merge away two lists of size 2^k.
* There are six states we distinguish. "x" represents some arbitrary
* bits, and "y" represents some arbitrary non-zero bits:
* 0: 00x: 0 pending of size 2^k; x pending of sizes < 2^k
* 1: 01x: 0 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
* 2: x10x: 0 pending of size 2^k; 2^k + x pending of sizes < 2^k
* 3: x11x: 1 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
* 4: y00x: 1 pending of size 2^k; 2^k + x pending of sizes < 2^k
* 5: y01x: 2 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
* (merge and loop back to state 2)
count
變數去記錄 the number of elements in the pending lists參考 hankluo6 的共筆 表格
跟玩 2048 遊戲有點類似。
Count Decimal | Count Binary | Merge | Range | Size |
---|---|---|---|---|
0 | 00000 | No | X | X |
1 | 00001 | No | X | X |
2 | 00010 | Yes | [0,1] | 2 |
3 | 00011 | No | X | X |
4 | 00100 | Yes | [2,3] | 2 |
5 | 00101 | Yes | [0,3] | 4 |
6 | 00110 | Yes | [4,5] | 2 |
7 | 00111 | No | X | X |
8 | 01000 | Yes | [6,7] | 2 |
9 | 01001 | Yes | [4,7] | 4 |
10 | 01010 | Yes | [8,9] | 2 |
11 | 01011 | Yes | [0,7] | 8 |
12 | 01100 | Yes | [10,11] | 2 |
13 | 01101 | Yes | [8,11] | 4 |
14 | 01110 | Yes | [12,13] | 2 |
15 | 01111 | No | X | X |
16 | 10000 | Yes | [14,15] | 2 |
list_sort
__attribute__((nonnull(2,3)))
void list_sort(void *priv, struct list_head *head,
int (*cmp)(void *priv, struct list_head *a,
struct list_head *b))
{
struct list_head *list = head->next, *pending = NULL;
size_t count = 0; /* Count of pending */
if (list == head->prev) /* Zero or one elements */
return;
/* Convert to a null-terminated singly-linked list. */
head->prev->next = NULL;
/*
* ...
*/
do {
size_t bits;
struct list_head **tail = &pending;
/* Find the least-significant clear bit in count */
for (bits = count; bits & 1; bits >>= 1)
tail = &(*tail)->prev;
/* Do the indicated merge */
if (likely(bits)) {
struct list_head *a = *tail, *b = a->prev;
a = merge(priv, (cmp_func)cmp, b, a);
/* Install the merged result in place of the inputs */
a->prev = b->prev;
*tail = a;
}
/* Move one element from input list to pending */
list->prev = pending;
pending = list;
list = list->next;
pending->next = NULL;
count++;
} while (list);
/* End of input; merge together all the pending lists. */
list = pending;
pending = pending->prev;
for (;;) {
struct list_head *next = pending->prev;
if (!next)
break;
list = merge(priv, (cmp_func)cmp, pending, list);
pending = next;
}
/* The final merge, rebuilding prev links */
merge_final(priv, (cmp_func)cmp, head, pending, list);
}
* pending
以及 list->prev
來串接 sorted lists.next
成員串接 elements. * pending
中-
表示 edge1-1-2
的情形:while-loop 開始時的 Count | Merge | while-loop 開始時的 pending list | merge function 後 pending list |
---|---|---|---|
0 | No | NULL | X |
1 | No | 1 | X |
2 | Yes | 1-1 | 2 |
3 | No | 1-2 | X |
4 | Yes | 1-1-2 | 2-2 |
5 | Yes | 1-2-2 | 1-4 |
6 | Yes | 1-1-4 | 2-4 |
7 | No | 1-2-4 | X |
8 | Yes | 1-1-2-4 | 2-2-4 |
9 | Yes | 1-2-2-4 | 1-4-4 |
10 | Yes | 1-1-4-4 | 2-4-4 |
11 | Yes | 1-2-4-4 | 1-2-8 |
12 | Yes | 1-1-2-8 | 2-2-8 |
13 | Yes | 1-2-2-8 | 1-4-8 |
14 | Yes | 1-1-4-8 | 2-4-8 |
15 | No | 1-2-4-8 | X |
16 | Yes | 1-1-2-4-8 | 2-2-4-8 |