2021q1 Homework5 (sort)

# 2021q1 Homework5 (sort) contributed by < `hankluo6` > > [GitHub](https://github.com/hankluo6/ksort) ### 取得測量資料 > [commit 1d9afc](https://github.com/hankluo6/ksort/commit/1d9afccb1e2135dcb48f7ed4559c37889ac163da) 使用 `dev_read` 做排序，並透過 `copy_to_user` 回傳執行時間 ```c static ssize_t dev_read(struct file *filep, char *buffer, size_t len, loff_t *offset) { /* Give at most 8 bytes per read */ ktime_t kt; int *arr; arr = kmalloc_array(TEST_LEN, sizeof(*arr), GFP_KERNEL); for (size_t i = 0; i < TEST_LEN; ++i) { arr[i] = next(); } kt = ktime_get(); sort_impl(arr, TEST_LEN, sizeof(*arr), cmpint, NULL); kt = ktime_sub(ktime_get(), kt); uint64_t times = ktime_to_us(kt); /* copy_to_user has the format ( * to, *from, size) and ret 0 on success */ int n_notcopied = copy_to_user(buffer, &times, len); kfree(arr); if (0 != n_notcopied) { printk(KERN_ALERT "XORO: Failed to read %d/%ld bytes\n", n_notcopied, len); return -EFAULT; } printk(KERN_INFO "XORO: read %ld bytes\n", len); return len; } ``` 因為要在 `dev_read` 中使用 `cmpint`，所以 `cmpint` 不能被宣告為 `__init` ```diff -static int __init cmpint(const void *a, const void *b) +static int cmpint(const void *a, const void *b) { return *(int *) a - *(int *) b; } ``` > [include/linux/init.h](https://github.com/torvalds/linux/blob/7f75285ca572eaabc028cf78c6ab5473d0d160be/include/linux/init.h#L15) > > These macros are used to mark some functions or initialized data (doesn't apply to uninitialized data) as `initialization` functions. The kernel can take this as hint that the function is used only during the initialization phase and free up used memory resources after 被標示為 `__init` 的 function 會被放在 `.init.text` 區段，在初始化結束後，該區段記憶體便會被釋放。 --- ### 整合 [swenson/sort](https://github.com/swenson/sort) 至 ksort > [commit 95172c](https://github.com/hankluo6/ksort/commit/95172cf8c0e0856bcc3976e91e216aca75f67c86) > 此部份參考 [Julian-Chu](https://hackmd.io/@Julian-Chu/linux2021q1-sort#%E6%94%B9%E5%AF%AB-swensonsort-%E7%9A%84%E7%A8%8B%E5%BC%8F%E7%A2%BC%E4%B8%A6%E6%95%B4%E5%90%88%E9%80%B2-ksort) 新增 64 位元的排序比較函式： ```c static int cmpint64(const void *a, const void *b) { uint64_t a_val = *(uint64_t *) a; uint64_t b_val = *(uint64_t *) b; if (a_val > b_val) return 1; if (a_val == b_val) return 0; return -1; } ``` `sort.h` 內 user space 函式改為 kernel space，並將超過 frame buffer 的函式改為使用 heap。另外 `sort.h` 沒辦法通過靜態分析，故我更改 `sort.h` 使其能符合靜態檢查 ([commit 452941](https://github.com/hankluo6/ksort/commit/452941592ff767122993d6899b6b227db14f5351))。 :::danger 避免書寫為 "frame buffer"，後者是 Linux 核心裝置驅動程式裡頭重要的介面，見 [Linux framebuffer](https://en.wikipedia.org/wiki/Linux_framebuffer) :notes: jserv ::: 測試資料為 100 次實驗，每次排序 10000 筆資料的時間，並將 95% 信賴區間外的極端值設為平均值方便作圖： ![](https://i.imgur.com/YdSJ7Cr.png) ![](https://i.imgur.com/ZDOaCRb.png) ![](https://i.imgur.com/WL8Rflh.png) ![](https://i.imgur.com/3R27Azv.png) 排序時間順序為： | Algorithm | Average Time (ns) | |:--------------------- |:----------------- | | merge sort in place | 538154 | | merge_sort | 542572 | | quick_sort | 546211 | | sqrt_sort | 648992 | | tim sort | 653751 | | shell_sort | 669552 | | grail_sort_dyn_buffer | 724964 | | heap_sort | 786883 | | grail sort | 811060 | | rec stable sort | 2087141 | | kernel heap sort | 2244015 | | bitonic sort | 11530609 | | binary insertion sort | 11586982 | | selection_sort | 105409396 | | bubble sort | 116702741 | --- ### 實作 intro sort 將作業說明提供的 intro sort 整合至 ksort 中，並比較效能： ![](https://i.imgur.com/69bgjs6.png) 可以發現 Kernel heap sort 的執行速度較 intro sort 好。在 Linux 核心內部，我們在意 comparator 函式的呼叫次數，比較二者: ![](https://i.imgur.com/ArRSp3g.png) Kernel heap sort 明顯比 intro sort 的比較次數來的少，且另外能發現到 Kernel heap sort 的比較次數相當**穩定**。 :::warning "compare" 為動詞，製作圖表時應改為 "comparison"。你需要解釋演算法設計上，比較次數的落差。 :notes: jserv ::: 此份 intro sort 的實作應能再改進。改進方案 `1`: 參考 [gcc/stl_algo.h](https://github.com/gcc-mirror/gcc/blob/ee351f7fdbd82f8947fe9a0e74cea65d216a8549/libstdc%2B%2B-v3/include/bits/stl_algo.h#L1855)，使用 16 當作 threshold，當小於此數字時，改用 insertion sort 排序，結果： ![](https://i.imgur.com/hfomhu7.png) ![](https://i.imgur.com/z2plnJB.png) 改進方案 `2`: 在程式最後使用 shell sort 做最後的排序，此時未排序好的元素為數量 < thresh 的 block 以及 heap sort 時剩下的一兩個元素，如果使用 shell sort 會需要額外比較至少 $O(n)$ 次，另外 shell sort 容易產生 cache miss，是我們不希望的情形，故改為使用 insertion sort，結果： ![](https://i.imgur.com/UyR4rdH.png) ![](https://i.imgur.com/BJiRF2C.png) 改進方案 `3`: shell sort 內使用 `tmp` 紀錄要交換的變數，再使用 memcpy 做交換，學習 Kernel heap sort 的手法，使用 swap 取代 memcpy，便能減少兩次的記憶體交換 (`tmp` to `array` 及 `array` to `tmp`)。 ```diff /* do_swap() is copied from heap.c */ void sort(void *_array, size_t length, size_t data_size, comparator_t comparator) { ... do { if (num < 4) continue; for (size_t j = gaps[i], k = j; j < num; k = ++j) { - memcpy(tmp, array + idx(k), size); while (k >= gaps[i] && - comparator(array + idx(k - gaps[i]), tmp) > 0) { + comparator(array + idx(k - gaps[i]), array + idx(k)) > 0) { - memcpy(array + idx(k), array + idx(k - gaps[i]), size); do_swap(array + idx(k), array + idx(k - gaps[i]), size, 0); k -= gaps[i]; } - memcpy(array + idx(k), tmp, size); } } while (i-- > 0); } ``` ![](https://i.imgur.com/Aq1xx6r.png) 將上述改進方案 `1`, `2`, `3` 全部整合 > [commit c29814](https://github.com/hankluo6/ksort/commit/c29814579afaae4a81926ca4a7de3e3083b2f52a) ![](https://i.imgur.com/DaYDajA.png) ![](https://i.imgur.com/RJe3pQj.png) 雖然 compare 次數還是比 heap sort 還多，但執行時間已比 heap sort 還要快一些。 :::warning 可對照 Go 的實作 [src/sort/sort.go](https://golang.org/src/sort/sort.go) quick sort (以及 introsort) 的效能受限於 branch prediction 的表現，可見 [Quicksort Pivot Imbalance](https://github.com/lorenzhs/quicksort-pivot-imbalance) 的討論，其中 "Super Scalar Sample Sort" 的實作可見 [ssssort](https://github.com/lorenzhs/ssssort)。摘自 Hacker News [Pattern-defeating quicksort](https://news.ycombinator.com/item?id=14666710) 的討論: > Here's a fun fact: Typical quicksorts (and introsorts) in standard libraries spend most of the time doing literally nothing - just waiting for the next instruction because of failed branch prediction! If you manage to eliminate branch misprediction, you can easily make sorting twice as fast! At least that is the case if you're sorting items by an integer key, or a tuple of integers, or something primitive like that (i.e. when comparison is rather cheap). > > Pdqsort efficiently eliminates branch mispredictions and brings some other improvements over introsort as well - for example, the complexity becomes $O(nk)$ if the input array is of length n and consists of only k different values. Of course, worst-case complexity is always $O(n \log n)$. :notes: jserv ::: --- ### pdqsort #### 以 C 實作 pdqsort 詳細程式碼見 [pdqsort/pdqsort.c](https://github.com/hankluo6/pdqsort/blob/main/pdqsort.c) :::warning 此實作並不完整，`partition_right_branchless` 的效能比單純使用 `partition_right` 較差，整體效能也比原本 pdqsort.cpp 還差，另外 heap sort 會有排序錯誤。 ::: #### 移植到 Kernel 中如果直接將程式碼在 kernel space 中執行，會發現效能與 user space 落差極大 (約為 user space 的兩倍時間)。 `unguarded_insertion_sort` 為 pdqsort 中的數量小於 `insertion_sort_threshold` 時呼叫的 insertion sort，將 `unguarded_insertion_sort` 內的 `memcpy` 改為 `swap`(`do_swap` 為 `heap.c` 中的 swap function)： ```diff static void unguarded_insertion_sort(void *_begin, void *_end, size_t size, cmp_func_t cmp_func) { char *begin = (char *)_begin; char *end = (char *)_end; if (begin == end) return; char *tmp = kmalloc(size, GFP_KERNEL); for (char *cur = begin + idx(1); cur != end; cur += idx(1)) { char *sift = cur; char *sift_1 = cur - idx(1); if (cmp_func(sift, sift_1)) { memcpy(tmp, sift, size); do { - memcpy(sift, sift_1, size); + do_swap(sift, sift_1, size, 0); sift -= idx(1); } while (cmp_func(tmp, sift_1 -= idx(1))); memcpy(sift, tmp, size); } } kfree(tmp); } ``` ![](https://i.imgur.com/jlZ5uee.png) 會看到效能落差極大，故推測是 `memcpy` 讓 pdqsort 在 kernel 中表現不好。在 LKM 中直接測試 `memcpy` 與 `swap` 的效率： ```c for (size_t i = 0; i < 100000; ++i) { uint64_t val1 = next(), val2 = next(); a[i] = val1; b[i] = val2; c[i] = val1; d[i] = val2; } kt = ktime_get(); for (int i = 0; i < 100000; ++i) { swap_words_64(&a[i], &b[i], sizeof(uint64_t)); } kt = ktime_sub(ktime_get(), kt); printk("swap: %llu\n", ktime_to_ns(kt)); kt = ktime_get(); for (int i = 0; i < 100000; ++i) { memcpy(&c[i], &d[i], sizeof(uint64_t)); } kt = ktime_sub(ktime_get(), kt); printk("memcpy: %llu\n", ktime_to_ns(kt)); ``` ![](https://i.imgur.com/cvonBw1.png) 卻會發現 `memcpy` 效率比 `swap` 較好。參考 [newlib/memcpy.c](https://github.com/eblot/newlib/blob/master/newlib/libc/string/memcpy.c?fbclid=IwAR3MffSotnnFLngDvDD3f3BE5NV44t7K7nhcaZxHQygshJzp-T7HT96SUKw) 自己重寫 `memcpy` ，並比較 sort 的效能： ![](https://i.imgur.com/pKd3pNP.png) 自己寫的 `memcpy` 效能在 pdqsort 中也比 kernel 內部的 `memcpy` 還快，另外測試使用 [lib/string.c](https://github.com/torvalds/linux/blob/9f4ad9e425a1d3b6a34617b8ea226d56a119a717/lib/string.c#L881) 未優化的 `memcpy` 版本 (逐 byte 複製)，在 pdqsort 中時間一樣比使用 kernel 內部的 `memcpy` 還要快，故應不是演算法造成的誤差，透過 `printk` 印出 `memcpy` 分別執行的時間： * default memcpy: 44.33 ns * customized memcpy: 22.7 ns 故先避免在 kernel 內使用預設的 `memcpy`。 :::warning 猜測為 function call 或 page faults 相關問題，但不知道該怎麼驗證？ > 見 https://www.kernel.org/doc/htmldocs/kernel-api/ch04s03.html > 注意 "populate" 這項操作，允許你對特定的 page 進行 prefault/prealloc，更詳細的說明: [mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory](https://lwn.net/Articles/846501/) > :notes: jserv ::: > [commit 94d6aa](https://github.com/hankluo6/ksort/commit/94d6aa23be7051dd4113f1b976f00daa8bb921dc) * 測試 100000 筆資料排序 ![](https://i.imgur.com/TQCv0tl.png) ![](https://i.imgur.com/sDugQio.png) pdqsort 與 intro sort 相比，速度又快上許多，但在 comparison 的次數上並沒有提昇。理論上，quick sort 的比較次數應比 heap sort 還少，但在 intro sort 與 pdqsort 中皆使用到 insertion sort 做排序，而 insertion sort 雖在速度上有優勢 (良好的 locality 與小範圍排序)，但其比較次數是隨著元素個數成長，其值應比 heap sort 的 average case: $1.5\times n \log2n + O(n)$ 還高出許多，此部份需再做額外數學證明。 ###### tags: `linux2021`

Read more

Linked List Sort

2022q1 Homework5 (quiz5)

2022q1 Homework5 (quiz6)

2022q1 Homework6 (ktcp)