2021q1 Homework5 (sort)

contributed by < hankluo6 >

GitHub

取得測量資料

commit 1d9afc

使用 dev_read 做排序，並透過 copy_to_user 回傳執行時間

static ssize_t dev_read(struct file *filep,
                        char *buffer,
                        size_t len,
                        loff_t *offset)
{
    /* Give at most 8 bytes per read */
    ktime_t kt;
    int *arr;

    arr = kmalloc_array(TEST_LEN, sizeof(*arr), GFP_KERNEL);
    for (size_t i = 0; i < TEST_LEN; ++i) {
        arr[i] = next();
    }

    kt = ktime_get();
    sort_impl(arr, TEST_LEN, sizeof(*arr), cmpint, NULL);
    kt = ktime_sub(ktime_get(), kt);
    uint64_t times = ktime_to_us(kt);
    /* copy_to_user has the format ( * to, *from, size) and ret 0 on success */

    int n_notcopied = copy_to_user(buffer, &times, len);

    kfree(arr);
    if (0 != n_notcopied) {
        printk(KERN_ALERT "XORO: Failed to read %d/%ld bytes\n", n_notcopied,
               len);
        return -EFAULT;
    }
    printk(KERN_INFO "XORO: read %ld bytes\n", len);
    return len;
}

因為要在 dev_read 中使用 cmpint，所以 cmpint 不能被宣告為 __init

-static int __init cmpint(const void *a, const void *b)
+static int cmpint(const void *a, const void *b)
{
    return *(int *) a - *(int *) b;
}

include/linux/init.h

These macros are used to mark some functions or initialized data (doesn't apply to uninitialized data) as initialization functions. The kernel can take this as hint that the function is used only during the initialization phase and free up used memory resources after

被標示為 __init 的 function 會被放在 .init.text 區段，在初始化結束後，該區段記憶體便會被釋放。

整合 swenson/sort 至 ksort

commit 95172c
此部份參考 Julian-Chu

新增 64 位元的排序比較函式：

static int cmpint64(const void *a, const void *b)
{
    uint64_t a_val = *(uint64_t *) a;
    uint64_t b_val = *(uint64_t *) b;
    if (a_val > b_val)
        return 1;
    if (a_val == b_val)
        return 0;
    return -1;
}

sort.h 內 user space 函式改為 kernel space，並將超過 frame buffer 的函式改為使用 heap。另外 sort.h 沒辦法通過靜態分析，故我更改 sort.h 使其能符合靜態檢查 (commit 452941)。

避免書寫為 "frame buffer"，後者是 Linux 核心裝置驅動程式裡頭重要的介面，見 Linux framebuffer

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

測試資料為 100 次實驗，每次排序 10000 筆資料的時間，並將 95% 信賴區間外的極端值設為平均值方便作圖：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

排序時間順序為：

Algorithm	Average Time (ns)
merge sort in place	538154
merge_sort	542572
quick_sort	546211
sqrt_sort	648992
tim sort	653751
shell_sort	669552
grail_sort_dyn_buffer	724964
heap_sort	786883
grail sort	811060
rec stable sort	2087141
kernel heap sort	2244015
bitonic sort	11530609
binary insertion sort	11586982
selection_sort	105409396
bubble sort	116702741

實作 intro sort

將作業說明提供的 intro sort 整合至 ksort 中，並比較效能：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

可以發現 Kernel heap sort 的執行速度較 intro sort 好。

在 Linux 核心內部，我們在意 comparator 函式的呼叫次數，比較二者:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Kernel heap sort 明顯比 intro sort 的比較次數來的少，且另外能發現到 Kernel heap sort 的比較次數相當穩定。

"compare" 為動詞，製作圖表時應改為 "comparison"。你需要解釋演算法設計上，比較次數的落差。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

此份 intro sort 的實作應能再改進。

改進方案 1: 參考 gcc/stl_algo.h，使用 16 當作 threshold，當小於此數字時，改用 insertion sort 排序，結果：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

改進方案 2: 在程式最後使用 shell sort 做最後的排序，此時未排序好的元素為數量 < thresh 的 block 以及 heap sort 時剩下的一兩個元素，如果使用 shell sort 會需要額外比較至少

O (n)

次，另外 shell sort 容易產生 cache miss，是我們不希望的情形，故改為使用 insertion sort，結果：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

改進方案 3: shell sort 內使用 tmp 紀錄要交換的變數，再使用 memcpy 做交換，學習 Kernel heap sort 的手法，使用 swap 取代 memcpy，便能減少兩次的記憶體交換 (tmp to array 及 array to tmp)。

/* do_swap() is copied from heap.c */

void sort(void *_array,
          size_t length,
          size_t data_size,
          comparator_t comparator)
{
    ...
    do {
        if (num < 4)
            continue;

        for (size_t j = gaps[i], k = j; j < num; k = ++j) {
-           memcpy(tmp, array + idx(k), size);

            while (k >= gaps[i] &&
-               comparator(array + idx(k - gaps[i]), tmp) > 0) {
+               comparator(array + idx(k - gaps[i]), array + idx(k)) > 0) {
-               memcpy(array + idx(k), array + idx(k - gaps[i]), size);
                do_swap(array + idx(k), array + idx(k - gaps[i]), size, 0);
                k -= gaps[i];
            }

-           memcpy(array + idx(k), tmp, size);
        }
    } while (i-- > 0);
}

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

將上述改進方案 1, 2, 3 全部整合

commit c29814

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

雖然 compare 次數還是比 heap sort 還多，但執行時間已比 heap sort 還要快一些。

可對照 Go 的實作 src/sort/sort.go

quick sort (以及 introsort) 的效能受限於 branch prediction 的表現，可見 Quicksort Pivot Imbalance 的討論，其中 "Super Scalar Sample Sort" 的實作可見 ssssort。

摘自 Hacker News Pattern-defeating quicksort 的討論:

Here's a fun fact: Typical quicksorts (and introsorts) in standard libraries spend most of the time doing literally nothing - just waiting for the next instruction because of failed branch prediction! If you manage to eliminate branch misprediction, you can easily make sorting twice as fast! At least that is the case if you're sorting items by an integer key, or a tuple of integers, or something primitive like that (i.e. when comparison is rather cheap).

Pdqsort efficiently eliminates branch mispredictions and brings some other improvements over introsort as well - for example, the complexity becomes
$O (n k)$ if the input array is of length n and consists of only k different values. Of course, worst-case complexity is always
$O (n \log n)$ .

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

pdqsort

以 C 實作 pdqsort

詳細程式碼見 pdqsort/pdqsort.c

此實作並不完整，partition_right_branchless 的效能比單純使用 partition_right 較差，整體效能也比原本 pdqsort.cpp 還差，另外 heap sort 會有排序錯誤。

移植到 Kernel 中

如果直接將程式碼在 kernel space 中執行，會發現效能與 user space 落差極大 (約為 user space 的兩倍時間)。

unguarded_insertion_sort 為 pdqsort 中的數量小於 insertion_sort_threshold 時呼叫的 insertion sort，將 unguarded_insertion_sort 內的 memcpy 改為 swap(do_swap 為 heap.c 中的 swap function)：

static void unguarded_insertion_sort(void *_begin, void *_end, size_t size, cmp_func_t cmp_func)
{
    char *begin = (char *)_begin;
    char *end = (char *)_end;
    
    if (begin == end)
        return;
        
    char *tmp = kmalloc(size, GFP_KERNEL);

    for (char *cur = begin + idx(1); cur != end; cur += idx(1)) {
        char *sift = cur;
        char *sift_1 = cur - idx(1);

        if (cmp_func(sift, sift_1)) {
            memcpy(tmp, sift, size);
            do {
-               memcpy(sift, sift_1, size);
+               do_swap(sift, sift_1, size, 0);
                sift -= idx(1);
            } while (cmp_func(tmp, sift_1 -= idx(1)));
            memcpy(sift, tmp, size);
        }
    }
    kfree(tmp);
}

會看到效能落差極大，故推測是 memcpy 讓 pdqsort 在 kernel 中表現不好。

在 LKM 中直接測試 memcpy 與 swap 的效率：

for (size_t i = 0; i < 100000; ++i) {
    uint64_t val1 = next(), val2 = next();
    a[i] = val1;
    b[i] = val2;
    c[i] = val1;
    d[i] = val2;
}

kt = ktime_get();
for (int i = 0; i < 100000; ++i) {
    swap_words_64(&a[i], &b[i], sizeof(uint64_t));
}
kt = ktime_sub(ktime_get(), kt);
printk("swap: %llu\n", ktime_to_ns(kt));


kt = ktime_get();
for (int i = 0; i < 100000; ++i) {
    memcpy(&c[i], &d[i], sizeof(uint64_t));
}
kt = ktime_sub(ktime_get(), kt);
printk("memcpy: %llu\n", ktime_to_ns(kt));

卻會發現 memcpy 效率比 swap 較好。

參考 newlib/memcpy.c 自己重寫 memcpy ，並比較 sort 的效能：

自己寫的 memcpy 效能在 pdqsort 中也比 kernel 內部的 memcpy 還快，另外測試使用 lib/string.c 未優化的 memcpy 版本 (逐 byte 複製)，在 pdqsort 中時間一樣比使用 kernel 內部的 memcpy 還要快，故應不是演算法造成的誤差，透過 printk 印出 memcpy 分別執行的時間：

default memcpy: 44.33 ns
customized memcpy: 22.7 ns

故先避免在 kernel 內使用預設的 memcpy。

猜測為 function call 或 page faults 相關問題，但不知道該怎麼驗證？

見 https://www.kernel.org/doc/htmldocs/kernel-api/ch04s03.html
注意 "populate" 這項操作，允許你對特定的 page 進行 prefault/prealloc，更詳細的說明: mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

jserv

commit 94d6aa

測試 100000 筆資料排序

pdqsort 與 intro sort 相比，速度又快上許多，但在 comparison 的次數上並沒有提昇。理論上，quick sort 的比較次數應比 heap sort 還少，但在 intro sort 與 pdqsort 中皆使用到 insertion sort 做排序，而 insertion sort 雖在速度上有優勢 (良好的 locality 與小範圍排序)，但其比較次數是隨著元素個數成長，其值應比 heap sort 的 average case:
$1.5 \times n \log 2 n + O (n)$ 還高出許多，此部份需再做額外數學證明。

2021q1 Homework5 (sort)

取得測量資料

整合 swenson/sort 至 ksort

實作 intro sort

pdqsort

以 C 實作 pdqsort

移植到 Kernel 中

tags: linux2021

Read more

Linked List Sort

2022q1 Homework5 (quiz5)

2022q1 Homework5 (quiz6)

2022q1 Homework6 (ktcp)

tags: `linux2021`