Linux 核心專題: 改進 ksort

執行人: ChengChaoChun
專題解說錄影
 GitHub

Reviewed by `yu-hsiennn`

應放上 commit 紀錄，且根據好的 Git Commit Message 來書寫，只單看你的 commit message 無法快速理解到這次你的更新。

感謝同學的糾正，已補上 commit 紀錄

Reviewed by `vestata`

Timsort 可以加入部分排序的資料分佈比較，因為 Timsort 演算法針對部分排序好的資料，比較能顯示差異。

感謝同學的建議，已加入部分排序的資料分佈比較，包含 descending order ， ascending order ， push front ， pipe organ 的實驗。

Reviewed by `randyuncle`

請問有辦法提供更多資料分佈的實驗嗎？例如說 listsort.txt 中提到的資料分佈（部份排序的資料、quick sort 的 worst case 等）。

任務簡介

重作第六次作業的 ksort，探討並行化的資料排序實作手法。

TODO: 依據第六次作業規範，重作 ksort

要確保並行化的資料排序結果正確，善用 kernel thread / workqueue / CMWQ 來分發運算工作到有效的 CPU 核。
善用 CMWQ 達成排序的並行處理，需要設計對應的檢驗程式。
可適度引用其他學員的成果，務必註明出處並確認其實作有效益

並行排序實作 (`sort_impl.c`)

ksort 使用〈Engineering a Sort Function〉提出的 quick sort 實作並改進，加上 Linux 核心的 CMWQ 達成並行化的資料排序。

在 sort_impl.c 的 161 行多了一個 swap_cnt 等於 0 變數的判斷式，如果swap_cnt 等於 0 就切換成插入排序。往前看到 130 行到 153 行程式，可以發現一但需要使用 q_swap 巨集，swap_cnt 就會被設定為 1 。因此判斷式成立的條件是，在第一次進入 for 循環時， pb 指標前的元素都小於 pivot key，且 pc 指標之後的元素都大於 pivot key ，也就是說當前陣列有可能已經大致是有序的，此時使用插入排序可能可以達到最好的情況，即時間複雜度為

n

。

提供數學證明。

已嘗試證明
沒有！為何不翻閱你的教科書，從權威材料學習呢？

變數 r 設定交換次數的上限。儘管此時 swap_cnt 為 0 ，但是使用插入排序來排序 pb 指標之前的元素有可能會形成最壞的情況，即時間複雜度為

n^{2}

，同樣的，排序 pc 指標之後的元素也可能會形成最壞的情況，因此這裡限制了交換次數的上限來保證排序的效率。

if (swap_cnt == 0) { /* Switch to insertion sort */
    r = 1 + n / 4;   /* n >= 7, so r >= 2 */
    for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)
        for (pl = pm; pl > (char *) a && CMP(thunk, pl - es, pl) > 0;pl -= es) {
                q_swap(pl, pl - es);
                if (++swap_cnt > r)
                    goto nevermind;
        }
    return;
}

證明 :
在選擇完 pivot key 之後，pivot key 會被交換到陣列的第一個位置(126行)。swap_cnt 等於 0 的情況如下圖，也就是 b 指標與 c 指標相遇，同時 b 指標掃描過的元素都比 pivot key 要小， c 指標掃描過的元素都比 pivot key 要大，此時就會跳出 130 行的 for 循環，且swap_cnt等於 0。(沒有等於 pivot key 的元素)，然後就換成使用插入排序。
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →

swap_cnt 等於 0 的最好情況 (交換次數最少):

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

pivot key 經過 157 行的 vecswap 之後會被置換到中間，此時時間複雜度為

n

。

swap_cnt 等於 0 的最壞情況 (交換次數最少):

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

pivot key 經過 157 行的 vecswap 之後會被置換到中間，此時陣列不是單調遞減，但時間複雜度接近

n^{2}

。

此時無法確定陣列當前的情況，因此設置一個交換次數的上限，來確保時間複雜度不會走向

n^{2}

。

數學證明 :
排序無非是執行 compare 和 swap 操作， compare 後再決定是否要 swap。

令 f(n) 函數表示排序 n 個元素的時間複雜度

插入排序最好情況:
f(n) = f(n-1) + 1 (+1 表示比較次數)
= f(n-2) + 1 + 1
= f(n-3) + 1 + 1 + 1
= f(n-n) + 1 * n
= n (共 n 次比較， 0 次交換)

元素每交換一次就會多一次比較次數，在這個插入排序中交換次數的上限為
$1 + n / 4$ ，也就是說比較次數最多是
$n + 1 + n / 4$ 。

因此這個插入排序的 upper bound 就是
$5 n / 4 + 1$ ，一但超過就會終止插入排序。

注意用語：

jump 是「跳躍」，而非「跳轉」，這操作沒有「轉」

\to

教育部重編國語詞典的「轉」解釋。

已修改，謝謝糾正

nl 與 nr 分別表示 pb 指標以前待排序的元素個數與 pc 指標以後待排序的元素個數。

當 nl > 100 且 nr > 100 : 新建一個 work item 並放到 workqueue 來排序 pb 指標以前的元素(並行排序)，而 pc 指標以後的元素則跳轉跳躍到 top 標籤排序(在當前 work item)。
當 nl > 0 : 遞迴執行函式來排序 pb 指標以前的元素。遞迴在當前執行流上立刻排序，而 CMWQ 是非同步的執行，因此待排序元素較少時使用遞迴更有效率。

nevermind:
    nl = (pb - pa) / es;
    nr = (pd - pc) / es;

    if (nl > 100 && nr > 100) {
        struct qsort *q = kmalloc(sizeof(struct qsort), GFP_KERNEL);
        init_qsort(q, a, nl, c);
        queue_work(workqueue, &q->w);
    } else if (nl > 0) {
        qs->a = a;
        qs->n = nl;
        /* The struct qsort is used for recursive call, so don't free it in
         * this iteration.
         */
        do_free = false;
        qsort_algo(w);
    }

    if (nr > 0) {
        a = pn - nr * es;
        n = nr;
        goto top;
    }

    if (do_free)
        kfree(qs);

TODO: 整合第一週測驗提到的 Timsort

予以並行化

注意用語：

algorithm 是「演算法」，而非「算法」(the way to calculate)

已修改，謝謝糾正
不用急著說「已修改」，你應該依循本課程教材規範的術語，完整修改和調整後才能說「已修改」。

commit

第一週測驗提到的 Timsort 算法演算法是以雙向鏈結串列來輔助排序的。在我們的排序模組中使用 copy_from_user 函式接收來自 user space 的資料(連續的記憶體空間，也就是陣列)，因此這裡需要建立雙向鏈結串列，並將資料複製到雙向鏈結串列上，才能由 Timsort 演算法排序。

這裡我們走訪陣列中的每一筆資料時，配置一個節點並記錄資料，然後添加到鏈結串列的尾部。

uint64_t *sort_buffer = kmalloc(size, GFP_KERNEL);
if (!sort_buffer)
    return 0;
    
len = copy_from_user(sort_buffer, buf, size);
if (len != 0)
    return 0;

struct list_head head;
INIT_LIST_HEAD(&head);
int ele_num = size / es;
for (int i = 0; i < ele_num; i++) {
    element_t *elem = kmalloc(sizeof(element_t), GFP_KERNEL);
    elem->val = *(sort_buffer + i);
    list_add_tail(&elem->list, &head);
}

排序完成後我們需要把資料結構由鏈結串列轉換為陣列，然後釋放鏈結串列配置的記憶體空間，最後再使用 copy_to_user 傳遞到 user spcae。

struct list_head *pos;
uint64_t *buf_p = sort_buffer;
element_t *entry;
list_for_each (pos, &head) {
    entry = list_entry(pos, element_t, list);
    *(buf_p) = entry->val;
    buf_p++;
}

/* Free list */
element_t *node, *safe;
list_for_each_entry_safe (node, safe, &head, list) {
    list_del(&node->list);
    kfree(node);
}

len = copy_to_user(buf, sort_buffer, size);
if (len != 0)
    return 0;

Timsort 並行化

修改 Timsort 的函式介面，函式接收一個 work_struct 參數。

void timsort(struct work_struct *w);

原先的 Timsort 需要接收三個參數，這裡定義一個結構體，用來記錄這三個參數以及 work_struct 。

struct timsort_struct {
    struct work_struct w;
    void *priv;
    struct list_head *h;
    list_cmp_func_t cmp;
};

將 timsort_struct 結構體初始化，然後將 work_item 放到 workqueue。

void timsort_init(struct timsort_struct *t,
                  void *priv_,
                  struct list_head *h_,
                  list_cmp_func_t cmp_)
{
    INIT_WORK(&t->w, timsort);
    t->priv = priv_;
    t->h = h_;
    t->cmp = cmp_;
};

timsort_init(&tim_, &count, &head, tim_compare);
queue_work(workqueue, &tim_.w);
drain_workqueue(workqueue);

接著看到 Timsort 函式，原先的 Timsort 函式要接收 void 指標， list_head 指標，以及比較函式，而我們的 Timsort 改成接收一個 work_struct 指標。這時使用 container_of 巨集就可以得到 timsort_struct 結構體在記憶體起始位置，並取得排序的參數。

struct timsort_struct *t = container_of(w, struct timsort_struct, w);
struct list_head *head = t->h;
list_cmp_func_t cmp = t->cmp;
void *priv = t->priv;

如何確保上述程式碼是正確且有效的實作？

已貼上實作 github 連結
所以呢？你到底如何驗證？只用少數的測試就能說明程式碼正確無誤嗎？倘若你搭乘的飛機，其背後的工程師也用這樣的態度去面對飛機的自動控制系統，你趕搭飛機嗎？補上驗證方式的探討、數學分析，並充分從實驗進行檢討。

驗證實作有效性一 :

commit

在 CMWQ 提到 :

work item 包含一個紀錄要非同步執行的函式的指標

A work item is a simple struct that holds a pointer to the function that is to be executed asynchronously.

kworkers 依次執行 workqueue 中的函式

For threaded workqueues, special purpose threads, called [k]workers, execute the functions off of the queue, one after the other.

參考 <Linux内核设计与实现（第三版）> 這本書 :

行程在準備就緒或者正在運行的狀態都是TASK_RUNNING， Linux 並沒有區分行程的 ready state 和 running state。

TASK_RUNNING 在 Linux 核心中的定義 :
#define TASK_RUNNING 0x00000000

3.驗證方法

實際執行結果 :

sort_selector : 1 選擇 Timsort
timsort 執行過程中將 user 行程的狀態輸出，此時狀態為 TASK_RUNNING

sudo dmesg 
[  127.631722] sort_selector : 1
[  127.631746] sort_read (pid=2885, comm=user, state=0)
[  127.631844] timsort (pid=24, comm=kworker/1:0, state=0)
[  127.631846] comm = user , task pid = 2885, task state = 0
[  127.632194] timsort time_cost  289419 ns
[  127.632196] sorted by timsort

TODO: 實作 Pattern Defeating Quicksort (pdqsort)

比較 Timsort, pdqsort 和 Linux 核心 lib/sort.c，並確保可從使用者層級的程式對裝置檔案進行設定 (如 write 系統呼叫)，讓這些排序實作存在於同一個 Linux 核心模組，並得以切換和比較。
過程中需要量化執行時間，可在 Linux 核心模組和使用者空間 (userspace) 中測量。在 Linux 核心模組中，可用 ktime 系列的 API，而在使用者層級可用 clock_gettime 相關 API，分別用 gnuplot 製圖。
善用統計模型，除去極端數值，過程中應詳述你的手法;
需要針對不同的情境去準備測試資料;

commit

目前在 ksort 模組有 qsort ， linux sort ， pdqsort ，以及 Timsort 。

使用 write 系統呼叫來選擇使用的排序。當 sort_selector 為 1 時，使用 Timsort ，當 sort_selector 為 2 時，使用 linux sort ，當 sort_selector 為 3 時，使用 pdqsort，當 sort_selector 為 4 時，使用 qsort。

static int sort_selector = 0;

static ssize_t sort_write(struct file *file,
                          const char *buf,
                          size_t size,
                          loff_t *offset)
{
    void *set_sort_buf = kmalloc(size, GFP_KERNEL);
    if (!set_sort_buf)
        return 0;

    unsigned long len;
    len = copy_from_user(set_sort_buf, buf, size);
    if (len != 0)
        return 0;

    int set_sort = *(int *)set_sort_buf;
    if(set_sort > 0 && set_sort < 5){
        sort_selector = set_sort;
        printk("sort_selector : %d", set_sort);
        return set_sort;
    }else{
        printk("sort_selector error");
        return 0;
    }
}

在 user.c 測試結果如下

sudo ./user 1

sort by timsort
set seccuess  1
Soring succeeded!
------------------------
sudo ./user 2

sort by linux lib sort
set seccuess  2
Soring succeeded!
------------------------
sudo ./user 3

sort by pdqsort
set seccuess  3
Soring succeeded!
------------------------
sudo ./user 4

sort by qsort
set seccuess  4
Soring succeeded!

排序時間比較

降序

使用者空間 (userspace) 中測量

Linux 核心模組中測量

Fisher Yates shuffling

使用者空間 (userspace) 中測量

Linux 核心模組中測量

分析

在 kernel space 中， Timsort 和 pdqsort 的表現差不多，但是明顯要優於 linux sort 。而在 user space 中， Timsort 的表現都是最差的。這是因為 userspace 在系統呼叫後， kernel 先走訪了一次陣列並且同時建立鏈結串列，才進行排序。而在排序完後又走訪二次鏈結串列，一次將資料複製到陣列上，一次釋放鏈結串列節點，最後才回到 user space 。

上述實驗是否涵蓋多種資料分布的情境？
orlp/pdqsort 提到的最差、最佳，和平均狀況應予以考慮。

Orsan Peters (Timsort 作者) 提到 pdqsort 對於單調遞增，單調遞減，相同元素，或者嚴格遞增單最後一個元素為 0 的陣列(作者稱為 push front)都可以有不錯的表現。下圖是作者的實驗數據，可以看到 shuffled 和 pipe organ (元素序列為小到大然後再大到小) 陣列明顯的比其他序列的陣列排序時間要長。

下面的實驗來驗證 pdqsort 的排序效率並與 linuxsort 和 timsort 比較。

all equal

使用者空間 (userspace) 中測量
Linux 核心模組中測量

ascending order

使用者空間 (userspace) 中測量
Linux 核心模組中測量

push front

使用者空間 (userspace) 中測量
Linux 核心模組中測量

pipe organ

使用者空間 (userspace) 中測量
Linux 核心模組中測量

經過一系列的實驗，可以發現 shuffled (Fisher Yates shuffling) 和 pipe organ 的排序時間確實是要比其他序列的陣列排序時間要更長。在作者提供的實驗數據中 pipe organ 的排序時間是要略長於 shuffled 的，而我實際實驗的結果是 Fisher Yates shuffling 排序時間是要比 pipe organ 更長的，這可能是陣列長度不一致(作者實驗的陣列長度是

10^{6}

)，也有可能是陣列洗牌方式的影響。

TODO: 引入 PRNG

考慮到產生亂數的時間和可預測性，改用 xorshift128+ 作為 PRNG，並善用 xoro 核心模組。
資料排序要考慮到演算法最差的狀況，不只有亂數輸入。
針對執行結果，予以討論，附上數學分析和解讀。

xoroshiro128+

使用者空間 (userspace) 中測量
Linux 核心模組中測量

xorshift128+

使用者空間 (userspace) 中測量
Linux 核心模組中測量

Linux 核心專題: 改進 ksort

Reviewed by yu-hsiennn

Reviewed by vestata

Reviewed by randyuncle

任務簡介

TODO: 依據第六次作業規範，重作 ksort

並行排序實作 (sort_impl.c)

TODO: 整合第一週測驗提到的 Timsort

Timsort 並行化

TODO: 實作 Pattern Defeating Quicksort (pdqsort)

排序時間比較

降序

Fisher Yates shuffling

分析

all equal

ascending order

push front

pipe organ

TODO: 引入 PRNG

xoroshiro128+

xorshift128+

Reviewed by `yu-hsiennn`

Reviewed by `vestata`

Reviewed by `randyuncle`

並行排序實作 (`sort_impl.c`)