# 2019q3 Homework4 (quiz4)
contributed by < `93i7xo2` >
###### tags: `sysprog2019`
Source: [2019q3 第 4 週測驗題](https://hackmd.io/@sysprog/B1keOS-dB)
### 測驗 `1`
考慮一個單向 linked list:
```cpp
typedef struct __list {
    int data;
    struct __list *next;
} list;
```
在不存在環狀結構的狀況下,以下函式能夠對 linked list 元素從小到大排序:
```cpp
list *sort(list *start) {
    if (!start || !start->next)
        return start;
    list *left = start;
    list *right = left->next;
    LL0;
    left = sort(left);
    right = sort(right);
    for (list *merge = NULL; left || right; ) {
        if (!right || (left && left->data < right->data)) {
            if (!merge) {
                LL1;
            } else {
                LL2;
                merge = merge->next;
            }
            LL3;
        } else {
            if (!merge) {
                LL4;
            } else {
                LL5;
                merge = merge->next;
            }
            LL6;
        }
    }
    return start;
}
```
請補完程式碼。
:::success
延伸問題:
- [x] 解釋上述程式運作原理;
- [x] 指出程式改進空間,特別是考慮到 [Optimizing merge sort](https://en.wikipedia.org/wiki/Merge_sort#Optimizing_merge_sort);
- [x] 將上述 singly-linked list 擴充為 circular doubly-linked list 並重新實作對應的 `sort`;
- [ ] 依循 Linux 核心 [include/linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h) 程式碼的方式,改寫上述排序程式;
- [ ] 嘗試將原本遞迴的程式改寫為 iterative 版本;
- [ ] 將 [2019q1 第 3 週測驗題 (下)](https://hackmd.io/@sysprog/SkrVSKiU4) 提及的 XOR linked list 列入考量,也實作對應的排序程式,並比較效能和記憶體開銷;
- [ ] 比照 [List Sort](https://arxiv.org/pdf/1310.7890.pdf) 的手法來分析前述程式 (含擴充版本) 的效能表現;
:::
#### 實驗環境
```bash!
~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       39 bits physical, 48 bits virtual
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               94
Model name:          Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Stepping:            3
CPU MHz:             800.012
CPU max MHz:         4000.0000
CPU min MHz:         800.0000
BogoMIPS:            6816.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
```
#### 1. 實作分析
完整程式碼如下
```cpp=
list *sort(list *start) {
    if (!start || !start->next)
        return start;
    list *left = start;
    list *right = left->next;
    left->next = NULL;
    left = sort(left);
    right = sort(right);
    for (list *merge = NULL; left || right; ) {
        if (!right || (left && left->data < right->data)) {
            if (!merge) {
                start = merge = left;
            } else {
                merge->next = left;
                merge = merge->next;
            }
            left = left->next;
        } else {
            if (!merge) {
                start = merge = right;
            } else {
                merge->next = right;
                merge = merge->next;
            }
            right = right->next;
        }
    }
    return start;
}
```
程式碼已經明示是用 merge sort,merge sort 分為兩部份
- Divide
- Conquer
**Divide**
   - 將長度為 n 的 list 切成 left, right 長度分別為 1, n-1 的 list
        > LL0 為 `left->next=NULL`,left才是獨立的list
 
**Conquer**
   - 合併經過排序的list,用迴圈逐一循序從 left, right 最前面挑選兩者之中最小的元素(`__list`)放在以 `merge` 為首的 list。有兩種選擇:
        - 從 left 挑選
        - 從 right 挑選
        從 left 挑選一種情況是 right 為 `NULL` 只得從 left 挑選 (迴圈內不允許left和right同時為NULL),另一種情況是 right, left 皆非 NULL 且 `left->data < right->data` ,反之則從 right 挑選。是 `sort()` 最重要的部份。
        > LL1 = `start = merge = left;`
        > LL2 = `merge->next = left;`
        > LL4 = `start = merge = right;`
        > LL5 = `merge->next = right;`
        因為 `merge` 會隨著元素加入而更新,所以先用 start 記下 list 的起始點
        加入後,也把原本 left, right 更新一下:
        > LL3 = `left = left->next;`
        > LL6 = `right = right->next;`
#### 2. 指出程式改進空間,特別是考慮到 [Optimizing merge sort](https://en.wikipedia.org/wiki/Merge_sort#Optimizing_merge_sort);
用長度為`n`的 `list* l`,分析 `sort(l)` 的時間複雜度,得到 $O(n^2)$ 的結果
$\begin{split}
T(n) &= 
T(1)+T(n-1)+cn, where\ c\ is\ a\ constant \\
&= 2\cdot{T(1)} + T(n-2) + c\cdot(n+(n-1)) \\
&= (n-1)\cdot{T(1)} + T(1) + c\cdot(n+(n-1)+(n-2)+...+2) \\
&= O(n^2)
\end{split}$
以下改良 [kaeteyaruyo](https://hackmd.io/@kaeteyaruyo/2019q3_homework4) 提出的 merge sort,將子問題切成平均的大小減少遞迴深度。也引入 [Fundamental Sorting Algorithms](https://victoramelkin.com/ta/cs32/pa/2/pa2.html) 提到 tiled merge sort 此種 cache-aware 演算法的精神,對於排序大量資料而言, quick sort 和 merge sort 勝過 insertion sort ( $O(n\cdot{log(n)}) < O(n^2)$ ),對於小量資料則取決於運行的機器。因此在 merge sort 將問題切成 L1 cache size 後變不再往下遞迴,直接用 insertion sort 做排序。
```cpp!
list *sort(list *start, unsigned size) {
    if (!start || !start->next)
        return start;
    /* If list size is smaller than L1 cache size (32k), do insertion sort. */
    if (!(sizeof(list) * size >> 15)) {
        list *current = start;
        list *sorted = NULL; // To store head of sorted list.
        while (current != NULL) {
            list *next = current->next;
            sortedInsert(&sorted, current);
            current = next;
        }
        return sorted;
    }
    /* Otherwise, do merge sort. */
    ...
}
```
在用小量資料測試原始版本的 `sort()` 也有前幾次測得時間稍慢的而後穩定的現象:
```
$ make test
Elapsed time
------------
0.0087318420
0.0051324368
0.0033595562
0.0033812523
0.0033354759
0.0033311844
0.0028159618
0.0023837090
0.0020971298
0.0018613338
0.0017888546
0.0017638206
0.0017621517
0.0017526150
0.0017971992
```
依照講師的意見,在 `sort()` 前先從頭走訪所有節點10000次再進行測量(1000次不可行),所測得時間趨於穩定:
```
$ make test
Elapsed time
------------
0.0018594265
0.0017547607
0.0018105507
0.0018079281
0.0017924309
0.0017812252
0.0017433167
0.0017609596
0.0017414093
```
對應的走訪程式碼如下:
```cpp
void print(list *head, int enable) {
    if (!head) return;
    if (enable){
        while (head) {
            printf(" %d", head->data);
            if ((head = head->next)) printf("->");
            else printf("\n");
        }
        return;
    }
    while (head) {
        head = head->next;
    }
}
int main(){
    ...
    // traversaling the list
    for (int i = 0; i < 1000; ++i) {
        print(l, 0);
    }
    double t1, t2;
    t1 = tvgetf();
    l = sort(l);
    t2 = tvgetf();
    printf("%.10f\n", t2 - t1); // sec
    ...
}
```
在此基礎上,將原始版本與改良過的版本用 $10^5$ ~ $10^2$ 不同數量級的亂數群各測試10次。 `sizeof(list)=16`,為了凸顯改良後的 merge sort 於 linked list 大小超過 L1 dcache (32kB) 有所改善,測試的亂數群的大小要超過8192個。(`32*1024/4=8192`)。
程式會先產生測試資料
```bash!
dd if=/dev/urandom of=input.dat bs=$((size*4)) count=1
```
接著從 input.dat 讀取 20000 個 `int32_t` 的整數,執行排序。比較下列各版本的效能:
- [原始版本](https://github.com/93i7xo2/sysprog2019q3-quiz4/blob/master/merge-original.c)
- [kaeteyaruyo 版本](https://github.com/93i7xo2/sysprog2019q3-quiz4/blob/master/merge-kaeteyaruyo.c)
- [kaeteyaruyo 版本 + cache-aware method (optimize)](https://github.com/93i7xo2/sysprog2019q3-quiz4/blob/master/merge-optimize.c)
對應的[程式碼](https://github.com/93i7xo2/sysprog2019q3-quiz4)和測試腳本使用方法如下:
```bash
# Be patient, it will take a long time.
~$ make
~$ make perf
~$ make plot
~$ eog output.png
```
測試結果:

改良後的版本比 kaeteyaruyo 的還差
> 待用 perf 分析
:::warning
linked list 個別元素的存取成本要和 cache-aware merge sort 一同考慮
:notes: jserv
:::
#### 3. 將上述 singly-linked list 擴充為 circular doubly-linked list 並重新實作對應的 sort
合併迴圈終止條件不變,但 circular doubly-linked list 不會出現指向 `NULL` 的情況,因此在每輪取出 `left/right` 首項元素前檢查 list ,若取出後為空則在最後更新 `left/right` 為 `NULL` ,方便下一輪的檢查。
```cpp!=
list *_left = NULL;
if (left->next != left){
    /*
     * If 'left' consists of more than one node,
     * make the remaining part as a new circular doubly-linked list.
     * Otherwise, set '_left' as NULL.
     */
    left->next->prev = left->prev;
    left->prev->next = left->next;
    /* Store the new head. */
    _left = left->next;
}
...
left = _left;
```
circular doubly-linked list 的指標操作,容易寫錯造成 infinte loop 或 segmentation fault 。
```cpp!=
/* Insert a node at the end of 'merge'. */
if (!merge) {
    start = merge = left;
    left->prev = left->next = left;
} else {
    left->prev = merge;
    left->next = merge->next;
    merge->next->prev= left;
    merge->next = left;
    merge = merge->next;
}
```
[完整程式碼](https://github.com/93i7xo2/sysprog2019q3-quiz4/blob/master/merge-circular.c):
```cpp!=
list *sort(list *start) {
    /* If list is empty or consists of only one node. */
    if (!start || start == start->next)
        return start;
    /* Divide list into two parts. */
    list *left = start;
    list *right = left->next;
    /* For right list. */
    right->prev = left->prev;
    left->prev->next = right;
    /* For left list. */
    left->prev = left;
    left->next = left;
    left = sort(left);
    right = sort(right);
    for (list *merge = NULL; left || right;) {
        /* Use 'merge' to record the last element in list. */
        if (!right || (left && left->data < right->data)) {
            /* A head of a remaining part. */
            list *_left = NULL;
            if (left->next != left){
                /*
                 * If 'left' consists of more than one node,
                 * make the remaining part as a new circular doubly-linked list.
                 * Otherwise, set '_left' as NULL.
                 */
                left->next->prev = left->prev;
                left->prev->next = left->next;
                /* Store the new head. */
                _left = left->next;
            }
            /* Insert a node at the end of 'merge'. */
            if (!merge) {
                start = merge = left;
                left->prev = left->next = left;
            } else {
                left->prev = merge;
                left->next = merge->next;
                merge->next->prev= left;
                merge->next = left;
                merge = merge->next;
            }
            left = _left;
        } else {
            list *_right = NULL;
            if (right->next != right){
                right->next->prev = right->prev;
                right->prev->next = right->next;
                _right = right->next;
            }
            if (!merge) {
                start = merge = right;
                right->prev = right->next = right;
            } else {
                right->prev = merge;
                right->next = merge->next;
                merge->next->prev= right;
                merge->next = right;
                merge = merge->next;
            }
            right = _right;
        }
    }
    return start;
}
```
以 100/1000/10000 不同數量的資料實測 10 次和原始版本比較:

#### 4. 依循 Linux 核心 include/linux/list.h 程式碼的方式,改寫上述排序程式
#### 5. 嘗試將原本遞迴的程式改寫為 iterative 版本
#### 6. 將 2019q1 第 3 週測驗題 (下) 提及的 XOR linked list 列入考量,也實作對應的排序程式,並比較效能和記憶體開銷
#### 7. 比照 List Sort 的手法來分析前述程式 (含擴充版本) 的效能表現
### Reference
- [使用 perf_events 分析程式效能](https://zh-blog.logan.tw/2019/07/10/analyze-program-performance-with-perf-events/)
- [簡介 perf_events 與 Call Graph](https://zh-blog.logan.tw/2019/10/06/intro-to-perf-events-and-call-graph/)
- [簡單學 makefile:makefile 介紹與範例程式 ](https://mropengate.blogspot.com/2018/01/makefile.html)
- [Calculate column average using bash shell ](https://linuxconfig.org/calculate-column-average-using-bash-shell)
- [How do I use shell variables in an awk script?](https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script)
- [Insert external variable in a AWK pattern](https://www.unix.com/shell-programming-and-scripting/123678-insert-external-variable-awk-pattern.html)
- [An Awk Primer/Awk Command-Line Examples](https://en.wikibooks.org/wiki/An_Awk_Primer/Awk_Command-Line_Examples)
- [A large collection of Gnuplot examples](https://alvinalexander.com/technology/gnuplot-charts-graphs-examples)