# 2019q3 Homework4 (quiz4)
contributed by < `93i7xo2` >
###### tags: `sysprog2019`
Source: [2019q3 第 4 週測驗題](https://hackmd.io/@sysprog/B1keOS-dB)
### 測驗 `1`
考慮一個單向 linked list:
```cpp
typedef struct __list {
int data;
struct __list *next;
} list;
```
在不存在環狀結構的狀況下,以下函式能夠對 linked list 元素從小到大排序:
```cpp
list *sort(list *start) {
if (!start || !start->next)
return start;
list *left = start;
list *right = left->next;
LL0;
left = sort(left);
right = sort(right);
for (list *merge = NULL; left || right; ) {
if (!right || (left && left->data < right->data)) {
if (!merge) {
LL1;
} else {
LL2;
merge = merge->next;
}
LL3;
} else {
if (!merge) {
LL4;
} else {
LL5;
merge = merge->next;
}
LL6;
}
}
return start;
}
```
請補完程式碼。
:::success
延伸問題:
- [x] 解釋上述程式運作原理;
- [x] 指出程式改進空間,特別是考慮到 [Optimizing merge sort](https://en.wikipedia.org/wiki/Merge_sort#Optimizing_merge_sort);
- [x] 將上述 singly-linked list 擴充為 circular doubly-linked list 並重新實作對應的 `sort`;
- [ ] 依循 Linux 核心 [include/linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h) 程式碼的方式,改寫上述排序程式;
- [ ] 嘗試將原本遞迴的程式改寫為 iterative 版本;
- [ ] 將 [2019q1 第 3 週測驗題 (下)](https://hackmd.io/@sysprog/SkrVSKiU4) 提及的 XOR linked list 列入考量,也實作對應的排序程式,並比較效能和記憶體開銷;
- [ ] 比照 [List Sort](https://arxiv.org/pdf/1310.7890.pdf) 的手法來分析前述程式 (含擴充版本) 的效能表現;
:::
#### 實驗環境
```bash!
~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Stepping: 3
CPU MHz: 800.012
CPU max MHz: 4000.0000
CPU min MHz: 800.0000
BogoMIPS: 6816.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
```
#### 1. 實作分析
完整程式碼如下
```cpp=
list *sort(list *start) {
if (!start || !start->next)
return start;
list *left = start;
list *right = left->next;
left->next = NULL;
left = sort(left);
right = sort(right);
for (list *merge = NULL; left || right; ) {
if (!right || (left && left->data < right->data)) {
if (!merge) {
start = merge = left;
} else {
merge->next = left;
merge = merge->next;
}
left = left->next;
} else {
if (!merge) {
start = merge = right;
} else {
merge->next = right;
merge = merge->next;
}
right = right->next;
}
}
return start;
}
```
程式碼已經明示是用 merge sort,merge sort 分為兩部份
- Divide
- Conquer
**Divide**
- 將長度為 n 的 list 切成 left, right 長度分別為 1, n-1 的 list
> LL0 為 `left->next=NULL`,left才是獨立的list
**Conquer**
- 合併經過排序的list,用迴圈逐一循序從 left, right 最前面挑選兩者之中最小的元素(`__list`)放在以 `merge` 為首的 list。有兩種選擇:
- 從 left 挑選
- 從 right 挑選
從 left 挑選一種情況是 right 為 `NULL` 只得從 left 挑選 (迴圈內不允許left和right同時為NULL),另一種情況是 right, left 皆非 NULL 且 `left->data < right->data` ,反之則從 right 挑選。是 `sort()` 最重要的部份。
> LL1 = `start = merge = left;`
> LL2 = `merge->next = left;`
> LL4 = `start = merge = right;`
> LL5 = `merge->next = right;`
因為 `merge` 會隨著元素加入而更新,所以先用 start 記下 list 的起始點
加入後,也把原本 left, right 更新一下:
> LL3 = `left = left->next;`
> LL6 = `right = right->next;`
#### 2. 指出程式改進空間,特別是考慮到 [Optimizing merge sort](https://en.wikipedia.org/wiki/Merge_sort#Optimizing_merge_sort);
用長度為`n`的 `list* l`,分析 `sort(l)` 的時間複雜度,得到 $O(n^2)$ 的結果
$\begin{split}
T(n) &=
T(1)+T(n-1)+cn, where\ c\ is\ a\ constant \\
&= 2\cdot{T(1)} + T(n-2) + c\cdot(n+(n-1)) \\
&= (n-1)\cdot{T(1)} + T(1) + c\cdot(n+(n-1)+(n-2)+...+2) \\
&= O(n^2)
\end{split}$
以下改良 [kaeteyaruyo](https://hackmd.io/@kaeteyaruyo/2019q3_homework4) 提出的 merge sort,將子問題切成平均的大小減少遞迴深度。也引入 [Fundamental Sorting Algorithms](https://victoramelkin.com/ta/cs32/pa/2/pa2.html) 提到 tiled merge sort 此種 cache-aware 演算法的精神,對於排序大量資料而言, quick sort 和 merge sort 勝過 insertion sort ( $O(n\cdot{log(n)}) < O(n^2)$ ),對於小量資料則取決於運行的機器。因此在 merge sort 將問題切成 L1 cache size 後變不再往下遞迴,直接用 insertion sort 做排序。
```cpp!
list *sort(list *start, unsigned size) {
if (!start || !start->next)
return start;
/* If list size is smaller than L1 cache size (32k), do insertion sort. */
if (!(sizeof(list) * size >> 15)) {
list *current = start;
list *sorted = NULL; // To store head of sorted list.
while (current != NULL) {
list *next = current->next;
sortedInsert(&sorted, current);
current = next;
}
return sorted;
}
/* Otherwise, do merge sort. */
...
}
```
在用小量資料測試原始版本的 `sort()` 也有前幾次測得時間稍慢的而後穩定的現象:
```
$ make test
Elapsed time
------------
0.0087318420
0.0051324368
0.0033595562
0.0033812523
0.0033354759
0.0033311844
0.0028159618
0.0023837090
0.0020971298
0.0018613338
0.0017888546
0.0017638206
0.0017621517
0.0017526150
0.0017971992
```
依照講師的意見,在 `sort()` 前先從頭走訪所有節點10000次再進行測量(1000次不可行),所測得時間趨於穩定:
```
$ make test
Elapsed time
------------
0.0018594265
0.0017547607
0.0018105507
0.0018079281
0.0017924309
0.0017812252
0.0017433167
0.0017609596
0.0017414093
```
對應的走訪程式碼如下:
```cpp
void print(list *head, int enable) {
if (!head) return;
if (enable){
while (head) {
printf(" %d", head->data);
if ((head = head->next)) printf("->");
else printf("\n");
}
return;
}
while (head) {
head = head->next;
}
}
int main(){
...
// traversaling the list
for (int i = 0; i < 1000; ++i) {
print(l, 0);
}
double t1, t2;
t1 = tvgetf();
l = sort(l);
t2 = tvgetf();
printf("%.10f\n", t2 - t1); // sec
...
}
```
在此基礎上,將原始版本與改良過的版本用 $10^5$ ~ $10^2$ 不同數量級的亂數群各測試10次。 `sizeof(list)=16`,為了凸顯改良後的 merge sort 於 linked list 大小超過 L1 dcache (32kB) 有所改善,測試的亂數群的大小要超過8192個。(`32*1024/4=8192`)。
程式會先產生測試資料
```bash!
dd if=/dev/urandom of=input.dat bs=$((size*4)) count=1
```
接著從 input.dat 讀取 20000 個 `int32_t` 的整數,執行排序。比較下列各版本的效能:
- [原始版本](https://github.com/93i7xo2/sysprog2019q3-quiz4/blob/master/merge-original.c)
- [kaeteyaruyo 版本](https://github.com/93i7xo2/sysprog2019q3-quiz4/blob/master/merge-kaeteyaruyo.c)
- [kaeteyaruyo 版本 + cache-aware method (optimize)](https://github.com/93i7xo2/sysprog2019q3-quiz4/blob/master/merge-optimize.c)
對應的[程式碼](https://github.com/93i7xo2/sysprog2019q3-quiz4)和測試腳本使用方法如下:
```bash
# Be patient, it will take a long time.
~$ make
~$ make perf
~$ make plot
~$ eog output.png
```
測試結果:

改良後的版本比 kaeteyaruyo 的還差
> 待用 perf 分析
:::warning
linked list 個別元素的存取成本要和 cache-aware merge sort 一同考慮
:notes: jserv
:::
#### 3. 將上述 singly-linked list 擴充為 circular doubly-linked list 並重新實作對應的 sort
合併迴圈終止條件不變,但 circular doubly-linked list 不會出現指向 `NULL` 的情況,因此在每輪取出 `left/right` 首項元素前檢查 list ,若取出後為空則在最後更新 `left/right` 為 `NULL` ,方便下一輪的檢查。
```cpp!=
list *_left = NULL;
if (left->next != left){
/*
* If 'left' consists of more than one node,
* make the remaining part as a new circular doubly-linked list.
* Otherwise, set '_left' as NULL.
*/
left->next->prev = left->prev;
left->prev->next = left->next;
/* Store the new head. */
_left = left->next;
}
...
left = _left;
```
circular doubly-linked list 的指標操作,容易寫錯造成 infinte loop 或 segmentation fault 。
```cpp!=
/* Insert a node at the end of 'merge'. */
if (!merge) {
start = merge = left;
left->prev = left->next = left;
} else {
left->prev = merge;
left->next = merge->next;
merge->next->prev= left;
merge->next = left;
merge = merge->next;
}
```
[完整程式碼](https://github.com/93i7xo2/sysprog2019q3-quiz4/blob/master/merge-circular.c):
```cpp!=
list *sort(list *start) {
/* If list is empty or consists of only one node. */
if (!start || start == start->next)
return start;
/* Divide list into two parts. */
list *left = start;
list *right = left->next;
/* For right list. */
right->prev = left->prev;
left->prev->next = right;
/* For left list. */
left->prev = left;
left->next = left;
left = sort(left);
right = sort(right);
for (list *merge = NULL; left || right;) {
/* Use 'merge' to record the last element in list. */
if (!right || (left && left->data < right->data)) {
/* A head of a remaining part. */
list *_left = NULL;
if (left->next != left){
/*
* If 'left' consists of more than one node,
* make the remaining part as a new circular doubly-linked list.
* Otherwise, set '_left' as NULL.
*/
left->next->prev = left->prev;
left->prev->next = left->next;
/* Store the new head. */
_left = left->next;
}
/* Insert a node at the end of 'merge'. */
if (!merge) {
start = merge = left;
left->prev = left->next = left;
} else {
left->prev = merge;
left->next = merge->next;
merge->next->prev= left;
merge->next = left;
merge = merge->next;
}
left = _left;
} else {
list *_right = NULL;
if (right->next != right){
right->next->prev = right->prev;
right->prev->next = right->next;
_right = right->next;
}
if (!merge) {
start = merge = right;
right->prev = right->next = right;
} else {
right->prev = merge;
right->next = merge->next;
merge->next->prev= right;
merge->next = right;
merge = merge->next;
}
right = _right;
}
}
return start;
}
```
以 100/1000/10000 不同數量的資料實測 10 次和原始版本比較:

#### 4. 依循 Linux 核心 include/linux/list.h 程式碼的方式,改寫上述排序程式
#### 5. 嘗試將原本遞迴的程式改寫為 iterative 版本
#### 6. 將 2019q1 第 3 週測驗題 (下) 提及的 XOR linked list 列入考量,也實作對應的排序程式,並比較效能和記憶體開銷
#### 7. 比照 List Sort 的手法來分析前述程式 (含擴充版本) 的效能表現
### Reference
- [使用 perf_events 分析程式效能](https://zh-blog.logan.tw/2019/07/10/analyze-program-performance-with-perf-events/)
- [簡介 perf_events 與 Call Graph](https://zh-blog.logan.tw/2019/10/06/intro-to-perf-events-and-call-graph/)
- [簡單學 makefile:makefile 介紹與範例程式 ](https://mropengate.blogspot.com/2018/01/makefile.html)
- [Calculate column average using bash shell ](https://linuxconfig.org/calculate-column-average-using-bash-shell)
- [How do I use shell variables in an awk script?](https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script)
- [Insert external variable in a AWK pattern](https://www.unix.com/shell-programming-and-scripting/123678-insert-external-variable-awk-pattern.html)
- [An Awk Primer/Awk Command-Line Examples](https://en.wikibooks.org/wiki/An_Awk_Primer/Awk_Command-Line_Examples)
- [A large collection of Gnuplot examples](https://alvinalexander.com/technology/gnuplot-charts-graphs-examples)