# Linked List Sort ### Optimized QuickSort > [GitHub](https://github.com/hankluo6/quicksort-ll) ```cpp // quickSort // // This public-domain C implementation by Darel Rex Finley. // // * Returns YES if sort was successful, or NO if the nested // pivots went too deep, in which case your array will have // been re-ordered, but probably not sorted correctly. // // * This function assumes it is called with valid parameters. // // * Example calls: // quickSort(&myArray[0],5); // sorts elements 0, 1, 2, 3, and 4 // quickSort(&myArray[3],5); // sorts elements 3, 4, 5, 6, and 7 bool quickSort(int *arr, int elements) { #define MAX_LEVELS 1000 int piv, beg[MAX_LEVELS], end[MAX_LEVELS], i=0, L, R ; beg[0]=0; end[0]=elements; while (i>=0) { L=beg[i]; R=end[i]-1; if (L<R) { piv=arr[L]; if (i==MAX_LEVELS-1) return NO; while (L<R) { while (arr[R]>=piv && L<R) R--; if (L<R) arr[L++]=arr[R]; while (arr[L]<=piv && L<R) L++; if (L<R) arr[R--]=arr[L]; } arr[L]=piv; beg[i+1]=L+1; end[i+1]=end[i]; end[i++]=L; } else { i--; }} return YES; } ``` We first observed [Optimized QuickSort — C Implementation (Non-Recursive)](https://alienryderflex.com/quicksort/), we could found that it use `while` loop to represent recursive part in original quick sort, and then divide to two parts: * the code in `if` is the split function in quicksort * the code in `else` is the merge function in quicksort It also use `beg` and `end` to record the range. The variable `i` is used to simulate the behavior of stack, and it will change the range of `beg` and `end`. * The content of initializaed array ```graphviz digraph { node [color=black shape=none margin=0 height=.3 ] values [label=< <table border="0" cellborder="1" cellspacing="0"> <tr> <td port="f0" bgcolor = "lightyellow">14</td> <td port="f1" bgcolor = "lightyellow">22</td> <td port="f2" bgcolor = "lightyellow">10</td> <td port="f3" bgcolor = "lightyellow">15</td> <td port="f4" bgcolor = "lightyellow">31</td> <td port="f5" bgcolor = "lightyellow">30</td> <td port="f6" bgcolor = "lightyellow">12</td> <td port="f7" bgcolor = "lightyellow"><font color = "white">00</font></td> </tr> </table> >]; } ``` * The first times meet the condition in while loop ```graphviz digraph { node [shape=plaintext, fontsize=18]; "i=0" [color=white]; node [shape=plaintext, fontcolor=g, fontsize=18]; "invis" [style=invis]; node [color=black shape=none margin=0 height=.3 ] values [label=< <table border="0" cellborder="1" cellspacing="0"> <tr> <td port="f0" bgcolor = "gray">12</td> <td port="f1" bgcolor = "gray">10</td> <td port="f2" bgcolor = "gray">14</td> <td port="f3" bgcolor = "lightyellow">15</td> <td port="f4" bgcolor = "lightyellow">31</td> <td port="f5" bgcolor = "lightyellow">30</td> <td port="f6" bgcolor = "lightyellow">22</td> <td port="f7" bgcolor = "lightyellow"><font color = "white">00</font></td> </tr> </table> >]; "beg[0]"->values:f0 "beg[1]"->values:f3 "end[0]"->values:f2 "end[1]"->values:f7 "i=0"->"invis"[style=invis] { rank=same; "i=0"; values } { rank=same; "invis"; "end[0]" ; "end[1]" } } ``` * The second times meet the condition in while loop ```graphviz digraph { node [shape=plaintext, fontcolor=g, fontsize=18]; "i=1" [color=white]; node [shape=plaintext, fontcolor=g, fontsize=18]; "invis" [style=invis]; node [color=black shape=none margin=0 height=.3 ] values [label=< <table border="0" cellborder="1" cellspacing="0"> <tr> <td port="f0" bgcolor = "gray">12</td> <td port="f1" bgcolor = "gray">10</td> <td port="f2" bgcolor = "gray">14</td> <td port="f3" bgcolor = "gray">15</td> <td port="f4" bgcolor = "lightyellow">31</td> <td port="f5" bgcolor = "lightyellow">30</td> <td port="f6" bgcolor = "lightyellow">22</td> <td port="f7" bgcolor = "lightyellow"><font color = "white">00</font></td> </tr> </table> >]; "beg[0]"->values:f0 "beg[1]"->values:f3 "beg[2]"->values:f4 "end[0]"->values:f2 "end[1]"->values:f3 "end[2]"->values:f7 "i=1"->"invis"[style=invis] { rank=same; "i=1"; values } { rank=same; "invis"; "end[0]" ; "end[1]" ; "end[2]" ;} } ``` * The third times meet the condition in while loop ```graphviz digraph { node [shape=plaintext, fontcolor=g, fontsize=18]; "i=2" [color=white]; node [shape=plaintext, fontcolor=g, fontsize=18]; "invis" [style=invis]; node [color=black shape=none margin=0 height=.3 ] values [label=< <table border="0" cellborder="1" cellspacing="0"> <tr> <td port="f0" bgcolor = "gray">12</td> <td port="f1" bgcolor = "gray">10</td> <td port="f2" bgcolor = "gray">14</td> <td port="f3" bgcolor = "gray">15</td> <td port="f4" bgcolor = "gray">22</td> <td port="f5" bgcolor = "gray">30</td> <td port="f6" bgcolor = "lightyellow">31</td> <td port="f7" bgcolor = "lightyellow"><font color = "white">00</font></td> </tr> </table> >]; "beg[0]"->values:f0 "beg[1]"->values:f3 "beg[2]"->values:f4 "beg[3]"->values:f7 "end[0]"->values:f2 "end[1]"->values:f3 "end[2]"->values:f6 "end[3]"->values:f7 "i=2"->"invis"[style=invis] { rank=same; "i=2"; values } { rank=same; "invis"; "end[0]" ; "end[1]" ; "end[2]" ; "end[3]" } } ``` We can observe from above that the array is divided into subarry constantly, and choose the rightmost array to process first. That is, it always choose the maximum value of unsorted array each time, and place it into right side. Then it can make the array in right side meet the characteristics of [Loop invariant](https://en.wikipedia.org/wiki/Loop_invariant). We can imagine that it divide array into two parts, and then divide into two parts until the end, then traverse from the rightmost subtree. In each time it encounter the leaft, this number in the leaf is the value that need to be sort at this time. We can use the above concept into linked list: we search into right of the tree, using variable `i` to simluate the operation of stack. When we found the value that need to be sorted, then insert to the front of the final list through `list_add_node_t`. To concise my code, We also consider pivot in the loop, so the tree should be ternary tree. ```cpp void quicksort_nonrecursive(node_t **list) { int value; int i = 0; int max_level = 100; node_t *node_beg[max_level], *node_end[max_level]; node_t *result = NULL, *left = NULL, *right = NULL; node_t *pivot, *p, *L, *R; node_beg[0] = *list; node_end[0] = get_list_tail(list); while (i >= 0) { L = node_beg[i]; R = node_end[i]; if (L != R) { pivot = L; value = pivot->value; p = pivot->next; pivot->next = NULL; while (p) { node_t *n = p; p = p->next; list_add_node_t(n->value > value ? &right : &left, n); } node_beg[i] = left; node_end[i] = get_list_tail(&left); node_beg[i + 1] = pivot; node_end[i + 1] = pivot; node_beg[i + 2] = right; node_end[i + 2] = get_list_tail(&right); left = NULL; right = NULL; i += 2; } else { if (L) list_add_node_t(&result, L); i--; } } *list = result; } ``` * Performance Analysis Input 2000 nodes linked list * Compile with `-O0`: ![](https://i.imgur.com/aV5qniU.png) We can found that the average performance of non recursive version is higher than recursive version. * Compile with `-O3`: ![](https://i.imgur.com/jzHe6A6.png) The difference has been reduced. It can prove that the compiler has good optimazation for recursive function. We can improve by implementing `introsort`: ```c void insert_sorted(node_t *entry, node_t **list) { while (*list && (*list)->value < entry->value) list = &(*list)->next; entry->next = *list; *list = entry; } void insertsort(node_t **list) { node_t *sorted = NULL; node_t *cur = *list; while (cur) { node_t *node = cur; cur = cur->next; insert_sorted(node, &sorted); } *list = sorted; } ``` * Record the length of left list and right list when partition ```c while (p) { node_t *n = p; p = p->next; if (n->value > value) { list_add_node_t(&right, n); r++; } else { list_add_node_t(&left, n); l++; } } ``` * When the length is less than 20, using insertion sort to replace quick sort ```c if (l < 20) insertsort(&left); else quicksort_recursion(&left); if (r < 20) insertsort(&right); else quicksort_recursion(&right); ``` When the recurvise times exceed `max_level`, used tree sort instead. ```c void introsort(node_t **list, int max_level, int insert) { if (!*list) return; if (max_level == 0) { treesort(list); return; } node_t *pivot = *list; int value = pivot->value; int l = 0, r = 0; node_t *p = pivot->next; pivot->next = NULL; node_t *left = NULL, *right = NULL; while (p) { node_t *n = p; p = p->next; if (n->value > value) { list_add_node_t(&right, n); r++; } else { list_add_node_t(&left, n); l++; } } if (l < insert) { insertsort(&left); } else { introsort(&left, max_level - 1, insert); } if (r < insert) { insertsort(&right); } else { introsort(&right, max_level - 1, insert); } node_t *result = NULL; list_concat(&result, left); list_concat(&result, pivot); list_concat(&result, right);; *list = result; } ``` The implementation of tree sort is based on red-black tree, so it have $O(nlogn)$ time complexity in the worst case. We fetch the code from [rv32emu-next/c_map.h](https://github.com/sysprog21/rv32emu-next/blob/master/c_map.h) and [rv32emu-next/c_map.c](https://github.com/sysprog21/rv32emu-next/blob/master/c_map.c), and create a new field in `node_t`. ```c typedef struct __node { int value; struct __node *next; // for linked list struct __node *left, *right, *up; // for red-black tree c_map_color_t color; // for red-black tree } node_t; ``` tree sort will push node from linked list into red-black tree, and fetch using inorder traversal. ```c void treesort(node_t **list) { node_t **record = list; c_map_t map = c_map_new(sizeof(int), sizeof(NULL), c_map_cmp_int); while (*list) { c_map_insert(map, *list, NULL); list = &(*list)->next; } node_t *node = c_map_first(map), *first = node; for ( ;node; node = c_map_next(node)) { *list = node; list = &(*list)->next; } *list = NULL; *record = first; free(map); } ``` ### Performance Analysis * test code ```c for (int max_level = 1; max_level < (1 << 15); max_level <<= 1) { for (int insert = 10; insert < 40; ++insert) { size_t times = 100; time = 0; while (times--) { size_t count = 100000; int *test_arr = malloc(sizeof(int) * count); for (int i = 0; i < count; ++i) { test_arr[i] = i; } shuffle(test_arr, count); while (count--) { list1 = list_make_node_t(list1, test_arr[count]); } clock_gettime(CLOCK_MONOTONIC, &tt1); introsort(&list1, max_level, insert); clock_gettime(CLOCK_MONOTONIC, &tt2); time += diff_in_ns(tt1, tt2); list_free(&list1); free(test_arr); } fprintf(fd, "%f ", time / 100.0); } fprintf(fd, "\n"); } ``` | max_level | insert | time(ns) | |:---------:|:------:|:--------:| | 16 | 21 | 43538009 | | 32 | 21 | 38184258 | | 64 | 21 | 49350176 | | 128 | 21 | 49877200 | * [detailed output](https://github.com/hankluo6/quicksort-ll/blob/main/introsort_parameter.txt) * `max_level` decide when to use tree sort * `insert` decide when to use insertion sort In the output, we can see that the optimal `max_level` is 32, which is around $2 \times log_2(100000)$. It is also the same result from the paper [Introspective Sorting and Selection Algorithms](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.5196&rep=rep1&type=pdf). >Excerpt from Introspective Sorting and Selection Algorithms page 5: > >depth limit is $2\lfloor log_2N\rfloor$, since this value empirically produces good results. * We compare sorting algorithm with 100000 unique number We can know that the performance of tree sort is better than other algorithms. This is the same result of [A Comparative Study of Linked List Sorting Algorithms](https://pages.mtu.edu/~shene/PUBLICATIONS/1996/3Conline.pdf). The problem of quick sort might be the recursion, making its runtime higher than tree sort. * We also compare with [linux/lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c), testing on 100000 random unique number ![](https://i.imgur.com/F2CPCyi.png) Try to use perf to analyse tree sort and list_sort - `list_sort` ``` Performance counter stats for './list_sort.out': 4,416,717 cache-misses # 0.531 % of all cache refs (66.65%) 832,301,750 cache-references (66.68%) 48,222,171,264 cycles (66.69%) 16,549,168,666 branch-instructions (66.69%) 26,319,955 branch-misses # 0.16% of all branches (66.67%) 564,284,840 L1-dcache-load-misses (66.63%) 12.103174289 seconds time elapsed 11.414579000 seconds user 0.688155000 seconds sys ``` - `tree_sort` ``` Performance counter stats for './tree_sort.out': 1,347,843,845 cache-misses # 20.133 % of all cache refs (66.65%) 6,694,777,854 cache-references (66.66%) 226,565,049,469 cycles (66.68%) 28,693,147,859 branch-instructions (66.68%) 1,123,005,940 branch-misses # 3.91% of all branches (66.67%) 3,115,288,209 L1-dcache-load-misses (66.66%) 58.205776700 seconds time elapsed 58.170830000 seconds user 0.007998000 seconds sys ``` > in the [comment]((https://github.com/torvalds/linux/blob/05a59d79793d482f628a31753c671f2e92178a21/lib/list_sort.c#L136)) of list_sort.c > > Thus, it will avoid cache thrashing as long as $3 \times 2^k$ elements can fit into the cache. Not quite as good as a fully-eager bottom-up mergesort, but it does use $0.2 \times n$ fewer comparisons, so is faster in the common case that everything fits into L1. Therefore, the design of list_sort is cache-friendly. We can see that the metrics of list_sort is lower than tree sort, which also shows that why linux kernel use merge sort instead of tree sort. The performance bottleneck of tree sort might come from the red-black tree. Throughout perf analysis, the highest value of L1-cache-misses and branch-misses is from `obj->comparator(&node->value, &cur->value)` in `c_map_insert`; the highest value of cache-misses comes from `cur->left = node` and `cur->right = node`, which is also inside `c_map_insert`. As a result, we can speculate that the performance bottlenect might because of the cost from building red-black tree. It needs to traverse and compare the node inside the tree, causing runtime to increase.