We first observed Optimized QuickSort — C Implementation (Non-Recursive), we could found that it use while
loop to represent recursive part in original quick sort, and then divide to two parts:
if
is the split function in quicksortelse
is the merge function in quicksortIt also use beg
and end
to record the range. The variable i
is used to simulate the behavior of stack, and it will change the range of beg
and end
.
The content of initializaed array
The first times meet the condition in while loop
The second times meet the condition in while loop
The third times meet the condition in while loop
We can observe from above that the array is divided into subarry constantly, and choose the rightmost array to process first. That is, it always choose the maximum value of unsorted array each time, and place it into right side. Then it can make the array in right side meet the characteristics of Loop invariant. We can imagine that it divide array into two parts, and then divide into two parts until the end, then traverse from the rightmost subtree. In each time it encounter the leaft, this number in the leaf is the value that need to be sort at this time.
We can use the above concept into linked list: we search into right of the tree, using variable i
to simluate the operation of stack. When we found the value that need to be sorted, then insert to the front of the final list through list_add_node_t
. To concise my code, We also consider pivot in the loop, so the tree should be ternary tree.
Performance Analysis
Input 2000 nodes linked list
Compile with -O0
:
We can found that the average performance of non recursive version is higher than recursive version.
Compile with -O3
:
The difference has been reduced. It can prove that the compiler has good optimazation for recursive function.
We can improve by implementing introsort
:
When the recurvise times exceed max_level
, used tree sort instead.
The implementation of tree sort is based on red-black tree, so it have time complexity in the worst case. We fetch the code from rv32emu-next/c_map.h and rv32emu-next/c_map.c, and create a new field in node_t
.
tree sort will push node from linked list into red-black tree, and fetch using inorder traversal.
test code
max_level | insert | time(ns) |
---|---|---|
16 | 21 | 43538009 |
32 | 21 | 38184258 |
64 | 21 | 49350176 |
128 | 21 | 49877200 |
max_level
decide when to use tree sortinsert
decide when to use insertion sortIn the output, we can see that the optimal max_level
is 32, which is around . It is also the same result from the paper Introspective Sorting and Selection Algorithms.
Excerpt from Introspective Sorting and Selection Algorithms page 5:
depth limit is , since this value empirically produces good results.
We compare sorting algorithm with 100000 unique number
We can know that the performance of tree sort is better than other algorithms. This is the same result of A Comparative Study of Linked List Sorting Algorithms. The problem of quick sort might be the recursion, making its runtime higher than tree sort.
We also compare with linux/lib/list_sort.c, testing on 100000 random unique number
Try to use perf to analyse tree sort and list_sort
list_sort
tree_sort
in the comment of list_sort.c
Thus, it will avoid cache thrashing as long as elements can fit into the cache. Not quite as good as a fully-eager bottom-up mergesort, but it does use fewer comparisons, so is faster in the common case that everything fits into L1.
Therefore, the design of list_sort is cache-friendly. We can see that the metrics of list_sort is lower than tree sort, which also shows that why linux kernel use merge sort instead of tree sort.
The performance bottleneck of tree sort might come from the red-black tree. Throughout perf analysis, the highest value of L1-cache-misses and branch-misses is from obj->comparator(&node->value, &cur->value)
in c_map_insert
; the highest value of cache-misses comes from cur->left = node
and cur->right = node
, which is also inside c_map_insert
. As a result, we can speculate that the performance bottlenect might because of the cost from building red-black tree. It needs to traverse and compare the node inside the tree, causing runtime to increase.