Try   HackMD

Linked List Sort

Optimized QuickSort

GitHub

//  quickSort
//
//  This public-domain C implementation by Darel Rex Finley.
//
//  * Returns YES if sort was successful, or NO if the nested
//    pivots went too deep, in which case your array will have
//    been re-ordered, but probably not sorted correctly.
//
//  * This function assumes it is called with valid parameters.
//
//  * Example calls:
//    quickSort(&myArray[0],5); // sorts elements 0, 1, 2, 3, and 4
//    quickSort(&myArray[3],5); // sorts elements 3, 4, 5, 6, and 7

bool quickSort(int *arr, int elements) {
  #define MAX_LEVELS  1000

  int piv, beg[MAX_LEVELS], end[MAX_LEVELS], i=0, L, R ;

  beg[0]=0; end[0]=elements;
  while (i>=0) {
    L=beg[i]; R=end[i]-1;
    if (L<R) {
      piv=arr[L]; if (i==MAX_LEVELS-1) return NO;
      while (L<R) {
        while (arr[R]>=piv && L<R) R--; if (L<R) arr[L++]=arr[R];
        while (arr[L]<=piv && L<R) L++; if (L<R) arr[R--]=arr[L]; }
      arr[L]=piv; beg[i+1]=L+1; end[i+1]=end[i]; end[i++]=L; }
    else {
      i--; }}
  return YES; }

We first observed Optimized QuickSort — C Implementation (Non-Recursive), we could found that it use while loop to represent recursive part in original quick sort, and then divide to two parts:

  • the code in if is the split function in quicksort
  • the code in else is the merge function in quicksort

It also use beg and end to record the range. The variable i is used to simulate the behavior of stack, and it will change the range of beg and end.

  • The content of initializaed array

    
    
    
    
    
    
    %0
    
    
    
    values
    
    
    14
    
    
    22
    
    
    10
    
    
    15
    
    
    31
    
    
    30
    
    
    12
    
    
    00
    
    
    
    
  • The first times meet the condition in while loop

    
    
    
    
    
    
    %0
    
    
    
    i=0
    i=0
    
    
    
    
    
    values
    
    
    12
    
    
    10
    
    
    14
    
    
    15
    
    
    31
    
    
    30
    
    
    22
    
    
    00
    
    
    
    beg[0]
    beg[0]
    
    
    
    beg[0]->values:f0
    
    
    
    
    
    beg[1]
    beg[1]
    
    
    
    beg[1]->values:f3
    
    
    
    
    
    end[0]
    end[0]
    
    
    
    end[0]->values:f2
    
    
    
    
    
    end[1]
    end[1]
    
    
    
    end[1]->values:f7
    
    
    
    
    
    
  • The second times meet the condition in while loop

    
    
    
    
    
    
    %0
    
    
    
    i=1
    i=1
    
    
    
    
    
    values
    
    
    12
    
    
    10
    
    
    14
    
    
    15
    
    
    31
    
    
    30
    
    
    22
    
    
    00
    
    
    
    beg[0]
    beg[0]
    
    
    
    beg[0]->values:f0
    
    
    
    
    
    beg[1]
    beg[1]
    
    
    
    beg[1]->values:f3
    
    
    
    
    
    beg[2]
    beg[2]
    
    
    
    beg[2]->values:f4
    
    
    
    
    
    end[0]
    end[0]
    
    
    
    end[0]->values:f2
    
    
    
    
    
    end[1]
    end[1]
    
    
    
    end[1]->values:f3
    
    
    
    
    
    end[2]
    end[2]
    
    
    
    end[2]->values:f7
    
    
    
    
    
    
  • The third times meet the condition in while loop

    
    
    
    
    
    
    %0
    
    
    
    i=2
    i=2
    
    
    
    
    
    values
    
    
    12
    
    
    10
    
    
    14
    
    
    15
    
    
    22
    
    
    30
    
    
    31
    
    
    00
    
    
    
    beg[0]
    beg[0]
    
    
    
    beg[0]->values:f0
    
    
    
    
    
    beg[1]
    beg[1]
    
    
    
    beg[1]->values:f3
    
    
    
    
    
    beg[2]
    beg[2]
    
    
    
    beg[2]->values:f4
    
    
    
    
    
    beg[3]
    beg[3]
    
    
    
    beg[3]->values:f7
    
    
    
    
    
    end[0]
    end[0]
    
    
    
    end[0]->values:f2
    
    
    
    
    
    end[1]
    end[1]
    
    
    
    end[1]->values:f3
    
    
    
    
    
    end[2]
    end[2]
    
    
    
    end[2]->values:f6
    
    
    
    
    
    end[3]
    end[3]
    
    
    
    end[3]->values:f7
    
    
    
    
    
    

We can observe from above that the array is divided into subarry constantly, and choose the rightmost array to process first. That is, it always choose the maximum value of unsorted array each time, and place it into right side. Then it can make the array in right side meet the characteristics of Loop invariant. We can imagine that it divide array into two parts, and then divide into two parts until the end, then traverse from the rightmost subtree. In each time it encounter the leaft, this number in the leaf is the value that need to be sort at this time.

We can use the above concept into linked list: we search into right of the tree, using variable i to simluate the operation of stack. When we found the value that need to be sorted, then insert to the front of the final list through list_add_node_t. To concise my code, We also consider pivot in the loop, so the tree should be ternary tree.

void quicksort_nonrecursive(node_t **list)
{
    int value;
    int i = 0;
    int max_level = 100;
    node_t *node_beg[max_level], *node_end[max_level];
    node_t *result = NULL, *left = NULL, *right = NULL;
    node_t *pivot, *p, *L, *R;
    
    node_beg[0] = *list;
    node_end[0] = get_list_tail(list);
    
    while (i >= 0) {
        L = node_beg[i]; 
        R = node_end[i];
        
        if (L != R) {
            pivot = L;
            value = pivot->value;
            p = pivot->next;
            pivot->next = NULL;
            
            while (p) {
                node_t *n = p;
                p = p->next;
                list_add_node_t(n->value > value ? &right : &left, n);
            }

            node_beg[i] = left;
            node_end[i] = get_list_tail(&left);
            node_beg[i + 1] = pivot;
            node_end[i + 1] = pivot;
            node_beg[i + 2] = right;
            node_end[i + 2] = get_list_tail(&right);
            
            left = NULL;
            right = NULL;
            i += 2;
        }
        else {
            if (L)
                list_add_node_t(&result, L);
            i--;
        }
    }
    *list = result;
}
  • Performance Analysis

    Input 2000 nodes linked list

    • Compile with -O0

      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

      We can found that the average performance of non recursive version is higher than recursive version.

    • Compile with -O3

      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

      The difference has been reduced. It can prove that the compiler has good optimazation for recursive function.

We can improve by implementing introsort:

void insert_sorted(node_t *entry, node_t **list)
{
    while (*list && (*list)->value < entry->value)
        list = &(*list)->next;
    entry->next = *list;
    *list = entry;
}

void insertsort(node_t **list) 
{
    node_t *sorted = NULL;
    node_t *cur = *list;
    while (cur) {
        node_t *node = cur;
        cur = cur->next;
        insert_sorted(node, &sorted);
    }
    *list = sorted;
}
  • Record the length of left list and right list when partition
while (p) {
    node_t *n = p;
    p = p->next;
    if (n->value > value) {
        list_add_node_t(&right, n);
        r++;
    }
    else {
        list_add_node_t(&left, n);
        l++;
    }
}
  • When the length is less than 20, using insertion sort to replace quick sort
if (l < 20)
    insertsort(&left);
else 
    quicksort_recursion(&left);
if (r < 20)
    insertsort(&right);
else
    quicksort_recursion(&right);

When the recurvise times exceed max_level, used tree sort instead.

void introsort(node_t **list, int max_level, int insert)
{
    if (!*list)
        return;

    if (max_level == 0) {
        treesort(list);
        return;
    }

    node_t *pivot = *list;
    int value = pivot->value;
    int l = 0, r = 0;
    node_t *p = pivot->next;
    pivot->next = NULL;

    node_t *left = NULL, *right = NULL;
    while (p) {
        node_t *n = p;
        p = p->next;
        if (n->value > value) {
            list_add_node_t(&right, n);
            r++;
        }
        else {
            list_add_node_t(&left, n);
            l++;
        }
    }

    if (l < insert) {
        insertsort(&left);
    }
    else { 
        introsort(&left, max_level - 1, insert);
    }
    if (r < insert) {
        insertsort(&right);
    }
    else {
        introsort(&right, max_level - 1, insert);
    }
    node_t *result = NULL;
    list_concat(&result, left);
    list_concat(&result, pivot); 
    list_concat(&result, right);;
    *list = result;
}

The implementation of tree sort is based on red-black tree, so it have

O(nlogn) time complexity in the worst case. We fetch the code from rv32emu-next/c_map.h and rv32emu-next/c_map.c, and create a new field in node_t.

typedef struct __node {                   
    int value;
    struct __node *next; // for linked list
    struct __node *left, *right, *up; // for red-black tree
    c_map_color_t color; // for red-black tree
} node_t;

tree sort will push node from linked list into red-black tree, and fetch using inorder traversal.

void treesort(node_t **list) {
    node_t **record = list;
    c_map_t map = c_map_new(sizeof(int), sizeof(NULL), c_map_cmp_int);
    while (*list) {
        c_map_insert(map, *list, NULL);
        list = &(*list)->next;
    }
    node_t *node = c_map_first(map), *first = node;
    for ( ;node; node = c_map_next(node)) {
        *list = node;
        list = &(*list)->next;
    }
    *list = NULL;
    *record = first;
    free(map);
}

Performance Analysis

  • test code

    ​​​​for (int max_level = 1; max_level < (1 << 15); max_level <<= 1) {
    ​​​​    for (int insert = 10; insert < 40; ++insert) {
    ​​​​        size_t times = 100;
    ​​​​        time = 0;
    ​​​​        while (times--) {
    ​​​​            size_t count = 100000;
    ​​​​            int *test_arr = malloc(sizeof(int) * count);
    
    ​​​​            for (int i = 0; i < count; ++i) {
    ​​​​                test_arr[i] = i;
    ​​​​            }
    ​​​​            shuffle(test_arr, count);
    
    ​​​​            while (count--) {
    ​​​​                list1 = list_make_node_t(list1, test_arr[count]);
    ​​​​            }
    ​​​​            clock_gettime(CLOCK_MONOTONIC, &tt1);
    ​​​​            introsort(&list1, max_level, insert);
    ​​​​            clock_gettime(CLOCK_MONOTONIC, &tt2);
    ​​​​            time += diff_in_ns(tt1, tt2);
    ​​​​            list_free(&list1);
    ​​​​            free(test_arr);
    ​​​​        }
    ​​​​        fprintf(fd, "%f ", time / 100.0);
    ​​​​    }
    ​​​​    fprintf(fd, "\n");
    ​​​​}
    
    max_level insert time(ns)
    16 21 43538009
    32 21 38184258
    64 21 49350176
    128 21 49877200
    • detailed output
    • max_level decide when to use tree sort
    • insert decide when to use insertion sort

    In the output, we can see that the optimal max_level is 32, which is around

    2×log2(100000). It is also the same result from the paper Introspective Sorting and Selection Algorithms.

    Excerpt from Introspective Sorting and Selection Algorithms page 5:

    depth limit is

    2log2N, since this value empirically produces good results.

  • We compare sorting algorithm with 100000 unique number

    We can know that the performance of tree sort is better than other algorithms. This is the same result of A Comparative Study of Linked List Sorting Algorithms. The problem of quick sort might be the recursion, making its runtime higher than tree sort.

  • We also compare with linux/lib/list_sort.c, testing on 100000 random unique number

    Try to use perf to analyse tree sort and list_sort

    • list_sort
    ​​​​ Performance counter stats for './list_sort.out':
    
    ​​​​     4,416,717      cache-misses              #    0.531 % of all cache refs      (66.65%)
    ​​​​   832,301,750      cache-references                                              (66.68%)
    ​​​​48,222,171,264      cycles                                                        (66.69%)
    ​​​​16,549,168,666      branch-instructions                                           (66.69%)
    ​​​​    26,319,955      branch-misses             #    0.16% of all branches          (66.67%)
    ​​​​   564,284,840      L1-dcache-load-misses                                         (66.63%)
    ​​​​   
    ​​​​  12.103174289 seconds time elapsed
    
    ​​​​  11.414579000 seconds user
    ​​​​   0.688155000 seconds sys
    
    • tree_sort
    ​​​ Performance counter stats for './tree_sort.out':
    
    ​​​  1,347,843,845      cache-misses              #   20.133 % of all cache refs      (66.65%)
    ​​​  6,694,777,854      cache-references                                              (66.66%)
    ​​​226,565,049,469      cycles                                                        (66.68%)
    ​​​ 28,693,147,859      branch-instructions                                           (66.68%)
    ​​​  1,123,005,940      branch-misses             #    3.91% of all branches          (66.67%)
    ​​​  3,115,288,209      L1-dcache-load-misses                                         (66.66%)
    ​​​  
    ​​​   58.205776700 seconds time elapsed
    
    ​​​   58.170830000 seconds user
    ​​​    0.007998000 seconds sys
    

    in the comment of list_sort.c

    Thus, it will avoid cache thrashing as long as

    3×2k elements can fit into the cache. Not quite as good as a fully-eager bottom-up mergesort, but it does use
    0.2×n
    fewer comparisons, so is faster in the common case that everything fits into L1.

    Therefore, the design of list_sort is cache-friendly. We can see that the metrics of list_sort is lower than tree sort, which also shows that why linux kernel use merge sort instead of tree sort.

    The performance bottleneck of tree sort might come from the red-black tree. Throughout perf analysis, the highest value of L1-cache-misses and branch-misses is from obj->comparator(&node->value, &cur->value) in c_map_insert; the highest value of cache-misses comes from cur->left = node and cur->right = node, which is also inside c_map_insert. As a result, we can speculate that the performance bottlenect might because of the cost from building red-black tree. It needs to traverse and compare the node inside the tree, causing runtime to increase.