# linux2023-summer-quiz linux2023: [Raymonna]() ## Problem α−1 I want to compare two codes, [red-black tree code from xieqing](https://github.com/xieqing/red-black-tree) and [S-Tree code from Professor Huang (Jserv)](https://gist.github.com/jserv/4dfaf78cf516cc20f4bc55ce388c195d), to get a more comprehensive understanding of the design for the data structure, respectively. * Insertion: * S-Tree * Step1 : find a empty node(the leaf) to insert the new value **a** or if it already exists in the tree, finish the insertion. * Step2: update the tree in order to make the new tree still follows the constraint for the height of the tree(same as the rule for AVL tree). * The insertion complexity is **O(nlogn)** ::: spoiler Code for insertion operation (S-Tree) ```c= struct treeint *treeint_insert(int a) { struct st_node *p = NULL; enum st_dir d; for (struct st_node *n = st_root(tree); n;) { struct treeint *t = container_of(n, struct treeint, st_n); if (a == t->value) return t; p = n; if (a < t->value) { n = st_left(n); d = LEFT; } else if (a > t->value) { n = st_right(n); d = RIGHT; } } struct treeint *i = calloc(sizeof(struct treeint), 1); if (st_root(tree)) st_insert(&st_root(tree), p, &i->st_n, d); else st_root(tree) = &i->st_n; i->value = a; return i; } ``` ```c= void st_insert(struct st_node **root, struct st_node *p, struct st_node *n, enum st_dir d) { if (d == LEFT) st_left(p) = n; else st_right(p) = n; st_parent(n) = p; st_update(root, n); } ``` ```c= static inline void st_update(struct st_node **root, struct st_node *n) { if (!n) return; int b = st_balance(n); int prev_hint = n->hint; struct st_node *p = st_parent(n); if (b < -1) { /* leaning to the right */ if (n == *root) *root = st_right(n); st_rotate_right(n); } else if (b > 1) { /* leaning to the left */ if (n == *root) *root = st_left(n); st_rotate_left(n); } n->hint = st_max_hint(n); /* Update the parent node if necessary */ if (n->hint == 0 || n->hint != prev_hint) st_update(root, p); } ``` ::: * red-black tree * Properties a. The same number of black nodes for all path b. Not allow two red adjacent nodes c. The color of root must be black, and the leaf node must be black as well d. The inserted node be initialized as red node e. The number of rotation in each insertion is at most 2 * Step1 : Search the position for insertion * Step2 : Check the color of parent; if color of the parent node is black, then finish the insertion; otherwise, go to the *insert_repair()* * Step3 : In the *insert_repair()*, following is the procedure(remember in this step, the color of parent node is red for sure) (1). Check the color of uncle, red or black. (2). If the color of uncle node is black, we need to further check the relative position between parent node and grandparent node (3). Following is the graph to visualize the manipulation of the *insert_repair()* based on the assumption that the parent node is the left child of the grandparent node. ![](https://hackmd.io/_uploads/r1nMz5o2h.jpg) ::: spoiler Code for insertion operation(red-black tree) ```c= rbnode *rb_insert(rbtree *rbt, void *data) { rbnode *current, *parent; rbnode *new_node; /* do a binary search to find where it should be */ current = RB_FIRST(rbt); parent = RB_ROOT(rbt); while (current != RB_NIL(rbt)) { int cmp; cmp = rbt->compare(data, current->data); #ifndef RB_DUP if (cmp == 0) { rbt->destroy(current->data); current->data = data; return current; /* updated */ } #endif parent = current; current = cmp < 0 ? current->left : current->right; } /* replace the termination NIL pointer with the new node pointer */ current = new_node = (rbnode *) malloc(sizeof(rbnode)); if (current == NULL) return NULL; /* out of memory */ current->left = current->right = RB_NIL(rbt); current->parent = parent; current->color = RED; current->data = data; if (parent == RB_ROOT(rbt) || rbt->compare(data, parent->data) < 0) parent->left = current; else parent->right = current; #ifdef RB_MIN if (rbt->min == NULL || rbt->compare(current->data, rbt->min->data) < 0) rbt->min = current; #endif if (current->parent->color == RED) { insert_repair(rbt, current); } else { //Do nothing } RB_FIRST(rbt)->color = BLACK; return new_node; } ``` ```c= void insert_repair(rbtree *rbt, rbnode *current) { rbnode *uncle; do { /* current node is RED and parent node is RED */ if (current->parent == current->parent->parent->left) { uncle = current->parent->parent->right; if (uncle->color == RED) { /* insertion into 4-children cluster */ /* split */ current->parent->color = BLACK; uncle->color = BLACK; /* send grandparent node up the tree */ current = current->parent->parent; /* goto loop or break */ current->color = RED; } else { /* insertion into 3-children cluster */ /* equivalent BST */ if (current == current->parent->right) { current = current->parent; rotate_left(rbt, current); } /* 3-children cluster has two representations */ current->parent->color = BLACK; /* thus goto break */ current->parent->parent->color = RED; rotate_right(rbt, current->parent->parent); } } else { uncle = current->parent->parent->left; if (uncle->color == RED) { /* insertion into 4-children cluster */ /* split */ current->parent->color = BLACK; uncle->color = BLACK; /* send grandparent node up the tree */ current = current->parent->parent; /* goto loop or break */ current->color = RED; } else { /* insertion into 3-children cluster */ /* equivalent BST */ if (current == current->parent->left) { current = current->parent; rotate_right(rbt, current); } /* 3-children cluster has two representations */ current->parent->color = BLACK; /* thus goto break */ current->parent->parent->color = RED; rotate_left(rbt, current->parent->parent); } } } while (current->parent->color == RED); } ``` ```c= void insert_repair(rbtree *rbt, rbnode *current) { rbnode *uncle; do { /* current node is RED and parent node is RED */ /*The parent node is the left node of the grandparent*/ if (current->parent == current->parent->parent->left) { uncle = current->parent->parent->right; if (uncle->color == RED) { /* insertion into 4-children cluster */ /* split */ current->parent->color = BLACK; uncle->color = BLACK; /* send grandparent node up the tree */ current = current->parent->parent; /* goto loop or break */ current->color = RED; } else { /* insertion into 3-children cluster */ /* equivalent BST */ if (current == current->parent->right) { current = current->parent; rotate_left(rbt, current); } current->parent->color = BLACK; /* thus goto break */ current->parent->parent->color = RED; rotate_right(rbt, current->parent->parent); } } else { uncle = current->parent->parent->left; if (uncle->color == RED) { /* insertion into 4-children cluster */ /* split */ current->parent->color = BLACK; uncle->color = BLACK; /* send grandparent node up the tree */ current = current->parent->parent; /* goto loop or break */ current->color = RED; } else { /* insertion into 3-children cluster */ /* equivalent BST */ if (current == current->parent->left) { current = current->parent; rotate_right(rbt, current); } /* 3-children cluster has two representations */ current->parent->color = BLACK; /* thus goto break */ current->parent->parent->color = RED; rotate_left(rbt, current->parent->parent); } } } while (current->parent->color == RED); } ``` ::: * Removement: * S-Tree: * In the design of S-Tree, if we want to remove, let's say 5, in the sorted sequence [1, 3, 5, 6, 7, 8], then we would choose 6 to be the replaced node. If the removed node is 8, then in this case, the replaced node would be 7. * The differnce between S-Tree and the AVL tree is the way to handle the deleted node. S-Tree applies some technique used in RB tree, but due to the constraint on the structure, which is the evaluation of height, the update time is still more than the red-black tree. But since the advantage of AVL tree is the less search time(in that it is more balanced than the red black tree), it still has some benefits to use this structure. * red-black tree: * Properties: (a). At most 3 rotation for each removement(in this case, the function *delete_repair()*) (b). In order to get a more complete understanding of the whole procedure of deletion, following is the visualization of the deltetion in RB tree. ![](https://hackmd.io/_uploads/rJ_KUdn33.jpg) ![](https://hackmd.io/_uploads/Bkpo8_22n.jpg) ![](https://hackmd.io/_uploads/Bk298d322.jpg) :::spoiler code for deletion in red-black tree ```c= int rb_delete(rbtree *rbt, rbnode *node, int keep) { delete_repair_time = 0; rbnode *target, *child; int data; data = node->data; /* choose node's in-order successor if it has two children */ if (node->left == RB_NIL(rbt) || node->right == RB_NIL(rbt)) { target = node; } else { target = rb_successor(rbt, node); /* node->right must not be NIL, thus move down */ /*That is, target must be one of the grandchild of node*/ node->data = target->data; /* data swapped */ } child = (target->left == RB_NIL(rbt)) ? target->right : target->left; /* child may be NIL */ if (target->color == BLACK) { if (child->color == RED) { //Inorder to compromise the loss of target(which is a black node) child->color = BLACK; } else if (target == RB_FIRST(rbt)) { //target is the root node //And since target must be the child of node, but no one is older than root, thus, target == node == root in this case //We could conclude that target is the root which has no child node //That is, target == node == the last node in this tree //Therefore, no modification needed in this circumstances } else { delete_repair(rbt, target); } } else { } if (child != RB_NIL(rbt)) child->parent = target->parent; if (target == target->parent->left) target->parent->left = child; else target->parent->right = child; free(target); return data; } void delete_repair(rbtree *rbt, rbnode *current) { delete_repair_time++; /* current must be black now, which would be deleted after delete_repair() */ /* current node has at most one child, and this child must be black */ rbnode *sibling; do { if (current == current->parent->left) { sibling = current->parent->right; if (sibling->color == RED) { sibling->color = BLACK; current->parent->color = RED; rotate_left(rbt, current->parent); sibling = current->parent->right; } /* sibling node must be BLACK now */ if (sibling->right->color == BLACK && sibling->left->color == BLACK) {//Case 2 sibling->color = RED; if (current->parent->color == RED) { current->parent->color = BLACK; break; /* goto break */ } else { //If the current->parent is black, then we let the current = current->parent and go back to Case 2 current = current->parent; /* goto loop (go back to Case 1 or 2 or 3 or 4)*/ } } else { if (sibling->right->color == BLACK) {//Case 3 sibling->left->color = BLACK; sibling->color = RED; rotate_right(rbt, sibling); sibling = current->parent->right; } /* transfer by rotation and recoloring */ sibling->color = current->parent->color;//Case 4 current->parent->color = BLACK; sibling->right->color = BLACK; rotate_left(rbt, current->parent); break; /* goto break */ } } else {/* When current == current->parent->right */ sibling = current->parent->left; if (sibling->color == RED) { sibling->color = BLACK; current->parent->color = RED; rotate_right(rbt, current->parent); sibling = current->parent->left; } /* sibling node must be BLACK now */ if (sibling->right->color == BLACK && sibling->left->color == BLACK) { sibling->color = RED; if (current->parent->color == RED) { current->parent->color = BLACK; break; /* goto break */ } else { current = current->parent; /* goto loop */ } } else { if (sibling->left->color == BLACK) { sibling->right->color = BLACK; sibling->color = RED; rotate_left(rbt, sibling); sibling = current->parent->left; } /* transfer by rotation and recoloring */ sibling->color = current->parent->color; current->parent->color = BLACK; sibling->left->color = BLACK; rotate_right(rbt, current->parent); break; /* goto break */ } } } while (current != RB_FIRST(rbt)); } ``` ::: ## Problem α−2 ```c struct treeint *treeint_insert(int a) { struct st_node *p = NULL; enum st_dir d; if (!st_root(tree)) { struct treeint *i = calloc(sizeof(struct treeint), 1); st_root(tree) = &i->st_n; i->value = a; return i; } struct st_node *n = st_root(tree); struct treeint *t = container_of(n, struct treeint, st_n); int t_value = t->value; while (n) { if (a != t_value) { p = n; if (a < t_value) { n = st_left(n); d = LEFT; } else if (a > t_value) { n = st_right(n); d = RIGHT; } if (n) { t = container_of(n, struct treeint, st_n); t_value = t->value; // Prefetch t->value for the next iteration } } else { return t; } } struct treeint *i = calloc(sizeof(struct treeint), 1); if (st_root(tree)) st_insert(&st_root(tree), p, &i->st_n, d); else st_root(tree) = &i->st_n; i->value = a; return i; } ``` According the observation from the **perf annotate**, we found that the bottleneck seems to locate in the insertion operation. Since I still have no better idea how to improve the perfomance from modifying the data structure of the tree, I try some common techique like **data prefetching** and **branch reduction**. The difference of performance could be shown in the following analyzing result: ``` //original code Performance counter stats for './a.out 10000000': 1,0003.21 msec task-clock # 1.000 CPUs utilized 56 context-switches # 5.598 /sec 3 cpu-migrations # 0.300 /sec 7,4136 page-faults # 7.411 K/sec 409,2041,1639 cycles # 4.091 GHz 109,2868,4639 instructions # 0.27 insn per cycle 20,4573,0049 branches # 204.507 M/sec 1,2757,6871 branch-misses # 6.24% of all branches 10.003724785 seconds time elapsed 9.900060000 seconds user 0.104000000 seconds sys ``` ``` //modified code Performance counter stats for './a.out 10000000': 9226.03 msec task-clock # 1.000 CPUs utilized 50 context-switches # 5.419 /sec 2 cpu-migrations # 0.217 /sec 7,4154 page-faults # 8.037 K/sec 367,9784,9550 cycles # 3.988 GHz 65,2797,4641 instructions # 0.18 insn per cycle 16,8063,1869 branches # 182.162 M/sec 1,2964,1154 branch-misses # 7.71% of all branches 9.227114780 seconds time elapsed 9.094714000 seconds user 0.131981000 seconds sys ``` ## Problem α−3 ### Performance comparison: * S-Tree ```shell= $ sudo perf stat ./a.out n = 100 [ After insertions ] Removing... average_remove_update_time = 2.770000 [ After removals ] Performance counter stats for './a.out': 1.07 msec task-clock # 0.500 CPUs utilized 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 65 page-faults # 60.765 K/sec 110,0440 cpu_core/cycles/ # 1.029 G/sec <not counted> cpu_atom/cycles/ (0.00%) 129,4924 cpu_core/instructions/ # 1.211 G/sec <not counted> cpu_atom/instructions/ (0.00%) 24,1982 cpu_core/branches/ # 226.216 M/sec <not counted> cpu_atom/branches/ (0.00%) 7675 cpu_core/branch-misses/ # 7.175 M/sec <not counted> cpu_atom/branch-misses/ (0.00%) 660,2640 cpu_core/slots/ # 6.172 G/sec 144,9991 cpu_core/topdown-retiring/ # 22.0% Retiring 77,6781 cpu_core/topdown-bad-spec/ # 11.8% Bad Speculation 295,1768 cpu_core/topdown-fe-bound/ # 44.7% Frontend Bound 142,4098 cpu_core/topdown-be-bound/ # 21.6% Backend Bound 25,8927 cpu_core/topdown-heavy-ops/ # 3.9% Heavy Operations # 18.0% Light Operations 75,0888 cpu_core/topdown-br-mispredict/ # 11.4% Branch Mispredict # 0.4% Machine Clears 204,5523 cpu_core/topdown-fetch-lat/ # 31.0% Fetch Latency # 13.7% Fetch Bandwidth 64,7317 cpu_core/topdown-mem-bound/ # 9.8% Memory Bound # 11.8% Core Bound 0.002138885 seconds time elapsed 0.002099000 seconds user 0.000000000 seconds sys ``` ```shell= $ time ./a.out 10000000 [ After insertions ] Removing... average_remove_update_time = 1.358752 [ After removals ] real 0m19.515s user 0m19.435s sys 0m0.080s ``` ![](https://hackmd.io/_uploads/rkFW5uhnn.png) * red-black tree ```shell= $ sudo perf stat ./a.out average insert repair time = 0.580000 Red-Black Tree creation, insertion and deletion: Performance counter stats for './a.out': 1.14 msec task-clock # 0.556 CPUs utilized 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 67 page-faults # 58.784 K/sec 111,0907 cpu_core/cycles/ # 974.672 M/sec <not counted> cpu_atom/cycles/ (0.00%) 131,9557 cpu_core/instructions/ # 1.158 G/sec <not counted> cpu_atom/instructions/ (0.00%) 24,4022 cpu_core/branches/ # 214.097 M/sec <not counted> cpu_atom/branches/ (0.00%) 7664 cpu_core/branch-misses/ # 6.724 M/sec <not counted> cpu_atom/branch-misses/ (0.00%) 666,5442 cpu_core/slots/ # 5.848 G/sec 146,3783 cpu_core/topdown-retiring/ # 22.0% Retiring 81,0308 cpu_core/topdown-bad-spec/ # 12.2% Bad Speculation 295,3705 cpu_core/topdown-fe-bound/ # 44.3% Frontend Bound 143,7644 cpu_core/topdown-be-bound/ # 21.6% Backend Bound 26,1389 cpu_core/topdown-heavy-ops/ # 3.9% Heavy Operations # 18.0% Light Operations 75,8030 cpu_core/topdown-br-mispredict/ # 11.4% Branch Mispredict # 0.8% Machine Clears 206,4980 cpu_core/topdown-fetch-lat/ # 31.0% Fetch Latency # 13.3% Fetch Bandwidth 62,7335 cpu_core/topdown-mem-bound/ # 9.4% Memory Bound # 12.2% Core Bound 0.002048556 seconds time elapsed 0.001987000 seconds user 0.000000000 seconds sys ``` ```shell= $ time ./a.out 10000000 $ time ./a.out 10000000 average insert repair time = 0.635921 Red-Black Tree creation, insertion and deletion: real 0m9.738s user 0m9.621s sys 0m0.116s ``` ![](https://hackmd.io/_uploads/S1IX5d3n2.png) * Summary: The basic reason of the difference in performance is the upate time per insertion and removement in STree and RBTree, if you observe the update time for each insertion/deletetion in stree, it is far more than the ones in rbtree(which is called x_repair() in the provided code). Therefore, using the rbtree is still a better choice when it comes to frequent insertions and deletetions. ## Problem β-1 ```c static inline uintptr_t align_up(uintptr_t sz, size_t alignment) { uintptr_t mask = alignment - 1; if((alignment & mask) == 0) { return (sz + mask) & (~mask); } return (((sz + mask) / alignment) *alignment); } ``` Let's take a look at the simple example: align_up(121, 4): * mask = 3; ~mask = 1111_1100 * sz = 121; sz + 3 = 124 * 124 & (~mask) = 124 - (124%4) = 124 ## Problem β-2 ```c //arch/powerpc/platforms/ps3/os-area.c static unsigned int db_align_up(unsigned int val, unsigned int size) { return (val + (size - 1)) & (~(size - 1)); } ``` ```c //arch/powerpc/platforms/ps3/os-area.c /** * db_for_each_64 - Iterator for 64 bit entries. * * A NULL value for id can be used to match all entries. * OS_AREA_DB_OWNER_ANY and OS_AREA_DB_KEY_ANY can be used to match all. */ static int db_for_each_64(const struct os_area_db *db, const struct os_area_db_id *match_id, struct db_iterator *i) { next: if (!i->db) { i->db = db; i->match_id = match_id ? *match_id : os_area_db_id_any; i->idx = (void *)db + db->index_64; i->last_idx = i->idx + db->count_64; i->value_64 = (void *)db + db->index_64 + db_align_up(db->count_64, 8); } else { i->idx++; i->value_64++; } if (i->idx >= i->last_idx) { pr_debug("%s:%d: reached end\n", __func__, __LINE__); return 0; } if (i->match_id.owner != OS_AREA_DB_OWNER_ANY && i->match_id.owner != (int)i->idx->owner) goto next; if (i->match_id.key != OS_AREA_DB_KEY_ANY && i->match_id.key != (int)i->idx->key) goto next; return 1; } ``` The code in **db_align_up()** uses the same technique for rounding up the value based on the bits you provide(in this case, the *size*). ## Problem γ-1 Code Review for [qsort_mt provided by Professor Huang(Jserv)](https://gist.github.com/jserv/38bca4ebfea1c434e1dfc15337f80eeb) We divide the code into two parts to discus. The pthread design and the sorting algorithm design * pthread design * qsort_mt() * First, it initialize the threads, including the mutex and condition variable for the leading thread(in this case, qs in qsort_mt(), which is responsible to manage the pool of threads) and the pool of threads(which is contained in the qs->common) * qsort_thread() * The execution function for each thread in the pool. Note that the threads wouldn't execute right after the *pthread_create()*. We need to wait for the qs->st become non-idle state, which would wake up one of the threads by signaling it. * Once one of the threads in the pool is unlocked, it begains to execute the *qsort_algo()*. * allocate_thread() * Once there exists idle thread when executing *qsort_algo()*, it calls the allocate_thread(c) to get the free thread(Notice that c->pool[i].st = ts_work is needed be protected by the c->pool[i].mtx_al, or it would generate race condition risk) * sorting algorithm design * In the qsort_algo(), it divides into 3 situations.(1). if n < 7, we could simply use bubble sort to tackle this problem. (2). if n => 7 and the unsorted sequence is not strictly ascending, then it would use qsort, (3). otherwise, it would use insertion sort. * There are some techniques that helps to improving the performance of the algorithm. * Find the approximate median(that is, take the eight points into consideration and set the median among these points), which would make the divided sequence more balanced. * Move all the elements with the same value of **a** to switch to the middle of sequence. * The usage of allocation of the pthreads which is not intuitive for me. First, the left part of subsequence after the first iteration of quick sort would dispatch to other available thread to execute, if any; otherwise it would execute the unsorted subsequence on the current thread continuously. The right part of the unsorted subsequence would always continue to execute on the current thread, since if the thread execute to this step, it means it could neglect the remaining sorting task for the left part, and would have the spare time to execute the right unsorted part. ## Problem γ-2 ```c static struct qsort *allocate_thread(struct common *c) { verify(pthread_mutex_lock(&c->mtx_al)); for (int i = 0; i < c->nthreads; i++) if (c->pool[i].st == ts_idle) { c->idlethreads--; verify(pthread_mutex_lock(&c->pool[i].mtx_st));//@@ c->pool[i].st = ts_work; verify(pthread_mutex_unlock(&c->mtx_al));//@@ return (&c->pool[i]); } verify(pthread_mutex_unlock(&c->mtx_al)); return (NULL); } ``` ## Problem γ-3 Combine the heapsort when n < c->forkelem ```c= static void sift_down(char *a, size_t start, size_t end, size_t es, cmp_t *cmp, int swaptype) { size_t root = start; size_t child; while ((child = 2 * root) <= end) { if (child < end && CMP(cmp, a + (child - 1) * es, a + child * es) < 0) { child++; } if (CMP(cmp, a + (root - 1) * es, a + (child - 1) * es) < 0) { swap(a + (root - 1) * es, a + (child - 1) * es); root = child; } else { return; } } } static void heapsort(void *base, size_t n, size_t es, cmp_t *cmp, int swaptype) { char *a = base; size_t i, size; /* Build the heap */ for (i = (n / 2); i > 0; i--) {//From those node which have leaf nodes sift_down(a, i, n, es, cmp, swaptype); } /* Perform the heapsort */ for (size = n; size > 1; size--) { swap(a, a + (size - 1) * es); sift_down(a, 1, size - 1, es, cmp, swaptype); } } qsort_algo() { ... if (n < c->forkelem) { // Switch to heapsort heapsort(a, n, es, cmp, swaptype); return; } ... } ``` * Result(the modified code is worse than the original code, still continue to find and try the other way to improve the performance): * original: ![](https://hackmd.io/_uploads/Bym9EMa2n.png) * add heap sort: ![](https://hackmd.io/_uploads/Hkif4Mp3h.png) * Substitute the qsort into heap sort. * If we want to properly utilize the advantage of pthread in the heap sort, the way I currently think of is to use the merge of the heap sort, which could dispatch the work to each of the threads in the pool. [TODO] * Heap sort + pthread * Heap sort + cuda * Quick sort + cuda * Analyze and comparison of the performance of each design with the qsort_mt