# 2024q1 Homework1 (lab0) contributed by < `millaker` > ## 開發環境 ``` $ gcc --version gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD Ryzen 7 7700 8-Core Processor CPU family: 25 Model: 97 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 2 Frequency boost: enabled CPU max MHz: 5388.2808 CPU min MHz: 3000.0000 BogoMIPS: 7585.53 ``` ## queue implementation ### `q_new` The first version of `q_new` allocates a `element_t` and use it as the head of the queue. The address of the `list_head` in `element_t` is returned. Since the `value` field in the head is never used, a `list_head` is enough to represent the head of the queue; Therefore, I modified `q_new()` allocating only a `list_head` as the head of the queue. ```diff struct list_head *q_new() { - element_t *tmp = (element_t *) malloc(sizeof(*tmp)); + struct list_head *tmp = (struct list_head *) malloc(sizeof(*tmp)); if (!tmp) { return NULL; } - INIT_LIST_HEAD(&tmp->list); - tmp->value = NULL; - return &tmp->list; + INIT_LIST_HEAD(tmp); + return tmp; } ``` ### `q_insert_head`, `q_insert_tail` These two functions are similar in terms of operation except that one finds the list head and one finds the list tail. Since the list is a doubly linked list, finding the head and tail requires only one pointer dereference. ```c /* Insert an element at head of queue */ bool q_insert_head(struct list_head *head, char *s) { if (!head) return false; element_t *new_node = (element_t *) malloc(sizeof(*new_node)); if (!new_node) return false; new_node->value = (char *) malloc(sizeof(char) * (strlen(s) + 1)); if (!new_node->value) { free(new_node); return false; } strncpy(new_node->value, s, strlen(s) + 1); list_add(&new_node->list, head); return true; } ``` After the new `element_t` and the `value` in it is correctly allocated, the string is copied to `value` using `strncpy`. The only different between `q_insert_head` and `q_insert_tail` is the last node inserting function. Since the `strlen()` in `malloc()` and `strncpy()` will always be the same value, the `strlen` can be called once, so I added the line below before first use. ```c size_t len = strlen(s); ``` ### `q_remove_head`, `q_remove_tail` These two functions like `q_insert_head` takes no effort locating the head and tail node. The string is copied using `strncpy` with at most `bufsize` to the provided char pointer `s`. Next, assign null to the last element of the string. Since the function is named `remove`, no memory were freed and only the target is removed from the list. ```c /* Remove an element from head of queue */ element_t *q_remove_head(struct list_head *head, char *sp, size_t bufsize) { if (!head || head == head->next) return NULL; element_t *removed = list_entry(head->next, element_t, list); strncpy(sp, removed->value, bufsize); sp[bufsize - 1] = '\0'; list_del(&removed->list); return removed; } ``` ### `q_ascend`, `q_descend` :::danger Address your motivations and considerations here. ::: ```c /* Remove every node which has a node with a strictly less value anywhere to * the right side of it */ int q_ascend(struct list_head *head) { // https://leetcode.com/problems/remove-nodes-from-linked-list/ if (!head || list_empty(head)) return 0; struct list_head *curr = head->prev; char *curr_max = list_entry(curr, element_t, list)->value; int i = 0; while (curr != head) { element_t *temp = list_entry(curr, element_t, list); if (strcmp(temp->value, curr_max) <= 0) { curr_max = temp->value; curr = curr->prev; i++; continue; } // Remove current node struct list_head *next_node = curr->prev; list_del(curr); curr = next_node; } return i; } ``` The listing above is the initial implementation of `q_ascend`. I've noticed that in `queue.h`, the description of `q_ascend` and `q_descend` states that these two functions **remove** every node that violates the ordering rule. I thought that removing the node instead of freeing the node is enough and the rest will be handled by external code; However, there were memory leaks when I verify my implementation. The revised version will free the removed node and no memory leak will occur. ```diff list_del(curr); + free(temp->value); + free(temp); curr = next_node; ``` ### `q_reverse` :::danger Address your motivations and considerations here. ::: My `q_reverse` traverses the whole list and swap the `prev` pointer and the `next` pointer. ### `q_reverseK` `q_reverseK` is similar to `q_reverse` but counts the specified `k` number of nodes. Like `q_reverse` I traverse the whole list while tracking the number of nodes. If `count` equals `k`, the marked list will be attached to a temporary `list_head` and passed to `q_reverse`. After the operation, the list is reconnected to the queue. ```c // Attach to temp head temp_head.next = reverse_first; temp_head.prev = curr; reverse_first->prev = &temp_head; curr->next = &temp_head; q_reverse(&temp_head); // Rebuild link temp_head.next->prev = prev; temp_head.prev->next = next; prev->next = temp_head.next; next->prev = temp_head.prev; ``` :::warning Facilitate List APIs to shorten your implementation. ::: The modification has been done in [commit 3ac2a83](https://github.com/millaker/lab0-c/commit/3ac2a8320ac9b66ed146e2a26b9fd3af22b88011). The operations before and after the call to `q_reverse()` was changed to using `list_cut_position()` and `list_splice_init()` instead. ```diff // Store prev and next struct list_head *prev = reverse_first->prev, *next = curr->next; - // Attach to temp head - temp_head.next = reverse_first; - temp_head.prev = curr; - reverse_first->prev = &temp_head; - curr->next = &temp_head; + list_cut_position(&temp_head, prev, curr); q_reverse(&temp_head); - // Rebuild link - temp_head.next->prev = prev; - temp_head.prev->next = next; - prev->next = temp_head.next; - next->prev = temp_head.prev; - // Restore vars + list_splice_init(&temp_head, prev); i = k - 1; reverse_first = next; curr = next; ``` ### `q_sort` I implemented `q_sort` using mergesort with two helper functions, `merge_list` and `sort_list`. `sort_list` splits the list in half by finding the mid node and recursively calls itself until the list is unseparable. `merge_list` takes care of the sorting and merging of two sorted list. Since we cannot allocated any additional nodes to make these two function easier, the list is first converted into a singly linked list. `sort_list` and `merge_list` operates on list that has no sentinel head node. The `prev` links are rebuilt after the sorting is done. #### linux [`list_sort.c`](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) study Instead of using the traditional bottom-up mergesort, linux `list_sort()` maintains a `pending` list of sorted lists and add one list element to the pending list each iteration. The `pending` list is linked with `prev` pointer in each `list_head`. If there are two size $2^k$ lists in the pending list, they will be merged into a size $2^{k+1}$ list. The following shows the `pending` list. ```graphviz digraph G { L2 -> L1 [label = prev]; L1 -> L1_1 -> L1_2 -> L1_3; L2 -> L2_1; L3 -> L3_1; L3 -> L2[label = prev]; L4 -> L3[label = prev]; {rank = same; L1; L2; L3;L4;} } ``` ```c= do { size_t bits; struct list_head **tail = &pending; /* Find the least-significant clear bit in count */ for (bits = count; bits & 1; bits >>= 1) tail = &(*tail)->prev; /* Do the indicated merge */ if (likely(bits)) { struct list_head *a = *tail, *b = a->prev; a = merge(priv, cmp, b, a); /* Install the merged result in place of the inputs */ a->prev = b->prev; *tail = a; } /* Move one element from input list to pending */ list->prev = pending; pending = list; list = list->next; pending->next = NULL; count++; } while (list); ``` The listing above is the main loop of the sort. 1. Line 6~7: Finds the next to be merged pending list 2. Line 9~15: Merge the two lists and put it back to `pending` 3. Line 18~22: Add a new element to the list After there is no new element, the `pending` lists are merged into a final sorted list, which is done by `merge_final()`. Because the `prev` link is used as `pending` and does not remain its original use case, `prev` link is restored after the merge. #### Why is this faster? From the commit message from [`commit b5c56e0`](https://github.com/torvalds/linux/commit/b5c56e0cdd62979dd538e5363b06be5bdf735a09) : > Obviously, we have to read the entire list into L1 cache at some point, and performance is best if it fits. But if it doesn't fit, each full pass over the input causes a cache miss per element, which is undesirable. :::danger Show me the measurements! ::: The traditional bottom-up mergesort requires us to read the whole list every time we merge. To avoid this problem, we sort the list as much as possible in the front of the list. (Two $2^k$ lists into one $2^{k+1}$). Only the last few merges will suffer cache misses. To show the effects described in the commit message, I use [Perforator](https://github.com/zyedidia/perforator?tab=readme-ov-file) to probe the cache misses while sorting. > `Perforator` is a tool for recording performance metrics over subregions of a program (e.g., functions) using the Linux "perf" interface. -> at the function level instead of the whole program. `Cachegrind` can simulate a simple cache system but runs too slow. `perf` reads the real hardware counters. The structure used in the test is as follow: ```c typedef struct { struct list_head list; int val; int seq; } element_t; ``` The structure contains a pointer of size 8 Bytes (64-bit machine) and two `int`s, the total size is 16 Bytes. From the official page of AMD 7700, the spec shows a L1 cache of 512KB, roughly 32000 `element_t`s can be stored in L1 cache.(This is an estimation). In order to see the effect of L1 cache miss, I tested with list size twice as large of 32000 which is 64000. For 64000 complete random data distribution each run 1000 times, the average results were summarized with a python script: ``` Region: td_sort # Top-down mergesort Average cpu-cycles: 45852878.67 Average cache-misses: 64878.18 Average cache-references: 995086.32 Average time-elapsed: 8.73 Average comparisons: 470571.36 Region: list_sort # Linux bottom-up mergesort Average cpu-cycles: 33957086.30 Average cache-misses: 42590.56 Average cache-references: 688218.42 Average time-elapsed: 6.46 Average comparisons: 470774.09 ``` With the same comparison count, my implementation of the top-down mergesort took 30% more cache-references. After noticing the problem, I found that my `merge` function reads memories more often. The exiting condition references two `struct list_head` pointer every loop, while `list_sort`'s implementation only references once. However, this did little effect on cache misses if the exiting condition references a pointer that is guaranteed to be in the cache. We can see that for the same data, comparison counts were identical but bottom-up mergesort cache misses decrease by 35%, which directly affected the execution time. #### Porting linux `list_sort()` to `qtest` After removing some linux specific hints and info, and writing my own `cmp()` function, I am able to sort the queue using `list_sort()` #### `likely()` and `unlikely()` macro I've noticed that there are two functions that I've never met before. Further investigation leads to [include/linux/compiler.h](https://github.com/torvalds/linux/blame/master/include/linux/compiler.h). It turns out that its two macro defined in `compiler.h`. ```c # define likely(x) __builtin_expect(!!(x), 1) # define unlikely(x) __builtin_expect(!!(x), 0) ``` They all use GNU `__builtin_expect()`, which is used to provide hints to the compiler. > built-in Function: long __builtin_expect (long exp, long c) You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect. - [GCC DOC](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_005f_005fbuiltin_005fexpect) The double negation to `x` guarantees that the return value will be a bool and retains its original boolean value. #### Comparing my sort with linux `list_sort()` I wrote a simple test command to time each sort. ``` option verbose 2 new it RAND 1000000 time sort free new it RAND 1000000 time linux_list_sort free ``` Result: ``` Delta time = 0.755 Delta time = 0.732 ``` The result shows a slight execution time decrease but not much difference. :::danger Why? Have you revised the experiments? ::: The revised experiments can be found in the [previous section](https://hackmd.io/KbgAPOOvT-uqnX9IbsA15g?both#Why-is-this-faster). To make more accurate measurements, I've moved the two sorting implementation to a separate file and operated the test on integer values. #### More detailed comparison Refer to: [Cpython listsort.txt](https://github.com/python/cpython/blob/main/Objects/listsort.txt) We can compare a set of different data distribution can be compared. 1. *sort: All random data 2. \sort: descending data 3. /sort: ascending data 4. 3sort: ascending, then 3 random exchanges 5. +sort: ascending, then 10 random at the end 6. %sort: ascending, then randomly replace 1% of elements with random values 7. ~sort: many duplicates 8. =sort: all equal 9. !sort: worst case scenario #### Data preparation for sort benchmark > [commit 61da385](https://github.com/millaker/lab0-c/commit/61da385bc673d6b37f9ec1662e47cb2a42c14113) The implementation can be refered to CPython [sortperf.py](https://github.com/python/cpython/blob/0c7dc494f2a32494f8971a236ba59c0c35f48d94/Tools/scripts/sortperf.py#L10) which is used to test the performance of list sort in Python. The worst case scenario in `sortperf.py` is used to trigger traditional quicksort with distribution like `[3, 2, 1, 0, 0, 1, 2, 3]`. Worst case for mergesort and Timsort is required to add to the tested dataset. #### Comparing Timsort and Linux `list_sort()` mergesort > [commit 972c8d9](https://github.com/millaker/lab0-c/commit/972c8d9f3e7880fd314a68d8bd7bce1627aa0077) The test iterates through all the patterns 1~8 available currently and size 20 to 12,000 with step size 100, each 100 times. Since [Ryzen R7-7700](https://www.amd.com/en/products/apu/amd-ryzen-7-7700) have L1 cache size of 512KB, the list length must be larger than approximately 22000 (512KB/(8*3)KB (Three 64-bit pointers in `element_t`)) to populate the entire private L1 cache in a single core. The largest test size is 12000 in the current test; Therefore, the effect of cache misses in these tests are neglectable. Further tests using `cachegrind` on larger size lists can make the cache more influential. #### All random > The entire test results including log and plot can be found in [commit a956867](https://github.com/millaker/lab0-c/commit/a9568674b5d683fd04b6be9685e6305606ba8030), I'll be only showing plots of comparisons since time and number of comparisons are positively correlated. ![Mixed_1_cmps](https://hackmd.io/_uploads/S16rnoVCa.png) From the plot above, we can see that current implementation of Timsort operates more comparisons than Linux mergesort with complete random data. #### Ascending, Descending ![Mixed_2_cmps](https://hackmd.io/_uploads/HJOGZ2N0a.png) ![Mixed_3_cmps](https://hackmd.io/_uploads/S1hzZhECa.png) With completely ascending and descending data, which matches the assumption of natural data that Timsort made, Timsort operates in O(n) time complexity. #### Ascending, 3 random ![Mixed_4_cmps](https://hackmd.io/_uploads/BJbnz34Cp.png) In this test pattern, every 3 nodes were randomly shuffled which sabotages the perfect run Timsort favors and resulted in less consistent comparison counts. #### Asceding, 10 random at tail ![Mixed_5_cmps](https://hackmd.io/_uploads/S1q_m3VC6.png) In this test pattern, the ascending nature retains. The result is similar to ascending. #### Ascending, 1% random ![Mixed_6_cmps](https://hackmd.io/_uploads/Hyl9EhERp.png) In this scenario, 1% of all the nodes are randomly replaced by random data. Timsort became more unstable but still outperformed Linux mergesort in every tested list size. #### Mandy duplicates ![Mixed_7_cmps](https://hackmd.io/_uploads/r1rXrnNCT.png) The data preparation was implemented as below: ```c } else if (pattern == DUP) { char *dup_str[4] = {"abcde", "bcdef", "cdefg", "defgh"}; for (int i = 0; i < size; i++) { q_insert_tail(temp, dup_str[i % 4]); } ``` Four pre-defined strings are inserted into the list in turn. I didn't think much about the pre-defined string when coding and accidentally create an ascending order, which can expect to favor Timsort. Timsort therefore can continously find small natural runs and merge them. #### Mandy duplicates - random > Fix in next patch #### All equal ![Mixed_8_cmps](https://hackmd.io/_uploads/HyFFUnNRT.png) This pattern falls into the ascending order category since the comparison is done as `a >= b`. #### Comparison between patterns ![Timsort_all_cmps](https://hackmd.io/_uploads/H1eqgpEC6.png) ![Mergesort_all_cmps](https://hackmd.io/_uploads/SJIcl64A6.png) From the 8 patterns above, we can see that Linux mergesort has a more stable performance (almost the same comparison count) on each kind of data pattern. Timsort, on the other hand, exceeds mergesort by a magnitude on data patterns with ascending and descending natural runs. Therefore, in order to improve Timsort, I must find a way to accelerate Timsort on random data. #### Comparison on Large size lists ### `q_ascend`, `q_descend` These two functions are similar in nature, removing node that violates the ordering rule to the right side of it. The queue is a doubly linked list, therefore traversing the list backwards is trivial. A variable `curr_max` records the current max value for `q_descend` and current min value for `q_ascend`. This value is compared to every node, and is updated when the new value is larger or smaller than it. ### Is null pointer `if guard` required for every list value dereference? Since every node inserted is created by `q_insert_head` and `q_insert_tail`, which will always allocate memory for the string value, is the `if(!ptr) return NULL;` required? :::warning Show your proposed work and expected benefits. ::: ### `q_shuffle` My `q_shuffle` is implemented using Fisher-Yates shuffle algorithm. The to be swapped target node is obtained by a standard library `rand() % size`, where size is the `(number of nodes left - 1)`. Two special cases when target is equal to current node and the next node is equal to the target is checked. ```c void q_shuffle(struct list_head *head) { if (!head || list_empty(head)) return; struct list_head *curr = head->next; int size = q_size(head); while (size > 1) { int num = rand() % size--; struct list_head *target, *next_temp = curr->next; for (target = curr; num > 0; target = target->next, num--) ; if (curr != target) list_swap(curr, target); if (next_temp != target) curr = next_temp; } } ``` A helper function `list_swap()` is added and can refer to linux kernel `list.h`'s `list_swap()`. If `list_swap` is given two same pointer as the argument, the function will break. If I can find out a way to resolve this limitation, the extra check in `q_shuffle` can be eliminated. ```c static inline void list_swap(struct list_head *a, struct list_head *b) { struct list_head *pos = b; list_del(b); // Replace a with b b->next = a->next; b->next->prev = b; b->prev = a->prev; b->prev->next = b; if (pos == a) pos = b; list_add(a, pos); } ``` #### Randomness test To test if `q_shuffle()` returns random ordering every time, I wrote a test function that shuffles 1, 2, 3, 4, 5 for N times. Every iteration starts with the initial ordering sequence. The result is dumped to an output file and processed by python script. ```c for (int i = 0; i < test_size; i++) { struct list_head *temp = q_new(); q_insert_tail(temp, "1"); q_insert_tail(temp, "2"); q_insert_tail(temp, "3"); q_insert_tail(temp, "4"); q_insert_tail(temp, "5"); q_shuffle(temp); // Dump shuffle result to output file element_t *entry; list_for_each_entry (entry, temp, list) { fprintf(outfile, "%s", entry->value); } fprintf(outfile, "\n"); fflush(outfile); if (i + 1 % 1000 == 0) printf("Progress (%d/%d)\n", i, test_size); q_free(temp); } ``` I ran 100000 iterations with the implementation in [commit a12d370](https://github.com/millaker/lab0-c/commit/a12d37044fc865695632671572a33765f6632d71). The testing result is showed below: ![image](https://hackmd.io/_uploads/Sy6o6zypa.png) With 5! = 120 possible permutations, every possibility should have a count of 8333 to be truely random. However, we can clearly see that some of the ordering never happens and one specific order got over 30,000 counts. I found out that `list_swap()` has wrong inserting position. The plot below shows the result after modifying the mistake. [commit 09cab71](https://github.com/millaker/lab0-c/commit/09cab71a4d52e5dd4268f2198764cd0f8ac1b5d4) ![image](https://hackmd.io/_uploads/Hy8xzS1Tp.png) We can see that every possible outcome is near the expected value 8333. After adding more iterations to the test, the possibilities are more evenly distributed. ![image](https://hackmd.io/_uploads/Sy3T7Bypa.png) In theory, Fisher-Yates shuffle can be think of picking one element from the remaining list and put it to the current position. The randomness of the shuffle is determined by how "random" we choose the target. From man page rand(3): > The rand() function returns a pseudo-random integer in the range 0 to RAND_MAX inclusive (i.e., the mathematical range [0, RAND_MAX]). as long as the list size is smaller than `RAND_MAX`, the possibility of every entry being picked will be similar. ## Web server and linenoise issue To understand why linenoise cannot work after `web` command executes, I studied 1. How linenoise works? 2. How `qtest` console interpret a command? ### linenoise execution flow `run_console` in `console.c` calls linenoise's main entry point `linenoise()` and passes the terminal handling to linenoise. The internal execution flow is `linenoise()` -> `line_raw()` -> `line-edit()` in brief, some short description of the purpose for each function is listed below: - `linenoise()`: main entry, return characters get from the user - `line_raw()`: Change terminal into raw mode (non-canonical and other stuffs) - `line_edit()`: main editing capabilities lies in this function. Reads one char at a time using blocking `read()` and deals with features like autocompletion and history. The `read()` is showed below. When returned from this function, the passed in `char *buf` will be populated with user input characters. ```c nread = read(l.ifd, &c, 1); if (nread <= 0) return l.len; /* Only autocomplete when the callback is set. It returns < 0 when * there was an error reading from fd. Otherwise it will return the * character that should be handled next. */ if (c == 9 && completion_callback != NULL) { c = complete_line(&l); /* Return on errors */ if (c < 0) return l.len; /* Read next character when 0 */ if (c == 0) continue; } ... // Process other editing features ``` --- ### `web` command The `web` command simply opens a socket for listening and assigns false to `use_linenoise`. When we look at `run_console()`, we can see that this assignment breaks the while loop and turned to another command processing routine `cmd_select()`, which supports both web_fd and stdin inputs, instead of using `linenoise()`. ```c if (!has_infile) { char *cmdline; while (use_linenoise && (cmdline = linenoise(prompt))) { interpret_cmd(cmdline); line_history_add(cmdline); /* Add to the history. */ line_history_save(HISTORY_FILE); /* Save the history on disk. */ line_free(cmdline); while (buf_stack && buf_stack->fd != STDIN_FILENO) cmd_select(0, NULL, NULL, NULL, NULL); has_infile = false; } if (!use_linenoise) { while (!cmd_done()) cmd_select(0, NULL, NULL, NULL, NULL); } } else { while (!cmd_done()) cmd_select(0, NULL, NULL, NULL, NULL); } ``` To support both linenoise line editing features and web socket input, I need to modify how linenoise read in characters in `line_edit()`. ### Handle multiple input sources in `line_edit` :::danger Refine the subjects. Avoid mentioning "modify" some files. Make it informative! ::: In `line_edit` construct to be monitored file descriptors and call `select` to monitor. `select` is given no timeout so it will block indefinitely until a fd is ready. In the original char receiving while loop, add a check if web is enabled and web_fd is set. The web operations are copied from `cmd_select()` in console.c. The received string is copied to the buffer and the length is returned. ```c /* line_edit() */ ... // Construct readfds for select int nfds = 0; fd_set readfds; FD_ZERO(&readfds); FD_SET(l.ifd, &readfds); if (web_enabled && web_fd != -1) { FD_SET(web_fd, &readfds); } if (l.ifd >= nfds) { nfds = l.ifd + 1; } if (web_fd >= nfds) { nfds = web_fd + 1; } ... // In while loop // Wait for input indefinitely int result = select(nfds, &readfds, NULL, NULL, 0); if (result <= 0) return 0; // Check is web if (web_enabled && FD_ISSET(web_fd, &readfds)) { FD_CLR(web_fd, &readfds); struct sockaddr_in clientaddr; socklen_t clientlen = sizeof(clientaddr); int web_connfd = accept(web_fd, (struct sockaddr *) &clientaddr, &clientlen); char *p = web_recv(web_connfd, &clientaddr); char *buffer = "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\n\r\n"; web_send(web_connfd, buffer); close(web_connfd); if (p) { l.len = strlen(p); strncpy(l.buf, p, l.len + 1); free(p); } else { l.len = 0; } return l.len; } ``` ### Always use `linenoise()` entry point instead of `cmd_select()` Since the web feature is integrated into linenoise, the variable `use_linenoise` is deleted and `run_console` will always use linenoise as the input source if there is no input file. ```c while ((cmdline = linenoise(prompt))) { interpret_cmd(cmdline); line_history_add(cmdline); /* Add to the history. */ line_history_save(HISTORY_FILE); /* Save the history on disk. */ line_free(cmdline); while (buf_stack && buf_stack->fd != STDIN_FILENO) cmd_select(0, NULL, NULL, NULL, NULL); has_infile = false; } ``` ### `console.c`: `prompt_flag` variable is useless In `console.c`, `prompt_flag` is used only in line 583 and no other file can modify the value of it; Therefore it can be removed. ```c //line 24 static bool prompt_flag = true; ... //line 583 if (infd == STDIN_FILENO && prompt_flag) { printf("%s", prompt); fflush(stdout); prompt_flag = true; } ``` ## *Dude, is my code constant time?* reading and implementation The approach to detect timing leakage consists of three steps 1. Measure execution time 2. Apply post-processing 3. Apply statistic test (t-test) Post-processing include cropping large measurements, because the main process may be interrupted by the OS or other activities. Welch's t-test is chosen in this paper as the statistical test, which tests the equivalence of means, and failure on this test indicates that there is timing leakage (not constant time) in the code. ## Xorshift PRNG > Test in [commit 1f48bfb](https://github.com/millaker/linux2024-quiz/commit/1f48bfb5c7f2bff39bc78bb981c028c802d442c1) Since the [wiki page](https://en.wikipedia.org/wiki/Xorshift) of xorshift prng only gives algorithms for 32-bit, 64-bit, 128-bit generators, I need to construct a 8-bit generator for `random_byte()`. My first thought is to use the low 8 bits from `xorshift32()`. The implementation is as followed: ```c uint32_t xorshift32() { /* Algorithm "xor" from p. 4 of Marsaglia, "Xorshift RNGs" */ static uint32_t x = 1; x ^= x << 13; x ^= x >> 17; x ^= x << 5; return x; } uint32_t xorshift8() { return xorshift32() & 0xFF; } ``` To check the randomness of this method, I generated 100,000,000 numbers and calculated the expected and resulting shannon entropy using a python script. ``` Expected :5.541263545158426 Result :5.523522120891769 ``` ![count](https://hackmd.io/_uploads/rJknwz1kA.png) ### xorshift-based `randombyte()` > [commit 05c5a23](https://github.com/millaker/lab0-c/commit/05c5a2319ae8553be3488132e88b110675db9242) The new `randombyte()` used the `xorshift8()` mentioned above. A new option, `prng` will activate the xorshift-based `randombyte()`. ## How `qtest` detect time limits ? `qtest` used `sigsetjmp()` and `siglongjmp()` pairs to achieve this. The first call to `exception_setup()` will save the current stack and CPU state to a `jmp_buf` for later `siglongjmp` calls. At the same time, an `alarm()` is set to trigger 1 second later. According to the manual page: > setjmp() and sigsetjmp() return 0 if returning directly, and nonzero when returning from longjmp(3) or siglongjmp(3) using the saved context. Therefore, we can distinguish between direct call or jumped from `longjmp`. When the program gets a signal, irrespective of which type, a custom exception handler is called an `longjmp()` to the previous set point. `alarm(0)` is executed to cancel previous scheduled alarm. > [alarm](https://man7.org/linux/man-pages/man2/alarm.2.html): If seconds is zero, any pending alarm is canceled. ## ttt ### Fixed point To choose the most appropiate scaling factor for my own fixed-point representation, I tracked `double score` variable used in `mcts.c:select_move()`. ``` 1.893018e+00 1.893018e+00 1.893018e+00 2.893018e+00 2.893018e+00 1.797693e+308 1.797693e+308 1.797693e+308 1.797693e+308 1.797693e+308 1.797693e+308 1.797693e+308 1.797693e+308 1.797693e+308 1.797693e+308 1.972770e+00 1.972770e+00 1.972770e+00 2.972770e+00 2.972770e+00 2.972770e+00 1.797693e+308 1.797693e+308 1.797693e+308 1.797693e+308 ``` Above is a small part of `score` during `mcts` exection. We can see that aside from the normal values, there are many occurences of the same number `1.797693e+308`, which is the maximum number $1.79693*10^{308} \approx 1.xx * 2^{1024}$ representable by IEEE 754 double. Therefore I clipped values larger than $10^{100}$ and smaller than $-10^{100}$ and drew the distribution. ![double](https://hackmd.io/_uploads/SJ6XBxNeC.png) The remaining values were located between 0 and 3.5, which means that the precision of the fractional parts are more important than the range of the integer in this use case. > commit [10b7951](https://github.com/millaker/lab0-c/commit/10b7951fba43f95af6f0b1abb7e41e6ce64a292b) Based on the analysis above, I chose 30 as the scaling factor for my fix point represention. The four basic arihtmetic operations can be done simply by doing normal operations with shifting if required. Three functions are provided, `fix_sqrt()` for fix point square root, `fix_log2()` for base 2 logrithm and `fix_log10()` for base 10 logrithm. `fix_sqrt()` is done with digit by digit calculation. `fix_log2()`'s algorithm can be refered to http://www.claysturner.com/dsp/binarylogarithm.pdf. `fix_log10()` is calculated with `fix_log2(x)/log_2(10)`, which can be proved easily by $$log_ab = \frac{log_2b}{log_2a}$$. To check if the implementation correctly remaps to double, I've wrote a simple test, `test_fix.c`. ![dvf_sqrt](https://hackmd.io/_uploads/rkgEAdVg0.png) ![dvf_log](https://hackmd.io/_uploads/rJuEAuVxR.png) ![dvf_log10](https://hackmd.io/_uploads/SJLvCO4g0.png) From the three figures above we can see that for `sqrt` and `log2`, the implementation is nearly identical with `double` arithmetic in C. However, for `log10`, some precision were lost during the $\frac{log_2b}{log_2a}$ division. The number was divided and truncated to the decimal point, therefore losing all its precision after the decimal point. ### Bot vs Bot gameplay In order to process keystrokes, refresh screen and simulate bot vs bot game, user-space coroutine capabilities is integrated. The implementation can be referred to [preempt_sched.c](https://github.com/sysprog21/concurrent-programs/tree/master/preempt_sched). The implementation uses `ucontext` and schedule the tasks in round-robin order. A timer is used to simulate hardware timer that kernel uses. Each specified interval passed, a `SIGALRM` signal will be sent to the current thread, which will trigger the `timer_handler()` signal handler for next scheduling. When testing the program, I've occasionally encounter error messages like: ``` malloc.c:3096: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed. Aborted ``` or ``` corrupted doubly-linked list ``` At first, I thought that the error was from the implementation of the user-space coroutine. I've inspected multiple possible locations where memory operations are done intensively such as drawing screen buffer or `realloc`ing the buffer but the problem still exists. The `corrupted doubly-linked list` mislead me to search problem of the maintained task list. It's actually the doubly linked list the memory allocator C used. The heap was corrupted because of stack overflow. ```c static struct task_struct *task_alloc(task_callback_t *func, void *arg) { struct task_struct *task = calloc(1, sizeof(*task)); task->stack = calloc(1, 1 << 20); task->callback = func; task->arg = arg; return task; } ``` The original implementation used `1 << 20`, or 1 MB, of memory for stack. So I changed the stack size to `1 << 25`, which is 32 MB, and the problem never occur again. The screen drawing method was referred to [mazu-editor](https://github.com/jserv/mazu-editor). A screen buffer is first populated with required information and then written to the terminal using `write()` each cycle. The two AI algorithm for the match are MCTS and negamax and they ran in the same task since they take turns evaluating the board. A quick demo is showed below: {%youtube dT5qhnbDhbo%} Note that when `<Ctrl-P>` is pressed the board will lock until another press. `<Ctrl-Q>` will clean up the mess and end the simulation right away.