2020q3 Homework1 (lab0)

# 2020q3 Homework1 (lab0) contributed by `guaneec` ###### tags: `sysprog2020` [GitHub repo](https://github.com/guaneec/sysprog-2020q3-lab0-c) ## Logs I finished the lab (100/100) before creating this document. So this will be me trying to replay my thoughts through the git commits. ### [Initial fork](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/eaa393ef05d11af57a7a5e923aad23834917e063) Skimmed through the [description](https://hackmd.io/@sysprog/2020-lab0) and decided to "just code it". Ran `make test`. Found that `trace-17-complexity` takes a long time to run and sometimes freezes my remote machine. Commented out trace 17 in `driver.py` but kept it unstaged. Current score: 0/95 (trace-17 commented out) ### [Implement q_insert_head](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/36675507ec70bcf99743f91fc5fafcb46c8f0cf3) ### [Implement q_remove_head](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/5f202f34e0f24e402e10e4a8594f06802bfb8802) Figured I probably should make `trace-1` pass first. Saw `ih` and `rh` so I guessed that probably means `insert head` and `remove head` without reading the code. Implemented `q_insert_head` and `q_remove_head`. `trace-1` passed but also did trace 7,8,9, which was surprising. Felt that the comments above functions are really useful. The code almost writes itself. Current score: 24/95 ### [Add tail to queue and implement q_insert_tail](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/36843b223d6e9b868e9a86a3a9176e192c838b5b) `it`, insert tail it is. The cached tail allows some constant time operations. This passes trace-2. Current score: 30/95 ### [Implement q_reverse](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/a3b06e0bf575f3fdf67bf4003209f398f0b04061) Passed `trace-3` Current score: 36/95 ### [Fix q_new to initialize tail](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/c47139efd19a827202a8dfd68a4ab4322f4a56f7) Found out that I forget to update q_new. This was not caught by tests. Current score: 36/95 ### [Implement q_size](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/b24a575dbda70d24c3208be19334b010a72755d3) Error related to size are gone. The score is unchanged though. Current score: 36/95 ### [Implement q_free](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/d4be78dd8bd56e808704fb4ec0bd569174b919ac) Removed errors `Freed queue, but x blocks are still allocated` Passed traces 12, 13, 14 Current score: 54/95 ### [Implement q_sort](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/32e902013c2094ca021b9b5ef855efd8f9fab27b) This might be the most complicated part of this lab. Followed [an algorithm outline](https://www.chiark.greenend.org.uk/~sgtatham/algorithms/listsort.html) to implement a non-recursive bottom-up Merge Sort. Passed 16 but not 15. Removed time_limit to find out 15 runs in ~1.5 sec on my machine. Bummer. Passed 4, 5, 16 Current score: 71/95 ### [Fix q_remove_head not terminating partial strings](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/87b15bd0d5cb4d0ff921152277e1f5074a68dcc5) Time to remove the Xs in trace 6. Learned that `strncpy` does not insert `'\0'` automatically. Passed 6 Current score: 77/95 ### [Fix q_new to handle failure](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/8b3238fd0424d7f8260ca40eb971a22aead2e09a) Passed 10 Current score: 83/95 ### [Fix q_insert_head to handle failures](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/1561816a2d91a41a72d4efb558416dfcf46b8bf9) Passed 11 Current score: 89/95 ### [Rewrite merge sort for better performance](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/a504cbc15bd03d18fd60f051f613cb0ba21c9511) Adapted the [top-down recursive merge sort](https://www.geeksforgeeks.org/merge-sort-for-linked-list/) from geeksforgeeks. Passed 15. Why tho. **TODO**: investigate Passed 15 (and 17) Current score: 100/100 ### [Fix calculation of loop length](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/8563352597fe025016ccdeb44eff06e6f336d9da) This bug made the input size unreasonably small and sped up simulations (x100). Glad I found it. ## Code ### q_new ```cpp /* * Create empty queue. * Return NULL if could not allocate space. */ queue_t *q_new() { queue_t *q = malloc(sizeof(queue_t)); if (!q) return NULL; q->head = NULL; q->tail = NULL; q->size = 0; return q; } ``` ### q_free ```cpp /* Free all storage used by queue */ void q_free(queue_t *q) { if (!q) return; list_ele_t *p = q->head; while (p) { list_ele_t *tmp = p; p = p->next; free(tmp->value); free(tmp); } free(q); } ``` ### q_insert_head ```cpp /* * Attempt to insert element at head of queue. * Return true if successful. * Return false if q is NULL or could not allocate space. * Argument s points to the string to be stored. * The function must explicitly allocate space and copy the string into it. */ bool q_insert_head(queue_t *q, char *s) { list_ele_t *newh; if (!q) return false; newh = malloc(sizeof(list_ele_t)); /* Don't forget to allocate space for the string and copy it */ if (!newh) return false; newh->value = strdup(s); /* What if either call to malloc returns NULL? */ if (!newh->value) { free(newh); return false; } if (!q->head) { q->tail = newh; } newh->next = q->head; q->head = newh; ++q->size; return true; } ``` ### q_insert_tail ```cpp /* * Attempt to insert element at tail of queue. * Return true if successful. * Return false if q is NULL or could not allocate space. * Argument s points to the string to be stored. * The function must explicitly allocate space and copy the string into it. */ bool q_insert_tail(queue_t *q, char *s) { if (!q) return false; list_ele_t *e = malloc(sizeof(list_ele_t)); if (!e) return false; e->value = strdup(s); e->next = NULL; if (!e->value) { free(e); return false; } if (!q->tail) { q->head = q->tail = e; } else { q->tail->next = e; } q->tail = e; ++q->size; return true; } ``` ### q_remove_head ```cpp /* * Attempt to remove element from head of queue. * Return true if successful. * Return false if queue is NULL or empty. * If sp is non-NULL and an element is removed, copy the removed string to *sp * (up to a maximum of bufsize-1 characters, plus a null terminator.) * The space used by the list element and the string should be freed. */ bool q_remove_head(queue_t *q, char *sp, size_t bufsize) { if (!q || !q->head) return false; if (sp) { strncpy(sp, q->head->value, bufsize); sp[bufsize - 1] = '\0'; } list_ele_t *e = q->head; q->head = q->head->next; if (e->value) free(e->value); free(e); if (!q->head) q->tail = NULL; --q->size; return true; } ``` ### q_size ```cpp /* * Return number of elements in queue. * Return 0 if q is NULL or empty */ int q_size(queue_t *q) { if (!q || !q->head) return 0; return q->size; } ``` ### q_reverse ```cpp /* * Reverse elements in queue * No effect if q is NULL or empty * This function should not allocate or free any list elements * (e.g., by calling q_insert_head, q_insert_tail, or q_remove_head). * It should rearrange the existing ones. */ void q_reverse(queue_t *q) { if (!q || !q->head) return; list_ele_t *oldhead = q->head; list_ele_t *prev = NULL, *cur = q->head, *next; while (cur->next) { next = cur->next; cur->next = prev; prev = cur; cur = next; } cur->next = prev; q->head = cur; q->tail = oldhead; } ``` ### q_sort ```cpp // bisect the list void split_list(list_ele_t *src, list_ele_t **front, list_ele_t **back) { list_ele_t *fast = src->next, *slow = src; while (fast) { fast = fast->next; if (fast) { slow = slow->next; fast = fast->next; } } *front = src; *back = slow->next; slow->next = NULL; } // merge 2 sorted lists list_ele_t *merge_sorted(list_ele_t *a, list_ele_t *b) { list_ele_t root = {.next = NULL}; list_ele_t *tail = &root; while (a || b) { list_ele_t **c = (!b || (a && strcmp(a->value, b->value) <= 0)) ? &a : &b; if (!(root.next)) root.next = *c; tail = tail->next = *c; *c = (*c)->next; } return root.next; } // adapted from https://www.geeksforgeeks.org/merge-sort-for-linked-list/ // my version was just a bit too slow :'( void mergesort(list_ele_t **headref) { list_ele_t *head = *headref; list_ele_t *a, *b; if (!head || !(head->next)) { return; } split_list(head, &a, &b); mergesort(&a); mergesort(&b); *headref = merge_sorted(a, b); } /* * Sort elements of queue in ascending order * No effect if q is NULL or empty. In addition, if q has only one * element, do nothing. */ void q_sort(queue_t *q) { // merge sort int n = q_size(q); if (!n) return; mergesort(&q->head); // update tail, probably could do this more efficiently list_ele_t *c; for (c = q->head; c->next; c = c->next) ; q->tail = c; } ``` ## Valgrind ```bash $ make valgrind # Explicitly disable sanitizer(s) make clean SANITIZER=0 qtest make[1]: Entering directory '/home/dan/workspace/sysprog/lab0-c' rm -f qtest.o report.o console.o harness.o queue.o random.o dudect/constant.o dudect/fixture.o dudect/ttest.o .qtest.o.d .report.o.d .console.o.d .harness.o.d .queue.o.d .random.o.d .dudect/constant.o.d .dudect/fixture.o.d .dudect/ttest.o.d *~ qtest /tmp/qtest.* rm -rf .dudect rm -rf *.dSYM (cd traces; rm -f *~) CC qtest.o CC report.o CC console.o CC harness.o CC queue.o CC random.o CC dudect/constant.o CC dudect/fixture.o CC dudect/ttest.o LD qtest make[1]: Leaving directory '/home/dan/workspace/sysprog/lab0-c' cp qtest /tmp/qtest.01PRXq chmod u+x /tmp/qtest.01PRXq sed -i "s/alarm/isnan/g" /tmp/qtest.01PRXq scripts/driver.py -p /tmp/qtest.01PRXq --valgrind --- Trace Points +++ TESTING trace trace-01-ops: # Test of insert_head and remove_head --- trace-01-ops 6/6 +++ TESTING trace trace-02-ops: # Test of insert_head, insert_tail, and remove_head --- trace-02-ops 6/6 +++ TESTING trace trace-03-ops: # Test of insert_head, insert_tail, reverse, and remove_head --- trace-03-ops 6/6 +++ TESTING trace trace-04-ops: # Test of insert_head, insert_tail, size, and sort --- trace-04-ops 6/6 +++ TESTING trace trace-05-ops: # Test of insert_head, insert_tail, remove_head, reverse, size, and sort --- trace-05-ops 5/5 +++ TESTING trace trace-06-string: # Test of truncated strings --- trace-06-string 6/6 +++ TESTING trace trace-07-robust: # Test operations on NULL queue --- trace-07-robust 6/6 +++ TESTING trace trace-08-robust: # Test operations on empty queue --- trace-08-robust 6/6 +++ TESTING trace trace-09-robust: # Test remove_head with NULL argument --- trace-09-robust 6/6 +++ TESTING trace trace-10-malloc: # Test of malloc failure on new --- trace-10-malloc 6/6 +++ TESTING trace trace-11-malloc: # Test of malloc failure on insert_head --- trace-11-malloc 6/6 +++ TESTING trace trace-12-malloc: # Test of malloc failure on insert_tail --- trace-12-malloc 6/6 +++ TESTING trace trace-13-perf: # Test performance of insert_tail --- trace-13-perf 6/6 +++ TESTING trace trace-14-perf: # Test performance of size --- trace-14-perf 6/6 +++ TESTING trace trace-15-perf: # Test performance of insert_tail, size, reverse, and sort --- trace-15-perf 6/6 +++ TESTING trace trace-16-perf: # Test performance of sort with random and descending orders # 10000: all correct sorting algorithms are expected pass # 50000: sorting algorithms with O(n^2) time complexity are expected failed # 100000: sorting algorithms with O(nlogn) time complexity are expected pass --- trace-16-perf 6/6 +++ TESTING trace trace-17-complexity: # Test if q_insert_tail and q_size is constant time complexity Probably constant time Probably constant time --- trace-17-complexity 5/5 --- TOTAL 100/100 Test with specific case by running command: scripts/driver.py -p /tmp/qtest.01PRXq --valgrind -t <tid> ``` No error in sight. Does this mean we're good? ### Massif #### Input ``` option simulation 1 # custom parameter, limits number of measurements option measures 1000 size ``` #### Memory graph ``` MB 1.224^ # | @ # : | @ # : | @ # : :: | @ # : : :: | @ # : : : : | @ # ::: : : : :: | @ # :: : : : : | @ # :: : : : : | @ # : ::: ::: : : : :: : | @ # : ::: : : : : : : : :: | @ # : ::: : : : : : : : : : | @ # : ::: : : : : :: : : : ::::: : | @ ::# : ::: : : : : : : : : : ::: : | :@ : : #: : ::: : : : : : : : : : ::: : : | :@ : : #: : ::: : :::: :: : : : : : ::: : : | :@ : : #::::: ::: : :: : :: : : : ::::: : ::: : : | :@ :: : #:: :: ::::: : :: : :: : : : : : : : : ::: : : | :@ :: : #:: :: ::::: : :: : :: : @: : : : : : :: : ::: @ : | :::@:::: : #:: ::::::::: : :: : :: :: @: :: : : : ::::: ::::: ::: @:: 0 +----------------------------------------------------------------------->Gi 0 2.904 ``` The graph reflects the fact that the simulation is done by initializing queues with random lengths. ## Dude, is my code constant time? This simply runs a t-test on the run time distributions of two different input classes -- fixed and random. The current implementation runs 10 trials of dudect with 10000 samples each, returning true if any of the t-value is less than 10. The measurements are done int the measure function in `constant.c`: ```cpp= void measure(int64_t *before_ticks, int64_t *after_ticks, uint8_t *input_data, int mode) { assert(mode == test_insert_tail || mode == test_size); if (mode == test_insert_tail) { for (size_t i = drop_size; i < number_measurements - drop_size; i++) { char *s = get_random_string(); dut_new(); dut_insert_head( get_random_string(), *(uint16_t *) (input_data + i * chunk_size) % 10000); before_ticks[i] = cpucycles(); dut_insert_tail(s, 1); after_ticks[i] = cpucycles(); dut_free(); } } else { for (size_t i = drop_size; i < number_measurements - drop_size; i++) { dut_new(); dut_insert_head( get_random_string(), *(uint16_t *) (input_data + i * chunk_size) % 10000); before_ticks[i] = cpucycles(); dut_size(1); after_ticks[i] = cpucycles(); dut_free(); } } } ``` The t-test tests if two distributions have the same mean. ### A better testing method Let's have a look at how [the original dudect](https://github.com/oreparaz/dudect) tool works. We can build the binaries and see how they run: ``` $ ./dudect_cmpmemcmp_-O2 meas: 0.64 M, max t: +1.52, max tau: 1.91e-03, (5/tau)^2: 6.89e+06. For the moment, maybe constant time. meas: 1.24 M, max t: +72.41, max tau: 6.50e-02, (5/tau)^2: 5.93e+03. Probably not constant time. meas: 1.85 M, max t: +104.35, max tau: 7.68e-02, (5/tau)^2: 4.24e+03. Probably not constant time. meas: 2.46 M, max t: +119.49, max tau: 7.62e-02, (5/tau)^2: 4.30e+03. Probably not constant time. meas: 3.07 M, max t: +153.38, max tau: 8.75e-02, (5/tau)^2: 3.26e+03. Probably not constant time. meas: 3.68 M, max t: +173.48, max tau: 9.04e-02, (5/tau)^2: 3.06e+03. Probably not constant time. meas: 4.29 M, max t: +205.48, max tau: 9.93e-02, (5/tau)^2: 2.54e+03. Probably not constant time. meas: 4.91 M, max t: +215.53, max tau: 9.73e-02, (5/tau)^2: 2.64e+03. Probably not constant time. meas: 5.53 M, max t: +235.33, max tau: 1.00e-01, (5/tau)^2: 2.49e+03. Probably not constant time. ^C ``` If a non-constant-time algorithm is tested, the max t value grows as more samples are gathered. The tool stops on user interrupt or reaching a "bananas" t-value (500). If a constant time algorithm is tested instead: ``` $ ./dudect_cmpct_-O2 meas: 1.00 M, max t: +2.72, max tau: 2.72e-03, (5/tau)^2: 3.37e+06. For the moment, maybe constant time. meas: 0.98 M, max t: +2.58, max tau: 2.61e-03, (5/tau)^2: 3.68e+06. For the moment, maybe constant time. meas: 2.00 M, max t: +2.32, max tau: 1.64e-03, (5/tau)^2: 9.26e+06. For the moment, maybe constant time. meas: 3.03 M, max t: +1.87, max tau: 1.08e-03, (5/tau)^2: 2.16e+07. For the moment, maybe constant time. meas: 3.01 M, max t: +1.86, max tau: 1.07e-03, (5/tau)^2: 2.17e+07. For the moment, maybe constant time. meas: 5.97 M, max t: +1.59, max tau: 6.49e-04, (5/tau)^2: 5.94e+07. For the moment, maybe constant time. meas: 6.97 M, max t: +1.53, max tau: 5.78e-04, (5/tau)^2: 7.48e+07. For the moment, maybe constant time. meas: 7.97 M, max t: +1.87, max tau: 6.63e-04, (5/tau)^2: 5.70e+07. For the moment, maybe constant time. meas: 8.97 M, max t: +2.07, max tau: 6.91e-04, (5/tau)^2: 5.24e+07. For the moment, maybe constant time. meas: 9.96 M, max t: +2.03, max tau: 6.42e-04, (5/tau)^2: 6.06e+07. For the moment, maybe constant time. meas: 10.96 M, max t: +2.31, max tau: 6.97e-04, (5/tau)^2: 5.15e+07. For the moment, maybe constant time. meas: 11.96 M, max t: +2.71, max tau: 7.83e-04, (5/tau)^2: 4.08e+07. For the moment, maybe constant time. meas: 12.95 M, max t: +2.51, max tau: 6.96e-04, (5/tau)^2: 5.16e+07. For the moment, maybe constant time. ... ``` max t floats around 0 and never grows above the threshold (10). In theory, as $N \to \infty$, $|t| \to \infty$ for any two distributions with non equal mean and $t \sim \mathcal{N}(0, 1)$ otherwise. As mentioned above, the current implementation runs 10 trials with 10000 samples each, returning true if any of the t-value is less than 10. However, as we can see in the output of ` ./dudect_cmpmemcmp_-O2`, `max t` is increasing as `meas` increases. The 10 t values returned are likely to be much less than that returned by 1 trial with 100000 samples. Therefore it's likely to have false positives. Furthermore, since there's a early exit logic, this effectively takes the minimum of all t-values. Repeated runs decrease the test's effectiveness; it's better to combine all measurement results. ### Equalizing setup code The problem here is that in line 11 and 22 of `measure`, ``` dut_insert_head( get_random_string(), *(uint16_t *) (input_data + i * chunk_size) % 10000); ``` does a different amount of work depending on the input class. Although it might no seem to be a problem judging purely from the code, due to modern CPU magic (branch prediction), the setup code can actually affect the timing. To deal with this, we simply have to run similar setup code no matter what the input is. ```cpp // updated to equalize setup work void measure(int64_t *before_ticks, int64_t *after_ticks, uint8_t *input_data, uint8_t *classes, int mode) { assert(mode == test_insert_tail || mode == test_size); if (mode == test_insert_tail) { for (size_t i = drop_size; i < number_measurements - drop_size; i++) { char *s = get_random_string(); queue_t *q0 = q_new(); queue_t *q1 = q_new(); queue_t *q = classes[i] ? q1 : q0; for (int i = 0; i < *(uint16_t *) (input_data + i * chunk_size) % 10000; ++i) { q_insert_head(q1, get_random_string()); } before_ticks[i] = cpucycles(); q_insert_tail(q, s); after_ticks[i] = cpucycles(); q_free(q0); q_free(q1); } } else { for (size_t i = drop_size; i < number_measurements - drop_size; i++) { queue_t *q0 = q_new(); queue_t *q1 = q_new(); queue_t *q = classes[i] ? q1 : q0; for (int i = 0; i < *(uint16_t *) (input_data + i * chunk_size) % 10000; ++i) { q_insert_head(q1, get_random_string()); } before_ticks[i] = cpucycles(); q_size(q); after_ticks[i] = cpucycles(); q_free(q0); q_free(q1); } } } ``` Here, 2 queues are initialized and `q1` is always filled with a random number of items. The only thing depending on the input class is queue to operate on (`q`), which can be selected with minimum overhead. I overwrite the old measure and made the old method usable behind a `old_measure` console param. Experimental data suggesting this works (tested on [this commit](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/8563352597fe025016ccdeb44eff06e6f336d9da)): ``` # === OLD METHOD === $ time ./qtest < i.txt cmd> cmd> option old 1 cmd> option simulation 1 cmd> option measures 100000 cmd> new q = [] cmd> cmd> size Testing size... meas: 0.10 M, max t: +49.62, max tau: 1.57e-01, (5/tau)^2: 1.02e+03, mu0: 3.08e+01, mu1: 2.25e+01, dmu: -8.27e+00, s0: 3.45e+01, s1: 3.45e+01, m20: 5.95e+07, m21: 1.02e+07. ERROR: Probably not constant time real 0m21.075s user 0m20.838s sys 0m0.237s ``` ``` # === NEW METHOD === $ time ./qtest < i.txt cmd> cmd> # option old 1 cmd> option simulation 1 cmd> option measures 100000 cmd> new q = [] cmd> cmd> size Testing size... meas: 0.10 M, max t: +1.09, max tau: 3.46e-03, (5/tau)^2: 2.09e+06, mu0: 2.41e+01, mu1: 2.29e+01, dmu: -1.17e+00, s0: 2.39e+02, s1: 2.39e+02, m20: 2.85e+09, m21: 2.94e+06. Probably constant time real 0m44.938s user 0m44.654s sys 0m0.284s ``` This takes about twice the time to run (to insert useless nodes), but gives a better t value. With the old method, inputs with length 0 actually takes much more compute time than inputs with random length. ### Outliers heavily skew the t-score This is probably why many others still passes trace-17 when the measurement method is flawed. Here I gathered measurement data from a ~100000-sample run. ```ipython In [1]: import scipy.stats In [2]: import numpy as np In [3]: a = np.loadtxt('meas.txt') In [4]: ps = {i: np.percentile(a, i) for i in range(101)} In [5]: ps Out[5]: {0: 18.0, [...] 40: 18.0, 41: 24.0, [...] 98: 24.0, 99: 45.0, 100: 40242.0} ``` Most of the values are quite concentrated, but there are also some obvious outliers. If we take only the normal values, then the t-test is quite sensitive to mean differences: ```ipython In [5]: scipy.stats.ttest_ind(a[a<50], a[a<50]+1, equal_var=False) Out[5]: Ttest_indResult(statistic=-62.52779779889665, pvalue=0.0) ``` However, if we don't preprocess the outliers out, the t-value become really low since the variances are inflated: ```ipython In [6]: scipy.stats.ttest_ind(a[a<50], a+1, equal_var=False) Out[6]: Ttest_indResult(statistic=-3.878930485333941, pvalue=0.00010498372340539558) In [7]: scipy.stats.ttest_ind(a, a+1, equal_var=False) Out[7]: Ttest_indResult(statistic=-1.2852869614896472, pvalue=0.1986935115874919) ``` In other words, when outliers are present, the t-test is more likely to report false positives. The original dudect combats this by running multiple t-tests with different chop-off thresholds and select the one that yields the maximum t-statistic. This is also why the report log prints `max t` instead of `t`. ### Constant time != Equal time O(1) time complexity means that the compute time is bounded as $n \to \infty$. It's perfectly acceptable to have $n=0$ or $n=1$ as special cases and have different running time. But since t-test only compares the mean of the distributions, algorithms with special cases fails the test. In this case, $n=0$ is a special case for my implementation of `q_insert_tail`, so if the sample size is large enough, dudect would report that it isn't constant time. ### `drop_size` Currently, the constant `drop_size` doesn't do any thing meaningful -- it merely pads the array that stores measurement results: ```cpp for (size_t i = drop_size; i < number_measurements - drop_size; i++) /* do measure i */ ``` I guess this is actually meant to implement [discarding first and last results](https://github.com/oreparaz/dudect/blob/5d3d92d45c335bff1cc6a27e276f7c406b7e89e4/src/fixture.c#L83), as commented by dudect's authors: > // XXX we could throw away the first 1e4 and the last 1e4 measurements, // to minimize measurement noise. test if this actually improves anything. Let's test this. We can plot the frequency of measurement outliers against its index in the loop above: ![](https://i.imgur.com/84PwiRB.png) It's then clear that the first measurement in each batch is much more likely to be an outlier. [The code to fix this is simple](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/01be8e90d5d5fbd45eacd5a18f1b15a4005c973c). A graph reveals that this does work. ![](https://i.imgur.com/UX7GhwY.png) --- Added after deadline ### Deciding the length of random input The current input value of class 1 (random class) is drawn from a (approx.) uniform distribution $U\{0,9999\}$. Do we actually need 9999 items to test const-ness? Does it even need to be random? After all, what we wish to do is reject the hypothesis $T(0) \neq T(n)$ for $n \neq 0$. It seems safe to set $n=1$ instead of a random value, since usually $T(0) \neq T(1) \Leftrightarrow T(0) \neq T(n)$. If one wants to be extra safe, one could test against, say, $T(10)$ instead. The speeds up the measurement rate by *a lot*. * 0 vs U(0, 9999) ``` $ time ./qtest -f i.txt cmd> # class 1 ~ U(0-9999) cmd> option simulation 1 cmd> option measures 10000 cmd> size Testing size... meas: 0.01 M, max t [0]: +1.08, max tau: 1.08e-02, (5/tau)^2: 2.16e+05, mu0: 2.24e+01, mu1: 2.25e+01, dmu: 1.86e-01, s0: 6.33e+00, s1: 6.33e+00, m20: 2.04e+05, m21: 5.36e+05. Probably constant time real 0m6.140s user 0m6.112s sys 0m0.028s ``` * 0 vs 1 ``` $ time ./qtest -f i.txt cmd> # class 1 = 1 cmd> option simulation 1 cmd> option measures 10000 cmd> size Testing size... meas: 0.01 M, max t [0]: +1.12, max tau: 1.12e-02, (5/tau)^2: 2.00e+05, mu0: 3.83e+01, mu1: 3.78e+01, dmu: -5.53e-01, s0: 2.55e+01, s1: 2.55e+01, m20: 3.25e+06, m21: 2.86e+06. Probably constant time real 0m0.050s user 0m0.023s sys 0m0.028s ``` The major upside of this is that we can now gather much more samples, and detect even the tiniest difference of sampled means. Consider the `q_size` function. A version I wrote contains a redundant check on `q->head`: ```cpp int q_size(queue_t *q) { if (!q || !q->head) return 0; return q->size; } ``` With the redundant check, the test reports that it isn't constant time (given enough samples): :::spoiler q_size 0 vs 1: t=13.51 ``` $ time ./qtest -f i.txt cmd> # class 1 = 1 cmd> option simulation 1 cmd> option measures 10000000 cmd> size Testing size... meas: 10.00 M, max t [1]: +13.51, max tau: 4.27e-03, (5/tau)^2: 1.37e+06, mu0: 3.49e+01, mu1: 3.50e+01, dmu: 7.77e-02, s0: 9.09e+00, s1: 9.09e+00, m20: 4.13e+08, m21: 4.13e+08. ERROR: Probably not constant time real 0m39.793s user 0m13.160s sys 0m26.632s ``` ::: If we remove the check: ```cpp int q_size(queue_t *q) { if (!q) return 0; return q->size; } ``` we get a positive result: :::spoiler q_size 0 vs 1: t=1.47 ``` $ time ./qtest -f i.txt cmd> # 0 vs 1 cmd> option simulation 1 cmd> option measures 10000000 cmd> size Testing size... meas: 10.00 M, max t [73]: +1.47, max tau: 4.64e-04, (5/tau)^2: 1.16e+08, mu0: 3.20e+01, mu1: 3.20e+01, dmu: -6.87e-03, s0: 7.42e+00, s1: 7.42e+00, m20: 2.75e+08, m21: 2.74e+08. Probably constant time real 0m40.202s user 0m12.897s sys 0m27.304s ``` ::: Now our test is so sensitive, the test on `q_insert_tail` fails miserably: :::spoiler q_insert_tail 0 vs 1: t=1286.31 ``` $ time ./qtest -f i.txt cmd> # 0 vs 1 cmd> option simulation 1 cmd> option measures 10000000 cmd> it Testing insert_tail... meas: 10.00 M, max t [1]: +1286.31, max tau: 4.07e-01, (5/tau)^2: 1.51e+02, mu0: 2.03e+02, mu1: 2.21e+02, dmu: 1.74e+01, s0: 2.25e+01, s1: 2.25e+01, m20: 2.52e+09, m21: 2.06e+09. ERROR: Probably not constant time real 0m41.202s user 0m14.567s sys 0m26.634s ``` ::: Since, as mentioned in [Constant time != Equal time](#Constant-time--Equal-time) above, the test detects const-ness in the strict sense, while what we are interested in is probably that in the asymptotic sense. We can get around this by skipping special cases and testing any non-special case inputs, 1 vs 2 for example. :::spoiler q_insert_tail 1 vs 2: t=1.63 ``` $ time ./qtest -f i.txt cmd> # 1 vs 2 cmd> option simulation 1 cmd> option measures 10000000 cmd> it Testing insert_tail... meas: 10.00 M, max t [0]: +1.63, max tau: 5.14e-04, (5/tau)^2: 9.45e+07, mu0: 2.18e+02, mu1: 2.18e+02, dmu: 3.43e-01, s0: 3.23e+02, s1: 3.23e+02, m20: 5.22e+11, m21: 5.90e+11. Probably constant time real 0m43.614s user 0m16.244s sys 0m27.368s ``` ::: We can verify this works for `q_size` as well: :::spoiler q_size 1 vs 2: t=0.58 ``` $ time ./qtest -f i.txt cmd> # 1 vs 2 cmd> option simulation 1 cmd> option measures 10000000 cmd> size Testing size... meas: 10.00 M, max t [0]: +0.58, max tau: 1.84e-04, (5/tau)^2: 7.42e+08, mu0: 3.39e+01, mu1: 3.39e+01, dmu: -2.59e-02, s0: 7.12e+01, s1: 7.12e+01, m20: 2.53e+10, m21: 2.46e+10. Probably constant time real 0m42.059s user 0m15.692s sys 0m26.367s ``` ::: ### The Kolmogorov–Smirnov test :::warning WIP ::: Constant time means that the distribution of execution times does not depend on input length. But the t-test only compares means. Can we do better? A quick google on `test same distribution` introduces me to the [Kolmogorov–Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test). > [The K-S test] is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples. (update: apparently dudect's paper also mentioned this test but opted to not use it) The K-S test outputs a single statistic $D$, indicating the max distance between the CDFs. The distribution of D does not depend on the null distribution if the null distribution is continuous. :::info References * [Nonparametric goodness-of-fit tests for discrete null distributions (2011)](https://pdfs.semanticscholar.org/d9c0/69f62de3dc1d13c76fe8e38cffff248fdb6b.pdf#page=34) ::: TODO: read more and experiment ### Inspecting the CDF In the cdf plots below, the data is taken from a 100000-sample run and preprocessed by removing values deviating from the median by more than 1.1 * 98th_percentile does. ```python # the numbers are completely arbitrary # this only chops from above def chop(a): m = np.median(a) p = np.percentile(a, 98) k = 1.1 return a[a-m<(p-m)*k] ``` ![](https://i.imgur.com/9KBpQb0.png) ![](https://i.imgur.com/Rg4VwCh.png) ![](https://i.imgur.com/HWX6ftD.png) ![](https://i.imgur.com/upUM5DH.png) ![](https://i.imgur.com/tKYI3c1.png) ![](https://i.imgur.com/OBXkq8D.png) We can also break down the data to see how the run time also depends on the previous input: ![](https://i.imgur.com/g3A4TN0.png) Here we can see that the execution time is lower when the current input is the same as the previous one. :::warning :question: Why do the CDFs for insert_tail match partially? It would be much easier to explain if they match entirely or not at all. But why do they converge at ~250 cycles? ::: :::warning :dizzy_face: I re-ran the measumrents on the same machine some time later and the long overlapping tails are gone: ![](https://i.imgur.com/G7ncgDw.png) I have no f-ing clue why. ::: --- Misc. graphs The same experiment, run in WSL (Windows 10 laptop) instead of a Ubuntu workstation above: ![](https://i.imgur.com/qLV9jpd.png) 2 bits of history: ![](https://i.imgur.com/l2ZwTLk.png) ### Comparing the t-test and the KS-test We wish to verify the statement from the paper: > [...] the disadvantage [of KS-tests] is that they may converge slower and require more samples. This seems counter-intuitive: why would comparing entire cdfs has a weaker power than comparing just the means? We conduct experiments on 2 cases, based on experimental data, to compare the rate of convergence of the t-test and KS-test: * the null hypothesis is true (the distributions are identical) * the alternative hypothesis is true (the distributions are distinct) To compare the case where the null hypothesis is true, we draw from the measured data 2 sets of N-samples and calculate the p-value with the corresponding statistical tests. * $H_0$: Null hypothesis, the two sets are drawn from the population $T_{\mathrm{insert\_tail}}(0)$ * $H_1$: Alternative hypothesis, one set is drawn from $T_{\mathrm{insert\_tail}}(0)$ and the other is drawn from $T_{insert\_tail}(1)$ ![](https://i.imgur.com/EYSUR1Z.png) ![](https://i.imgur.com/iAzV6Sw.png) ![](https://i.imgur.com/2ZrndOB.png) ![](https://i.imgur.com/d8TOqEN.png) Observations: * when $H_0$ is true, the t-test gives the correct p-values: at a 95% confidence level it wrongly rejects $H_0$ in 5% of the cases when is true (type I error) * when $H_0$ is true, the ks-test is overly conservative: at a 80% confidence level it wrongly rejects $H_0$ in only ~10% of the cases (type I error) * when $H_1$ is true, the ks-test test actually converges much faster then the t-test and has a lower type II error rate ### Reviewing work done by others in previous semesters In [AdrianHuang's report of lab0](https://hackmd.io/U5AimAfrSeS-cLvP6vhNUQ?view) in 2020q1, they also noticed a difference of distributions when the compute time is supposed to be equal. They found a way to fix this, with a approach different from [ours above](#Equalizing-setup-code). In [their fix](https://github.com/AdrianHuang/lab0-c/commit/52fcbd12400cb3544692a10665aa55747e961e24), they sorted the measured execution time before passing the values to the t-test. ```cpp // ... differentiate(exec_times, before_ticks, after_ticks); qsort(exec_times, number_measurements, sizeof(int64_t), (int (*)(const void *, const void *)) cmp); update_statistics(exec_times, classes); // ... ``` However, the sorting process changes the order of `exec_times` but not `classes`, so in `update_statistics(exec_times, classes)`, the two arrays are effectively unrelated and `update_statistics` does not output meaningful statistics. ### Quantitative analysis of the effectiveness of tests :::warning WIP ::: Pick a confidence level $1-\alpha$ in advance. The type I error rate $\alpha$ by definition should not depend on the sample size $N$. On the other hand, the type II error rate $\beta$ depends on the [power of the test](https://en.wikipedia.org/wiki/Power_of_a_test). The original `simulation` runs 10 trials of t-test with 10000 samples each and a confidence level $1-\alpha$. This has an overall type I error rate of $\alpha^{10}$ and type II error rate of $1 - (1-\beta_\mathrm{10k})^{10}$ Our proposed change combines all samples into a single run. The type II error rate is $\alpha$ and type II error rate is $\beta_\mathrm{100k}$ Since $\alpha$ is usually chosen to be negligible, the difference between the type I error rates $\alpha^{10}$ and $\alpha$ doesn't really matter. Contrarily $\beta$ can be potentially large. Even if we assume $\beta_\mathrm{10k}=\beta_\mathrm{100k}$, the type II error rate is much larger when the samples are split into 10 runs: $1-(1-\beta)^{10} > \beta$. For example, when $\beta=10\%$, $1-(1-\beta)^{10}$ can be as high as $67\%$. The difference can get worse since in most cases $\beta_\mathrm{10k} \gg \beta_\mathrm{100k}$. ### The t-statistic threshold :::warning WIP ::: In the code, some thresholds for the t-statistic are defined: ```cpp // threshold values for Welch's t-test #define t_threshold_bananas 500 // test failed, with overwhelming probability #define t_threshold_moderate \ 10 // test failed. Pankaj likes 4.5 but let's be more lenient ``` How do these values affect the effectiveness of the t-test? These thresholds affects both $\alpha$ and $\beta$. Let's look at the type I error rate $\alpha$ first. $\alpha$ is the total area under the two tails of the pdf chopped off by the threshold. For a symmetric pdf, $\alpha = 2(1-F_T(t_{thres}))$ where $F_T(t)$ is the cdf of the t-statistic. For the t-test, the t-statistic follow the t-distribution of ~$2N-2$ degrees of freedom when $H_0$ is true. In the implementation, `enough_measurements=10000` and thus the dof is at least about 10000. Plugging in the parameters, we get $$ \begin{aligned} \alpha_{4.5} &< 7 \cdot 10^{-6} \\ \alpha_{10} &< 2 \cdot 10^{-23} \\ \alpha_{500} &< 1 \cdot 10^{-3200} \\ \end{aligned} $$ Calculations (Wolfram|Alpha): [t=4.5](https://www.wolframalpha.com/input/?i=2%281+-+CDF%5BStudentTDistribution%5B10000%5D%2C+4.5%5D%29) [t=10](https://www.wolframalpha.com/input/?i=2%281+-+CDF%5BStudentTDistribution%5B10000%5D%2C+10%5D%29) [t=500](https://www.wolframalpha.com/input/?i=2%281+-+CDF%5BStudentTDistribution%5B10000%5D%2C+500%5D%29) With $\alpha$ values so low, it seems reasonable to reject $H_0$ if any $t>t_{thres}$ is observed. What about $\beta$? ### The normality assumption :::warning WIP ::: The t-test, Student's or Welch's, assumes that the underlying distribution follows a normal distribution. But clearly our data is not normal. How does this affect the t-test? From our experiments above, Welch's t-test gives the correct p-value. Can this be explained mathematically? [Someone](https://stats.stackexchange.com/a/9575) pointed out that for large samples sizes the normality assumption is not needed because of the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), but no rigorous derivation is given. ### A pattern in the cycle counts The data points can be plotted as a histogram: ![](https://i.imgur.com/pEzsTFz.png) In this historgram, the size of each bin is exactly 1. Interestingly, the bars are spaced evenly. A further inspection shows that the spacing is 3. Let's plot the times mod 3: ![](https://i.imgur.com/ZiSPnd1.png) It seems that all measurements are a multiple of 3. But on a closer look, the frequencies of 1 and 2 are not zero: ``` {0: 0.9997531901700336, 1: 1.42782546261545e-05, 2: 0.00023253157534023041} ``` :::warning Why is this the case? Why the tiny number of exceptions? ::: ### The Mann–Whitney U test The Mann–Whitney U test is another non-parametric test. Unlike the ks-test, the distribution of the statistic $U$ does not depend on whether the null distribution is continous or not. The seems like a perfect test for our use case. Let's see what kind of p-values the test outputs. ![](https://i.imgur.com/LChFWn8.png) ![](https://i.imgur.com/P9S7sTD.png) The type I error rate matches the nominal value. The $H_1$ curve converges super fast! Much faster than the ks-test and t-test. ### Comparison of power of the 3 tests Experimental approximation of power of the tests with $\alpha=0.01$ under different sample sizes. ![](https://i.imgur.com/AT2EgYC.png) The performance is actually similar. Let's try $\alpha_{10} \approx 2 \cdot 10^{-23}$ as computed above. ![](https://i.imgur.com/55Fx6zK.png) The ks-test is slightly better than the other two tests. However, a small increase on sample size cancels the difference easily. What I learned: :::info Sample size matters much more than the choice of test. :::