2020q1 Homework1 (lab0)

# 2020q1 Homework1 (lab0) contributed by < `colinyoyo26` > ## 實驗環境 - 作業系統: Ubuntu Linux 16.04.6 - 編譯器版本： gcc 7.4.0 :::warning 你的開發環境能升級到 Ubuntu Linux 18.04 或更新的版本嗎？後續的實驗需要 Linux v4.15+ :notes: jserv ::: > 好的，正在升級中 ## 實作過程 ### queue.h #### `queue_t` 資料結構的部份增加了 `tail`和 `size`分別為了在 O(1) 時間 insertion 還有 return size ```cpp typedef struct { list_ele_t *head; list_ele_t *tail; size_t size; } queue_t; ``` ### queue.c #### `q_insert_head` 比起去年，今年多了安全性的檢查，所以把 `strcpy` 替換成了 `strncpy` ```cpp bool q_insert_head(queue_t *q, char *s) { list_ele_t *newh; char *str; if (!q) return false; size_t len = strlen(s); newh = malloc(sizeof(list_ele_t)); str = malloc(len * sizeof(char) + 1); if (!newh || !str) { free(str); free(newh); return false; } strncpy(str, s, len); str[len] = '\0'; newh->value = str; newh->next = q->head; q->head = newh; if (!q->tail) q->tail = q->head; q->size++; return true; } ``` #### `q_insert_tail` 這部份和 `q_insert_head` 差不多 ```cpp bool q_insert_tail(queue_t *q, char *s) { list_ele_t *newh; char *str; if (!q || !s) return false; size_t len = strlen(s); newh = malloc(sizeof(list_ele_t)); str = malloc(len * sizeof(char) + 1); if (!newh || !str) { free(str); free(newh); return false; } strncpy(str, s, len); str[len] = '\0'; newh->value = str; if (!q->head) { q->head = newh; q->tail = newh; } q->tail->next = newh; newh->next = NULL; q->tail = newh; q->size++; return true; } ``` #### `q_remove_head` ```cpp bool q_remove_head(queue_t *q, char *sp, size_t bufsize) { if (!q || !q->head) return false; if (bufsize > 0 && sp) { strncpy(sp, q->head->value, bufsize - 1); sp[bufsize - 1] = '\0'; } list_ele_t *tem = q->head; q->head = q->head->next; free(tem->value); free(tem); q->size--; return true; } ``` #### `q_size` 直接回傳多加入的 `size` member ```cpp int q_size(queue_t *q) { return !q ? 0 : q->size; } ``` #### `q_reverse` ```cpp void q_reverse(queue_t *q) { if (!q || q->size < 2) return; list_ele_t *cur = q->head, *prev = NULL, *tem; q->tail = q->head; while (cur) { tem = cur; cur = cur->next; tem->next = prev; prev = tem; } q->head = prev; } ``` #### `q_free` 直接 reuse `q_remove_head` 的部份避免重複的 code 為 `free` 和 `q_remove_head` 都會檢查 NULL 所以沒有特別檢查 ```cpp void q_free(queue_t *q) { while (q_remove_head(q, NULL, 0)) ; free(q); } ``` #### `q_sort` 先提及 natural sort 的部份，在 lab0-c 目錄加入 `strnatcmp.h` 和 `strnatcmp.c` 然後在 `queue.h` include `strnatcmp.h` 並用 macro 包裝函式讓 code 比較易讀一點 ```cpp #include "strnatcmp.h" ...... #define LESS_THAN(a, b) strnatcmp(a, b) < 0 ``` `Makefile` 的 `OBJS` 也要加入 `strnatcmp.o`用於靜態連結 ``` OBJS := qtest.o report.o console.o harness.o queue.o strnatcmp.o\ random.o dudect/constant.o dudect/fixture.o dudect/ttest.o ``` `qtest.c` 和 `q_sort` 比較的地方改成 ```cpp LESS_THAN(e->next->value, e->value)` ``` 剛改完 `make test` 因為呼叫 function call 成本的關係會有超時的情況，在 `strnatcmp.c` 內用 `inline` 宣告`strnatcmp0` 就解決了我採用 merge-sort 實作 `q_sort` ，利用 `q->size` 找到分段點，分成 `left` `right` 分別排序然後在 `q_merge` merge 起來， `LESS_THAN` 用於比較字典排序 - [ ] `q_sort` ```cpp void q_sort(queue_t *q) { if (!q || q->size < 2) return; queue_t left, right; left.size = (q->size >> 1) + (q->size & 1); right.size = q->size >> 1; list_ele_t *cur = left.head = q->head; right.tail = q->tail; for (size_t i = 0; i < left.size - 1; i++) cur = cur->next; left.tail = cur; right.head = cur->next; left.tail->next = right.tail->next = NULL; q->head = q->tail = NULL; q_sort(&left); q_sort(&right); q_merge(&left, &right, q); } ``` - [ ] `q_merge` ```cpp static void q_merge(queue_t *left, queue_t *right, queue_t *q) { q->size = left->size + right->size; list_ele_t *l = left->head, *r = right->head; list_ele_t *tem = NULL; q->head = LESS_THAN(left->head->value, right->head->value) ? left->head : right->head; q->tail = q->head; for (size_t i = 0; i < q->size; i++) { if (!r || (l && LESS_THAN(l->value, r->value))) { tem = l; l = l->next; } else { tem = r; r = r->next; } q->tail->next = tem; q->tail = tem; } tem->next = NULL; } ``` ## Dude, is my code constant time? 利用 leakage detection test 測試程式是否是 constant time ，優點是在這個方法下不需要硬體模型，分為以下三個步驟進行 #### Repeatedly measure exeution time - 分別用兩類(class) input data ，論文中提到的是 fix-vs-random ， fixed class 是為了 corner-case #### Apply post-processing - Cropping: 去掉某些極端值 (可能是 OS interrupt 造成......) - High-order preprocessing: (不太懂) #### Apply statistical test - null hypothesis: "the two timing distributions are equal" - if statistical test reject the null hypothesis, then we know it's not run in constant time - [Welch's t-test](https://en.wikipedia.org/wiki/Welch's_t-test): lab0 使用的測試，以下較完整介紹 ### Welch’s t-test 先簡介 Student’s t-distribution 就是抽樣下的常態分佈，抽樣越多就會越接近標準常態分佈(因為越接近母體)，圖中的 one-tailed test 和 two-tailed test 和 alternative hypothesis 有關，我們只需要看兩組數據是否不一樣，所以用 two-tailed test > alternativw hypothesis: "the two timing distributions are not equal" > ![](https://i.imgur.com/jeWKqt1.gif) 接著是 Welch’s t-test - 先計算出兩個 class (class 0, class 1) 的平均值和變異數 - Welch's t-test defines the statistic t by the following formula: - $t = \frac{\bar{X_0} - \bar{X_1}}{\sqrt{\frac{Var_0}{N_0} + \frac{Var_1}{N_1}}}$ - $t$ 越接近 0 代表兩組數據的差異越小 - lab0 實作上只有到這裡，然後直接檢查 $t$ 是否大於 `t_threshold_moderate` - approximate df (degree of freedom) using the [Welch–Satterthwaite equation](https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation) - 由 df 我們可以得到相應的 t-distribution 如上圖的鐘型曲線 - 圖上的藍白交界處分別為 +-t - 把藍色部份積分起來可以得到 p value 若 p 小於特定值 (通常 0.05 這部份沒有深入理解) 我們可以說兩組數據有統計上的顯著差異 (statistically significant difference) ## Dudect in lab0 知道理論後，接著是 lab0 的實作 ### `ttest.h` - [ ] `t_ctx` ```cpp typedef struct { double mean[2]; double m2[2]; double n[2]; } t_ctx; ``` - `mean[i]` 為 class i 的平均數 - `m2[i]` 為 class i 的 $M_2$ (which is $Var \cdot (N - 1)$) - `n[i]` 為 class i 的 sample 數 ($N$) ### `fixture.c` 這裡可以看到 `is_size_const` 和 `is_insert_tail_const` 用於測試， `q_size` `q_insert_tail` 是不是 constant time ，以下藉由 `is_insert_tail_const` 說明程式實作原理。 - [ ] `bool is_insert_tail_const(void)` :::info `t` is pointer to `t_ctx` `q` is pointer to `queue_t` ::: - `init_once()` 用於初始化 `*t` 和 `q` (指向 NULL) - 接著會 call `doit(0)` 先測量執行時間，再做統計分析 (t-test) ```cpp bool is_insert_tail_const(void) { bool result = false; t = malloc(sizeof(t_ctx)); for (int cnt = 0; cnt < test_tries; ++cnt) { printf("Testing insert_tail...(%d/%d)\n\n", cnt, test_tries); init_once(); for (int i = 0; i < enough_measurements / (number_measurements - drop_size * 2) + 1; ++i) result = doit(0); printf("\033[A\033[2K\033[A\033[2K"); if (result == true) break; } free(t); return result; } ``` 接著看到 `doit` function - [ ] `static bool doit(int mode)` - `mode` 用於分辨 `is_size_const` 和 `is_insert_tail_const` - `prepare_inputs(input_data, classes)` 會用亂數產生 `input_data` 的值 - 每個成員大小為 `sizeof(uint8_t) * chunk_size` - 用 `classes` 分辨他是 class 0 或 class 1 (也是隨機)，被指定到 class 0 的會執行以下 code 把前 2 bytes 會被設為 0 - `*(uint16_t *) (input_data + i * chunk_size) = 0x00` - `measure(before_ticks, after_ticks, input_data, mode)` 先對新建的 queue 中執行數次 `q_insert_head` 接著再測量一次 `q_insert_tail` 的執行時間 (紀錄在 `before_ticks` 和 `after_ticks`) - random string 的插入次數為 `*(uint16_t *) (input_data + i * chunk_size)` 所以 class 0 不會事先插入任何元素 - `differentiate(exec_times, before_ticks, after_ticks)` 由 `before_ticks` 和 `after_ticks` 相減得到執行時間，紀錄在 `exec_times` - `update_statistics` 裡面會 call `t_push` 計算出兩個 class 分別的平均值 (mean) 和 $M_2$ ( $M_2 = Var \cdot (N - 1)$ ) - 計算平均值得公式，把 `delta` 展開可以得到 $\bar{x_i} = \bar{x_{i-1}} + \frac{x_i - \bar{x_{i-1}}}{N} = \frac{(N - 1) \cdot \bar{x_{i-1}} + x_i}{N}$ - $(N - 1) \cdot \bar{x_{i-1}}$ 是前 $N - 1$ 項的總和 - code 中的 `x` 對應到 $x_i$ (第 $i$ 個 sample (execution time)) - 上述的 $\bar{x_{i-1}}$ 對應到 code 中的 `ctx->mean[class]` (更新前) - $N$ 對應到 `ctx->n[class]` (`ctx->n[class]++` 之後) - 接著 $M_2$ 的公式，根據 [Welford's online algorithm](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) $M_{2,i} = M_{2,{i-1}} + (x_i - \bar{x_{i-1}}) \cdot (x_i - \bar{x_{i}})$ - 把 `delta` 展開後符合公式 ```cpp void t_push(t_ctx *ctx, double x, uint8_t class) { assert(class == 0 || class == 1); ctx->n[class]++; double delta = x - ctx->mean[class]; ctx->mean[class] = ctx->mean[class] + delta / ctx->n[class]; ctx->m2[class] = ctx->m2[class] + delta * (x - ctx->mean[class]); } ``` - `report()` 內會在呼叫 `t_compute` 進行 [Welch's t-test](https://en.wikipedia.org/wiki/Welch's_t-test) - $t = \frac{\bar{X_0} - \bar{X_1}}{\sqrt{\frac{Var_0}{N_0} + \frac{Var_1}{N_1}}}$ - $\bar{X_i}$ 為 class i 的平均值 - $Var_i$ 為 class i 的變異數 - $t$ 對應到 code 中的 `t_value` ```cpp double t_compute(t_ctx *ctx) { double var[2] = {0.0, 0.0}; var[0] = ctx->m2[0] / (ctx->n[0] - 1); var[1] = ctx->m2[1] / (ctx->n[1] - 1); double num = (ctx->mean[0] - ctx->mean[1]); double den = sqrt(var[0] / ctx->n[0] + var[1] / ctx->n[1]); double t_value = num / den; return t_value; } ``` - 最後檢查 $|t|$ 是否大於 `t_threshold_moderate` - if yes, reject the null hypothesis ```cpp static bool doit(int mode) { int64_t *before_ticks = calloc(number_measurements + 1, sizeof(int64_t)); int64_t *after_ticks = calloc(number_measurements + 1, sizeof(int64_t)); int64_t *exec_times = calloc(number_measurements, sizeof(int64_t)); uint8_t *classes = calloc(number_measurements, sizeof(uint8_t)); uint8_t *input_data = calloc(number_measurements * chunk_size, sizeof(uint8_t)); if (!before_ticks || !after_ticks || !exec_times || !classes || !input_data) { die(); } prepare_inputs(input_data, classes); measure(before_ticks, after_ticks, input_data, mode); differentiate(exec_times, before_ticks, after_ticks); update_statistics(exec_times, classes); bool ret = report(); free(before_ticks); free(after_ticks); free(exec_times); free(classes); free(input_data); return ret; } ``` ## `select` system call & `console.c` ### `select` 簡單來說 `select` 被用來同時監控多個 file descriptors 並 block 直到有 file descriptors 可以執行 I/O operation 或是 timeout ，資料來自 Linux Programmer's Manual 因為篇幅關係就不貼上來以下是 `select` 的宣告 ```cpp int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout); ``` - `nfds` is the highest-numbered file descriptor in any of the three sets, plus 1. - `timeout` argument specifies the interval that select() should block waiting for a file descriptor to become ready. The call will block until either: - a file descriptor becomes ready - the call is interrupted by a signal handler - the timeout expires > If timeout is NULL (no timeout), select() can block indefinitely. `fd_set` 以及用來操作 `fd_set` 的四個 macro - `fd_set` a fixed size buffer - Executing `FD_CLR()` or `FD_SET()` with a value of fd that is negative or is equal to or larger than `FD_SETSIZE` will result in undefined behavior. - `FD_ZERO()` clear a set - `FD_SET()` add a given file descriptor from a set - `FD_CLR()` remove a given file descriptor from a set - `FD_ISSET()` tests to see if a file descriptor is part of the set - this is useful after select() returns. return value - 一般情況回傳 #file descriptor - timeout return 0 - error return -1 ### `console.c` 以下分析 `console.c` 實作，先簡單介紹 RIO (Robust I/O)，他有以下特點 - 自動處理 short count :::info 這個地方課本沒有寫得很具體，是指 `rio_read` 遇到 ``'\n'`` 還會繼續讀，不會像 `read` 直接回傳 short count 嗎 ::: - application-level buffe (對應到下面資料結構的 buf member) - buffer 空時才會 call `read` 重新填滿 - 優點以 `console.c` 的 `readline()`為例，在檢查 `\n` 時不用對每一個 byte 都 call `read` - [ ] `struct RIO_ELE` ```cpp typedef struct RIO_ELE rio_t, *rio_ptr; struct RIO_ELE { int fd; /* File descriptor */ int cnt; /* Unread bytes in internal buffer */ char *bufptr; /* Next unread byte in internal buffer */ char buf[RIO_BUFSIZE]; /* Internal buffer */ rio_ptr prev; /* Next element in stack */ }; ``` - [ ] `int cmd_select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout)` 這部份函式分成三段來看，call `cmd_select` 時會先執行以下迴圈，把 `buf_stack` 內的 command 先執行完 - `block_flag` 初始為 false 且值沒有被更改所以一直都為 flase - 有宣告 `static` 且 include 的檔案中沒有修改到 - `read_ready()`檢查 `buf_stack` 是否有完整 command line - `readline()` 對應到 CSAPP 就是 `rio_readlineb` - `interpret_cmd(cmdline)` 直譯 `cmdline` - 裡面會 call `parse_args` 對 `cmpline` 進行預處理 (把空白換成 `'\0'`) - 預處理完後會 call `interpret_cmda` 比對 `cmd_list` (command 的 metadata 被用 linked list 串起來) 找出對應的 command ```cpp while (!block_flag && read_ready()) { cmdline = readline(); interpret_cmd(cmdline); prompt_flag = true; } ``` 下面這段的目的是把 `buf_stack->fd` 加到 `readfds` 這個 `fd_set` 內，並把 `nfds` 設成最大的 file descriptor + 1 ，最後會回傳參數三個 set 包含的 file descriptor 總數 - `if (!readfds)` 若參數 `readfds` 傳入 `NULL` 則把 `readfds` 指向 local 空的 set ```cpp if (!block_flag) { /* Process any commands in input buffer */ if (!readfds) readfds = &local_readset; /* Add input fd to readset for select */ infd = buf_stack->fd; FD_SET(infd, readfds); if (infd == STDIN_FILENO && prompt_flag) { printf("%s", prompt); fflush(stdout); prompt_flag = true; } if (infd >= nfds) nfds = infd + 1; } if (nfds == 0) return 0; ``` 接下來是最後的部份 `select` 等到可以進行 I/O operation (讀取 `buf_stack->fd`)，並執行寫入的 command - 第一個 `if` 用於檢查 `select` timeout 或 error ，接著把剛剛加入的 file descriptor 從 set 移除 ```cpp int result = select(nfds, readfds, writefds, exceptfds, timeout); if (result <= 0) return result; infd = buf_stack->fd; if (readfds && FD_ISSET(infd, readfds)) { /* Commandline input available */ FD_CLR(infd, readfds); result--; cmdline = readline(); if (cmdline) interpret_cmd(cmdline); } return result; ``` ## 程式現有缺陷目前朝 `readline` 函式的方向想，for loop 中的第一個 `if` 最多進去一次，剩下的部份等於是一次搬一個 byte 並檢查，如果 command 很長會成為效能瓶頸，於是想到用 `memchr` 和 `memcpy` 試著優化 (不過還沒實驗驗證) ```cpp for (cnt = 0; cnt < RIO_BUFSIZE - 2; cnt++) { if (buf_stack->cnt <= 0) { /* Need to read from input file */ buf_stack->cnt = read(buf_stack->fd, buf_stack->buf, RIO_BUFSIZE); buf_stack->bufptr = buf_stack->buf; if (buf_stack->cnt <= 0) { /* Encountered EOF */ pop_file(); if (cnt > 0) { /* Last line of file did not terminate with newline. */ /* Terminate line & return it */ *lptr++ = '\n'; *lptr++ = '\0'; if (echo) { report_noreturn(1, prompt); report_noreturn(1, linebuf); } return linebuf; } return NULL; } } /* Have text in buffer */ c = *buf_stack->bufptr++; *lptr++ = c; buf_stack->cnt--; if (c == '\n') break; } ``` 我試著改成以下，用 `memchr` 和 `memcpy` 叫有效率的處理資料，且這邊 while loop 最多跑兩次 - [Github](https://github.com/colinyoyo26/2020q1-lab0-c/blob/test2/console.c) ```cpp while (!end && cnt < RIO_BUFSIZE - 2) { if (buf_stack->cnt <= 0) { /* Need to read from input file */ buf_stack->cnt = read(buf_stack->fd, buf_stack->buf, RIO_BUFSIZE); buf_stack->bufptr = buf_stack->buf; if (buf_stack->cnt <= 0) { /* Encountered EOF */ pop_file(); if (cnt > 0) { /* Last line of file did not terminate with newline. */ /* Terminate line & return it */ *lptr++ = '\n'; *lptr++ = '\0'; if (echo) { report_noreturn(1, prompt); report_noreturn(1, linebuf); } return linebuf; } return NULL; } } size_t len = buf_stack->cnt < RIO_BUFSIZE - 2 - cnt ? buf_stack->cnt : RIO_BUFSIZE - 2 - cnt; end = (char *) memchr(buf_stack->bufptr, '\n', len); size_t offset = end ? (intptr_t) end - (intptr_t) buf_stack->bufptr + 1 : len; memcpy(lptr, buf_stack->bufptr, offset); buf_stack->cnt -= offset; buf_stack->bufptr += offset; cnt += offset; lptr += offset; } ``` :::danger 上面的超連結是 dead link，你有對應的測試計畫嗎？ :notes: jserv ::: > 已更正連結，目前只有想到比較 `make test` 時花在 `readline` 的總共時間，晚點動手實驗看看先用 perf 觀察，但沒有發現到熱點 `$ sudo perf record -e cycles:u make test && sudo perf report` 但還是做了簡單的實驗，把 `readline` 改名成 `read_line` 然後在用一個 inline function `readline` 包裝它，把每次呼叫這個函式的時間加到 `total_time` 最後程式結束時會在 `run_console` 最後印出 `total_time` 的值先清空 cache ，然後 `make test` ``` $ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $ make test ``` 試了幾次發現效能反而比較差一點，先推測試因為 command 都很短，所以呼叫 `memchr` 的成本反更大 ```cpp double total_time = 0; static inline char *readline() { double time = gettime(); char *buf = read_line(); total_time += gettime() - time; return buf; } ``` `read_ready` 也用一樣的邏輯修改 ```cpp static bool read_ready() { if (!buf_stack || !memchr(buf_stack->bufptr, '\n', buf_stack->cnt)) return false; return true; } ``` ## Reference * [Dude, is my code constant time?](https://eprint.iacr.org/2016/1123.pdf) * video: [Student's t-test](https://www.youtube.com/watch?v=pTmLQvMM-1M) * [Welford's online algorithm](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) * [統計學:大家都喜歡問的系列-p值是什麼](https://medium.com/@chih.sheng.huang821/%E7%B5%B1%E8%A8%88%E5%AD%B8-%E5%A4%A7%E5%AE%B6%E9%83%BD%E5%96%9C%E6%AD%A1%E5%95%8F%E7%9A%84%E7%B3%BB%E5%88%97-p%E5%80%BC%E6%98%AF%E4%BB%80%E9%BA%BC-2c03dbe8fddf) * [CS:APP 第 10 章重點提示](https://hackmd.io/@sysprog/H1TtmVTTz)