--- tags: linux2024 --- # 2024q1 Homework1 (lab0) contributed by < [howardjchen](https://github.com/howardjchen) > :::danger 不要濫用 `:::` 標示,內容比格式更重要。倘若內文充斥大量 `:::`,他人很難跟你討論,從而違反作業公開發表並接受他人批評的初衷。 著手改進! ::: ## 開發環境 > MacBook Air M1 with Lima installed ubuntu 22.04 ``` Static hostname: lima-ubuntu-lts Icon name: computer-vm Chassis: vm Machine ID: 6f0bcb7423b94dceae613cf4c2b3ce4e Boot ID: 8a203eea063e4693a53eb5d07ae9cebe Virtualization: qemu Operating System: Ubuntu 22.04.3 LTS Kernel: Linux 5.15.0-92-generic Architecture: arm64 Hardware Vendor: QEMU Hardware Model: QEMU Virtual Machine ``` Reference: - [lima](https://dockerbook.tw/docs/alternatives/lima) - [lima 介紹](https://www.youtube.com/watch?v=aV4l85XHFGA&ab_channel=%E9%BA%A6%E5%85%9C%E6%90%9EIT) ### 安裝必要工具 ```shell $ sudo apt install build-essential git-core valgrind $ sudo apt install cppcheck clang-format aspell colordiff ``` | 名稱 | 作用 | | --------------- | ----------------- | | build-essential | GNU Toolchain and necessary headers/libraries | | git-core | distributed revision control system | | valgrind | dynamic binary instrumentation (DBI) framework | | cppcheck | static C/C++ code analysis | | clang-format | format C/C++/Obj-C code | | aspell | spell-checker | | colordiff | colorize 'diff' output | ## 針對佇列操作的實作 ### `q_new` [commit e08f65f](https://github.com/sysprog21/lab0-c/commit/e08f65f40b7dde07af70f087023df6d994907a51) ## q_insert_head 當執行測試案例 trace-03-ops.cmd 時,make test 失敗。在運行時階段使用 make SANITIZER=1 檢查記憶體後,我發現有一些程式碼需要修正。 ``` TESTING trace trace-11-malloc: Test of malloc failure on insert_head ERROR: Freed queue, but 4 blocks are still allocated --- trace-11-malloc 0/6 TESTING trace trace-12-malloc: Test of malloc failure on insert_tail ERROR: Freed queue, but 3 blocks are still allocated --- trace-12-malloc 0/6 ``` :::danger git commit message 標題不明確,應簡述你修正的問題。 你不該說 "Sometimes",而是清楚陳述,何時會導致問題,又,你的解決方案能否全面迴避問題,是否會有副作用 (side effect)。 ::: commit 8f3592ef31 Author: Howard Chen <s880367@gmail.com> Date: Sat Mar 2 00:55:51 2024 +0800 Fix bug Sometimes malloc may failed. Add more protection to check the condition when malloc and strdup failed scenario ```diff @@ -4,14 +4,15 @@ bool q_insert_head(struct list_head *head, char *s) return false; element_t *el = malloc(sizeof(element_t)); - char *str = strdup(s); + if (!el) + return false; - if (!str || !el) { + el->value = strdup(s); + if (!el->value) { free(el); return false; } - el->value = str; list_add(&el->list, head); return true; ``` ## q_insert_tail :::danger 列出程式碼是為了討論和檢討,不是假裝有付出! ::: ```diff @@ -23,14 +24,15 @@ bool q_insert_tail(struct list_head *head, char *s) return false; element_t *el = malloc(sizeof(element_t)); - char *str = strdup(s); + if (!el) + return false; - if (!str || !el) { + el->value = strdup(s); + if (!el->value) { free(el); return false; } - el->value = str; list_add_tail(&el->list, head); return true; ``` ## q_remove_head ``` **16159 ERROR: AddressSanitizer: heap-buffer-overflow on address 0xffff8810063a at pc 0xffff8c38ab3c bp 0xffffc7d7c7e0 sp 0xffffc7d7c838** READ of size 1025 at 0xffff8810063a thread T0 #0 0xffff8c38ab38 in __interceptor_memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 #1 0xaaaac1cf245c in memcpy /usr/include/aarch64-linux-gnu/bits/string_fortified.h:29 #2 0xaaaac1cf245c in q_remove_head /tmp/lima/linux2024/lab0-c/queue.c:88 #3 0xaaaac1ceca08 in queue_remove /tmp/lima/linux2024/lab0-c/qtest.c:353 #4 0xaaaac1cecbfc in do_rh /tmp/lima/linux2024/lab0-c/qtest.c:410 #5 0xaaaac1cef68c in interpret_cmda /tmp/lima/linux2024/lab0-c/console.c:181 #6 0xaaaac1cf0188 in interpret_cmd /tmp/lima/linux2024/lab0-c/console.c:201 #7 0xaaaac1cf0958 in cmd_select /tmp/lima/linux2024/lab0-c/console.c:610 #8 0xaaaac1cf1998 in run_console /tmp/lima/linux2024/lab0-c/console.c:705 #9 0xaaaac1cede4c in main /tmp/lima/linux2024/lab0-c/qtest.c:1258 #10 0xffff8c1273f8 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 #11 0xffff8c1274c8 in __libc_start_main_impl ../csu/libc-start.c:392 #12 0xaaaac1ce97ec in _start (/tmp/lima/linux2024/lab0-c/qtest+0x97ec) ``` 我沒有看清楚題目的描述,copy的大小應該是要小於等於 bufsize "If sp is non-NULL and an element is removed, copy the removed string to *sp" :::danger 列出程式碼是為了討論和檢討,不是假裝有付出! ::: ```diff element_t *q_remove_head(struct list_head *head, char *sp, size_t bufsize) { element_t *ele; if (!head || list_empty(head)) return NULL; ele = list_first_entry(head, element_t, list); - if (sp) - memcpy(sp, ele->value, bufsize); + if (sp) { + if(strlen(ele->value) < bufsize) + memcpy(sp, ele->value, strlen(ele->value) + 1); + else + memcpy(sp, ele->value, bufsize); + } list_del(head->next); return ele; } ``` ### [commit 1e80a1ca](https://github.com/sysprog21/lab0-c/commit/1e80a1ca0b83d5f5a33792a9a3052b298838f930) :::danger 列出程式碼是為了討論和檢討,不是假裝有付出! ::: ``` Author: Howard Chen <s880367@gmail.com> Date: Wed Feb 28 01:06:53 2024 +0800 Fix buffer size less than copy length When bufsize is less than the copy length of string Need to copy only length of (bufsize - 1) and tail a '\0' in the end of the sp ``` ```diff diff --git a/queue.c b/queue.c index 06c348d..0a2d036 100644 --- a/queue.c +++ b/queue.c @@ -125,11 +125,14 @@ element_t *q_remove_head(struct list_head *head, char *sp, size_t bufsize) ele = list_first_entry(head, element_t, list); if (sp) { - if (strlen(ele->value) < bufsize) + if (strlen(ele->value) <= bufsize) memcpy(sp, ele->value, strlen(ele->value) + 1); - else - memcpy(sp, ele->value, bufsize); + else { + memcpy(sp, ele->value, bufsize-1); + sp[bufsize - 1] = '\0'; + } } + list_del(head->next); return ele; ``` ## q_delete_mid > Delete the middle node in list. > The middle node of a linked list of size n is the > ⌊n / 2⌋th node from the start using 0-based indexing. > If there're six element, the third member should be return. 這裡先從```q_size```的數量來決定middle是哪一個node **TODO: 嘗試使用fast/slow pointer改善效能** ```c bool q_delete_mid(struct list_head *head) { if (!head || list_empty(head)) return false; element_t *ele; struct list_head *slow = head->next; int mid = q_size(head); /* * TODO: use slow/fast pointer * struct list_head *slow, *fast; * slow = head->next; * fast = slow->next->next; */ mid = (mid & 0x1) ? mid >> 1 : (mid >> 1) - 1; while (mid > 0) { slow = slow->next; mid--; } ele = list_entry(slow, element_t, list); list_del(slow); q_release_element(ele); return true; } ``` 嘗試使用fast/slow pointer改善效能 **[commit db72bce](https://github.com/sysprog21/lab0-c/commit/db72bced1111f4fb11737dd386e86e41dbb08444)** :::danger * 避免過多的中英文混用,已有明確翻譯詞彙者,例如「鏈結串列」(linked list) 和「佇列」(queue),就使用該中文詞彙,英文則留給變數名稱、人名,或者缺乏通用翻譯詞彙的場景。 * 無論標題和內文中,中文和英文字元之間要有空白字元 (對排版和文字搜尋有利)。 ::: ## q_reverse > Reverse elements in queue > No effect if q is NULL or empty > This function should not allocate or free any list elements > (e.g., by calling q_insert_head, q_insert_tail, or q_remove_head). > It should rearrange the existing ones. ** TODO: Try to use ```list_move```** ```c void q_reverse(struct list_head *head) { struct list_head *cur = head, *tmp; if (!head || list_empty(head)) return; while (cur->next != head) { tmp = cur->next; cur->next = cur->prev; cur->prev = tmp; cur = tmp; } tmp = cur->next; cur->next = cur->prev; cur->prev = tmp; } ``` 參考 [alanjian85](https://hackmd.io/@alanjian85/lab0-2023#q_sort)的方法使用```list_move()```可以讓code更加簡潔 :::danger 改進你的漢語表達。 ::: ``` commit 35374d9 Author: Howard Chen <s880367@gmail.com> Date: Thu Feb 29 10:16:38 2024 +0800 Fix fast pointer iteration For q_delete_mid fix fast pointer initilization. For q_reverse, adopting list_move() API to improve code simplicity. Note that I use list_for_each_safe() not list_for_each() since after each move of node to after head, node will change location so there will be new node->next. But we don't want to use the new node->next instead of the original node->next with is safe. Therefore, list_for_each_safe() was adopt to make sure each iteration use the correct node->next. ``` ```c void q_reverse(struct list_head *head) { struct list_head *node, *safe; if (!head || list_empty(head) || list_is_singular(head)) return; list_for_each_safe (node, safe, head) { list_move(node, head); } } ``` ## q_swap 原本想法是手動去改動每一個node的```next``` and ```prev```,參考[alanjian85](https://hackmd.io/@alanjian85/lab0-2023#q_sort)的方法使用```list_move()```把每一個```node->next```當作該```node```的```head```並移動到```node->next```的後面 - 注意這邊使用```list_for_each```不是```list_for_each_safe```,因為```node```移動過後直接使用下一個node即可 - 這邊沒有考慮到到queue size是奇數的時候,最後一個node不需要做swap,補上patch修正:[commit a56d48c](https://github.com/sysprog21/lab0-c/commit/a56d48c9c15c72263d44bbac666706cdd84f0a1f) ```diff diff --git a/queue.c b/queue.c index 001f0fb..4ee3ef5 100644 --- a/queue.c +++ b/queue.c @@ -193,7 +193,8 @@ void q_swap(struct list_head *head) if (!head || list_empty(head)) return; list_for_each (node, head) { - list_move(node, node->next); + if (node->next != head) + list_move(node, node->next); } } ``` ## q_sort > Sort elements of queue in ascending/descending order ### 想法 - 採用比較直觀的遞迴方法 - 利用 fats/slow pointer 找出 mid pointer 把 list 一分為二,因為這是針對 circular linked list 所以tail->next 並不會指向一個 NULL pointer。因此把 tail->next 每一次都指向 head ```graphviz digraph { rankdir = LR; label = "unsort"; node [shape=circle]; h [label="head"]; edge [weight=3]; h -> 1 -> 2 -> 3 -> 4 -> 5 -> 6; edge [weight=3]; h->6 -> 5 -> 4 -> 3 -> 2 -> 1 -> h; 6 -> h; node [shape=none]; edge[weight = 3] mid->4 slow->3 fast->6 node [shape=none label=""]; edge[weight = 1 style="invis"] nn->h; } ``` - 分成兩個 sub-list之後,把 sub-list 的 tail 都指向 head 。這樣之後 fast ptr 在 traverse 的時候才知道尾巴在哪裡 ```c mid = slow->next; slow->next = head; ``` - 在針對個別 sub-list 進行一次 split ```graphviz digraph { rankdir = LR; label = "left list right list"; node [shape=circle]; h1 [label="head"]; h2 [label="head"]; edge [weight=3 style=""]; 1 -> 2 -> 3-> h1; 3 -> 2 -> 1; 4 -> 5 -> 6->h2; 6 -> 5 -> 4; edge [weight=3 style="invis"]; h1 -> 4 } ``` - split 的目標是要分到最後只剩下一個 node - 一開始會先得到只有一個 node 的 left list - right list 還沒分完,針對 right list 再做一次 split ```graphviz digraph { rankdir = LR; label = "right list left list"; node [shape=record]; h1 [label="head"]; h2 [label="head"]; node [shape=circle]; edge [weight=3]; 2->3->h1; 3->2; 1->h2; edge [weight=3 style="invis"]; 1->2; 2->1; } ``` ```graphviz digraph { rankdir = LR; label = "right list left list"; node [shape=record]; h1 [label="head"]; h2 [label="head"]; node [shape=circle]; edge [weight=3]; 5->6->h1; 6->5; 4->h2; edge [weight=3 style="invis"]; 4->5; 5->4; } ``` :::danger 避免非必要的項目縮排 (即 `* ` 和 `- `),以清晰、明確,且流暢的漢語書寫。 ::: 對 right list 再做一次 split 後,會得到各自只有一個 node 的 left and right list 接下來就比較各自value的大小後進行 merge ```graphviz digraph { rankdir = LR; label = "right list left list"; node [shape=record]; h1 [label="head"]; h2 [label="head"]; node [shape=circle]; edge [weight=3]; 2->h1; 3->h2; edge [weight=3 style="invis"]; h2->2; } ``` ```graphviz digraph { rankdir = LR; label = "right list left list"; node [shape=record]; h1 [label="head"]; h2 [label="head"]; node [shape=circle]; edge [weight=3]; 5->h1; 6->h2; edge [weight=3 style="invis"]; h2->5; } ``` :bulb: Merge 完之後會在做一次 linear traverse 因為要把 ```head->prev``` 接上 tail node ```c struct list_head *tmp = head->next; while (tmp->next != head) tmp = tmp->next; head->prev = tmp; ``` ## q_delete_dup 因為原本的 queue 有先排序過,重複的節點會相鄰,因此我們只要刪除相鄰的節點中有重複的即可 但不知道為什麼在跑```trace-06-ops.cmd```會出現 error ``` **Error on trace-06-ops.cmd** l = [aryizker gerbil gerbil gerbil lion lion pdhnbsroo tfcitf uopmysyyj zebra zebra] l = [aryizker pdhnbsroo tfcitf uopmysyyj] l = NULL **ERROR: There is no queue, but 4 blocks are still allocated** ``` :::danger 改進你的漢語表達。 ::: 因此用Valgrind檢查是否有沒釋放的記憶體,**發現是 ```strdup``` 之後沒有對malloc出來的記憶體空間做釋放** Commit: [Release the memory created by strdup](https://github.com/sysprog21/lab0-c/commit/1de151db17a64ac785b1074cbdb25fc799a2d2a6) ``` $ valgrind -q --leak-check=full ./qtest -v 3 -f traces/trace-06-ops.cmd ``` ```bash ERROR: Freed queue, but 4 blocks are still allocated ==4478== 168 bytes in 4 blocks are still reachable in loss record 1 of 1 ==4478== at 0x4865058: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-arm64-linux.so) ==4478== by 0x10FB7F: test_malloc (harness.c:133) ==4478== by 0x10FE17: test_strdup (harness.c:212) ==4478== by 0x1103FB: q_delete_dup (queue.c:176) ==4478== by 0x10B683: do_dedup (qtest.c:465) ==4478== by 0x10E71B: interpret_cmda (console.c:181) ==4478== by 0x10ED7F: interpret_cmd (console.c:201) ==4478== by 0x10FA47: run_console (console.c:691) ==4478== by 0x10C9A7: main (qtest.c:1258) ==4478== ``` ## q_descend and q_ascend 想法是使用暴力法,針對每一個節點向右觀察,將大於或小於的節點刪掉,此方法的time complexity=O(n^2) **TODO: 使用hash table 理論上可以將time complexity降到O(n**) ## q_reverseK 想法是針對sub node先做reverse,然後再更新tmp head到下一組 sub node的prev 針對最後一組sub node如果數量小於k的話,就不做reverse ## queue_contex_t - 在```q_merge```中需要使用 ```queue_contex_t``` 來得到每一個 queue - 但```queue_contex_t```真的有點難想像,參考[SPFishcool](https://hackmd.io/MlPKtUJJS0C8j4UQcwMk2g?both)在[2023q1 Homework1 (lab0)](https://hackmd.io/MlPKtUJJS0C8j4UQcwMk2g?both)畫的圖,非常清楚的畫出queue與```queue_contex_t```的關係如下 所以其實```q```與```chain```分別指向的是不同的linked list >![](https://i.imgur.com/80aWxMq.png) >queue_contex_t架構圖 ```c typedef struct { struct list_head *q; struct list_head chain; int size; int id; } queue_contex_t; ``` - 再加入 `queue_chain_t` 其完整佇列架構就完成了! >![](https://i.imgur.com/NuEJyGr.png) >queue 為上圖的紅色虛線部份 ## q_merge 想法是把所有的 queue 都先 move 到first contex底下的queue再一起sort ```c int q_merge(struct list_head *head, bool descend) { if (!head || list_empty(head)) return 0; if (list_is_singular(head)) return q_size(list_first_entry(head, queue_contex_t, chain)->q); queue_contex_t *first, *target; first = list_first_entry(head, queue_contex_t, chain); /* For move each target's queue to first context''s queue */ list_for_each_entry(target, head->next, chain) { if (target->id == first->id) break; list_splice_tail_init(target->q, first->q); } q_sort(first->q, descend); head = first->q; return q_size(head); } ``` ## 圖解list_cut_position ```list_cut_position```不太好懂,以下圖來示範 ```graphviz digraph { rankdir = LR; label = "Original node"; node [shape=circle]; h [label="head_from"]; edge [weight=3]; h -> 1 -> 2 -> 3 -> 4 -> 5 -> 6; edge [weight=3]; 6 -> 5 -> 4 -> 3 -> 2 -> 1 -> h; node [shape=none]; edge[weight = 3] node_cut->4 node [shape=none label=""]; edge[weight = 1 style="invis"] nn->h; } ``` ```graphviz digraph { rankdir = LR; label = "After cut"; node [shape=circle]; h [label="head_from"]; node [shape=circle]; h2 [label="head_to"]; edge [weight=3]; h -> 5 -> 6; edge [weight=3]; 6 -> 5 -> h; edge [weight=3]; h2 -> 1 -> 2 -> 3 -> 4; edge [weight=3]; 4 -> 3 -> 2 -> 1 -> h2; node [shape=none]; edge[weight = 3] node_cut->4 } ``` ```c static inline void list_cut_position(struct list_head *head_to, struct list_head *head_from, struct list_head *node) ``` :::danger 1. 無論標題和內文中,中文和英文字元之間要有空白字元 (對排版和文字搜尋有利) 2. HackMD 不是讓你張貼完整程式碼的地方,GitHub 才是!因此你在開發紀錄只該列出關鍵程式碼 (善用 diff 標示),可附上對應 GitHub commit 的超連結,列出程式碼是為了「檢討」和「便於他人參與討論」,不是用來「假裝自己有付出」 ::: ## * 研讀 Linux 核心原始程式碼的 [lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) #### 比較自己的 mergesort_list 和 Linux 核心程式碼之間效能落差 - 為了要分析兩者之間的效能差異,採用下面3個 tool 來進行效能分析 1. [perf 原理和實務](/qDsVNbZwQia8UGW1ItAhEg) 2. [ Cachegrind: a cache and branch-prediction profiler](https://valgrind.org/docs/manual/cg-manual.html) 參考 [bakudr18](https://hackmd.io/@MR-9Qa9wQLWglSAUyd6UhQ/BkvLMLkeq) 同學的作業發現 Valgrind 還有這一個強大的工具可以使用 Cachegrind 是 Valgrind 底下的其中一個 Tool , 專門用來量測 cache 使用與 branch predictor。執行 `valgrind --tool=cachegrind PROGRAM` 來紀錄程式執行過程,cg_annotate FILE 可以列出 function-by-function 的 cache access 結果。 3. [callgrind](https://valgrind.org/info/tools.html#callgrind) + [kcachegrind](https://kcachegrind.github.io/html/Home.html) - callgrind 是 cachegrind 的一個 extension 除了可以分析 cache miss 之外,還可以透過 kcachegrind 進一步把整個程式的 call graph 畫出來,並分析 bottleneck 在那一個 function > Callgrind, by Josef Weidendorfer, is an extension to Cachegrind. It provides all the information that Cachegrind does, plus extra information about callgraphs. It was folded into the main Valgrind distribution in version 3.2.0. Available separately is an amazing visualisation tool, KCachegrind, which gives a much better overview of the data that Callgrind collects; it can also be used to visualise Cachegrind's output. ### 1. 利用 perf stat 分析 使用 perf stat 因為已經有個要優化的目標,我們來對 ```trace-15-perf.cmd``` 使用 perf stat 工具分析 ```q_sort``` cache miss 情形 ``` sudo perf stat -r 100 -e cache-misses,cache-references,instructions,cycles ./qtest -f ./traces/trace-15-perf.cmd ``` - --repeat <n>或是-r <n> 可以重複執行 n 次該程序,並顯示每個 event 的變化區間。 cache-misses,cache-references和 instructions,cycles類似這種成對的 event,若同時出現 perf 會很貼心幫你計算比例。 - ==**Linux kernel API: list_sort**== ``` Performance counter stats for './qtest -f ./traces/trace-15-perf.cmd' (100 runs): 2,399,125 cache-misses # 7.139 % of all cache refs ( +- 0.90% ) 33,230,475 cache-references ( +- 0.37% ) 762,773,394 instructions # 1.07 insn per cycle ( +- 0.00% ) 712,856,204 cycles ( +- 0.16% ) 0.153062 +- 0.000261 seconds time elapsed ( +- 0.17% ) ``` - ==**自己寫的 mergesort_list**:== ``` Performance counter stats for './qtest -f ./traces/trace-15-perf.cmd' (100 runs): 4,088,496 cache-misses # 10.292 % of all cache refs ( +- 4.22% ) 40,234,841 cache-references ( +- 0.21% ) 774,742,603 instructions # 0.88 insn per cycle ( +- 0.01% ) 864,907,903 cycles ( +- 0.86% ) 0.19061 +- 0.00175 seconds time elapsed ( +- 0.92% ) ``` - 執行時間比 ```list_sort``` 多花 ==0.037 秒== - cache reference 數比 ```list_sort```多, cache miss 也比較多 - IPC 的數量比 ```list_sort``` 少,同時 cycle 數量又比較多 ### perf record & perf report 有別於 stat,reocrd 可以針對函式級別進行 event 統計,方便我們對程序「熱點」作更精細的分析和優化。我們可以針對 ```./qtest -f ./traces/trace-15-perf.cmd``` 使用 perf record 進行 branch 情況分析 ```shell $ sudo perf record -e branch-misses:u,branch-instructions:u ./qtest -f ./traces/trace-15-perf.cmd ``` ```shell $ perf report ``` `:u`是讓 perf 只統計發生在 user space 的 event。最後可以觀察到迴圈展開前後 branch-instructions 的差距。 ``` Samples: 395 of event 'branch-misses:u', Event count (approx.): 3058424 Overhead Command Shared Object Symbol 47.58% qtest libc-2.31.so [.] __strcmp_avx2 ▒ 38.43% qtest qtest [.] merge_list ▒ 3.50% qtest qtest [.] mergesort_list ▒ 3.21% qtest qtest [.] randombytes ◆ 0.99% qtest libc-2.31.so [.] _int_malloc ▒ 0.93% qtest libc-2.31.so [.] __random ▒ 0.93% qtest qtest [.] test_free ▒ 0.81% qtest qtest [.] 0x00000000000029b0 ▒ 0.76% qtest libc-2.31.so [.] rand ▒ 0.68% qtest libc-2.31.so [.] cfree@GLIBC_2.2.5 ``` ![Screenshot from 2024-03-15 22-50-39](https://hackmd.io/_uploads/HJgPxkzAp.png) ### 2. 利用 Cachegrind 進行效能分析 > Cachegrind is a cache profiler. It performs detailed simulation of the I1, D1 and L2 caches in your CPU and so can accurately pinpoint the sources of cache misses in your code. > > It identifies the number of cache misses, memory references and instructions executed for each line of source code, with per-function, per-module and whole-program summaries. It is useful with programs written in any language. Cachegrind runs programs about 20--100x slower than normal. - 嘗試使用 valgrind 裡面的令一個 tool cachegrind 進一步分析每一個 function 的 cache miss :::info :bulb: 要注意的是 Ir counts 基本上就是 assembly instructions 執行的數量 ::: - Ir: I cache reads (instructions executed) - I1mr: I1 cache read misses (instruction wasn't in I1 cache but was in L2) - I2mr: L2 cache instruction read misses (instruction wasn't in I1 or L2 cache, had to be fetched from memory) - Dr: D cache reads (memory reads) - D1mr: D1 cache read misses (data location not in D1 cache, but in L2) - D2mr: L2 cache data read misses (location not in D1 or L2) - Dw: D cache writes (memory writes) - D1mw: D1 cache write misses (location not in D1 cache, but in L2) - D2mw: L2 cache data write misses (location not in D1 or L2) #### 1. Linux kernel API: list_sort |Ir | I1mr |ILmr |Dr |D1mr |DLmr | Dw |D1mw |DLmw | | |-----------|-------|------|------------|---------- |-----|------------|----------|---------|-----------| |490,987,422| 1,600 |1,523 |120,610,370 |==9,552,254== |2,373| 53,881,626 |1,072,698 |225,686 |PROGRAM TOTALS| #### 2. mergesort_list |Ir | I1mr |ILmr |Dr |D1mr |DLmr | Dw |D1mw |DLmw | | |-----------|-------|------|------------|---------- |-----|------------|----------|---------|-----------| |491,935,138| 1,591 |1,520 | 129,754,521| ==12,671,085==|2,373| 55,280,397 | 655,445 | 225,686 | PROGRAM TOTALS| - 可以發現 D1-cache 的 read miss 是 12671085 比 ```list_sort``` 的 9552254 還要多 - cg_annotate 可以列出 function-by-function 的 cache access 結果 - 可以發現 ```mergesort_list``` cache read miss 的數量最高: 3,790,923 #### ==Linux kernel API: list_sort== ``` -------------------------------------------------------------------------------- I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 16777216 B, 64 B, 16-way associative Command: ./qtest -f ./traces/trace-15-perf.cmd Data file: cachegrind.out.8480 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Thresholds: 0.1 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: off -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -------------------------------------------------------------------------------- 490,987,422 1,600 1,523 120,610,370 9,552,254 2,373 53,881,626 1,072,698 225,686 PROGRAM TOTALS -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function -------------------------------------------------------------------------------- 81,029,740 10 7 18,309,555 2,749,259 6 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2 56,088,696 3 3 14,078,944 9 0 5,279,604 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random_r.c:random_r 51,068,936 3 3 6,643,424 545,282 0 9,446,992 6 0 /home/howard/linux2022/lab0-c/queue.c:merge 36,957,228 2 2 14,078,944 0 0 3,519,736 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random.c:random 31,597,984 33 32 5,806,658 118,397 1 5,326,225 199,973 199,966 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_malloc 29,292,304 0 0 10,984,614 2,343,235 0 3,661,538 55 0 /home/howard/linux2022/lab0-c/queue.c:sort_comp 25,929,916 13 13 7,362,909 29,887 0 3,201,628 13 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_free 20,642,030 5 3 2,080,425 3 1 3,520,290 0 0 /home/howard/linux2022/lab0-c/qtest.c:fill_rand_string 17,284,143 8 7 2,560,594 1,911,422 0 640,546 15 0 /home/howard/linux2022/lab0-c/qtest.c:show_queue 14,735,314 7 7 3,841,529 55 0 1,490,679 0 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc 14,370,820 51 46 7,185,371 343 0 19 1 1 ???:??? 13,439,748 4 4 3,839,928 320,314 5 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_l_avx 12,872,482 8 6 2,239,856 163,277 0 2,675,554 424,853 0 /home/howard/linux2022/lab0-c/queue.c:list_sort 12,800,117 4 4 3,200,030 3 1 3,840,033 83,760 24,997 /home/howard/linux2022/lab0-c/harness.c:test_malloc 12,160,190 5 5 3,840,036 204,679 0 2,240,018 362,456 0 /home/howard/linux2022/lab0-c/harness.c:test_free 8,895,194 5 5 640,009 0 0 1,280,024 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms 7,199,325 1 1 1,439,865 0 0 1,439,865 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/rand.c:rand 6,723,171 2 2 1,600,755 42,463 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:free 6,080,425 2 2 2,080,425 0 0 1,120,000 0 0 /home/howard/linux2022/lab0-c/queue.c:q_new_element 5,440,228 12 10 1,120,078 6 0 640,042 0 0 /home/howard/linux2022/lab0-c/qtest.c:do_ih 3,840,222 9 7 960,042 400,023 0 320,072 0 0 /home/howard/linux2022/lab0-c/qtest.c:do_sort 3,519,736 0 0 1,759,868 3 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/../sysdeps/unix/sysv/linux/x86/lowlevellock.h:random 3,200,156 1 1 1,600,078 19 1 800,039 0 0 /home/howard/linux2022/lab0-c/harness.c:error_check 2,240,000 1 1 320,000 0 0 320,000 0 0 /home/howard/linux2022/lab0-c/queue.c:q_insert_head 2,217,818 6 6 482,020 0 0 321,231 28 1 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms 1,440,000 1 1 480,000 39,566 0 480,000 0 0 /home/howard/linux2022/lab0-c/queue.c:q_release_element 1,280,054 4 2 320,012 320,012 0 0 0 0 /home/howard/linux2022/lab0-c/queue.c:q_size 1,280,030 1 1 320,006 200,007 0 320,006 0 0 /home/howard/linux2022/lab0-c/queue.c:q_reverse 1,280,012 0 0 320,003 0 0 320,003 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_free 1,280,012 0 0 0 0 0 320,003 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_malloc 1,279,976 2 2 639,988 12 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_avx 1,120,061 1 1 160,016 158,903 0 160,009 0 0 /home/howard/linux2022/lab0-c/queue.c:q_free 800,000 0 0 160,000 0 0 640,000 0 0 /home/howard/linux2022/lab0-c/list.h:q_insert_head 640,000 0 0 0 0 0 160,000 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:q_new_element ``` #### ==mergesort_list== ``` -------------------------------------------------------------------------------- I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 16777216 B, 64 B, 16-way associative Command: ./qtest -f ./traces/trace-15-perf.cmd Data file: cachegrind.out.8759 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Thresholds: 0.1 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: off -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -------------------------------------------------------------------------------- 491,935,138 1,591 1,520 129,754,521 12,671,085 2,373 55,280,397 655,445 225,686 PROGRAM TOTALS -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function -------------------------------------------------------------------------------- 81,881,583 10 7 18,501,150 2,311,756 6 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2 61,645,878 2 2 17,679,374 2,496,562 0 13,979,517 1,434 0 /home/howard/linux2022/lab0-c/queue.c:merge_list 56,109,412 3 3 14,084,144 9 0 5,281,554 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random_r.c:random_r 36,970,878 2 2 14,084,144 0 0 3,521,036 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random.c:random 31,597,984 33 32 5,806,658 118,331 1 5,326,225 199,973 199,966 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_malloc 30,289,174 2 2 10,772,834 3,790,923 0 3,199,958 6,207 0 /home/howard/linux2022/lab0-c/queue.c:mergesort_list 25,929,916 13 13 7,362,909 30,097 0 3,201,628 13 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_free 20,638,222 3 3 2,079,231 3 1 3,519,746 0 0 /home/howard/linux2022/lab0-c/qtest.c:fill_rand_string 17,284,143 8 7 2,560,594 1,911,521 0 640,546 15 0 /home/howard/linux2022/lab0-c/qtest.c:show_queue 14,735,314 7 7 3,841,529 55 0 1,490,679 0 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc 14,448,758 51 46 7,224,340 770 0 19 1 1 ???:??? 13,439,748 4 4 3,839,928 320,324 5 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_l_avx 12,800,117 4 4 3,200,030 3 1 3,840,033 83,702 24,997 /home/howard/linux2022/lab0-c/harness.c:test_malloc 12,160,172 5 5 3,840,036 204,644 0 2,240,018 362,516 0 /home/howard/linux2022/lab0-c/harness.c:test_free 8,897,142 5 5 640,009 0 0 1,280,024 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms 7,202,575 1 1 1,440,515 0 0 1,440,515 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/rand.c:rand 6,723,171 2 2 1,600,755 42,508 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:free 6,079,231 2 2 2,079,231 0 0 1,120,000 0 0 /home/howard/linux2022/lab0-c/queue.c:q_new_element 5,440,228 12 10 1,120,078 6 0 640,042 0 0 /home/howard/linux2022/lab0-c/qtest.c:do_ih 3,840,222 9 7 960,042 400,026 0 320,072 0 0 /home/howard/linux2022/lab0-c/qtest.c:do_sort 3,521,036 0 0 1,760,518 3 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/../sysdeps/unix/sysv/linux/x86/lowlevellock.h:random 3,200,156 1 1 1,600,078 19 1 800,039 0 0 /home/howard/linux2022/lab0-c/harness.c:error_check 2,240,000 0 0 320,000 0 0 320,000 0 0 /home/howard/linux2022/lab0-c/queue.c:q_insert_head 2,218,828 6 6 482,020 0 0 321,231 28 1 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms 1,440,000 0 0 480,000 39,548 0 480,000 0 0 /home/howard/linux2022/lab0-c/queue.c:q_release_element 1,280,090 2 2 320,018 320,002 0 30 18 0 /home/howard/linux2022/lab0-c/queue.c:q_sort 1,280,054 2 2 320,012 320,012 0 0 0 0 /home/howard/linux2022/lab0-c/queue.c:q_size 1,280,030 1 1 320,006 200,007 0 320,006 0 0 /home/howard/linux2022/lab0-c/queue.c:q_reverse 1,280,012 0 0 320,003 0 0 320,003 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_free 1,280,012 0 0 0 0 0 320,003 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_malloc 1,279,976 2 2 639,988 12 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_avx 1,120,061 1 1 160,016 158,869 0 160,009 0 0 /home/howard/linux2022/lab0-c/queue.c:q_free 800,000 0 0 160,000 0 0 640,000 0 0 /home/howard/linux2022/lab0-c/list.h:q_insert_head 640,000 0 0 0 0 0 160,000 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:q_new_element ``` ### 3. Callgrind + Kcachegrind 進行分析 - 首先利用 callgrind 產生分析的 dump ```valgrind --tool=callgrind ./qtest -f ./traces/trace-15-perf.cmd``` - 在利用 kcachegrind 把 dump 視覺化 ```kcachegrind callgrind.out.8840``` #### ==mergort_list 分析== - 可以發現 ```mergesort_list``` 是在sort過程中被 call 次數最多的 function - kcachegrind 可以更進一步分析出 mergesort_list 裡面, 被執行次數最多的 function ![](https://i.imgur.com/pQE0mmz.png) ```graphviz digraph "callgraph" { F55606ab5d2a0 [label="malloc\n46 331 795"]; F55606ab61740 [label="random\n79 022 370"]; F55606ab63e10 [label="0x000000000010a700\n46 970 869"]; F55606ab67610 [label="0x000000000010a4a0\n33 279 149"]; F55606ab67b50 [label="free\n32 639 121"]; F55606ab68030 [label="0x000000000010a880\n89 103 476"]; F55606ab68600 [label="rand\n86 223 160"]; F55606ab72b80 [label="0x000000000010a670\n83 994 134"]; F55606ab73110 [label="__strcmp_avx2\n77 043 651"]; F55606ab775e0 [label="_start\n491 870 341"]; F55606ab77bf0 [label="(below main)\n491 870 341"]; F55606ab78590 [label="fill_rand_string\n109 749 685"]; F55606ab7dc10 [label="main\n491 870 341"]; F55606ab81710 [label="run_console\n491 771 018"]; F55606ab83550 [label="do_free\n54 369 103"]; F55606ab849e0 [label="q_free\n54 366 859"]; F55606ab84f80 [label="do_ih\n217 809 624"]; F55606ab85ff0 [label="q_insert_head\n96 945 101"]; F55606ab86550 [label="do_sort\n213 187 520"]; F55606ab873e0 [label="q_sort\n182 639 830"]; F55606ab88dc0 [label="q_release_element\n53 246 318"]; F55606ab891a0 [label="test_free\n51 806 318"]; F55606ab89720 [label="merge_list\n141 898 843"]; F55606ab89c70 [label="q_new_element\n93 905 101"]; F55606ab8a7e0 [label="mergesort_list\n181 359 722", color="red:blue"]; F55606ab8b140 [label="mergesort_list'2\n171 067 879", color="red:blue"]; F55606ab8b9c0 [label="test_malloc\n84 337 256"]; F55606ab91bd0 [label="interpret_cmda\n491 693 518"]; F55606ab94930 [label="cmd_select\n491 770 474"]; F55606ab95c60 [label="interpret_cmd\n491 735 542"]; F55606ac4beb0 [label="random_r\n45 898 943"]; F55606ac5c2e0 [label="_int_malloc\n31 548 854"]; F55606ac7bc80 [label="_int_free\n25 918 826"]; F55606ab5d2a0 -> F55606ac5c2e0 [weight=2,label="31 548 854 (214 646x)"]; F55606ab61740 -> F55606ac4beb0 [weight=2,label="45 898 943 (1 440 149x)"]; F55606ab63e10 -> F55606ab5d2a0 [weight=2,label="46 331 795 (319 537x)"]; F55606ab67610 -> F55606ab67b50 [weight=2,label="32 639 121 (320 014x)"]; F55606ab67b50 -> F55606ac7bc80 [weight=2,label="25 918 826 (320 014x)"]; F55606ab68030 -> F55606ab68600 [weight=2,label="86 223 160 (1 440 158x)"]; F55606ab68600 -> F55606ab61740 [weight=2,label="79 022 370 (1 440 158x)"]; F55606ab72b80 -> F55606ab73110 [weight=2,label="77 043 651 (3 475 241x)"]; F55606ab775e0 -> F55606ab77bf0 [weight=2,label="491 870 341 (1x)"]; F55606ab77bf0 -> F55606ab7dc10 [weight=2,label="491 870 341 (1x)"]; F55606ab78590 -> F55606ab68030 [weight=2,label="89 103 476 (1 440 158x)"]; F55606ab7dc10 -> F55606ab81710 [weight=2,label="491 771 018 (1x)"]; F55606ab81710 -> F55606ab94930 [weight=2,label="491 770 474 (25x)"]; F55606ab83550 -> F55606ab849e0 [weight=2,label="54 366 859 (3x)"]; F55606ab849e0 -> F55606ab88dc0 [weight=2,label="53 246 318 (160 000x)"]; F55606ab84f80 -> F55606ab78590 [weight=2,label="109 749 685 (160 000x)"]; F55606ab84f80 -> F55606ab85ff0 [weight=2,label="96 945 101 (160 000x)"]; F55606ab85ff0 -> F55606ab89c70 [weight=2,label="93 905 101 (160 000x)"]; F55606ab86550 -> F55606ab873e0 [weight=2,label="182 639 830 (6x)"]; F55606ab873e0 -> F55606ab8a7e0 [weight=2,label="181 359 722 (6x)"]; F55606ab88dc0 -> F55606ab891a0 [weight=2,label="51 806 318 (320 000x)"]; F55606ab891a0 -> F55606ab67610 [weight=2,label="33 279 149 (320 000x)"]; F55606ab89720 -> F55606ab72b80 [weight=2,label="83 994 134 (3 475 257x)"]; F55606ab89c70 -> F55606ab8b9c0 [weight=2,label="84 337 256 (320 000x)"]; F55606ab8a7e0 -> F55606ab8b140 [weight=2,label="171 067 879 (12x)"]; F55606ab8b140 -> F55606ab89720 [weight=2,label="141 898 843 (319 988x)"]; F55606ab8b140 -> F55606ab8b140 [weight=3,label="1 328 340 680 (639 976x)"]; F55606ab8b9c0 -> F55606ab63e10 [weight=2,label="46 970 869 (320 001x)"]; F55606ab91bd0 -> F55606ab83550 [weight=2,label="54 369 103 (3x)"]; F55606ab91bd0 -> F55606ab84f80 [weight=2,label="217 809 624 (3x)"]; F55606ab91bd0 -> F55606ab86550 [weight=2,label="213 187 520 (6x)"]; F55606ab94930 -> F55606ab95c60 [weight=2,label="491 735 542 (24x)"]; F55606ab95c60 -> F55606ab91bd0 [weight=2,label="491 693 518 (24x)"]; } ``` :::info :bulb: cachegrind 和 callgrind 都指出需要改進的地方在 mergesort_list ::: ### 嘗試改進 mergelist_sort 的效能 **[Commit: Add flag to improve sorting performance](https://github.com/sysprog21/lab0-c/commit/2dc604f7e1ae72c7add20c99aea0340278fab7a3)** **[Commit: Use indirect pointer to make code more elegant](https://github.com/sysprog21/lab0-c/commit/456091b0ca92ef4f75221ab817ca97a37b7be641)** Merge 完之後會在做一次 linear traverse 因為要把 ```head->prev``` 接上 tail node 我們嘗試把這個動作移動到 recursion 的時候進行 如此就不用在最後來要做一次 traverse 找 tail node 另一方面我們導入indirect pointer讓code更加簡潔 如此在每一次merge的時候就不需要另外宣告tmp list ```diff static void merge_list(struct list_head *l1, struct list_head *l2, - struct list_head *head, + struct list_head **phead, bool descend, size_t cnt) - LIST_HEAD(tmp_head); - struct list_head *ptr = &tmp_head, *tmp = &tmp_head; + struct list_head *ptr = (*phead); ``` ==改進後== ``` Performance counter stats for './qtest -f ./traces/trace-15-perf.cmd' (100 runs): 3,069,662 cache-misses # 7.862 % of all cache refs ( +- 0.91% ) 38,993,371 cache-references ( +- 0.34% ) 772,745,730 instructions # 0.94 insn per cycle ( +- 0.00% ) 822,278,196 cycles ( +- 0.14% ) 0.175408 +- 0.000268 seconds time elapsed ( +- 0.15% ) ``` 改進前 ``` Performance counter stats for './qtest -f ./traces/trace-15-perf.cmd' (100 runs): 4,088,496 cache-misses # 10.292 % of all cache refs ( +- 4.22% ) 40,234,841 cache-references ( +- 0.21% ) 774,742,603 instructions # 0.88 insn per cycle ( +- 0.01% ) 864,907,903 cycles ( +- 0.86% ) 0.19061 +- 0.00175 seconds time elapsed ( +- 0.92% ) ``` > 待解釋 - 把效能提升到 0.175408 秒 比原來的 0.19061 ==提升了8%== - cache miss 也減少 IPC也提昇了一點點 ## 整合 tiny-web-server 節錄自[select(2)](https://man7.org/linux/man-pages/man2/select.2.html) Linux manuma page: > FD_ZERO(&fdset) initializes a descriptor set fdset to the null set. > FD_SET(fd, &fdset) includes a particular descriptor fd in fdset. > FD_CLR(fd, &fdset) removes fd from fdset. > FD_ISSET(fd, &fdset) is non-zero if fd is a member of fdset, zero otherwise. > FD_COPY(&fdset_orig, &fdset_copy) replaces an already allocated &fdset_copy file descriptor set with a copy of &fdset_orig. > If timeout is not a null pointer, it specifies a maximum interval to wait for the selection to complete. > If timeout is a null pointer, the select blocks indefinitely. **- 參考[vax-r](https://hackmd.io/DSdogwE1Se-d-E5qr-000w?view) 在 [整合網頁伺服器](https://hackmd.io/DSdogwE1Se-d-E5qr-000w?view)中提到** `web` 命令執行後, `linenoise` 套件會失效是因為 `do_web` 函式當中,如果開啟 web server 並監聽 port 9999 , `use_linenoise` 變數會被設為 `false` 如果強制將 `use_linenoise` 設為 `true` ,反而導致瀏覽器嘗試傳送請求給 web server 時會出現無限延遲。 一旦使用 `web` 命令後, `linenoise` 套件被關閉並且程式使用 `cmd_select` 來同時控制 command line input 和網頁瀏覽器請求。 因此我們將 `linenoise` 函式引入 `cmd_select` 當中 ```diff - char *cmdline = readline(); + char *cmdline = linenoise(prompt); ``` 執行結果會發現,開啟 web server 之後 `linenoise` 套件必須要按下 Enter 才能夠使用,這裡分析 `linenoise` 套件總共會卡在兩個路圈里面,其一是 `line_edit()` 的 `read()`,另一個是 `complete_line`裡面,因此必須把 `select`放在 `line_edit()`裡面來監控`stdin`的`fd`與`web_fd` **值得一提的是當fd ready之後,執行完相關的動作後會用`FD_CLR`清掉相關的fd,這個時候記得需要把fd再設定回來,不然`select`就會不會監控任何fd導致一直卡在sleect裡面** [Commit: Complete linenoise working during web open](https://github.com/sysprog21/lab0-c/commit/08805a9452023b3c8a9c89b675bc13d39c62c3f7) ```diff @@ -915,6 +916,8 @@ static int line_edit(int stdin_fd, l.cols = get_columns(stdin_fd, stdout_fd); l.maxrows = 0; l.history_index = 0; + int nfds = 0; + fd_set local_readset; /* Buffer starts empty. */ l.buf[0] = '\0'; @@ -932,6 +935,31 @@ static int line_edit(int stdin_fd, int nread; char seq[5]; + FD_ZERO(&local_readset); + FD_SET(STDIN_FILENO, &local_readset); + nfds = STDIN_FILENO; + + if (web_fd > 0) { + FD_SET(web_fd, &local_readset); + nfds = nfds > web_fd ? nfds : web_fd; + } + + if (web_fd > 0) { + select(nfds + 1, &local_readset, NULL, NULL, NULL); + + if (web_fd > 0 && FD_ISSET(web_fd, &local_readset)) { + FD_CLR(web_fd, &local_readset); + char *p = web_action(web_fd); + strncpy(buf, p, strlen(p) + 1); + nread = strlen(p) + 1; + free(p); + + return nread; + } + + FD_CLR(STDIN_FILENO, &local_readset); + } + nread = read(l.ifd, &c, 1); if (nread <= 0) return l.len; ``` ## 提供新的命令 `shuffle` 在 `qtest` 提供新的命令 `shuffle`,藉由 [Fisher–Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle) 演算法,對佇列中所有節點進行洗牌 (shuffle) 操作,針對每一個被選出來節點透過 `list_move_tail`移動到佇列的最尾端直到所有佇列的節點都被移動過為止 測試[1,2,3,4]做shuffle 1000000次的直方圖 chi square sum: 19.458359333749343 ![shuffle_result](https://hackmd.io/_uploads/SJFwn0SCT.png) 可以觀察到 shuffle 產生的結果大致符合 uniform distritbution * 在 `qtest` 中執行 `option entropy 1` 並搭配 `ih RAND 10` 一類的命令,觀察亂數字串的分布,並運用「資訊與熵互補,資訊就是負熵」的觀念,設計一組新的命令或選項,得以在 `qtest` 切換不同的 PRNG 的實作 (可選定 [Xorshift](https://en.wikipedia.org/wiki/Xorshift)),搭配統計分析工具,比較亂數產生器 (如 Xorshift vs. 內建) 的品質 * 參見: [由大數法則理解訊息熵 (entropy) 的數學意義](https://hackmd.io/@8dSak6oVTweMeAe9fXWCPA/ryoegPgw_) * 參見: [The Linux Pseudorandom Number Generator Revisited](https://eprint.iacr.org/2012/251.pdf) ## 研讀論文〈[Dude, is my code constant time?](https://eprint.iacr.org/2016/1123.pdf)〉 ### 實作 percentile > 到目前為止 trace trace-17-complexity 一直沒有過 > --- trace-17-complexity 0/5 > **--- TOTAL 95/100** 在看完[oreparaz/dudect](https://github.com/oreparaz/dudect) 的程式碼之後,嘗試實作 percentile功能 [commit f5aa820](https://github.com/howardjchen/lab0-c/commit/f5aa82038dc42423aa77f56a221088643d0bd95f#) 在論文[Dude, is my code constant time?](https://eprint.iacr.org/2016/1123.pdf) 裡面中提到,percentile的作用是在static processing之前先做pre-processing,如此可以去除極值 >a) Pre-processing: We pre-process measurements prior to statistical processing. We discard measurements that are larger than the percentile, for various values of p. The rationale here is to test a restricted distribution range, especially the lower cycle count tail. The upper tail may be more influenced by data-independent noise. This heuristic process gives good empirical results (makes detection faster); nevertheless, we also process measurements without pre-processing. 但是發現remove_tail跟 remove_head 一直無法通過,後來才發現似乎是 platform 的問題 因為我是用Apple M1 macbook Air + lima VM,在qtest.c中發現 ```c /* FIXME: It is known that both functions is_remove_tail_const() and * is_remove_head_const() can not pass dudect on Apple M1 (based on Arm64). * We shall figure out the exact reasons and resolve later. */ ``` 所以後來改到Apple的system底下測試就成功通過了 --- trace-17-complexity 5/5 --- TOTAL 100/100 ![Screenshot 2024-03-04 at 12.54.41 AM](https://hackmd.io/_uploads/r1CdjQf6a.png) :::danger 1. 無論標題和內文中,中文和英文字元之間要有空白字元 (對排版和文字搜尋有利)。 2. 避免過多的中英文混用,已有明確翻譯詞彙者,例如「鏈結串列」(linked list) 和「佇列」(queue),就使用該中文詞彙,英文則留給變數名稱、人名,或者缺乏通用翻譯詞彙的場景。 3. 改進你的漢語表達。 ::: ## Corutine: Coro ### 程式輸出錯誤 在執行 `coro` 程式的時候,發現stdout的內容很奇怪,看起來像是有log丟失的樣子,尤其是 `ssh` 到遠端的伺服器這個現象又更明顯,後來發現是開了 [byobu](https://github.com/dustinkirkland/byobu) 造成的,換成原本的 Termianl就一切正常了 ```diff -Task 0: n = 70 -Task 1: n = 70 -Task 2: n = 70 -Task 0 fib(0) = 0 + Task 1 fib(1) = 1 Task 2 0 Task 0: resume ``` ```diff Task 2 3 Task 0: resume Task 0 fib(8) = 21 -Task 1: resume -Task 1 fib(9) = 34 + + Task 2: resume Task 2 4 Task 0: resume ``` 研讀完 `coro` 發現所有的 fib 數列計算都是由 `void task0(void *arg)` 在執行,`void task1(void *arg)`只有負責打印出 `task->i`,但在 `main` 裡面宣告了三個 `task`,其中 `task0` 和 `task1`使用 `void task0(void *arg)` ,而`task2` 使用的是`void task2(void *arg)` 最重要的核心是 `schedule` 和 `task_switch`, `schedule`是負責啟動所有的 task,每一個task在初始化之後,會將自身的任務加到 linked list 中,之後所有的任務分配就是交由 `task_switch` 執行 ```c void schedule(void) { static int i; setjmp(sched); while (ntasks-- > 0) { struct arg arg = args[i]; tasks[i++](&arg); printf("Never reached\n"); } task_switch(); } ``` `task_switch` 的工作負責將在linked list中的task取出來,透過 `longjmp` 跳回到task function裡面的 `setjmp` ```c static void task_switch() { if (!list_empty(&tasklist)) { struct task *t = list_first_entry(&tasklist, struct task, list); list_del(&t->list); cur_task = t; longjmp(t->env, 1); } } ``` 可以發現每當`void task0(void *arg)`和`void task1(void *arg)`執行完任務時,會再透過 `task_add(task)` 把自己加到 linked list 中,讓之後 `task_switch` 執行的時候可以取出任務來執行 ```c for (; task->i < task->n; task->i += 2) { if (setjmp(task->env) == 0) { long long res = fib_sequence(task->i); printf("%s fib(%d) = %lld\n", task->task_name, task->i, res); task_add(task); task_switch(); } task = cur_task; printf("%s: resume\n", task->task_name); } ``` ## 整合 tic-tac-toe 遊戲功能 [Commit: Implement task0 and task1 working concurrently ](https://github.com/sysprog21/lab0-c/commit/5d5cc339997e1a400d8ce3b33c90fb92b7b4eeb0)[Commit: Fix issue for environment parameters not align ](https://github.com/sysprog21/lab0-c/commit/ff2901f8c2be7a0b2e85bfcd7152f13556e5d73f) 我們參考[coro](https://github.com/sysprog21/concurrent-programs/tree/master/coro) 透過`setjmp` 和 `longjmp` 來實作 corutine,目前可以成功讓兩個不同演算法的 task 進行對弈一次,在加入鍵盤事件的時候,一開始發現 Ctrl-Q會沒辦法被偵測到,後來研讀[Build Your Own Text Editor](https://viewsourcecode.org/snaptoken/kilo/)在第二章的部分 [2.Entering raw mode](https://viewsourcecode.org/snaptoken/kilo/02.enteringRawMode.html)中有提到: > By default, Ctrl-S and Ctrl-Q are used for software flow control. Ctrl-S stops data from being transmitted to the terminal until you press Ctrl-Q. This originates in the days when you might want to pause the transmission of data to let a device like a printer catch up. Let’s just turn off that feature. > IXON comes from <termios.h>. The I stands for “input flag” (which it is, unlike the other I flags we’ve seen so far) and XON comes from the names of the two control characters that Ctrl-S and Ctrl-Q produce: XOFF to pause transmission and XON to resume transmission. 因此我們在 `enable_raw_mode` 的時候針對這些特殊的keyboard binding 做 disable [Commit: Disable Ctrl-S and Ctrl-Q](https://github.com/sysprog21/lab0-c/commit/f95866cf7347f6fda9735187132ebebd70624497) ```diff - raw.c_iflag &= ~(BRKINT | ICRNL | INPCK | ISTRIP); + raw.c_iflag &= ~(BRKINT | ICRNL | INPCK | ISTRIP | IXON); ``` ### error: Illegal instruction (core dumped) 連續多次電腦 vs 電腦對戰的時候,執行到第二次都會發生 `error: Illegal instruction (core dumped)` 的錯誤,因此我們使用 `make SANITIZER=1` 來查看,發現有 stack buffer overflow 的問題 ``` ==175450==ERROR: AddressSanitizer: stack-buffer-overflow on address 0xffffe6406710 at pc 0xaaaac80afd3c bp 0xffffe64065b0 sp 0xffffe64065c0 READ of size 8 at 0xffffe6406710 thread T0 #0 0xaaaac80afd38 in schedule ttt/ttt.c:232 #1 0xaaaac80b018c in corutine_ai ttt/ttt.c:400 #2 0xaaaac809dac8 in do_ttt /tmp/lima/linux2024/lab0-c/qtest.c:1061 #3 0xaaaac80a2d0c in interpret_cmda /tmp/lima/linux2024/lab0-c/console.c:182 #4 0xaaaac80a3808 in interpret_cmd /tmp/lima/linux2024/lab0-c/console.c:202 #5 0xaaaac80a4ec8 in run_console /tmp/lima/linux2024/lab0-c/console.c:699 #6 0xaaaac80a14cc in main /tmp/lima/linux2024/lab0-c/qtest.c:1647 #7 0xffffacfa73f8 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 #8 0xffffacfa74c8 in __libc_start_main_impl ../csu/libc-start.c:392 #9 0xaaaac809b86c in _start (/tmp/lima/linux2024/lab0-c/qtest+0xb86c) Address 0xffffe6406710 is located in stack of thread T0 at offset 80 in frame #0 0xaaaac80b0038 in corutine_ai ttt/ttt.c:383 This frame has 2 object(s): [32, 48) 'registered_task' (line 384) [64, 80) 'registered_arg' (line 388) <== Memory access at offset 80 overflows this variable HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork ``` ### solution [Commit: Fix stack overflow when running PC vs PC again](https://github.com/sysprog21/lab0-c/commit/86f551e45c8b83fe6ac009b97071250b0a6f2a00) 原因出在連續多次電腦 vs 電腦對戰的時候,上一場的 `i=2` 會被帶到下一場,導致在存取 `args[i]` 的時候會超出陣列的長度而產生 stack buffer overflow,但很納悶的是在進入 `schedule` 之前都有先初始化並宣告 `static int i` 但為何第二次執行的時候 `i` 沒有被初始化,而還停留在第一次對戰結束之後的 `i=2`目前原因不明,不知道是否跟 `setjmp` 的時候 `jmp_buf sched` 會保留當下的環境變數有關 ```diff @@ -232,8 +236,11 @@ void schedule(void) } task_switch(); if (rawmode == 1) disable_raw_mode(); + + /* Assign back initial value */ + i = 0; } ``` #### Issue(待釐清) 目前程式執行上沒有問題,但利用 `make SANITIZER=1` 來監測執行時期記憶體是否有正確的使用,發現在從 `task_switch` 利用 `longjmp` 跳回 `kb_event_task` 的 `setjmp` 會有以下錯誤產生 ``` *** longjmp causes uninitialized stack frame ***: terminated Aborted (core dumped) ```