2022q1 Homework1 (lab0)

--- tags: linux2022 --- # 2022q1 Homework1 (lab0) contributed by < [howardjchen](https://github.com/howardjchen) > ## Goal of Homework Impelment following functions in `queue.[ch]`: - [x] `q_new`: 建立新的「空」佇列; - [x] `q_free`: 釋放佇列所佔用的記憶體; - [x] `q_insert_head`: 在佇列開頭 (head) 插入 (insert) 給定的新節點 (以 LIFO 準則); - [x] `q_insert_tail`: 在佇列尾端 (tail) 插入 (insert) 給定的新節點 (以 FIFO 準則); - [x] `q_remove_head`: 在佇列開頭 (head) 移去 (remove) 給定的節點; - [x] `q_release_element`: 釋放特定節點的記憶體空間; - [x] `q_size`: 計算目前佇列的節點總量; - [x] `q_delete_mid`: 移走佇列中間的節點，詳見 LeetCode 2095. Delete the Middle Node of a Linked List - [x] `q_delete_dup`: 在已經排序的狀況，移走佇列中具備重複內容的節點，詳見 LeetCode 82. Remove Duplicates from Sorted List II - [x] `q_swap`: 交換佇列中鄰近的節點，詳見 LeetCode 24. Swap Nodes in Pairs - [x] `q_reverse`: 以反向順序重新排列鏈結串列，該函式不該配置或釋放任何鏈結串列節點，換言之，它只能重排已經存在的節點; - [x] `q_sort`: 以遞增順序來排序鏈結串列的節點，可參閱 Linked List Sort 得知實作手法; - [x] 滿足 $ make test 自動評分系統的所有項目。 - [x] 研讀 Linux 核心原始程式碼的 lib/list_sort.c 並嘗試引入到 lab0-c 專案，比較你自己實作的 merge sort 和 Linux 核心程式碼之間效能落差 - [x] 除了修改程式，也要編輯「作業區」，增添開發紀錄和 GitHub 連結，除了提及你如何逐步達到自動評分程式的要求外，共筆也要涵蓋以下: - [x] 開啟 Address Sanitizer，修正 qtest 執行過程中的錯誤 - [x] 先執行 qtest 再於命令提示列輸入 help 命令，會使開啟 Address Sanitizer 觸發錯誤，應予以排除 - [x] 運用 Valgrind 排除 qtest 實作的記憶體錯誤，並透過 Massif 視覺化 “simulation” 過程中的記憶體使用量，需要設計對應的實驗 - [x] 詳閱「你所不知道的 C 語言: linked list 和非連續記憶體」並應用裡頭提及的手法，例如 pointer to pointer (指向某個指標的指標，或說 indirect pointer)，讓程式碼更精簡且有效 - [ ] 在 qtest 提供新的命令 shuffle，允許藉由 Fisher–Yates shuffle 演算法，對佇列中所有節點進行洗牌 (shuffle) 操作 - [ ] 在 qtest 提供新的命令 web，提供 web 伺服器功能，注意: web 伺服器運作過程中，qtest 仍可接受其他命令，可依據前述說明，嘗試整合 tiny-web-server - [ ] 解釋 select 系統呼叫在本程式的使用方式，並分析 console.c 的實作，說明其中運用 CS:APP RIO 套件的原理和考量點。可對照參考 CS:APP 第 10 章重點提示 - [ ] 研讀論文〈Dude, is my code constant time?〉，解釋本程式的 “simulation” 模式是如何透過以實驗而非理論分析，達到驗證時間複雜度，需要解釋 Student’s t-distribution 及程式實作的原理。注意：現有實作存在若干致命缺陷，請討論並提出解決方案; - [ ] 指出現有程式的缺陷 (提示: 和 RIO 套件有關) 或者可改進之處 (例如更全面地使用 Linux 核心風格的鏈結串列 API)，嘗試實作並提交 pull request ## 開發環境 ### Ubuntu Linux 系統資訊 ``` Static hostname: pcie-lab Icon name: computer-desktop Chassis: desktop Machine ID: fd5e7ae180f94e0cbbb273229cc399e3 Boot ID: 45ad9e2f7b1349a8bb6bb449830c36a1 Operating System: Ubuntu 20.04.2 LTS Kernel: Linux 5.11.0-34-generic Architecture: x86-64 ``` ``` gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 165 Model name: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz Stepping: 5 CPU MHz: 2900.000 CPU max MHz: 4800.0000 CPU min MHz: 800.0000 BogoMIPS: 5799.77 Virtualization: VT-x L1d cache: 256 KiB L1i cache: 256 KiB L2 cache: 2 MiB L3 cache: 16 MiB NUMA node0 CPU(s): 0-15 Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected ``` ### 安裝必要工具 ```shell $ sudo apt install build-essential git-core valgrind $ sudo apt install cppcheck clang-format aspell colordiff ``` | 名稱 | 作用 | | --------------- | ----------------- | | build-essential | GNU Toolchain and necessary headers/libraries | | git-core | distributed revision control system | | valgrind | dynamic binary instrumentation (DBI) framework | | cppcheck | static C/C++ code analysis | | clang-format | format C/C++/Obj-C code | | aspell | spell-checker | | colordiff | colorize 'diff' output | # 開發紀錄 ## q_new > Create empty queue. > Return NULL if could not allocate space. ```c struct list_head *q_new() { struct list_head *q = malloc(sizeof(struct list_head)); if (!q) return NULL; INIT_LIST_HEAD(q); return q; } ``` ## q_insert_head > Attempt to insert element at head of queue. > Return true if successful. > Return false if q is NULL or could not allocate space. :::info 一開始針對inset head沒有頭緒，在考慮要把malloc element寫在insert head / insert tail裡面，還是獨立出來。這裡先參考 a12345645 的寫法，把 create element 獨立出來 TODO: 比較兩者寫法在程式執行時期效能的差異 Ref: https://github.com/a12345645/lab0-c ::: :::warning 你也該清楚說明動機和相關考量。補上相關超連結。 :notes: jserv ::: ```c bool q_insert_head(struct list_head *head, char *s) { element_t *ele = q_new_element(s); i f(!ele || !head) return false; list_add(&ele->list, head); return true; } ``` Note that ```ele``` needs to be free if cannot allocate space for ```ele->value``` ```c static element_t *q_new_element(char *s) { element_t *ele = malloc(sizeof(element_t)); size_t len = strlen(s) + 1; if(!ele) return NULL; ele->value = malloc(sizeof(len)); if(!ele->value) { free(ele); return NULL; } memcpy(ele->value, s, len); INIT_LIST_HEAD(&ele->list); return ele; }; ``` ## q_insert_tail > Attempt to insert element at tail of queue. > Return true if successful. > Return false if q is NULL or could not allocate space. ```c bool q_insert_tail(struct list_head *head, char *s) { element_t *ele = q_new_element(s); if(!ele || !head) return false; list_add_tail(&ele->list, head); return true; return true; } ``` ## q_remove_head > Attempt to remove element from head of queue. > Return target element. > Return NULL if queue is NULL or empty. > If sp is non-NULL and an element is removed, copy the removed string to *sp ```c element_t *q_remove_head(struct list_head *head, char *sp, size_t bufsize) { element_t *ele; if (!head || list_empty(head)) return NULL; ele = list_first_entry(head, element_t, list); if (sp) memcpy(sp, ele->value, bufsize); list_del(head->next); return ele; } ``` However, ```make test``` failed at test case ```trace-03-ops.cmd```. After using ```make SANITIZER=1``` to check the memory during run time stage, I found there were some codes needs to be fixed. ### Error 1 :::danger ERROR: AddressSanitizer: heap-buffer-overflow on address 0x604000000380 at pc 0x7f775c378480 bp 0x7ffc95fb8120 sp 0x7ffc95fb78c8 READ of size 1025 at 0x604000000380 thread T0 #0 0x7f775c37847f (/lib/x86_64-linux-gnu/libasan.so.5+0x9b47f) #1 0x557ca6e60bcc in memcpy /usr/include/x86_64-linux-gnu/bits/string_fortified.h:34 #2 0x557ca6e60bcc in q_remove_head /home/howard/linux2022/lab0-c/queue.c:128 #3 0x557ca6e5b0c4 in do_remove /home/howard/linux2022/lab0-c/qtest.c:381 #4 0x557ca6e5b312 in do_rh /home/howard/linux2022/lab0-c/qtest.c:440 #5 0x557ca6e5ddf4 in interpret_cmda /home/howard/linux2022/lab0-c/console.c:185 #6 0x557ca6e5e793 in interpret_cmd /home/howard/linux2022/lab0-c/console.c:208 #7 0x557ca6e5f9c2 in cmd_select /home/howard/linux2022/lab0-c/console.c:583 #8 0x557ca6e5ff3c in run_console /home/howard/linux2022/lab0-c/console.c:656 #9 0x557ca6e5cac3 in main /home/howard/linux2022/lab0-c/qtest.c:962 #10 0x7f775bfc30b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2) #11 0x557ca6e59b8d in _start (/home/howard/linux2022/lab0-c/qtest+0x8b8d) ::: :::success 我沒有看清楚題目的描述，copy的大小應該是要小於等於 bufsize "If sp is non-NULL and an element is removed, copy the removed string to *sp" ::: ```diff= element_t *q_remove_head(struct list_head *head, char *sp, size_t bufsize) { element_t *ele; if (!head || list_empty(head)) return NULL; ele = list_first_entry(head, element_t, list); - if (sp) - memcpy(sp, ele->value, bufsize); + if (sp) { + if(strlen(ele->value) < bufsize) + memcpy(sp, ele->value, strlen(ele->value) + 1); + else + memcpy(sp, ele->value, bufsize); + } list_del(head->next); return ele; } ``` ### - commit 88982a If ```bufsize``` < ```strlen(ele->value)```, we only could copy length of ```bufsize - 1``` and tail a null termination in the end of ```sp``` ``` Author: Howard Chen <s880367@gmail.com> Date: Wed Feb 23 03:00:43 2022 +0800 Fix string copy issue When bufsize is less than the copy length of string Need to opey only length of (bufsize - 1) and tail a '\0' in the end of the sp ``` ```diff= diff --git a/queue.c b/queue.c index 06c348d..0a2d036 100644 --- a/queue.c +++ b/queue.c @@ -125,11 +125,14 @@ element_t *q_remove_head(struct list_head *head, char *sp, size_t bufsize) ele = list_first_entry(head, element_t, list); if (sp) { - if (strlen(ele->value) < bufsize) + if (strlen(ele->value) <= bufsize) memcpy(sp, ele->value, strlen(ele->value) + 1); - else - memcpy(sp, ele->value, bufsize); + else { + memcpy(sp, ele->value, bufsize-1); + sp[bufsize - 1] = '\0'; + } } + list_del(head->next); return ele; ``` ### Error 2 :::danger 0x604000000380 is located 0 bytes to the right of 48-byte region [0x604000000350,0x604000000380) allocated by thread T0 here: #0 0x7f775c3eabc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8) #1 0x557ca6e5fff6 in test_malloc /home/howard/linux2022/lab0-c/harness.c:138 #2 0x557ca6e608d9 in q_new_element /home/howard/linux2022/lab0-c/queue.c:27 #3 0x557ca6e60acd in q_insert_tail /home/howard/linux2022/lab0-c/queue.c:96 #4 0x557ca6e5b893 in do_it /home/howard/linux2022/lab0-c/qtest.c:290 #5 0x557ca6e5ddf4 in interpret_cmda /home/howard/linux2022/lab0-c/console.c:185 #6 0x557ca6e5e793 in interpret_cmd /home/howard/linux2022/lab0-c/console.c:208 #7 0x557ca6e5f9c2 in cmd_select /home/howard/linux2022/lab0-c/console.c:583 #8 0x557ca6e5ff3c in run_console /home/howard/linux2022/lab0-c/console.c:656 #9 0x557ca6e5cac3 in main /home/howard/linux2022/lab0-c/qtest.c:962 #10 0x7f775bfc30b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2) ::: :::info 另一個問題是malloc element大小填錯 ::: ```diff= static element_t *q_new_element(char *s) { element_t *ele = malloc(sizeof(element_t)); size_t len = strlen(s) + 1; if (!ele) return NULL; - ele->value = malloc(sizeof(len)); + ele->value = malloc(len * sizeof(char)); if (!ele->value) { free(ele); return NULL; } memcpy(ele->value, s, len); INIT_LIST_HEAD(&ele->list); return ele; }; ``` :::success 後來參考 lance 發現C語言的標準函示庫裏面有 strdup 可以直接使用不用去擔心malloc的大小事多少，真是太方便了 Ref: https://github.com/laneser/lab0-c/blob/master/queue.c ::: ``` STRDUP(3) Linux Programmer's Manual STRDUP(3) SYNOPSIS #include <string.h> char *strdup(const char *s); char *strndup(const char *s, size_t n); char *strdupa(const char *s); char *strndupa(const char *s, size_t n); DESCRIPTION The strdup() function returns a pointer to a new string which is a duplicate of the string s. Memory for the new string is obtained with malloc(3), and can be freed with free(3). The strndup() function is similar, but copies at most n bytes. If s is longer than n, only n bytes are copied, and a terminating null byte ('\0') is added. strdupa() and strndupa() are similar, but use alloca(3) to allocate the buffer. They are available only when using the GNU GCC suite, and suffer from the same limitations described in alloca(3). RETURN VALUE top On success, the strdup() function returns a pointer to the duplicated string. It returns NULL if insufficient memory was available, with errno set to indicate the error. ERRORS top ENOMEM Insufficient memory available to allocate duplicate string. ``` ```diff= element_t *new_element(char *s) { element_t *new_ele = malloc(sizeof(element_t)); if (!new_ele) return NULL; + new_ele->value = strdup(s); if (!new_ele->value) { free(new_ele); return NULL; } return new_ele; } ``` ## q_delete_mid > Delete the middle node in list. > The middle node of a linked list of size n is the > ⌊n / 2⌋th node from the start using 0-based indexing. > If there're six element, the third member should be return. - 這裡先從```q_size```的數量來決定middle是哪一個node - **TODO: 嘗試使用fast/slow pointer改善效能** ```c bool q_delete_mid(struct list_head *head) { if (!head || list_empty(head)) return false; element_t *ele; struct list_head *slow = head->next; int mid = q_size(head); /* * TODO: use slow/fast pointer * struct list_head *slow, *fast; * slow = head->next; * fast = slow->next->next; */ mid = (mid & 0x1) ? mid >> 1 : (mid >> 1) - 1; while (mid > 0) { slow = slow->next; mid--; } ele = list_entry(slow, element_t, list); list_del(slow); q_release_element(ele); return true; } ``` ## q_delete_dup > Delete all nodes that have duplicate string, > leaving only distinct strings from the original list. > Note: this function always be called after sorting, in other words, > list is guaranteed to be sorted in ascending order. - 因為有先 sort 過，所以直接比較相鄰兩個node的value ```c bool q_delete_dup(struct list_head *head) { if (!head || list_empty(head)) return false; element_t *ele, *safe; list_for_each_entry_safe (ele, safe, head, list) { if (&safe->list != head && !strcmp(ele->value, safe->value)) { list_del(&(ele->list)); q_release_element(ele); } } return true; } ``` ## q_swap > Attempt to swap every two adjacent nodes. ```c void q_swap(struct list_head *head) { struct list_head *left = head, *right, *tmp_l, *tmp_r; if (!head || list_empty(head)) return; while ((left->next->next != head) && (left->next != head)) { right = left->next->next; left = left->next; tmp_l = left->prev; tmp_r = right->next; left->prev = right; right->next = left; left->next = tmp_r; right->prev = tmp_l; tmp_l->next = right; tmp_r->prev = left; } } ``` ## q_reverse > Reverse elements in queue > No effect if q is NULL or empty > This function should not allocate or free any list elements > (e.g., by calling q_insert_head, q_insert_tail, or q_remove_head). > It should rearrange the existing ones. ```c void q_reverse(struct list_head *head) { struct list_head *cur = head, *tmp; if (!head || list_empty(head)) return; while (cur->next != head) { tmp = cur->next; cur->next = cur->prev; cur->prev = tmp; cur = tmp; } tmp = cur->next; cur->next = cur->prev; cur->prev = tmp; } ``` ## 運用 Valgrind 排除 qtest 實作的記憶體錯誤 ### 檢測 trace-01-ops.cmd的記憶體錯誤 ``` valgrind --tool=memcheck ./qtest -f traces/trace-01-ops.cmd ``` ``` cmd> # Test of insert_head and remove_head cmd> option fail 0 cmd> option malloc 0 cmd> new l = [] cmd> ih gerbil l = [gerbil] cmd> ih bear l = [bear gerbil] cmd> ih dolphin l = [dolphin bear gerbil] cmd> rh dolphin Removed dolphin from queue l = [bear gerbil] cmd> rh bear Removed bear from queue l = [gerbil] cmd> rh gerbil Removed gerbil from queue l = [] cmd> Freeing queue ==48453== 5 bytes in 1 blocks are still reachable in loss record 1 of 3 ==48453== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==48453== by 0x4A5250E: strdup (strdup.c:42) ==48453== by 0x1109CF: linenoiseHistoryAdd (linenoise.c:1236) ==48453== by 0x111562: linenoiseHistoryLoad (linenoise.c:1325) ==48453== by 0x10C6B0: main (qtest.c:951) ==48453== ==48453== 101 bytes in 19 blocks are still reachable in loss record 2 of 3 ==48453== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==48453== by 0x4A5250E: strdup (strdup.c:42) ==48453== by 0x110943: linenoiseHistoryAdd (linenoise.c:1236) ==48453== by 0x111562: linenoiseHistoryLoad (linenoise.c:1325) ==48453== by 0x10C6B0: main (qtest.c:951) ==48453== ==48453== 160 bytes in 1 blocks are still reachable in loss record 3 of 3 ==48453== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==48453== by 0x11098F: linenoiseHistoryAdd (linenoise.c:1224) ==48453== by 0x111562: linenoiseHistoryLoad (linenoise.c:1325) ==48453== by 0x10C6B0: main (qtest.c:951) ==48453== ``` > still reachable: 程式結束時有未釋放的記憶體，不過卻還有指標指著，通常會發生在 global 變數即便是在所用的函式庫裡頭配置的記憶體，也可偵測到，這也是動態分析手法的優勢。 - 可以用massif繪製圖形 ``` valgrind --tool=memcheck ./qtest -f traces/trace-01-ops.cmd ``` ![](https://i.imgur.com/XkXSK58.png) - From the error message of valgrind, I found that the problem happened when the pointer of ```strdup``` was not freed under some condition. Because in ```linenoiseHistoryLoad```, the function will allocate a memory space using ```strdup```, I started to find where might be the problem that cause the non-free issue. - The memory space for linenoise will all be freed via ```linenoiseFree```, whcih is loacted in ```run_console(infile_name);``` - In the function of ```run_console```, I found that if the source file cannot be opened, the function will immediately returned without calling ```linenoiseFree```. ```c bool run_console(char *infile_name) { if (!push_file(infile_name)) { report(1, "ERROR: Could not open source file '%s'", infile_name); return false; } if (!has_infile) { char *cmdline; while ((cmdline = linenoise(prompt)) != NULL) { interpret_cmd(cmdline); linenoiseHistoryAdd(cmdline); /* Add to the history. */ linenoiseHistorySave(HISTORY_FILE); /* Save the history on disk. */ linenoiseFree(cmdline); } } else { while (!cmd_done()) cmd_select(0, NULL, NULL, NULL, NULL); } return err_cnt == 0; } ``` - Therefore, I tried to move ```linenoiseHistoryLoad``` into ```run_console``` as below codes that could solve the problem. ```diff diff --git a/console.c b/console.c index 6ad338a..a9a766e 100644 --- a/console.c +++ b/console.c @@ -645,6 +645,7 @@ bool run_console(char *infile_name) if (!has_infile) { char *cmdline; + linenoiseHistoryLoad(HISTORY_FILE); /* Load the history at startup */ while ((cmdline = linenoise(prompt)) != NULL) { interpret_cmd(cmdline); linenoiseHistoryAdd(cmdline); /* Add to the history. */ diff --git a/qtest.c b/qtest.c index dd40972..4566f49 100644 --- a/qtest.c +++ b/qtest.c @@ -948,7 +948,6 @@ int main(int argc, char *argv[]) linenoiseSetCompletionCallback(completion); linenoiseHistorySetMaxLen(HISTORY_LEN); - linenoiseHistoryLoad(HISTORY_FILE); /* Load the history at startup */ set_verblevel(level); if (level > 1) { set_echo(true); ``` - Before solving the issue, there are 160B heap cost ![](https://i.imgur.com/izHb6Xv.png) - After solve the problem, we could check out the head profiler that the 160B caost was eliminated ![](https://i.imgur.com/XcTu4yo.png) - 紀錄一下在 trace ```linenoiseHistoryLoad``` 發現 ```memmove```這個 standar library :::info memmove() 用來複製內存內容，其原型為： void * memmove(void *dest, const void *src, size_t num); memmove() 與memcpy()類似都是用來複製src 所指的內存內容前num 個字節到dest 所指的地址上。不同的是，memmove() 更為靈活，當src 和dest 所指的內存區域重疊時，memmove() 仍然可以正確的處理，不過執行效率上會比使用memcpy() 略慢些。 ::: ![](https://i.imgur.com/dHnFXqC.png) - [C語言memmove()函數：複製內存內容（可以處理重疊的內存塊）](http://c.biancheng.net/cpp/html/156.html) - [memmove(3) — Linux manual page](https://man7.org/linux/man-pages/man3/memmove.3.html) ## 測 make valgrind 的記憶體錯誤 - 發現在 trace-02-ops & trace-17-ops 會有錯誤 #### 1. 嘗試針對 trace-02-ops & trace-017-ops的記憶體錯誤進行修改 ``` +++ TESTING trace trace-02-ops: # Test of insert_head, insert_tail, remove_head, remove_tail, and delete_mid ==110811== Invalid read of size 8 ==110811== at 0x4842C30: memmove (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==110811== by 0x10E950: memcpy (string_fortified.h:34) ==110811== by 0x10E950: q_remove_tail (queue.c:158) ==110811== by 0x10B73C: do_remove (qtest.c:380) ==110811== by 0x10BA4A: do_rt (qtest.c:445) ==110811== by 0x10D1BF: interpret_cmda (console.c:185) ==110811== by 0x10D70B: interpret_cmd (console.c:208) ==110811== by 0x10DF51: cmd_select (console.c:583) ==110811== by 0x10E230: run_console (console.c:657) ==110811== by 0x10C6D4: main (qtest.c:961) ==110811== Address 0x4ba9908 is 8 bytes before a block of size 2,049 alloc'd ==110811== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==110811== by 0x10B68C: do_remove (qtest.c:348) ==110811== by 0x10BA4A: do_rt (qtest.c:445) ==110811== by 0x10D1BF: interpret_cmda (console.c:185) ==110811== by 0x10D70B: interpret_cmd (console.c:208) ==110811== by 0x10DF51: cmd_select (console.c:583) ==110811== by 0x10E230: run_console (console.c:657) ==110811== by 0x10C6D4: main (qtest.c:961) ``` - 發現跟之前在 ```remove_head```是一樣的問題 :::info 我沒有看清楚題目的描述，copy的大小應該是要小於等於 bufsize “If sp is non-NULL and an element is removed, copy the removed string to *sp” ::: ```diff @@ -154,8 +154,14 @@ element_t *q_remove_tail(struct list_head *head, char *sp, size_t bufsize) return NULL; ele = list_last_entry(head, element_t, list); - if (sp) - memcpy(sp, ele->value, bufsize); + if (sp) { + if (strlen(ele->value) <= bufsize) + memcpy(sp, ele->value, strlen(ele->value) + 1); + else { + memcpy(sp, ele->value, bufsize - 1); + sp[bufsize - 1] = '\0'; + } + } list_del(head->prev); ``` ## make test 通過 ![](https://i.imgur.com/3yauzxr.png =400x400) ## q_sort via Linux kernel API > Sort elements of queue in ascending order > No effect if q is NULL or empty. In addition, if q has only one > element, do nothing. - 先嘗試使用 linux kernel API 的```list_sort``` ```c void q_sort(struct list_head *head) { if (!head || list_empty(head) || list_is_singular(head)) return; list_sort(NULL, head, sort_comp); } ``` ## 自己實做 q_sort > Sort elements of queue in ascending order > No effect if q is NULL or empty. In addition, if q has only one > element, do nothing. ### 想法 - 採用比較直觀的遞迴方法 - 利用 fats/slow pointer 找出 mid pointer 把 list 一分為二，因為這是針對 circular linked list 所以tail->next 並不會指向一個 NULL pointer。因此把 tail->next 每一次都指向 head ```graphviz digraph { rankdir = LR; label = "unsort"; node [shape=circle]; h [label="head"]; edge [weight=3]; h -> 1 -> 2 -> 3 -> 4 -> 5 -> 6; edge [weight=3]; h->6 -> 5 -> 4 -> 3 -> 2 -> 1 -> h; 6 -> h; node [shape=none]; edge[weight = 3] mid->4 slow->3 fast->6 node [shape=none label=""]; edge[weight = 1 style="invis"] nn->h; } ``` :::info - 分成兩個 sub-list之後，把 sub-list 的 tail 都指向 head 。這樣之後 fast ptr 在 traverse 的時候才知道尾巴在哪裡 ```c mid = slow->next; slow->next = head; ``` - 在針對個別 sub-list 進行一次 split ```graphviz digraph { rankdir = LR; label = "left list right list"; node [shape=circle]; h1 [label="head"]; h2 [label="head"]; edge [weight=3 style=""]; 1 -> 2 -> 3-> h1; 3 -> 2 -> 1; 4 -> 5 -> 6->h2; 6 -> 5 -> 4; edge [weight=3 style="invis"]; h1 -> 4 } ``` ::: :::info - split 的目標是要分到最後只剩下一個 node - 一開始會先得到只有一個 node 的 left list - right list 還沒分完，針對 right list 再做一次 split ```graphviz digraph { rankdir = LR; label = "right list left list"; node [shape=record]; h1 [label="head"]; h2 [label="head"]; node [shape=circle]; edge [weight=3]; 2->3->h1; 3->2; 1->h2; edge [weight=3 style="invis"]; 1->2; 2->1; } ``` ```graphviz digraph { rankdir = LR; label = "right list left list"; node [shape=record]; h1 [label="head"]; h2 [label="head"]; node [shape=circle]; edge [weight=3]; 5->6->h1; 6->5; 4->h2; edge [weight=3 style="invis"]; 4->5; 5->4; } ``` ::: :::info - 對 right list 再做一次 split 後，會得到各自只有一個 node 的 left and right list - 接下來就比較各自value的大小後進行 merge ```graphviz digraph { rankdir = LR; label = "right list left list"; node [shape=record]; h1 [label="head"]; h2 [label="head"]; node [shape=circle]; edge [weight=3]; 2->h1; 3->h2; edge [weight=3 style="invis"]; h2->2; } ``` ```graphviz digraph { rankdir = LR; label = "right list left list"; node [shape=record]; h1 [label="head"]; h2 [label="head"]; node [shape=circle]; edge [weight=3]; 5->h1; 6->h2; edge [weight=3 style="invis"]; h2->5; } ``` ::: ```c struct list_head *merge_list(struct list_head *l1, struct list_head *l2, struct list_head *head) { LIST_HEAD(tmp_head); struct list_head *ptr = &tmp_head, *tmp = &tmp_head; while ((l1 != head) && (l2 != head)) { if (strcmp(list_entry(l1, element_t, list)->value, list_entry(l2, element_t, list)->value) <= 0) { ptr->next = l1; l1->prev = ptr; l1 = l1->next; } else { ptr->next = l2; l2->prev = ptr; l2 = l2->next; } ptr = ptr->next; } ptr->next = (l1 != head) ? l1 : l2; (ptr->next)->prev = ptr; return tmp->next; } ``` ```c static struct list_head *mergesort_list(struct list_head *node, struct list_head *head) { struct list_head *left, *right; if (!head || node->next == head) return node; struct list_head *slow = node; struct list_head *fast = slow->next->next; struct list_head *mid = NULL; while ((fast->next != head) && (fast != head)) { slow = slow->next; fast = fast->next->next; } mid = slow->next; slow->next = head; left = mergesort_list(node, head); right = mergesort_list(mid, head); return merge_list(left, right, head); } ``` ```c void q_sort(struct list_head *head) { if (!head || list_empty(head) || list_is_singular(head)) return; #if SORT_BY_KERNEL_API list_sort(NULL, head, sort_comp); #else head->next = mergesort_list(head->next, head); head->next->prev = head; struct list_head *tmp = head->next; while (tmp->next != head) tmp = tmp->next; head->prev = tmp; #endif } ``` :::info :bulb: Merge 完之後會在做一次 linear traverse 因為要把 ```head->prev``` 接上 tail node ```c struct list_head *tmp = head->next; while (tmp->next != head) tmp = tmp->next; head->prev = tmp; ``` ::: ## 比較自己的 mergesort_list 和 Linux 核心程式碼之間效能落差 - 為了要分析兩者之間的效能差異，採用下面3個 tool 來進行效能分析 1. [perf 原理和實務](/qDsVNbZwQia8UGW1ItAhEg) 2. [ Cachegrind: a cache and branch-prediction profiler](https://valgrind.org/docs/manual/cg-manual.html) - 參考 [bakudr18](https://hackmd.io/@MR-9Qa9wQLWglSAUyd6UhQ/BkvLMLkeq) 同學的作業發現 Valgrind 還有這一個強大的工具可以使用 - Cachegrind 是 Valgrind 底下的其中一個 Tool , 專門用來量測 cache 使用與 branch predictor。執行 valgrind --tool=cachegrind PROGRAM 來紀錄程式執行過程，cg_annotate FILE 可以列出 function-by-function 的 cache access 結果。 3. [callgrind](https://valgrind.org/info/tools.html#callgrind) + [kcachegrind](https://kcachegrind.github.io/html/Home.html) - callgrind 是 cachegrind 的一個 extension 除了可以分析 cache miss 之外，還可以透過 kcachegrind 進一步把整個程式的 call graph 畫出來，並分析 bottleneck 在那一個 function > Callgrind, by Josef Weidendorfer, is an extension to Cachegrind. It provides all the information that Cachegrind does, plus extra information about callgraphs. It was folded into the main Valgrind distribution in version 3.2.0. Available separately is an amazing visualisation tool, KCachegrind, which gives a much better overview of the data that Callgrind collects; it can also be used to visualise Cachegrind's output. ### 1. 利用 perf stat 分析使用 perf stat 因為已經有個要優化的目標，我們來對 ```trace-15-perf.cmd``` 使用 perf stat 工具分析 ```q_sort``` cache miss 情形 ``` sudo perf stat -r 100 -e cache-misses,cache-references,instructions,cycles ./qtest -f ./traces/trace-15-perf.cmd ``` - --repeat <n>或是-r <n> 可以重複執行 n 次該程序，並顯示每個 event 的變化區間。 cache-misses,cache-references和 instructions,cycles類似這種成對的 event，若同時出現 perf 會很貼心幫你計算比例。 - ==**Linux kernel API: list_sort**== ``` Performance counter stats for './qtest -f ./traces/trace-15-perf.cmd' (100 runs): 2,214,941 cache-misses # 6.624 % of all cache refs ( +- 1.41% ) 33,436,750 cache-references ( +- 0.38% ) 505,342,377 instructions # 0.85 insn per cycle ( +- 0.01% ) 595,963,686 cycles ( +- 0.24% ) 0.127784 +- 0.000338 seconds time elapsed ( +- 0.26% ) ``` - ==**自己寫的 mergesort_list**:== ``` Performance counter stats for './qtest -f ./traces/trace-15-perf.cmd' (100 runs): 2,553,846 cache-misses # 6.602 % of all cache refs ( +- 1.18% ) 38,681,778 cache-references ( +- 0.45% ) 506,314,523 instructions # 0.71 insn per cycle ( +- 0.01% ) 711,354,384 cycles ( +- 0.21% ) 0.152460 +- 0.000337 seconds time elapsed ( +- 0.22% ) ``` - 執行時間比 ```list_sort``` 多花 ==0.024 秒== - cache reference 數比 ```list_sort```多， cache miss 也比較多 - IPC 的數量比 ```list_sort``` 少，同時 cycle 數量又比較多 ### perf record & perf report 有別於 stat，reocrd 可以針對函式級別進行 event 統計，方便我們對程序「熱點」作更精細的分析和優化。我們可以針對 ```./qtest -f ./traces/trace-15-perf.cmd``` 使用 perf record 進行 branch 情況分析 ```shell $ sudo perf record -e branch-misses:u,branch-instructions:u ./qtest -f ./traces/trace-15-perf.cmd ``` ```shell $ perf report ``` `:u`是讓 perf 只統計發生在 user space 的 event。最後可以觀察到迴圈展開前後 branch-instructions 的差距。 ``` Samples: 337 of event 'branch-misses:u', Event count (approx.): 3112600 Overhead Command Shared Object Symbol 41.14% qtest qtest [.] merge_list 38.73% qtest libc-2.31.so [.] __strcmp_avx2 4.12% qtest libc-2.31.so [.] __random 3.20% qtest qtest [.] mergesort_list 2.47% qtest qtest [.] 0x0000000000002880 1.74% qtest libc-2.31.so [.] __random_r 1.30% qtest qtest [.] test_free 1.29% qtest qtest [.] q_new_element 0.99% qtest qtest [.] fill_rand_string 0.84% qtest libc-2.31.so [.] __memmove_avx_unaligned_erms 0.78% qtest libc-2.31.so [.] rand 0.75% qtest libc-2.31.so [.] _int_malloc 0.68% qtest qtest [.] q_insert_head 0.44% qtest qtest [.] test_malloc 0.40% qtest libc-2.31.so [.] __memset_avx2_unaligned_erms 0.32% qtest qtest [.] error_check 0.24% qtest libc-2.31.so [.] _int_free 0.23% qtest qtest [.] 0x00000000000024a0 0.16% qtest ld-2.31.so [.] _dl_lookup_symbol_x 0.14% qtest qtest [.] 0x00000000000026f0 0.03% qtest ld-2.31.so [.] __GI___tunables_init 0.00% qtest ld-2.31.so [.] _dl_start ``` ![](https://i.imgur.com/DIfzLOD.png) ### 2. 利用 Cachegrind 進行效能分析 > Cachegrind is a cache profiler. It performs detailed simulation of the I1, D1 and L2 caches in your CPU and so can accurately pinpoint the sources of cache misses in your code. > > It identifies the number of cache misses, memory references and instructions executed for each line of source code, with per-function, per-module and whole-program summaries. It is useful with programs written in any language. Cachegrind runs programs about 20--100x slower than normal. - 嘗試使用 valgrind 裡面的令一個 tool cachegrind 進一步分析每一個 function 的 cache miss :::info :bulb: 要注意的是 Ir counts 基本上就是 assembly instructions 執行的數量 ::: - Ir: I cache reads (instructions executed) - I1mr: I1 cache read misses (instruction wasn't in I1 cache but was in L2) - I2mr: L2 cache instruction read misses (instruction wasn't in I1 or L2 cache, had to be fetched from memory) - Dr: D cache reads (memory reads) - D1mr: D1 cache read misses (data location not in D1 cache, but in L2) - D2mr: L2 cache data read misses (location not in D1 or L2) - Dw: D cache writes (memory writes) - D1mw: D1 cache write misses (location not in D1 cache, but in L2) - D2mw: L2 cache data write misses (location not in D1 or L2) #### 1. Linux kernel API: list_sort |Ir | I1mr |ILmr |Dr |D1mr |DLmr | Dw |D1mw |DLmw | | |-----------|-------|------|------------|---------- |-----|------------|----------|---------|-----------| |490,987,422| 1,600 |1,523 |120,610,370 |==9,552,254== |2,373| 53,881,626 |1,072,698 |225,686 |PROGRAM TOTALS| #### 2. mergesort_list |Ir | I1mr |ILmr |Dr |D1mr |DLmr | Dw |D1mw |DLmw | | |-----------|-------|------|------------|---------- |-----|------------|----------|---------|-----------| |491,935,138| 1,591 |1,520 | 129,754,521| ==12,671,085==|2,373| 55,280,397 | 655,445 | 225,686 | PROGRAM TOTALS| - 可以發現 D1-cache 的 read miss 是 12671085 比 ```list_sort``` 的 9552254 還要多 - cg_annotate 可以列出 function-by-function 的 cache access 結果 - 可以發現 ```mergesort_list``` cache read miss 的數量最高： 3,790,923 #### ==Linux kernel API: list_sort== ``` -------------------------------------------------------------------------------- I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 16777216 B, 64 B, 16-way associative Command: ./qtest -f ./traces/trace-15-perf.cmd Data file: cachegrind.out.8480 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Thresholds: 0.1 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: off -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -------------------------------------------------------------------------------- 490,987,422 1,600 1,523 120,610,370 9,552,254 2,373 53,881,626 1,072,698 225,686 PROGRAM TOTALS -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function -------------------------------------------------------------------------------- 81,029,740 10 7 18,309,555 2,749,259 6 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2 56,088,696 3 3 14,078,944 9 0 5,279,604 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random_r.c:random_r 51,068,936 3 3 6,643,424 545,282 0 9,446,992 6 0 /home/howard/linux2022/lab0-c/queue.c:merge 36,957,228 2 2 14,078,944 0 0 3,519,736 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random.c:random 31,597,984 33 32 5,806,658 118,397 1 5,326,225 199,973 199,966 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_malloc 29,292,304 0 0 10,984,614 2,343,235 0 3,661,538 55 0 /home/howard/linux2022/lab0-c/queue.c:sort_comp 25,929,916 13 13 7,362,909 29,887 0 3,201,628 13 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_free 20,642,030 5 3 2,080,425 3 1 3,520,290 0 0 /home/howard/linux2022/lab0-c/qtest.c:fill_rand_string 17,284,143 8 7 2,560,594 1,911,422 0 640,546 15 0 /home/howard/linux2022/lab0-c/qtest.c:show_queue 14,735,314 7 7 3,841,529 55 0 1,490,679 0 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc 14,370,820 51 46 7,185,371 343 0 19 1 1 ???:??? 13,439,748 4 4 3,839,928 320,314 5 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_l_avx 12,872,482 8 6 2,239,856 163,277 0 2,675,554 424,853 0 /home/howard/linux2022/lab0-c/queue.c:list_sort 12,800,117 4 4 3,200,030 3 1 3,840,033 83,760 24,997 /home/howard/linux2022/lab0-c/harness.c:test_malloc 12,160,190 5 5 3,840,036 204,679 0 2,240,018 362,456 0 /home/howard/linux2022/lab0-c/harness.c:test_free 8,895,194 5 5 640,009 0 0 1,280,024 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms 7,199,325 1 1 1,439,865 0 0 1,439,865 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/rand.c:rand 6,723,171 2 2 1,600,755 42,463 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:free 6,080,425 2 2 2,080,425 0 0 1,120,000 0 0 /home/howard/linux2022/lab0-c/queue.c:q_new_element 5,440,228 12 10 1,120,078 6 0 640,042 0 0 /home/howard/linux2022/lab0-c/qtest.c:do_ih 3,840,222 9 7 960,042 400,023 0 320,072 0 0 /home/howard/linux2022/lab0-c/qtest.c:do_sort 3,519,736 0 0 1,759,868 3 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/../sysdeps/unix/sysv/linux/x86/lowlevellock.h:random 3,200,156 1 1 1,600,078 19 1 800,039 0 0 /home/howard/linux2022/lab0-c/harness.c:error_check 2,240,000 1 1 320,000 0 0 320,000 0 0 /home/howard/linux2022/lab0-c/queue.c:q_insert_head 2,217,818 6 6 482,020 0 0 321,231 28 1 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms 1,440,000 1 1 480,000 39,566 0 480,000 0 0 /home/howard/linux2022/lab0-c/queue.c:q_release_element 1,280,054 4 2 320,012 320,012 0 0 0 0 /home/howard/linux2022/lab0-c/queue.c:q_size 1,280,030 1 1 320,006 200,007 0 320,006 0 0 /home/howard/linux2022/lab0-c/queue.c:q_reverse 1,280,012 0 0 320,003 0 0 320,003 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_free 1,280,012 0 0 0 0 0 320,003 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_malloc 1,279,976 2 2 639,988 12 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_avx 1,120,061 1 1 160,016 158,903 0 160,009 0 0 /home/howard/linux2022/lab0-c/queue.c:q_free 800,000 0 0 160,000 0 0 640,000 0 0 /home/howard/linux2022/lab0-c/list.h:q_insert_head 640,000 0 0 0 0 0 160,000 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:q_new_element ``` #### ==mergesort_list== ``` -------------------------------------------------------------------------------- I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 16777216 B, 64 B, 16-way associative Command: ./qtest -f ./traces/trace-15-perf.cmd Data file: cachegrind.out.8759 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Thresholds: 0.1 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: off -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -------------------------------------------------------------------------------- 491,935,138 1,591 1,520 129,754,521 12,671,085 2,373 55,280,397 655,445 225,686 PROGRAM TOTALS -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function -------------------------------------------------------------------------------- 81,881,583 10 7 18,501,150 2,311,756 6 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2 61,645,878 2 2 17,679,374 2,496,562 0 13,979,517 1,434 0 /home/howard/linux2022/lab0-c/queue.c:merge_list 56,109,412 3 3 14,084,144 9 0 5,281,554 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random_r.c:random_r 36,970,878 2 2 14,084,144 0 0 3,521,036 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/random.c:random 31,597,984 33 32 5,806,658 118,331 1 5,326,225 199,973 199,966 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_malloc 30,289,174 2 2 10,772,834 3,790,923 0 3,199,958 6,207 0 /home/howard/linux2022/lab0-c/queue.c:mergesort_list 25,929,916 13 13 7,362,909 30,097 0 3,201,628 13 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_free 20,638,222 3 3 2,079,231 3 1 3,519,746 0 0 /home/howard/linux2022/lab0-c/qtest.c:fill_rand_string 17,284,143 8 7 2,560,594 1,911,521 0 640,546 15 0 /home/howard/linux2022/lab0-c/qtest.c:show_queue 14,735,314 7 7 3,841,529 55 0 1,490,679 0 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc 14,448,758 51 46 7,224,340 770 0 19 1 1 ???:??? 13,439,748 4 4 3,839,928 320,324 5 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_l_avx 12,800,117 4 4 3,200,030 3 1 3,840,033 83,702 24,997 /home/howard/linux2022/lab0-c/harness.c:test_malloc 12,160,172 5 5 3,840,036 204,644 0 2,240,018 362,516 0 /home/howard/linux2022/lab0-c/harness.c:test_free 8,897,142 5 5 640,009 0 0 1,280,024 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms 7,202,575 1 1 1,440,515 0 0 1,440,515 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/rand.c:rand 6,723,171 2 2 1,600,755 42,508 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:free 6,079,231 2 2 2,079,231 0 0 1,120,000 0 0 /home/howard/linux2022/lab0-c/queue.c:q_new_element 5,440,228 12 10 1,120,078 6 0 640,042 0 0 /home/howard/linux2022/lab0-c/qtest.c:do_ih 3,840,222 9 7 960,042 400,026 0 320,072 0 0 /home/howard/linux2022/lab0-c/qtest.c:do_sort 3,521,036 0 0 1,760,518 3 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/stdlib/../sysdeps/unix/sysv/linux/x86/lowlevellock.h:random 3,200,156 1 1 1,600,078 19 1 800,039 0 0 /home/howard/linux2022/lab0-c/harness.c:error_check 2,240,000 0 0 320,000 0 0 320,000 0 0 /home/howard/linux2022/lab0-c/queue.c:q_insert_head 2,218,828 6 6 482,020 0 0 321,231 28 1 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms 1,440,000 0 0 480,000 39,548 0 480,000 0 0 /home/howard/linux2022/lab0-c/queue.c:q_release_element 1,280,090 2 2 320,018 320,002 0 30 18 0 /home/howard/linux2022/lab0-c/queue.c:q_sort 1,280,054 2 2 320,012 320,012 0 0 0 0 /home/howard/linux2022/lab0-c/queue.c:q_size 1,280,030 1 1 320,006 200,007 0 320,006 0 0 /home/howard/linux2022/lab0-c/queue.c:q_reverse 1,280,012 0 0 320,003 0 0 320,003 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_free 1,280,012 0 0 0 0 0 320,003 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_malloc 1,279,976 2 2 639,988 12 0 0 0 0 /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_avx 1,120,061 1 1 160,016 158,869 0 160,009 0 0 /home/howard/linux2022/lab0-c/queue.c:q_free 800,000 0 0 160,000 0 0 640,000 0 0 /home/howard/linux2022/lab0-c/list.h:q_insert_head 640,000 0 0 0 0 0 160,000 0 0 /usr/include/x86_64-linux-gnu/bits/string_fortified.h:q_new_element ``` ### 3. Callgrind + Kcachegrind 進行分析 - 首先利用 callgrind 產生分析的 dump ```valgrind --tool=callgrind ./qtest -f ./traces/trace-15-perf.cmd``` - 在利用 kcachegrind 把 dump 視覺化 ```kcachegrind callgrind.out.8840``` #### ==mergort_list 分析== - 可以發現 ```mergesort_list``` 是在sort過程中被 call 次數最多的 function - kcachegrind 可以更進一步分析出 mergesort_list 裡面，被執行次數最多的 function ![](https://i.imgur.com/pQE0mmz.png) ```graphviz digraph "callgraph" { F55606ab5d2a0 [label="malloc\n46 331 795"]; F55606ab61740 [label="random\n79 022 370"]; F55606ab63e10 [label="0x000000000010a700\n46 970 869"]; F55606ab67610 [label="0x000000000010a4a0\n33 279 149"]; F55606ab67b50 [label="free\n32 639 121"]; F55606ab68030 [label="0x000000000010a880\n89 103 476"]; F55606ab68600 [label="rand\n86 223 160"]; F55606ab72b80 [label="0x000000000010a670\n83 994 134"]; F55606ab73110 [label="__strcmp_avx2\n77 043 651"]; F55606ab775e0 [label="_start\n491 870 341"]; F55606ab77bf0 [label="(below main)\n491 870 341"]; F55606ab78590 [label="fill_rand_string\n109 749 685"]; F55606ab7dc10 [label="main\n491 870 341"]; F55606ab81710 [label="run_console\n491 771 018"]; F55606ab83550 [label="do_free\n54 369 103"]; F55606ab849e0 [label="q_free\n54 366 859"]; F55606ab84f80 [label="do_ih\n217 809 624"]; F55606ab85ff0 [label="q_insert_head\n96 945 101"]; F55606ab86550 [label="do_sort\n213 187 520"]; F55606ab873e0 [label="q_sort\n182 639 830"]; F55606ab88dc0 [label="q_release_element\n53 246 318"]; F55606ab891a0 [label="test_free\n51 806 318"]; F55606ab89720 [label="merge_list\n141 898 843"]; F55606ab89c70 [label="q_new_element\n93 905 101"]; F55606ab8a7e0 [label="mergesort_list\n181 359 722", color="red:blue"]; F55606ab8b140 [label="mergesort_list'2\n171 067 879", color="red:blue"]; F55606ab8b9c0 [label="test_malloc\n84 337 256"]; F55606ab91bd0 [label="interpret_cmda\n491 693 518"]; F55606ab94930 [label="cmd_select\n491 770 474"]; F55606ab95c60 [label="interpret_cmd\n491 735 542"]; F55606ac4beb0 [label="random_r\n45 898 943"]; F55606ac5c2e0 [label="_int_malloc\n31 548 854"]; F55606ac7bc80 [label="_int_free\n25 918 826"]; F55606ab5d2a0 -> F55606ac5c2e0 [weight=2,label="31 548 854 (214 646x)"]; F55606ab61740 -> F55606ac4beb0 [weight=2,label="45 898 943 (1 440 149x)"]; F55606ab63e10 -> F55606ab5d2a0 [weight=2,label="46 331 795 (319 537x)"]; F55606ab67610 -> F55606ab67b50 [weight=2,label="32 639 121 (320 014x)"]; F55606ab67b50 -> F55606ac7bc80 [weight=2,label="25 918 826 (320 014x)"]; F55606ab68030 -> F55606ab68600 [weight=2,label="86 223 160 (1 440 158x)"]; F55606ab68600 -> F55606ab61740 [weight=2,label="79 022 370 (1 440 158x)"]; F55606ab72b80 -> F55606ab73110 [weight=2,label="77 043 651 (3 475 241x)"]; F55606ab775e0 -> F55606ab77bf0 [weight=2,label="491 870 341 (1x)"]; F55606ab77bf0 -> F55606ab7dc10 [weight=2,label="491 870 341 (1x)"]; F55606ab78590 -> F55606ab68030 [weight=2,label="89 103 476 (1 440 158x)"]; F55606ab7dc10 -> F55606ab81710 [weight=2,label="491 771 018 (1x)"]; F55606ab81710 -> F55606ab94930 [weight=2,label="491 770 474 (25x)"]; F55606ab83550 -> F55606ab849e0 [weight=2,label="54 366 859 (3x)"]; F55606ab849e0 -> F55606ab88dc0 [weight=2,label="53 246 318 (160 000x)"]; F55606ab84f80 -> F55606ab78590 [weight=2,label="109 749 685 (160 000x)"]; F55606ab84f80 -> F55606ab85ff0 [weight=2,label="96 945 101 (160 000x)"]; F55606ab85ff0 -> F55606ab89c70 [weight=2,label="93 905 101 (160 000x)"]; F55606ab86550 -> F55606ab873e0 [weight=2,label="182 639 830 (6x)"]; F55606ab873e0 -> F55606ab8a7e0 [weight=2,label="181 359 722 (6x)"]; F55606ab88dc0 -> F55606ab891a0 [weight=2,label="51 806 318 (320 000x)"]; F55606ab891a0 -> F55606ab67610 [weight=2,label="33 279 149 (320 000x)"]; F55606ab89720 -> F55606ab72b80 [weight=2,label="83 994 134 (3 475 257x)"]; F55606ab89c70 -> F55606ab8b9c0 [weight=2,label="84 337 256 (320 000x)"]; F55606ab8a7e0 -> F55606ab8b140 [weight=2,label="171 067 879 (12x)"]; F55606ab8b140 -> F55606ab89720 [weight=2,label="141 898 843 (319 988x)"]; F55606ab8b140 -> F55606ab8b140 [weight=3,label="1 328 340 680 (639 976x)"]; F55606ab8b9c0 -> F55606ab63e10 [weight=2,label="46 970 869 (320 001x)"]; F55606ab91bd0 -> F55606ab83550 [weight=2,label="54 369 103 (3x)"]; F55606ab91bd0 -> F55606ab84f80 [weight=2,label="217 809 624 (3x)"]; F55606ab91bd0 -> F55606ab86550 [weight=2,label="213 187 520 (6x)"]; F55606ab94930 -> F55606ab95c60 [weight=2,label="491 735 542 (24x)"]; F55606ab95c60 -> F55606ab91bd0 [weight=2,label="491 693 518 (24x)"]; } ``` :::info :bulb: cachegrind 和 callgrind 都指出需要改進的地方在 mergesort_list ::: ### 嘗試改進 mergelist_sort 的效能 ==改進後== ``` Performance counter stats for './qtest -f ./traces/trace-15-perf.cmd' (100 runs): 2,468,345 cache-misses # 6.314 % of all cache refs ( +- 0.86% ) 39,092,004 cache-references ( +- 0.09% ) 504,908,000 instructions # 0.75 insn per cycle ( +- 0.01% ) 676,718,388 cycles ( +- 0.15% ) 0.144814 +- 0.000239 seconds time elapsed ( +- 0.16% ) ``` ==改進前== ``` Performance counter stats for './qtest -f ./traces/trace-15-perf.cmd' (100 runs): 2,546,624 cache-misses # 6.600 % of all cache refs ( +- 0.91% ) 38,587,995 cache-references ( +- 0.44% ) 506,296,965 instructions # 0.72 insn per cycle ( +- 0.01% ) 701,463,458 cycles ( +- 0.17% ) 0.150247 +- 0.000268 seconds time elapsed ( +- 0.18% ) ``` > 待解釋 - 把效能提升到 0.144814 秒比原來的 0.150247 ==提升了3.6%== - cache miss 也減少 IPC也提昇了一點點 ## 實做 q_shuffle - 因為 queue.h 沒辦法 edit 後 push ，所以直接加在 ```qtest.c``` 裡面 ```c static void swap(element_t *a, element_t *b) { char *tmp = b->value; b->value = a->value; a->value = tmp; } ``` ```c /* * Using Fisher–Yates shuffle algorithm * -- To shuffle an array a of n elements (indices 0..n-1): * for i from n−1 downto 1 do * j ← random integer such that 0 ≤ j ≤ i * exchange a[j] and a[i] */ static void q_shuffle(struct list_head *head) { if (!head || list_empty(head) || list_is_singular(head)) return; int n = q_size(head); element_t **ele_a = malloc(n * sizeof(element_t)); int i = 0, j; element_t *ele; list_for_each_entry (ele, head, list) { ele_a[i] = ele; i++; } for (j = n - 1; j > 0; j--) { i = rand() % j; swap(ele_a[j], ele_a[i]); } free(ele_a); } ``` ```c static bool do_shuffle(int argc, char *argv[]) { if (argc != 1) { report(1, "%s takes no arguments", argv[0]); return false; } if (!l_meta.l) report(3, "Warning: Try to access null queue"); error_check(); if (exception_setup(true)) q_shuffle(l_meta.l); exception_cancel(); show_queue(3); return !error_check(); } ``` ## 嘗試整合 tiny-web-server > FD_ZERO(&fdset) initializes a descriptor set fdset to the null set. > FD_SET(fd, &fdset) includes a particular descriptor fd in fdset. > FD_CLR(fd, &fdset) removes fd from fdset. > FD_ISSET(fd, &fdset) is non-zero if fd is a member of fdset, zero otherwise. > FD_COPY(&fdset_orig, &fdset_copy) replaces an already allocated &fdset_copy file descriptor set with a copy of &fdset_orig. > If timeout is not a null pointer, it specifies a maximum interval to wait for the selection to complete. > If timeout is a null pointer, the select blocks indefinitely. **- 參考[vax-r](https://hackmd.io/DSdogwE1Se-d-E5qr-000w?view) 在 [整合網頁伺服器](https://hackmd.io/DSdogwE1Se-d-E5qr-000w?view)中提到** `web` 命令執行後， `linenoise` 套件會失效是因為 `do_web` 函式當中，如果開啟 web server 並監聽 port 9999 ， `use_linenoise` 變數會被設為 `false` 如果強制將 `use_linenoise` 設為 `true` ，反而導致瀏覽器嘗試傳送請求給 web server 時會出現無限延遲。一旦使用 `web` 命令後， `linenoise` 套件被關閉並且程式使用 `cmd_select` 來同時控制 command line input 和網頁瀏覽器請求。因此我們將 `linenoise` 函式引入 `cmd_select` 當中 ```diff - char *cmdline = readline(); + char *cmdline = linenoise(prompt); ``` 執行結果會發現，開啟 web server 之後 `linenoise` 套件會間隔性的能夠使用，而不能使用的時候就是 `line_edit` 當中的 `read` 被阻塞的時候。 TODO: fix the bug ## 研讀論文〈Dude, is my code constant time?〉 ## 參考資訊 * [HackMD_教學](https://hackmd.io/c/tutorials-tw/) * [C Programming Lab: Assessing Your C Programming Skills](http://www.cs.cmu.edu/~213/labs/cprogramminglab.pdf) * [ASCII_Table](http://www.asciitable.com/)