2024q1 Homework2 (quiz1+2)

# 2024q1 Homework2 (quiz1+2) contributed by < [HotMercury](https://github.com/HotMercury) > [作業要求](https://hackmd.io/@sysprog/BkmmEQCn6) ## 第一次測驗 ### Testing 1 : non-recursive link list quick sort 參考 [Optimized QuickSort — C Implementation (Non-Recursive)](https://alienryderflex.com/quicksort/) 來實作非遞迴的 quick sort, 首先為了比較 recursive and non-recursive 的差異，我實作了一個 [recursive 版本的 quick sort](https://github.com/HotMercury/sort/blob/main/quicksort.c), 通過實驗數據來比對兩者的差異。分析 [Optimized QuickSort](https://alienryderflex.com/quicksort/) 如何做到 non-recursive 的作法，主要是使用到了 stack 的概念，多了兩個 array 來存放這次需要 begin and end 範圍，以下程式碼為主要的分割概念。 ```c arr[L]=piv; beg[i+1]=L+1; end[i+1]=end[i]; end[i++]=L; ``` 我們以 10 個 node 來做例子，假設第一次的 pivot 實際位於第七個 node, 最一開始的時候我們是檢查 0 ~ 10, 當第一個 pivot 結束後，應該要將他們拆開成 0 ~ 7 and 8 ~ 10, 所以目前的 array 就會變成以下的樣子，直到變成 i 變成 0 後跳出迴圈。 | i 迴圈 | 0 | 1 | 2| | -------- | -------- | -------- |--| | begin | 0 | 7 |?| | end | 7 | 10 |?| #### compare times 觀察以下的 swap, 每次 compare 成功就必須做三次的轉換，然而在 optimized quick sort 裡卻只需要兩次的轉換。 ```c int temp = arr[i]; arr[i] = arr[j]; arr[j] = temp; ``` 為什麼可以做到只有兩次呢？當一開始的時候我們會將 pivot 設為第一個的值(L)，從 R(lsb) 開始找到替換值，直接覆蓋 L 值，之後再從 R 值開始找到替代值，替代 L, 所以省下的那次是因為我們不是直接找到兩個需要交換的值，再透過 tmp 來交換。 #### cycle time 透過 [dudect](https://github.com/oreparaz/dudect) 測量的 [cycle time](https://stackoverflow.com/questions/12631856/difference-between-rdtscp-rdtsc-memory-and-cpuid-rdtsc/12634857#12634857) 的方式，在程式中插入 x86 rdtscp, 可以計算出精準度較高的 cycle time ```c uint64_t rdtscp() { uint64_t tsc; asm volatile ("rdtscp; " // serializing read of tsc "shl $32,%%rdx; " // shift higher 32 bits stored in rdx up "or %%rdx,%%rax" // and or onto rax : "=a"(tsc) // output to tsc variable : : "%rcx", "%rdx"); // rcx and rdx are clobbered return tsc; } ``` average 10 times |cycle time | 10000 | 100000 | | -------- | -------- | -------- | | non-recursive | 26613060 | 172404088 | | recursive | 34850950 | 304100050 | 實驗中發現的問題可以發現 non-recursive 的速度快很多，然而 non-recursive 會有 stack 爆炸的問題，使用 valgrind 發現手動調整就可以通過，所以這個以 stack 紀錄範圍的方式說不定可以再優化。 ``` valgrind --leak-check=full --max-stackframe=6400000 ./a.out ``` #### link list optimized quick sort **shuffle** 我發現這裡使用到了 [Fisher-Yates](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle)的亂數方法，可以獲取 1 ~ n-1 的隨機數字。一開始很不能明白為什麼不用 mod 可以找到特定範圍的數字，以下 code 的概念使用了分區的方式，我先將 RAND_MAX 分割範圍，然後只要用除法就可以知道這次在哪個範圍，i 的用處就是控制切割範圍大小。 ```c size_t j = i + rand() / (RAND_MAX / (n - i) + 1); ``` 這裡有一段 shuffle 的 code, 可以運用數學的方式交換而不需要多宣告一個參數，但如果數字過大就會有 overflow 的問題。 ```diff! - int t = array[j]; - array[j] = array[i]; - array[i] = t; + array[j] += array[i]; + array[i] = array[j] - array[i]; + array[j] -= array[i]; ``` **quick sort** 假設目前要排列的 link 為以下圖示，首先會選出 pivot 這裡選定為 4. ```mermaid graph LR; A(4) B(1) C(7) D(5) E(3) A --> B --> C --> D --> E ``` 再來將左邊與右邊分開 *begin[i]* ```mermaid graph LR; B(1) E(3) B --> E ``` *begin[i+1]* ```mermaid graph LR; A(4) ``` *begin[i+2]* ```mermaid graph LR; B(5) E(7) B --> E ``` 如果不需要比對的時候就會直接將值由大到小的加入到 result list, 排序及完成。 #### 可改進點 - pivot 挑選 - begin and end 的 array 大小 ### Testing 2 : [ Timsort](https://en.wikipedia.org/wiki/Timsort) *glossary* - run : 切割出的資料 - run size : 有幾個資料 - run length : 總共的 run 切割數特性：為了解決日常可能會遇到的排序，通常為非亂序，有一定排序的 list 改良 (相較於 mersge sort) - 降低 cache : 會在部分排序完後嘗試合併，減少 cache miss 所造成的成本 - 降低記憶體開銷 : 只需要配置重疊部分 - 減少進行合併次數 #### 構思 *不看實作嘗試理解及揣測* [Timsort](https://en.wikipedia.org/wiki/Timsort) 混合 merge sort 以及 insertion sort 且為 stable 的排序演算法，特性是可以有效的處理日常數據 e.g. 大部分以排序的資料。在 timsort 中會把一次切割的資料稱為 run, 而這個 run 所切割的大小會大大影響效能，我們會先限制一個最小 run length 的大小，在這個 run 裡面會是用 insertion sort 的方式(insertion sort 在資料集少的時候表現好)。為了達到 balance 的 merge, timsort merge 時會以 stack top 的三個 run 來做 merge 的判斷，且 merge 的判斷都是以相鄰的 run 來合併，這樣可以達到 cache friendly。 ```graphviz digraph G { node [shape=box, style="rounded,filled", fillcolor="#dbe5f1", fontname="Arial"]; edge [fontname="Arial"]; Start [label="Start", shape=ellipse]; Insert_Run [label="Insert Run into Stack"]; Merge_Check [label="Merge Check"]; Merge [label="Merge X and Y"]; Replace [label="Replace X and Y with Z"]; Z_Check [label="Z > Y + X?"]; Y_Check [label="Y > X?"]; End [label="End", shape=ellipse]; Start -> Insert_Run; Insert_Run -> Merge_Check; Merge_Check -> Merge [label="|Z| ≤ |Y| + |X|"]; Merge_Check -> Z_Check [label="|Z| > |Y| + |X|"]; Merge -> Replace; Replace -> Merge_Check; Z_Check -> End [label="Yes"]; Z_Check -> Y_Check [label="No"]; Y_Check -> End [label="Yes"]; Y_Check -> Insert_Run [label="No"]; } ``` 如果 $X+Y > Z$ 會作以下 merge ```graphviz digraph G { rankdir=LR; node [shape=record, style="filled", fillcolor="", fontname="Arial"]; edge [fontname="Arial"]; stack [label="X | <from>Y | Z", shape=record]; subgraph cluster_merged { label="Merged Stack" style="rounded"; Merged [label="<to>X + Y | Z", shape=record]; } stack:from -> Merged:to; } ``` 而針對 array 的版本 timsort 為了解決 merge sort 總是會使用額外的空間來合併，提出了一種方法，只須針對重疊的部份作 merge。e.g. [1,2,3,6,10] merge [4,5,7,9,12,14,17] 只須判斷 [4~10] 就可以全部做完，因為剩下的都不會互相影響。 [Galloping mode](https://en.wikipedia.org/wiki/Timsort#Galloping_mode_during_merge) 會設置 minimum galloping threshold 來判斷是否要做 galloping mode, 此機制是每次從一個 run 找到要插入的點，再從被插入的 run 找到對應插入點。 e.g. - $A[1,2,3,7,8,9]$ - $B[4,5,6,10,11,12]$ 找到 4 inserted point - $A[1,2,3,4,7,8,9]$ - $B[5,6,10,11,12]$ 找到 7 inserted point - $A[1,2,3,4,5,6,7,8,9]$ - $B[10,11,12]$ ...依此類推 #### Timsort 實作遇到的問題 `find_run` 這裡目前切 run 的方式沒有做過多的處理，只是把已經排好的切成一個 run, 所以需要查閱資料並實作有率的 run 切割。此外這裡有個很有趣的計數方式，因為我們在切 run 的時候只使用到單向鍊結，，因此可以把 length 用 point 的方式藏在 prev. ```c head->next->prev = (struct list_head *) len; ``` `merge` 與 libsort 的 merge 方式一樣，在看到第 2,3,4 行有注意到，如果 $a_{0-n}$ 連續n1筆都是小於 $b_0$, `*tail = a` 是否等於做白工。 ```c= if (cmp(priv, a, b) <= 0) { *tail = a; tail = &a->next; a = a->next; if (!a) { *tail = b; break; } ``` `merge_at` 把自己跟前一條 list 做 merge `merge_collapse` 依照 stak_size 來決定合併的方式，先假設 A, B, C, D, E 為 stack 目前紀錄的長度且有順序的概念。這樣可以確保鄰近的 list 會先 merge 且 merge 間的 length 不會差太多。 ``` - if (n >= 3 且 A <= B + C) || (n >= 4 且 A <= B + C) - if (A < C) 做 B 的 merge - else 做 C 的 merge - else (n >=2 且 A <= B) - 做 B 的 merge ``` #### 嘗試改良切割 run 目前完全是依據當前有多少連續上升或下降的方式切割，而現在要注意的點有以下：切割出的總數量因該小於 2 的冪，且長度不能太短。思考：如何分割是可以達到較佳的效果？站在 merge 的角度要讓 merge 的兩個 run 長度差不多。 - insertion sort 在 64 個值以下會有較佳的效能，而 run size 也要小於等於 2 的冪，如何作到？根據 [Minimum run size](https://en.wikipedia.org/wiki/Timsort#Minimum_run_size) run length 在 32 ~ 64 會有較佳的表現，這裡有提到一個方法，首先提取 msb 的 6 個 bit, 如果後面還有就 + 1 這樣就可以得到 minimim run size, 原理是：6 個 bit 代表我們要取 32 ~ 64 的範圍，而在 2 進位中假設 $0b10$ 的第 1 位元代表由兩的第 0 位元組成，所以就可以確保 merge 時是兩兩相似長度合併且會 <= 2 的冪，再來是加入 insertion sort 來處理 run 內部的排序。如何找到對應的 6 位 MSB 可以搭配 [GCC Built-in function](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) `__builtin_clz` > Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined. ```c uint64_t mask = (1ULL << 6) - 1; uin64_t num_bits = 64 - __builtin_clzll(total_length); int min_size = (total_length >> (num_bits - 6)) & mask; min_size += !!(total_length & mask << (num_bits-6)); ``` #### timsort_test 方式在 `check_list` 中發現實作了測試 stable 的方式，這是我以前在測試時不會想到要考慮正確性的點，透過加入額外 seq 的方式可以檢查最後排序的結果是否 stable，這裡還有個思考點：是否可以在排序時就檢查？ ```c int unstable = 0; list_for_each_entry_safe (entry, safe, head, list) { if (entry->list.next != head) { if (entry->val == safe->val && entry->seq > safe->seq) unstable++; } } ``` 在產生亂數的地方看到了一段註解 [Assume ASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization) ASLR 是一種安全機制，通常在操作系統中使用，它會隨機地改變程序的內存地址佈局，增加攻擊者對特定記憶體位置進行攻擊的難度，所以使用的 main 每次位置都會不一樣。 #### 實驗設計 - minimal run size 調整大小的影響。 ## 第二次測驗 ### Testing 1 目標是利用 preorder 加上 inorder 做出 unique 的數列，並加以測試。 e.g. - Input : preorder = [3,9,20,15,7], inorder = [9,3,15,20,7] - Output: [3,9,20,null,null,15,7] 這邊注意到 `struct hlist` 成員有 `*next` 以及 `**prev`, 為了了解原因參考了[Linux 核心的 hash table 實作](https://hackmd.io/@sysprog/linux-hashtable)，因此在這裡整理我的理解。為什麼要以 `**pprev` 的方式而不是 doubly link list? 因為我們可以透過控制 prev 達到不用再將 list_head 傳入當 function 就可以處理刪除 first node 的例外處理。 ```c struct hlist_head { struct hlist_node *first; }; struct hlist_node { struct hlist_node *next, **pprev; }; ``` ```graphviz digraph structs { rankdir=LR; node [shape=none] subgraph cluster1{ label="hlist_head[num]"; node[shape=record] struct1[label= "NULL|<f1>fisrt|<f2>first|NULL"]; } subgraph cluster2{ node[shape=record]; hk_1[label="hlist_node | {<prev>pprev | <next>next}"]; label="hash_key1" } subgraph cluster_2 { style=filled; color=lightgrey; node [shape=record]; hk_2 [label="hlist_node | {<prev>pprev | <next>next}"]; label="hash_key 2" } subgraph cluster_3 { style=filled; color=lightblue; node [shape=record]; hk_3 [label="hlist_node | {<prev>pprev | <next>next}"]; label="hash_key 3" hk_3:next->null } struct1:f1 -> hk_1:prev hk_1:prev:s -> struct1:f1 hk_1:next ->hk_2:prev hk_2:prev:s -> hk_1:next hk_2:next ->hk_3:prev hk_3:prev:s -> hk_2:next } ``` `buildTree` - 動態建立跟 inorder 數量一樣的 hlist_head : `in_heads` - 將對應 hash key 插入 `in_head` - 呼叫 dfs `dfs` 這是一個遞迴搜尋法，因為 preorder 的第一個元素及為 root 所以先設定好 root, 使用 `find` 找到對應 index(從這裡可以知道 hash 是為了加速查找，不用走過整個 link list), 接下來討論 left and right 該怎麼接上。 - left : 根據從 in_head 找到的 index 限制下次的左邊範圍， - right : 限制右邊的範圍以下面的例子來說，index 為 1 ,所以下次左邊的範圍會被限制在 1~1 間，右邊則會被限制在 2~4 之間 *preorder* 選中第一個當作 root ```graphviz digraph G { rankdir=LR node [shape=box, color=purple] node1 [label=3 style=filled] node2 [label=9] node3 [label=20] node4 [label=15] node5 [label=7] node1 -> node2; node2-> node3[arrowhead=vee]; node3->node4 node4->node5 } ``` *inorder* 分別分出屬於左子樹 or 右子樹。 ```graphviz digraph G { rankdir=LR node [shape=box, color=purple] node1 [label=9] node2 [label=3 style=filled] node3 [label=15] node4 [label=20] node5 [label=7] node1 -> node2; node2-> node3[arrowhead=vee]; node3->node4 node4->node5 } ``` #### 在 Linux 核心的 cgroups 相關原始程式碼，找出 pre-order walk 程式碼並探討 *cgroups ： control groups* [patch of cgroups writeback support](https://lore.kernel.org/all/1432329245-5844-1-git-send-email-tj@kernel.org/) 我在閱讀 `fsync` syscall 時為了瞭解 write back 的流程注意到 [Writeback and control groups](https://lwn.net/Articles/648292/), 把資料從 memory 寫入 persistent device 是一件棘手的事，kernel 必須在 dirty pages 與 limited I/O bandwidth 間平衡，以下提出為什麼需要使用 cgroups. > There is a "magical function" called balance_dirty_pages() that will, if need be, throttle processes dirtying a lot of pages in order to match the rate at which pages are being dirtied and the rate at which they can be cleaned. It works reasonably well in current kernels, but it only operates globally; it is not equipped to deal with control groups 然而 block controller 與 memory controller 並不能好好協調 > Writeback is clearly related to both memory use and I/O bandwidth, but the control-group mechanism offers no way to enable controllers to work together :::warning 我總結一下這邊的問題，如果有不對的地方請幫我指正 block control 並不能得到 memory control 的資訊，因此並不會去執行特定 control group 的資料，這樣會影響 global dirty-page watermarks ::: kernel 中有兩個 struct 會參與 write back `backing_dev_info` 知道寫入 device 以及追蹤 IO bandwith, `bdi_writeback` 設定回寫。因此為了解決上述問題，將更多的資訊從 `backing_dev_info` 移到 `bdi_writeback`. 更多資訊可以從 commit 52ebea749aaed195245701a8f90a23d672c7a933 取得。 > balance_dirty_pages() is changed to use the per-group bdi_writeback structure, as are other pieces of the writeback-control mechanism. Tejun described it as being mostly "a giant plumbing job. 在 `submit_bio` 裡面做的第一件事就跟 cgroup 有關西，不過還有待理解。 ```c blk_qc_t submit_bio(struct bio *bio) { if (blkcg_punt_bio_submit(bio)) return BLK_QC_T_NONE; ``` #### 可改進點目前是把 indorder 數量當作 hash table 的預設空間大小，且還有可能發生碰撞，所以其實不需要 hash table 只需要用 index 當作 key 就好，但如果當測資很大時，array 沒那麼大時，就可以構思 hash table 的大小的設計。 ### Testing 2 分析 LRU 實作，會將目前最久沒用的 cache 剔除，這裡維護兩種 list, doubly link list and hlist, 在這裡 list 維護所有的 cache, 會形成一條長鍊，所以可以簡單的插入以及剔除，而如果我們想從這條長鍊找到對應的 node 將會付出很大的成本，這時候 hlist 就派上用場，可以很容易的找到資料。 - `lRUCacheCreate` 初始化內部成員 `list_head` and `hlist_head` - `lRUCacheFree` 釋放所有 cache - `lRUCacheGet` 這裡會將回傳對應 key 值的 value, 且 `list_move` 會將此次的 node 移到 list_head 的頭，這樣便可以達成 LRU 的需求。 - `lRUCachePut` 想要將值放入 cache, 有兩種情況，如果 cache 存在則移到 MSB 位置，否則查看是否還有空間，如果沒空間還要額外做剔除 cache 的動作，接下來再將新的 cache 放入 list and hlist. #### 在 Linux 核心找出 LRU 相關程式碼並探討 > linux kernel 5.14.1 先觀察 [struct page](https://elixir.bootlin.com/linux/v5.14.1/source/include/linux/mm_types.h#L70) 首先注意到用到了大量的 `union` 這表示這個 struct 被用到很多不同的場景 ```c /** * @lru: Pageout list, eg. active_list protected by * lruvec->lru_lock. Sometimes used as a generic list * by the page owner. */ struct list_head lru; ``` 當發生 page fault 時，會伴隨發生踢除 page cache 的行為。 ### Testing 3 find n_th bit 此 macro 在 linux kernel 中的 [bitmap.h](https://elixir.bootlin.com/linux/latest/source/include/linux/bitmap.h#L215) 中出現，主要的目的是設定一個 mask 從低位元開始的 nbits 都是1 其他都是 0, e.g. 輸入3 得到 0b11. 0UL 右移會以 0 的方式補其左邊，而為什麼 nbit 要加上負號，假設輸入 3 表示應該右移 64 - 3 位, 因為負號是 $\sim nbit+1$ 也就是 $-3 + 1$ 所以 $64 - 1 -3-1=61$ 剛好是需要右移的量。這裡的 >> 證實了 unsign shift 是以 logical shift 的方式補齊 > 找資料輔助說明 ```c #define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1))) ``` 以下的code 參考 [GCC built-in functions](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 是用來 ```c! #define small_const_nbits(nbits) \ (__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG && (nbits) > 0) ``` - `find_nth_bit`