2024q1 Homework5 (assessment)

# 2024q1 Homework5 (assessment) contributed by < [ChenFuhuangKye](https://github.com/kyehuang) > ## 因為自動飲料機而延畢的那一年有些看似簡單的事情背後卻是經過無數次的實驗才得出的結果，文章中有一句話讓我十分有感觸，「一項產業進步的速度，很大程度取決於做實驗修正問題的速度。」有時候有些事情只有接觸過後才會了解設計的意涵。大學時，學校的專業科目只要求我們考試時會過就好了，對於實作我其實是沒有什麼機會接觸的，只依稀記得某某演算法的概念總覺得不難，然而如果叫我手刻出來我是完全不行的。因此我決定在大學畢業的暑假每天都寫一題 leetcode 。在練題目時，我常常會沒有考慮到例外狀況，要等到將程式碼提交後，測資會告訴你哪邊出錯了，我在進行修改。由於程式碼一提交，不到一分鐘我就可以知道結果了，我沒有訓練自己在一開始就考慮例外狀況，反而是我一直提交給網站，網站跟我說哪邊出狀況了我在修改。我很難想像機械系每次繳交設計圖需要一兩天才會有結果，將設計圖修改後又需要等待，期間心理的壓力。這幾周我看了需多的程式碼，原來一個簡潔有架構的程式是如此優美。舉第一次作業為例，我第一次看到 list.h 中有許多的巨集，我其實不知道這些要怎麼使用，總想著我直接寫就好了不打算使用巨集，再觀摩了 [yanjiew](https://hackmd.io/@yanjiew/linux2023q1-lab0) 同學的作品，我才了解使用巨集會讓程式的架構變成很清晰易理解。指導教授常說一個好的函式不該超過 30 行，應把功能切開呼叫不同函式處理，這時候我才知道他的堅持是非常有意義的。 ## 閱讀 <Linux 核心模組運作原理> 編譯核心模組的命令 ``` $ make -C /lib/modules/`uname -r`/build M=`pwd` modules ``` ``make -C /lib/modules/$`uname -r`/build M=$`pwd` modules`` 的作用是在指定的核心目錄``/lib/modules/$`uname -r`/build`` 執行 make 。這裡使用 `-c` 參數是用來切換到該目錄，並在此目錄執行 Makefile 。 `` M=$`pwd` `` 是告訴 `/build` 底下的 Makefile ，我的程式當前所在位置。 `modules` 是一個 target ，用於編譯目錄中所有被標記為模組的對象 (由 `obj-m` 指定)。總結來說 ``make -C /lib/modules/$`uname -r`/build M=$`pwd` modules`` 可以利用核心的目錄下的 Makefile 編譯當前目錄的模組，且允許在自己的目錄下維護獨立的 Makefile，無須更改核心中的 Makefile。接著查看 `insmod` 手冊 ``` NAME insmod - Simple program to insert a module into the Linux Kernel SYNOPSIS insmod [filename] [module options...] DESCRIPTION insmod is a trivial program to insert a module into the kernel. Most users will want to use modprobe(8) instead, which is more clever and can handle module dependencies. Only the most general of error messages are reported: as the work of trying to link the module is now done inside the kernel, the dmesg usually gives more information about errors. ``` `insmod` 的描述為插入模組到 Linux 核心。利用 `sudo strace insmod fibdrv.ko` 可以發現會呼叫 `finit_module` 。`finit_module` 呼叫 [`idempotent_init_module`](https://elixir.bootlin.com/linux/latest/source/kernel/module/main.c#L3194) ， idempotent_init_module 會在呼叫 [`init_module_from_file`](https://elixir.bootlin.com/linux/latest/source/kernel/module/main.c#L3131) ，最後在利用 init_module_from_file 呼叫 [load_module](https://elixir.bootlin.com/linux/latest/source/kernel/module/main.c#L2836) 。 ## 想投入的專案 ### 做 lab0 並彙整學員成果 ### 以 eBPF 建構 TCP 伺服器 --- 10 = 8 + 2 = (2 << 3) + 2 operator priority: shift 和 + 何者優先權高 (看 C11 規格) ## TODO: 撰寫程式碼、測試 https://wiki.archlinux.org/title/Modalias IEE 754. Assume float is 32-bit ```c float float_mul10(float x) { // using bitwise operation; No mul/div uint32_t *num = (uint32_t*) &x; uint32_t sign = (*num)>>31; uint32_t exp = (((*num)>>23) & 0xFF ) -127; uint32_t frac = ((*num) & 0x7FFFFF); exp = (exp+3); frac = frac + ((frac |0x800000)>>2); if (frac & 0x800000) { exp = exp + 1; frac = (frac & 0x7FFFFF)>>1; } *num = (sign<<31) | ((exp+127)<<23) | (frac & 0x7FFFFF); return x; } ``` 執行結果 ``` x = 0.190000, 10*x = 1.900000, float_mul10(x) = 1.900000 x = 0.361000, 10*x = 3.610000, float_mul10(x) = 3.610000 x = 0.685900, 10*x = 6.859000, float_mul10(x) = 6.859000 x = 1.303210, 10*x = 13.032099, float_mul10(x) = 13.032099 x = 2.476099, 10*x = 24.760988, float_mul10(x) = 24.760986 x = 4.704587, 10*x = 47.045876, float_mul10(x) = 47.045872 x = 8.938716, 10*x = 89.387161, float_mul10(x) = 89.387154 x = 16.983561, 10*x = 169.835602, float_mul10(x) = 169.835602 x = 32.268764, 10*x = 322.687653, float_mul10(x) = 322.687622 x = 61.310654, 10*x = 613.106567, float_mul10(x) = 613.106506 ``` 思考方式假設 $x = (-1)^{sign}*2^{E-127}*(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$ 將$x$乘$10$ $10*x=10*(-1)^{sign}*2^{E-127}*(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$ $10*x=8*\frac{5}{4}*(-1)^{sign}*2^{E-127}*(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$ $10*x=(-1)^{sign}*8(2^{E-127})*\frac{5}{4}*(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$ $10*x=(-1)^{sign}*(2^{E+3-127})*\frac{5}{4}(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$ 展開 $\frac{5}{4}(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$ $\frac{5}{4}(1+\sum^{23}_{i=1}b_{23-i}2^{-i}) = (1+\sum^{23}_{i=1}b_{23-i}2^{-i}) + \frac{1}{4}(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$ $\frac{5}{4}(1+\sum^{23}_{i=1}b_{23-i}2^{-i}) = (1+\sum^{23}_{i=1}b_{23-i}2^{-i}) + (\frac{1}{4}+\sum^{23}_{i=1}b_{23-i}2^{-i-2})$ 因此 exp 加 3 符合 $2^{E+3-127}$ ， frac 加frac 的四分之一符合 $(1+\sum^{23}_{i=1}b_{23-i}2^{-i}) + (\frac{1}{4}+\sum^{23}_{i=1}b_{23-i}2^{-i-2}))$ 如果 frac 溢位，將 exp 加 1 並把 frac 除 2 。根據 [C Operator Precedence](https://en.cppreference.com/w/c/language/operator_precedence) 的優先表，可以發現 \+ 的優先權比 shift 還要高。 ## TODO: lab0 重做, 強化統計的基礎 (例如需要多大的樣本空間?), sort!, 善用工具 (perf, valgrind, massif) 除了比較執行時間外，判斷排序演算法的好壞還可以觀察其發生 cache missing 的情況。由於 CPU 與 cache 之間的溝通速度遠快於與記憶體的溝通速度，如果能夠降低 cache missing 的次數，則代表排序演算法較優。然而，每次計算 cache missing 的結果都可能有所不同，因此找到適當的樣本大小來代表演算法的 cache missing 是非常重要的。 ### merge_sort 實作 merge_sort 可以利用 `Divide and Conquer` 使用遞迴方法，每次遞迴先將 list 分成兩個左右兩個區塊，直到 list 不能分割，接著合併左右 list 直到全部合併。 ```c void merge_sort(struct list_head *head, bool descend) { if (!head || list_empty(head) || list_is_singular(head)) return; struct list_head *mid = head; for (struct list_head *fast = head->next; fast != head && fast->next != head; fast = fast->next->next) mid = mid->next; LIST_HEAD(left); LIST_HEAD(right); list_cut_position(&left, head, mid); list_splice_init(head, &right); merge_sort(&left, descend); merge_sort(&right, descend); merge_list(head, &left, &right, descend); } void merge_list(struct list_head *head, struct list_head *left, struct list_head *right, bool descend) { LIST_HEAD(tmp); while (!list_empty(left) && !list_empty(right)) { element_t *left_entry = list_entry(left->next, element_t, list); element_t *right_entry = list_entry(right->next, element_t, list); if (descend ? strcmp(left_entry->value, right_entry->value) > 0 : strcmp(left_entry->value, right_entry->value) < 0) { list_move_tail(left->next, &tmp); } else { list_move_tail(right->next, &tmp); } } list_splice_tail_init(list_empty(left) ? right : left, &tmp); list_splice_init(&tmp, head); } ``` ### 使用 perf 計算演算法的 cache missing 首先，利用 `lscpu` 得知測試環境的 L3 cache 大小為 12MB，`list_head` 大小為 16B 。可以將樣本大小設為 12 * 1024 * 64。我設計了一個測試程式，並測試。 ```c #include <stdio.h> #include <stdlib.h> #include "queue.h" #define N 12 * 1024 * 64 int main() { struct list_head *head = q_new(); for(int i = 0; i < N; i++) q_insert_head(head, "world"); q_sort(head, 0); return 0; } ``` ``` $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz CPU family: 6 Model: 165 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 Stepping: 2 CPU max MHz: 5000.0000 CPU min MHz: 800.0000 BogoMIPS: 5199.98 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fx sr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts re p_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi pku ospke md_clear flush_l1d arch_ capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 192 KiB (6 instances) L1i: 192 KiB (6 instances) L2: 1.5 MiB (6 instances) L3: 12 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-11 Vulnerabilities: Gather data sampling: Mitigation; Microcode Itlb multihit: KVM: Mitigation: VMX disabled L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Retbleed: Mitigation; Enhanced IBRS Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop Srbds: Mitigation; Microcode Tsx async abort: Not affected ``` ``` $ sudo perf stat -e cache-references,cache-misses ./merge_sort Performance counter stats for './merge_sort': 59,127,873 cache-references 23,874,713 cache-misses # 40.38% of all cache refs 0.377605633 seconds time elapsed 0.351518000 seconds user 0.023967000 seconds sys ``` 重複執行 5 次，分別得到 | | 第一次取樣 | 第二次取樣 | 第三次取樣 | 第四次取樣 | 第五次取樣 | |:----------------:|:----------:|:----------:|:----------:|:----------:|:----------:| | cache-references | 59127873 | 57615096 | 57638423 | 57674422 | 59593397 | | cache-misses | 23874713 | 24929428 | 24034141 | 24130806 | 24184553 | 如何確保這些分佈是否都相同可以利用卡方檢定 (Chi-Squared Test) 中的同質性檢定 ([Test of Homogeneity](https://courses.lumenlearning.com/wm-concepts-statistics/chapter/test-of-homogeneity/)) ，檢驗對於同一個類別變數，資料內不同群體出現的頻率分佈是否相同。同質性檢定建立的假設如下： * 虛無假設 H0: 對於 cache missing 類別，不同的子群體之間存在**相同**的分佈。 * 對立假設 H1: 對於 cache missing 類別，不同的子群體之間存在**不同**的分佈。卡方檢定中需要找出觀察與預期，我們需要定義出預期，這邊我們的虛無假設是『對於 cache missing 類別，不同的子群體之間存在**相同**的分佈。』，因此我們的同質性預期可以用 cache 狀況代表。 * cache-references ，總共有 291,649,211 * cache-misses ，總共有 121,153,641 * 預期的 cache missing 比例為 121,153,641 / 291,649,211 = 41.51% 利用 python 程式計算同質性檢定 ```python import numpy as np from scipy import stats output = stats.chi2_contingency( observed = np.array([ [35253160 ,32685668 ,33604282 ,33543616 ,35408844], [23874713 ,24929428 ,24034141 ,24130806 ,24184553] ]) ) print(output) ``` ``` $ python main.py Chi2ContingencyResult(statistic=np.float64(129007.8621825518), pvalue=np.float64(0.0), dof=4, expected_freq=array([[34565635.80424913, 33681279.64222307, 33694916.40176633, 33715961.10133328, 34837777.0504282 ], [24562237.19575087, 23933816.35777693, 23943506.59823367, 23958460.89866673, 24755619.94957181]])) ``` 可以得到 * Statistic ($x^2$) : 129007.8621825518 * 值越大代表觀察頻率與期望頻率差異顯著 * p-value: 0.0 * 值為 0小於0.05 ，在同質性檢定的結論，不是每次的 cache missing 都一樣因此, 本次的同質性檢定得出各組數據關係不相同，而是互相獨立。雖然說可以大致推測出 cache missing 比例為 41.51％，然而經過卡方檢驗後，可以得知每次取樣的結果彼此互相獨立。有可能與測試程式有關，因為計算 cache missing 也有包含 list 在增加 node 的過程。 #### 增加採樣次數我寫了一個腳本，進行了 10000 次採樣，其 p-value 也接近於 0 。我好奇這些資料的分佈狀況，於是以 1% 為單位累加出現的次數，並畫出了cache miss機率分佈圖。結果顯示，cache missing 大致呈現常態分佈，但誤差範圍很大，這與同質性檢驗的結果一致，且本次的執行結果與之前 cache missing 結果不一樣。由於每次檢驗彼此獨立，因此 perf 計算出的 cache missing 數據，除非兩種演算法的結果差異達到一定程度，否則無法用於評估演算法的好壞。 ![Figure_1](https://hackmd.io/_uploads/Hy_e5DtUC.png) ```bash #!/bin/bash # Initialize result arrays cache_references_results=() cache_misses_results=() cache_correct_results=() # Repeat 10000 times for i in {1..10000} do # Run perf command and store the output in a variable output=$(perf stat -e cache-references,cache-misses ./merge_sort 2>&1) # Extract cache-references and cache-misses values from the output cache_references=$(echo "$output" | grep 'cache-references' | awk '{print $1}' | tr -d ',') cache_misses=$(echo "$output" | grep 'cache-misses' | awk '{print $1}' | tr -d ',') # Calculate cache_correct cache_correct=$((cache_references - cache_misses)) # Store the results in arrays cache_references_results+=("$cache_references") cache_misses_results+=("$cache_misses") cache_correct_results+=("$cache_correct") cache_miss_rate=$(awk "BEGIN {print ($cache_misses/$cache_references)*100}") # Print the extracted values for the current iteration echo "Run $i:" echo "Cache References: $cache_references" echo "Cache Misses: $cache_misses" echo "Cache Correct: $cache_correct" echo "Cache missing: $cache_miss_rate%" echo "---------------------------------" done # Write all cache references results to a file echo "${cache_references_results[@]}" > cache_references.txt # Write all cache misses results to a file echo "${cache_misses_results[@]}" > cache_misses.txt # Write all cache correct results to a file echo "${cache_correct_results[@]}" > cache_correct.txt # Print all results echo "All Cache References Results: ${cache_references_results[@]}" echo "All Cache Misses Results: ${cache_misses_results[@]}" echo "All Cache Correct Results: ${cache_correct_results[@]}" ``` ### 使用 valgrind 觀察 cache misssing valgrind 可以觀察到程式運作時 cache missing 發生的狀況，可以找出是那一行程式發生的多次的 cache missing 。 ``` $valgrind --tool=cachegrind ./merge_sort ==558249== ./.valgrindrc was not read as it is either not a regular file, ==558249== or is world writeable, or is not owned by the current user. ==558249== Cachegrind, a cache and branch-prediction profiler ==558249== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==558249== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info ==558249== Command: ./merge_sort ==558249== --558249-- warning: L3 cache found, using its data for the LL simulation. --558249-- warning: specified LL cache: line_size 64 assoc 16 total_size 12,582,912 --558249-- warning: simulated LL cache: line_size 64 assoc 24 total_size 12,582,912 ==558249== brk segment overflow in thread #1: can't grow to 0x485b000 ==558249== (see section Limitations in user manual) ==558249== NOTE: further instances of this message will not be shown ==558249== ==558249== I refs: 2,264,282,423 ==558249== I1 misses: 1,334 ==558249== LLi misses: 1,326 ==558249== I1 miss rate: 0.00% ==558249== LLi miss rate: 0.00% ==558249== ==558249== D refs: 1,348,111,526 (871,215,763 rd + 476,895,763 wr) ==558249== D1 misses: 24,598,017 ( 19,308,724 rd + 5,289,293 wr) ==558249== LLd misses: 7,576,318 ( 4,952,807 rd + 2,623,511 wr) ==558249== D1 miss rate: 1.8% ( 2.2% + 1.1% ) ==558249== LLd miss rate: 0.6% ( 0.6% + 0.6% ) ==558249== ==558249== LL refs: 24,599,351 ( 19,310,058 rd + 5,289,293 wr) ==558249== LL misses: 7,577,644 ( 4,954,133 rd + 2,623,511 wr) ==558249== LL miss rate: 0.2% ( 0.2% + 0.6% ) $ sudo cg_annotate cachegrind.out.558249 -------------------------------------------------------------------------------- I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 12582912 B, 64 B, 24-way associative Command: ./merge_sort Data file: cachegrind.out.558249 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Thresholds: 0.1 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: on -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -------------------------------------------------------------------------------- 2,264,282,423 (100.0%) 1,334 (100.0%) 1,326 (100.0%) 871,215,763 (100.0%) 19,308,724 (100.0%) 4,952,807 (100.0%) 476,895,763 (100.0%) 5,289,293 (100.0%) 2,623,511 (100.0%) PROGRAM TOTALS -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function -------------------------------------------------------------------------------- 352,321,488 (15.56%) 6 ( 0.45%) 6 ( 0.45%) 125,829,110 (14.44%) 4,121 ( 0.02%) 14 ( 0.00%) 73,138,163 (15.34%) 0 0 ???:merge_list 243,793,840 (10.77%) 1 ( 0.07%) 1 ( 0.08%) 121,896,920 (13.99%) 1,365 ( 0.01%) 5 ( 0.00%) 48,758,768 (10.22%) 10,090 ( 0.19%) 74 ( 0.00%) ???:list_empty 221,769,976 ( 9.79%) 29 ( 2.17%) 29 ( 2.19%) 37,750,909 ( 4.33%) 0 0 37,746,111 ( 7.91%) 1,572,303 (29.73%) 1,571,760 (59.91%) ./malloc/./malloc/malloc.c:_int_malloc 186,122,168 ( 8.22%) 5 ( 0.37%) 5 ( 0.38%) 98,828,268 (11.34%) 13,331,089 (69.04%) 3,119,860 (62.99%) 33,292,271 ( 6.98%) 1,365 ( 0.03%) 9 ( 0.00%) ???:merge_sort 173,958,096 ( 7.68%) 4 ( 0.30%) 4 ( 0.30%) 39,321,600 ( 4.51%) 4,919,860 (25.48%) 1,622,063 (32.75%) 0 0 0 ./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2 173,015,040 ( 7.64%) 0 0 94,371,840 (10.83%) 0 0 62,914,560 (13.19%) 0 0 ???:list_add_tail 141,557,760 ( 6.25%) 1 ( 0.07%) 1 ( 0.08%) 78,643,200 ( 9.03%) 1,034,147 ( 5.36%) 209,220 ( 4.22%) 47,185,920 ( 9.89%) 3,277,277 (61.96%) 655,694 (24.99%) ???:list_del 125,829,120 ( 5.56%) 1 ( 0.07%) 1 ( 0.08%) 39,321,600 ( 4.51%) 0 0 39,321,600 ( 8.25%) 0 0 ???:list_move_tail 89,653,302 ( 3.96%) 6 ( 0.45%) 6 ( 0.45%) 33,030,163 ( 3.79%) 1 ( 0.00%) 0 25,165,839 ( 5.28%) 196,159 ( 3.71%) 196,156 ( 7.48%) ???:test_malloc 69,205,812 ( 3.06%) 7 ( 0.52%) 7 ( 0.53%) 17,301,518 ( 1.99%) 1 ( 0.00%) 1 ( 0.00%) 6,291,973 ( 1.32%) 0 0 ./malloc/./malloc/malloc.c:malloc 53,477,308 ( 2.36%) 2 ( 0.15%) 2 ( 0.15%) 26,738,654 ( 3.07%) 4,095 ( 0.02%) 15 ( 0.00%) 17,301,482 ( 3.63%) 10,770 ( 0.20%) 73 ( 0.00%) ???:list_splice 50,128,732 ( 2.21%) 3 ( 0.22%) 3 ( 0.23%) 12,582,920 ( 1.44%) 4 ( 0.00%) 2 ( 0.00%) 4,718,595 ( 0.99%) 0 0 ./stdlib/./stdlib/random_r.c:random_r 33,030,165 ( 1.46%) 3 ( 0.22%) 3 ( 0.23%) 12,582,920 ( 1.44%) 1 ( 0.00%) 1 ( 0.00%) 3,145,730 ( 0.66%) 0 0 ./stdlib/./stdlib/random.c:random 32,243,671 ( 1.42%) 4 ( 0.30%) 4 ( 0.30%) 17,301,482 ( 1.99%) 3,699 ( 0.02%) 17 ( 0.00%) 9,437,172 ( 1.98%) 212,607 ( 4.02%) 197,017 ( 7.51%) ???:list_cut_position 29,884,435 ( 1.32%) 3 ( 0.22%) 3 ( 0.23%) 9,437,190 ( 1.08%) 1 ( 0.00%) 1 ( 0.00%) 4,718,595 ( 0.99%) 0 0 ???:fail_allocation 29,097,966 ( 1.29%) 1 ( 0.07%) 1 ( 0.08%) 11,010,041 ( 1.26%) 0 0 4,718,589 ( 0.99%) 0 0 ???:list_is_singular 28,311,744 ( 1.25%) 31 ( 2.32%) 30 ( 2.26%) 14,155,838 ( 1.62%) 1,049 ( 0.01%) 13 ( 0.00%) 20 ( 0.00%) 1 ( 0.00%) 1 ( 0.00%) ???:??? 28,311,528 ( 1.25%) 2 ( 0.15%) 2 ( 0.15%) 14,155,764 ( 1.62%) 0 0 9,437,176 ( 1.98%) 0 0 ???:INIT_LIST_HEAD 26,738,654 ( 1.18%) 1 ( 0.07%) 1 ( 0.08%) 13,369,327 ( 1.53%) 0 0 8,650,741 ( 1.81%) 4,049 ( 0.08%) 12 ( 0.00%) ???:list_splice_tail 25,952,256 ( 1.15%) 3 ( 0.22%) 3 ( 0.23%) 8,650,752 ( 0.99%) 0 0 7,077,888 ( 1.48%) 0 0 ???:q_insert_head 25,165,792 ( 1.11%) 1 ( 0.07%) 1 ( 0.08%) 7,864,310 ( 0.90%) 0 0 7,864,310 ( 1.65%) 0 0 ???:list_splice_init 22,806,540 ( 1.01%) 3 ( 0.22%) 3 ( 0.23%) 1,572,865 ( 0.18%) 0 0 3,145,730 ( 0.66%) 0 0 ./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms 20,447,245 ( 0.90%) 0 0 9,437,190 ( 1.08%) 0 0 4,718,595 ( 0.99%) 0 0 ???:find_footer 18,874,368 ( 0.83%) 2 ( 0.15%) 2 ( 0.15%) 6,291,456 ( 0.72%) 0 0 5,505,024 ( 1.15%) 0 0 ???:test_strdup 17,301,504 ( 0.76%) 1 ( 0.07%) 1 ( 0.08%) 9,437,184 ( 1.08%) 0 0 6,291,456 ( 1.32%) 0 0 ???:list_add 12,582,896 ( 0.56%) 0 0 3,932,155 ( 0.45%) 0 0 3,932,155 ( 0.82%) 0 0 ???:list_splice_tail_init 11,796,480 ( 0.52%) 2 ( 0.15%) 2 ( 0.15%) 2,359,296 ( 0.27%) 0 0 1,572,864 ( 0.33%) 0 0 ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms 11,010,048 ( 0.49%) 1 ( 0.07%) 1 ( 0.08%) 1,572,864 ( 0.18%) 1 ( 0.00%) 1 ( 0.00%) 0 0 0 ./string/../sysdeps/x86_64/multiarch/strlen-avx2.S:__strlen_avx2 6,291,474 ( 0.28%) 0 0 2,359,300 ( 0.27%) 0 0 786,437 ( 0.16%) 0 0 ???:main 3,145,219 ( 0.14%) 2 ( 0.15%) 2 ( 0.15%) 0 0 0 1 ( 0.00%) 0 0 ./malloc/./malloc/arena.c:malloc -------------------------------------------------------------------------------- The following files chosen for auto-annotation could not be found: -------------------------------------------------------------------------------- ./malloc/./malloc/arena.c ./malloc/./malloc/malloc.c ./stdlib/./stdlib/random.c ./stdlib/./stdlib/random_r.c ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S ./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S ./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S ./string/../sysdeps/x86_64/multiarch/strlen-avx2.S ``` 我們可以發現，merge_sort 發生了 13,331,089 次 cache miss，這是因為遞迴呼叫容易多次出現 cache miss。同樣地，list_del 出現了 1,034,147 次 cache miss，這是由於 list_move_tail 會呼叫 list_del，而 list_del 需要多次讀取 node 前後的節點位置，因此容易發生 cache miss。透過 Valgrind，可以觀察程式運作時的 cache miss 狀況，分析其原因，並改善演算法。