# 2024q1 Homework5 (assessment)
contributed by < [ChenFuhuangKye](https://github.com/kyehuang) >
## 因為自動飲料機而延畢的那一年
有些看似簡單的事情背後卻是經過無數次的實驗才得出的結果,文章中有一句話讓我十分有感觸,「一項產業進步的速度,很大程度取決於做實驗修正問題的速度。」有時候有些事情只有接觸過後才會了解設計的意涵。大學時,學校的專業科目只要求我們考試時會過就好了,對於實作我其實是沒有什麼機會接觸的,只依稀記得某某演算法的概念總覺得不難,然而如果叫我手刻出來我是完全不行的。因此我決定在大學畢業的暑假每天都寫一題 leetcode 。在練題目時,我常常會沒有考慮到例外狀況,要等到將程式碼提交後,測資會告訴你哪邊出錯了,我在進行修改。由於程式碼一提交,不到一分鐘我就可以知道結果了,我沒有訓練自己在一開始就考慮例外狀況,反而是我一直提交給網站,網站跟我說哪邊出狀況了我在修改。我很難想像機械系每次繳交設計圖需要一兩天才會有結果,將設計圖修改後又需要等待,期間心理的壓力。
這幾周我看了需多的程式碼,原來一個簡潔有架構的程式是如此優美。舉第一次作業為例,我第一次看到 list.h 中有許多的巨集,我其實不知道這些要怎麼使用,總想著我直接寫就好了不打算使用巨集,再觀摩了 [yanjiew](https://hackmd.io/@yanjiew/linux2023q1-lab0) 同學的作品,我才了解使用巨集會讓程式的架構變成很清晰易理解。指導教授常說一個好的函式不該超過 30 行,應把功能切開呼叫不同函式處理,這時候我才知道他的堅持是非常有意義的。
## 閱讀 <Linux 核心模組運作原理>
編譯核心模組的命令
```
$ make -C /lib/modules/`uname -r`/build M=`pwd` modules
```
``make -C /lib/modules/$`uname -r`/build M=$`pwd` modules`` 的作用是在指定的核心目錄``/lib/modules/$`uname -r`/build`` 執行 make 。這裡使用 `-c` 參數是用來切換到該目錄,並在此目錄執行 Makefile 。
`` M=$`pwd` `` 是告訴 `/build` 底下的 Makefile ,我的程式當前所在位置。
`modules` 是一個 target ,用於編譯目錄中所有被標記為模組的對象 (由 `obj-m` 指定)。
總結來說 ``make -C /lib/modules/$`uname -r`/build M=$`pwd` modules`` 可以利用核心的目錄下的 Makefile 編譯當前目錄的模組,且允許在自己的目錄下維護獨立的 Makefile, 無須更改核心中的 Makefile。
接著查看 `insmod` 手冊
```
NAME
insmod - Simple program to insert a module into the Linux Kernel
SYNOPSIS
insmod [filename] [module options...]
DESCRIPTION
insmod is a trivial program to insert a module into the kernel. Most
users will want to use modprobe(8) instead, which is more clever and
can handle module dependencies.
Only the most general of error messages are reported: as the work of
trying to link the module is now done inside the kernel, the dmesg
usually gives more information about errors.
```
`insmod` 的描述為插入模組到 Linux 核心。
利用 `sudo strace insmod fibdrv.ko` 可以發現會呼叫 `finit_module` 。`finit_module` 呼叫 [`idempotent_init_module`](https://elixir.bootlin.com/linux/latest/source/kernel/module/main.c#L3194) , idempotent_init_module 會在呼叫 [`init_module_from_file`](https://elixir.bootlin.com/linux/latest/source/kernel/module/main.c#L3131) ,最後在利用 init_module_from_file 呼叫 [load_module](https://elixir.bootlin.com/linux/latest/source/kernel/module/main.c#L2836) 。
## 想投入的專案
### 做 lab0 並彙整學員成果
### 以 eBPF 建構 TCP 伺服器
---
10 = 8 + 2 = (2 << 3) + 2
operator priority: shift 和 + 何者優先權高 (看 C11 規格)
## TODO: 撰寫程式碼、測試
https://wiki.archlinux.org/title/Modalias
IEE 754. Assume float is 32-bit
```c
float float_mul10(float x)
{
// using bitwise operation; No mul/div
uint32_t *num = (uint32_t*) &x;
uint32_t sign = (*num)>>31;
uint32_t exp = (((*num)>>23) & 0xFF ) -127;
uint32_t frac = ((*num) & 0x7FFFFF);
exp = (exp+3);
frac = frac + ((frac |0x800000)>>2);
if (frac & 0x800000)
{
exp = exp + 1;
frac = (frac & 0x7FFFFF)>>1;
}
*num = (sign<<31) | ((exp+127)<<23) | (frac & 0x7FFFFF);
return x;
}
```
執行結果
```
x = 0.190000, 10*x = 1.900000, float_mul10(x) = 1.900000
x = 0.361000, 10*x = 3.610000, float_mul10(x) = 3.610000
x = 0.685900, 10*x = 6.859000, float_mul10(x) = 6.859000
x = 1.303210, 10*x = 13.032099, float_mul10(x) = 13.032099
x = 2.476099, 10*x = 24.760988, float_mul10(x) = 24.760986
x = 4.704587, 10*x = 47.045876, float_mul10(x) = 47.045872
x = 8.938716, 10*x = 89.387161, float_mul10(x) = 89.387154
x = 16.983561, 10*x = 169.835602, float_mul10(x) = 169.835602
x = 32.268764, 10*x = 322.687653, float_mul10(x) = 322.687622
x = 61.310654, 10*x = 613.106567, float_mul10(x) = 613.106506
```
思考方式
假設 $x = (-1)^{sign}*2^{E-127}*(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$
將$x$乘$10$
$10*x=10*(-1)^{sign}*2^{E-127}*(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$
$10*x=8*\frac{5}{4}*(-1)^{sign}*2^{E-127}*(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$
$10*x=(-1)^{sign}*8(2^{E-127})*\frac{5}{4}*(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$
$10*x=(-1)^{sign}*(2^{E+3-127})*\frac{5}{4}(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$
展開 $\frac{5}{4}(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$
$\frac{5}{4}(1+\sum^{23}_{i=1}b_{23-i}2^{-i}) = (1+\sum^{23}_{i=1}b_{23-i}2^{-i}) + \frac{1}{4}(1+\sum^{23}_{i=1}b_{23-i}2^{-i})$
$\frac{5}{4}(1+\sum^{23}_{i=1}b_{23-i}2^{-i}) = (1+\sum^{23}_{i=1}b_{23-i}2^{-i}) + (\frac{1}{4}+\sum^{23}_{i=1}b_{23-i}2^{-i-2})$
因此 exp 加 3 符合 $2^{E+3-127}$ , frac 加frac 的四分之一 符合 $(1+\sum^{23}_{i=1}b_{23-i}2^{-i}) + (\frac{1}{4}+\sum^{23}_{i=1}b_{23-i}2^{-i-2}))$
如果 frac 溢位 ,將 exp 加 1 並把 frac 除 2 。
根據 [C Operator Precedence](https://en.cppreference.com/w/c/language/operator_precedence) 的優先表,可以發現 \+ 的優先權比 shift 還要高。
## TODO: lab0 重做, 強化統計的基礎 (例如需要多大的樣本空間?), sort!, 善用工具 (perf, valgrind, massif)
除了比較執行時間外,判斷排序演算法的好壞還可以觀察其發生 cache missing 的情況。由於 CPU 與 cache 之間的溝通速度遠快於與記憶體的溝通速度,如果能夠降低 cache missing 的次數,則代表排序演算法較優。然而,每次計算 cache missing 的結果都可能有所不同,因此找到適當的樣本大小來代表演算法的 cache missing 是非常重要的。
### merge_sort
實作 merge_sort 可以利用 `Divide and Conquer` 使用遞迴方法,每次遞迴先將 list 分成兩個左右兩個區塊,直到 list 不能分割,接著合併左右 list 直到全部合併。
```c
void merge_sort(struct list_head *head, bool descend)
{
if (!head || list_empty(head) || list_is_singular(head))
return;
struct list_head *mid = head;
for (struct list_head *fast = head->next;
fast != head && fast->next != head; fast = fast->next->next)
mid = mid->next;
LIST_HEAD(left);
LIST_HEAD(right);
list_cut_position(&left, head, mid);
list_splice_init(head, &right);
merge_sort(&left, descend);
merge_sort(&right, descend);
merge_list(head, &left, &right, descend);
}
void merge_list(struct list_head *head,
struct list_head *left,
struct list_head *right,
bool descend)
{
LIST_HEAD(tmp);
while (!list_empty(left) && !list_empty(right)) {
element_t *left_entry = list_entry(left->next, element_t, list);
element_t *right_entry = list_entry(right->next, element_t, list);
if (descend ? strcmp(left_entry->value, right_entry->value) > 0
: strcmp(left_entry->value, right_entry->value) < 0) {
list_move_tail(left->next, &tmp);
} else {
list_move_tail(right->next, &tmp);
}
}
list_splice_tail_init(list_empty(left) ? right : left, &tmp);
list_splice_init(&tmp, head);
}
```
### 使用 perf 計算演算法的 cache missing
首先,利用 `lscpu` 得知測試環境的 L3 cache 大小為 12MB,`list_head` 大小為 16B 。可以將樣本大小設為 12 * 1024 * 64。我設計了一個測試程式,並測試。
```c
#include <stdio.h>
#include <stdlib.h>
#include "queue.h"
#define N 12 * 1024 * 64
int main() {
struct list_head *head = q_new();
for(int i = 0; i < N; i++)
q_insert_head(head, "world");
q_sort(head, 0);
return 0;
}
```
```
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
CPU family: 6
Model: 165
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 2
CPU max MHz: 5000.0000
CPU min MHz: 800.0000
BogoMIPS: 5199.98
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fx
sr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts re
p_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est
tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes
xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb
stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep
bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm
ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi pku ospke md_clear flush_l1d arch_
capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 192 KiB (6 instances)
L1i: 192 KiB (6 instances)
L2: 1.5 MiB (6 instances)
L3: 12 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-11
Vulnerabilities:
Gather data sampling: Mitigation; Microcode
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Mitigation; Enhanced IBRS
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence;
BHI Syscall hardening, KVM SW loop
Srbds: Mitigation; Microcode
Tsx async abort: Not affected
```
```
$ sudo perf stat -e cache-references,cache-misses ./merge_sort
Performance counter stats for './merge_sort':
59,127,873 cache-references
23,874,713 cache-misses # 40.38% of all cache refs
0.377605633 seconds time elapsed
0.351518000 seconds user
0.023967000 seconds sys
```
重複執行 5 次,分別得到
| | 第一次取樣 | 第二次取樣 | 第三次取樣 | 第四次取樣 | 第五次取樣 |
|:----------------:|:----------:|:----------:|:----------:|:----------:|:----------:|
| cache-references | 59127873 | 57615096 | 57638423 | 57674422 | 59593397 |
| cache-misses | 23874713 | 24929428 | 24034141 | 24130806 | 24184553 |
如何確保這些分佈是否都相同可以利用卡方檢定 (Chi-Squared Test) 中的同質性檢定 ([Test of Homogeneity](https://courses.lumenlearning.com/wm-concepts-statistics/chapter/test-of-homogeneity/)) ,檢驗對於同一個類別變數,資料內不同群體出現的頻率分佈是否相同。
同質性檢定建立的假設如下:
* 虛無假設 H0: 對於 cache missing 類別,不同的子群體之間存在**相同**的分佈。
* 對立假設 H1: 對於 cache missing 類別,不同的子群體之間存在**不同**的分佈。
卡方檢定中需要找出觀察與預期,我們需要定義出預期,這邊我們的虛無假設是『對於 cache missing 類別,不同的子群體之間存在**相同**的分佈。』,因此我們的同質性預期可以用 cache 狀況代表。
* cache-references ,總共有 291,649,211
* cache-misses ,總共有 121,153,641
* 預期的 cache missing 比例為 121,153,641 / 291,649,211 = 41.51%
利用 python 程式計算同質性檢定
```python
import numpy as np
from scipy import stats
output = stats.chi2_contingency(
observed = np.array([
[35253160 ,32685668 ,33604282 ,33543616 ,35408844],
[23874713 ,24929428 ,24034141 ,24130806 ,24184553]
])
)
print(output)
```
```
$ python main.py
Chi2ContingencyResult(statistic=np.float64(129007.8621825518), pvalue=np.float64(0.0), dof=4, expected_freq=array([[34565635.80424913, 33681279.64222307, 33694916.40176633,
33715961.10133328, 34837777.0504282 ],
[24562237.19575087, 23933816.35777693, 23943506.59823367,
23958460.89866673, 24755619.94957181]]))
```
可以得到
* Statistic ($x^2$) : 129007.8621825518
* 值越大代表觀察頻率與期望頻率差異顯著
* p-value: 0.0
* 值為 0小於0.05 ,在同質性檢定的結論,不是每次的 cache missing 都一樣
因此, 本次的同質性檢定得出各組數據關係不相同,而是互相獨立。雖然說可以大致推測出 cache missing 比例為 41.51% ,然而經過卡方檢驗後,可以得知每次取樣的結果彼此互相獨立。有可能與測試程式有關,因為計算 cache missing 也有包含 list 在增加 node 的過程。
#### 增加採樣次數
我寫了一個腳本,進行了 10000 次採樣,其 p-value 也接近於 0 。我好奇這些資料的分佈狀況,於是以 1% 為單位累加出現的次數,並畫出了cache miss機率分佈圖。結果顯示,cache missing 大致呈現常態分佈,但誤差範圍很大,這與同質性檢驗的結果一致,且本次的執行結果與之前 cache missing 結果不一樣。由於每次檢驗彼此獨立,因此 perf 計算出的 cache missing 數據,除非兩種演算法的結果差異達到一定程度,否則無法用於評估演算法的好壞。
![Figure_1](https://hackmd.io/_uploads/Hy_e5DtUC.png)
```bash
#!/bin/bash
# Initialize result arrays
cache_references_results=()
cache_misses_results=()
cache_correct_results=()
# Repeat 10000 times
for i in {1..10000}
do
# Run perf command and store the output in a variable
output=$(perf stat -e cache-references,cache-misses ./merge_sort 2>&1)
# Extract cache-references and cache-misses values from the output
cache_references=$(echo "$output" | grep 'cache-references' | awk '{print $1}' | tr -d ',')
cache_misses=$(echo "$output" | grep 'cache-misses' | awk '{print $1}' | tr -d ',')
# Calculate cache_correct
cache_correct=$((cache_references - cache_misses))
# Store the results in arrays
cache_references_results+=("$cache_references")
cache_misses_results+=("$cache_misses")
cache_correct_results+=("$cache_correct")
cache_miss_rate=$(awk "BEGIN {print ($cache_misses/$cache_references)*100}")
# Print the extracted values for the current iteration
echo "Run $i:"
echo "Cache References: $cache_references"
echo "Cache Misses: $cache_misses"
echo "Cache Correct: $cache_correct"
echo "Cache missing: $cache_miss_rate%"
echo "---------------------------------"
done
# Write all cache references results to a file
echo "${cache_references_results[@]}" > cache_references.txt
# Write all cache misses results to a file
echo "${cache_misses_results[@]}" > cache_misses.txt
# Write all cache correct results to a file
echo "${cache_correct_results[@]}" > cache_correct.txt
# Print all results
echo "All Cache References Results: ${cache_references_results[@]}"
echo "All Cache Misses Results: ${cache_misses_results[@]}"
echo "All Cache Correct Results: ${cache_correct_results[@]}"
```
### 使用 valgrind 觀察 cache misssing
valgrind 可以觀察到程式運作時 cache missing 發生的狀況,可以找出是那一行程式發生的多次的 cache missing 。
```
$valgrind --tool=cachegrind ./merge_sort
==558249== ./.valgrindrc was not read as it is either not a regular file,
==558249== or is world writeable, or is not owned by the current user.
==558249== Cachegrind, a cache and branch-prediction profiler
==558249== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==558249== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==558249== Command: ./merge_sort
==558249==
--558249-- warning: L3 cache found, using its data for the LL simulation.
--558249-- warning: specified LL cache: line_size 64 assoc 16 total_size 12,582,912
--558249-- warning: simulated LL cache: line_size 64 assoc 24 total_size 12,582,912
==558249== brk segment overflow in thread #1: can't grow to 0x485b000
==558249== (see section Limitations in user manual)
==558249== NOTE: further instances of this message will not be shown
==558249==
==558249== I refs: 2,264,282,423
==558249== I1 misses: 1,334
==558249== LLi misses: 1,326
==558249== I1 miss rate: 0.00%
==558249== LLi miss rate: 0.00%
==558249==
==558249== D refs: 1,348,111,526 (871,215,763 rd + 476,895,763 wr)
==558249== D1 misses: 24,598,017 ( 19,308,724 rd + 5,289,293 wr)
==558249== LLd misses: 7,576,318 ( 4,952,807 rd + 2,623,511 wr)
==558249== D1 miss rate: 1.8% ( 2.2% + 1.1% )
==558249== LLd miss rate: 0.6% ( 0.6% + 0.6% )
==558249==
==558249== LL refs: 24,599,351 ( 19,310,058 rd + 5,289,293 wr)
==558249== LL misses: 7,577,644 ( 4,954,133 rd + 2,623,511 wr)
==558249== LL miss rate: 0.2% ( 0.2% + 0.6% )
$ sudo cg_annotate cachegrind.out.558249
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 12582912 B, 64 B, 24-way associative
Command: ./merge_sort
Data file: cachegrind.out.558249
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation: on
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
2,264,282,423 (100.0%) 1,334 (100.0%) 1,326 (100.0%) 871,215,763 (100.0%) 19,308,724 (100.0%) 4,952,807 (100.0%) 476,895,763 (100.0%) 5,289,293 (100.0%) 2,623,511 (100.0%) PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
352,321,488 (15.56%) 6 ( 0.45%) 6 ( 0.45%) 125,829,110 (14.44%) 4,121 ( 0.02%) 14 ( 0.00%) 73,138,163 (15.34%) 0 0 ???:merge_list
243,793,840 (10.77%) 1 ( 0.07%) 1 ( 0.08%) 121,896,920 (13.99%) 1,365 ( 0.01%) 5 ( 0.00%) 48,758,768 (10.22%) 10,090 ( 0.19%) 74 ( 0.00%) ???:list_empty
221,769,976 ( 9.79%) 29 ( 2.17%) 29 ( 2.19%) 37,750,909 ( 4.33%) 0 0 37,746,111 ( 7.91%) 1,572,303 (29.73%) 1,571,760 (59.91%) ./malloc/./malloc/malloc.c:_int_malloc
186,122,168 ( 8.22%) 5 ( 0.37%) 5 ( 0.38%) 98,828,268 (11.34%) 13,331,089 (69.04%) 3,119,860 (62.99%) 33,292,271 ( 6.98%) 1,365 ( 0.03%) 9 ( 0.00%) ???:merge_sort
173,958,096 ( 7.68%) 4 ( 0.30%) 4 ( 0.30%) 39,321,600 ( 4.51%) 4,919,860 (25.48%) 1,622,063 (32.75%) 0 0 0 ./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
173,015,040 ( 7.64%) 0 0 94,371,840 (10.83%) 0 0 62,914,560 (13.19%) 0 0 ???:list_add_tail
141,557,760 ( 6.25%) 1 ( 0.07%) 1 ( 0.08%) 78,643,200 ( 9.03%) 1,034,147 ( 5.36%) 209,220 ( 4.22%) 47,185,920 ( 9.89%) 3,277,277 (61.96%) 655,694 (24.99%) ???:list_del
125,829,120 ( 5.56%) 1 ( 0.07%) 1 ( 0.08%) 39,321,600 ( 4.51%) 0 0 39,321,600 ( 8.25%) 0 0 ???:list_move_tail
89,653,302 ( 3.96%) 6 ( 0.45%) 6 ( 0.45%) 33,030,163 ( 3.79%) 1 ( 0.00%) 0 25,165,839 ( 5.28%) 196,159 ( 3.71%) 196,156 ( 7.48%) ???:test_malloc
69,205,812 ( 3.06%) 7 ( 0.52%) 7 ( 0.53%) 17,301,518 ( 1.99%) 1 ( 0.00%) 1 ( 0.00%) 6,291,973 ( 1.32%) 0 0 ./malloc/./malloc/malloc.c:malloc
53,477,308 ( 2.36%) 2 ( 0.15%) 2 ( 0.15%) 26,738,654 ( 3.07%) 4,095 ( 0.02%) 15 ( 0.00%) 17,301,482 ( 3.63%) 10,770 ( 0.20%) 73 ( 0.00%) ???:list_splice
50,128,732 ( 2.21%) 3 ( 0.22%) 3 ( 0.23%) 12,582,920 ( 1.44%) 4 ( 0.00%) 2 ( 0.00%) 4,718,595 ( 0.99%) 0 0 ./stdlib/./stdlib/random_r.c:random_r
33,030,165 ( 1.46%) 3 ( 0.22%) 3 ( 0.23%) 12,582,920 ( 1.44%) 1 ( 0.00%) 1 ( 0.00%) 3,145,730 ( 0.66%) 0 0 ./stdlib/./stdlib/random.c:random
32,243,671 ( 1.42%) 4 ( 0.30%) 4 ( 0.30%) 17,301,482 ( 1.99%) 3,699 ( 0.02%) 17 ( 0.00%) 9,437,172 ( 1.98%) 212,607 ( 4.02%) 197,017 ( 7.51%) ???:list_cut_position
29,884,435 ( 1.32%) 3 ( 0.22%) 3 ( 0.23%) 9,437,190 ( 1.08%) 1 ( 0.00%) 1 ( 0.00%) 4,718,595 ( 0.99%) 0 0 ???:fail_allocation
29,097,966 ( 1.29%) 1 ( 0.07%) 1 ( 0.08%) 11,010,041 ( 1.26%) 0 0 4,718,589 ( 0.99%) 0 0 ???:list_is_singular
28,311,744 ( 1.25%) 31 ( 2.32%) 30 ( 2.26%) 14,155,838 ( 1.62%) 1,049 ( 0.01%) 13 ( 0.00%) 20 ( 0.00%) 1 ( 0.00%) 1 ( 0.00%) ???:???
28,311,528 ( 1.25%) 2 ( 0.15%) 2 ( 0.15%) 14,155,764 ( 1.62%) 0 0 9,437,176 ( 1.98%) 0 0 ???:INIT_LIST_HEAD
26,738,654 ( 1.18%) 1 ( 0.07%) 1 ( 0.08%) 13,369,327 ( 1.53%) 0 0 8,650,741 ( 1.81%) 4,049 ( 0.08%) 12 ( 0.00%) ???:list_splice_tail
25,952,256 ( 1.15%) 3 ( 0.22%) 3 ( 0.23%) 8,650,752 ( 0.99%) 0 0 7,077,888 ( 1.48%) 0 0 ???:q_insert_head
25,165,792 ( 1.11%) 1 ( 0.07%) 1 ( 0.08%) 7,864,310 ( 0.90%) 0 0 7,864,310 ( 1.65%) 0 0 ???:list_splice_init
22,806,540 ( 1.01%) 3 ( 0.22%) 3 ( 0.23%) 1,572,865 ( 0.18%) 0 0 3,145,730 ( 0.66%) 0 0 ./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms
20,447,245 ( 0.90%) 0 0 9,437,190 ( 1.08%) 0 0 4,718,595 ( 0.99%) 0 0 ???:find_footer
18,874,368 ( 0.83%) 2 ( 0.15%) 2 ( 0.15%) 6,291,456 ( 0.72%) 0 0 5,505,024 ( 1.15%) 0 0 ???:test_strdup
17,301,504 ( 0.76%) 1 ( 0.07%) 1 ( 0.08%) 9,437,184 ( 1.08%) 0 0 6,291,456 ( 1.32%) 0 0 ???:list_add
12,582,896 ( 0.56%) 0 0 3,932,155 ( 0.45%) 0 0 3,932,155 ( 0.82%) 0 0 ???:list_splice_tail_init
11,796,480 ( 0.52%) 2 ( 0.15%) 2 ( 0.15%) 2,359,296 ( 0.27%) 0 0 1,572,864 ( 0.33%) 0 0 ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
11,010,048 ( 0.49%) 1 ( 0.07%) 1 ( 0.08%) 1,572,864 ( 0.18%) 1 ( 0.00%) 1 ( 0.00%) 0 0 0 ./string/../sysdeps/x86_64/multiarch/strlen-avx2.S:__strlen_avx2
6,291,474 ( 0.28%) 0 0 2,359,300 ( 0.27%) 0 0 786,437 ( 0.16%) 0 0 ???:main
3,145,219 ( 0.14%) 2 ( 0.15%) 2 ( 0.15%) 0 0 0 1 ( 0.00%) 0 0 ./malloc/./malloc/arena.c:malloc
--------------------------------------------------------------------------------
The following files chosen for auto-annotation could not be found:
--------------------------------------------------------------------------------
./malloc/./malloc/arena.c
./malloc/./malloc/malloc.c
./stdlib/./stdlib/random.c
./stdlib/./stdlib/random_r.c
./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S
./string/../sysdeps/x86_64/multiarch/strlen-avx2.S
```
我們可以發現,merge_sort 發生了 13,331,089 次 cache miss,這是因為遞迴呼叫容易多次出現 cache miss。同樣地,list_del 出現了 1,034,147 次 cache miss,這是由於 list_move_tail 會呼叫 list_del,而 list_del 需要多次讀取 node 前後的節點位置,因此容易發生 cache miss。透過 Valgrind,可以觀察程式運作時的 cache miss 狀況,分析其原因,並改善演算法。