Linux 核心專題: 重做 lab0

執行人: ChenFuhuangKye
解說影片連結

Reviewed by `LindaTing0106`

透過 Valgrind，可以觀察程式運作時的 cache miss 狀況，分析其原因，並改善程式碼。

有針對使 cache miss 降低的議題對程式碼進行改善了嗎？

Reviewed by `kevinzxc1217`

而 list_del 需要多次讀取 node 前後的節點位置，因此容易發生 cache miss

這裡為什麼讀取 node 前後節點位置會容易發生 cache miss 呢?

Reviewed by `chloe0919`

可以不需要貼上完整程式碼，貼上特別要說明的段落即可。
hackmd 撰寫請符合中文和英文字元之間要有空白字元

由於每次檢驗彼此獨立，因此 perf 計算出的 cache missing 數據，除非兩種演算法的結果差異達到一定程度，否則無法用於評估演算法的好壞。

可以再清楚說明為何每次檢驗彼此獨立，會影響 perf 計算出的 chche missing 在評估演算法上的效能。

Reviewed by `Lccgth`

我們可以發現，merge_sort 發生了 13,331,089 次 cache miss，這是因為遞迴呼叫容易多次出現 cache miss。

可以說明為什麼遞迴呼叫容易多次出現 cache miss。

遞迴呼叫中，每次的調用會存取不同的記憶體位置，可能會導致 cache miss。

Reviewed by `mesohandsome`

重複執行 5 次，分別得到

執行 5 次是否太少？

Reviewed by `dockyu`

如果
$p$ 小於 0.05 ，拒絕虛無假設，硬幣為不公平硬幣

$χ^{2} = 36$ ，
$p < 0.05$ ， 1不符合虛無假設，硬幣為不公平硬幣

$χ^{2} = 1$ ，
$0.50 < p < 0.25$ ，
$p > 0.05$ ，符合虛無假設，硬幣為公平硬幣

$χ^{2} = 0$ ，
$p = 0.99 > 0.05$ ，符合虛無假設，硬幣為公平硬幣

可以說明

p

是如何得出的。

首先確認自由度，由於硬幣只有正反面，假設正面出現機率為 x ，反面出現的機率為 1 - x ，由此可以知道拋硬幣的自由度為 1 。接著利用卡方值查表，舉第 1 組為例，此時的卡方值為 36 、自由度為 1 ，而當自由度為 1 且 p = 0.05 時，卡方值為 3.84 ，可以發現第一組的卡方值大於 p = 0.05 時，因此硬幣為不公平硬幣。

Reviewed by `hugo0406`

能補上觀察圖示「以數學分析來觀察 shuffle 1M 次的結果」的文字敘述嗎?

Reviewed by `lintin528`

可以嘗試設計幾個資料集來測試不同的排序演算法之表現，又或是他們出現 worst case 或 best case 的情形並比較。

Reviewed by `randyuncle`

最後的用 valgrind 觀察 cache missing 的部份不需用把整個包含警告訊息的結果貼上來，可以只貼出關鍵的數據就好，方便閱讀。

任務簡介

重做 lab0，並強化統計相關議題。

TODO: 依據第一次作業規範，重作 lab0

善用 git rebase 操作，在最新的 sysprog21/lab0-c 開發。
可適度彙整其他學員的成果，但務必指明出處並確認其內容有效

以數學分析來觀察 shuffle 1M 次的結果

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

改進上圖的 x 軸標示

已更新

Expectation:  41666
Observation:  {'1234': 41392, '1243': 41907, '1324': 41598, '1342': 41563, '1423': 42164, '1432': 41460, '2134': 41834, '2143': 41677, '2314': 41740, '2341': 41727, '2413': 41207, '2431': 41311, '3124': 41722, '3142': 41696, '3214': 41826, '3241': 41624, '3412': 41788, '3421': 41507, '4123': 41514, '4132': 41613, '4213': 41921, '4231': 41857, '4312': 41867, '4321': 41485}
chi square sum:  26.045792732683726

說好的數學呢？
列出 Lemma 和 Theorem，並詳盡討論。翻出久違的教科書。

可以透過卡方檢驗比較期望出現頻率分佈是否符合預期的分佈。

卡方檢驗方法

計算卡方值

χ^{2}

，檢驗卡方值的大小。

卡方值的公式如下

χ^{2} = \sum_{i = 1}^{k} \frac{(O_{i} - E_{i})^{2}}{E_{i}}

$O_{i}$ 為類別觀察的次數
$E_{i}$ 為類別期望的次數
k 為類別的數量
自由度為 df = k-1

舉拋 100 次公平硬幣為例，我們期望會有 50 次正面、 50 次反面。
舉三個例子

如果觀察到有 80 次正面、 20 次反面
- $χ^{2} = \frac{(80 - 50)^{2}}{50} + \frac{(20 - 50)^{2}}{50} = 36$
如果觀察到有 55 次正面、45 次反面
- $χ^{2} = \frac{(55 - 50)^{2}}{50} + \frac{(45 - 50)^{2}}{50} = 1$
如果觀察到有 50 次正面、 50 次反面
- $χ^{2} = \frac{(50 - 50)^{2}}{50} + \frac{(50 - 50)^{2}}{50} = 0$

可以發現當

χ^{2}

值越小，觀察次數與期望次數越接近，反之當

χ^{2}

值越大，觀察次數與期望次數差異越大，越不符合期望。

由於拋 100 次公平硬幣中，只有正面跟反面兩個類別，因此自由度為 1 。現在有了卡方值以及自由度，可以透過查表來判斷是否為公平硬幣。

以上面三個例子為例，

虛無假設 H0: 硬幣為公平硬幣
對立假設 H1: 硬幣為不公平硬幣

如果

p

小於 0.05 ，拒絕虛無假設，硬幣為不公平硬幣

$χ^{2} = 36$ ，
$p < 0.05$ ， 1不符合虛無假設，硬幣為不公平硬幣
$χ^{2} = 1$ ，
$0.50 < p < 0.25$ ，
$p > 0.05$ ，符合虛無假設，硬幣為公平硬幣
$χ^{2} = 0$ ，
$p = 0.99 > 0.05$ ，符合虛無假設，硬幣為公平硬幣

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

修正圖片的存取權限

已修正

如果 shuffle 是公平的代表發牌結果出現頻率要一樣，以發了一百萬次牌為例，會有 24 種狀況，每種狀況出現的期望次數為 41666 次。

可以透過卡方檢驗比較期望出现頻率分佈是否符合預期的分佈。如果洗牌是公平的，代表發牌结果出現頻率應當相同。以發了一百萬次牌為例，會有24種情况，每種情況出現的期望次數為 41666 次。

配適度檢定建立的假設如下：

虛無假設 H0: 數據符合出現的頻率分佈
對立假設 H1: 數據不符合出現的頻率分佈

計算的 p-value 為 0.299 大於 0.005 ，無法拒絕虛無假設，因此結論 shuffle 的結果符合期望的結果。

$ python test_shuffle.py
Power_divergenceResult(statistic=np.float64(26.045119999999997), 
pvalue=np.float64(0.29874079973512085))

import numpy as np
from scipy import stats


f_obs = [41392, 41907, 41598, 41563, 42164, 41460, 41834, 41677, 41740, 41727, 41207, 41311,
         41722, 41696, 41826, 41624, 41788, 41507, 41514, 41613, 41921, 41857, 41867, 41485]

total_obs = sum(f_obs)
f_exp = [total_obs / len(f_obs)] * len(f_obs)

output = stats.chisquare(f_obs=f_obs, f_exp=f_exp)

print(output)

研讀 Linux 核心原始程式碼的 lib/list_sort.c

list_sort 參數的意義。

priv : list_sort 函式不會去使用，而是傳遞給 cmp 比較函式，可以用於傳遞需要在比較函式中使用的額外訊息。
head : 為鏈結串列的開頭。
cmp : 為一個比較函式，用於比較兩個元素之間的排列順序，並回傳一個整數來表示兩個元素的比較結果。

由於 head 跟 cmp 不能為 NULL ，加上 __attribute__((nonnull(2, 3))) 讓編譯器可以發出警告。

__attribute__((nonnull(2, 3))) void list_sort(void *priv,
                                              struct list_head *head,
                                              list_cmp_func_t cmp)

list_sort 初始值

list : 這是一個指向鏈結串列中第一個節點的指標，並確保此鏈結串列數量不是 0 或是 1 。
pending : 這是一個指向待處理節點的指標，其初始值為 NULL 。
count : 這是一個整數值，用來紀錄待處理節點的數量。

注意用語，務必詳閱 https://hackmd.io/@sysprog/it-vocabulary

list_sort 有分成三個區塊。

逐一走訪每個節點，並把節點逐一加入 pending ，當 pending 的數量 + 1 後為 2 的冪就不合併，反之合併。
當 list 中沒有節點時，將 pending 中所有節點合併。
最後整理成環狀串列。

執行步驟

`在 do while` 之前

count = 1

加入 6 到 pending

count = 2

加入 5 到 pending

count = 3

加入 4 到 pending

count = 4

加入 3 到 pending ，5 節點的分支與 6 節點的分支合併

count = 5

加入 2 到 pending ， 3 節點的分支與 4 節點的分支合併。

count = 6

加入 1 到 pending ， 3 節點的分支與 5 節點的分支合併。

當 `list` 沒有節點

將 1 節點的分支、 2 節點的分支與 3 節點的分支合併。

最後將頭尾接上

強化統計的基礎

例如針對排序和其分析，需要多大的樣本空間？

TODO: 改進鏈結串列的排序

拉近與 Linux 核心鏈結串列排序的距離，甚至在特定狀況得以超越其性能。

TODO: 善用工具進行分析

如 perf, valgrind, massif

除了比較執行時間外，判斷排序演算法的好壞還可以觀察其發生 cache missing 的情況。由於 CPU 與 cache 之間的溝通速度遠快於與記憶體的溝通速度，如果能夠降低 cache missing 的次數，則代表排序演算法較優。然而，每次計算 cache missing 的結果都可能有所不同，因此找到適當的樣本大小來代表演算法的 cache missing 是非常重要的。

merge_sort

實作 merge_sort 可以利用 Divide and Conquer 使用遞迴方法，每次遞迴先將 list 分成兩個左右兩個區塊，直到 list 不能分割，接著合併左右 list 直到全部合併。

void merge_sort(struct list_head *head, bool descend)
{
    if (!head || list_empty(head) || list_is_singular(head))
        return;

    struct list_head *mid = head;
    for (struct list_head *fast = head->next;
         fast != head && fast->next != head; fast = fast->next->next)
        mid = mid->next;

    LIST_HEAD(left);
    LIST_HEAD(right);

    list_cut_position(&left, head, mid);
    list_splice_init(head, &right);

    merge_sort(&left, descend);
    merge_sort(&right, descend);

    merge_list(head, &left, &right, descend);
}

void merge_list(struct list_head *head,
                struct list_head *left,
                struct list_head *right,
                bool descend)
{
    LIST_HEAD(tmp);

    while (!list_empty(left) && !list_empty(right)) {
        element_t *left_entry = list_entry(left->next, element_t, list);
        element_t *right_entry = list_entry(right->next, element_t, list);
        if (descend ? strcmp(left_entry->value, right_entry->value) > 0
                    : strcmp(left_entry->value, right_entry->value) < 0) {
            list_move_tail(left->next, &tmp);
        } else {
            list_move_tail(right->next, &tmp);
        }
    }
    list_splice_tail_init(list_empty(left) ? right : left, &tmp);
    list_splice_init(&tmp, head);
}

使用 perf 計算資料排序程式的 cache missing

首先，利用 lscpu 得知測試環境的 L3 cache 大小為 12MB，list_head 大小為 16B 。可以將樣本大小設為 12 * 1024 * 64。我設計了一個測試程式，並測試。

#include <stdio.h>
#include <stdlib.h>
#include "queue.h"

#define N 12 * 1024 * 64

int main() {
    struct list_head *head = q_new();

    for(int i = 0; i < N; i++)
        q_insert_head(head, "world");

    q_sort(head, 0);    
    return 0;   
}

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
    CPU family:          6
    Model:               165
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            2
    CPU max MHz:         5000.0000
    CPU min MHz:         800.0000
    BogoMIPS:            5199.98

$ sudo perf stat -e cache-references,cache-misses ./merge_sort

 Performance counter stats for './merge_sort':

        59,127,873      cache-references                                                      
        23,874,713      cache-misses                     #   40.38% of all cache refs         

       0.377605633 seconds time elapsed

       0.351518000 seconds user
       0.023967000 seconds sys

重複執行 5 次，分別得到

為何重複 5 次呢？而不是其他次數？你的統計學素養在哪？

	第一次取樣	第二次取樣	第三次取樣	第四次取樣	第五次取樣
cache-references	59127873	57615096	57638423	57674422	59593397
cache-misses	23874713	24929428	24034141	24130806	24184553

如何確保這些分佈是否都相同可以利用卡方檢定 (Chi-Squared Test) 中的同質性檢定 (Test of Homogeneity) ，檢驗對於同一個類別變數，資料內不同群體出現的頻率分佈是否相同。

同質性檢定建立的假設如下：

虛無假設 H0: 對於 cache missing 類別，不同的子群體之間存在相同的分佈。
對立假設 H1: 對於 cache missing 類別，不同的子群體之間存在不同的分佈。

卡方檢定中需要找出觀察與預期，我們需要定義出預期，這邊我們的虛無假設是『對於 cache missing 類別，不同的子群體之間存在相同的分佈。』，因此我們的同質性預期可以用 cache 狀況代表。

cache-references ，總共有 291,649,211
cache-misses ，總共有 121,153,641
預期的 cache missing 比例為 121,153,641 / 291,649,211 = 41.51%

利用 python 程式計算同質性檢定

import numpy as np 
from scipy import stats


output =  stats.chi2_contingency(
    observed = np.array([
        [35253160	,32685668	,33604282	,33543616	,35408844],
        [23874713 	,24929428 	,24034141 	,24130806 	,24184553]
    ])    
)

print(output)

$ python main.py 
Chi2ContingencyResult(statistic=np.float64(129007.8621825518), pvalue=np.float64(0.0), dof=4, expected_freq=array([[34565635.80424913, 33681279.64222307, 33694916.40176633,
        33715961.10133328, 34837777.0504282 ],
       [24562237.19575087, 23933816.35777693, 23943506.59823367,
        23958460.89866673, 24755619.94957181]]))

可以得到

Statistic (
$x^{2}$ ) : 129007.8621825518
- 值越大代表觀察頻率與期望頻率差異顯著
p-value: 0.0
- 值為 0小於0.05 ，在同質性檢定的結論，不是每次的 cache missing 都一樣

因此, 本次的同質性檢定得出各組數據關係不相同，而是互相獨立。雖然說可以大致推測出 cache missing 比例為 41.51％，然而經過卡方檢驗後，可以得知每次取樣的結果彼此互相獨立。有可能與測試程式有關，因為計算 cache missing 也有包含 list 在增加 node 的過程。

增加採樣次數

我寫了一個腳本，進行了 10000 次採樣，其 p-value 也接近於 0 。我好奇這些資料的分佈狀況，於是以 1% 為單位累加出現的次數，並畫出了cache miss機率分佈圖。結果顯示，cache missing 大致呈現常態分佈，但誤差範圍很大，這與同質性檢驗的結果一致，且本次的執行結果與之前 cache missing 結果不一樣。由於每次檢驗彼此獨立，因此 perf 計算出的 cache missing 數據，除非兩種演算法的結果差異達到一定程度，否則無法用於評估演算法的好壞。

#!/bin/bash

# Initialize result arrays
cache_references_results=()
cache_misses_results=()
cache_correct_results=()

# Repeat 10000 times
for i in {1..10000}
do
  # Run perf command and store the output in a variable
  output=$(perf stat -e cache-references,cache-misses ./merge_sort 2>&1)

  # Extract cache-references and cache-misses values from the output
  cache_references=$(echo "$output" | grep 'cache-references' | awk '{print $1}' | tr -d ',')
  cache_misses=$(echo "$output" | grep 'cache-misses' | awk '{print $1}' | tr -d ',')

  # Calculate cache_correct
  cache_correct=$((cache_references - cache_misses))

  # Store the results in arrays
  cache_references_results+=("$cache_references")
  cache_misses_results+=("$cache_misses")
  cache_correct_results+=("$cache_correct")
  cache_miss_rate=$(awk "BEGIN {print ($cache_misses/$cache_references)*100}")

  # Print the extracted values for the current iteration
  echo "Run $i:"
  echo "Cache References: $cache_references"
  echo "Cache Misses: $cache_misses"
  echo "Cache Correct: $cache_correct"
  echo "Cache missing: $cache_miss_rate%"
  echo "---------------------------------"
done

# Write all cache references results to a file
echo "${cache_references_results[@]}" > cache_references.txt

# Write all cache misses results to a file
echo "${cache_misses_results[@]}" > cache_misses.txt

# Write all cache correct results to a file
echo "${cache_correct_results[@]}" > cache_correct.txt

# Print all results
echo "All Cache References Results: ${cache_references_results[@]}"
echo "All Cache Misses Results: ${cache_misses_results[@]}"
echo "All Cache Correct Results: ${cache_correct_results[@]}"

使用 valgrind 觀察 cache misssing

valgrind 可以觀察到程式運作時 cache missing 發生的狀況，可以找出是那一行程式發生的多次的 cache missing 。

$valgrind --tool=cachegrind ./merge_sort 
==558249== ./.valgrindrc was not read as it is either not a regular file,
==558249==     or is world writeable, or is not owned by the current user.
==558249== Cachegrind, a cache and branch-prediction profiler
==558249== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==558249== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==558249== Command: ./merge_sort
==558249== 
--558249-- warning: L3 cache found, using its data for the LL simulation.
--558249-- warning: specified LL cache: line_size 64  assoc 16  total_size 12,582,912
--558249-- warning: simulated LL cache: line_size 64  assoc 24  total_size 12,582,912
==558249== brk segment overflow in thread #1: can't grow to 0x485b000
==558249== (see section Limitations in user manual)
==558249== NOTE: further instances of this message will not be shown
==558249== 
==558249== I   refs:      2,264,282,423
==558249== I1  misses:            1,334
==558249== LLi misses:            1,326
==558249== I1  miss rate:          0.00%
==558249== LLi miss rate:          0.00%
==558249== 
==558249== D   refs:      1,348,111,526  (871,215,763 rd   + 476,895,763 wr)
==558249== D1  misses:       24,598,017  ( 19,308,724 rd   +   5,289,293 wr)
==558249== LLd misses:        7,576,318  (  4,952,807 rd   +   2,623,511 wr)
==558249== D1  miss rate:           1.8% (        2.2%     +         1.1%  )
==558249== LLd miss rate:           0.6% (        0.6%     +         0.6%  )
==558249== 
==558249== LL refs:          24,599,351  ( 19,310,058 rd   +   5,289,293 wr)
==558249== LL misses:         7,577,644  (  4,954,133 rd   +   2,623,511 wr)
==558249== LL miss rate:            0.2% (        0.2%     +         0.6%  )


$ sudo cg_annotate cachegrind.out.558249 
--------------------------------------------------------------------------------
I1 cache:         32768 B, 64 B, 8-way associative
D1 cache:         32768 B, 64 B, 8-way associative
LL cache:         12582912 B, 64 B, 24-way associative
Command:          ./merge_sort
Data file:        cachegrind.out.558249
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  on

--------------------------------------------------------------------------------
Ir                     I1mr           ILmr           Dr                   D1mr                DLmr               Dw                   D1mw               DLmw               
--------------------------------------------------------------------------------
2,264,282,423 (100.0%) 1,334 (100.0%) 1,326 (100.0%) 871,215,763 (100.0%) 19,308,724 (100.0%) 4,952,807 (100.0%) 476,895,763 (100.0%) 5,289,293 (100.0%) 2,623,511 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir                   I1mr        ILmr        Dr                   D1mr                DLmr               Dw                  D1mw               DLmw                file:function
--------------------------------------------------------------------------------
352,321,488 (15.56%)  6 ( 0.45%)  6 ( 0.45%) 125,829,110 (14.44%)      4,121 ( 0.02%)        14 ( 0.00%) 73,138,163 (15.34%)         0                  0           ???:merge_list
243,793,840 (10.77%)  1 ( 0.07%)  1 ( 0.08%) 121,896,920 (13.99%)      1,365 ( 0.01%)         5 ( 0.00%) 48,758,768 (10.22%)    10,090 ( 0.19%)        74 ( 0.00%)  ???:list_empty
221,769,976 ( 9.79%) 29 ( 2.17%) 29 ( 2.19%)  37,750,909 ( 4.33%)          0                  0          37,746,111 ( 7.91%) 1,572,303 (29.73%) 1,571,760 (59.91%)  ./malloc/./malloc/malloc.c:_int_malloc
186,122,168 ( 8.22%)  5 ( 0.37%)  5 ( 0.38%)  98,828,268 (11.34%) 13,331,089 (69.04%) 3,119,860 (62.99%) 33,292,271 ( 6.98%)     1,365 ( 0.03%)         9 ( 0.00%)  ???:merge_sort
173,958,096 ( 7.68%)  4 ( 0.30%)  4 ( 0.30%)  39,321,600 ( 4.51%)  4,919,860 (25.48%) 1,622,063 (32.75%)          0                  0                  0           ./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
173,015,040 ( 7.64%)  0           0           94,371,840 (10.83%)          0                  0          62,914,560 (13.19%)         0                  0           ???:list_add_tail
141,557,760 ( 6.25%)  1 ( 0.07%)  1 ( 0.08%)  78,643,200 ( 9.03%)  1,034,147 ( 5.36%)   209,220 ( 4.22%) 47,185,920 ( 9.89%) 3,277,277 (61.96%)   655,694 (24.99%)  ???:list_del
125,829,120 ( 5.56%)  1 ( 0.07%)  1 ( 0.08%)  39,321,600 ( 4.51%)          0                  0          39,321,600 ( 8.25%)         0                  0           ???:list_move_tail
 89,653,302 ( 3.96%)  6 ( 0.45%)  6 ( 0.45%)  33,030,163 ( 3.79%)          1 ( 0.00%)         0          25,165,839 ( 5.28%)   196,159 ( 3.71%)   196,156 ( 7.48%)  ???:test_malloc
 69,205,812 ( 3.06%)  7 ( 0.52%)  7 ( 0.53%)  17,301,518 ( 1.99%)          1 ( 0.00%)         1 ( 0.00%)  6,291,973 ( 1.32%)         0                  0           ./malloc/./malloc/malloc.c:malloc
 53,477,308 ( 2.36%)  2 ( 0.15%)  2 ( 0.15%)  26,738,654 ( 3.07%)      4,095 ( 0.02%)        15 ( 0.00%) 17,301,482 ( 3.63%)    10,770 ( 0.20%)        73 ( 0.00%)  ???:list_splice
 50,128,732 ( 2.21%)  3 ( 0.22%)  3 ( 0.23%)  12,582,920 ( 1.44%)          4 ( 0.00%)         2 ( 0.00%)  4,718,595 ( 0.99%)         0                  0           ./stdlib/./stdlib/random_r.c:random_r
 33,030,165 ( 1.46%)  3 ( 0.22%)  3 ( 0.23%)  12,582,920 ( 1.44%)          1 ( 0.00%)         1 ( 0.00%)  3,145,730 ( 0.66%)         0                  0           ./stdlib/./stdlib/random.c:random
 32,243,671 ( 1.42%)  4 ( 0.30%)  4 ( 0.30%)  17,301,482 ( 1.99%)      3,699 ( 0.02%)        17 ( 0.00%)  9,437,172 ( 1.98%)   212,607 ( 4.02%)   197,017 ( 7.51%)  ???:list_cut_position
 29,884,435 ( 1.32%)  3 ( 0.22%)  3 ( 0.23%)   9,437,190 ( 1.08%)          1 ( 0.00%)         1 ( 0.00%)  4,718,595 ( 0.99%)         0                  0           ???:fail_allocation
 29,097,966 ( 1.29%)  1 ( 0.07%)  1 ( 0.08%)  11,010,041 ( 1.26%)          0                  0           4,718,589 ( 0.99%)         0                  0           ???:list_is_singular
 28,311,744 ( 1.25%) 31 ( 2.32%) 30 ( 2.26%)  14,155,838 ( 1.62%)      1,049 ( 0.01%)        13 ( 0.00%)         20 ( 0.00%)         1 ( 0.00%)         1 ( 0.00%)  ???:???
 28,311,528 ( 1.25%)  2 ( 0.15%)  2 ( 0.15%)  14,155,764 ( 1.62%)          0                  0           9,437,176 ( 1.98%)         0                  0           ???:INIT_LIST_HEAD
 26,738,654 ( 1.18%)  1 ( 0.07%)  1 ( 0.08%)  13,369,327 ( 1.53%)          0                  0           8,650,741 ( 1.81%)     4,049 ( 0.08%)        12 ( 0.00%)  ???:list_splice_tail
 25,952,256 ( 1.15%)  3 ( 0.22%)  3 ( 0.23%)   8,650,752 ( 0.99%)          0                  0           7,077,888 ( 1.48%)         0                  0           ???:q_insert_head
 25,165,792 ( 1.11%)  1 ( 0.07%)  1 ( 0.08%)   7,864,310 ( 0.90%)          0                  0           7,864,310 ( 1.65%)         0                  0           ???:list_splice_init
 22,806,540 ( 1.01%)  3 ( 0.22%)  3 ( 0.23%)   1,572,865 ( 0.18%)          0                  0           3,145,730 ( 0.66%)         0                  0           ./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms
 20,447,245 ( 0.90%)  0           0            9,437,190 ( 1.08%)          0                  0           4,718,595 ( 0.99%)         0                  0           ???:find_footer
 18,874,368 ( 0.83%)  2 ( 0.15%)  2 ( 0.15%)   6,291,456 ( 0.72%)          0                  0           5,505,024 ( 1.15%)         0                  0           ???:test_strdup
 17,301,504 ( 0.76%)  1 ( 0.07%)  1 ( 0.08%)   9,437,184 ( 1.08%)          0                  0           6,291,456 ( 1.32%)         0                  0           ???:list_add
 12,582,896 ( 0.56%)  0           0            3,932,155 ( 0.45%)          0                  0           3,932,155 ( 0.82%)         0                  0           ???:list_splice_tail_init
 11,796,480 ( 0.52%)  2 ( 0.15%)  2 ( 0.15%)   2,359,296 ( 0.27%)          0                  0           1,572,864 ( 0.33%)         0                  0           ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
 11,010,048 ( 0.49%)  1 ( 0.07%)  1 ( 0.08%)   1,572,864 ( 0.18%)          1 ( 0.00%)         1 ( 0.00%)          0                  0                  0           ./string/../sysdeps/x86_64/multiarch/strlen-avx2.S:__strlen_avx2
  6,291,474 ( 0.28%)  0           0            2,359,300 ( 0.27%)          0                  0             786,437 ( 0.16%)         0                  0           ???:main
  3,145,219 ( 0.14%)  2 ( 0.15%)  2 ( 0.15%)           0                   0                  0                   1 ( 0.00%)         0                  0           ./malloc/./malloc/arena.c:malloc

--------------------------------------------------------------------------------
The following files chosen for auto-annotation could not be found:
--------------------------------------------------------------------------------
  ./malloc/./malloc/arena.c
  ./malloc/./malloc/malloc.c
  ./stdlib/./stdlib/random.c
  ./stdlib/./stdlib/random_r.c
  ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
  ./string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
  ./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S
  ./string/../sysdeps/x86_64/multiarch/strlen-avx2.S

我們可以發現，merge_sort 發生了 13,331,089 次 cache miss，這是因為遞迴呼叫容易多次出現 cache miss。同樣地，list_del 出現了 1,034,147 次 cache miss，這是由於 list_move_tail 會呼叫 list_del，而 list_del 需要多次讀取 node 前後的節點位置，因此容易發生 cache miss。透過 Valgrind，可以觀察程式運作時的 cache miss 狀況，分析其原因，並改善程式碼。