Linux 核心專題: 並行化的合併排序及其改進

執行人: youjiaw
專題解說影片

任務簡介

回顧〈並行程式設計講座〉，運用其中的手法改寫經典的合併排序 (參見第九週測驗題)，逐步提升並行表現並予以量化分析。

TODO: 回顧〈並行程式設計講座〉並紀錄問題

對照 Linux 核心專題: 並行程式設計教材修訂

〈並行程式設計: Atomics 操作〉提到 alignas 可以避免 false sharing 或增進記憶體的存取效率，但是我查閱 Linux 核心原始程式碼卻沒找到此操作。

隨後發現 Linux 核心是使用 attribute((aligned(sizeof(long)))) 這種對齊方式。

參照 concurrent-programs 裡頭的程式碼。
Lock-free Thread Pool 的內文提到「若偵測到目前待執行的工作數目小於 worker thread 總數，則要透過 condition variable 喚醒可能處於等待狀態的 worker thread。」，為什麼不是大於？

TODO: 並行化的合併排序

第九週測驗第一題藉由 fork(2) 和 mmap(2) 系統呼叫來實作並行版本的合併排序，以此為出發，探討可能的實作和改進手法。過程中務必確保執行結果正確，且維持 stable sorting 的特性。

首先觀察原程式碼的效能瓶頸:

$ perf record ./original

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

可見主要瓶頸為:

merge
_int_malloc
get_mem_cgroup_from_mm

組合語言

merge() 
Event: cpu_core/cycles:P/

Percent        
               
               
             Disassembly of section .text:
               
             0000000000001269 <merge>:
             merge():   
               endbr64        
               push    %rbp   
               mov     %rsp,%rbp
               sub     $0x40,%rsp
               mov     %rdi,-0x28(%rbp)
               mov     %esi,-0x2c(%rbp)
               mov     %rdx,-0x38(%rbp)
               mov     %ecx,-0x30(%rbp)
               mov     -0x2c(%rbp),%edx
               mov     -0x30(%rbp),%eax
               add     %edx,%eax
               cltq           
               mov     $0x4,%esi
               mov     %rax,%rdi
             → call    calloc@plt
               mov     %rax,-0x8(%rbp)
               movl    $0x0,-0x14(%rbp)
               movl    $0x0,-0x10(%rbp)
  1.27         movl    $0x0,-0xc(%rbp)
             ↓ jmp     f3     
         4f:   mov     -0x14(%rbp),%eax
               cltq           
               lea     0x0(,%rax,4),%rdx
  8.46         mov     -0x28(%rbp),%rax
               add     %rdx,%rax
  5.40         mov     (%rax),%edx
               mov     -0x10(%rbp),%eax
  2.95         cltq           
               lea     0x0(,%rax,4),%rcx
  1.71         mov     -0x38(%rbp),%rax
               add     %rcx,%rax
  4.23         mov     (%rax),%eax
  1.62         cmp     %eax,%edx
  5.98       ↓ jg      ba     
 12.36         mov     -0x14(%rbp),%eax
               lea     0x1(%rax),%edx
  4.07         mov     %edx,-0x14(%rbp)
               cltq           
  0.28         lea     0x0(,%rax,4),%rdx
               mov     -0x28(%rbp),%rax
  1.73         lea     (%rdx,%rax,1),%rcx
               mov     -0xc(%rbp),%eax
               lea     0x1(%rax),%edx
               mov     %edx,-0xc(%rbp)
  3.31         cltq           
  0.89         lea     0x0(,%rax,4),%rdx
  0.64         mov     -0x8(%rbp),%rax
               add     %rax,%rdx
  5.59         mov     (%rcx),%eax
               mov     %eax,(%rdx)
  3.10       ↓ jmp     f3     
 12.72   ba:   mov     -0x10(%rbp),%eax
  1.85         lea     0x1(%rax),%edx
  1.20         mov     %edx,-0x10(%rbp)
               cltq           
               lea     0x0(,%rax,4),%rdx
               mov     -0x38(%rbp),%rax
  1.45         lea     (%rdx,%rax,1),%rcx
  3.44         mov     -0xc(%rbp),%eax
               lea     0x1(%rax),%edx
               mov     %edx,-0xc(%rbp)
  2.30         cltq           
  2.09         lea     0x0(,%rax,4),%rdx
               mov     -0x8(%rbp),%rax
               add     %rax,%rdx
  0.51         mov     (%rcx),%eax
  1.25         mov     %eax,(%rdx)
  2.67   f3:   mov     -0x14(%rbp),%eax
  0.81         cmp     -0x2c(%rbp),%eax
             ↓ jge     142    
  0.43         mov     -0x10(%rbp),%eax
               cmp     -0x30(%rbp),%eax
  0.47       ↑ jl      4f     
             ↓ jmp     142    
  0.37  109:   mov     -0x14(%rbp),%eax
               lea     0x1(%rax),%edx
               mov     %edx,-0x14(%rbp)
               cltq           
               lea     0x0(,%rax,4),%rdx
               mov     -0x28(%rbp),%rax
               lea     (%rdx,%rax,1),%rcx
               mov     -0xc(%rbp),%eax
  1.09         lea     0x1(%rax),%edx
               mov     %edx,-0xc(%rbp)
               cltq           
               lea     0x0(,%rax,4),%rdx
               mov     -0x8(%rbp),%rax
               add     %rax,%rdx
               mov     (%rcx),%eax
  1.36         mov     %eax,(%rdx)
        142:   mov     -0x14(%rbp),%eax
               cmp     -0x2c(%rbp),%eax
             ↑ jl      109    
  1.29       ↓ jmp     185    
  1.09  14c:   mov     -0x10(%rbp),%eax
               lea     0x1(%rax),%edx
               mov     %edx,-0x10(%rbp)
               cltq           
               lea     0x0(,%rax,4),%rdx
               mov     -0x38(%rbp),%rax
               lea     (%rdx,%rax,1),%rcx
               mov     -0xc(%rbp),%eax
               lea     0x1(%rax),%edx
               mov     %edx,-0xc(%rbp)
               cltq           
               lea     0x0(,%rax,4),%rdx
               mov     -0x8(%rbp),%rax
               add     %rax,%rdx
               mov     (%rcx),%eax
               mov     %eax,(%rdx)
        185:   mov     -0x10(%rbp),%eax
               cmp     -0x30(%rbp),%eax
             ↑ jl      14c    
               mov     -0x8(%rbp),%rax
               leave          
             ← ret

先從 merge() 開始分析，透過 annotate 功能觀察其組合語言，可以發現耗時最多的是:

12.36%: mov -0x14(%rbp),%eax
12.72%: mov -0x10(%rbp),%eax

說明 merge() 最耗時的操作是進行記憶體存取，需要頻繁從堆疊讀取資料，並進行陣列操作。

同時也觀察 cache-misses 等指標:

$ perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./original

 Performance counter stats for './original' (5 runs):

          143,1481      cpu_core/cache-misses/           #   47.85% of all cache refs           ( +-  2.61% )  (90.52%)
           73,8702      cpu_atom/cache-misses/           #   24.69% of all cache refs           ( +-  4.77% )  (64.18%)
          299,1534      cpu_core/cache-references/                                              ( +-  3.69% )  (90.52%)
          117,6258      cpu_atom/cache-references/                                              ( +-  4.45% )  (64.18%)
      10,4709,9229      cpu_core/instructions/           #    1.55  insn per cycle              ( +-  2.55% )  (90.52%)
       5,4809,1881      cpu_atom/instructions/           #    0.81  insn per cycle              ( +-  3.87% )  (64.18%)
       6,7690,5518      cpu_core/cycles/                                                        ( +-  2.45% )  (90.52%)
       3,2715,8147      cpu_atom/cycles/                                                        ( +-  3.92% )  (64.18%)

          0.026511 +- 0.000203 seconds time elapsed  ( +-  0.77% )

可見 P-Core 相較於 E-Core 處理了較多的指令和週期，同時有較高的 cache miss rate，代表 CPU 需要頻繁從主記憶體讀取資料，而不是從快取讀取。

以上數據反映了目前 merge() 函式的記憶體存取效率不佳，特別是在 cache 利用率方面。因此，將嘗試優化記憶體存取模式。

TODO: 嘗試改進並進行充分的量化分析

考慮以下手法：

實作對 cache 更友善的合併排序

引入 coroutine 和多執行緒，研讀並行程式設計: 排程器原理

避免遞迴，參考 Linux 核心原始程式碼的資料排序策略

降低 fork 的成本，例如預先 fork (pre-forking)

實作 workqueue，善用第九週測驗第三題提及的 work stealing

降低記憶體操作的成本，搭配課程教材，理解 Linux 核心記憶體管理

2025

開發環境

$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  24
  On-line CPU(s) list:   0-23
Vendor ID:               GenuineIntel
  Model name:            13th Gen Intel(R) Core(TM) i7-13700
    CPU family:          6
    Model:               183
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           1
    Stepping:            1
    CPU max MHz:         5200.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4224.00

程式碼風格

使用 lab0 規範的程式碼書寫風格，用 clang-format 確認一致。並且編譯時指定 -O2 最佳化層級。

clang-format

BasedOnStyle: Chromium
Language: Cpp
MaxEmptyLinesToKeep: 3
IndentCaseLabels: false
AllowShortIfStatementsOnASingleLine: false
AllowShortCaseLabelsOnASingleLine: false
AllowShortLoopsOnASingleLine: false
DerivePointerAlignment: false
PointerAlignment: Right
SpaceAfterCStyleCast: true
TabWidth: 4
UseTab: Never
IndentWidth: 4
BreakBeforeBraces: Linux
AccessModifierOffset: -4

釋放記憶體資源

$ valgrind -q --leak-check=full ./original

在檢查記憶體洩漏時，會印出一大串如下的訊息：

==421823== 4,000,000 bytes in 1 blocks are definitely lost in loss record 1,028 of 1,028
==421823==    at 0x484DA83: calloc
==421823==    by 0x109299: merge
==421823==    by 0x10951C: merge_sort
==421823==    by 0x10970D: main

查看程式碼可以發現 merge() 中用 calloc() 分配記憶體，但隨後都沒有對應的釋放。

因此先更改呼叫函式的方式，用指標來儲存回傳值，並在 memcpy() 後釋放該記憶體。

-   memcpy(arr, merge(arr, left_len, arr + left_len, right_len),
-          len * sizeof(int));
+   int *merged = merge(arr, left_len, arr + left_len, right_len);
+   memcpy(arr, merged, len * sizeof(int));
+   free(merged);

更改後再次執行 valgrind 測試就不會報錯了。
隨後執行 perf 測試可以發現 cache-misses, cache-references 與 cache misses rate 均有所下降，代表減少了不必要的記憶體佔用後，讓資料更可能在 cache 中被找到，減少了對主記憶體的存取。

更改前:

 Performance counter stats for './original' (5 runs):

          131,5945      cpu_core/cache-misses/           #   46.97% of all cache refs           ( +-  2.18% )  (88.01%)
           67,5690      cpu_atom/cache-misses/           #   24.12% of all cache refs           ( +-  5.76% )  (69.25%)
          280,1727      cpu_core/cache-references/                                              ( +-  3.68% )  (88.01%)
          120,0657      cpu_atom/cache-references/                                              ( +-  3.69% )  (69.25%)
       6,1812,2717      cpu_core/instructions/           #    1.22  insn per cycle              ( +-  3.75% )  (88.01%)
       3,4086,8797      cpu_atom/instructions/           #    0.67  insn per cycle              ( +-  4.30% )  (69.25%)
       5,0592,3045      cpu_core/cycles/                                                        ( +-  3.08% )  (88.01%)
       2,5114,8103      cpu_atom/cycles/                                                        ( +-  3.59% )  (69.25%)

          0.019212 +- 0.000170 seconds time elapsed  ( +-  0.89% )

更改後:

 Performance counter stats for './modified' (5 runs):

           33,1258      cpu_core/cache-misses/           #   19.19% of all cache refs           ( +- 15.58% )  (80.94%)
           12,8966      cpu_atom/cache-misses/           #    7.47% of all cache refs           ( +- 30.75% )  (80.65%)
          172,6061      cpu_core/cache-references/                                              ( +-  3.99% )  (80.94%)
           48,5840      cpu_atom/cache-references/                                              ( +- 21.71% )  (80.65%)
       5,7529,8070      cpu_core/instructions/           #    1.34  insn per cycle              ( +-  1.88% )  (80.94%)
       3,5251,3433      cpu_atom/instructions/           #    0.82  insn per cycle              ( +-  3.32% )  (80.65%)
       4,2877,7371      cpu_core/cycles/                                                        ( +-  2.17% )  (80.94%)
       2,2721,1217      cpu_atom/cycles/                                                        ( +-  4.90% )  (80.65%)

          0.017647 +- 0.000440 seconds time elapsed  ( +-  2.49% )

減少臨時記憶體空間

merge() 中會呼叫 calloc() 來配置臨時的記憶體空間，用來儲存排序完的陣列，長度為 left_len + right_len。

若是我們不使用額外的記憶體，直接修改原始陣列內容，會遇到一個問題，就是會沒辦法紀錄被覆蓋掉的資料，但我們其實不需要備份所有的資料，只要左半邊就好，因為我們是從左排到右，所以右半邊的資料不是已經被用過了，就是還沒輪到它們所以不會被覆蓋掉。
因此，我們可以只儲存左半邊的資料，並修改原陣列來完成合併排序。而參數的 int *right_half 指標也可以省略，直接藉由 left_len + right_idx 來存取就好。

merge():

static void merge(int *arr, const int left_len, const int right_len)
{
    int *left_copy = calloc(left_len, sizeof(int));
    memcpy(left_copy, arr, left_len * sizeof(int));
    int left_idx = 0, right_idx = 0, cur_idx = 0;

    while (left_idx < left_len && right_idx < right_len) {
        if (left_copy[left_idx] <= arr[left_len + right_idx]) {
            arr[cur_idx++] = left_copy[left_idx++];
        } else {
            arr[cur_idx++] = arr[left_len + right_idx++];
        }
    }

    while (left_idx < left_len)
        arr[cur_idx++] = left_copy[left_idx++];

    free(left_copy);
}

merge_sort():

-   int *merged = merge(arr, left_len, arr + left_len, right_len);
-   memcpy(arr, merged, len * sizeof(int));
-   free(merged);
+   merge(arr, left_len, right_len);

可以看到，修改過後所有指標的開銷都有減少。

 Performance counter stats for './modified' (5 runs):

           22,0829      cpu_core/cache-misses/           #   14.37% of all cache refs           ( +- 13.42% )  (77.69%)
            4,1056      cpu_atom/cache-misses/           #    2.67% of all cache refs           ( +- 14.10% )  (63.23%)
          153,6400      cpu_core/cache-references/                                              ( +-  3.30% )  (77.69%)
           29,7139      cpu_atom/cache-references/                                              ( +- 11.92% )  (63.23%)
       5,6610,4774      cpu_core/instructions/           #    1.34  insn per cycle              ( +-  2.59% )  (77.69%)
       3,4230,1899      cpu_atom/instructions/           #    0.81  insn per cycle              ( +-  4.57% )  (63.23%)
       4,2147,8510      cpu_core/cycles/                                                        ( +-  2.32% )  (77.69%)
       2,2180,9128      cpu_atom/cycles/                                                        ( +-  4.88% )  (63.23%)

          0.016568 +- 0.000197 seconds time elapsed  ( +-  1.19% )

避免遞迴

原程式碼的合併排序採用遞迴的 top down 方式，先將陣列不斷切半，直到長度為 1 再開始進行合併。但是因為 merge_sort() 的參數是使用指標 int *arr，所以遞迴時只是傳遞指標，實際的陣列資料不會一直被複製。
而 process 數量為最多

2^{f o r k_c o u n t}

個，假設輸入陣列為 [8,3,2,9,1,4] 且設定 fork_count < 2 ，則任務切分方式以及負責的 process 如下所示:

可見在 fork_count = 2 之後，左半邊的任務就完全交由 P2 處理了。

若是要實作非遞迴的 merge_sort() 函式，分配任務的方式會不同。
































void merge_sort(int *arr, const int len)
{
    for (int curr_size = 1; curr_size < len; curr_size = curr_size * 2) {
        for (int left_start = 0; left_start < len;
             left_start += curr_size * 2) {
            int right_start = left_start + curr_size;
            if (right_start >= len)
                continue;

            int left_len = curr_size;
            int right_len = (right_start + curr_size <= len)
                                ? curr_size
                                : (len - right_start);

            if (fork_count < 32) {
                pid_t pid = fork();
                fork_count++;

                if (pid == 0) {
                    merge(arr + left_start, left_len, right_len);
                    exit(0);
                }
            } else {
                merge(arr + left_start, left_len, right_len);
            }
        }

        while (wait(NULL) > 0)
            ;
        fork_count = 0;
    }
}

上方函式從排序長度為 1 開始由左到右進行合併，每個 child process 都負責執行同樣 curr_size 的任務，並且為了避免排序的順序錯誤，parent 要等待所有 child 執行完成 (上方 32 行)，才能繼續執行下一個 curr_size 的任務，並且每輪使用最多

2^{5}

個 process。

但這麼做的問題是，對於每一個 curr_size 都會建立一次 child process，導致 fork() 和 wait() 的開銷變大。而且這樣分層排序的方法，也會讓同個陣列位址被存取的時間間隔變最長，導致時間局部性不好。

TODO:
參考 Linux 核心原始程式碼，實作時間局部性較好的資料排序策略

由結果可見，所有指標的開銷都增加。

 Performance counter stats for './modified_iterative' (5 runs):

           83,0962      cpu_core/cache-misses/           #   12.72% of all cache refs           ( +-  4.22% )
           61,4374      cpu_atom/cache-misses/           #    9.40% of all cache refs           ( +- 18.71% )  (2.04%)
          653,3341      cpu_core/cache-references/                                              ( +-  1.74% )
          255,8276      cpu_atom/cache-references/                                              ( +- 25.04% )  (2.04%)
       9,7570,2379      cpu_core/instructions/           #    1.64  insn per cycle              ( +-  0.21% )
       2,7850,6605      cpu_atom/instructions/           #    0.47  insn per cycle              ( +- 13.88% )  (2.04%)
       5,9651,5203      cpu_core/cycles/                                                        ( +-  0.38% )
       2,4837,2240      cpu_atom/cycles/                                                        ( +-  1.86% )  (2.04%)

          0.076095 +- 0.000184 seconds time elapsed  ( +-  0.24% )

TODO:

使用 thread 取代 process，降低記憶體的開銷。
用 thread pool 進行管理，減少 process 建立與同步成本。
使用 in-place merge 來減少記憶體分配。

使用 thread 取代 process

比照建立 process 的方式，使用 thread_count 紀錄 thread 的數量，每輪最多建立 32 個，並且最後會用 pthread_join() 來確保執行順序的正確。

static int thread_count = 0;
#define MAX_THREADS 32

typedef struct {
    int *arr;
    int left_len;
    int right_len;
} merge_args_t;

void *merge_thread(void *args)
{
    merge_args_t *m_args = (merge_args_t *) args;
    merge(m_args->arr, m_args->left_len, m_args->right_len);
    return NULL;
}

void merge_sort(int *arr, const int len)
{
    pthread_t threads[MAX_THREADS];
    merge_args_t args[MAX_THREADS];
    for (int curr_size = 1; curr_size < len; curr_size = curr_size * 2) {
        for (int left_start = 0; left_start < len;
             left_start += curr_size * 2) {
            int right_start = left_start + curr_size;
            if (right_start >= len)
                continue;

            int left_len = curr_size;
            int right_len = (right_start + curr_size <= len)
                                ? curr_size
                                : (len - right_start);

            if (thread_count < MAX_THREADS) {
                args[thread_count] =
                    (merge_args_t){arr + left_start, left_len, right_len};
                pthread_create(&threads[thread_count], NULL, merge_thread,
                               &args[thread_count]);
                thread_count++;
            } else {
                merge(arr + left_start, left_len, right_len);
            }
        }

        for (int i = 0; i < thread_count; i++)
            pthread_join(threads[i], NULL);
        thread_count = 0;
    }
}

從結果可以發現，cache-misses、cache-references、instructions 與 cycles 都有下降，但是執行時間卻相似。

Why?

 Performance counter stats for './modified_iterative_t' (5 runs):

           46,0530      cpu_core/cache-misses/           #   13.66% of all cache refs           ( +-  5.87% )  (99.70%)
           26,6420      cpu_atom/cache-misses/           #    7.90% of all cache refs           ( +- 15.29% )  (3.98%)
          337,2048      cpu_core/cache-references/                                              ( +-  2.34% )  (99.70%)
          158,0680      cpu_atom/cache-references/                                              ( +-  9.91% )  (3.98%)
       7,6828,7175      cpu_core/instructions/           #    1.60  insn per cycle              ( +-  0.51% )  (99.70%)
       2,7916,0568      cpu_atom/instructions/           #    0.58  insn per cycle              ( +- 25.60% )  (3.98%)
       4,7960,8226      cpu_core/cycles/                                                        ( +-  0.42% )  (99.70%)
       2,5337,3752      cpu_atom/cycles/                                                        ( +-  2.12% )  (3.98%)

          0.076603 +- 0.000370 seconds time elapsed  ( +-  0.48% )

若是觀察 task-clock、context-switches、cpu-migrations、page-faults，可以發現更改成 thread 之後，page faults 減少了超過 70%，但是 context-switches 卻變成三倍多，也讓 CPU 利用率從 1.527 降到 1.3。
process:

 Performance counter stats for './modified_iterative' (5 runs):

            156.35 msec task-clock                       #    1.539 CPUs utilized               ( +-  0.19% )
                50      context-switches                 #  319.793 /sec                        ( +-  4.08% )
                10      cpu-migrations                   #   63.959 /sec                        ( +-  5.48% )
            1,9236      page-faults                      #  123.031 K/sec                       ( +-  0.99% )

          0.101603 +- 0.000432 seconds time elapsed  ( +-  0.42% )

thread:

 Performance counter stats for './modified_iterative_t' (5 runs):

            131.33 msec task-clock                       #    1.315 CPUs utilized               ( +-  0.60% )
               179      context-switches                 #    1.363 K/sec                       ( +-  5.40% )
                 5      cpu-migrations                   #   38.071 /sec                        ( +- 17.20% )
              4530      page-faults                      #   34.492 K/sec                       ( +-  0.20% )

          0.099849 +- 0.000446 seconds time elapsed  ( +-  0.45% )

繼續觀察其他指標
process:

 Performance counter stats for './modified_iterative' (5 runs):

       4,5227,4303      cpu_core/L1-dcache-loads/                                               ( +-  1.15% )  (59.36%)
       2,0018,9808      cpu_atom/L1-dcache-loads/                                               ( +-  8.69% )  (8.13%)
          216,9259      cpu_core/L1-dcache-load-misses/  #    0.48% of all L1-dcache accesses   ( +-  4.26% )  (53.45%)
   <not supported>      cpu_atom/L1-dcache-load-misses/
          365,3226      cpu_core/L1-icache-load-misses/                                         ( +-  4.63% )  (53.30%)
          139,5034      cpu_atom/L1-icache-load-misses/                                         ( +-  6.02% )  (9.20%)
           56,7227      cpu_core/LLC-loads/                                                     ( +-  6.01% )  (58.64%)
           27,8307      cpu_atom/LLC-loads/                                                     ( +-  5.34% )  (9.20%)
            1,0014      cpu_core/LLC-load-misses/        #    1.77% of all L1-icache accesses   ( +-  8.65% )  (63.56%)
                 0      cpu_atom/LLC-load-misses/                                               (9.20%)
       4,1159,7939      cpu_core/dTLB-loads/                                                    ( +-  3.71% )  (66.60%)
       1,9909,9233      cpu_atom/dTLB-loads/                                                    ( +-  8.24% )  (9.20%)
            6,6703      cpu_core/dTLB-load-misses/       #    0.02% of all dTLB cache accesses  ( +-  6.98% )  (64.80%)
           10,8972      cpu_atom/dTLB-load-misses/       #    0.03% of all dTLB cache accesses  ( +-  4.59% )  (9.20%)
            1,9319      cpu_core/iTLB-load-misses/                                              ( +-  3.86% )  (61.59%)
              3237      cpu_atom/iTLB-load-misses/                                              ( +- 61.80% )  (1.23%)

          0.103435 +- 0.000518 seconds time elapsed  ( +-  0.50% )

thread:

 Performance counter stats for './modified_iterative_t' (5 runs):

       4,5940,4428      cpu_core/L1-dcache-loads/                                               ( +-  0.89% )  (61.23%)
       2,2018,4369      cpu_atom/L1-dcache-loads/                                               ( +-  8.83% )  (5.90%)
          130,3731      cpu_core/L1-dcache-load-misses/  #    0.28% of all L1-dcache accesses   ( +-  6.47% )  (60.44%)
   <not supported>      cpu_atom/L1-dcache-load-misses/
          268,2266      cpu_core/L1-icache-load-misses/                                         ( +-  4.89% )  (46.26%)
          109,4907      cpu_atom/L1-icache-load-misses/                                         ( +-  2.23% )  (5.90%)
           27,9794      cpu_core/LLC-loads/                                                     ( +-  2.64% )  (49.28%)
           26,8749      cpu_atom/LLC-loads/                                                     ( +-  2.75% )  (5.90%)
              4029      cpu_core/LLC-load-misses/        #    1.44% of all L1-icache accesses   ( +- 11.77% )  (54.54%)
                 0      cpu_atom/LLC-load-misses/                                               (5.90%)
       3,7674,1682      cpu_core/dTLB-loads/                                                    ( +-  2.52% )  (55.48%)
       2,0273,9010      cpu_atom/dTLB-loads/                                                    ( +-  1.09% )  (5.90%)
            2,4798      cpu_core/dTLB-load-misses/       #    0.01% of all dTLB cache accesses  ( +-  5.57% )  (64.09%)
            6,0715      cpu_atom/dTLB-load-misses/       #    0.02% of all dTLB cache accesses  ( +-  6.01% )  (5.90%)
            1,1996      cpu_core/iTLB-load-misses/                                              ( +-  2.02% )  (62.42%)
     <not counted>      cpu_atom/iTLB-load-misses/                                              ( +- 61.24% )  (0.00%)

          0.103509 +- 0.000460 seconds time elapsed  ( +-  0.44% )

總結使用 thread 取代 process 之後:

執行時間: 相似
CPU 效能變化:
- instructions、cycles 下降:
- context-switches 增加、task-clock 下降: CPU 利用率降低。
Cache 效能提升:
- cache-misses、cache-references 下降。
- L1-dcache-loads 相似，但 miss rate 下降: 局部性更好，可以更好利用高層快取。
- LLC-loads、LLC-load-misses 下降: 上層快取效率變高，且需要再去主記憶體讀取資料的次數變少。
- TLB 效能提升: 記憶體存取的局部性更好。
- page-faults 減少: 使用相同的 page table，不需複製。

所以因為 thread 可以共享部分資源，讓整體記憶體的存取效率變好，對 cache 更友善。但這種共享可能也帶來了更多的同步需求，因此 context switch 增加，CPU 利用率降低，導致整體執行時間沒有明顯改善。

建立 thread pool

若是有 thread pool，就可以減少反覆建立、銷毀 thread 的成本。
參考〈案例: Thread Pool 實作和改進〉的 tpool.c，加入 thread pool，其中因為共享變數 future 是用來儲存函式 func 的執行結果，在本任務中不需要，所以將相關程式碼移除。

相關程式碼: tpool

merge_sort() 的實作方式為於 thread pool 中先建立一定數量的 threads，並利用 semaphore 確保每層排序順序。
Thread 完成任務後會呼叫 sem_post()，而兩層迴圈中間會呼叫 tasks 次的 sem_wait()，以確保目前 curr_size 大小的所有任務都完成了，

void *merge_thread(arg *m_args)
{
    merge(m_args->arr, m_args->left_len, m_args->right_len);
    sem_post(m_args->sem);
    free(m_args);
    return NULL;
}

void merge_sort(int *arr, const int len)
{
    tpool_t pool = tpool_create(32);
    if (!pool) {
        fprintf(stderr, "Failed to create thread pool\n");
        exit(1);
    }

    for (int curr_size = 1; curr_size < len; curr_size = curr_size * 2) {
        sem_t level_done;
        sem_init(&level_done, 0, 0);
        int tasks = 0;

        for (int left_start = 0; left_start < len; left_start += curr_size * 2) {
            int right_start = left_start + curr_size;
            if (right_start >= len)
                continue;

            int left_len = curr_size;
            int right_len = (right_start + curr_size <= len) 
                                ? curr_size
                                : (len - right_start);

            arg *task_arg = malloc(sizeof(arg));
            if (!task_arg) {
                fprintf(stderr, "Failed to allocate memory for task argument\n");
                exit(1);
            }
            task_arg->arr = arr + left_start;
            task_arg->left_len = left_len;
            task_arg->right_len = right_len;
            task_arg->sem = &level_done;

            tpool_apply(pool, merge_thread, task_arg);

            tasks++;
        }

        // 等待這一層的所有任務完成
        for (int i = 0; i < tasks; i++) {
            sem_wait(&level_done);
        }
        sem_destroy(&level_done);
    }
    tpool_join(pool);
}

但更改之後所有開銷都增加了，執行時間更是變為 30 倍。

 Performance counter stats for './main' (5 runs):

           57,6853      cpu_core/cache-misses/           #    0.08% of all cache refs           ( +-  9.15% )  (60.75%)
           17,8755      cpu_atom/cache-misses/           #    0.02% of all cache refs           ( +- 12.24% )  (39.29%)
       7,2701,3541      cpu_core/cache-references/                                              ( +-  0.85% )  (60.75%)
       2,8862,0876      cpu_atom/cache-references/                                              ( +-  0.90% )  (39.29%)
     347,9882,1379      cpu_core/instructions/           #    0.75  insn per cycle              ( +-  0.64% )  (60.75%)
     224,3670,5415      cpu_atom/instructions/           #    0.48  insn per cycle              ( +-  0.49% )  (39.29%)
     463,4978,1281      cpu_core/cycles/                                                        ( +-  0.70% )  (60.75%)
     356,2195,8753      cpu_atom/cycles/                                                        ( +-  0.77% )  (39.29%)

            2.1690 +- 0.0222 seconds time elapsed  ( +-  1.02% )

tpool.c 的 threadtask_t 是單向鏈結串列，所以每次在 jobqueue_fetch() 中取得要執行的任務時，都要從頭走訪到尾端，並做刪除的動作。因此嘗試新增 *prev 指標，讓取得待執行的任務變為 O(1)。

struct __threadtask *next, *prev;

執行時間減少了 1/4。

 Performance counter stats for './main' (5 runs):

           57,9547      cpu_core/cache-misses/           #    0.08% of all cache refs           ( +-  8.00% )  (61.85%)
           19,4331      cpu_atom/cache-misses/           #    0.03% of all cache refs           ( +- 23.98% )  (38.21%)
       7,1004,1050      cpu_core/cache-references/                                              ( +-  0.99% )  (61.85%)
       2,9220,3883      cpu_atom/cache-references/                                              ( +-  1.44% )  (38.21%)
     278,0485,3627      cpu_core/instructions/           #    0.55  insn per cycle              ( +-  1.37% )  (61.85%)
     206,4901,3957      cpu_atom/instructions/           #    0.41  insn per cycle              ( +-  1.22% )  (38.21%)
     506,3622,5954      cpu_core/cycles/                                                        ( +-  0.79% )  (61.85%)
     385,2950,8312      cpu_atom/cycles/                                                        ( +-  0.89% )  (38.21%)

            1.4898 +- 0.0220 seconds time elapsed  ( +-  1.48% )

參考 Linux 核心原始程式碼的資料排序策略

list_sort.c 中使用 count 和一個 pending list 來觀察要合併的長度，維持著合併：不合併為 2 : 1 的比例。這樣的方式可以讓資料的時間局部性變好。

我參考此策略，嘗試實作了用於陣列的簡易版本。用一個 pending list 陣列儲存目前已合併的各區段長度，若是目前的區段長度與前一個長度相同，就直接合併。

例如，以下為每次加入一個節點，直至節點數量為 8 的合併順序 (括號內是當次被合併的節點加起來的節點數量):

1. 1
2. 1,1 -> (2)
3. 2,1
4. 2,1,1 -> 2,(2) -> (4)
5. 4,1
6. 4,1,1 -> 4,(2)
7. 4,2,1
8. 4,2,1,1 -> 4,2,(2) ->4,(4) -> (8)

並且可以記錄陣列的長度總和，來直接計算出要合併的陣列起始位置，像是若要將 4,1,1 合併為 4,(2)，就是合併 arr + (6-1) 和 arr + (6-2) 兩個長度為 1 的段落；將 4,2,2 合併為 4,(4) 則是合併 arr + (8-2) 和 arr + (8-4) 兩個長度為2的段落。

無並行的版本

這裡先實作無並行的版本，只需要更改 merge_sort():

#define MAX_SEGMENTS 20

void merge_sort(int *arr, const int len)
{
    if (len <= 1)
        return;

    int pending[MAX_SEGMENTS];
    int pending_count = 0, total_length = 0;
    
    for (int i = 0; i < len; i++) {
        // adding an item
        pending[pending_count] = 1;
        pending_count++;
        total_length++;
        
        // keep merging segments with the same length
        while (pending_count >= 2 &&
               pending[pending_count - 1] == pending[pending_count - 2]) {
            int merge_length = pending[pending_count - 1];
            int left_start = total_length - 2 * merge_length;

            merge(arr + left_start, merge_length, merge_length);
            
            // update the length of the last segment
            pending_count -= 1;
            pending[pending_count - 1] = 2 * merge_length;
        }
    }
    
    // merge the remaining segments
    while (pending_count > 1) {
        int right_length = pending[pending_count - 1];
        int left_length = pending[pending_count - 2];
        int left_start = total_length - right_length - left_length;

        merge(arr + left_start, left_length, right_length);
        
        pending_count -= 1;
        pending[pending_count - 1] = left_length + right_length;
    }
}

其中 pending_count 是紀錄段落的數量，total_length 則是元素總數量。而因為 N_ITEMS 為 1000000 時，pending list 中最大的元素會是

2^{19}

，所以最大長度時元素為 1, 2, 4, 8, 16, …, 2^19，共 20 個元素，所以 MAX_SEGMENTS 可設為 20。

執行結果為:

 Performance counter stats for './modified_iter_linux' (5 runs):

            6,3650      cpu_core/cache-misses/           #    9.99% of all cache refs           ( +- 42.33% )
     <not counted>      cpu_atom/cache-misses/                                                  (0.00%)
           63,6948      cpu_core/cache-references/                                              ( +-  1.05% )
     <not counted>      cpu_atom/cache-references/                                              (0.00%)
       6,8959,6667      cpu_core/instructions/           #    1.71  insn per cycle              ( +-  0.00% )
     <not counted>      cpu_atom/instructions/                                                  (0.00%)
       4,0360,2493      cpu_core/cycles/                                                        ( +-  0.07% )
     <not counted>      cpu_atom/cycles/                                                        (0.00%)

          0.078652 +- 0.000273 seconds time elapsed  ( +-  0.35% )

相較於使用以下單純的遞迴程式，cache-misses 數量減少為 70%，說明此種合併順序對快取更友善。

void merge_sort(int *arr, const int len)
{
    if (len == 1)
        return;

    const int mid = len / 2;
    const int left_len = len - mid;
    const int right_len = mid;

    merge_sort(arr, left_len);
    merge_sort(arr + left_len, right_len);

    merge(arr, left_len, right_len);
}

 Performance counter stats for './original_mono' (5 runs):

            8,6442      cpu_core/cache-misses/           #   13.58% of all cache refs           ( +- 13.83% )
     <not counted>      cpu_atom/cache-misses/                                                  (0.00%)
           63,6460      cpu_core/cache-references/                                              ( +-  1.24% )
     <not counted>      cpu_atom/cache-references/                                              (0.00%)
       6,7629,6846      cpu_core/instructions/           #    1.66  insn per cycle              ( +-  0.01% )
     <not counted>      cpu_atom/instructions/                                                  (0.00%)
       4,0749,0414      cpu_core/cycles/                                                        ( +-  0.12% )
     <not counted>      cpu_atom/cycles/                                                        (0.00%)

          0.079299 +- 0.000230 seconds time elapsed  ( +-  0.29% )

TODO: 改為並行版本，要如何控制執行順序的正確？

並行的版本

因為所有的合併任務必須按照順序執行，若是依序指派任務給不同 thread 執行，卻又要求 thread 互相等待，就會失去並行的效果。

所以我們可以直接把整個陣列切分為 thread 數量的等份，讓每個 thread 負責排序不同段落，最後再將這些已排序的段落，進行最後的排序。

所以可以將 merge_sort() 改為:

#define MAX_THREADS 32
#define MAX_SEGMENTS 20

typedef struct {
    int *arr;
    int len;
} merge_args_t;

static void *merge_thread(void *args)
{
    merge_args_t *m_args = (merge_args_t *) args;
    merge_sort_part(m_args->arr, m_args->len);
    return NULL;
}

void merge_sort(int *arr, const int len)
{
    pthread_t threads[MAX_THREADS];
    merge_args_t args[MAX_THREADS];

    int arr_len = len / MAX_THREADS,
        final_len = len - arr_len * (MAX_THREADS - 1);
    for (int i = 0; i < MAX_THREADS; i++) {
        if (i == MAX_THREADS - 1)
            arr_len = final_len;

        args[i] = (merge_args_t){arr + i * arr_len, arr_len};
        pthread_create(&threads[i], NULL, merge_thread, &args[i]);
    }

    for (int i = 0; i < MAX_THREADS; i++)
        pthread_join(threads[i], NULL);

    // merge the sorted parts from 'right to left'
    // because in merge function, the left part is copied to a new array
    for (int i = MAX_THREADS - 1; i > 0; i--)
        merge(arr + (i - 1) * arr_len, arr_len, len - i * arr_len);
}

merge_sort() 先計算分為 MAX_THREADS 個段落時，每個段落的大小為何，再依序分配任務給每個 thread，其中 merge_sort_part() 即為上方無並行版本的合併函式。

測試結果:

 Performance counter stats for './modified_iter_linux' (5 runs):

            7,5328      cpu_core/cache-misses/           #    5.47% of all cache refs           ( +- 18.40% )  (92.06%)
            1,9366      cpu_atom/cache-misses/           #    1.41% of all cache refs           ( +-  9.24% )  (81.82%)
          137,6053      cpu_core/cache-references/                                              ( +-  1.57% )  (92.06%)
            9,2884      cpu_atom/cache-references/                                              ( +-  1.15% )  (81.82%)
       7,5655,4861      cpu_core/instructions/           #    1.91  insn per cycle              ( +-  1.44% )  (92.06%)
       2,8855,0509      cpu_atom/instructions/           #    0.73  insn per cycle              ( +-  2.00% )  (81.82%)
       3,9550,1810      cpu_core/cycles/                                                        ( +-  1.66% )  (92.06%)
       1,6743,5612      cpu_atom/cycles/                                                        ( +-  1.97% )  (81.82%)

          0.023243 +- 0.000221 seconds time elapsed  ( +-  0.95% )

可見改為並行程式之後，執行時間縮短了 70%。

與原程式碼 (修正記憶體釋放錯誤之後) 進行比較

與原程式碼相比，雖然執行時間增加了35%，但是 cache-references 減少了 20%，而 cache-misses 更是減少了 78%。

原程式碼結果:

 Performance counter stats for './modified' (5 runs):

           33,1258      cpu_core/cache-misses/           #   19.19% of all cache refs           ( +- 15.58% )  (80.94%)
           12,8966      cpu_atom/cache-misses/           #    7.47% of all cache refs           ( +- 30.75% )  (80.65%)
          172,6061      cpu_core/cache-references/                                              ( +-  3.99% )  (80.94%)
           48,5840      cpu_atom/cache-references/                                              ( +- 21.71% )  (80.65%)
       5,7529,8070      cpu_core/instructions/           #    1.34  insn per cycle              ( +-  1.88% )  (80.94%)
       3,5251,3433      cpu_atom/instructions/           #    0.82  insn per cycle              ( +-  3.32% )  (80.65%)
       4,2877,7371      cpu_core/cycles/                                                        ( +-  2.17% )  (80.94%)
       2,2721,1217      cpu_atom/cycles/                                                        ( +-  4.90% )  (80.65%)

          0.017647 +- 0.000440 seconds time elapsed  ( +-  2.49% )

繼續測試其他指標，可以發現 l1 的 data 與 instruction load-misses 下降超過 40%，LLC-load-misses 更是下降了 86%。

 Performance counter stats for './modified_iter_linux' (5 runs):

       1,2277,0500      cpu_core/L1-dcache-loads/        #    1.131 G/sec                       ( +-  4.00% )  (41.89%)
         5849,7485      cpu_atom/L1-dcache-loads/        #  538.905 M/sec                       ( +-  4.55% )  (72.51%)
           29,2287      cpu_core/L1-dcache-load-misses/  #    0.24% of all L1-dcache accesses   ( +-  8.66% )  (91.50%)
   <not supported>      cpu_atom/L1-dcache-load-misses/
           54,6282      cpu_core/L1-icache-load-misses/                                         ( +-  1.30% )  (94.60%)
           12,1926      cpu_atom/L1-icache-load-misses/                                         ( +-  6.88% )  (72.51%)
            6,6629      cpu_core/LLC-loads/              #  613.816 K/sec                       ( +-  0.99% )  (94.44%)
              7656      cpu_atom/LLC-loads/              #   70.530 K/sec                       ( +- 12.90% )  (72.51%)
              1072      cpu_core/LLC-load-misses/        #    1.61% of all L1-icache accesses   ( +- 33.29% )  (99.35%)
                 0      cpu_atom/LLC-load-misses/                                               (72.51%)
            108.55 msec task-clock                       #    4.624 CPUs utilized               ( +-  0.26% )
                55      context-switches                 #  506.684 /sec                        ( +- 10.00% )
                15      cpu-migrations                   #  138.187 /sec                        ( +-  6.46% )
              1646      page-faults                      #   15.164 K/sec                       ( +-  0.31% )

          0.023473 +- 0.000167 seconds time elapsed  ( +-  0.71% )

原程式碼結果:

 Performance counter stats for './original' (5 runs):

         9584,8026      cpu_core/L1-dcache-loads/        #  827.441 M/sec                       ( +-  5.14% )  (40.78%)
         7197,7040      cpu_atom/L1-dcache-loads/        #  621.367 M/sec                       ( +-  5.26% )  (65.04%)
           51,5655      cpu_core/L1-dcache-load-misses/  #    0.54% of all L1-dcache accesses   ( +- 13.16% )  (92.74%)
   <not supported>      cpu_atom/L1-dcache-load-misses/
           93,9572      cpu_core/L1-icache-load-misses/                                         ( +-  4.79% )  (95.53%)
           25,0156      cpu_atom/L1-icache-load-misses/                                         ( +-  2.78% )  (65.04%)
            9,2860      cpu_core/LLC-loads/              #  801.646 K/sec                       ( +-  7.31% )  (97.46%)
            4,1364      cpu_atom/LLC-loads/              #  357.089 K/sec                       ( +- 18.47% )  (65.04%)
              7717      cpu_core/LLC-load-misses/        #    8.31% of all L1-icache accesses   ( +- 11.40% )  (70.79%)
              1235      cpu_atom/LLC-load-misses/        #    1.33% of all L1-icache accesses   ( +-100.06% )  (65.04%)
            115.84 msec task-clock                       #    6.611 CPUs utilized               ( +-  0.71% )
                39      context-switches                 #  336.681 /sec                        ( +-  5.71% )
                25      cpu-migrations                   #  215.821 /sec                        ( +- 11.07% )
              8208      page-faults                      #   70.858 K/sec                       ( +-  0.15% )

          0.017522 +- 0.000546 seconds time elapsed  ( +-  3.12% )

總結

指標	原程式碼結果	最終程式碼結果	改善幅度
cache-misses (cpu_core)	33,1258 次	7,5328 次	+78%
cache-misses (cpu_atom)	12,8966 次	1,9366 次	+85%
cache-references (cpu_core)	172,6061 次	137,6053 次	+20%
cache-references (cpu_atom)	48,5840 次	9,2884 次	+80%
L1-dcache-load-misses	51,5655 次	29,2287 次	+43%
L1-icache-load-misses (cpu_core)	93,9572 次	54,6282 次	+41%
L1-icache-load-misses (cpu_atom)	25,0156 次	12,1926 次	+51%
LLC-load-misses (cpu_core)	7717 次	1072 次	+86%
LLC-load-misses (cpu_atom)	1235 次	0 次	+100%
page-faults	8208 次	1646 次	+80%
context-switches	39 次	55 次	-41%
cpu-migrations	25 次	15 次	+40%
task-clock	115.84 msec	108.55 msec	+6%
執行時間	0.017522 秒	0.023473 秒	-34%

這些結果表明，這樣的排序方法可以更好的利用快取，讓資料還在快取上層時再次被使用。

2024

開發環境

$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  20
  On-line CPU(s) list:   0-19
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  10
    Socket(s):           1
    Stepping:            7
    CPU max MHz:         4700.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            7399.70
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   320 KiB (10 instances)
  L1i:                   320 KiB (10 instances)
  L2:                    10 MiB (10 instances)
  L3:                    19.3 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-19

實作 workqueue 及 pre-forking

參考〈案例: Thread Pool 實作和改進〉的 tpool.c，加入 thread pool。
首先更改 main 函式，在呼叫 merge_sort() 之前先於 thread pool 中建立指定數量的 thread，並在排序完成後將資源釋放，這樣的做法可以減少動態建立和釋放的開銷。

+   pool = tpool_create(4);
    merge_sort(arr, N_ITEMS);
+   tpool_join(pool);

merge_sort() 函式先將任務依照中間點，切分為左半邊與右半邊，並以 arg 結構體記錄任務的變數，接著使用 tpool_apply() 將各任務要執行的函式與變數加入 workqueue 中。
最後等待任務執行完，就釋放相關資源，並呼叫 merge()。

























static void merge_sort(int *arr, const int len)
{
    if (len == 1)
        return;

    const int mid = len / 2;
    const int left_len = len - mid;
    const int right_len = mid;

    arg left_task = {arr, left_len};
    arg right_task = {arr + left_len, right_len};

    tpool_future_t left_future = tpool_apply(pool, merge_sort_task, &left_task);
    tpool_future_t right_future =
        tpool_apply(pool, merge_sort_task, &right_task);

    tpool_future_get(left_future, 0);
    tpool_future_destroy(left_future);

    tpool_future_get(right_future, 0);
    tpool_future_destroy(right_future);

    memcpy(arr, merge(arr, left_len, arr + left_len, right_len),
           len * sizeof(int));
}

其中，新增的 arg 結構體與 merge_sort_task() 函式如下。

typedef struct __arg{
    int *data;
    const int len;
} arg;

void *merge_sort_task(arg *task) {
    int *arr = (*task).data;
    int len = (*task).len;
    merge_sort(arr, len);
}

但是在這樣的更改下，程式無法順利執行完成。經過測試發現，所有 threads 都會在卡在 merge_sort() 第 16 行的 tpool_future_get() 中，呼叫 pthread_cond_wait(&future->cond_finished, &future->mutex) 等待子任務的完成，卻一直沒有被喚醒。

進一步分析發現，因為 thread 是透過 jobqueue_fetch() 來取得並執行任務（下方第 1 行），也就是 merge_sort()，而上述等待情況所對應的喚醒函式 pthread_cond_broadcast() 需要等 merge_sort() 執行完成後才會被呼叫（下方第 11 行），但 merge_sort() 本身又在等待子任務完成，形成了所有 thread 執行的任務都在等待子任務完成，這種無法結束的等待情況。













void *ret_value = task->func(task->arg);
pthread_mutex_lock(&task->future->mutex);
if (task->future->flag & __FUTURE_DESTROYED) {
    pthread_mutex_unlock(&task->future->mutex);
    pthread_mutex_destroy(&task->future->mutex);
    pthread_cond_destroy(&task->future->cond_finished);
    free(task->future);
} else {
    task->future->flag |= __FUTURE_FINISHED;
    task->future->result = ret_value;
    pthread_cond_broadcast(&task->future->cond_finished);
    pthread_mutex_unlock(&task->future->mutex);
}

因為目前沒有想到其他的解決辦法，所以直接將 merge_sort() 改為非遞迴，來避免任務與子任務之間的等待狀況。

更改為非遞迴的合併排序

非遞迴版本中，要交給 thread 執行的任務就沒有 merge_sort() 了，只有 merge()，main thread 會在 merge_sort() 中提前切分好所有要進行 merge() 的任務，並將這些任務加入 workqueue，更改後的 arg 結構體與 merge_task() 函式如下。

減少參數，例如 mid。

謝謝老師的提醒，在目前的程式碼中 mid 無法直接由結構體的其他參數算出，需要使用 merge_sort() 中的 current_size。

typedef struct __arg{
    int *arr;
    int left;
    int mid;
    int right;
} arg;

void *merge_task(void *args)
{
    arg *task = (arg *)args;
    memcpy(task->arr + task->left, merge(task->arr, task->left, task->mid, task->right), (task->right - task->left + 1) * sizeof(int));
    return NULL;
}

merge_sort() 中，等待 future 任務執行完成與釋放資源的函式，可以選擇放在兩個迴圈之間，或是放在最內層的迴圈。
若是如下方放在迴圈外面，除了 futures 陣列大小難以決定以外，一次性的把所有任務加入，也會導致執行順序無法確定，造成結果錯誤的狀況。

void merge_sort(int *arr, int left, int right)
{
    int current_size;
    int left_start;

    tpool_future_t futures[N_ITEMS];
    int i = 0;
    for (current_size = 1; current_size <= right - left;
         current_size = 2 * current_size) {
        for (left_start = left; left_start < right;
             left_start += 2 * current_size) {
            int right_end = MIN(left_start + 2 * current_size - 1, N_ITEMS - 1);
            int mid = MIN(left_start + current_size - 1, N_ITEMS - 1);

            arg *merge_arg = malloc(sizeof(arg));
            merge_arg->arr = arr;
            merge_arg->left = left_start;
            merge_arg->mid = mid;
            merge_arg->right = right_end;

            futures[i++] = tpool_apply(pool, merge_task, merge_arg);
        }
    }

    for (int j = 0; j < i; j++) {
        tpool_future_get(futures[j], 0);
        tpool_future_destroy(futures[j]);
    }
}

如下方 N_ITEMS = 4 的例子，workqueue 會被加入第一層的兩個任務，及第二層的一個任務，假設 thread 數量為 3，且各取得了一個任務，這時如果第二層的任務比任何一個第一層的任務早完成，排序結果就可能會出錯。

- 正確例子
            (4)    (3)   (2)    (1)
              \    /      \      /
              (3, 4)       (1, 2)
                 \           /
                 (1, 2, 3, 4)

- 錯誤例子（先排前兩個，再排前兩個和後兩個，最後排後兩個）
            (4)    (3)   (2)    (1)
1st merge:    \    /      
              (3, 4)       (2, 1) -> 不變
2nd merge:       \           /   
                 (2, 1, 3, 4)     -> 假設 (2, 1) 已排序過，所以按順序加入
3rd merge:              \  /      -> 排序 arr[2], arr[3]
                 (2, 1, 3, 4)

如果將函式放在兩個迴圈之間，相當於每次都只先加入一層的任務，並等待這些任務執行完畢，再繼續加入下一層的任務，直到結束。這樣就可以保證上述不確定的情況被排除。

注意書寫規範：

使用 lab0 規範的程式碼書寫風格，務必用 clang-format 確認一致

已修正程式碼，謝謝老師。

void merge_sort(tpool_t pool, int *arr, int left, int right, int size)
{
    int current_size;
    int left_start;

    for (current_size = 1; current_size <= right - left;
         current_size = 2 * current_size) {
        int len = size / (2 * current_size) + 1;
        tpool_future_t futures[len];
        int i = 0;

        for (left_start = left; left_start < right;
             left_start += 2 * current_size) {
            int right_end = MIN(left_start + 2 * current_size - 1, size - 1);
            int mid = MIN(left_start + current_size - 1, size - 1);

            arg *merge_arg = malloc(sizeof(arg));
            merge_arg->arr = arr;
            merge_arg->left = left_start;
            merge_arg->mid = mid;
            merge_arg->right = right_end;

            futures[i++] = tpool_apply(pool, merge_task, merge_arg);
        }
        for (int j = 0; j < i; j++) {
            tpool_future_get(futures[j], 0);
            tpool_future_destroy(futures[j]);
        }
    }
}

若是將函式放在最內層的迴圈內，則是代表每加入一個任務，就等待他做完才加入下一個任務，失去了並行的能力。
但是我認為這種方法可以讓剛加入的任務還在 memory hierachy 上層時，就先進行合併，所以對 cache 較友善，而且也不需要多餘的記憶體空間來存放 futures 陣列，將在後續效能評估進行討論。

不用「猜想」，用效能分析工具來驗證你的想法。

好的，以補在下方的效能評估。

for (current_size = 1; current_size <= right - left;
     current_size = 2 * current_size) {
    for (left_start = left; left_start < right;
         left_start += 2 * current_size) {
        int right_end = MIN(left_start + 2 * current_size - 1, N_ITEMS - 1);
        int mid = MIN(left_start + current_size - 1, N_ITEMS - 1);

        arg *merge_arg = malloc(sizeof(arg));
        merge_arg->arr = arr;
        merge_arg->left = left_start;
        merge_arg->mid = mid;
        merge_arg->right = right_end;

        tpool_future_t future = tpool_apply(pool, merge_task, merge_arg);
        tpool_future_get(future, 0);
        tpool_future_destroy(future);
    }
}

排序測試和效能評估機制

使用以下兩種方式進行評估：

執行時間
原先參考論文〈Dude, is my code constant time?〉，利用 CPU 提供的週期計數器 TSC register 作為評估執行時間的方法，但是 Game Timing and Multicore Processors 提及在多處理器和雙核系統上不保證計數器在處理器核之間同步處理，因此使用 rdtsc 需要將待測試的程式限制在一個 CPU 上 (set CPU affinity)以此減少誤差。
因此，我決定使用 clock_gettime() 來測量，避免 cpu pinning 以及同步的問題。

This use of RDTSC for timing suffers from these fundamental issues:
Discontinuous values. Using RDTSC directly assumes that the thread is always running on the same processor. Multiprocessor and dual-core systems do not guarantee synchronization of their cycle counters between cores.
…
使用 perf 工具進行效能分析。

以下測試皆固定建立 10 個 thread。

首先比較上方提到的，將等待任務執行完成與釋放資源的函式放在兩個迴圈之間（以下簡稱 A 方法），或是最內層的迴圈（以下簡稱 B 方法）。

對照 CS:APP 第 5 章

執行時間

使用 clock_gettime() 測量 10 種陣列大小，從 100,000 筆資料開始，每次加 100,000 一直到 1,000,000，並且每個大小都會重複執行 5 次，最後再取執行時間的平均。
可以發現 A 方法在所有資料大小的平均執行時間都比 B 方法快 3 倍左右。

Perf

使用 perf stat 分析

觀察下方結果可以發現，有如先前所推測的，B 方法雖然執行速度較慢、花費的 instructions, cycles 比較多，但是 cache-misses 比例從 3.66% 下降到了 0.21%。

而我原先認為 B 方法因為一加入任務就執行，所以 L1 data cache 的 misses 比例會比較小，但與測試的結果不相符，兩個方法的比例是差不多的。
經過查閱發現，在我的開發環境中，每個 cpu core 都有獨立的 L1, L2 cache，L3 cache 才有共享。因此，若是負責加入任務的 main thread 與執行任務的 thread 是由不同核執行，這些資料就不會被共享到。

Method	instructions	cycles	cache-references	cache-misses	L1-dcache-loads	L1-dcache-loads-misses
A 方法	950,1385,8505	1566,6598,5158	11,3224,7852	4139,2926 (3.66% of all cache refs)	258,8940,6382	16,7917,5049 (6.49% of all L1-dcache accesses)
B 方法	4812,2255,9812	6422,3224,9672	41,6341,6998	860,1043 (0.21% of all cache refs)	1154,0040,5239	73,3492,6426 (6.36% of all L1-dcache accesses)

而 context-switches 次數也相差 5 倍，所以整體而言，A 方法無論是在執行時間，還是資源的消耗都明顯優於 B 方法。

Method	context-switches	branch-misses	migrations
A 方法	1286,8241	3,1163,8462	1,5208
Ｂ方法	6420,2873	11,1517,4417	2,9538

比較原程式碼與使用 A 方法的程式碼

原程式碼的 fork_count 在達到限制次數之前，每次 fork 都會增加當前的 process 數量，代表最多有

2^{n}

個 process 同時運行。所以我認為可以在 A 方法建立相同數量的 thread 來進行比較。

原程式碼：

fork_count	instructions	cycles	cache-references	cache-misses	context-switches	branch-misses	migrations	page-faults
2	69,9744,1954 (1.58 insn per cycle)	44,2110,0134	836,6862	353,7818 (42.28% of all cache refs)	11	5417,7829	4	13,5404
3	70,0072,1959 (1.59 insn per cycle)	43,9481,5979	931,2532	361,0411 (38.77% of all cache refs)	19	5398,8739	10	13,5530
4	70,0375,8024 (1.58 insn per cycle)	44,3996,1195	895,0482	361,2487 (40.36% of all cache ref)	24	5403,8126	10	13,5783

A 方法：

thread_count	instructions	cycles	cache-references	cache-misses	context-switches	branch-misses	migrations	page-faults
4	385,1146,9270 (0.73 insn per cycle)	528,7587,6798	5,5372,3054	4634,6027 (8.37% of all cache refs)	251,5514	1,1819,8786	2262	19,6335
8	690,0228,5445 (0.62 insn per cycle)	1107,2844,8769	8,6049,3112	4758,8317 (5.53% of all cache refs)	644,9611	2,0449,7558	9244	19,6417
16	1216,9900,6744 (0.56 insn per cycle)	2178,4866,1794	14,9976,0521	5521,1937 (3.68% of all cache refs)	1326,1207	3,2319,3069	15,9792	19,6589

經過測試發現，原程式碼的執行時間和使用的資源並不會隨著 fork 數量的增加而有顯著的改變，而 A 方法卻變成所有開銷都會隨著 thread 數量增加而變多。

但是在 A 方法中，cache misses 佔 cache-references 的比例是隨著 thread 數量增加而變少的，而且為原程式碼的

1 / 5

至

1 / 10

倍。這代表在記憶體的存取上，使用 thread pool 改進是有效的，可以獲得更好的 locality of reference。

雖然 cache miss 比例減少了，但程式總體的執行時間和 CPU 資源使用仍然較高，代表效能瓶頸可能在其他地方。

接著用 perf record 進行分析，僅列出部分輸出進行討論。

原程式碼：

# Total Lost Samples: 0
# Samples: 4K of event 'cycles'
# Event count (approx.): 4313695937

# Overhead  Command     Shared Object      Symbol                              
# ........  ..........  .................  ....................................
    64.61%  merge_sort  merge_sort         [.] merge
     5.28%  merge_sort  libc.so.6          [.] _int_malloc
     4.50%  merge_sort  merge_sort         [.] merge_sort
     3.32%  merge_sort  libc.so.6          [.] __memmove_evex_unaligned_erms
     2.61%  merge_sort  libc.so.6          [.] __libc_calloc
     2.23%  merge_sort  [kernel.kallsyms]  [k] asm_exc_page_fault
     1.87%  merge_sort  merge_sort         [.] rand_next
     1.63%  merge_sort  merge_sort         [.] main
     1.06%  merge_sort  merge_sort         [.] iabs
     1.01%  merge_sort  [kernel.kallsyms]  [k] __rcu_read_lock
…

A 方法：

# Total Lost Samples: 0
# Samples: 94K of event 'cycles'
# Event count (approx.): 37079673786

# Overhead  Command  Shared Object      Symbol                                     
# ........  .......  .................  ...........................................
    18.07%  main     [kernel.kallsyms]  [k] native_queued_spin_lock_slowpath
     5.16%  main     [kernel.kallsyms]  [k] psi_group_change
     4.68%  main     libc.so.6          [.] pthread_cond_wait@@GLIBC_2.3.2
     3.94%  main     [kernel.kallsyms]  [k] futex_wake
     3.72%  main     [kernel.kallsyms]  [k] _raw_spin_lock
     3.43%  main     libc.so.6          [.] pthread_mutex_lock@@GLIBC_2.2.5
     2.47%  main     [kernel.kallsyms]  [k] futex_q_lock
     2.11%  main     libc.so.6          [.] __GI___lll_lock_wait
     1.82%  main     [kernel.kallsyms]  [k] __get_user_nocheck_4
     1.80%  main     [kernel.kallsyms]  [k] llist_add_batch
     1.80%  main     main               [.] merge
     1.68%  main     main               [.] jobqueue_fetch
     1.65%  main     [kernel.kallsyms]  [k] ttwu_queue_wakelist
     1.59%  main     [kernel.kallsyms]  [k] __futex_unqueue
     
…

原程式碼的 merge 函數佔用了 64.61% 的 CPU 時間，是主要的效能瓶頸，但是在 A 方法中，merge 函數僅佔 1.8%，佔用最多的是 native_queued_spin_lock_slowpath，還有其他與同步相關的函式。
代表 CPU 花大量時間在處理同步機制，這個結果也可以對應到上方的 instructions per cycle，隨著 thread 數量越多 instructions per cycle 逐漸減少。

分析結果

原始 fork 版本：

Fork 建立的是獨立的 process，有各自的記憶體空間，減少了資源競爭。
Child process 結束後會直接退出，減少了長期運行的開銷。

我的版本:

Threads 共享同一個記憶體空間，可能導致更多的資源競爭和同步開銷。
Threads 直到程式結束前才會全部退出，可能導致額外的管理開銷。