2022q1 Homework3 (quiz3)

contribute by < bakudr18 >

測驗題目

測驗 `8`

void remove_data_origin(tree *t, int d)
{
    tnode_t **p = t;
    // search node with d data
    while (*p != NULL && (*p)->data != d) {
        if (d < (*p)->data)
            p = &(*p)->left; // go to left sub-tree
        else
            p = &(*p)->right; // go to right sub-tree
    }
    tnode_t *q = *p;

    // d not found
    if (!q)
        return;

    // rebuild btree
    if (!q->left) // no left sub-tree exists, link to right sub-tree directly
        *p = q->right;
    else if(!q->right) // no right sub-tree exists, link to left sub-tree directly
        *p = q->left;
    else { // both left and right sub-trees exist
        tnode_t **r = &q->right;
        while ((*r)->left) // find next node in in-order
            r = &(*r)->left;
        q->data = (*r)->data;
        q = *r;
        *r = q->right;
    }
    free(q);
}

程式碼運作原理列在註解上，並且已改用 C99 實作，方便後續做效能比較。
針對演算法方面，我實在想不到有更好的作法可以改善，因此我嘗試往減少 branch 的方向思考。

嘗試改寫成 branchless 實作

首先，參考你所不知道的 C 語言: linked list 和非連續記憶體中 mergeTwoLists 的作法，先針對 search node 的部份以 ternary operator 改寫。

while (*p != NULL && (*p)->data != d) {
-        if (d < (*p)->data)
-            p = &(*p)->left;
-        else
-            p = &(*p)->right;
+        p = (d < (*p)->data) ? &(*p)->left : &(*p)->right;
    }

參考 Branchless Programming 的解釋

But the compiler actually elects to do something different. Instead of going with this arithmetic trick, it used a special cmov (“conditional move”) instruction that assigns a value based on a condition (which is computed and checked using the flags register, the same way as for jumps)

針對 ternary operator，編譯器有可能根據此選擇使用 cmov 指令來優化，以類似以下數學式的概念來避免分支：

x = c \times a + (1 - c) \times b

這種技巧被稱做 predication 。

接著來做實驗，先察看 gcc 版本

$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

因為需要看組合語言，因此我先複習了 CSAPP: Chap3 Machine Programs 了解 x86 instruction 的基礎。

首先以 -O0 編譯優化選項來測試，並且以 objdump 來觀察反組譯出來的結果，使用 -S 選項可讓 gcc 這個 compiler driver 停在組合語言輸出階段。

$ gcc -O0 -g -o b0 btree.c
$ objdump -S b0
00000000000012e4 <remove_data_origin>:
void remove_data_origin(tree *t, int d)
{
    ...
        if (d < (*p)->data)
    1301:       48 8b 45 e8             mov    -0x18(%rbp),%rax
    1305:       48 8b 00                mov    (%rax),%rax
    1308:       8b 00                   mov    (%rax),%eax
    130a:       39 45 d4                cmp    %eax,-0x2c(%rbp)
    130d:       7d 11                   jge    1320 <remove_data_origin+0x3c>
            p = &(*p)->left;
    130f:       48 8b 45 e8             mov    -0x18(%rbp),%rax
    1313:       48 8b 00                mov    (%rax),%rax
    1316:       48 83 c0 08             add    $0x8,%rax
    131a:       48 89 45 e8             mov    %rax,-0x18(%rbp)
    131e:       eb 0f                   jmp    132f <remove_data_origin+0x4b>
        else
            p = &(*p)->right;
    1320:       48 8b 45 e8             mov    -0x18(%rbp),%rax
    1324:       48 8b 00                mov    (%rax),%rax
    1327:       48 83 c0 10             add    $0x10,%rax
    132b:       48 89 45 e8             mov    %rax,-0x18(%rbp)
    ...
    
0000000000001402 <remove_data_branchless>:
void remove_data_branchless(tree *t, int d)
{
    ...
        p = (d < (*p)->data) ? &(*p)->left : &(*p)->right;
    141f:       48 8b 45 e8             mov    -0x18(%rbp),%rax
    1423:       48 8b 00                mov    (%rax),%rax
    1426:       8b 00                   mov    (%rax),%eax
    1428:       39 45 d4                cmp    %eax,-0x2c(%rbp)
    142b:       7d 0d                   jge    143a <remove_data_branchless+0x38>
    142d:       48 8b 45 e8             mov    -0x18(%rbp),%rax
    1431:       48 8b 00                mov    (%rax),%rax
    1434:       48 83 c0 08             add    $0x8,%rax
    1438:       eb 0b                   jmp    1445 <remove_data_branchless+0x43>
    143a:       48 8b 45 e8             mov    -0x18(%rbp),%rax
    143e:       48 8b 00                mov    (%rax),%rax
    1441:       48 83 c0 10             add    $0x10,%rax
    1445:       48 89 45 e8             mov    %rax,-0x18(%rbp)
    ...

以上結果發現 complier 產生的組合語言是完全一樣的，而且都沒有用到 cmov 指令。
接著再以 -O1 優化選項實驗一次，結果如下

$ gcc -O1 -g -o b1 btree.c
$ objdump -S b1
void remove_data_origin(tree *t, int d)
{
    ...
        if (d < (*p)->data)
            p = &(*p)->left;
    121e:       48 8d 47 08             lea    0x8(%rdi),%rax
    1222:       48 83 c7 10             add    $0x10,%rdi
    1226:       39 f2                   cmp    %esi,%edx
    1228:       48 0f 4e c7             cmovle %rdi,%rax
    ...
    
void remove_data_branchless(tree *t, int d)
{
    ...
        p = (d < (*p)->data) ? &(*p)->left : &(*p)->right;
    12ac:       48 8d 47 08             lea    0x8(%rdi),%rax
    12b0:       48 83 c7 10             add    $0x10,%rdi
    12b4:       39 f2                   cmp    %esi,%edx
    12b6:       48 0f 4e c7             cmovle %rdi,%rax
    ...

可以發現到兩者接使用了 cmov 指令做優化，因此可以知道 if else 與 ternary operator 的作法在 gcc 9.3.0 編譯出來的結果基本上是一致的。

回到 remove_data 程式碼，在 rebuild btree 時，需要判斷左右 sub-tree 是否都存在，這裡同樣用你所不知道的 C 語言: linked list 和非連續記憶體中 mergeTwoLists 的技巧，以 OR 來減少分支。

    // rebuild btree
-    if (!q->left)
-        *p = q->right;
-    else if(!q->right)
-        *p = q->left;
+    if (!q->left || !q->right)
+        *p = (tnode_t*) ((intptr_t)q->left | (intptr_t)q->right);
    else {
        ...

雖然還是需要 if 判斷，但因為 *p 被 assign 的值變成只有一種，因此猜測此修改會減少 branch mispredicted 的次數，同樣的先比較編譯後的 assembly，這次僅列出 -O1 編譯優化的結果。

$ gcc -O1 -g -o b1 btree.c
$ objdump -S b1
void remove_data_origin(tree *t, int d)
{
    ...
        *p = q->right;
    1236:       48 8b 57 10             mov    0x10(%rdi),%rdx
        *p = q->left;
    123f:       48 89 10                mov    %rdx,(%rax)
    ...
        if (!q->left)
    1273:       48 8b 57 08             mov    0x8(%rdi),%rdx
    1277:       48 85 d2                test   %rdx,%rdx
    127a:       74 ba                   je     1236 <remove_data_origin+0x2d>
    else if(!q->right)
    127c:       48 8b 4f 10             mov    0x10(%rdi),%rcx
    1280:       48 85 c9                test   %rcx,%rcx
    1283:       74 ba                   je     123f <remove_data_origin+0x36>
    ...
    
void remove_data_branchless(tree *t, int d)
{
    ...
        *p = (tnode_t*) ((intptr_t)q->left | (intptr_t)q->right);
    12c4:       48 0b 57 10             or     0x10(%rdi),%rdx
    12c8:       48 89 10                mov    %rdx,(%rax)
    ...
    if (!q->left || !q->right)
    12fc:       48 8b 57 08             mov    0x8(%rdi),%rdx
    1300:       48 85 d2                test   %rdx,%rdx
    1303:       74 bf                   je     12c4 <remove_data_branchless+0x2d>
    1305:       48 8b 4f 10             mov    0x10(%rdi),%rcx
    1309:       48 85 c9                test   %rcx,%rcx
    130c:       74 b6                   je     12c4 <remove_data_branchless+0x2d>
    ...

單看組合語言輸出，看不出到底哪個執行結果會比較好，因此就實際跑跑看程式做比較，這裡會使用到 HW1 介紹過的 Cachegrind ，Cachegrind 除了可以觀察 cache miss ，開啟 --branch-sim=yes 選項還可以觀察 branch prediction 結果。

以下為測試程式碼，為了盡可能讓 rebuild btree 時 if else 都有可能發生，因此我刻意製造了 unbalanced binary tree。

int main(int argc, char **argv){
    tree t = NULL;
    srand(0);
    int method = 0;
    if (argc == 2)
        method = argv[1][0] - '0';

    for(int i = 0; i < 10000; i++){
        insert_data(&t, i);
    }
    for(int i = 0; i < 10000; i++){
        insert_data(&t, 10000 - i);
    }
    for(int i = 0; i < 10000; i++){
        insert_data(&t, rand() % 5000);
        insert_data(&t, 10000 - rand() % 5000);
    }
    for(int i = 0; i < 20000; i++){
        remove_data[method](&t, rand() % 10000);
    }
    
    return !isValidBST(&t);
}

然後是 Cachegrind 的執行結果

$ cg_annotate cachegrind.out.remove_data_origin
--------------------------------------------------------------------------------
Command:          ./btree 0
--------------------------------------------------------------------------------
Ir            I1mr  ILmr  Dr          D1mr        DLmr  Dw        D1mw    DLmw   Bc          Bcm    Bi      Bim 
--------------------------------------------------------------------------------
2,300,288,281 1,058 1,032 580,380,613 145,956,805 2,081 2,116,315 209,039 24,121 379,375,954 82,392 118,710 142  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir            I1mr ILmr Dr          D1mr       DLmr Dw      D1mw    DLmw   Bc          Bcm    Bi     Bim  file:function
--------------------------------------------------------------------------------
1,399,706,891    2    2 399,939,111 97,569,261    0 120,000  33,927      0 199,969,555 40,032      0   0  /home/bakud/linux2022/btree/btree.c:insert_data
  886,258,952    2    2 177,317,480 47,548,118    0  53,206  15,574      0 177,246,217 24,296      0   0  /home/bakud/linux2022/btree/btree.c:remove_data_origin
    5,640,090   21   21     960,031    230,178    0 960,013 122,398 20,000     760,038     42      0   0  /build/glibc-sMfBJT/glibc-2.31/malloc/malloc.c:_int_malloc

$ cg_annotate cachegrind.out.remove_data_branchless 
--------------------------------------------------------------------------------
Command:          ./btree 1
--------------------------------------------------------------------------------
Ir            I1mr  ILmr  Dr          D1mr        DLmr  Dw        D1mw    DLmw   Bc          Bcm    Bi      Bim 
--------------------------------------------------------------------------------
2,300,302,778 1,058 1,032 580,380,613 145,956,805 2,081 2,116,315 209,039 24,121 379,375,954 79,797 118,710 142  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir            I1mr ILmr Dr          D1mr       DLmr Dw      D1mw    DLmw   Bc          Bcm    Bi     Bim  file:function
--------------------------------------------------------------------------------
1,399,706,891    2    2 399,939,111 97,569,261    0 120,000  33,927      0 199,969,555 40,032      0   0  /home/bakud/linux2022/btree/btree.c:insert_data
  886,273,449    2    2 177,317,480 47,548,118    0  53,206  15,574      0 177,246,217 23,644      0   0  /home/bakud/linux2022/btree/btree.c:remove_data_branchless
    5,640,090   21   21     960,031    230,178    0 960,013 122,398 20,000     760,038     42      0   0  /build/glibc-sMfBJT/glibc-2.31/malloc/malloc.c:_int_malloc

其中 Bcm 為 conditional branches mispredicted，從結果來看，remove_data_origin 的 Bcm 為 24296 次，而 remove_data_branchless 的 Bcm 為 23644 次，看起來確實有少一些 branch mispredicted，但說實在的結果差強人意，我猜測可能不同微處理器架構甚至不會有改善？

Static B-Trees from Algorithmica / HPC

上述所使用的 Binary Search Tree (以下簡稱為 BST ) 結構如下，會需要對每一個 node 配置記憶體空間，如果使用 malloc 一個一個配置，那麼記憶體就很有可能不是連續的，而不連續的記憶體自然的 Locality 也就更差，同時零碎的記憶體也容易造成記憶體管理的困難。
關於 Locality 的基礎知識可以參照 CSAPP Chap 6.2。

typedef struct tnode {
    int data;   
    struct tnode *left;
    struct tnode *right;
} tnode_t;

有一個簡單的方法可以改進 memory layout ，那就是用 array 來實作 BST ，對於每一個 node

k

，其 child 為 node

2 k + 1

與 node

2 k + 2

。不過 BST 一次只會比較一個 Node data ，如果需要一次比較多個 node ，可以使用 B-Tree 搭配 SIMD 的實作方式來達到。為了方便實驗，以下的實作都假設 tree 為 static ，也就是 build tree 之後就不會再修改 tree node。

B-tree 的概念如下，每一個 node 不再只有一個 key ，而每個 block 可以包含

k

個 pointer 指向 child nodes ，以及

B = (k - 1)

筆 key，且每一個 block 皆已排序，而對於每一個 child

i

，key value 會在其 parent 的 key

(i - 1)

到

i

之間。

假設

k

為定值，則 B-tree height 相對於 BST 減少了

\frac{l o g_{2} (n)}{l o g_{k} (n)} = \frac{l o g (k)}{l o g (2)} = l o g_{2} (k)

，以下實作設定

B = (k - 1) = 16

，則每一個 block 佔 64 bytes，剛好可以存進 Cache line 中，以避免需要多次 fetch data 到 Cache line 造成效能影響。

註: 可以透過 $ cat /proc/cpuinfo 中的 clfush_size 檢查 cache line 大小。

B-tree search version 1

#define B 16    // block size 
/* 
 * N: total number of key, the last block will 
 * be padded with dummy value (INT_MAX) if N is not multiples of B
 */
#define N 160

#define nblocks  ((N + B - 1) / B)

int btree[nblocks][B];

// Find child block index of block k index i
int go(int k, int i) { return k * (B + 1) + i + 1; }

接著 construct b-tree 的實作如下，其中 a 為 input sorted array。

void build(int k) {
    static int t = 0;
    if (k < nblocks) {
        for (int i = 0; i < B; i++) {
            // build child block of block k index i first
            build(go(k, i));
            // set block k index i
            btree[k][i] = (t < N ? a[t++] : INT_MAX);
        }
        // build last child block of block k
        build(go(k, B));
    }
}

在 Search function 中，引入了 SIMD 來達到平行 cmp 多比資料，在 x86 CPU 提供了 avx 指令集給 user 使用，其中 avx2 指令集會使用 256 bits SIMD register 進行平行運算，avx2 相關的 intrinsic function 可以透過 #include <x86instrin.h> 來引入，但需記得在編譯時開啟以下選項。

#pragma GCC target("avx2")
#pragma GCC optimize("O3")

在比較 target 是否大於 block key 時，一般的作法是透過 iteration 來將比較結果一個一個紀錄在 bitmask，,也就是以下的 mask_naive，但若使用 SIMD instruction 實作，可以一次平行比較 8 個 key，其中 __m256i 會用 256 bits SIMD register 來存 8/16/32/64 bits integer，詳細的 intrinsic function 說明可以在 Intel Intrinsics Guide 查詢，以下會簡單以註解解釋。

int mask_naive = (1 << B);
for (int i = 0; i < B; i++)
    mask_naive |= (btree[k][i] >= x) << i;

typedef __m256i reg;

int cmp(reg x_vec, int* y_ptr) {
    reg y_vec = _mm256_load_si256((reg*) y_ptr); // load 8 sorted elements
    reg mask = _mm256_cmpgt_epi32(x_vec, y_vec); // compare against the key
    return _mm256_movemask_ps((__m256) mask);    // extract the 8-bit mask
}

reg x = _mm256_set1_epi32(target);
int mask = ~(
    cmp(x, &btree[k][0]) +
    (cmp(x, &btree[k][8]) << 8)
);

綜合以上，search function v1 統整實作如下，其中 __builtin_ffs 為 GCC Extension，會回傳 1 + index of the least significant 1-bit 。

int lower_bound(int _x) {
    int k = 0, res = INT_MAX;
    reg x = _mm256_set1_epi32(_x); // copy 8 int _x to reg x
    while (k < nblocks) {
        int mask = ~(
            cmp(x, &btree[k][0]) +
            (cmp(x, &btree[k][8]) << 8)
        );
        int i = __builtin_ffs(mask) - 1;
        if (i < B)
            res = btree[k][i];
        k = go(k, i);
    }
    return res;
}

補充：在 Linux Kernel 中的 arch/x86/include/asm/bitops.h 可以找到 x86 架構下 ffs 是以 bsf 指令實作的。

B-tree search version 2 - HugePage

事前準備：

Algorithmica / HPC 使用 hugepage 的概念來改善效能，為了解釋 Hugepage ，需要先理解 TLB 是什麼。

參考自 CSAPP Chap 9.2 ，Translation lookaside buffer (TLB) 是一塊 cache 專門用來協助 MMU 加速 address translation ，TLB 內會維護 page table ，MMU 將 Virtual Page Number (VPN) 傳給 TLB 後，TLB 以此 VPN 找到對應的 Page table entry (PTE) 回傳給 MMU ，則 MMU 就可以把 PTE translate 成 physical address 傳給 cache/memory ，讓 cache/memory 找到對應的 data 。若 TLB 上找不到 PTE，表示 TLB miss ，MMU 就需要去 L1 cache 上把 PTE 撈上來更新到 TLB 上。

TLB 的大小是固定的，當 page size 越大 (VPO 佔用 bits 愈多)，表示一個 PTE mapping 到的 page 越大，但相對的 TLB 能放的 PTE 數量就越少。TLB 的硬體概念其實與一般使用的 Cache 類似，同樣都有 multi-level hierarchy 與 set-asscoicative 的硬體架構，現代 CPU 大多有提供 two-level TLB，TLB 的資訊可以透過 cpuid 來察看，例如我的 CPU TLB 資訊如下

Deterministic Address Translation Parameters:
        L1 Code TLB: 4KB pages
                     8-way set associative
                     16 entries
                     Shared by max 2 threads

        L1 Code TLB: 2MB or 4MB pages
                     8-way set associative
                     2 entries
                     Shared by max 2 threads

  L1 Store-only TLB: 4KB, 2MB, 4MB or 1GB pages
                     fully associative
                     Shared by max 2 threads

   L1 Load-only TLB: 4KB pages
                     4-way set associative
                     Shared by max 2 threads

   L1 Load-only TLB: 2MB or 4MB pages
                     4-way set associative
                     Shared by max 2 threads

   L1 Load-only TLB: 1GB pages
                     fully associative
                     Shared by max 2 threads

      L2 Shared TLB: 4KB, 2MB or 4MB pages
                     8-way set associative
                     128 entries
                     Shared by max 2 threads

      L2 Shared TLB: 4KB or 1GB pages
                     8-way set associative
                     128 entries
                     Shared by max 2 threads

我所使用的 CPU 型號為 Intel Core i7-1165G7 ，可能因為比較新所以 cpuid 的支援不完全，因此沒有看到 L1 data TLB 的 entries 數量，另外特別的是，這顆 CPU 的 L1 data TLB 又分成了 Store-only 與 Load-only ，且使用了不同的 set-assiciative ，不確定是因為哪些特殊需求而設計的。

疑問1：目前不太確定如何直接從 Intel 的 spec 找到對應型號的 TLB 資訊。

補充: 後來在 instlatx64 網站上看到了 Intel Core i7-1165G7 的 TLB 資訊，以下節錄片段。

CPU#000 Vendor: GenuineIntel Family: 6 Model: 8c Stepping: 1 CoreType:0x200806c1
CPU#000 Type: "11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz"
CPU#000 AffMask: 0x0000000000000001 
CPU#000 PhysMask:0x00000000000000ff 
CPU#000 APIC_ID:0x00000000 Phys_ID:000 Core_ID:00 SMT_ID:00 
CPU#000 L1I cache:     32KB,    64 byte cache line,  8 way, SMask:0000000000000003 
CPU#000 L1D cache:     48KB,    64 byte cache line, 12 way, SMask:0000000000000003 
CPU#000 L2  cache:   1280KB,    64 byte cache line, 20 way, SMask:0000000000000003, non-inclusive
CPU#000 L3  cache:  12288KB,    64 byte cache line, 12 way, SMask:00000000000000ff, hashed LLC
CPU#000 L1I 4K TLB:                    128 entries,  8 way, SMask:0000000000000003
CPU#000 L1I 2M TLB:                     16 entries,  8 way, SMask:0000000000000003
CPU#000 L1D 4K TLB(loads):              64 entries,  4 way, SMask:0000000000000003
CPU#000 L1D 2M TLB(loads):              32 entries,  4 way, SMask:0000000000000003
CPU#000 L1D 1G TLB(loads):               8 entries,   full, SMask:0000000000000003
CPU#000 L1D 4K+2M+1G TLB(stores):       16 entries,   full, SMask:0000000000000003
CPU#000 L2I+D 4K+2M TLB:              1024 entries,  8 way, SMask:0000000000000003
CPU#000 L2I+D 4K+1G TLB:              1024 entries,  8 way, SMask:0000000000000003

疑問2: The Linux Virtual Memory System 中49頁提到

Why HugePages?

Enlarge TLB size
…

我的理解是 TLB 是一種特殊用途的 cache，大小應該是固定的，要怎麼 Enlarge?

自問自答，察看 kernel doc HugeTLB Pages ，猜測這個 "Enlarge TLB size" 指的是 Linux kernel 中的 Hugetlb ，表示 pool of huge pages 的 memory 容量 (kB)。

參考 kernel doc Transparent Hugepage Support

the TLB miss will run faster (especially with virtualization using nested pagetables but almost always also on bare metal without virtualization)

a single TLB entry will be mapping a much larger amount of virtual memory in turn reducing the number of TLB misses. With virtualization and nested pagetables the TLB can be mapped of larger size only if both KVM and the Linux guest are using hugepages but a significant speedup already happens if only one of the two is using hugepages just because of the fact the TLB miss is going to run faster.

使用 Hugepage 增加 page size 主要有兩個優點，一是 TLB miss 所花費時間更短，推測原因是 PTE 數量變少因此搜尋速度更快；二是一個 PTE mapping 到更大的 memory ，自然也就減少了 TLB miss 的次數，因此適合用在一塊特定記憶體範圍做多次操作的應用上。另外 Hugepage 還有一個相對前兩點沒這麼顯著的小優點，就是因為在初次使用 memory 時會一次 load 比較大的 page size 進來，例如 page size 2MB 就一次 load 2MB 到 memory 上，因此相對的 page fault 發生次數更少。

我們可以透過 madvise 的 MADV_HUGEPAGE 來調整特定記憶體位置所使用的 page size ，

MADV_HUGEPAGE (since Linux 2.6.38)
Enable Transparent Huge Pages (THP) for pages in the range specified by addr and length. Currently, Transparent Huge Pages work only with private anonymous pages (see mmap(2)). The kernel will regularly scan the areas marked as huge page candidates to replace them with huge pages. The kernel will also allocate huge pages directly when the region is naturally aligned to the huge page size (see posix_memalign(2)).

int (*btree)[B];
const int P = 1 << 21;                        // page size in bytes (2MB)
const int T = (64 * nblocks + P - 1) / P * P; // can only allocate whole number of pages
btree = (int(*)[16]) std::aligned_alloc(P, T);
madvise(btree, T, MADV_HUGEPAGE);

version1 v.s. version2

測試程式如下，在建完大小為 N 的 btree 後，會針對範圍 0~N-1 隨機搜尋 2N 次，為了避免編譯器優化而沒有執行到 lower_bound ，因此每次搜尋完都會做一次比較判斷搜尋結果是否正確。

補充：後來以 objdump -S static_btree 觀察 assembly ，發現不管有沒有用 if(res != target) 判斷結果，編譯器都會產生 callq <lower_bound> 的指令，因此推測沒有判斷結果也不會造成 lower_bound 不會被執行到。

int main(int argc, char **argv)
{
    if (argc < 2) {
        printf("Error: static_btree needs at least 2 arguments\n");
        printf("./static_btree <number> [method]\n");
        return -1;
    }
    N = 1 << atoi(argv[1]);

    nblocks = (N + B - 1) / B;
    T = ((64 * nblocks + P - 1) / P * P);  // can only allocate whole number of pages

    if (argc == 3 && argv[2] == "0") {
        btree = malloc(nblocks * B * sizeof(int));
        if(!btree)
            goto alloc_fail;
    } else {
        btree = (int(*)[B]) aligned_alloc(P, T);
        if(!btree)
            goto alloc_fail;
        if(madvise(btree, T, MADV_HUGEPAGE)){
            printf("huge page setting failed\n");
            free(btree);
            return -1;
        }
    }

    // build B-tree by sorted array
    build(0);

    // search target
    int target, res;
    int n = N * 2;
    srand(0);
    for(int i = 0; i < n; i++){
        target = rand() % N;
        res = lower_bound(target);
        if(res != target){
            printf("search error\n");
            return -2;
        }
    }

    free(btree);
    return 0;

alloc_fail:
    printf("allocate memory %d failed\n", N);
    return -1;
}

做效能分析前先參照之前自己寫過的共筆建立測試環境，包含以下

/etc/default/grub 新增 isolcpu=7 ，然後以 taskset -c 7 program 執行測試程式
抑制 address space layout randomization (ASLR)
設定 scaling_governor 為 performance
針對 Intel 處理器，關閉 turbo mode
關閉 CPU 7 的 SMP IRQ afffinity

首先透過 perf-stat 來測量 data TLB load misses 的次數，結果如下，發現使用 hugepage 並沒有降低 dTLB-load-misses ，而且測試結果誤差範圍很大(16~20%)可能不是很有參考價值。

# S-Tree
$ sudo perf stat -C 7 -r 20 -e dTLB-load-misses taskset -c 7 ./static_btree 24 0

 Performance counter stats for 'CPU(s) 7' (20 runs):

              1529      dTLB-load-misses                                              ( +- 16.12% )

           7.93306 +- 0.00982 seconds time elapsed  ( +-  0.12% )

# S-Tree + HugePage
$ sudo perf stat -C 7 -r 20 -e dTLB-load-misses taskset -c 7 ./static_btree 24 1
 Performance counter stats for 'CPU(s) 7' (20 runs):

              2038      dTLB-load-misses                                              ( +- 20.39% )

            7.8228 +- 0.0405 seconds time elapsed  ( +-  0.52% )

另外使用 perf-record 來紀錄更細節的操作，-C 7 表示只觀察 CPU 7，參考 KYG-yaya573142 的共筆，編譯時開啟 --fomit-frame-pointer ，然後以 -g --call-graph dwarf 紀錄 stack trace 。

最後 perf-report 結果如下，使用 version1 S-tree 發生 33 次 dTLB-load-misses，其中 22.3% 發生在 lower_bound。而使用 version2 S-tree + Hugepage 發生 25 次 dTLB-load-misses，其中 13.14% 發生在 lower_bound，但是同樣的多次量測會發現這個結果誤差也很大，目前沒有方向要怎麼找原因。

$ sudo perf record -C 7 -r 20 -g --call-graph dwarf -e dTLB-load-misses taskset -c 7 ./static_btree 24 0
$ sudo perf report -g graph,0.5,caller
Samples: 33  of event 'dTLB-load-misses', Event count (approx.): 1193
  Children      Self  Command       Shared Object      Symbol
+   30.85%     0.00%  perf-exec     [unknown]          [k] 0xffffffffffffffff                                                                                                         ▒
+   30.85%     0.00%  perf-exec     libm-2.31.so       [.] __ccoshf128                                                                                                                ▒
+   30.85%     0.00%  perf-exec     [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe                                                                                             ▒
+   30.85%     0.00%  perf-exec     [kernel.kallsyms]  [k] do_syscall_64                                                                                                              ◆
+   30.85%     0.00%  perf-exec     [kernel.kallsyms]  [k] __x64_sys_execve                                                                                                           ▒
+   27.58%     0.00%  static_btree  static_btree       [.] _start                                                                                                                     ▒
+   27.58%     0.00%  static_btree  libc-2.31.so       [.] __libc_start_main                                                                                                          ▒
+   24.64%     0.00%  static_btree  static_btree       [.] main                                                                                                                       ▒
+   22.30%     0.00%  static_btree  static_btree       [.] lower_bound                                                                                                                ▒
+   17.60%     7.04%  static_btree  static_btree       [.] cmp

$ sudo perf record -C 7 -r 20 -g --call-graph dwarf -e dTLB-load-misses taskset -c 7 ./static_btree 24 1
$ sudo perf report -g graph,0.5,caller
Samples: 25  of event 'dTLB-load-misses', Event count (approx.): 974
  Children      Self  Command          Shared Object      Symbol                                                                                                          ▒q                                                                                                     ▒
+   13.14%     0.00%  static_btree     static_btree       [.] lower_bound                                                                                                             ▒
+   13.14%     2.87%  static_btree     static_btree       [.] cmp

在 perf-report 頁面按 e 可以展開 symbol callchains 的結果如下 Details，可以發現大部份 dTLB-load-misses 都發生在 _mm256_movemask_ps 內，對照到 assembly 應該只有一行 instruction vmovmskps %ymm0,%eax，不了解為何內部會紀錄到 apic timer interrupt 等看起來不相關的 function。

-   13.14%     2.87%  static_btree     static_btree       [.] cmp                                                                                                                     ◆
   - 10.27% cmp                                                                                                                                                                       ▒
      - 8.62% _mm256_movemask_ps (inlined)                                                                                                                                            ▒
         - asm_sysvec_apic_timer_interrupt                                                                                                                                            ▒
            - sysvec_apic_timer_interrupt                                                                                                                                             ▒
               - 5.75% __sysvec_apic_timer_interrupt                                                                                                                                  ▒
                    hrtimer_interrupt                                                                                                                                                 ▒
                    __hrtimer_run_queues                                                                                                                                              ▒
                    tick_sched_timer                                                                                                                                                  ▒
                    tick_sched_handle.isra.0                                                                                                                                          ▒
                    update_process_times                                                                                                                                              ▒
                    scheduler_tick                                                                                                                                                    ▒
                    task_tick_fair                                                                                                                                                    ▒
                  - update_cfs_group                                                                                                                                                  ▒
                     - reweight_entity                                                                                                                                                ▒
                          update_curr                                                                                                                                                 ▒
               - 1.44% irqentry_exit                                                                                                                                                  ▒
                    irqentry_exit_to_user_mode                                                                                                                                        ▒
                    exit_to_user_mode_prepare                                                                                                                                         ▒
                    schedule                                                                                                                                                          ▒
                    __sched_text_start                                                                                                                                                ▒
                    psi_task_switch                                                                                                                                                   ▒
      - 1.64% asm_sysvec_apic_timer_interrupt                                                                                                                                         ▒
           irq_enter_rcu                                                                                                                                                              ▒
   - 2.87% _start                                                                                                                                                                     ▒
        __libc_start_main                                                                                                                                                             ▒
        main                                                                                                                                                                          ▒
        lower_bound                                                                                                                                                                   ▒
        cmp                                                                                                                                                                           ▒
        _mm256_movemask_ps (inlined)

註：perf-report 的結果中， Children 表示此 symbol (function) 包含內部 call 到其他的 function 的 event 總和佔整體多少 percent 。Self 表示此 Symbol 不包含內部 call 的其他 function 的 event 總和。

參考資料：

使用 perf_events 分析程式效能

再來直接測量執行時間，測試程式如下，其中 CLOCK_MONOTONIC 為單調遞增的計時方式，這裡以 moving average algorithm 計算量測時間的平均值。

int main(int argc, char **argv)
{
    ...
    struct timespec t1, t2;
    
    // search target
    int target, res;
    int n = 1024;
    long long el;
    double avg = 0; 
    srand(0);
    for(int i = 0; i < n; i++){
        target = rand() % N;
        clock_gettime(CLOCK_MONOTONIC, &t1);
        res = lower_bound(target);
        clock_gettime(CLOCK_MONOTONIC, &t2);
        if(res != target){
            printf("search error\n");
            return -2;
        }
        // moving average
        el = elapsed(&t1, &t2);
        ouble delta_intvl = (double)el - avg;
        avg += delta_intvl / (i + 1);
    }
    printf("%f\n", avg);

以 fibdrv 使用過的 python script 統計在 array size 遞增時對照的執行時間，測試 50 次並去除 95% 信賴區間以外的值。結果同樣的沒有顯著差異。

B-tree search version 3 - optimize SIMD

繼續跟著 Algorithmica/HPC 針對 SIMD 進行優化，version 1 需要使用兩次 _mm256_movemask_ps 和一次 left shift 才能產生一個 16 bits mask，因此這裡希望使用其他指令來優化。 avx2 有支援 pack instruction ，而其中 _mm256_packs_epi32 可以將兩個 32 bits signed integers a, b 經過 saturate and pack 後回傳 16 bits signed integer ，舉例來說，a[31:0] = 0xabcdffff 經過 Saturate16 後回傳 dst[15:0] = 0xffff ，而 pack 會將 saturate 後的值用以下順序放到 256 bits SIMD register 上。

dst[15:0] := Saturate16(a[31:0])
dst[31:16] := Saturate16(a[63:32])
dst[47:32] := Saturate16(a[95:64])
dst[63:48] := Saturate16(a[127:96])
dst[79:64] := Saturate16(b[31:0])
dst[95:80] := Saturate16(b[63:32])
dst[111:96] := Saturate16(b[95:64])
dst[127:112] := Saturate16(b[127:96])
dst[143:128] := Saturate16(a[159:128])
dst[159:144] := Saturate16(a[191:160])
dst[175:160] := Saturate16(a[223:192])
dst[191:176] := Saturate16(a[255:224])
dst[207:192] := Saturate16(b[159:128])
dst[223:208] := Saturate16(b[191:160])
dst[239:224] := Saturate16(b[223:192])
dst[255:240] := Saturate16(b[255:224])

可以發現到 register a, b 的 pack 順序並不是

[a 0, a 1, a 2, a 3, a 4, a 5, a 6, a 7, b 0, b 1, b 2, b 3, b 4, b 5, b 6, b 7]

而是

[a 0, a 1, a 2, a 3, b 0, b 1, b 2, b 3, a 4, a 5, a 6, a 7, b 4, b 5, b 6, b 7]

因此在使用 _mm256_packs_epi32 之前，有必要特別針對 b-tree 各個 blocks 的 key order 進行處理，而處理方式如下，透過 _mm256_permutevar8x32_epi32 指令，根據 perm_mask 去交換 register x 的順序，因此以下 permute 函式會將從 node 開始的 16 個 32 bits integer （也就是 b-tree 的一個 block) ，其中間 index 4~9 以 perm_mask 的順序重新排序。

void permute(int *node) {
    const reg perm_mask = _mm256_setr_epi32(4,5,6,7,0,1,2,3);
    reg* middle = (reg*) (node + 4);
    reg x = _mm256_loadu_si256(middle);
    x = _mm256_permutevar8x32_epi32(x, perm_mask);
    _mm256_storeu_si256(middle, x);
}

可以事先在 build b-tree 階段，對每一個 block 事先做 permute ，就可以在 search 階段比較各個 key value 與 target 產生 bitmask 時(即以下 rank 函式），直接使用 _mm256_packs_epi32 而不必在意順序。

void build_op(int k)
{
    static int t = 0;
    if (k < nblocks) {
        for (int i = 0; i < B; i++) {
            build_op(go(k, i));
            btree[k][i] = (a < N ? a++ : INT_MAX);
        }
+       permute(btree[k]);
        build_op(go(k, B));
    }
}

unsigned rank(reg x, int* y) {
    reg a = _mm256_load_si256((reg*) y);
    reg b = _mm256_load_si256((reg*) (y + 8));

    reg ca = _mm256_cmpgt_epi32(a, x);
    reg cb = _mm256_cmpgt_epi32(b, x);

    reg c = _mm256_packs_epi32(ca, cb);
    int mask = _mm256_movemask_epi8(c);

    // we need to divide the result by two because we call movemask_epi8 on 16-bit masks:
    return mask ? __builtin_ctz(mask) >> 1 : 16;
}

這個 rank 有兩個需要注意的細節，一是 _mm256_movemask_epi8 回傳值為 32 bits mask，因此最後 count trailing zero 的結果需要除以 2。二是原文作者使用 __tzcnt_u32 而非 __builtin_ctz ，這需要 CPU 有支援 LZCNT 指令集才能用，但我使用的 CPU 對這個指令集支援不完整(限定只能在 eax register 使用，可以用上面提到過的 CPUID 察看到) ，因此這裡以 __builtin_ctz 取代，只是需要特別針對 mask == 0 做處理。

補充：後來發現 gcc 以 O3 編譯優化後的 assembly 就會自己將 __builtin_ctz 組譯成 tzcnt 指令在 eax register 上使用。

因此最後 search function 如下，這裡同樣有兩個細節，一是 _mm256_set1_epi32 輸入值要減一，否則結果會變成 upper_bound 。二是因為 b-tree 的儲存順序有用 permute 修改過，因此在 update 函式內提取每個 block 的 index i 需要經過 translate 。

const int translate[17] = {
    0, 1, 2, 3,
    8, 9, 10, 11,
    4, 5, 6, 7,
    12, 13, 14, 15,
    0
};

void update(int *res, int* node, unsigned i) {
    int val = node[translate[i]];
    *res = (i < B ? val : *res);
}

int lower_bound_op(int _x) {
    int k = 0, res = INT_MAX;
    reg x = _mm256_set1_epi32(_x - 1);
    for (int h = 0; h < H; h++) {
        unsigned i = rank(x, btree[k]);
        update(&res, btree[k], i);
        k = go(k, i);
    }
    // the last branch:
    if (k < nblocks) {
        unsigned i = rank(x, btree[k]);
        update(&res, btree[k], i);
    }
    return res;
}

參考資料：

Why is permute needed in parallel SIMD/SSE/AVX ?

比較 version 1, 2, 3

與前次實驗方式相同，每個 version 執行 1024 次搜尋取平均，重複 50 次後去除 outlier 值再取平均，最後結果如下，結果發現到使用 pack instruction 結果變得更慢… ，因此開始思考是不是 CPU 差異導致。

原作者使用的是 AMD Zen 2 CPU，而我使用的是 Intel 11th gen. Core Tiger Lake microarchitecture ，需要找的是 _mm256_movemask_ps 使用的指令 vmovmskps，以及 _mm256_packs_epi32 使用的指令 vpackssdw 。

首先察看 Agner Fog's instruction tables ，Page 110 可以找到 Zen 2 的 vmovmskps latency 為 5 cycles ，而 Page 107 可以找到 vpackssdw 的 latency 為 1 cycle。接著是 Tiger Lake ，Page 343 可以看到 vmovmskps latency 為 1 cycles ，但找不到 vpackssdw 的資訊，因此另外去找了 Intel® 64 and IA-32 Architectures Optimization Reference Manual 其中的 Appendix D: INSTRUCTION LATENCY AND THROUGHPUT ，但只找的到最接近的為 Skylake microarchitecture 的資訊，其 vpacksswb (指令的功能類似，但改為 pack 16 bits integer) 的 latency 為 1 cycle，簡單整理成以下表格

Latency	Zen 2	Tiger Lake
`vmovmskps`	5	1
`vpackssdw`	1	1 (Skylake VPACKSSWB)

假設 Tiger Lake vpackssdw latency 為 1 cycle ，則使用 search version 3 還需要對 mask 除以 2，以及 update 需要做 translate 等反而可能需要執行更多的指令，推測確實是造成 version 3 更慢的原因。

補充: 後來又在 stackoverflow 中看到 instlatx64 網站有量測 Intel Core i7-1165G7 指令的 latency，其中 vpackssdw 為 3 cycles，反而比 Zen 2 更慢…，因此上述推測又更合理了。

測驗 `9`

考慮以下進行記憶體地址對齊 (alignment) 的巨集:

/* maximum alignment needed for any type on this platform, rounded up to a
   power of two */
#define MAX_ALIGNMENT 16

/* Given a size, round up to the next multiple of sizeof(void *) */
#define ROUND_UP_TO_ALIGNMENT_SIZE(x) \
    (((x) + MAX_ALIGNMENT - 1) & ~(MAX_ALIGNMENT - 1))

ROUND_UP_TO_ALIGNMENT_SIZE 會使傳入值 x 向上對齊到二進位

10000_{2}

的倍數，

/*
 *  "x" represents some arbitary bits, "abcd" represents four arbitary bits
 *  x0abcd -> x10000 if any of abcd bits is not zero
 *  x0abcd -> x00000 if all of abcd bits are zero
 */

當 x 的 least 4 bits 皆為 0 時，x 加上

1111_{2}

不會進位，最後 & ~

(1111_{2})

截掉 least 4 bits 即可。
相反的，當 x 的 least 4 bits 不為 0 時， x 加上

1111_{2}

必定會有 1 bit 進位到第 5 個 bit，後續一樣 & ~

(1111_{2})

截掉 least 4 bits 即可。

以相同邏輯實作 ROUND_DOWN 版本，只需截掉 least 4 bits 即可。

#define ROUND_DOWN_TO_ALIGNMENT_SIZE(x) \
    ((x) & ~(MAX_ALIGNMENT - 1))

Linux Kernel 中的 include/linux/math.h 可以找到跟 round up/down 相關實作，並且針對要對齊的值是否為 power of 2 有不同實作。

首先看到 round up/down 對齊 power of 2 的部份

/*
 * This looks more complex than it should be. But we need to
 * get the type for the ~ right in round_down (it needs to be
 * as wide as the result!), and we want to evaluate the macro
 * arguments just once each.
 */
#define __round_mask(x, y) ((__typeof__(x))((y)-1))

/**
 * round_up - round up to next specified power of 2
 * @x: the value to round
 * @y: multiple to round up to (must be a power of 2)
 *
 * Rounds @x up to next multiple of @y (which must be a power of 2).
 * To perform arbitrary rounding up, use roundup() below.
 */
#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)

/**
 * round_down - round down to next specified power of 2
 * @x: the value to round
 * @y: multiple to round down to (must be a power of 2)
 *
 * Rounds @x down to next multiple of @y (which must be a power of 2).
 * To perform arbitrary rounding down, use rounddown() below.
 */
#define round_down(x, y) ((x) & ~__round_mask(x, y))

__round_mask 中的 typeof 為 GNU extension，回傳 x 的型態，y 為 power of 2 value，因此 (y)-1 會變成 0 到 log2(y) bits 皆為 1 的 bitmask。

round_up 中，若 x 的 least log2(y) bits 皆為 0，則 ((x)-1 | __round_mask(x,y)) 仍會維持 (x)-1，因此 +1 後變回 x 回傳。
若 x 的 least log2(y) bits 有任意位元不為 0，則 ((x)-1 | __round_mask(x,y)) 會讓 least log2(y) bits 變為 1，但剩餘的高位 bits 仍維持原樣，因此 +1 後會進一位。

round_down 原理跟 ROUND_DOWN_TO_ALIGNMENT_SIZE 相同，就不重複解釋了。

若參數 y 不為 power of 2 的倍數，則要使用以下定義

/**
 * roundup - round up to the next specified multiple
 * @x: the value to up
 * @y: multiple to round up to
 *
 * Rounds @x up to next multiple of @y. If @y will always be a power
 * of 2, consider using the faster round_up().
 */
#define roundup(x, y) (					\
{							\
	typeof(y) __y = y;				\
	(((x) + (__y - 1)) / __y) * __y;		\
}							\
)
/**
 * rounddown - round down to next specified multiple
 * @x: the value to round
 * @y: multiple to round down to
 *
 * Rounds @x down to next multiple of @y. If @y will always be a power
 * of 2, consider using the faster round_down().
 */
#define rounddown(x, y) (				\
{							\
	typeof(x) __x = (x);				\
	__x - (__x % (y));				\
}							\
)

這裡 round up 與 round down 就只使用單純的數學解法來實作，其中 ({...}) 同樣為 GNU extension 的用法，會回傳最後一行的結果。而第一行使用 typeof(y) __y = y; 是為了避免參數 y 展開後被多次計算 (double evaluation) 造成錯誤結果，例如 roundup(a, b++)。

測驗 10

考慮以下巨集可得到二個表示式進行整數除法操作時，最接近的整數值:









#define DIVIDE_ROUND_CLOSEST(x, divisor)                       \
    ({                                                         \
        typeof(x) __x = x;                                     \
        typeof(divisor) __d = divisor;                         \
        (((typeof(x)) -1) > 0 || ((typeof(divisor)) -1) > 0 || \
         (((__x) > 0) == ((__d) > 0)))                         \
            ? ((RRR) / (__d))                  \
            : ((SSS) / (__d));                 \
    })

題目說明中的這句，對照程式碼對不太起來

注意: 當除數 (即 divisor) 為負值時，行為沒有定義。

如果 divisor 永遠為正值，看起來第 6 行的 (((__x) > 0) == ((__d) > 0))) 判斷式就有點多餘？

先假設 x 和 divisor 都是 unsigned integer, 則 -1 會被轉型 unsigned integer，表示 ((typeof(x)) -1) > 0 成立，針對 x 和 divisor 都大於零的情況，

(x + (d / 2)) / d

即為解答，除以 2 可以用 right shift 1 取代，因此

RRR: __x + (__d >> 1)

假設 x 和 divisor 皆為 signed integer，若 x 與 divisor 同號，則 ((__x) > 0) == ((__d) > 0)) 成立，結果同上，若 x 與 divisor 不同號，先列出以下幾個例子：

DIVIDE_ROUND_CLOSEST(-1, 3) = 0
DIVIDE_ROUND_CLOSEST(-2, 3) = -1
DIVIDE_ROUND_CLOSEST(2, -3) = -1

可以發現，與前一個解答類似，只需要改個負號即可，即

(x - (d / 2)) / d

，因此，

SSS: __x - (__d >> 1)

Linux Kernel 中的 include/linux/math.h 可以找到一樣的實作，註解提到

/*
 * Result is undefined for negative divisors if the dividend
 * variable type is unsigned and for negative dividends if 
 * the divisor variable type is unsigned.
 */

2022q1 Homework3 (quiz3)

測驗 8

嘗試改寫成 branchless 實作

Static B-Trees from Algorithmica / HPC

B-tree search version 1

B-tree search version 2 - HugePage

version1 v.s. version2

B-tree search version 3 - optimize SIMD

比較 version 1, 2, 3

測驗 9

測驗 10

Read more

2022q1 Homework1 (lab0)

2022q1 Homework5 (quiz5)

2022q1 Homework4 (quiz4)

2021q1 Homework2 (quiz2)

測驗 `8`

測驗 `9`