2020q3 Homework3 (dict)

contributed by < quantabase13 >

CPY 和 REF 實作的效能差異分析

在 dict 目錄下完成編譯後，執行

$ ./test_common --bench CPY

可以得到 bench_ref.txt 文件。
為了確認這份文件中的資料代表的意思，觀察了test_common.c 中的程式碼，其中一部份如下：

    if (argc == 3 && strcmp(argv[1], "--bench") == 0) {
        int stat = bench_test(root, BENCH_TEST_FILE, LMAX);
        tst_free(root);
        free(pool);
        return stat;
    }

bench_test 的程式碼被定義在 bench.c ，裡面的程式碼如下：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include "bench.h"

#define DICT_FILE "cities.txt"
#define WORDMAX 256
#define PREFIX_LEN 3

double tvgetf()
{
    struct timespec ts;
    double sec;

    clock_gettime(CLOCK_REALTIME, &ts);
    sec = ts.tv_nsec;
    sec /= 1e9;
    sec += ts.tv_sec;

    return sec;
}

int bench_test(const tst_node *root, char *out_file, const int max)
{
    char prefix[PREFIX_LEN + 1] = "";
    char word[WORDMAX] = "";
    char **sgl;
    FILE *fp = fopen(out_file, "w");
    FILE *dict = fopen(DICT_FILE, "r");
    int idx = 0, sidx = 0;
    double t1, t2;

    if (!fp || !dict) {
        if (fp) {
            fprintf(stderr, "error: file open failed in '%s'.\n", DICT_FILE);
            fclose(fp);
        }
        if (dict) {
            fprintf(stderr, "error: file open failed in '%s'.\n", out_file);
            fclose(dict);
        }
        return 1;
    }

    sgl = (char **) malloc(sizeof(char *) * max);
    while (fscanf(dict, "%s", word) != EOF) {
        if (strlen(word) < sizeof(prefix) - 1)
            continue;
        strncpy(prefix, word, sizeof(prefix) - 1);
        t1 = tvgetf();
        tst_search_prefix(root, prefix, sgl, &sidx, max);
        t2 = tvgetf();
        fprintf(fp, "%d %f msec\n", idx, (t2 - t1) * 1000);
        idx++;
    }

    free(sgl);
    fclose(fp);
    fclose(dict);
    return 0;
}

關鍵是中間的 while 迴圈。仔細觀察後可以發現 bench_test 的工作內容就是將 cities.txt 中的城市名稱取出，取其前三個字母作為 prefix (根據 #define PREFIX_LEN 3 )，傳進函式 tst_search_prefix ，最後將 tst_search_prefix 執行的時間寫入 bench_ref.txt。
從 test_common.c 中，可以知道命令

$ ./test_common --bench CPY

裡的 CPY 決定了 string 在 ternary search tree 的儲存方式。
試著將其視覺化後，得到下圖：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

如果改成執行

$ ./test_common --bench REF

視覺化後能得到下圖：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

在 inputfile 皆為 cities.txt 的條件下，可以發現 CPY 的版本與 REF 比起來，執行 tst_search_prefix 有明顯速度上的差異。
嘗試用 perf 來細部分析這兩種不同命令，得到如下兩個結果：



















ho@ho-Nitro-AN515-54:~/dict$ sudo perf stat ./test_common --bench CPY
CPY mechanism
ternary_tree, loaded 206849 words in 0.098053 sec

 Performance counter stats for './test_common --bench CPY':

         11,433.94 msec task-clock                #    1.000 CPUs utilized          
                34      context-switches          #    0.003 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             7,347      page-faults               #    0.643 K/sec                  
    47,111,915,437      cycles                    #    4.120 GHz                    
    50,429,200,586      instructions              #    1.07  insn per cycle         
     7,537,729,166      branches                  #  659.241 M/sec                  
       232,021,975      branch-misses             #    3.08% of all branches        

      11.439066191 seconds time elapsed

      11.390165000 seconds user
       0.043992000 seconds sys



















ho@ho-Nitro-AN515-54:~/dict$ sudo perf stat ./test_common --bench REF
REF mechanism
ternary_tree, loaded 206849 words in 0.096954 sec

 Performance counter stats for './test_common --bench REF':

         12,613.26 msec task-clock                #    1.000 CPUs utilized          
                43      context-switches          #    0.003 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             7,170      page-faults               #    0.568 K/sec                  
    51,906,233,997      cycles                    #    4.115 GHz                    
    50,404,576,872      instructions              #    0.97  insn per cycle         
     7,533,050,891      branches                  #  597.232 M/sec                  
       232,130,089      branch-misses             #    3.08% of all branches        

      12.614219162 seconds time elapsed

      12.569555000 seconds user
       0.044005000 seconds sys

奇怪的是，執行


ho@ho-Nitro-AN515-54:~/dict$ sudo perf stat ./test_common --bench REF

出來的結果中，發現每個 cycle 只執行不到1個 instruction。
為了深入釐清，我們使用 perf record -g 取樣：


ho@ho-Nitro-AN515-54:~/dict$ sudo perf record -g ./test_common --bench REF

接著以 perf report 讀取報告：


ho@ho-Nitro-AN515-54:~/dict$ sudo perf report -g graph,0.5,caller

得到以下結果：



















Samples: 51K of event 'cycles', Event count (approx.): 52400465598
  Children      Self  Command      Shared Object      Symbol
+   99.48%     0.00%  test_common  [unknown]          [k] 0x0c96258d4c544155
+   99.48%     0.00%  test_common  libc-2.27.so       [.] __libc_start_main
+   99.48%     0.09%  test_common  test_common        [.] main
+   98.06%     0.01%  test_common  test_common        [.] bench_test
+   97.99%     0.48%  test_common  test_common        [.] tst_search_prefix
+   97.50%    97.38%  test_common  test_common        [.] tst_suggest
+    0.51%     0.00%  test_common  libc-2.27.so       [.] fprintf
     0.45%     0.12%  test_common  libc-2.27.so       [.] vfprintf
     0.36%     0.36%  test_common  test_common        [.] tst_ins_del
     0.28%     0.26%  test_common  libc-2.27.so       [.] __GI___printf_fp_l
     0.25%     0.00%  test_common  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwf
     0.25%     0.01%  test_common  [kernel.kallsyms]  [k] do_syscall_64
     0.21%     0.00%  test_common  libc-2.27.so       [.] __GI___libc_write
     0.20%     0.00%  test_common  [kernel.kallsyms]  [k] __x64_sys_write
     0.20%     0.00%  test_common  [kernel.kallsyms]  [k] ksys_write
     0.20%     0.00%  test_common  [kernel.kallsyms]  [k] vfs_write
     0.16%     0.08%  test_common  test_common        [.] bloom_add

發現 tst_suggest 花掉最多CPU 週期(self overhead 佔總體的 97.38%)，因此我們決定細部分析 tst_suggest 所用到的指令：


























Samples: 51K of event 'cycles', 4000 Hz, Event count (approx.): 52400465598
tst_suggest  /home/ho/dict/test_common [Percent: local period]
  1.10 │       mov    %al,-0xc(%rbp)
       │     if (!p || *n >= max)
  2.09 │       cmpq   $0x0,-0x8(%rbp)
  0.89 │     ↓ je     119
  2.73 │       mov    -0x28(%rbp),%rax
  1.17 │       mov    (%rax),%eax
  0.62 │       cmp    %eax,-0x10(%rbp)
       │     ↓ jle    119
       │     return;
       │     tst_suggest(p->lokid, c, nchr, a, n, max);
  0.15 │       movsbl -0xc(%rbp),%esi
  0.18 │       mov    -0x8(%rbp),%rax
 36.73 │       mov    0x8(%rax),%rax
  0.29 │       mov    -0x10(%rbp),%r8d
  0.08 │       mov    -0x28(%rbp),%rdi
  0.12 │       mov    -0x20(%rbp),%rcx
  0.72 │       mov    -0x18(%rbp),%rdx
  0.22 │       mov    %r8d,%r9d
  0.09 │       mov    %rdi,%r8
  0.10 │       mov    %rax,%rdi
  0.92 │     → callq  tst_suggest
       │     if (p->key)
  0.68 │       mov    -0x8(%rbp),%rax
  0.66 │       movzbl (%rax),%eax

這裡列出的其實就是 function 的 disassembly code。我們觀察到 tst_suggest(p->lokid, c, nchr, a, n, max); 這個 function call 中，mov 0x8(%rax),%rax 指令所占時間比例最高，於是我們想確認 %rax 對應 function call 中的那一個 argument。

參考相關的第一手資料 x86-64 psABI 3.2.3 [Parameter Passing]

• Arguments of types (signed and unsigned) _Bool , char , short , int , long , long long ,
and pointers are in the INTEGER class.

Passing Once arguments are classified, the registers get assigned (in left-to-right
order) for passing as follows:

If the class is MEMORY, pass the argument on the stack.

If the class is INTEGER, the next available register of the sequence %rdi,
%rsi, %rdx, %rcx, %r8 and %r9 is used.
.

If the class is SSE, the next available vector register is used, the registers
are taken in the order from %xmm0 to %xmm7.

If the class is SSEUP, the eightbyte is passed in the next available eightbyte
chunk of the last used vector register.

If the class is X87, X87UP or COMPLEX_X87, it is passed in memory.

我們可以確認，在 tst_suggest(p->lokid, c, nchr, a, n, max); 中：

%rdi 存的是 p->lokid
%esi 存的是 c
%rdx 存的是 nchr
%rcx 存的是 a
%r8 存的是 n
%r9d 存的是 max

並且，p->lokid 這項操作花了最長的時間。
到這裡似乎可以深入分析是否是 cache misses 造成的影響了，但是我遇到一個不知如何解釋的狀況：
執行命令


$ sudo perf record -g -e cycles  ./test_common --bench REF

接著細部分析 tst_suggest 的指令熱點，會發現：

























Samples: 56K of event 'cycles', 4000 Hz, Event count (approx.): 57186636114
tst_suggest  /home/ho/dict/test_common [Percent: local period]
Percent│     if (!p || *n >= max)
  0.98 │       cmpq   $0x0,-0x8(%rbp)
  1.90 │     ↓ je     119
  1.61 │       mov    -0x28(%rbp),%rax
  1.41 │       mov    (%rax),%eax
  1.01 │       cmp    %eax,-0x10(%rbp)
       │     ↓ jle    119
       │     return;
       │     tst_suggest(p->lokid, c, nchr, a, n, max);
  0.52 │       movsbl -0xc(%rbp),%esi
  0.14 │       mov    -0x8(%rbp),%rax
  0.17 │       mov    0x8(%rax),%rax
 37.93 │       mov    -0x10(%rbp),%r8d
  0.24 │       mov    -0x28(%rbp),%rdi
  0.09 │       mov    -0x20(%rbp),%rcx
  0.09 │       mov    -0x18(%rbp),%rdx
  0.60 │       mov    %r8d,%r9d
  0.21 │       mov    %rdi,%r8
  0.11 │       mov    %rax,%rdi
  0.09 │     → callq  tst_suggest
       │     if (p->key)
  0.31 │       mov    -0x8(%rbp),%rax
  0.51 │       movzbl (%rax),%eax

同樣都使用相同的 event ，只是差別在有沒有在命令列顯式給出使用的 event (根據Perf wiki , perf record 預設採用 cycles 作為 Sampling event)，最後的熱點竟然不同。
同時，若使用perf top 命令即時監控的話，會發現熱點仍然是 mov 0x8(%rax),%rax 。
另外，若是用 perf 分析 CPY 的版本，也會得到類似的 output:




















Samples: 46K of event 'cycles', Event count (approx.): 47723392335
  Children      Self  Command      Shared Object      Symbol
+   99.46%     0.00%  test_common  [unknown]          [.] 0x0c96258d4c544155       ◆
+   99.46%     0.00%  test_common  libc-2.27.so       [.] __libc_start_main        ▒
+   99.46%     0.13%  test_common  test_common        [.] main                     ▒
+   97.95%     0.02%  test_common  test_common        [.] bench_test               ▒
+   97.84%     0.50%  test_common  test_common        [.] tst_search_prefix        ▒
+   97.36%    97.25%  test_common  test_common        [.] tst_suggest              ▒
+    0.52%     0.00%  test_common  libc-2.27.so       [.] fprintf                  ▒
     0.47%     0.13%  test_common  libc-2.27.so       [.] vfprintf                 ▒
     0.40%     0.39%  test_common  test_common        [.] tst_ins_del              ▒
     0.28%     0.26%  test_common  libc-2.27.so       [.] __GI___printf_fp_l       ▒
     0.22%     0.00%  test_common  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hw▒
     0.22%     0.01%  test_common  [kernel.kallsyms]  [k] do_syscall_64            ▒
     0.17%     0.00%  test_common  libc-2.27.so       [.] __GI___libc_write        ▒
     0.17%     0.00%  test_common  [kernel.kallsyms]  [k] __x64_sys_write          ▒
     0.17%     0.00%  test_common  [kernel.kallsyms]  [k] ksys_write               ▒
     0.16%     0.03%  test_common  libc-2.27.so       [.] __isoc99_fscanf          ▒
     0.15%     0.00%  test_common  [kernel.kallsyms]  [k] vfs_write                ▒
     0.15%     0.06%  test_common  test_common        [.] bloom_add                ▒





















Samples: 46K of event 'cycles', 4000 Hz, Event count (approx.): 47723392335
tst_suggest  /home/ho/dict/test_common [Percent: local period]
  1.51 │     ↓ je     119                                                          ▒
  4.55 │       mov    -0x28(%rbp),%rax                                             ▒
  1.65 │       mov    (%rax),%eax                                                  ▒
  0.76 │       cmp    %eax,-0x10(%rbp)                                             ▒
       │     ↓ jle    119                                                          ▒
       │     return;                                                               ▒
       │     tst_suggest(p->lokid, c, nchr, a, n, max);                            ▒
  0.11 │       movsbl -0xc(%rbp),%esi                                              ▒
  0.19 │       mov    -0x8(%rbp),%rax                                              ▒
 40.62 │       mov    0x8(%rax),%rax                                               ▒
  0.33 │       mov    -0x10(%rbp),%r8d                                             ▒
  0.08 │       mov    -0x28(%rbp),%rdi                                             ▒
  0.09 │       mov    -0x20(%rbp),%rcx                                             ◆
  0.80 │       mov    -0x18(%rbp),%rdx                                             ▒
  0.26 │       mov    %r8d,%r9d                                                    ▒
  0.05 │       mov    %rdi,%r8                                                     ▒
  0.12 │       mov    %rax,%rdi                                                    ▒
  1.01 │     → callq  tst_suggest                                                  ▒
Press 'h' for help on key bindings

如果兩種實作都有同樣位置的熱點，是否表示 CPY 的實作方式的其實沒辦法降低 cache misses ?
我們比較看看 cache-misses event 的數量：

以下是 CPY 的版本。
輸入命令為


$ sudo perf record -g -e cache-misses,cache-references,instructions,cycles  ./test_common --bench CPY

得到















Samples: 46K of event 'cache-misses', Event count (approx.): 1078849051
  Children      Self  Command      Shared Object      Symbol
+   99.52%     0.00%  test_common  [unknown]          [.] 0x0c96258d4c544155
+   99.52%     0.00%  test_common  libc-2.27.so       [.] __libc_start_main
+   99.52%     0.00%  test_common  test_common        [.] main
+   99.21%     0.01%  test_common  test_common        [.] bench_test
+   99.14%     0.26%  test_common  test_common        [.] tst_search_prefix
+   98.89%    98.78%  test_common  test_common        [.] tst_suggest
     0.35%     0.00%  test_common  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
     0.35%     0.00%  test_common  [kernel.kallsyms]  [k] do_syscall_64
     0.30%     0.00%  test_common  libc-2.27.so       [.] __GI___libc_write
     0.30%     0.00%  test_common  [kernel.kallsyms]  [k] __x64_sys_write
     0.30%     0.00%  test_common  [kernel.kallsyms]  [k] ksys_write
     0.30%     0.00%  test_common  [kernel.kallsyms]  [k] vfs_write
     0.23%     0.00%  test_common  [kernel.kallsyms]  [k] __vfs_write

以下是 REF 的版本。
輸入命令為


$ sudo perf record -g -e cache-misses,cache-references,instructions,cycles  ./test_common --bench REF

得到















Samples: 54K of event 'cache-misses', Event count (approx.): 1329973047
  Children      Self  Command      Shared Object      Symbol
+   99.55%     0.00%  test_common  [unknown]          [k] 0x0c96258d4c544155
+   99.55%     0.00%  test_common  libc-2.27.so       [.] __libc_start_main
+   99.54%     0.00%  test_common  test_common        [.] main
+   99.30%     0.01%  test_common  test_common        [.] bench_test
+   99.26%     0.18%  test_common  test_common        [.] tst_search_prefix
+   99.08%    98.70%  test_common  test_common        [.] tst_suggest
     0.34%     0.00%  test_common  [kernel.kallsyms]  [k] do_syscall_64
     0.34%     0.00%  test_common  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
     0.31%     0.00%  test_common  libc-2.27.so       [.] __GI___libc_write
     0.30%     0.00%  test_common  [kernel.kallsyms]  [k] __x64_sys_write
     0.30%     0.00%  test_common  [kernel.kallsyms]  [k] ksys_write
     0.30%     0.00%  test_common  [kernel.kallsyms]  [k] vfs_write
     0.24%     0.00%  test_common  [kernel.kallsyms]  [k] __vfs_write

兩者的 cache-misses 的數量比為

\frac{1329973047}{1078849051} \approx 1.23

。
再看看當輸入命令為


$ sudo perf record -e cache--misses -g  ./test_common --bench CPY

或


$ sudo perf record -e cache-misses -g  ./test_common --bench REF

時，$ perf top -e cache-misses 會給出哪個指令是熱點的判斷。
結果發現熱點是 mov -0x10(%rbp),%r8d，也就是 mov 0x8(%rax),%rax 的下一條指令。

研讀 Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters 論文

論文的連結在此
Xor Filters 的演算法可分為四個部份：

Membership Tests
Construction
Mapping Step
Assigning Step

以下我依序針對各個部份記錄我的理解。

Membership Tests

這部份的演算法很簡短，只有比較

f i n g e r p r i n t (x)

是否與

B [h_{0} (x)] x o r B [h_{1} (x)] x o r B [h_{2} (x)]

相等。我們透過這部份的演算法得知 input

x

是否屬於當初構建 Xor Filter 時所用的 Key Set (就是判斷資料庫裡有沒有

x

)。

Construction

從這部份可以看出演算法的架構，內容如下：

我們需要：
- set of keys S
- a fingerprint function
演算法的工作流程：
- 隨機選擇3個 hash function
  $h_{0}$ ,
  $h_{1}$ ,
  $h_{2}$ , 直到
  $m a p (S, h_{0}, h_{1}, h_{2})$ 返回 Sucess 以及一個 stack
  $σ$ 。(關於這裡的隨機，有特別強調要 independent from the fingerprint function, 但我不懂一個 function indepentdent from 另一個 function 的定義是什麼。需要查閱相關資料。)
- 設 B 為一個大小為
  $⌊ 1.23 \cdot | S | ⌋ + 32$ 的 array, 其中每一個元素有 k bits。
- $a s s i g n (σ, B, h_{0}, h_{1}, h_{2})$
我們最後能得到：
- array B and the hash functions
  $h_{0}$ ,
  $h_{1}$ ,
  $h_{2}$

這一部份出現了兩個新 function

m a p (S, h_{0}, h_{1}, h_{2})

及

a s s i g n (σ, B, h_{0}, h_{1}, h_{2})

，它們是 Xor Filter 這個演算法的核心，分別對應到 Mapping Step 和 Assigning Step

Mapping Step

m a p ()

的想法如下：

設
$H$ 為大小為
$⌊ 1.23 \cdot | S | ⌋ + 32$ 的 array, 其中每一個元素皆為 key 的集合(一開始為空)。
對用來構建 Xor Filter 的 Key Set 中的每一個成員
$x$ ，計算
$h_{0} (x)$ ,
$h_{1} (x)$ ,
$h_{2} (x)$ ，並將
$x$ 填入 H 中對應的集合
$H [h_{0} (x)]$ 、
$H [h_{1} (x)]$ 、
$H [h_{2} (x)]$ 中

接下來：

初始化一個 empty queue Q，並遍歷整個
$H$ ，如果
$H (i)$ 只有一個元素，就將數字
$i$ 加入 Q 中。

最後：

初始化一個 empty stack
$σ$ 。
檢查 Q 是否為空，如果不是，就將一個元素
$i$ 從 Q 中移除，並且：
- 如果
  $H [i]$ 只有一個元素
  $x$ ，就把 (
  $x$ ,
  $i$ ) push 到 stack
  $σ$
- 把
  $x$ 從
  $H [h_{j} (x)]$ 移除 (j = 0 to 2)，並且：
  - 如果移除
    $x$ 後
    $H [h_{j} (x)]$ 只剩一個元素，就把
    $h_{j} (x)$ 加入 Q 中。

如果 stack

σ

大小｜

σ

｜=｜S｜，return success and the stack

σ

。

我們在此先分析 stack

σ

元素的特性：
根據

如果
$H [i]$ 只有一個元素
$x$ ，就把 (
$x$ ,
$i$ ) push 到 stack
$σ$

這項規則，每次將 (

x

i

) push 到 stack

σ

後，

i

會與用之後遇到的每個

x'

計算出的

h_{0} (x')

h_{1} (x')

h_{2} (x')

不同

因此，我們最後得到的 stack

σ

有一個非常重要、之後的 Assigning Step 會用到的特點：

$i$ is different from
$h_{0} (x')$ ,
$h_{1} (x')$ ,
$h_{2} (x')$ for all keys
$x'$ encountered so far.

這裡的 encounter so far 的時間線是指從 stack 頂端到 stack 底端遍歷的方向。

Assigning Step

a s s i g n ()

的參數是 stack

σ

, B, hash funtions

h_{0}

、

h_{1}

、

h_{2}

。這一步的工作如下：

對 stack

σ

中的每一個元素(

x

i

)：

令
$B [i] = 0$
令
$B [i] = f i n g e r p r i n t (x) x o r B [h_{0} (x)] x o r B [h_{1} (x)] x o r B [h_{2} (x)]$

由於 Stack 中元素的特性：

$i$ is different from
$h_{0} (x')$ ,
$h_{1} (x')$ ,
$h_{2} (x')$ for all keys
$x'$ encountered so far.

所以每個

B [i]

都只會被 set 一次。

雖然不妨礙理解演算法，但一個令人疑惑的問題是，Assigning Step 中，，

B [h_{0} (x)] 、 B [h_{1} (x)] 、 B [h_{2} (x)]

(至少其中兩個) 在還沒被初始化前就被引用，這樣的表達方式是否合理？

參閱對應程式碼實作 xor_singleheader，引入 dict 程式碼

xor_singleheader 的實作程式碼在此
xor_singleheader 實作上跟 bloom filter 有一個差別，導致不能直接將其引入程式碼：

用來構建 Xor filter 的 Key Set，其中不能有重複的元素。

相反的，bloom filter 容許重複元素。
因此，決定先對資料進行前處理。前處理的步驟如下：

將 cities.txt 資料根據資料檔案的分析所述加以修改，以利解析。
讀取資料至 array，並進行排序。
將 array 中重複的元素移除。
針對 array 中每個元素計算其 hash 值（使用 djb2 作為 hash function）

以下是我前處理的核心程式碼(第一次嘗試)：




































u_int64_t *cities_parsing(FILE *fp, char **stringSet, int *setsize)
{
    int size = 0;
    char line[WRDMAX];
    while (fgets(line, WRDMAX, fp)!= NULL) {
        char city[WRDMAX] = "";
        char province[WRDMAX] = "";
        char nation[WRDMAX] = "";
        sscanf(line, "%[^,^\n], %[^,^\n], %[^,^\n]", city, province, nation);
        char *city_dup = strdup(city);
        char *province_dup = strdup(province);
        char *nation_dup = strdup(nation);
        if (*nation) {
            *(stringSet + size) = city_dup;
            *(stringSet + size + 1) = province_dup;
            *(stringSet + size + 2) = nation_dup;
            size = size + 3;
        } else {
            *(stringSet + size) = city_dup;
            *(stringSet + size + 1) = province_dup;
            free(nation_dup);
            size = size + 2;
        }
    }
    
    qsort(stringSet, size, sizeof(char *), cstring_cmp);
    rm_dup(stringSet, &size);
    *setsize = size;
    u_int64_t *stringSet_hashed = malloc(size * sizeof(u_int64_t));

    for (int j = 0; j < size; j++) {
        *(stringSet_hashed + j) = djb2(*(stringSet + j));
    }
    return stringSet_hashed;

}

我把它稱作第一次嘗試，是因為檢查 hash 出來的結果時，發現有duplicate (共有 1 個)。
把 hash funciton 從 dbj2 換成 jenkins 後就沒有 duplicate 發生。
使用的 jenkins 實作如下：















static unsigned long jenkins(const void *_str)
{
    const char *key = _str;
    unsigned long hash = 0;
    while (*key) {
        hash += *key;
        hash += (hash << 10);
        hash ^= (hash >> 6);
        key++;
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

接下來就是實作 benchmark 的相關程式:

Xor filter 的 benchmark 程式碼部份如下：
































    FILE *fp_xor = fopen(IN_FILE, "r");
    char **stringSet = malloc(300000 * sizeof(char *));
    u_int64_t *stringSet_hashed = cities_parsing(fp_xor, stringSet, &setsize);
    xor8_t filter;
    xor8_allocate(setsize, &filter);
    bool is_ok = xor8_populate(stringSet_hashed, setsize, &filter);
    if (!is_ok) {
        printf("You have duplicate keys");
        return 0;
    }
    if (argc == 3 && strcmp(argv[1], "--benchXOR") == 0) {
        int min = 0;
        int max = setsize - 1;
        FILE *fp = fopen(BENCH_TEST_FILE, "w");
        for (int i = 0; i < setsize; i++) {
            int x = rand() % (max - min) + min;
            char *tmp = stringSet[x];
            t1 = tvgetf();
            if (!(xor8_contain(jenkins(tmp), &filter))) {
                t2 = tvgetf();
                fprintf(fp, "%d %f msec\n", i, (t2 - t1) * 1000);
                continue;
            }
            res = tst_search(root, tmp);
            t2 = tvgetf();
            fprintf(fp, "%d %f msec\n", i, (t2 - t1) * 1000);
        }
    tst_free(root);
    free(pool);
    return 0;
    }

想法是：從 stringSet 裡隨機選擇一個 string (stringSet 就是儲存 cities.txt 中所有名稱（且不重複）的 array)，將其輸入 Xor filter 確認是否存在。

以下是分別針對 CPY 及 REF 測試的圖：

奇怪的地方是，兩張圖在 y = 0 ( x 軸) 處都仍有數據點分佈，理論上這些數據點代表不屬於 filter 的 string (直接出局不用到 tree 裡 search)。但所有輸入的 string 都屬於 filter，應該不會出現這種分佈。

關於 Bloom filter 的部份，benchmark程式碼如下：





















    if (argc == 3 && strcmp(argv[1], "--benchBloom") == 0) {
    int min = 0;
    int max = setsize - 1;
    FILE *fp = fopen(BENCH_TEST_FILE, "w");
    for (int i = 0; i < setsize; i++) {
        int x = rand() % (max - min) + min;
        char *tmp = stringSet[x];
        t1 = tvgetf();
        if (!(bloom_test(bloom, word))) {
            t2 = tvgetf();
            fprintf(fp, "%d %f msec\n", i, (t2 - t1) * 1000);
            continue;
        }
        res = tst_search(root, tmp);
        t2 = tvgetf();
        fprintf(fp, "%d %f msec\n", i, (t2 - t1) * 1000);
    }
    tst_free(root);
    free(pool);
    return 0;
    }

以下是分別針對 CPY 及 REF 測試的圖：

到這邊可以知道我的 bench mark 設計一定有問題，需要再花時間去釐清。

2020q3 Homework3 (dict)

CPY 和 REF 實作的效能差異分析

研讀 Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters 論文

Membership Tests

Construction

Mapping Step

Assigning Step

參閱對應程式碼實作 xor_singleheader， 引入 dict 程式碼

Read more

2020q3 quiz11

2020q3 Homework3 (quiz3)

2020q3 Homework6 (kcalc)

2020q3 Homework6 (quiz6)

參閱對應程式碼實作 xor_singleheader，引入 dict 程式碼