2020q3 Homework3 (dict)

# 2020q3 Homework3 (dict) contributed by < `blueskyson` > ###### tags: `linux2020` ## 目錄 [TOC] ## 視覺化 ternary search tree + bloom filter 的效能表現並分析 ### ternary search tree 效能首先執行 `$ ./test_common --bench CPY` ，輸出 bench_ref.txt ，再使用 gnuplot 畫出分佈圖， gnuplot 指令如下: ``` set xlabel 'city-string index' set ylabel 'time(msec)' set title 'tst\_search\_prefix time costs' plot [:260000][:]'bench_ref.txt' using 1:2 with points ``` 輸出圖表如下： ![](https://i.imgur.com/k43qcm6.png) 此圖的 y 座標為 tst.c 中 `tst_search_prefix()` 每次搜尋補字所消耗的時間，單位為微秒。 x 座標則是代表 `tst_search_prefix()` 讀入第 x 個字串，每個字串是 cities.txt 的城市或國家名中取前 3 個字元，使用 3 個字元當作 prefix 搜尋。總共從 cities.txt 讀取大約 260000 個字串。再來執行 `$ ./test_common --bench REF` ，一樣會輸出 bench_ref.txt ，分佈圖為: ![](https://i.imgur.com/7Db9Y1D.png) 比較 CPY 和 REF 機制，發現 CPY 消耗的時間大部份都在 0.6 秒以下， REF 消耗的時間大部分在 0.7 至 0.8 秒以下。比較搜尋時間， CPY 的效能比 REF 好。 ### bloom filter 效能由於 `bench.c` 沒有內建的 bloom filter 效能分析，我在 `bench.c` 加入類似 `bench_test()` 的函式，它會從 cities.txt 中讀取地名，並且去除 `,` 和 `\n` 字元，然後呼叫 `bloom_test()` 檢查其是否在， bloom filter 中，以下展開為程式碼: :::spoiler bench_test2() ```c= int bench_test2(bloom_t bloom, const tst_node *root, char *out_file, const int max) { char word[WORDMAX] = ""; char **sgl; FILE *fp = fopen(out_file, "w"); FILE *dict = fopen(DICT_FILE, "r"); int idx = 0; double t1, t2; if (!fp || !dict) { if (fp) { fprintf(stderr, "error: file open failed in '%s'.\n", DICT_FILE); fclose(fp); } if (dict) { fprintf(stderr, "error: file open failed in '%s'.\n", out_file); fclose(dict); } return 1; } sgl = (char **) malloc(sizeof(char *) * max); while (fscanf(dict, "%s", word) != EOF) { int lastchar = strlen(word) - 1; if (word[lastchar] == ',' || word[lastchar] == '\n') word[lastchar] = '\0'; t1 = tvgetf(); bloom_test(bloom, word); t2 = tvgetf(); fprintf(fp, "%d %f msec\n", idx, (t2 - t1) * 1000); idx++; } free(sgl); fclose(fp); fclose(dict); return 0; } ``` ::: <br> 然後我在 `test_common.c` 中加入: ```c= if (argc == 3 && strcmp(argv[1], "--bench2") == 0) { int stat = bench_test2(bloom, root, "bench_ref2.txt", LMAX); tst_free(root); free(pool); return stat; } ``` 做完這些更動之後，執行 `$ ./test_common --bench2 CPY` 即可輸出每次執行 bloom_test 所消耗的時間 bench_ref2.txt ，畫出分佈圖: ![](https://i.imgur.com/DlpfwxE.png) 執行 `$ ./test_common --bench2 REF` ![](https://i.imgur.com/GQTlq7l.png) 可以發現不論是 CPY 或 REF 執行 bloom_test 的速度都非常快，幾乎在 0.002 微秒以下。 ## 以 perf 分析 TST 程式碼的執行時期行為並且充分解讀首先執行 5 次 perf ，分析 `./test_common --bench CPY` 的行為: ``` $ perf stat -e branches,branch-misses,cache-misses,cycles,instructions,context-switches ./test_common --bench CPY CPY mechanism ternary_tree, loaded 206849 words in 0.142208 sec Performance counter stats for './test_common --bench CPY': 7,545,228,775 branches 236,997,577 branch-misses # 3.14% of all branches 1,589,264,930 cache-misses 51,561,737,832 cycles 52,179,870,840 instructions # 1.01 insn per cycle 57 context-switches 16.466583643 seconds time elapsed 16.421854000 seconds user 0.036004000 seconds sys ``` 執行 5 次 perf ，分析 `./test_common --bench REF` 的行為: ``` $ perf stat -e branches,branch-misses,cache-misses,cycles,instructions,context-switches ./test_common --bench REF REF mechanism ternary_tree, loaded 206849 words in 0.154480 sec Performance counter stats for './test_common --bench REF': 7,541,064,992 branches 237,372,869 branch-misses # 3.15% of all branches 1,682,579,144 cache-misses 57,721,488,007 cycles 52,159,621,475 instructions # 0.90 insn per cycle 67 context-switches 18.673287298 seconds time elapsed 18.639914000 seconds user 0.031999000 seconds sys ``` 分析 perf 的資訊，發現 CPY 和 REF 機制耗費的 `instruction` 數量差不多，都在 520 億個指令左右，然而 CPY 卻耗費較少執行時間。 CPY 可以耗費較少時間是因為 `tst` 每次建立新的 `leaf` 時，會在 heap 分配新的空間儲存字串，因此在寫入以及走訪 tst 時，有比較大的機率存取到相鄰的記憶體位置，進而降低 `cache misses` 。從 perf 的數據也可發現 CPY 的 `cache misses` 為 1,589,264,930 ， REF 的 `cache misses` 為 1,682,579,144 ，符合前述推論。 ## 考慮到 CPY 和 REF 這兩種實作機制，思考效能落差的緣故，並且提出後續效能改善機制同前面所述，效能落差顯現於 `cashe misses` 。改善機制: 目前還沒有想法 ## 研讀 [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258) 論文並參閱對應程式碼實作 xor_singleheader，引入上述 dict 程式碼中，需要比較原有的 bloom filter 和 xor filter 在錯誤率及記憶體開銷