2018q3 Homework3(dict)

# 2018q3 Homework3(dict) contributed by < `type59ty` > ###### tags: `sysprog2018`, `hw3` --- - 實驗環境 ``` System: Linux pop-os 4.15.0-34-generic gcc version: 7.3.0 memory: 15.5G cpu: Intel® Core™ i7-4770 CPU @ 3.40GHz × 8 ``` ## 預期目標 - [ ] 學習 [ternary search tree](https://en.wikipedia.org/wiki/Ternary_search_tree) 作為 [auto-complete](https://en.wikipedia.org/wiki/Autocomplete) 和 prefix search 的實作機制 - [ ] 學習 [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) 和 [ternary search tree](https://en.wikipedia.org/wiki/Ternary_search_tree) 的整合議題 - bit-wise operation 的實踐 - [ ] 設計效能評比 (benchmark) 的程式框架 - [ ] 學習 GNU make 和 `Makefile` 撰寫 - [ ] 複習機率統計和對應的數學基礎 ## Trie (字典樹) 和 Ternary search tree (TST) ### Trie - 定義：字典樹(Trie)，也被稱為單詞搜尋樹，是一種很特別的樹狀資料結構，可以快速的依照字母插入、尋找字串，特別**適用於大量字串出現**的時候。 - 優點：跟 binary search tree 相比， Trie 不把整個字串包在同一個 node 中，而是根據 node 的位置決定代表的字串為何。也就是一個看結果(最終 node)，一個看過程 (經過的 node)。因此利用 trie，可以**快速的找出有相同 prefix 的字串**。 - 缺點：若是**資料非常大量，而且幾乎沒有相同的 prefix**，將會非常**浪費儲存空間**。 ### Ternary search tree - 定義：Ternary search tree (TST)，這種樹的每個節點最多只會有 3 個 children，分別代表小於，等於，大於。 - 優點：解決 Trie 空間浪費的缺點，同時保有 Trie 的搜尋效率和 Binary search tree (BST) 的空間利用率優點 - 缺點： TST 插入的順序會影響比較的次數 - [Trie v.s Ternary search tree](https://www.cnblogs.com/rush/archive/2012/12/30/2839996.html) ## [ternary search tree](https://en.wikipedia.org/wiki/Ternary_search_tree) - 從 github 上 fork 後 clone 下來 ```c forest@pop-os:~$ git clone https://github.com/type59ty/dict.git ``` - 測試： ```c forest@pop-os:~/dict$ ./test_cpy ternary_tree, loaded 259112 words in 0.101007 sec Commands: a add word to the tree f find word in tree s search words matching prefix d delete word from the tree q quit, freeing all data choice:a enter word to add: NCKU NCKU - inserted in 0.000009536743 sec. (259113 words in tree) choice:f find word in tree: NCKU found NCKU in 0.000003 sec. choice: s find words matching prefix (at least 1 char): Ta Ta - searched prefix in 0.002033 sec suggest[0] : Tañgo, suggest[1] : Taşca, suggest[2] : Taşköprü, …… ``` ## 程式碼修正觀察 test_cpy.c 和 test_ref.c 兩者的輸出，發現只有 test_ref 用到 bloom filter 的機制，需要在 test_cpy.c 中補上程式碼 ### ./test_cpy ```c find word in tree: Taiwan found Taiwan in 0.000001 sec. ``` ### ./test_ref ```c find word in tree: Taiwan Bloomfilter found Taiwan in 0.000001 sec. Probability of false positives:0.009693 ---------- Tree found Taiwan in 0.000001 sec. ``` ### 宣告 - #include "bloom.h" - #define TableSize 5000000 - #define HashNumber 2 - bloom_t bloom = bloom_create(TableSize); - tst_ins_del 這個函式來自 tst.h ，當我們要在 TST 中插入/刪除一個單字的時候，函式會將單字一個個插入/刪除 TST ```c while ((rtn = fscanf(fp, "%s", word)) != EOF) { char *p = word; if( !tst_ins_del(&root, &p, INS, CPY) ){ … } … } ``` - 因為要加入 bloom filter 機制，所以讀檔時要把單字一併加入 bloom table，因此在底下加入： ```c while ((rtn = fscanf(fp, "%s", word)) != EOF) { char *p = word; if( !tst_ins_del(&root, &p, INS, CPY) ){ … }else { /* update bloom filter */ bloom_add(bloom, word); } idx++; word += (strlen(word) + 1); } ``` :::warning - 先執行看看，結果得到錯誤訊息 ```c test_cpy.c:59:14: error: assignment to expression with array type word += (strlen(word) + 1); ^~ ``` ::: - 回頭看一下 word 的型態為 array，所以要改成： ```c *word += (strlen(word) + 1); ``` - 其他部份也一併修改，完成 test_cpy.c - 而 test_ref.c 中也缺少一些程式碼： ```c if (argc == 2 && strcmp(argv[1], "--bench") == 0) { int stat = bench_test(root, BENCH_TEST_FILE, LMAX); tst_free_all(root); return stat; } FILE *output; output = fopen("ref.txt", "a"); if (output != NULL) { fprintf(output, "%.6f\n", t2 - t1); fclose(output); } else printf("open file error\n"); ``` - 修改完成，執行 ./test_cpy --bench 測試 ```c forest@pop-os:~/dict$ ./test_cpy --bench ternary_tree, loaded 259112 words in 0.139431 sec ``` - ./test_ref --bench ```c forest@pop-os:~/dict$ ./test_ref --bench ternary_tree, loaded 259112 words in 0.140784 sec 程式記憶體區段錯誤 (核心已傾印) ``` :::warning 雖然資料都有成功 load 到 bench_ref.txt，可是卻顯示 segmentation fault，這方面需要再研究一下... ::: ### bench.c 這個程式會 copy 每個單字的前3個字元( prefix )，使用 TST 搜尋並將花費時間紀錄下來 ```c char prefix[3] = ""; …… strncpy(prefix, word, 3); …… ``` 但是字串的結尾都會擺一個 null character ，並佔用一個 char 空間，所以宣告是錯的，應該改成： ```c char prefix[4] = ""; ``` 還有一個地方也有問題，就是關於時間的單位，一開始宣告 sec 為 10^-9^秒 ```c sec /= 1e9; ``` 但最後紀錄時間時，後面乘上 10^6^ ，所以單位應該變成 micro sec (10^-3^ sec) ```c fprintf(fp, "%d %f msec\n", idx, (t2 - t1) * 1000000); ``` ## Makefile 修改 - 原本的 Makefile ```c TESTS = test_cpy test_ref TEST_DATA = s Tai CFLAGS = -O0 -Wall -Werror -g # Control the build verbosity ifeq ("$(VERBOSE)","1") Q := VECHO = @true else Q := @ VECHO = @printf endif GIT_HOOKS := .git/hooks/applied .PHONY: all clean all: $(GIT_HOOKS) $(TESTS) $(GIT_HOOKS): @scripts/install-git-hooks @echo OBJS_LIB = \ tst.o bloom.o OBJS := \ $(OBJS_LIB) \ test_cpy.o \ test_ref.o deps := $(OBJS:%.o=.%.o.d) test_%: test_%.o $(OBJS_LIB) $(VECHO) " LD\t$@\n" $(Q)$(CC) $(LDFLAGS) -o $@ $^ -lm %.o: %.c $(VECHO) " CC\t$@\n" $(Q)$(CC) -o $@ $(CFLAGS) -c -MMD -MF .$@.d $< test: $(TESTS) echo 3 | sudo tee /proc/sys/vm/drop_caches; perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_cpy --bench $(TEST_DATA) perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_ref --bench $(TEST_DATA) bench: $(TESTS) @for test in $(TESTS); do\ ./$$test --bench $(TEST_DATA); \ done clean: $(RM) $(TESTS) $(OBJS) $(RM) $(deps) rm -f bench_cpy.txt bench_ref.txt ref.txt cpy.txt caculate -include $(deps) ``` - 修改 $make test 前面增加一行刪除 cpy.txt 和 ref.txt，如果 test 跑第二次以上，他的資料會不斷 append 到這兩個檔案，所以應該是每次 test 都要先清空 ```c test: $(TESTS) rm -f cpy.txt ref.txt echo 3 | sudo tee /proc/sys/vm/drop_caches; perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_cpy --bench $(TEST_DATA) perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_ref --bench $(TEST_DATA) ``` - 加入 $make plot calculate.c 會讀取 cpy.txt 和 ref.txt，並分別計算執行100次 test 的平均值，寫入 output.txt 檔案，以供 runtimept.gp 繪製圖表 ```c plot: gcc -o calculate calculate.c ./calculate gnuplot scripts/runtime*.gp eog runtime*.png ``` - 修改 $make clean 新增一段刪除 gnuplot 所產生的圖表 ```c clean: $(RM) $(TESTS) $(OBJS) $(RM) $(deps) rm -f bench_cpy.txt bench_ref.txt ref.txt cpy.txt runtime*.png caculate ``` - 修改 $make bench 執行 prefix 效能評估 ```c bench: ./test_cpy --bench ./test_ref --bench ``` ## Linux 效能分析工具: Perf - 這邊使用分別執行100次 test_cpy --bench 和 test_ref --bench 並使用 perf 觀察其 cache-miss、cache-reference、instruction 和 clock cycle ```c forest@pop-os:~/dict$ sudo make test rm -f cpy.txt ref.txt echo 3 | sudo tee /proc/sys/vm/drop_caches; 3 perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_cpy --bench s Tai ternary_tree, loaded 259112 words in 0.132259 sec ``` ```c Performance counter stats for './test_cpy --bench s Tai' (100 runs): 1,201,574 cache-misses # 44.844 % of all cache refs ( +- 0.31% ) 2,679,453 cache-references ( +- 0.22% ) 586,983,628 instructions # 0.97 insn per cycle ( +- 0.03% ) 604,831,839 cycles ( +- 0.15% ) 0.161409468 seconds time elapsed ( +- 0.19% ) ``` ```c Performance counter stats for './test_ref --bench s Tai' (100 runs): 1,274,843 cache-misses # 44.349 % of all cache refs ( +- 0.87% ) 2,874,558 cache-references ( +- 0.20% ) 588,566,508 instructions # 0.95 insn per cycle ( +- 0.00% ) 619,669,874 cycles ( +- 0.36% ) 0.164743728 seconds time elapsed ( +- 0.37% ) ``` ## gnuplot - $make plot ```c plot: gcc -o caculate calculate.c ./caculate gnuplot scripts/runtime*.gp eog runtime*.png ``` 在 scripts 資料夾內有幾個 .gp 檔，是用來製作 gnuplot 用的，但其中有一些部份要改： - runtime3.gp 觀察程式碼可以發現，這個程式會讀取 bench_cpy.txt 的資料，其中 plot 這邊 x 軸要改成 247613 (因為有 0~247613筆資料)， y 軸讓它自動設定，( x 軸如果也自動設定，它會跳到下一個級距，會空一塊在那邊) 且 y軸時間單位應改為 micro sec ( ms ) ```c reset set xlabel 'prefix' set ylabel 'time( ms )' set title 'perfomance comparison' set term png enhanced font 'Verdana,10' set output 'runtime3.png' set format x "%10.0f" set xtic 30000 set xtics rotate by 45 right plot [:247614][:]'bench_cpy.txt' using 1:2 with points title 'cpy',\ ``` - runtimept.gp 也一樣 ```c reset set xlabel 'prefix' set ylabel 'time( ms )' set title 'perfomance comparison' set term png enhanced font 'Verdana,10' set output 'runtime2.png' set format x "%10.0f" set xtic 30000 set xtics rotate by 45 right plot [0:247613][:]'bench_cpy.txt' using 1:2 with points title 'cpy',\ 'bench_ref.txt' using 1:2 with points title 'ref',\ ``` 最後結果： ![](https://i.imgur.com/uYT2Aop.png) ![](https://i.imgur.com/GuXB4p4.png) ![](https://i.imgur.com/GyPzT0A.png)