2018q3 Homework3 (dict)

2018q3 Homework3 (dict) === contributed by < [Jyun-Neng](https://github.com/Jyun-Neng) > ## Ternary Search Tree (TST) * TST node structure * key: 儲存的字元 * refcnt: 不太明白，只知是用於刪除時。 * TST node 會指向兩種 node * 該字串的下一個字元。 ex: ab (儲存字元 a 的 node 會指向儲存字元 b 的 node) * 前一個字元相同但該字元不同。 ex: ab, ae (儲存字元 a 的 node 會先指向儲存字元 b 的 node，而儲存字元 b 的 node 在指向儲存字元 e 的 node，因為它們前一個字元相同) * 一個 node 最多只有 3 children：大於目前字元 (hikid)、小於目前字元 (lokid)、等於目前字元 (eqkid) ```C= /** ternary search tree node. */ typedef struct tst_node { char key; /* char key for node (null for node with string) */ unsigned refcnt; /* refcnt tracks occurrence of word (for delete) */ struct tst_node *lokid; /* ternary low child pointer */ struct tst_node *eqkid; /* ternary equal child pointer */ struct tst_node *hikid; /* ternary high child pointer */ } tst_node; ``` * `tst_search_prefix` * 這個函式的目的為只需給定前幾個的字元即可搜尋到相關的所有字串。 * 在 `tst.c` 第 354 行有這段程式碼，目的為計算字串長度。 ```C /* get length of s */ for (; *start; start++) ; /* wait */ const size_t nchr = start - s; ``` * 我在這改成使用 `strlen()` 也能得到一樣的結果。 ```C /* get length of s */ const size_t nchr = strlen(s); ``` ## Ternary Search Tree 測試程式執行以下 commands ```shell $make $./test_cpy --bench ``` 會由 `bench.c` 裡的函式 `bench_test` 來測試 TST，並將搜尋的時間記錄在 `bench_cpy.txt`。若要將搜尋時間繪製成圖表，執行以下 commands ```shell $gnuplot scripts/runtime3.gp $eog runtime3.png ``` ### Bugs #### 時間單位問題 `bench.c` 第 53 行輸出到 `out_file` (執行 `$./test_cpy --bench` `out_file=bench_cpy.txt`) 的時間是 (t2 - t1) * 1000000 而 t1, t2 都是 nano 等級的，所以應該是 micro second 而非 second。 ```C fprintf(fp, "%d %f sec\n", idx, (t2 - t1) * 1000000); ``` #### `bench_test`問題使用 `bench_test` 得出的搜尋時間並繪製成圖表後發現搜尋時間幾乎都是 0。 ![](https://i.imgur.com/7wYWcas.png) 將搜尋後得到的字串一起輸出到 `out_file` 看字串搜尋的結果。 ```C fprintf(fp, "%d %f msec %s\n", idx, (t2 - t1) * 1000000, *sgl); ``` 結果都是 null。 ``` 0 0.953674 msec (null) 1 0.476837 msec (null) 2 0.238419 msec (null) 3 0.476837 msec (null) 4 0.238419 msec (null) 5 0.000000 msec (null) 6 0.238419 msec (null) 7 0.238419 msec (null) 8 0.238419 msec (null) 9 0.238419 msec (null) ... ``` 原因是出在 `bench_test` 內所給定的變數 `prefix` 大小只有 3，而後 `strncpy` 直接複製了 `word` 的前三字元給 `prefix`，並沒有多餘的空間給 `\0`。 ```C char prefix[3] = ""; ... strncpy(prefix, word, 3); ``` 沒有給 null character 會導致後面 `tst_search_prefix` 在計算字串長度時發生錯誤，而不會得到剛好是 3 的字串長度。 ```C /* get length of s */ for (; *start; start++) ; /* wait */ const size_t nchr = start - s; ``` 將 `prefix` 的大小改為 4 即可解決問題。 ```C char prefix[4] = ""; ``` 再次測試及繪製圖表後，便可搜尋到字串且搜尋時間就不全部都是 0。 ``` 0 157.833099 msec Shaţrah, 1 405.788422 msec Chièvres, 2 192.880630 msec Buea, 3 34.093857 msec Air 4 159.263611 msec Argalastí, 5 110.149384 msec Mumbai, 6 51.259995 msec Indāpur, 7 187.158585 msec Mexborough, 8 58.889389 msec Citendejé, 9 84.161758 msec Disūq, 10 63.419342 msec Fedderingen, ... ``` ![](https://i.imgur.com/d95QSXC.png) ### strcpy segmentation fault `test_cpy.c` 及 `test_ref.c` 中以下程式碼，可能產生 segmentation fault。 ```clike if (argc > 1 && strcmp(argv[1], "--bench") == 0) // a for auto strcpy(word, argv[2]); ... case 'a': ... if (argc > 1 && strcmp(argv[1], "--bench") == 0) strcpy(word, argv[3]); ``` 會發生 segmentation fault 的原因是出自 `argc` 的判斷，即便 `argc > 1` 也並不一定存在 `argv[2] argv[3]`。將 `argc` 的判斷改成下面的程式碼即可。 ```clike if (argc > 2 && strcmp(argv[1], "--bench") == 0) // a for auto strcpy(word, argv[2]); ... case 'a': ... if (argc > 3 && strcmp(argv[1], "--bench") == 0) strcpy(word, argv[3]); ``` ## gnuplot ```shell $./test_cpy --bench $./test_ref --bench ``` 執行這兩個指令會用 `cities.txt` 每個字串的前三字原來作為搜尋字串，總共搜尋 247614 個字串。在`makefile` 中加入 gnuplot 的功能 ```clike bench: $(TESTS) @for test in $(TESTS); do\ ./$$test --bench; \ done plot: bench gnuplot scripts/runtime3.gp eog runtime3.png ``` 在 terminal 輸入以下指令便可產生圖表記錄 cpy 及 ref 的搜尋時間。 ```shell $make plot ``` ![](https://i.imgur.com/pH8S8dB.png) :::info ref 是以 Bloom Filter 的方式實作，而 Bloom Filter 主要的貢獻是讓搜尋的時間複雜度變成 $O(1)$ 但需要額外的 table 來記錄。 cpy 是以 ternary search tree 的方式實作，而其主要貢獻是降低儲存空間但會增加搜尋時間。然而，由上面的圖表會發現 ref 的搜尋時間會較 cpy 的搜尋時間慢，是因為資料還不夠多筆嗎？ ::: :::success 因為這個測試都是記錄 `tst_search_prefix` 的搜尋時間，而沒有記錄 bloom filter 的搜尋時間。 ::: ## perf 分別對 `test_cpy.c` 及 `test_ref.c` 做 `bench_test` 並分析其效能。 ```shell Performance counter stats for './test_cpy --bench' (10 runs): 3,3808,1127 cache-misses # 45.692 % of all cache refs ( +- 0.67% ) 7,3991,6571 cache-references ( +- 0.25% ) 471,2391,2619 instructions # 0.80 insn per cycle ( +- 0.00% ) 592,4065,7521 cycles ( +- 0.38% ) 17.596857160 seconds time elapsed ( +- 0.43% ) ``` ```shell Performance counter stats for './test_ref --bench' (10 runs): 3,8419,5293 cache-misses # 46.074 % of all cache refs ( +- 0.47% ) 8,3386,6508 cache-references ( +- 0.27% ) 471,9135,3298 instructions # 0.70 insn per cycle ( +- 0.00% ) 671,5535,0460 cycles ( +- 0.22% ) 19.893304368 seconds time elapsed ( +- 0.27% ) ``` 結果發現 bloom filter 會有較多的 cache miss。 ### `tst_search` v.s `bloom_test` 我在 `test_ref.c` 的 f case 加入下面這段程式碼，產生 `ref.txt` 及 `cpy.txt`，目的分別為記錄 `bloom_test` 及 `tst_search` 找到該字串的時間。 ```clike= case 'f': ... t1 = tvgetf(); if (bloom_test(bloom, word) == 1) { t2 = tvgetf(); FILE *output; output = fopen("ref.txt", "a"); if (output != NULL) { fprintf(output, "%.6f\n", (t2 - t1) * 1000000); fclose(output); } else printf("open file error\n"); ... t1 = tvgetf(); res = tst_search(root, word); t2 = tvgetf(); output = fopen("cpy.txt", "a"); if (output != NULL) { fprintf(output, "%.6f\n", (t2 - t1) * 1000000); fclose(output); } else printf("open file error\n"); ... ``` 並在 `makefile` 加入這段指令 ```clike TEST_FIND = f Taiwan ... test_find: $(TESTS) echo 3 | sudo tee /proc/sys/vm/drop_caches; perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_ref --bench $(TEST_FIND) test_plot: gnuplot scripts/runtimeFind.gp eog runtimeFind.png ``` 在 terminal 輸入以下指令後繪製圖表。 ```shell make test_find make test_plot ``` ![](https://i.imgur.com/FC3CND7.png) 可以發現 bloom filter 的搜尋速度會比 TST 的搜尋速度快。 :::info 前一個的圖表是用 `bench_test`，而 `bench_test` 是計算 `tst_search_prefix` 的搜尋時間，所以理應兩個的時間都差不多，因為都是用 `cities.txt` 的資料做測試。這次的測試才是真正的對 bloom filter 及 tst 的搜尋速度做比對。那前者的時間差別應該是沒有意義的才多囉？ ::: ## Questions ### `tst_free` & `tst_free_all` `tst_free` 就是 recursive 的釋放 node 的記憶體位址，當 `p` 有記錄字元的時候代表會有 child `eqkid` 存在。 ```clike= /** free the ternary search tree rooted at p, data storage external. */ void tst_free(tst_node *p) { if (!p) return; tst_free(p->lokid); if (p->key) tst_free(p->eqkid); tst_free(p->hikid); free(p); } ``` `tst_free_all` 也是一樣的道理，但在第 10-11 行，我不懂為什麼 key 是 null 的時候卻要做 `free(p->eqkid)`。 ```clike= /** free the ternary search tree rooted at p, data storage internal. */ void tst_free_all(tst_node *p) { if (!p) return; tst_free_all(p->lokid); if (p->key) tst_free_all(p->eqkid); tst_free_all(p->hikid); if (!p->key) free(p->eqkid); free(p); } ``` 在 `test_ref.c` 使用 `tst_free` 來釋放記憶體空間，若用 `tst_free_all` 的話會導致 segmentation fault，而 `test_cpy.c` 使用的是 `tst_free_all`。目前還不懂為什麼 `test_ref.c` 不能使用 `tst_free_all`，`test_ref.c` 及 `test_cpy.c` 記錄 node 的方式有何不同。 ### delete word from the tree 執行 `./test_cpy` 後，`cities.txt` 的字串都會被載入並存在 tst 中，但若要對這些字串做刪除的動作的話會有問題。 ```shell $./test_cpy ternary_tree, loaded 259112 words in 0.122297 sec ... choice: f find word in tree: Taiwan found Taiwan in 0.000001 sec. ... choice: d enter word to del: Taiwan deleting Taiwan Taiwan (refcnt: 19) not removed. delete failed. choice: a enter word to add: aid aid - inserted in 0.000006198883 sec. (259113 words in tree) ... choice: d enter word to del: aid deleting aid deleted aid in 0.000004 sec ``` Taiwan 是 `cities.txt` 裡的資料，所以一開始就可以找到，但卻無法刪除，而我額外新增的字串 aid 就可以被刪除。