E02: dict - HackMD

# E02: dict ## ternary search tree 程式碼觀察 * 取得 [dict](https://github.com/sysprog21/dict) 程式碼，編譯並測試，執行`./test_ref`: ``` ternary_tree, loaded 259112 words in 0.178311 sec Commands: a add word to the tree f find word in tree s search words matching prefix d delete word from the tree q quit, freeing all data choice: ``` 輸入`s`之後，再輸入`Tai`： ``` find words matching prefix (at least 1 char): Tai Tai - searched prefix in 0.000417 sec suggest[0] : Tai’an, suggest[1] : Tai, suggest[2] : Taibon suggest[3] : Taicheng, suggest[4] : Taichung, suggest[5] : Taiden, suggest[6] : Taigum, suggest[7] : Taikang, suggest[8] : Tailai, suggest[9] : Taillades, suggest[10] : Taillan-Médoc, suggest[11] : Taillecourt, suggest[12] : Tain, suggest[13] : Tainan, suggest[14] : Taino, suggest[15] : Tainter suggest[16] : Taintrux, suggest[17] : Taiobeiras, suggest[18] : Taipa, suggest[19] : Taipalsaari, suggest[20] : Taipei, suggest[21] : Taiping, suggest[22] : Taipu, suggest[23] : Tairan suggest[24] : Tairua, suggest[25] : Taisen-ri, suggest[26] : Taissy, suggest[27] : Taitung suggest[28] : Taivalkoski, suggest[29] : Taivassalo, suggest[30] : Taiwala, suggest[31] : Taiwan suggest[32] : Taixing, suggest[33] : Taiynsha, suggest[34] : Taiyuan, suggest[35] : Taizhou, ``` * 使用Makefile中的`make test`: ``` perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_cpy --bench $(TEST_DATA) perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_ref --bench $(TEST_DATA) ``` 幫你針對test_cpy與test_ref兩種建立節點的方式進行100次的bench測試，並統計cache-misses， cache-references, instructions以及cycles的結果。 bench的內容為從字典載入所有字串，然後進行`search words matching prefix` 。結果如下： * test_cpy: ``` Performance counter stats for './test_cpy --bench s Tai' (100 runs): 5,223,267 cache-misses # 53.491 % of all cache refs ( +- 0.33% ) 9,764,729 cache-references ( +- 0.27% ) 535,030,756 instructions # 1.02 insn per cycle ( +- 0.01% ) 525,324,872 cycles ( +- 0.62% ) 0.160240267 seconds time elapsed ( +- 0.71% ) ``` * test_ref: ``` Performance counter stats for './test_ref --bench s Tai' (100 runs): 6,084,546 cache-misses # 53.151 % of all cache refs ( +- 0.45% ) 11,447,693 cache-references ( +- 0.35% ) 595,669,012 instructions # 1.03 insn per cycle ( +- 0.00% ) 575,768,356 cycles ( +- 0.51% ) 0.175606162 seconds time elapsed ( +- 0.61% ) ``` 原本的作法在進行`search words matching prefix` 之後會發現許多城市或國家名稱後多出`,`的符號，且輸入`f`後，搜尋`Taipei`會失敗。 ``` choice: f find word in tree: Taipei Taipei not found. ``` 原因在於`test_ref.c`或`test_cpy.c`在使用stdin時，調用了`fscanf`如下，以`test_ref.c`為例: ```clike fscanf(fp, "%s", Top)) != EOF ``` 將`fp`所指向的字典檔案的內容，以`%s`格式讀取到`Top`所指向的位置，而字典檔案的內容為: ``` Shanghai, China Buenos Aires, Argentina Mumbai, India Mexico City, Distrito Federal, Mexico Karachi, Pakistan ... ``` 因此會輸入`Shanghai,`、`China`、`Buenos Aires,`...的字串，再交給TST處理。這裡將`test_ref.c`在讀取檔案的部分稍作修改: ```clike while ((rtn = fscanf(fp, "%s", Top)) != EOF) { unsigned int len = strlen(Top); if(Top[len-1] == ',') { Top[len-1] = '\0'; len--; } char *p = Top; /* insert reference to each string */ if (!tst_ins_del(&root, &p, INS, REF)) { /* fail to insert */ fprintf(stderr, "error: memory exhausted, tst_insert.\n"); fclose(fp); return 1; } else { /* update bloom filter */ bloom_add(bloom, Top); } idx++; Top += (len + 1); } ``` 編譯後執行./test_ref得到: ``` find words matching prefix (at least 1 char): Tai Tai - searched prefix in 0.000423 sec suggest[0] : Tai’an suggest[1] : Tai suggest[2] : Taibon suggest[3] : Taicheng suggest[4] : Taichung suggest[5] : Taiden suggest[6] : Taigum suggest[7] : Taikang suggest[8] : Tailai suggest[9] : Taillades suggest[10] : Taillan-Médoc suggest[11] : Taillecourt suggest[12] : Tain suggest[13] : Tainan suggest[14] : Taino suggest[15] : Tainter suggest[16] : Taintrux suggest[17] : Taiobeiras suggest[18] : Taipa suggest[19] : Taipalsaari suggest[20] : Taipei suggest[21] : Taiping suggest[22] : Taipu suggest[23] : Tairan suggest[24] : Tairua suggest[25] : Taisen-ri suggest[26] : Taissy suggest[27] : Taitung suggest[28] : Taivalkoski suggest[29] : Taivassalo suggest[30] : Taiwala suggest[31] : Taiwan suggest[32] : Taixing suggest[33] : Taiynsha suggest[34] : Taiyuan suggest[35] : Taizhou ``` `,`符號未被插入TST且搜尋`Taipei`可以得到結果: ``` choice: f find word in tree: Taipei Bloomfilter found Taipei in 0.000002 sec. Probability of false positives:0.009693 ---------- Tree found Taipei in 0.000002 sec. ``` 由於目前只修改test_ref，針對這部份的bench結果如下： ``` Performance counter stats for './test_ref --bench s Tai' (100 runs): 5,133,110 cache-misses # 49.970 % of all cache refs ( +- 0.31% ) 10,272,480 cache-references ( +- 0.39% ) 557,437,682 instructions # 1.01 insn per cycle ( +- 0.00% ) 551,481,633 cycles ( +- 0.11% ) 0.159990496 seconds time elapsed ( +- 0.13% ) ``` ## 字典輸入順序: 考慮到TST插入字串的順序會影響樹狀圖的結構，如下圖: - [ ] 用 `bb -> bc -> bd` 的順序插入 ![](https://i.imgur.com/qqfaSkv.png) - [ ] 用 `bc -> bd -> bb` 的順序插入 ![](https://i.imgur.com/B4YZLvh.png) 以上述兩種情況將字串插入TST，發現整體樹狀圖高度不同，因此實作計算樹狀圖高度函數: ```clike= int tst_max_depth(tst_node *p, int len) { if(!p || !p->key) return 0; len++; //printf("%c %d\n", p->key, len); int lo = tst_max_depth(p->lokid, len); int eq = tst_max_depth(p->eqkid, len); int hi = tst_max_depth(p->hikid, len); return MAX(MAX(len, lo), MAX(eq, hi)); } ``` 計算出原本cities.txt的排列下，TST的高度為**52**: ``` ternary_tree, loaded 259112 words in 0.406383 sec Commands: a add word to the tree f find word in tree s search words matching prefix d delete word from the tree q quit, freeing all data h calculate depth choice: h Depth:52 ``` 將cities.txt的內容以alphabetical order重新排列，再逐一插入TST，發現字典以alphabetical order排列下，樹狀圖高度增加為**55**: ``` Commands: a add word to the tree f find word in tree s search words matching prefix d delete word from the tree q quit, freeing all data h calculate depth choice: h Depth:55 ``` ## 新增繪圖功能 to be continue