# E02: dict
## ternary search tree 程式碼觀察
* 取得 [dict](https://github.com/sysprog21/dict) 程式碼,編譯並測試,執行`./test_ref`:
```
ternary_tree, loaded 259112 words in 0.178311 sec
Commands:
a add word to the tree
f find word in tree
s search words matching prefix
d delete word from the tree
q quit, freeing all data
choice:
```
輸入`s`之後,再輸入`Tai`:
```
find words matching prefix (at least 1 char): Tai
Tai - searched prefix in 0.000417 sec
suggest[0] : Tai’an,
suggest[1] : Tai,
suggest[2] : Taibon
suggest[3] : Taicheng,
suggest[4] : Taichung,
suggest[5] : Taiden,
suggest[6] : Taigum,
suggest[7] : Taikang,
suggest[8] : Tailai,
suggest[9] : Taillades,
suggest[10] : Taillan-Médoc,
suggest[11] : Taillecourt,
suggest[12] : Tain,
suggest[13] : Tainan,
suggest[14] : Taino,
suggest[15] : Tainter
suggest[16] : Taintrux,
suggest[17] : Taiobeiras,
suggest[18] : Taipa,
suggest[19] : Taipalsaari,
suggest[20] : Taipei,
suggest[21] : Taiping,
suggest[22] : Taipu,
suggest[23] : Tairan
suggest[24] : Tairua,
suggest[25] : Taisen-ri,
suggest[26] : Taissy,
suggest[27] : Taitung
suggest[28] : Taivalkoski,
suggest[29] : Taivassalo,
suggest[30] : Taiwala,
suggest[31] : Taiwan
suggest[32] : Taixing,
suggest[33] : Taiynsha,
suggest[34] : Taiyuan,
suggest[35] : Taizhou,
```
* 使用Makefile中的`make test`:
```
perf stat --repeat 100 \
-e cache-misses,cache-references,instructions,cycles \
./test_cpy --bench $(TEST_DATA)
perf stat --repeat 100 \
-e cache-misses,cache-references,instructions,cycles \
./test_ref --bench $(TEST_DATA)
```
幫你針對test_cpy與test_ref兩種建立節點的方式進行100次的bench測試, 並統計cache-misses, cache-references, instructions以及cycles的結果。 bench的內容為從字典載入所有字串, 然後進行`search words matching prefix` 。 結果如下:
* test_cpy:
```
Performance counter stats for './test_cpy --bench s Tai' (100 runs):
5,223,267 cache-misses # 53.491 % of all cache refs ( +- 0.33% )
9,764,729 cache-references ( +- 0.27% )
535,030,756 instructions # 1.02 insn per cycle ( +- 0.01% )
525,324,872 cycles ( +- 0.62% )
0.160240267 seconds time elapsed ( +- 0.71% )
```
* test_ref:
```
Performance counter stats for './test_ref --bench s Tai' (100 runs):
6,084,546 cache-misses # 53.151 % of all cache refs ( +- 0.45% )
11,447,693 cache-references ( +- 0.35% )
595,669,012 instructions # 1.03 insn per cycle ( +- 0.00% )
575,768,356 cycles ( +- 0.51% )
0.175606162 seconds time elapsed ( +- 0.61% )
```
原本的作法在進行`search words matching prefix` 之後會發現許多城市或國家名稱後多出`,`的符號,且輸入`f`後,搜尋`Taipei`會失敗。
```
choice: f
find word in tree: Taipei
Taipei not found.
```
原因在於`test_ref.c`或`test_cpy.c`在使用stdin時,調用了`fscanf`如下,以`test_ref.c`為例:
```clike
fscanf(fp, "%s", Top)) != EOF
```
將`fp`所指向的字典檔案的內容,以`%s`格式讀取到`Top`所指向的位置,而字典檔案的內容為:
```
Shanghai, China
Buenos Aires, Argentina
Mumbai, India
Mexico City, Distrito Federal, Mexico
Karachi, Pakistan
...
```
因此會輸入`Shanghai,`、`China`、`Buenos Aires,`...的字串,再交給TST處理。
這裡將`test_ref.c`在讀取檔案的部分稍作修改:
```clike
while ((rtn = fscanf(fp, "%s", Top)) != EOF) {
unsigned int len = strlen(Top);
if(Top[len-1] == ',') {
Top[len-1] = '\0';
len--;
}
char *p = Top;
/* insert reference to each string */
if (!tst_ins_del(&root, &p, INS, REF)) { /* fail to insert */
fprintf(stderr, "error: memory exhausted, tst_insert.\n");
fclose(fp);
return 1;
} else { /* update bloom filter */
bloom_add(bloom, Top);
}
idx++;
Top += (len + 1);
}
```
編譯後執行./test_ref得到:
```
find words matching prefix (at least 1 char): Tai
Tai - searched prefix in 0.000423 sec
suggest[0] : Tai’an
suggest[1] : Tai
suggest[2] : Taibon
suggest[3] : Taicheng
suggest[4] : Taichung
suggest[5] : Taiden
suggest[6] : Taigum
suggest[7] : Taikang
suggest[8] : Tailai
suggest[9] : Taillades
suggest[10] : Taillan-Médoc
suggest[11] : Taillecourt
suggest[12] : Tain
suggest[13] : Tainan
suggest[14] : Taino
suggest[15] : Tainter
suggest[16] : Taintrux
suggest[17] : Taiobeiras
suggest[18] : Taipa
suggest[19] : Taipalsaari
suggest[20] : Taipei
suggest[21] : Taiping
suggest[22] : Taipu
suggest[23] : Tairan
suggest[24] : Tairua
suggest[25] : Taisen-ri
suggest[26] : Taissy
suggest[27] : Taitung
suggest[28] : Taivalkoski
suggest[29] : Taivassalo
suggest[30] : Taiwala
suggest[31] : Taiwan
suggest[32] : Taixing
suggest[33] : Taiynsha
suggest[34] : Taiyuan
suggest[35] : Taizhou
```
`,`符號未被插入TST且搜尋`Taipei`可以得到結果:
```
choice: f
find word in tree: Taipei
Bloomfilter found Taipei in 0.000002 sec.
Probability of false positives:0.009693
----------
Tree found Taipei in 0.000002 sec.
```
由於目前只修改test_ref,針對這部份的bench結果如下:
```
Performance counter stats for './test_ref --bench s Tai' (100 runs):
5,133,110 cache-misses # 49.970 % of all cache refs ( +- 0.31% )
10,272,480 cache-references ( +- 0.39% )
557,437,682 instructions # 1.01 insn per cycle ( +- 0.00% )
551,481,633 cycles ( +- 0.11% )
0.159990496 seconds time elapsed ( +- 0.13% )
```
## 字典輸入順序:
考慮到TST插入字串的順序會影響樹狀圖的結構,如下圖:
- [ ] 用 `bb -> bc -> bd` 的順序插入

- [ ] 用 `bc -> bd -> bb` 的順序插入

以上述兩種情況將字串插入TST,發現整體樹狀圖高度不同,因此實作計算樹狀圖高度函數:
```clike=
int tst_max_depth(tst_node *p, int len)
{
if(!p || !p->key)
return 0;
len++;
//printf("%c %d\n", p->key, len);
int lo = tst_max_depth(p->lokid, len);
int eq = tst_max_depth(p->eqkid, len);
int hi = tst_max_depth(p->hikid, len);
return MAX(MAX(len, lo), MAX(eq, hi));
}
```
計算出原本cities.txt的排列下,TST的高度為**52**:
```
ternary_tree, loaded 259112 words in 0.406383 sec
Commands:
a add word to the tree
f find word in tree
s search words matching prefix
d delete word from the tree
q quit, freeing all data
h calculate depth
choice: h
Depth:52
```
將cities.txt的內容以alphabetical order重新排列,再逐一插入TST,發現字典以alphabetical order排列下,樹狀圖高度增加為**55**:
```
Commands:
a add word to the tree
f find word in tree
s search words matching prefix
d delete word from the tree
q quit, freeing all data
h calculate depth
choice: h
Depth:55
```
## 新增繪圖功能
to be continue