2017q3 Homework2 (prefix-search)

contributed by <Lihikoka>
GitHub

Reviewed by `ZixinYang`

讀入資料時, 把最後一個 character 改成 0, 不只會改掉城市的逗號, 也會把國家的最後一個 character 改掉, 所以應先判斷逗號是否存在。
缺少程式執行效率的分析。
直接將 freeword 改為 0, 對 CPY 版本來說, malloc 記憶體空間之後就不再釋放記憶體了。不應在這裡限制 CPY 版本的操作。
chunking (Locality)

Ternary Search Tree 原理

Trie (trie 的發明者 Edward Fredkin 把它讀作 [tri]，但是其他作者把它讀作 [traɪ]) 是一種特殊的 tree，可以用在前綴搜尋。Ternary search tree 是 trie 的一種特例，trie 可以在每個節點有任意個子節點，但是 TST 只允許每個節點有三個子節點，且三個子節點分別為：lo kid、equal kid、hi kid

equal kid 所含的字元為 word 本身的字元
lo kid 所含的字元為比 equal kid 小 (ascii code) 的字元
hi kid 所含的字元為比 equal kid 大 (ascii code) 的字元

用 Wikipedia 的例子來解說 (此 TST 包含的 word 有 as、at、cup、cute、he、i 和 us)：

      c
    / | \
   a  u  h
   |  |  | \
   t  t  e  u
 /  / |   / |
s  p  e  i  s

Insertion

為了 insert cute, 先建立根節點 c ，然後依序將 u、t、e 插入 equal kid 即完成
為了 insert at，先把 at 的第一個字元 a 和根節點 c 作比較，發現不一樣且 a 比 c 小，所以在根節點的 lo kid 插入 a ；接著在節點 a 的 equal kid 插入 t
為了 insert as ，先把 as 的第一個字元 a 和「根節點 c 、跟節點的 lo kid 」作比較，發現跟 lo kid 一樣，因此繼續往下找；將 as 的第二個字元 s 和「 a 節點的 equal kid 」作比較，發現不一樣，又 a 節點的 lo kid 為空，因此將 s 插入 a 的 lo kid
依此類推…

Search

為了搜尋 cute ，先將根節點和 cute 的第一個字元作比較，發現一樣，因此沿著 equal kid 走，將 cute 的第二個字元 u 和剛剛的 equal kid 作比較，發現又一樣，重複上述步驟，發現 cute 的每個字元都走得完，因此找到了 cute。

Auto Complete or Prefix Search

假設輸入的 prefix 為 "cu" ，根據上面的 TST ，走完節點 c 、u 後有兩種選擇： cup 和 cute ，因此可將符合 prefix 的 word 列出來，達成 auto complete ，此技術可以用在搜尋引擎、輸入法補字的功能上。

特性分析 TST v.s. Hashing

Hashing

Need to examine the entire key ( because that is the way the hash function works )
Search hits and misses cost the same
The running time and performance relues heavily on the hash function
Does not support as much as operations than TST ( sorting )

Ternary Search Tree

Works only for strings
Only examines just enough key characters
Search miss may only involve a few characters
Support more operations ( sorting )
Faster than hashing ( for missis especially ) and flexible than binary search tree

開發環境

Linux 4.8.0-53-generic
Linux Mint 18.2 Cinnamon 64-bit
Model name:            Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K

探索程式

首先執行 test_cpy 程式，依照作業說明輸入：

choice: s   
find words matching prefix (at least 1 char): Tain
  Tain - searched prefix in 0.000005 sec

suggest[0] : Tain,
suggest[1] : Tainan,
suggest[2] : Taino,
suggest[3] : Tainter
suggest[4] : Taintrux,

可以發現有幾個單字後面帶有 , ，原因是讀檔的時候是讀整個字串

while ((rtn = fscanf(fp, "%s", word)) != EOF)

這樣會造成之後執行 f (find) 指令的時候的麻煩

choice: f
find word in tree: Tainan
  Tainan not found.
  
choice: f
find word in tree: Tainan,
  found Tainan, in 0.000002 sec.

為了實驗的方便性，決定修正這個問題，在 while 迴圈內加入

size_t len = strlen(word);
word[len - 1] = 0;

實驗結果

choice: s
find words matching prefix (at least 1 char): Tain
  Tain - searched prefix in 0.000005 sec

suggest[0] : Tain
suggest[1] : Tainan
suggest[2] : Taino
suggest[3] : Tainte
suggest[4] : Taintrux

同學你好，這樣好像也會把國家的最後一個字也刪掉，變成國家找不到。
yuan922

此次作業的 TST 結構

Youtube: Ternary search tree introduction 影片中提到插入一個單字時也會插入一個 value 至最後一個字元，例如 puts('cat', 728) 即代表會將 728 插入到 't' 對應到的節點的某個欄位上，在此次作業並不是採用插入 value 的方式，而是採用「將整個單字的字串插入至最後一個單字的 eqkid 的 eqkid ，並且不管其原本是 tst_node 的資料型態，檢查的時候直接 cast 成 char* 」，tree 的結構圖如 github 的 README.md 所示：

 o
                              |-c
                              |-0
                  ------------+------------
                  |lokid      |eqkid      |hikid
                  o           o           o
                              |-a
                              |-0
                          ----+----
                          |   |   |    note: any of the lokid or hikid nodes
                              o              can also have pointers to nodes
                              |-t            for words that "cat" or "ca" is
                              |-0            a partial prefix to.
                          ----+----
                          |   |   |
                              o
                              |-0
                              |-1    <== the refcnt is only relevant to the final node
                          ----+----
                          |   |   |
                        NULL  o  NULL
                            "cat"

第一個被插入的 word 為「 Shanghai 」，可用 gdb 確認上述結構：

(gdb) p *(root)
$16 = {key = 83 'S', refcnt = 1, lokid = 0x605810, eqkid = 0x605690, hikid = 0x606890}
(gdb) p *(root->eqkid)
$17 = {key = 104 'h', refcnt = 1, lokid = 0x607280, eqkid = 0x6056c0, hikid = 0x607370}
(gdb) p *(root->eqkid->eqkid)
$18 = {key = 97 'a', refcnt = 1, lokid = 0x61f3a0, eqkid = 0x6056f0, hikid = 0x60d2b0}
(gdb) p *(root->eqkid->eqkid->eqkid)
$19 = {key = 110 'n', refcnt = 1, lokid = 0x64a420, eqkid = 0x605720, hikid = 0x639950}
(gdb) p *(root->eqkid->eqkid->eqkid->eqkid)
$20 = {key = 103 'g', refcnt = 1, lokid = 0x780290, eqkid = 0x605750, hikid = 0x61ce50}
(gdb) p *(root->eqkid->eqkid->eqkid->eqkid->eqkid)
$21 = {key = 104 'h', refcnt = 1, lokid = 0xdade90, eqkid = 0x605780, hikid = 0x62f5d0}
(gdb) p *(root->eqkid->eqkid->eqkid->eqkid->eqkid->eqkid)
$22 = {key = 97 'a', refcnt = 1, lokid = 0x0, eqkid = 0x6057b0, hikid = 0x0}
(gdb) p *(root->eqkid->eqkid->eqkid->eqkid->eqkid->eqkid->eqkid)
$23 = {key = 105 'i', refcnt = 1, lokid = 0x0, eqkid = 0x6057e0, hikid = 0x0}
(gdb) p *(root->eqkid->eqkid->eqkid->eqkid->eqkid->eqkid->eqkid->eqkid)
$24 = {key = 0 '\000', refcnt = 1, lokid = 0x0, eqkid = 0x7ffff30ce020, hikid = 0x0}
(gdb) p (char *) (root->eqkid->eqkid->eqkid->eqkid->eqkid->eqkid->eqkid->eqkid->eqkid)
$25 = 0x7ffff30ce020 "Shanghai"

練習透過 GDB scripts，開發出走訪給定 TST 裡頭的元素的自動化方法
參見 What are the best ways to automate a GDB debugging session? 和簡易 GDB Script 教學
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
jserv

這麼做的好處在於可以確定搜尋到的是一個完整的 word

修正程式碼

尋找問題

首先將 test_ref.c 中註解要求修改之處之 CPY 改成 REF ，並觀察輸出，

Commands:
 a  add word to the tree
 f  find word in tree
 s  search words matching prefix
 d  delete word from the tree
 q  quit, freeing all data

choice: s
find words matching prefix (at least 1 char): Tain
  Tain - searched prefix in 0.000005 sec

suggest[0] : Tain
suggest[1] : Tain
suggest[2] : Tain
suggest[3] : Tain
suggest[4] : Tain
suggest[5] : Tain
suggest[6] : Tain
suggest[7] : Tain
suggest[8] : Tain
suggest[9] : Tain
suggest[10] : Tain
suggest[11] : Tain

可以發現 prefix searching 的過程有問題。

choice: f   
find word in tree: Tain
  found Tain in 0.000001 sec.

choice: f
find word in tree: Tainan
  found Tainan in 0.000001 sec.

可以發現 find 也有問題，

輸入 q 的話則會得到：

choice: q
*** Error in `./test_ref': free(): invalid pointer: 0x00007ffc3e917500 ***

第一行 Error in `./test_ref': free(): invalid pointer: 0x00007ffc3e917500 告訴我們進行 free() 操作時 pointer 無效，原因是 pointer 指到不可 free 的記憶體位址。

觀察 test_ref.c 的 tst_ins_del 函式，發現第 273 行










if (cpy) { /* allocate storage for 's' */
    const char *eqdata = strdup(*s);
    if (!eqdata)
        return NULL;
    curr->eqkid = (tst_node *) eqdata;
    return (void *) eqdata;
} else { /* save pointer to 's' (allocated elsewhere) */
    curr->eqkid = (tst_node *) *s;
    return (void *) *s;
}

發現如果是 cpy 輸入的是 1 (CPY) 的話，會執行 strdup ，如果是 0 (REF) 的話則直接將 curr->eqkid 指到 *s 指到的 address，但是 *s 指到的位址的內容是會變的 (隨著每次讀的 word 而改變) ，因此問題就是：

「 tree node 指到的記憶體位址是外部的記憶體位址 (void *tst_ins_del(tst_node **root, char *const *s, const int del, const int cpy) 裡的 *s ) ，因此每次搜尋的時候會輸出給定的搜尋字串」

但是並沒有分配記憶體給 curr->eqkid，解決辦法就是如同註解所說的在其他處宣告。最簡單的解決方式是在 test_ref.c 將 char *p = word 改成 char *p = strdup(word)。

但是 strdup 的實作過程 (連結為 apple 實作 strdup 的 source code) 中會使用 malloc ，因此用多了容易有記憶體破碎的問題，記憶體破碎會導致 cache miss 的機會變大 (spatial locality) ，因此決定引入 memory pool ，一次分配一大塊連續的記憶體。

使用連續記憶體 (memory pool)

需要描述你的「動機」和「考量點」
過程中又參考什麼文獻呢？ (至少列出一篇論文)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
jserv

已補上動機、考量點、參考資料，參考資料為 Stack Overflow。Lihikoka

動機 & 理由

要宣告連續記憶體有多種方法可以實現，可以用 array of pointer 或 memory pool ，這邊選擇 memory pool ，因為每次要分配的記憶體大小都不一樣 (word array 的最大 size 雖然都相同，都是 WRDMAX ，但是實際要分配的記憶體大小的只有有效字元的部份，這部份小於或等於 WRDMAX ) ，因此在此次作業的 case 中採用 array of pointer 會浪費記憶體空間。

Memory pool 程式碼

參考 Stackoverflow 上別人寫的簡易 memory pool ：

























pool *pool_create(size_t size)
{
    pool *p = (pool *) malloc(size + sizeof(pool));
    p->next = (char *) &p[1];
    p->end = p->next + size;
    return p;
}

void pool_destroy(pool *p)
{
    free(p);
}

size_t pool_available(pool *p)
{
    return p->end - p->next;
}

void *pool_alloc(pool *p, size_t size)
{
    if (pool_available(p) < size) return NULL;
    void *mem = p->next;
    p->next += size;
    return mem;
}

將 p = strdup(word) 改寫成

p = (char *) pool_alloc(mpool, sizeof word);
strcpy(p, word);

Command q 的地方要改寫成

pool_destroy(mpool);
tst_free(root);

另外，觀察 tst.c 內 tst_ins_del function 內呼叫　tst_del_word 之處：

tst_del_word(root, curr, &stk, 1);

最後一個 parameter 1 代入 argument freeword，對應到 tst_del_word 的註解：

/** if 'freeword = 1' the copy of word allocated and
 *  stored as the node->eqkid is freed, if 'freeword = 0', node->eqkid is
 *  stored elsewhere and not freed
 */

若 freeword = 1 ，會 free 掉記憶體， freeword = 0 則不會，在 memory pool 的 case 比較適合此種方法，因為記憶體是配置在 memory pool 內，由於使用的 memory pool API 並沒有包含釋放記憶體 (此處的釋放指的是對 memory pool 而言為 available) ，所以 freeword 應該傳入 0 。

參考資料

Youtube: Ternary search tree introduction