2017q3 Homework2 (prefix-search)

contributed by <amikai>

tags: `sysprog2017`

Reviewed by `Yuessiah`

學習不是將原文記下來，而是該知識經過理解，且能用自己的話說出來，故沒有翻不出的句子，只有能不能理解的概念。建議理解了該知識再寫進報告中。
建議在有出現程式碼片段的地方都使用 Syntax highlighting，幫你補上惹。
若是質疑別人的想法，應該好好問清楚(可直接在 hackmd 中引言發問)。我有稍微回覆你對 strdup 的疑慮，參考一下。

Trie

Trie 是一種有序的樹狀資料結構, 通常被用來儲存 key 為字串的 dynamic set 或是 associative array. Tire 不像 BST, Trie 並不會將 key 儲存在節點中, 而是取決於它所在的位置.

Trie 這個詞來自於 retireval 裡的 trie. Robert Sedgewick 在他的線上課程裡說這個字念 try 並不是念 tree, 原因是需要跟 binary tree 做出區別.

property:

在 Trie 裡某個 node 所有的子代都會有共同的前綴字串, 也就是此節點所對應到的字串.
root 所對應到的字串為空字串
每個 node 不一定會有對應的 value, 而有 value 的值就為字串的字尾
只有在 leaf 節點或是某些你有興趣的節點會有 value

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

以此圖來看所記錄的 key 為 by, sea, sells, she, shells, the (property 2).

同學您好，請問你的圖片來源是哪來的qq？ David Kuo
Algorithm 4/e, Robert Sedgewick 這是書裡的內容 as23041248

search operation

如何知道是 search hit 還是 search miss

若你需要尋找 shells, 從樹根進行搜尋到最後 s 節點的時候發現有 value 對應則會 hit

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

search miss 有兩種情況:
case1: 若你需要尋找 shell, 從樹根往下進行搜尋到 l 節點的時候發現沒有 value 對應則為 miss

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

case2: 若你需要尋找 shore 發現下個節點沒有 o 則 search miss

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

insert operation

在 insert 也是相當的簡單, 要插入某個字串時, 只要對此字串進行 search operation:

case1. 若某個字串裡字元無法繼續往下搜尋, 則在此 node 插入一個 children node , 反覆到此字串中的字元插入完畢, 最後在將字串的結尾字元的 node 放入 value.

case2. 若搜尋此字串時, node 的字元皆有對應到字串, 則直接在此結尾字元的 node 放入 value 即可.

黑色為原來的 trie, 紅色為 insert 後所做的改變:

delete operation

step1. 先找到字串 (key) 相對應的 node 並將此 node 的 value 設定為 null.

setp2. 若此 node 的 link 皆為 null 則刪除此 node, 並且遞迴 case 2

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Ternary search tree

TST(ternary search tree) 是一種 trie (有時又叫做 prefix tree), TST 的 node 排放的方式類似於 BST(binary search tree), 不同的是 TST 會有三個子個.

TST 和其他的 prefix tree 一樣, TST 可以用在 associate map 並且支援 incremental string search, However. ternary search trees are more space efficient compared to standard prefix trees, at the cost of speed(還不知道怎麼翻). 一般的 TST 通常用來拼音檢查, 自動補全(auto-completion)

auto complete 在繁體中文一般翻譯為「自動補完」、「自動補齊」
incremental string search 可翻譯為「漸進式字串搜尋」
上段英文是說，相較於典型的 prefix tree，TST 對空間使用更節省，犧牲執行速度作為代價
"at the cost of" 是指「以…為代價」(高中英文程度)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

在線上課程裡, Robert Sedgewick 提到如何用 Trie 來紀錄一些英文詞彙, 首先他提到了 26 way tries , 想法非常的直覺, 每個 node 就直接紀錄 26 個字母:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

此 Trie 裡所記錄的 string 為: she, sea, sells

光看這個 trie 就知道所耗費掉的空間很可怕: 若有 n 個節點則耗費 26n 個 link, 而 null link 會高達

26 n - (n - 1) = 25 n + 1

. 備註: 一個節點有 26 個 link, 所以總共有 26n 個 link, 有用到的 link 為 n - 1 (1 為 root), 所以沒用到的就為

26 n - (n - 1)

而 Robert Sedgewick (他就是 TST 發明人之一) 使用 TST 巧妙地對空間做了大幅的改善, 但是沒有 26 個字母要怎麼記住所有的字串呢? TST 的 3 個 link 從左而右分別對應到分別對應到: 當前字串的字元小於, 等於, 大於節點裡的字

下圖為一個一般的 trie 和 TST 比較的範例:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

search operation:

首先我們會將待搜尋的字串和 TST 的 root 做比較, 若相同則走 middle link, 小於走 left link, 大於走 right link, 如此這樣反覆的遞迴. 若字串搜尋到結尾了或是搜尋到 leaf 節點還沒搜尋到則為 search miss, 反之則為 search hit.

和給定的程式碼可匹配嗎？列出程式碼並依據上述分析做比較

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

insert operation

和 Tries 相同, 使用 TST search 之後, 做出相對應的操作即可.

delete operation

The delete() method for TSTs requires more work. Essentially, each char- acter in the key to be deleted belongs to a BST. In a trie, we could remove the link corresponding to a character by setting the corresponding entry in the array of links to null; in a TST, we have to use BST node deletion to remove the node corresponding to the character.

參考資料:

著手開發

首先看到程式定義了 tst_ins_del, 若 del 為 0 則此函數是插入, 若為非 0 此函數為刪除, 若 cpy 為 0 使用

void *tst_ins_del(tst_node **root,
                  char *const *s,
                  const int del,
                  const int cpy);

依據 Fixme 的指示將 CPY 都改成 REF, 並且搜尋 Yo, 得到不正確的輸出:

調整用語，「爆炸」的行為顯然與共筆前後文不一致
技術文件在於清楚闡述前因後果、你的思路、實驗，以及觀察推論
"jserv"

Commands:
 a  add word to the tree
 f  find word in tree
 s  search words matching prefix
 d  delete word from the tree
 q  quit, freeing all data

choice: s
find words matching prefix (at least 1 char): Yo
...
suggest[688] : Yo
suggest[689] : Yo
suggest[690] : Yo
suggest[691] : Yo
suggest[692] : Yo
suggest[693] : Yo
suggest[694] : Yo

這裡總共會有 694 Yo, 所以無法直接畫出整棵樹, 因為資料集太大, 所以我將資料集替換成以下, 並且先追蹤原本使用的 test_cpy:

aa
ab

經過 gdb 追蹤 test_cpy 之後, 樹會長成以下這樣 (? 為 '\0', # 為 NULL):

                            a
                       +----+----+
                       |    |    |
                       v    v    v
                       #    a    b
                            +    +
                            |    |
                            v    v
                            ?    ?
                            +    +
                            |    |
                            v    v
                            a    a

有趣的是, 為何 '\0' 下還有數字呢 ? 這或許就是老師的實做方式了, 左邊最底下 a 的位址, 其實就是存放 aa 字串的開頭的位址, 右邊最底下 a 的位址, 其實就是存放 ab 的位址, search 時只是抓到最底下的字串.

接下來我在繼續使用 test_ref , 發現 search a 時會出現:

suggest[0] : a
suggest[1] : a

用 gdb 觀察後發現建立完成的 trie 會跟 test_cpy 相同, 有趣的是左邊最底下的 a 和右邊最底下的 a 會參照到一模一樣的位址, 而且參照到的位置就是在 fgets 讀入使用這輸入字串時所用的 word 變數, 所以每次參照到的位址都一樣, 所以就會印出一樣的字串.

看到別人的作法都是使用 strdup 或是用 malloc (或是使用此函數使用 malloc ) 去避開一樣的位址, 但是這不就是原本 test_cpy 使用的方式嗎? 在這邊我還蠻質疑的, 但是我又無法提供更好的作法. 所以我就沒有對程式再做修改

test_cpy 的作法會使記憶體位置散亂在各個地方，對 cache spatial locality 不好，並不是說用 strdup 就不好，這是兩回事。
"Yuessiah"

在用 gdb 追 code 的時候發現根本不會經過 char *p = NULL; 這個敘述, 所以可以直接砍掉

依據給定的 coding style (4 個空格，而非 tab) 去排版在共筆中所有的程式碼。為了培養「打群架」的能力，我們要著重開發紀律和對細節的高度掌握
"jserv"

switch (*word) {
    char *p = NULL; // To be delete
    case 'a':
        printf("enter word to add: ");
        if (!fgets(word, sizeof word, stdin)) {
            fprintf(stderr, "error: insufficient input.\n");
            break;
        }
....

forward declartion

tst.h 裡寫了此行 typedef struct tst_node tst_node; 並將 tst_node 的實作隱藏在 tst.c 檔案裡, 當別的想要使用 tst_node 時會先 #inlude tst.h, 在使用 tst_node 時不可直接存取其內部屬性, 此特性可以用來封裝資訊 (encapsulation) , 達到資料隱藏的效果, 要存取內部的資料欄位時只能看 tst.c 是否有提供方法, 而 tst.c 裡提供了以下三種方法就是為了存取 tst_node 裡的資料:

char tst_get_key(const tst_node *node);
unsigned tst_get_refcnt(const tst_node *node);
char *tst_get_string(const tst_node *node);

另一個好處則是可以分離介面和實作, OOP 裡有一個很有名的 Design Principles: Program to an interface, not an implementation, 當介面和實作分離時, 實作可以抽換, 並且不會影響到其他程式. 這點對於大型的系統非常重要, 舉例: 當你有一個大系統使用到 tst_node 時, 現在 tst_node 刪除了一個欄位, 則必須將此系統所有的使用到此欄位的地方進行修改, 但是介面和實作分離之後, 只需要更動 tst.c 裡的內容

參考資料:

http://www.artima.com/lejava/articles/designprinciples.html

2017q3 Homework2 (prefix-search)

tags: sysprog2017

Reviewed by Yuessiah

Trie

search operation

insert operation

delete operation

Ternary search tree

search operation:

insert operation

delete operation

著手開發

forward declartion

tags: `sysprog2017`

Reviewed by `Yuessiah`