2017q1 Homework3 (prefix-search)

# 2017q1 Homework3 (prefix-search) contributed by <`river85511`> ### Reviewed by `ZixinYang` - 觀察程式碼的時候, 不要用語法來註記內容, 如 ( while 的部份, for 的部份... ), 應直接以該段程式碼的用意來註記。 - 程式解析清楚的, 但缺乏自己動手改寫的過程, 也沒有缺乏程式效能的分析。 - forward declaration 除了讓標頭檔的內容方便閱讀以外, 也可以減少一改寫 structure 的內容, 有 include 該個標頭檔的檔案都要重新編譯一次所需的時間。 --- ## 開發環境 ``` Lubuntu 16.04 4.10.0-37-generic Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K ``` ## Ternary Search Tree (TST) ### TST - Introduction * TST * 是一種 [trie](https://en.wikipedia.org/wiki/Trie)（亦稱 prefix-tree) 的結構 * Keys 是一串字串，e.g. `ann`。 * 每個節點 (Node) 最多有三個子分支，以及一個 Key * `Key` 用來記錄 Keys 中的其中一個字元 (包括 `EndOfString`)。 * `low` ，用來指向比當前的 Node 的 Key 小的 Node。 * `equal`，用來指向與當前的 Node 的 Key 一樣大的 Node。 * `high` ，用來指向比當前的 Node 的 Key 大的 Node。 * 觀察在 `tst.c` 中 `tst_node` ``` C /** ternary search tree node. */ typedef struct tst_node { char key; unsigned refcnt; struct tst_node *lokid; struct tst_node *eqkid; struct tst_node *hikid; } tst_node; ``` ``` o |-key |-refcnt ------------+------------ |lokid |eqkid |hikid o o o ``` * 發現 * 跟 [Trie](https://www.cs.bu.edu/teaching/c/tree/trie/) 中介紹的結構以外，還多了 `refcnt` * `refcnt` 是用來記錄一個 Keys 出現的次數。 * 舉例來說，若把 `cat` 插入 TST 中，會是如此的結構： ``` o |-c |-0 ------------+------------ |lokid |eqkid |hikid o o o |-a |-0 ----+---- | | | note: any of the lokid or hikid nodes o can also have pointers to nodes |-t for words that "cat" or "ca" is |-0 a partial prefix to. ----+---- | | | o |-0 |-1 <== the refcnt is only relevant to the final node ----+---- | | | NULL o NULL "cat" ``` * 發現： * 在記錄 `cat` 中的 `t` 的 `tst_node` 之後多了一個記錄 `0` 的 `Key` 的 `tst_node` * 此 `key` 為 `0` 的 `tst_node` 用來表示 `cat` 這個字串的 `EndOfString`，以及記錄 `cat` 這個字串的出現的次數 (因此 `refcnt` 為 `1`)。 --- ## Ternary Search Tree Operation * 三種操作 (Operation) 為： * **搜尋** (Search) * **插入** (Insert) * **刪除** (Delete) ### TST - Search Operation * 兩種搜尋方式 * 找完整的字串 * 找字首 (prefix) * 「找完整的字串」在 `test_cpy.c` 中相對應的程式碼 ``` C case 'f': res = tst_search(root, word); if (res) printf(" found %s in %.6f sec.\n", (char *) res, t2 - t1); else printf(" %s not found.\n", word); break; ``` * 「找字首」在 `test_cpy.c` 中相對應的程式碼 ``` C case 's': res = tst_search_prefix(root, word, sgl, &sidx, LMAX); if (res) { printf(" %s - searched prefix in %.6f sec\n\n", word, t2 - t1); for (int i = 0; i < sidx; i++) printf("suggest[%d] : %s\n", i, sgl[i]); } else printf(" %s - not found\n", word); break; ``` ### TST - Insertion * 觀察在 `test_cpy.c` 中相對應的部份。 ``` C case 'a': res = tst_ins_del(&root, &p, INS, CPY); if (res) { idx++; printf(" %s - inserted in %.6f sec. (%d words in tree)\n", (char *) res, t2 - t1, idx); } else printf(" %s - already exists in list.\n", (char *) res); break; ``` ### TST - Deletion * 兩種刪除功能 * 刪除一個字串 * 刪除整個 TST * 「刪除一個字串」在 `test_cpy.c` 中相對應的程式碼 ``` C case 'd': res = tst_ins_del(&root, &p, DEL, CPY); if (res) printf(" delete failed.\n"); else { printf(" deleted %s in %.6f sec\n", word, t2 - t1); idx--; } break; ``` * 「刪除整個 TST 」在 `test_cpy.c` 中相對應的程式碼 ``` C case 'q': tst_free_all(root); return 0; break; ``` ## 觀察程式碼 * 從 `test_cpy.c` 可以發現 `TST` 的操作在 `tst.c` 中對應的函式為： * 搜尋字串 \ 字首：`tst_search()`、`tst_search_prefix()` * 插入、刪除字串：`tst_ins_del()` * 刪除整個 `TST`：`tst_free_all()` > TODO：觀察上述的函式程式碼 ### 搜尋字串 `tst_search()` * `tst.c` 中的程式碼 ``` c void *tst_search(const tst_node *p, const char *s) { const tst_node *curr = p; while (curr) { /* loop over each char in 's' */ int diff = *s - curr->key; /* calculate the difference */ if (diff == 0) { /* handle the equal case */ if (*s == 0) /* if *s = curr->key = nul-char, 's' found */ return (void *) curr->eqkid; /* return pointer to 's' */ s++; curr = curr->eqkid; } else if (diff < 0) /* handle the less than case */ curr = curr->lokid; else curr = curr->hikid; /* handle the greater than case */ } return NULL; } ``` * `while` 迴圈的內容解讀 * `tst_node` 類型的指標 `curr` 指向的 `tst_node` 中的 `key` 會和待搜索字串中的字元 `*s` 作比較 * 若是 `*s` > `key` 則將 `curr` 指向 `curr->hikid` 所指向的記憶體位置。 * 若是 `*s` < `key` 則將 `curr` 指向 `curr->lokid` 所指向的記憶體位置。 * 若是 `*s` = `key` * 且 `*s` 為字串中 `EOS (EndOfString)`，則表示找到待搜索的字串了，因此回傳 `curr->eqkid` 的位置。(因為 `curr->eqkid` 中紀錄著待搜索的字串，如上面提及的 `cat` 之例子。) * 若 `*s` 不是 `EOS`，則繼續在 `curr->eqkid` 中搜尋字串中的下一個字元(`s++`)。 * 重複以上步驟直到 `curr` 為 `NULL` 為止。 ### 搜尋字首 `tst_search_prefix()` * `tst.c` 中的程式碼 ``` C void *tst_search_prefix(const tst_node *root, const char *s, char **a, int *n, const int max) { const tst_node *curr = root; const char *start = s; if (!*s) return NULL; /* get length of s */ for (; *start; start++) ; /* wait */ const size_t nchr = start - s; start = s; /* reset start to s */ *n = 0; /* initialize n - 0 */ /* Loop while we haven't hit a NULL node or returned */ while (curr) { int diff = *s - curr->key; /* calculate the difference */ if (diff == 0) { /* handle the equal case */ /* check if prefix number of chars reached */ if ((size_t)(s - start) == nchr - 1) { /* call tst_suggest to fill a with pointer to matching words */ tst_suggest(curr, curr->key, nchr, a, n, max); return (void *) curr; } if (*s == 0) /* no matching prefix found in tree */ return (void *) curr->eqkid; s++; curr = curr->eqkid; } else if (diff < 0) /* handle the less than case */ curr = curr->lokid; else curr = curr->hikid; /* handle the greater than case */ } return NULL; } ``` * 可以被分成兩個部份來解讀 1. 紀錄字串長度的部份 2. `while` 迴圈的部份 * 先來看`1. 紀錄字串長度的部份` ```C /* get length of s */ for (; *start; start++) ; /* wait */ const size_t nchr = start - s; start = s; /* reset start to s */ *n = 0; /* initialize n - 0 */ ``` * 解讀： * 先透過 `for` 迴圈將 `start` 指向字串的最後一個字元 (`EOS`) * 用 `size_t` 型態的變數 `nchr` 來紀錄最後一個字元到第一個字元（也就是字串的全長） * 再來是 `2. while 迴圈的部份` ``` C while (curr) { int diff = *s - curr->key; /* calculate the difference */ if (diff == 0) { /* handle the equal case */ /* check if prefix number of chars reached */ if ((size_t)(s - start) == nchr - 1) { /* call tst_suggest to fill a with pointer to matching words */ tst_suggest(curr, curr->key, nchr, a, n, max); return (void *) curr; } if (*s == 0) /* no matching prefix found in tree */ return (void *) curr->eqkid; s++; curr = curr->eqkid; } else if (diff < 0) /* handle the less than case */ curr = curr->lokid; else curr = curr->hikid; /* handle the greater than case */ } ``` * 解讀 * 這個 `while` 迴圈和 `tst_search()` 中的 `while` 迴圈最大的差別在於：當搜尋到待搜尋字串的`EOS` 的前一個字元時，就表示找到相符合的字首了，接著會呼叫 `tst_suggest()` 來將符合字首要求的字串全部儲存在一個 `char` 型態的雙指標(`double pointer`) `a` 中。 ### 插入、刪除字串：`tst_ins_del()` * 在 `tst.c` 中的程式碼 ``` C void *tst_ins_del(tst_node **root, char *const *s, const int del, const int cpy) { int diff; const char *p = *s; tst_stack stk = {.data = {NULL}, .idx = 0}; tst_node *curr, **pcurr; if (!root || !*s) return NULL; /* validate parameters */ if (strlen(*s) + 1 > STKMAX / 2) /* limit length to 1/2 STKMAX */ return NULL; /* 128 char word lenght is plenty */ pcurr = root; /* start at root */ while ((curr = *pcurr)) { /* iterate to insertion node */ diff = *p - curr->key; /* get ASCII diff for >, <, = */ if (diff == 0) { /* if char equal to node->key */ if (*p++ == 0) { /* check if word is duplicate */ if (del) { /* delete instead of insert */ (curr->refcnt)--; /* decrement reference count */ /* chk refcnt, del 's', return NULL on successful del */ return tst_del_word(root, curr, &stk, 1); } else curr->refcnt++; /* increment refcnt if word exists */ return (void *) curr->eqkid; /* pointer to word / NULL on del */ } pcurr = &(curr->eqkid); /* get next eqkid pointer address */ } else if (diff < 0) { /* if char less than node->key */ pcurr = &(curr->lokid); /* get next lokid pointer address */ } else { /* if char greater than node->key */ pcurr = &(curr->hikid); /* get next hikid pointer address */ } if (del) tst_stack_push(&stk, curr); /* push node on stack for del */ } /* if not duplicate, insert remaining chars into tree rooted at curr */ for (;;) { /* allocate memory for node, and fill. use calloc (or include * string.h and initialize w/memset) to avoid valgrind warning * "Conditional jump or move depends on uninitialised value(s)" */ if (!(*pcurr = calloc(1, sizeof **pcurr))) { fprintf(stderr, "error: tst_insert(), memory exhausted.\n"); return NULL; } curr = *pcurr; curr->key = *p; curr->refcnt = 1; curr->lokid = curr->hikid = curr->eqkid = NULL; if (!*root) /* handle assignment to root if no root */ *root = *pcurr; /* Place nodes until end of the string, at end of stign allocate * space for data, copy data as final eqkid, and return. */ if (*p++ == 0) { if (cpy) { /* allocate storage for 's' */ const char *eqdata = strdup(*s); if (!eqdata) return NULL; curr->eqkid = (tst_node *) eqdata; return (void *) eqdata; } else { /* save pointer to 's' (allocated elsewhere) */ curr->eqkid = (tst_node *) *s; return (void *) *s; } } pcurr = &(curr->eqkid); } } ``` * 可以被分成三個部份來解讀 1. 檢查字串是否符合規範 2. `while` 迴圈 3. `for` 迴圈 * 首先是 `1. 檢查字串是否符合規範` ```C if (!root || !*s) return NULL; /* validate parameters */ if (strlen(*s) + 1 > STKMAX / 2) /* limit length to 1/2 STKMAX */ return NULL; /* 128 char word lenght is plenty */ ``` * 解讀 * 由於 `STKMAX` 在 `tst.c` 中的定義為 `WRDMAX` 的兩倍，因此若字串的長度加上 `EOS` 後比 `WRDMAX` 大的話，該字串就無法被插入到 `TST` 中。(註：在 `tst.c` 中，`WRDMAX` 被定義為 `128`) * 若待被插入或刪除的字串為 `0` 的話，則不插入或刪除。 * 接著是 `2. while 迴圈` ```C pcurr = root; /* start at root */ while ((curr = *pcurr)) { /* iterate to insertion node */ diff = *p - curr->key; /* get ASCII diff for >, <, = */ if (diff == 0) { /* if char equal to node->key */ if (*p++ == 0) { /* check if word is duplicate */ if (del) { /* delete instead of insert */ (curr->refcnt)--; /* decrement reference count */ /* chk refcnt, del 's', return NULL on successful del */ return tst_del_word(root, curr, &stk, 1); } else curr->refcnt++; /* increment refcnt if word exists */ return (void *) curr->eqkid; /* pointer to word / NULL on del */ } pcurr = &(curr->eqkid); /* get next eqkid pointer address */ } else if (diff < 0) { /* if char less than node->key */ pcurr = &(curr->lokid); /* get next lokid pointer address */ } else { /* if char greater than node->key */ pcurr = &(curr->hikid); /* get next hikid pointer address */ } if (del) tst_stack_push(&stk, curr); /* push node on stack for del */ } ``` * 解讀時可以分成兩個情況： 1. `del == 0` 2. `del == 1` * 第一種情況： * 比較 `curr` 指向的 `tst_node` 的 `key` 和字串中的字元 `*p` 的大小 * 若是 `*p` > `key`，則讓 `pcurr` 指向指向`curr->lokid`的指標的記憶體位置。 * 若是 `*p` < `key`，則讓 `pcurr` 指向指向`curr->hikid`的指標的記憶體位置。 * 若是 `*p` = `key` * 且字串的下一個字元(`*s++`)為 `EOS` 時，將`curr` 指向的 `tst_node` 的 `refcnt` 值加一。 * 若字串的下一個字元不為 `EOS`，則將 `pcurr` 指向指向`curr->eqkid` 的指標的記憶體位置。 * 第二種情況： * 將含有待刪除的字串的 `EOS` 之前的字元的 `Node` 的記憶體位置儲存在 `stk` 中。 * 當字串的下一個字元(`s++`)為 `EOS` 時，將 `curr` 指向的 `tst_node` 的 `refcnt` 的值減一，並將 `stk` 中所儲存的記憶體的內容，透過呼叫 `tst_del_word()` 來釋放。 > 為什麼在新增或刪除字串時，不是將 `curr->eqkid` 的 `refcnt` 的值加一或減一？[name=David Kuo] * 最後是 `3. for 迴圈` ```C /* if not duplicate, insert remaining chars into tree rooted at curr */ for (;;) { /* allocate memory for node, and fill. use calloc (or include * string.h and initialize w/memset) to avoid valgrind warning * "Conditional jump or move depends on uninitialised value(s)" */ if (!(*pcurr = calloc(1, sizeof **pcurr))) { fprintf(stderr, "error: tst_insert(), memory exhausted.\n"); return NULL; } curr = *pcurr; curr->key = *p; curr->refcnt = 1; curr->lokid = curr->hikid = curr->eqkid = NULL; if (!*root) /* handle assignment to root if no root */ *root = *pcurr; /* Place nodes until end of the string, at end of stign allocate * space for data, copy data as final eqkid, and return. */ if (*p++ == 0) { if (cpy) { /* allocate storage for 's' */ const char *eqdata = strdup(*s); if (!eqdata) return NULL; curr->eqkid = (tst_node *) eqdata; return (void *) eqdata; } else { /* save pointer to 's' (allocated elsewhere) */ curr->eqkid = (tst_node *) *s; return (void *) *s; } } pcurr = &(curr->eqkid); } ``` ## 觀察字典檔 > TODO：觀察字典檔 * `Cities.tst` * 內容為世界上各個地區的城市名稱 * 共有 `93827` 筆資料。 * 第一筆資料是 `Shanghai, China` * 排序沒有遵守 `A-Z` 的大小。 ## 測試程式 > TODO：根據觀察的結果，預期可能會有的問題後，透過測試程式 `test_cpy` 來確認自己的想法，最後修正程式中的 bug。 ``` $ ./test_cpy ternary_tree, loaded 259112 words in 0.149612 sec Commands: a add word to the tree f find word in tree s search words matching prefix d delete word from the tree q quit, freeing all data ``` ## 問題探討 ### 觀察 `tst.h` 和 `tst.c` 為何要 Forward Declaration * 若別的程式員想要了解 `tst.c` 裡的函式的功能時，就可以在 `tst.h` 中很方便的閱讀當初的開發者對於 `tst.c` 裡各個函式的註解。