2020q3 Homework3 (dict)

# 2020q3 Homework3 (dict) contributed by < `justapig9020` > ###### tags: `sysprog2020` > [題目](https://hackmd.io/@sysprog/2020-dict) > [GitHub](https://github.com/justapig9020/dict) ## 程式說明 ### 運行模式此程式分為兩個模式，區別為輸入的字串保存的方式。 1. CPY ( copy ) > 在 `CPY` 模式中，輸入的字串在插入至樹中時會複製一份，因此樹葉上的字串各自獨立保存。 3. REF ( reference ) > 在 `REF` 模式中，輸入的字串會統一保存於 `pool` 中。 `pool` 在程式開始時會 `allocate` 一塊記憶體位址，樹葉實際上是指向這塊記憶體之中的對應位址，因此字串彼此相鄰。 ```cpp if (CPYmask) { /* memory pool */ pool = (char *) malloc(poolsize * sizeof(char)); Top = pool; } ``` ### 測試程式程式讀取給定字典檔，並透過對輸入字串前處理將每個單字獨立出來。在不同模式中輸入的 `Buffer` 有所不同。 - CPY > 由於字串在插入後會進行 `deep copy` ，此模式下使用拋棄式的 `buffer` ，每次讀檔都會將上次內容覆蓋掉。 ```cpp char word[WRDMAX] = ""; ... char *Top = word; ``` - REF > 直接將輸入保存於先前 `allocate` 的 `buffer` 上。 ```cpp if (CPYmask) { /* memory pool */ pool = (char *) malloc(poolsize * sizeof(char)); Top = pool; } ``` 接下來依序將單字插入 `ternary search tree` 並更新 `bloom filter` 。 ```cpp= while (fgets(buf, WORDMAX, fp)) { /* do pre-processing */ while (*Top) { if (!tst_ins_del(&root, Top, INS, REF)) { /* fail to insert */ ... } bloom_add(bloom, Top); ... } ... } ``` 最後進入 `for(;;)` 等待命令並做對應的處理。 ## Ternary Search Tree ### 資料結構主要資料結構 - 節點 ```cpp typedef struct tst_node { char key; unsigned refcnt; struct tst_node *lokid; struct tst_node *eqkid; struct tst_node *hikid; } tst_node; ``` ![](https://i.imgur.com/ClTTi8y.jpg) 其中 `key` 為該節點代表的字元，`refcnt` 則是以該結點為結尾的字串被插入的次數。另外每一節點指向另外3個節點，分別代表比較後 - 大於 `hikid` - 等於 `eqkid` - 小於 `lokid` 的狀況。接下來搭配程式來了解運作方式。 ### 插入由於刪除與插入都需要走訪樹中的特定分支，因此插入與刪除被寫成同一個 `function` 。首先我們聚焦於插入功能。首先透過慾插入之字串走訪樹，若可以在樹中找到該字串則將對應的 `refcnt` 加一並返回該字串的位址。 ```cpp= while ((curr = *pcurr)) { /* iterate to insertion node */ diff = *p - curr->key; /* get ASCII diff for >, <, = */ if (diff == 0) { /* if char equal to node->key */ if (*p++ == 0) { /* check if word is duplicate */ if (del) { /* delete instead of insert */ ... } else curr->refcnt++; /* increment refcnt if word exists */ return (void *) curr->eqkid; /* pointer to word / NULL on del */ } pcurr = &(curr->eqkid); /* get next eqkid pointer address */ } else if (diff < 0) { /* if char less than node->key */ pcurr = &(curr->lokid); /* get next lokid pointer address */ } else { /* if char greater than node->key */ pcurr = &(curr->hikid); /* get next hikid pointer address */ } ... } ``` 若比對完樹上現有結點仍有字元並未被比較到(簡而言之，樹中沒有當前想加入的字串)，則依序新增結點直到所有剩餘字元都加入樹中。 ```cpp= for (;;) { /* allocate memory for node, and fill. use calloc (or include * string.h and initialize w/memset) to avoid valgrind warning * "Conditional jump or move depends on uninitialised value(s)" */ if (!(*pcurr = calloc(1, sizeof **pcurr))) { fprintf(stderr, "error: tst_insert(), memory exhausted.\n"); return NULL; } curr = *pcurr; curr->key = *p; curr->refcnt = 1; curr->lokid = curr->hikid = curr->eqkid = NULL; if (!*root) /* handle assignment to root if no root */ *root = *pcurr; if (*p++ == 0) { ... } pcurr = &(curr->eqkid); } ``` :::info 第 `15` - `16` 行的判斷式似乎是多餘的。由於先前透過 `while` 迴圈走訪分支時， `root` 已是判斷條件， ```cpp pcurr = root; while ((curr = *pcurr)) ``` 換言之若 `*root == NULL` 則第一次進入 `for` 迴圈時 `pcurr` 會指向 `root` 。因此在 `calloc` 時 `root` 就會被分配到記憶體位址，並無需再進行額外的判斷了。 > 很好，等你做實驗確認後，就送 pull request 來改進程式碼！ > :notes: jserv ::: 最後根據給定的模式將字串新增至 `\0` 結點的 `eqkid`。 ```cpp if (*p++ == 0) { if (cpy) { /* allocate storage for 's' */ const char *eqdata = strdup(s); if (!eqdata) return NULL; curr->eqkid = (tst_node *) eqdata; return (void *) eqdata; } else { /* save pointer to 's' (allocated elsewhere) */ curr->eqkid = (tst_node *) s; return (void *) s; } } ``` :::info 將走訪功能獨立出來，並將插入與刪除分離成兩個 `function` 。如此一來不僅不用花費分支度判斷目前模式，且我認為於架構上會比較整潔。嘗試更改[程式](https://github.com/justapig9020/dict/tree/find_next)為上述架構。 > 這是你自己的作業和開發紀錄，當然觀點是自己的，請避免說「我個人的觀點」，你本來就沒有代表誰發言，不要因電視新聞看多了，忘了直觀的漢語怎麼書寫。有疑慮就去證實，並且送 pull request 過來！ > :notes: jserv > 已修正，以下為思考過程 > ![](https://i.imgur.com/3s4zc7T.png) > :notes: justapig9020 ::: ### 刪除刪除也是透過 `tst_ins_del` 進行。差別在於 `del` 在走訪樹時會將走過的節點透過 `tst_stack_push` 推入 `stk` 。 ```cpp tst_stack_push(&stk, curr); /* push node on stack for del */ ``` 如果最後有在樹中找到慾刪除的字串，就會將對應的 `refcnt` 減一並呼叫 `tst_del_word` 進行刪除。首先如果 `refcnt` 不為 `0` 代表此字串還有被參考，因此只返回指向目標結點的指標。若 `refcnt` 為 `0` 則代表以無參考，需刪除字串。 ```cpp if (!victim->refcnt) { /* if last occurrence */ ... } else /* node->refcnt non-zero */ printf(" %s (refcnt: %u) not removed.\n", (char *) node->eqkid, node->refcnt); ``` 刪除字串，首由下往上先將該分支中沒有 `lokid` \ `hikid` 的節點刪除。在 `tst` 中若兩字串在開頭有共用的子字串的話，這兩字串會共用節點直到出現分歧。後加入的字串會由分歧點開始，依據該字元大小被加入 `lokid` 或 `hikid` 為根的子樹。換言之，當一節點沒有上述兩分支的話，意味著該節點並未被其他字串參考。由此作為判斷將沒有被參考的節點刪除。 ```cpp while (!parent->lokid && !parent->hikid && !victim->lokid && !victim->hikid) { parent->eqkid = NULL; free(victim); victim = parent; parent = tst_stack_pop(stk); if (!parent) { /* last word & root node */ free(victim); return (void *) (*root = NULL); } } ``` 此處需注意的是， `tst_stack_pop` 便是將之前走訪時推入的節點 `pop` 出來。透過此方法，便可以反向走訪之前的節點。當該 `stack` 為空時，意味著已經回到整顆樹的根了，此時呼叫此函數會返回 `NULL` 。 ```cpp static void *tst_stack_pop(tst_stack *s) { if (!s->idx) return NULL; void *node = s->data[--(s->idx)]; s->data[s->idx] = NULL; return node; } ``` 當碰到第一個有分支的結點後有幾種狀況。 #### `victim->lokid` 與 `victim->hikid` 皆存在首先，當由於 `victim->lokid->hikid` 或 `victim->hikid->lokid` 沒有子樹的狀況下，可以直接將另一棵子樹接上，並利用被接上的節點取代 `victim` ，詳細狀況如下圖所示。 1. `victim->lokid->hikid = victim->hikid` ![](https://i.imgur.com/1Ze0igv.jpg) ![](https://i.imgur.com/D09ISgl.jpg) 2. `victim->hikid->lokid = victim->lokid` ![](https://i.imgur.com/3h5tZAN.jpg) ![](https://i.imgur.com/lSGO4F9.jpg) 以情況 `1` 為例，由於`victim->lokid->key` 必定小於 `victim->hikid->key` 所以可以不透過比較直接完成串接。另一情況同理。然而當上述兩子樹都不為空時，由於無法直接串接子樹，在此實作中選擇移除該節點的 `eqkid` ，藉此讓以移除的字串無法再次匹配。 ![](https://i.imgur.com/NpRml96.jpg) ![](https://i.imgur.com/CO7XcrJ.jpg) 在刪除節點時會使用到 `macro` - `del_node` ，在此 `macro` 中會檢查 `victim` 在 `parent` 中的位置，並將要替換的子樹串接上去。 --- #### `victim->lokid` 或 `victim->hikid` 中只有一個存在由於 `victim` 只有一棵子樹，無需額外處理直接將子樹透過 `del_node` 串接上 `parent` 即可。 #### `vintim` 沒有子樹 --- ## Bloom Filter ### 新增