2020q3 Homework3 (dict)

# 2020q3 Homework3 (dict) contributed by < `Sisker1111` > ###### tags: `進階電腦系統理論與實作` [TOC] ## 作業說明及程式碼 > [I04: dict](https://hackmd.io/@sysprog/2020-dict) > [GitHub](https://github.com/Sisker1111/dict) ## key word ternary search tree、Bloom filter、bit-wise operation、benchmark、GNU make and Makefile、Valgrind、perf. # 程式原理 ## CPY 與 REF 機制原始程式中，會根據輸入參數分成 CPY 和 REF 兩種模式，並分別作對應的處理。 ```cpp= if (!strcmp(argv[1], "CPY") || (argc > 2 && !strcmp(argv[2], "CPY"))) { CPYmask = 0; REF = DEL; printf("CPY mechanism\n"); } else printf("REF mechanism\n"); ``` 在 REF 模式下，`CPYmask` = -1，因此改配置給一個更大的空間給`Top`。 ```cpp long poolsize = 2000000 * WRDMAX; enum { INS, DEL, WRDMAX = 256, STKMAX = 512, LMAX = 1024 }; .... if (CPYmask) { /* memory pool */ pool = (char *) malloc(poolsize * sizeof(char)); Top = pool; } ``` * 接著會開始讀取`"cities.txt"`這個檔案，根據`,`及`\n`兩個字元去做切割 ```cpp= while (fgets(buf, WORDMAX, fp)) { int offset = 0; for (int i = 0, j = 0; buf[i + offset]; i++) { Top[i] = (buf[i + j] == ',' || buf[i + j] == '\n') ? '\0' : buf[i + j]; j += (buf[i + j] == ','); } while (*Top) { if (!tst_ins(&root, Top, INS, REF)) { . . } } ``` * 再來，呼叫 tst_ins 來插入 tst 的節點，如果是在 CPY 模式下，會透過 strdup 去呼叫 malloc 配置一塊等同於 Input 字串`s`的空間，並複製其內容到`eqdata`，如果是在 REF 模式下，因為已事先為 Top 配置了一個很大的記憶體空間來擺放所有的待搜尋字串，因此只要 refererence 過去即可 ```cpp= if (cpy) { /* allocate storage for 's' */ const char *eqdata = strdup(s); if (!eqdata) return NULL; curr->eqkid = (tst_node *) eqdata; return (void *) eqdata; } else { /* save pointer to 's' (allocated elsewhere) */ curr->eqkid = (tst_node *) s; return (void *) s; } ``` `Top -= offset & ~CPYmask;`這個機制會判斷是否讓 Top 回到起始位置，在 CPY 模式 Top 會回到 word 的起始位置，REF 模式則保持不變. 最後，我認為 CPY 及 REF 兩種機制應該是在效能與空間成本的一個tradeoff， CPY 機制下雖然要多花 branch 去存取字串，但每次只需配置一定大小的空間， REF 雖然只要 refererence 過去就可以存取該字串，但須先配置一個很大的空間. ## tst_search 計算目前搜尋到的 `prefix = *s` 以及被搜尋的 `node key = cur` 兩個字元的差值並存入`diff`，用來判斷兩個字元是否相等. * 當`diff == 0` 判斷 `*s`是否是空字元，是的話則 search 完成，否則繼續比對下一個字元，往`eqkid`移動. * 當`diff < 0` 表示`*s` 比`cur->key`小，往`lokid`移動. * 當`diff > 0` 表示`*s` 比`cur->key`大，往`hikid`移動. ```cpp= while (curr) { /* loop over each char in 's' */ int diff = *s - curr->key; /* calculate the difference */ if (diff == 0) { /* handle the equal case */ if (*s == 0) /* if *s = curr->key = nul-char, 's' found */ return (void *) curr->eqkid; /* return pointer to 's' */ s++; curr = curr->eqkid; } else if (diff < 0) /* handle the less than case */ curr = curr->lokid; else curr = curr->hikid; /* handle the greater than case */ } ``` ## tst_ins 原始程式 tst_ins() 中，第一個 while 的目的是要判斷要 insert 的字串是否已經存在 Ternary search tree中: ```cpp= //tst.c const char *p = s; tst_node *curr, **pcurr; if (!root || !s) return NULL; /* validate parameters */ if (strlen(s) + 1 > STKMAX / 2) /* limit length to 1/2 STKMAX */ return NULL; /* 128 char word length is plenty */ pcurr = root; while ((curr = *pcurr)) { if (*p == 0 && curr->key == 0) { curr->refcnt++; return (void *) curr->eqkid; } pcurr = next_node(pcurr, &p); } ``` * `while ((curr = *pcurr))`用來判斷 tst 走到的地方是否還有節點. * `next_node()`負責把 Insert 字串`*s`和 tst 中的節點比大小，操作和 search 時相同. * `if (*p == 0 && curr->key == 0)`成立代表`*s`和 tst 都同時走到底並且確立`*s`已經存在 tst 中，並紀錄重複插入的次數(`curr->refcnt++;`)，最後直接返還一個 NULL ，在底下這個 test_common 程式中可以看到，`if(res)`成立時才會計算花費的時間. ```cpp= //test_common.c if (bloom_test(bloom, Top)) /* if detected by filter, skip */ res = NULL; else { /* update via tree traversal and bloom filter */ bloom_add(bloom, Top); res = tst_ins(&root, Top, REF); } t2 = tvgetf(); if (res) { idx++; Top += (strlen(Top) + 1) & CPYmask; printf(" %s - inserted in %.10f sec. (%d words in tree)\n", (char *) res, t2 - t1, idx); } ``` 最後，我們要加剩餘的字元都加進 Ternary search tree 中，一直到`*p`結束為止. ```cpp= //test_common.c for (;;) { if (!(*pcurr = calloc(1, sizeof **pcurr))) { fprintf(stderr, "error: tst_insert(), memory exhausted.\n"); return NULL; } curr = *pcurr; curr->key = *p; curr->refcnt = 1; curr->lokid = curr->hikid = curr->eqkid = NULL; if (*p++ == 0) { if (cpy) { /* allocate storage for 's' */ . . . } else { . . . } } pcurr = &(curr->eqkid); } ``` ## tst_del_word 透過 `tst_del_word` 將 TST 上的字串進行刪除。 ```cpp tst_node *victim = node; tst_node *parent = tst_stack_pop(stk); if (!victim->refcnt) { if (!victim->key && freeword) free(victim->eqkid); while (!parent->lokid && !parent->hikid && !victim->lokid && !victim->hikid) { ... } ``` * `victim` 會從等待被刪除的字串本身開始。 * 如果該 `refcnt` > 0，表示尚有 reference 存在，則不需刪除。 * `if (!victim->key && freeword) free(victim->eqkid)`，判斷是否為 CPY 模式，若是，則首先釋放字串本身。 * while 迴圈中，如果待移除的 `victim` 和其 `parent` 沒有其他 child，則可以直接釋放。 ```cpp if (victim->lokid && victim->hikid) { /* victim has both lokid/hikid */ if (!victim->lokid->hikid) { /* check for hikid in lo tree */ ... } else if (!victim->hikid->lokid) { /* check for lokid in hi tree */ ... } else /* can't rotate, return, leaving victim->eqkid NULL */ return NULL; } else if (victim->lokid) { /* only lokid, replace victim with lokid */ ... } else if (victim->hikid) { /* only hikid, replace victim with hikid */ ... } else { /* victim - no children, but parent has other children */ ... ``` * 在該移除的候選節點 `victim` 的同時，如果`victim`具有 low child 和 high child，則需要做相對應的處理，這部分的程式碼相對龐大，詳情請見[原始程式碼](https://github.com/Sisker1111/dict): ## 改善 Bloom filter 所用的資源在 bloom filter 中，我們只需要 1 個 bit 就能表達我們需要的狀態，但 C 語言裡並沒有定義 1 bit 的 Boolean Type，儘管我們只需要 1 bit 但系統最低仍會分配 1 byte 的空間，因此我們會浪費 7/8 的 table 空間，改進方式如下: ```cpp= bloom_t bloom_create(size_t size) { bloom_t res = calloc(1, sizeof(struct bloom_filter)); res->size = size; //res->bits = malloc(size + 7); res->bits = calloc((size + 7) >> 3, 1); bloom_add_hash(res, djb2); bloom_add_hash(res, jenkins); return res; } ``` 這樣做的目的是，使得每個 byte 的每個 bit 都是 Bloom filter 的一個 element (意即 bit 和 bit 間是獨立看待的)，因此，在搜尋時我們會需要先找到該 hash value 位於哪個 "byte"，然後再找出位於哪個 "bit"，以下程式在做這件事情: ```cpp= bool bloom_test(bloom_t filter, const void *item) { struct bloom_hash *h = filter->func; uint8_t *bits = filter->bits; while (h) { unsigned int hash = h->func(item); hash %= filter->size; if (!(bits[hash >> 3] & (0x40 >> (hash & 7)))) { return false; } h = h->next; } return true; } ``` # Ternary search tree + bloom filter 的效能表現