2020q3 Homework3 (dict)

--- tags: Linux2020 --- # 2020q3 Homework3 (dict) contributed by < `YLowy` > ## 修改 dict 中錯誤 [bloom.c 的錯誤更正](https://hackmd.io/@YLowy/HkyD5pqBv#%E5%B0%8D%E6%96%BC-bloomc-%E9%80%B2%E8%A1%8C%E9%8C%AF%E8%AA%A4%E6%9B%B4%E6%AD%A3) > [sysprog21/dict](https://github.com/sysprog21/dict) 已引入修正，請 git merge 或 rebase > :notes: jserv ## Code Overview 開始前先解讀 dict 程式碼 ### test_common.c - [ ] main() ```c= int main(int argc, char **argv) { ... int CPYmask = -1; ... if (!strcmp(argv[1], "CPY") || (argc > 2 && !strcmp(argv[2], "CPY"))) { CPYmask = 0; REF = DEL; printf("CPY mechanism\n"); } else printf("REF mechanism\n"); ... ``` :::warning :warning: 上述程式碼寫得很差，你大可直接改寫。可參考 [x-compressor](https://github.com/jserv/x-compressor/blob/master/x.c#L59-L71)，檢查 `argv[0]` 的名稱，後者就是執行檔的名稱，因此你可以建立 `test-common` 執行檔後，透過 [soft link](https://www.cyberciti.biz/faq/creating-soft-link-or-symbolic-link/) 來建立形如 `test-cpy` 和 `test-ref` 的檔案，再由 C 程式判斷其作用。對照看 [x-compressor/Makefile](https://github.com/jserv/x-compressor/blob/master/Makefile#L17-L19) :notes: jserv ::: CPYmask 預設為 0xFFFFFFFF 若是對此程式傳入參數 " CPY " 會將 CPYmask 值帶入成 0x00000000 並在之後讀檔案 cities.txt 之後， t1 紀錄當下時間 ```c=61 t1 = tvgetf(); ``` 首先建立 bloom filter table ```c=63 bloom_t bloom = bloom_create(TableSize); ``` 如果以 REF 機制會建立一個 memory pool , 並且讓 Top 指向該 pool 首位 ```c=67 if (CPYmask) { /* memory pool */ pool = (char *) malloc(poolsize * sizeof(char)); Top = pool; } ``` 再來做基本的資料處理 ```c=74 while (fgets(buf, WORDMAX, fp)) { int offset = 0; for (int i = 0, j = 0; buf[i + offset]; i++) { Top[i] = (buf[i + j] == ',' || buf[i + j] == '\n') ? '\0' : buf[i + j]; j += (buf[i + j] == ','); } ... ``` 要解讀這裡首先要先觀察 cities.txt 內容 ``` Shanghai, China Buenos Aires, Argentina Mumbai, India Mexico City, Distrito Federal, Mexico Karachi, Pakistan İstanbul, Turkey Delhi, India Manila, Philippines Moscow, Russia ... ``` :::danger 注意：不是「每一行」格式都相同，作業說明已提及例外狀況。 :notes: jserv ::: ~~我們可以發現每一行中的格式為 "城市","空格""國家"~~ ... ASCII " "(空格) -> 0x20 ASCII "\0"(NUL) -> 0x00 ... 而這裡想要進行的處理為將該行讀取到的資料 ( buf ) 的逗號、逗號後的空格、跳行取代掉舉例而言， `Shanghai, China` 會轉變成 `Shanghai\0China\0` `Mexico City, Distrito Federal, Mexico` 會轉變成 `Mexico City\0Distrito Federal\0Mexico\0` 所以讀取一行內的資料會穿插 "\0" 變成多組字串形成字串組所以會用下面的 `while (*Top) {...` 將此字串組餵進 tst 以及 bloom filter 中並讓 Top 指向轉換後資料的首位 ```c=81 while (*Top) { if (!tst_ins_del(&root, Top, INS, REF)) { /* fail to insert */ fprintf(stderr, "error: memory exhausted, tst_insert.\n"); fclose(fp); return 1; } bloom_add(bloom, Top); idx++; int len = strlen(Top); offset += len + 1; Top += len + 1; } Top -= offset & ~CPYmask; memset(Top, '\0', WORDMAX); } ``` 取得資料後，會將該資料餵進 tst_ins_del(...) 以及 bloom_add(...) idx 為總加入單字數 offset 指派為該資料長度 + 1 以計算位移 Top 為 pointer type ，用意為讓 Top 能夠指向 " 下一筆欲加入處理過後資料的位置 " 這邊還要解釋 CPY REF 兩種機制在此處不同的處理 : 1. REF 機制中 CPYmask 為 0xFFFFFFFF 在`Top -= offset & ~CPYmask;` 中 Top = Top - offset & 0x00000000 Top 並不會作修正先前提到 REF 架構時候會建立一個 memory pool 並讓將位址指派給 Top Top 會成為 memory pool 中紀錄下一筆資料的起始位址 2. CPY 機制中 CPYmask 為 0x00000000 在`Top -= offset & ~CPYmask;` 中 Top = Top - offset & 0xFFFFFFFF Top 會成為原先的指向 word 的位址 ```c=36 char word[WRDMAX] = ""; ``` 之後兩者都在該 Top 指標指向空間往後 WORDMAX 大小做資料清空，全部帶入"\0", 並重複動作讓所有 cities.txt 的資料寫入 tst 以及 bloom filter 全部做完之後會寫入到第二個時間 t2 ```c=96 t2 = tvgetf(); ``` t2 - t1 此處可以知道選用方法花費多少時間建立 ternary_tree 在 main 函式中的 tst_ins_del(&root, Top, INS, REF) 也必須了解在一開始的 Enumeration 中宣告 ```c=15 enum { INS, DEL, WRDMAX = 256, STKMAX = 512, LMAX = 1024 }; ``` 其中 INS , DEL 用來帶進 tst_ins_del 的參數後面的 REF 在 main程式碼中用來表示為 CPY 或者 REF 機制 ```c=49 REF = DEL; ``` 我個人不喜歡這樣的的表達，應該要有更容易解讀方式實作 ### tst.c - [ ] tst_ins_del ```c=277 void *tst_ins_del(tst_node **root, const char *s, const int del, const int cpy) { int diff; const char *p = s; tst_stack stk = {.data = {NULL}, .idx = 0}; tst_node *curr, **pcurr; ``` diff 用來表示加入字串的當時ASCII - 字元當前 node 的ASCII數值 (似乎可以省略這個變數?) 創建一個 stack stk (為何指標陣列大小要給 2 倍的WORDMAX?) ```c=225 if (!root || !s) return NULL; /* validate parameters */ if (strlen(s) + 1 > STKMAX / 2) /* limit length to 1/2 STKMAX */ return NULL; /* 128 char word length is plenty */ ``` 檢查輸入是否合理 ```c=230 pcurr = root; /* start at root */ while ((curr = *pcurr)) { /* iterate to insertion node */ diff = *p - curr->key; /* get ASCII diff for >, <, = */ if (diff == 0) { /* if char equal to node->key */ if (*p++ == 0) { /* check if word is duplicate */ if (del) { /* delete instead of insert */ (curr->refcnt)--; /* decrement reference count */ /* chk refcnt, del 's', return NULL on successful del */ return tst_del_word(root, curr, &stk, cpy); } else curr->refcnt++; /* increment refcnt if word exists */ return (void *) curr->eqkid; /* pointer to word / NULL on del */ } pcurr = &(curr->eqkid); /* get next eqkid pointer address */ } else if (diff < 0) { /* if char less than node->key */ pcurr = &(curr->lokid); /* get next lokid pointer address */ } else { /* if char greater than node->key */ pcurr = &(curr->hikid); /* get next hikid pointer address */ } if (del) tst_stack_push(&stk, curr); /* push node on stack for del */ } ``` 考慮到目前使用到 INSERT , 先預想 del = 0 可以把上述程式碼簡化成容易理解版本: ```c=230 pcurr = root; while ((curr = *pcurr)) { diff = *p - curr->key; if (diff == 0) { if (*p++ == 0) { curr->refcnt++; return (void *) curr->eqkid; } pcurr = &(curr->eqkid); } else if (diff < 0) { pcurr = &(curr->lokid); } else { pcurr = &(curr->hikid); } } ``` 很容易看得出curr 會指向當前 tst 中正要開始做當下字串 INSERT 的 node ```c=257 for (;;) { /* allocate memory for node, and fill. use calloc (or include * string.h and initialize w/memset) to avoid valgrind warning * "Conditional jump or move depends on uninitialised value(s)" */ if (!(*pcurr = calloc(1, sizeof **pcurr))) { fprintf(stderr, "error: tst_insert(), memory exhausted.\n"); return NULL; } curr = *pcurr; curr->key = *p; curr->refcnt = 1; curr->lokid = curr->hikid = curr->eqkid = NULL; if (!*root) /* handle assignment to root if no root */ *root = *pcurr; /* Place nodes until end of the string, at end of stign allocate * space for data, copy data as final eqkid, and return. */ if (*p++ == 0) { if (cpy) { /* allocate storage for 's' */ const char *eqdata = strdup(s); if (!eqdata) return NULL; curr->eqkid = (tst_node *) eqdata; return (void *) eqdata; } else { /* save pointer to 's' (allocated elsewhere) */ curr->eqkid = (tst_node *) s; return (void *) s; } } pcurr = &(curr->eqkid); } ``` 上面會對於接下來要加入的字元建立並且新增該 tst_node 空間，並且在讀取到最後一個字元時 ( 也就是該字元的下一個字元為 '\0' ) 時檢查是 CPY 或者 REF 機制如果是 CPY 機制以 strdup 複製一個等同 s 指向的字串，並指派給 eqkid 如果是 REF 機制則直接把該字串的 pointer (也就是 s) 指派給 eqkid 為什麼要那麼做?原因 : #### [分飾多角的 eqkid](https://hackmd.io/@sysprog/2020-dict#%E5%88%86%E9%A3%BE%E5%A4%9A%E8%A7%92%E7%9A%84-eqkid) Linux man page 對於 [strdup](https://man7.org/linux/man-pages/man3/strdup.3p.html) 的描述: > The strdup() function returns a pointer to a new string which is a duplicate of the string s. Memory for the new string is obtained with malloc(3), and can be freed with free(3). 查 C11 規格書其實沒有講到 strdup() ，後來發現這其實並非規範在 ISO C standard 中，仰賴實作，以 malloc() 在 heap 中建立副本。當在 Insert 新的單字的時候，會一路 malloc 建立一條 tst_node list 並在最後的 eqkid 指向一個 malloc 的字串。在 Insert 像是 "Pneumonoultramicroscopicsilicovolcanoconiosis" 這種很難出現分支的單字，基本上在 Heap 中的 tst_node 以及最後的 eqkid 指向的空間會建立在一起。 - [ ] tst_del_word 這邊程式碼非常長，而且 branch 行為非常多，如果不從脈絡以及大綱開始看會很難看懂這段程式碼的核心行為是對於每一種 del 前提條件下進行調整主要行為分為兩種 : 1. 刪掉該點之後如何接上不該被刪掉的節點 2. 接上節點時候要考量的 Rotate 行為以下拆成 5 個最大部分進行，當該 "tst_node" 成為欲刪除目標時候考量行為 ```cpp static void *tst_del_word(tst_node **root, tst_node *node, tst_stack *stk, const int freeword) { tst_node *victim = node; /* begin deletion w/victim */ tst_node *parent = tst_stack_pop(stk); /* parent to victim */ if (!victim->refcnt) { /* if last occurrence */ if (!victim->key && freeword) /* check key is nul */ free(victim->eqkid); /* free string (data) */ /* remove unique suffix chain - parent & victim nodes * have no children. simple remove until the first parent * found with children. */ while (!parent->lokid && !parent->hikid && !victim->lokid && !victim->hikid) { // A Part } /* check if victim is prefix for others (victim has lo/hi node). * if both lo & hi children, check if lokid->hikid present, if not, * move hikid to lokid->hikid, replace node with lokid and free node. * if lokid->hikid present, check hikid->lokid. If not present, then * move lokid to hikid->lokid, replace node with hikid free node. */ if (victim->lokid && victim->hikid) { /* victim has both lokid/hikid */ // B Part } else if (victim->lokid) { /* only lokid, replace victim with lokid */ // C Part } else if (victim->hikid) { /* only hikid, replace victim with hikid */ //D Part } else { /* victim - no children, but parent has other children */ //E Part } } else /* node->refcnt non-zero */ printf(" %s (refcnt: %u) not removed.\n", (char *) node->eqkid, node->refcnt); return victim; /* return NULL on successful free, *node otherwise */ } ``` #### A Part 最值觀的刪除條件，無後顧之憂所以我全都要刪 ![](https://i.imgur.com/huz7hrO.gif) #### B Part 有了左右兩個子節點時，就要連同考慮該節點刪除後該如何接，rotate 方式其實與資料結構中的紅黑數相似，不過這邊要考慮三個節點，所以程式碼看起來比較多 ![](https://i.imgur.com/65hcHjJ.gif) #### C Part 直接將父節點的 eqkid 指向自己的子節點即可 ![](https://i.imgur.com/x9yW1mp.gif) #### D Part 同 C Part ，不過改成hikid ![](https://i.imgur.com/8jZUaYm.gif) #### E Part 自身無子節點，可以父節點有其他的子節點，勢必要讓其中一個取代自己成為父節點指向的 eqkid ，這邊也要考慮 rotate 問題所以程式碼看起來很大 ![](https://i.imgur.com/6RegFOF.gif) ### bloom.c 觀察 bloom filter 所進行之行為在加入單字前會先建立 bloom filter table 資料結構如下: test_common.c `bloom_t bloom = bloom_create(TableSize);` ```graphviz digraph A { rankdir=L; a [label="bloom" shape = circle] b [shape=record fontsize=24 label="{<fun>func|<bits>bits|<sizes>size = 5000K}"] c [shape=record fontsize=24 label="{bitTable}"] d [shape=record fontsize=24 label="{function pointer = \"djb2\"|<next>next}"] e [shape=record fontsize=24 label="{function pointer = \"jenkins\"|<next>next}"] null [label="Null" shape = circle] a->b b:fun->d b:bits->c d:next->e e:next->null } ``` 再來會對該單字進行 Hash 提並且記錄在 bitmap 中 `bloom_add(bloom, Top);` 其對應方式如下 ```c=83 void bloom_add(bloom_t filter, const void *item) { struct bloom_hash *h = filter->func; uint8_t *bits = filter->bits; while (h) { unsigned int hash = h->func(item); hash %= filter->size; bits[hash >> 3] |= 0x40 >> (hash & 7); h = h->next; } } ``` 由於用 bitmap 紀錄，先找到此對應的 byte 位置 bits[hash >> 3] : ```graphviz digraph A{ rankdir=LR; b [shape=record fontsize=24 label="{ 0| 0| 0| 0| 0| 0| 0| 0 }"] } ``` hash>>3 hash & 7 => hash & 00000111 hash 值會剩下後 3 bit ，也就是該 byte 之 offset (000 ~ 111) 也就是 0 到 7 再來對該bytes 進行 " or " 操作紀錄 Hash map |= 0x40 >> (hash & 7) ... 對該 bit 進行 0 -> 1 操作 ```graphviz digraph A{ rankdir=LR; b [shape=record fontsize=24 label="{ 0|1| 0| 0| 0| 0| 0| 0 }"] } ``` 為何是 0x40 ?? 應該是 0x80 (0b10000000) 才對 ### 對於 bloom.c 進行錯誤更正早在一開始測試自己的輸入時就發現這個問題，有些原本記錄在 tst 中的資料並沒有能被找到， bloom filter 只會有 "顯示有但事實上沒有" 的錯誤情況，如果 bloom filter 沒能找到之後也不會再 tst 中找尋 ```c=168 if (bloom_test(bloom, word)) { t2 = tvgetf(); printf(" Bloomfilter found %s in %.6f sec.\n", word, t2 - t1); printf( " Probability of false positives:%lf\n", pow(1 - exp(-(double) HashNumber / (double) ((double) TableSize / (double) idx)), HashNumber)); t1 = tvgetf(); res = tst_search(root, word); t2 = tvgetf(); if (res) printf(" ----------\n Tree found %s in %.6f sec.\n", (char *) res, t2 - t1); else printf(" ----------\n %s not found by tree.\n", word); } else printf(" %s not found by bloom filter.\n", word); ``` 在inputdata.txt: ``` ap, apple, applepen, applepencil,Penum ``` make 完之後會發現他載入5個單字選用 serach ap 結果會找到 4 個單字 ``` choice: s find words matching prefix (at least 1 char): ap ap - searched prefix in 0.000001 sec suggest[0] : ap suggest[1] : apple suggest[2] : applepen suggest[3] : applepencil ``` find ap :+1: ``` choice: f find word in tree: ap Bloomfilter found ap in 0.000001 sec. Probability of false positives:0.000000 ---------- Tree found ap in 0.000000 sec. ``` find apple :-1: ``` Commands: a add word to the tree f find word in tree s search words matching prefix d delete word from the tree q quit, freeing all data choice: f find word in tree: apple apple not found by bloom filter. ``` 全部單字中只有 apple 並未能找到我懷疑是 bloom filter 這裡寫錯了對 bloom_add, bloom_test 進行修改 0x40 ->0x80 ```c=83 void bloom_add(bloom_t filter, const void *item) { struct bloom_hash *h = filter->func; uint8_t *bits = filter->bits; while (h) { unsigned int hash = h->func(item); hash %= filter->size; bits[hash >> 3] |= 0x80 >> (hash & 7); h = h->next; } } bool bloom_test(bloom_t filter, const void *item) { struct bloom_hash *h = filter->func; uint8_t *bits = filter->bits; while (h) { unsigned int hash = h->func(item); hash %= filter->size; if (!(bits[hash >> 3] & (0x80 >> (hash & 7)))) { return false; } h = h->next; } return true; } ``` remake 之後以相同的 inputdata 再試一次 ``` Commands: a add word to the tree f find word in tree s search words matching prefix d delete word from the tree q quit, freeing all data choice: f find word in tree: apple Bloomfilter found apple in 0.000002 sec. Probability of false positives:0.000000 ---------- Tree found apple in 0.000001 sec. ``` 成功找到問題點，果然是這邊的問題 ### GDB 實驗新增 testinput.txt : ``` ap, apple apple, Pneum ``` 修改 test_common.c : ```c=32 #define IN_FILE "testinput.txt" ``` ``` $ make clean $ make ``` ``` $ gdb -q ./test_common ``` ``` Reading symbols from ./test_common... (gdb) b main Breakpoint 1 at 0x185f: file test_common.c, line 35. (gdb) r CPY Starting program: /home/ylowy/linux2020/dict/test_common CPY Breakpoint 1, main (argc=0, argv=0x0) at test_common.c:35 35 { (gdb) until 97 CPY mechanism main (argc=2, argv=0x7fffffffe518) at test_common.c:97 97 fclose(fp); (gdb) print t1 $1 = 1601403009.8308451 (gdb) print t2 $2 = 1601403009.8309977 ``` 執行到 tst 建立完成， t1 , t2 時間 ``` (gdb) ptype tst_node type = struct tst_node { char key; unsigned int refcnt; struct tst_node *lokid; struct tst_node *eqkid; struct tst_node *hikid; } (gdb) print sizeof(tst_node) $39 = 32 ``` >先前提到的 alignment 問題 sizeof(char) = 1 sizeof(unsigned int) = 4 sizeof(tst_node *) = 8 ``` (gdb) p *root $3 = {key = 97 'a', refcnt = 1, lokid = 0x55555555ca90, eqkid = 0x55555555c930, hikid = 0x0} (gdb) p root $4 = (tst_node *) 0x55555555c900 ``` ``` (gdb) print *(char*)(root->eqkid->eqkid->eqkid) $11 = 97 'a' (gdb) print *(char*)(root->eqkid->eqkid->eqkid+1) $12 = 112 'p' (gdb) print *(char*)(root->eqkid->eqkid->eqkid+2) $13 = 0 '\000' ``` 可以清楚了解第一個字串 " ap " 建立過程，以及在 " key = '\0' " node 建立的 eqkid指向字串空間 " 'a' , 'p' , '\0' " ``` (gdb) print*root $7 = {key = 97 'a', refcnt = 1, lokid = 0x55555555ca90, eqkid = 0x55555555c930, hikid = 0x0} (gdb) print root $8 = (tst_node *) 0x55555555c900 (gdb) print root->eqkid $11 = (struct tst_node *) 0x55555555c930 (gdb) print *(root->eqkid) $12 = {key = 112 'p', refcnt = 1, lokid = 0x0, eqkid = 0x55555555c960, hikid = 0x0} (gdb) print root->eqkid->eqkid $13 = (struct tst_node *) 0x55555555c960 (gdb) print *(root->eqkid->eqkid) $14 = {key = 0 '\000', refcnt = 1, lokid = 0x0, eqkid = 0x55555555c990, hikid = 0x55555555c9b0} ``` 在 55 c9 00 中建立 root (32 bytes tst_node) 第一個 97 代表 'a' (char) -> 3 個 byte 是沒用的第二個 1 代表 'refcnt' (unsigned int) 第三四個 1431685776 21845 ( 10進位的 lokid address ) -> 0x55555555ca90 第五六個 1431685424 21845 ( 10進位的 eqkid address ) -> 0x55555555c930 第七八個 0 0 ( 10進位的 hikid address ) -> 0x000000000000 `0x55555555c920: 0 0 49 0` :::info ~~這行不確定存放甚麼不過也~~說明就算是接著 malloc() 也不會把記憶體位置連續配置在一起答案: 當在使用 malloc 要求 N bytes 空間的時候，會在 Heap 中配置 N+4 的空間，並把前4個 byte 寫入該配置空間大小，回傳第 5 個位址。這也是為何 free() 不需要空間大小，只需要位移 -4 取得 malloc 時期配置空間。 ::: root -> eqkid 指向的紀體空間 ( ASCII('p') = 112 ) `0x55555555c930: 112 1 0 0` 依序搜索下去 ``` 0x55555555c960: 0 1 0 0 0x55555555c970: 1431685520 21845 1431685552 21845 ``` 1431685520 21845 ( 0x55555555c990 ) -> 該 tst_node 的 eqkid 由於是尾端('\0') 該eqkid 會指向字串 ``` 0x55555555c900: 97 1 1431685776 21845 0x55555555c910: 1431685424 21845 0 0 0x55555555c920: 0 0 49 0 0x55555555c930: 112 1 0 0 0x55555555c940: 1431685472 21845 0 0 0x55555555c950: 0 0 49 0 0x55555555c960: 0 1 0 0 0x55555555c970: 1431685520 21845 1431685552 21845 0x55555555c980: 0 0 33 0 0x55555555c990: 28769 0 0 0 0x55555555c9a0: 0 0 49 0 0x55555555c9b0: 112 1 0 0 0x55555555c9c0: 1431685600 21845 0 0 0x55555555c9d0: 0 0 49 0 0x55555555c9e0: 108 1 0 0 0x55555555c9f0: 1431685648 21845 0 0 0x55555555ca00: 0 0 49 0 0x55555555ca10: 101 1 0 0 ``` 對於該 0x55555555c990 找尋 ``` (gdb) print *(char*)(0x55555555c990) $15 = 97 'a' (gdb) print (char*)(0x55555555c990) $16 = 0x55555555c990 "ap" ``` 可以發現跟我們實作的內容相同 textinput.txt 在 Heap 空間中可以確定是字串是建構在該 list 下一個 ``` (gdb) print (char*)(root->lokid->eqkid->eqkid->eqkid->eqkid->eqkid->eqkid) $25 = 0x55555555cbb0 "Pneum" ``` 再來試試看 REF 機制 `(gdb) r REF` 版面緣故直接找到尾端 node ，而 REF 架構中 eqkid 會指向遙遠的 0x7fffd9390010 ``` (gdb) p *(root->eqkid->eqkid) $6 = {key = 0 '\000', refcnt = 1, lokid = 0x0, eqkid = 0x7fffd9390010, hikid = 0x55555555c990} ``` 該 process 的 Heap 到 Stack 的距離以上是使用 GDB 觀察 test_common CPY 以及 REF 機制的 Insert Data.txt 行為的分析，剛好回顧一下 GDB 操作 ## 在 GitHub 上 fork dict，視覺化 ternary search tree + bloom filter 的效能表現並分析。注意，需要設計實驗，允許不同的字串輸入並探討 TST 回應的統計分佈 ``` ./test_common --bench CPY ``` 在測試程式中有這樣提到 bench_test() test_common.c ```c=103 if (argc == 3 && strcmp(argv[1], "--bench") == 0) { int stat = bench_test(root, BENCH_TEST_FILE, LMAX); tst_free(root); free(pool); return stat; } ``` 呼叫到的程式碼主要如下，從 cities 中的每個單字取前 3 個字元做搜尋，並且判斷其時間分布 bench.c ```c=47 while (fscanf(dict, "%s", word) != EOF) { if (strlen(word) < sizeof(prefix) - 1) continue; strncpy(prefix, word, sizeof(prefix) - 1); t1 = tvgetf(); tst_search_prefix(root, prefix, sgl, &sidx, max); t2 = tvgetf(); fprintf(fp, "%d %f msec\n", idx, (t2 - t1) * 1000); idx++; } ``` 這邊可以定義要每個單字的前幾個字元做比對 ```c=9 #define PREFIX_LEN 3 ``` ### 實驗一根據原先程式 bench.c 中測量 tst_prefix_search 程式以前 3 個字元做 suggest 測量時間照題目所敘述存在此段差異原因為 cache miss bench cpy ![](https://i.imgur.com/kKTtXL9.png) bench ref ![](https://i.imgur.com/wXARN9x.png) 觀察輸出圖發現，兩機制在搜尋上所花費時間其實差不了多少，但是 CPY 機制可以讓花費最大時間壓在 0.7 左右，並且相對穩定。再來思考這樣的實驗的合理性，實驗一在於給定 "一定可以找到" `tst_search_prefix(root, prefix, sgl, &sidx, max);`的前提上做實驗 ### 實驗二為了量測 bloom filter 的效能表現的 tst_search 量測 CPY : ![](https://i.imgur.com/Gf5vfsa.png) REF : ![](https://i.imgur.com/upYZvV4.png) 想法為量測單次執行時間，但是這並不是很好的量測實驗，一來要考量到 tst_search 的 input ，二來是時間過於短，應該要不影響輸入資料差距情況下測量多次測量以觀察時間。 ### 實驗三實驗資料其實想挺久的，後來我的作法為直接引入新的 cities2.txt ，並把裡面一個字元換成另一個字元。實驗中進行 100 次測試，每次測試會把比對資料 (cities.txt, cities2.txt, cities3.txt...) 內的單詞一一進行比對並且統計總時間。這樣比起隨機輸入測試好處是不會存在 "sdsddede " , "ewrwed" 這種毫無意義的測試，並起透過計算知道輸入的資料錯誤率，達到相對的驗證。第一個測試我讓 cities.txt 裡面的 `'C'` 更換為 `'B'`，像是單辭 China 會被替換為 Bhina 。而以下都是以 CPY 機制進行的實驗。 ``` ternary_tree, loaded 206849 words in 0.157690 sec C word count : 17126 ``` error rate : 8.279469564 % citiesChange.py ```python= import os def alter(file,old_str,new_str): with open(file, "r") as f1,open("cities2.txt", "w+") as f2: for line in f1: if old_str in line: line = line.replace(old_str, new_str) f2.write(line) alter("cities.txt", "C", "B") ``` 先從原本資料 cities.txt 進行比對，由於每個單辭都確定是可以找到的，我們可以預期再增加 bloom filter 之後時間花費會提升，畢竟增加了 hash 計算以及一些分支行為。而以下也是為我們預期結果，再確定都可以找到情況下，幒測量時間上升了 15ms 左右。錯誤率 0% (0/206849) ![](https://i.imgur.com/QvYRSH1.png) ![](https://i.imgur.com/UMpUIve.png) 再來替換`'a'` 該單字再總資料中出現相當頻繁，以提升比對資料錯誤率進行實驗。我們可以觀察到花費時間還是上升。但是總體時間上升情形不像上個實驗那麼嚴重錯誤率 8.27% (17126/206849) ![](https://i.imgur.com/vM4022B.png) ![](https://i.imgur.com/thDUE2S.png) 在原先 cities.txt 中，再次替換字元 `'e'` 使錯誤率上升並做觀察。在有 bloom filter 情形所花費時間相對穩定，卻還是沒有使用 bloom filter 花費時間較低。錯誤率 74.06% (153206/206849) ![](https://i.imgur.com/8sfxOKq.png) ![](https://i.imgur.com/9DHdvX1.png) 最後替換 `'a'`,`'e'`,`'i'`,`'o'`，讓錯誤率上升至非常高以做觀察，未使用 bloom filter 效率還是比較高，這似乎與我們預先猜想情逛不符合。錯誤率 98.90% (204590/206849) ![](https://i.imgur.com/Mdk78vx.png) ![](https://i.imgur.com/ozcPNeC.png) :::success 實驗結果與預期不同，我認為應該是在 tst 中結構不夠大，使得 "實際找尋的成本" 小於 "利用 filter " 判斷的成本，若是能增加此 tst 的內容，應該是可以得到預期結果。 ::: ## 以 perf 分析 TST 程式碼的執行時期行為並且充分解讀 ### Perf 實驗 `make test` ``` Performance counter stats for './test_common --bench CPY s Tai' (100 runs): 510,9294 cache-misses # 52.157 % of all cache refs ( +- 0.50% ) 979,6034 cache-references ( +- 0.12% ) 5,8790,6063 instructions # 1.16 insn per cycle ( +- 0.02% ) 5,0593,6526 cycles ( +- 0.18% ) 0.189083 +- 0.000529 seconds time elapsed ( +- 0.28% ) ``` ``` Performance counter stats for './test_common --bench REF s Tai' (100 runs): 499,0523 cache-misses # 51.495 % of all cache refs ( +- 0.60% ) 969,1313 cache-references ( +- 0.14% ) 5,5442,8501 instructions # 1.11 insn per cycle ( +- 0.00% ) 4,9995,3329 cycles ( +- 0.22% ) 0.186238 +- 0.000442 seconds time elapsed ( +- 0.24% ) ``` ## 研讀 Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters 論文並參閱對應程式碼實作 xor_singleheader，引入上述 dict 程式碼中，需要比較原有的 bloom filter 和 xor filter 在錯誤率及記憶體開銷 ### 研讀 [Xor Filters](https://arxiv.org/abs/1912.08258) Bloom filter 為 dict 所使用實作 Cuckoo Filter 可以比 Bloom filter 使用更少的空間以及更快的速度而本篇的 xor filter 可以有更少的空間需求又更快的速度 U : 一個大箱子放置許多 String S : 從 U 箱子中取出一些 String x :　S 裡面的一個 element fingerprint(x) 得到一個 k-bits 值想像成 ```c= char *x ={'a','p','p','l','e'}; uintk_t myStringfgpt = fingerprint(x); ``` array B : k-bits 值得陣列大小為 c 個，c 比 |S| 大些(題目設定) ```c= #define c ⌊1.23 · |S|⌋ + 32 uintk_t B[c]; ``` 想像B陣列平均成 3 等分 3 個 hash function h0, h1, h2 分別會 hash 到分別三塊上 Mapping | 陣列 | finger print | | ---- | ------------ | | 0 | 00010011... | | 1 | 00010110... | | 2 | 01000011... | | 3 | 10010011... | | 4 | 01010011... | | 5 | 01110011... | | 6 | 00010011... | | 7 | 00000111... | | 8 | 00011011... | 只要在 U 這個大箱子裡的 element y, h0(y) 範圍只會在0-2，h1(y) 範圍只會在3-5，h2(y) 範圍只會在6-9 ![](https://i.imgur.com/SFgHilV.png) ![](https://i.imgur.com/8rWMqX9.png) ![](https://i.imgur.com/bjICGtA.png) 這裡要注意 H array 和 Q queue H 可以存放多筆 key 的 array (這裡實作應該是 assign pointer to queue 方式) S 裡面的元素都帶入之後 Q 先放置第一輪 "單獨 key" 的位置，並且 push 至 Q 中 Q 完成之後再將其中元素依依 pop 出來，再將三個 key 從 H 移除，如果移除後 H[i] 這個位置又只有一個 key 就 push 進 Q 裡，重覆到 H 中沒有任何元素我的輸出會是一個 Stack σ ![](https://i.imgur.com/OGmJB5C.png) 其實跟著這 4 個演算法走就可以知道在幹嘛了 Algo 4 : B[i] ← fingerprint(x) ~~xor~~ = B[h0(x)] xor B[h1(x)] xor B[h2(x)] 稍微更正下 XOR FILTERS B[h0(x)] xor B[h1(x)] xor B[h2(x)] = fingerprint(x) 這裡使用 XOR 操作跟 [HW2q6](https://hackmd.io/@YLowy/BJMA34fHP#%E6%B8%AC%E9%A9%97-6) 有一點點小小相似，都使用到相同的 XOR 特性 ### 參閱對應程式碼實作 [xor_singleheader](https://github.com/FastFilter/xor_singleheader)，引入上述 dict 程式碼 ## 如何縮減頻繁 malloc 帶來的效能衝擊呢？提示: memory pool ## 在現有的開放原始碼專案中，找出使用 bloom filter 的專案，並解析其導入的考量