prefix-search - HackMD

# prefix-search contributed by <`f74034067`>, <`catpig1630`> :::warning TODO: 嘗試導入 [Bloom Filter](https://en.wikipedia.org/wiki/Bloom_filter)，實作快速字串搜尋機制。如果字串根本不存在於資料庫中，我們根本不需走訪 ternary search tree * [Bloom Filters by Example](https://llimllib.github.io/bloomfilter-tutorial/) * [A cache efficient thread safe bloom filter](https://github.com/thomas-dean/bloomfilter) * [Approximate matching using Hierarchical Bloom Filter Trees](https://github.com/ishnid/mrsh-hbft) * [Golomb-coded map (GCM)](https://github.com/0xcb/Golomb-coded-map) :notes: jserv ::: [功能描述](https://hackmd.io/s/ByFWoR_tz) ## Trie (字典樹) ＆ Ternary search tree (TST) [rex662624](https://hackmd.io/s/SJbHD5sYM#) 的解說十分詳盡完整 ### Trie trie 與 binary search tree 不同在於，不是把整個字串包在同一個node 中，而是根據 node 的位置決定代表的字串為何。也就是一個看結果( 最終 node )，一個看過程(經過的node)。因此利用 trie，可以快速的找出有相同 prefix 的字串。以下將舉例示範 trie 如何儲存 data。 :::info 一開始，trie tree 起始於一個空的root - ◎ ::: ``` ◎ ``` :::info 若今插入一個字串 and 1. 首先將 and 拆解成 a, n ,d 2. 因為 a 第一次出現，所以在 root 建立一個新children node ,value= a 3. 接下來從node "a" 出發，剩下 n,d 4. n 也是第一次出現,因此在 a 建立一個新 children node ,value= n 5. d 比照 4 作法。 ::: ``` ◎ / a / n / d ``` :::info 若此時插入一個字串 ant 1. ant 拆解成 a , n , t，從 root 出發 2. node "a" 已經存在，因此不再在 root 創造新的 children 3. 再來看 n , node "a" 往 children 看發現 node "n"已存在，因此也不創造 children 4. 再來看 t 發現 n 的 children 中沒有此 node，因此創造新 children "t" ::: ``` ◎ / a / n / \ d t ``` :::info 如此再插入 bear ,bea 就會產生像這樣子的 tree 1. 有 value 的就是有 string 存在的 node 2. 因此 search "ant" 會 return 值 0 ，表示 trie 內有此string 3. 而 search "be" 會 retrun NULL ，因為並沒有值在 node "e"。 ::: ``` ◎ / \ a b / \ n e / \ \ d t a (3) (0) (1) \ r (2) ``` 以上，我們注意到 Trie 中，若有儲存的字串有許多相同的 prefix ，那麼 trie 可以節省大量儲存空間，因為 Trie 中每個字串都是經由 character by character 方法進行儲存，所以具有相同 prefix 的部份都會有共同的 node 。相對的，若是資料非常大量，而且幾乎沒有相同的 prefix 將會非常浪費儲存空間。因為 trie 在 memory 的儲存方式其實是像下圖這樣，每個 node 建立時，都必須同時創造26個值為NULL的子節點。(下圖的 universal set ={a,b,c,d,e}5所以只創造了 5 個) ![](https://i.imgur.com/S7VQnPf.png) ### Ternary search tree 因此用此 Ternary search tree ，就能解決上面的問題。這種樹的每個節點最多只會有3個 children，分別代表小於，等於，大於。以下範例解釋 Ternary search tree 。 :::info 插入一個字串 and 1. 與 trie 概念相同，但是在root 是直接擺 "a" 而不需要一個null 。 ::: ``` a | n | d ``` :::info 接著插入單字ant 1. 我們先比對 a ，發現 a 和 root 的字母相同（a=a） 2. 因此 pointer 往下，來到 n 節點，發現 n 節點和 ant 的第二個字母相同（n=n） 3. 因此 pointer 再往下，來到 d 節點，發現 d 節點和 ant 的第三個字母不同，而且 **t>d** 4. 因此 d 節點的 right child 成為 t ，成功將 ant 放入原本的 TST 中 ::: ``` a | n- | | d t ``` :::info 接著插入單字age 1. 我們先比對 a ，發現 a 和 root 的字母相同（a=a) 2. 因此 pointer 往下，來到 n 節點，發現 n 節點和 age 的第二個字母不同，而且 **g<n** 3. 因此 n 節點的 left child 成為 g ，成功將 age 放入原本的 TST 中 ::: ``` a | -n | | g d- | | e t ``` [ Ternary Search Tree ](https://www.cs.usfca.edu/~galles/visualization/TST.html) 此網址提供了用動畫表示的方式，可以清楚看到如何加入與刪除 string 的流程。從 Ternary Search Tree 中，我們可以發現幾種特性。 1. 解決 trie tree 中大量佔用記憶體的情況，因為 TST 中，每個 node 最多只會有3個 child 2. 但尋找時間在有些 case 中，會比 trie 慢，而這是因為 TST 插入的順序會影響比較的次數。例如以下兩張圖，上圖是用 `aa->ac->ae->af->ag->ah` 的次序插入，而下圖是 `af->ae->ac->aa->ag->ah`的順序。試想今天要尋找 ah ，則上圖要比較6次，下圖則是只要比較3次。而若是建立 trie ，只需要2次。 ![](https://i.imgur.com/2ZZy7ym.png) ![](https://i.imgur.com/fPkOILf.png) ## 程式碼修正 ### 修改 Makefile，`$ make bench` 進行效能評比參考 [raygoah](https://hackmd.io/dwDOR7H2Q8qyxGxSOsc-vA?view) 的作法 * `Makefile` ```clike= TEST_INPUT = f Taiwan bench: $(TESTS) echo 3 | sudo tee /proc/sys/vm/drop_caches perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_cpy --bench $(TEST_INPUT) echo 3 | sudo tee /proc/sys/vm/drop_caches perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_ref --bench $(TEST_INPUT) ``` * `$ make bench` 執行結果 ``` Performance counter stats for './test_cpy --bench f Taiwan' (100 runs): 488,395 cache-misses # 29.139 % of all cache refs ( +- 1.26% ) 1,676,097 cache-references ( +- 0.47% ) 449,033,851 instructions # 1.08 insn per cycle ( +- 0.01% ) 417,234,828 cycles ( +- 0.48% ) 0.148296779 seconds time elapsed ( +- 0.56% ) ``` ### 修改 FIXME * 首先修改 `test_ref.c` 標注 `FIXME` 的地方 ```clike= - res = tst_ins_del(&root, &p, INS, CPY); + res = tst_ins_del(&root, &p, INS, REF); ``` * 執行 `test_ref.c`，會發現搜尋結果有誤 ``` Commands: a add word to the tree f find word in tree s search words matching prefix d delete word from the tree q quit, freeing all data choice: s find words matching prefix (at least 1 char): Tain Tain - searched prefix in 0.000010 sec suggest[0] : Tain suggest[1] : Tain suggest[2] : Tain suggest[3] : Tain suggest[4] : Tain suggest[5] : Tain suggest[6] : Tain suggest[7] : Tain suggest[8] : Tain suggest[9] : Tain suggest[10] : Tain suggest[11] : Tain ``` * `tst.c` 中與 CPY 有關的程式碼 ```clike= if (*p++ == 0) { if (cpy) { /* allocate storage for 's' */ const char *eqdata = strdup(*s); if (!eqdata) return NULL; curr->eqkid = (tst_node *) eqdata; return (void *) eqdata; } else { /* save pointer to 's' (allocated elsewhere) */ curr->eqkid = (tst_node *) *s; return (void *) *s; } } ``` CPY 的版本由於 strdup(*s) 會先用 malloc() 配置與參數 s 字串相同大小的空間，然後將 s 字串的內容複製到該空間，所以每次回傳的地址都不相同。 REF 的版本中，每次傳入 tst_ins_del() 的 *p 都相同，curr -> eqkid 當然也就都指向同一個位址。 ### 修正 REF 版本 searching 功能參考 [tina0405](https://hackmd.io/s/SkEFpwh2-#) 的作法 * 建立一個叫 word_arr 二維陣列，每次讀入一個新的城市就依序存入 word_arr[ ][ WRDMAX ] 中，如此一來每次傳入 tst_ins_del() 的 *p 就不會一樣了。 * 若是在 main() 裡面建立一個 word_arr[ 259112 ][ WRDMAX ] 會造成程序 core dumped ，所以我選擇宣告成全域變數。 ```clike= char word_arr[259112][WRDMAX]; int main() { ... while ((rtn = fscanf(fp, "%s", word_arr[idx])) != EOF) { char *p = word_arr[idx]; /* FIXME: insert reference to each string */ if (!tst_ins_del(&root, &p, INS, REF)) { fprintf(stderr, "error: memory exhausted, tst_insert.\n"); fclose(fp); return 1; } idx++; } ... } ``` * 執行 delete 的結果發現仍然有誤 ``` ternary_tree, loaded 259112 words in 0.166198 sec Commands: a add word to the tree f find word in tree s search words matching prefix d delete word from the tree q quit, freeing all data choice: d enter word to del: China deleting China China (refcnt: 1099) not removed. delete failed. ``` * 執行 quit 的結果發現仍然有誤 ``` ternary_tree, loaded 259112 words in 0.153593 sec Commands: a add word to the tree f find word in tree s search words matching prefix d delete word from the tree q quit, freeing all data choice: q *** Error in `./test_ref': free(): invalid pointer: 0x0000000000fb8000 *** ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f2bbcc487e5] /lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f2bbcc5137a] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f2bbcc5553c] ./test_ref[0x4021e4] ./test_ref[0x4021b9] ./test_ref[0x4021b9] ./test_ref[0x4021b9] ./test_ref[0x4021b9] ./test_ref[0x4021b9] ./test_ref[0x4021b9] ./test_ref[0x40219e] ./test_ref[0x40219e] ./test_ref[0x40219e] ./test_ref[0x4021b9] ./test_ref[0x40219e] ./test_ref[0x40219e] ./test_ref[0x4021b9] ./test_ref[0x40219e] ./test_ref[0x40219e] ./test_ref[0x40219e] ./test_ref[0x40219e] ./test_ref[0x40219e] ./test_ref[0x401348] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f2bbcbf1830] ./test_ref[0x400979] ======= Memory map: ======== 00400000-00403000 r-xp 00000000 08:08 261623 /home/suchihhan/prefix-search/test_ref 00602000-00603000 r--p 00002000 08:08 261623 /home/suchihhan/prefix-search/test_ref 00603000-00604000 rw-p 00003000 08:08 261623 /home/suchihhan/prefix-search/test_ref 00604000-04546000 rw-p 00000000 00:00 0 0511e000-06831000 rw-p 00000000 00:00 0 [heap] 7f2bb8000000-7f2bb8021000 rw-p 00000000 00:00 0 7f2bb8021000-7f2bbc000000 ---p 00000000 00:00 0 7f2bbc9bb000-7f2bbc9d1000 r-xp 00000000 08:07 136362 /lib/x86_64-linux-gnu/libgcc_s.so.1 7f2bbc9d1000-7f2bbcbd0000 ---p 00016000 08:07 136362 /lib/x86_64-linux-gnu/libgcc_s.so.1 7f2bbcbd0000-7f2bbcbd1000 rw-p 00015000 08:07 136362 /lib/x86_64-linux-gnu/libgcc_s.so.1 7f2bbcbd1000-7f2bbcd91000 r-xp 00000000 08:07 135915 /lib/x86_64-linux-gnu/libc-2.23.so 7f2bbcd91000-7f2bbcf91000 ---p 001c0000 08:07 135915 /lib/x86_64-linux-gnu/libc-2.23.so 7f2bbcf91000-7f2bbcf95000 r--p 001c0000 08:07 135915 /lib/x86_64-linux-gnu/libc-2.23.so 7f2bbcf95000-7f2bbcf97000 rw-p 001c4000 08:07 135915 /lib/x86_64-linux-gnu/libc-2.23.so 7f2bbcf97000-7f2bbcf9b000 rw-p 00000000 00:00 0 7f2bbcf9b000-7f2bbcfc1000 r-xp 00000000 08:07 135871 /lib/x86_64-linux-gnu/ld-2.23.so 7f2bbd1a3000-7f2bbd1a6000 rw-p 00000000 00:00 0 7f2bbd1bf000-7f2bbd1c0000 rw-p 00000000 00:00 0 7f2bbd1c0000-7f2bbd1c1000 r--p 00025000 08:07 135871 /lib/x86_64-linux-gnu/ld-2.23.so 7f2bbd1c1000-7f2bbd1c2000 rw-p 00026000 08:07 135871 /lib/x86_64-linux-gnu/ld-2.23.so 7f2bbd1c2000-7f2bbd1c3000 rw-p 00000000 00:00 0 7ffc4547e000-7ffc4549f000 rw-p 00000000 00:00 0 [stack] 7ffc455c9000-7ffc455cc000 r--p 00000000 00:00 0 [vvar] 7ffc455cc000-7ffc455ce000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] 已經終止 (core dumped) ``` ### 修正 REF 版本 delete 功能 * 修改 `tst.c` 中 tst_del_word() 的最後一個參數，將原本寫死的 1 改為 cpy ```clike= -tst_del_word(root, curr, &stk, 1); +tst_del_word(root, curr, &stk, cpy); ``` ### 修正 REF 版本 quit 功能 * quit 功能會出錯則是因為 `test_ref.c` 中 tst_free_all(root) 會 free 到 REF 版本中宣告的二維陣列的位址，若改成 tst_free(root) 就不會出錯了 ```clike= -tst_free_all(root); +tst_free(root) ``` ## 比較 REF 與 CPY 的效能先來測試一下兩種版本的效能，以執行 `s Tai` 100次為例 * $ make bench ``` Performance counter stats for './test_cpy --bench s Tai' (100 runs): 405,229 cache-misses # 25.640 % of all cache refs ( +- 1.07% ) 1,580,480 cache-references ( +- 0.32% ) 449,637,901 instructions # 1.17 insn per cycle ( +- 0.00% ) 385,322,883 cycles ( +- 0.22% ) 0.132403198 seconds time elapsed ( +- 0.32% ) ``` ``` Performance counter stats for './test_ref --bench s Tai' (100 runs): 715,608 cache-misses # 35.563 % of all cache refs ( +- 1.47% ) 2,012,240 cache-references ( +- 0.42% ) 469,495,339 instructions # 1.00 insn per cycle ( +- 0.00% ) 470,839,058 cycles ( +- 0.49% ) 0.162691614 seconds time elapsed ( +- 0.62% ) ``` * CPY 優於 REF 在 REF 版本，使用陣列儲存 ternary search tree 中的某些節點，cache 的 spatial locality 特性反而有機會在搜尋中將許多尚不需用到的 data 搬到 cache 中，造成 cache-misses 較高，並且由於陣列宣告成固定大小，還會產生 memory fragmentation 的問題。 ## 實做 memory pool 參考 [ChiuYiTang](https://hackmd.io/s/ByeocUhnZ#) 的作法 * 只紀錄兩個指標，一個 pPool 指向整個 memory pool 的開頭，一個 pTop 則指向下一次要分配出的記憶體位址。 ```clike= int main() { ... char *pPool = (char *) malloc(poolSize * sizeof(char)); char *pTop = pPool; while ((rtn = fscanf(fp, "%s", pTop)) != EOF) { t1 = tvgetf(); char *p = pTop; /* FIXME: insert reference to each string */ if (!tst_ins_del(&root, &p, INS, REF)) { fprintf(stderr, "error: memory exhausted, tst_insert.\n"); fclose(fp); return 1; } idx++; pTop += (strlen(pTop) + 1); } ... case 'a': printf("enter word to add: "); if (bench_flag == 0) { if (!fgets(pTop, sizeof word, stdin)) { fprintf(stderr, "error: insufficient input.\n"); break; } rmcrlf(pTop); } else { strcpy(pTop, argv[3]); } p = pTop; t1 = tvgetf(); /* FIXME: insert reference to each string */ res = tst_ins_del(&root, &p, INS, REF); t2 = tvgetf(); if (res) { idx++; pTop += (strlen(pTop) + 1); printf(" %s - inserted in %.6f sec. (%d words in tree)\n",(char *) res, t2 - t1, idx); } else printf(" %s - already exists in list.\n", (char *) res); } ... case 'q': tst_free(root); free(pPool); return 0; ... ``` * $ make bench ``` Performance counter stats for './test_cpy --bench s Tai' (100 runs): 444,998 cache-misses # 27.656 % of all cache refs ( +- 1.78% ) 1,609,058 cache-references ( +- 0.60% ) 480,816,403 instructions # 1.09 insn per cycle ( +- 0.02% ) 442,471,864 cycles ( +- 0.50% ) 0.152579698 seconds time elapsed ( +- 0.61% ) ``` ``` Performance counter stats for './test_ref --bench s Tai' (100 runs): 404,904 cache-misses # 25.170 % of all cache refs ( +- 1.48% ) 1,608,690 cache-references ( +- 0.46% ) 464,059,245 instructions # 1.10 insn per cycle ( +- 0.00% ) 422,942,275 cycles ( +- 0.47% ) 0.145401520 seconds time elapsed ( +- 0.56% ) ``` 參考 [rex662624](https://hackmd.io/s/SJbHD5sYM#) 將 insert 的時間抓出來比較 * $ make plot ![](https://i.imgur.com/Qw5JCoh.png) memory pool 的好處在於省去了頻繁的 malloc() 與 free()，可以看到在建立 ternary search tree 的效能上 REF 略優於 CPY。 ## Prefix 缺陷 ### 搜尋城市名問題 * 城市中間有空白會被分開 * 逗號會被加到城市名稱裡面可以利用 fscanf 裡類似正規表達式的特殊寫法直接找出字串裡的程式名稱和國家名稱把原本的%s 改成 %256[^,], %256[^\n]\n 其中[^\n]代表的是不包含 \n 字元，詳細說明可以參考 fprintf 的 man page ### 刪除不存在的字串會變成輸入 ### 發現 prefix search 查詢 A 的時候會 segmentation fault ### reference [afcidk](https://hackmd.io/s/HynvxMxcM#) [bauuuu1021](https://hackmd.io/s/rJd31ZRYz#Prefix-%E7%BC%BA%E9%99%B7) [2017重做同學](https://hackmd.io/s/r1nQinIbz#) ## tst_traverse_fn [TingL7](https://hackmd.io/s/r11sE83cz#)的解說十分詳盡完整 * callback function * 根據維基百科定義：a **callback** is any executable code that is passed as an argument to other code, which is expected to _call back_ (execute) the argument at a given time. * 在 c 語言中， callback function 就是將 function pointer 當作參數傳入另一個 function 中執行。 * `tst_traverse_fn` function * 就是一個遞迴的 callback function ，它將 fn() 當作參數。 ```c /** tst_traverse_fn(), traverse tree calling 'fn' on each word. * prototype for 'fn' is void fn(const void *, void *). data can * be NULL if unused. */ void tst_traverse_fn(const tst_node *p, void(fn)(const void *, void *), void *data) { if (!p) return; tst_traverse_fn(p->lokid, fn, data); if (p->key) tst_traverse_fn(p->eqkid, fn, data); else fn(p, data); tst_traverse_fn(p->hikid, fn, data); } ``` * 這個函式的功能就是走訪整個樹的節點，並且在 leaves 執行使用者自訂函式 fn()。 * 應用： * 計算 leaves 數量 ``` void fn(const void *p, void *data) { ++*(int *)data; } tst_traverse_fn(root, fn, &data); printf("data: %d\n", data); ``` 結果 ``` data: 85500 ``` * 效能評比可以比較兩者的走訪速度。 ``` t1 = tvgetf(); tst_traverse_fn(root, fn, &data); t2 = tvgetf(); printf("cpy time: %.6f\n", t2-t1); ``` 結果 ``` cpy time: 0.015064 ref time: 0.015275 ``` * 印出資料庫中所有的國家、城市 ``` void fn(const void *p, void *data){ printf("%s\n", tst_get_string(p)); } tst_traverse_fn(root, fn, &data); ``` 結果 ``` ... Tufino, Tuftonboro, Tuganay, Tugas, Tugaya, Tugbong, Tugdan, Tuggen, Tuglie, Tugolesskiy Tugos, Tuguegarao Tugulym, Tugun, ... ```