2018q1 Homework4 (prefix-search)

--- tags: linyunwen --- # 2018q1 Homework4 (prefix-search) ###### tags: `linyunwen` contributed by <`LinYunWen`> - [D04: prefix-search](https://hackmd.io/s/ByFWoR_tz) :::info ## Envorinment * Architecture: x86_64 * CPU 作業模式： 32-bit, 64-bit * Byte Order: Little Endian * CPU(s): 4 * On-line CPU(s) list: 0-3 * 每核心執行緒數：2 * 每通訊端核心數：2 * Socket(s): 1 * NUMA 節點： 1 * 供應商識別號： GenuineIntel * CPU 家族： 6 * 型號： 60 * Model name: Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz * 製程： 3 * CPU MHz： 2793.554 * CPU max MHz: 3400.0000 * CPU min MHz: 800.0000 * BogoMIPS: 5587.10 * 虛擬： VT-x * L1d 快取： 32K * L1i 快取： 32K * L2 快取： 256K * L3 快取： 3072K ::: ## 了解 ternary search tree - 閱讀 - [wiki-Ternary tree](https://en.wikipedia.org/wiki/Ternary_search_tree) - [字典樹介紹](http://pisces.ck.tp.edu.tw/~peng/index.php?action=showfile&file=f743c2923f8170798f62a585257fdd8436cd73b6d) - [```rex662624``` 的介紹](https://hackmd.io/s/SJbHD5sYM) - 理解 1. 原以為 ternary tree 就是字典樹，但其實這樣的說法並不完全，ternary tree 是字典樹 (tire) 的一種 ### tire (字典樹) - 也被稱之為單詞搜尋樹、前綴樹，是一種特殊的樹狀資料結構，善用字串的特性，將每個字的共同前輟 (common prefix) 當作儲存的依據，加速搜尋時間。每個節點 (node) 都代表一個字母，根結點 (root) 除外 ```(根節點要應付空字串)```，因此每一個節點的子孫都有相同的前輟 - 重點是 1. 每個節點可以有 26 個子節點 (若不分大小寫的話) 2. 每個節點都可能代表一組字串，不一定要到達葉子 3. 整棵樹的高度為最長字串長度 + 1 ![](https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Trie_example.svg/250px-Trie_example.svg.png) - 在應用上，大多出現在 1. 詞頻統計 2. 前輟匹配如相關搜尋，或是網頁上的字詞搜尋 ### Ternary tree (三元樹) - 他是介於 tire 、 binary tree 的樹狀結構，因為每個節點不是像字元樹一樣都存下來，也不是像二元樹般只有兩個，三元樹有三個子節點，個別為```等位子節點 (equal kid)```、```低位子節點 (lo kid)```、```高位子節點 (hi kid)``` - 重點是 1. 因為根節點上是直接存放第一個 insert 的第一個字母 2. 其每個節點只有三個子節點，因此相較原本的 tire 會比較節省空間 3. 但是也因此造成樹狀結構較深，搜尋的速度在某些狀況下會相對比較慢 ![](https://i.imgur.com/EFrZ5yp.png) ![](https://i.imgur.com/s9f4plv.png) 圖為上圖的 Ternary tree - 在應用上，大多用於拼寫檢查、自動完成 :::warning TST 有個隱藏問題，它會因為字串插入的順序不同，而產生不同結構的樹，也因此如何安排插入順序使得深度最淺 ```（每個節點都塞滿三個子節點）``` ，也是值得思考 ::: ## 理解 prefix search 程式碼 - 首先觀察整份檔案，看到 - tst.c, tst.h => 為 TST 相關函式，如 insert, delete 等等 > 可以看到一個很重要的地方是 > ```c=21 > // tst.h > void *tst_ins_del(tst_node **root, > char *const *s, > const int del, > const int cpy); > ``` > 最後一個參數即為要以 copy or reference 的方式執行 - cities.txt => 為存放所有城市資料的文件檔 - test_cpy.c, test_ref.c => 為 main function 所在，主要為程式執行時所用 copy or reference > 兩者接設有 > ```c=8 > /** constants insert, delete, max word(s) & stack nodes */ > enum { INS, DEL, WRDMAX = 256, STKMAX = 512, LMAX = 1024 }; > #define REF INS > #define CPY DEL > ``` > 因為 enum 是遞增，故 ```INS=0```, ```DEL=1``` > 同時 ```REF=0```, ```CPY=1``` - 進一步看到 FIXME 所在 1. 在 test_ref.c ```c=54 /* FIXME: insert reference to each string */ if (!tst_ins_del(&root, &p, INS, CPY)) { fprintf(stderr, "error: memory exhausted, tst_insert.\n"); fclose(fp); return 1; } ``` 2. 在 test_ref.c ```c=89 /* FIXME: insert reference to each string */ res = tst_ins_del(&root, &p, INS, CPY); t2 = tvgetf(); if (res) { idx++; printf(" %s - inserted in %.6f sec. (%d words in tree)\n", (char *) res, t2 - t1, idx); } else printf(" %s - already exists in list.\n", (char *) res); ``` 3. 在 test_ref.c ```c=141 /* FIXME: remove reference to each string */ res = tst_ins_del(&root, &p, DEL, CPY); t2 = tvgetf(); if (res) printf(" delete failed.\n"); else { printf(" deleted %s in %.6f sec\n", word, t2 - t1); idx--; } ``` ### 排除 FIXME > brench bug/fix-me 在 test_ref.c 中應該要操作的是 reference 的方式，因此 ```tst_ins_del``` 最後一個參數應為 REF 1. test_ref.c (55) ```clike - tst_ins_del(&root, &p, INS, CPY) + tst_ins_del(&root, &p, INS, REF) // 樣式參考 <rex662624> ``` 更改完後執行，發現有錯誤 ```shell= choice: s find words matching prefix (at least 1 char): Tain Tain - searched prefix in 0.000010 sec suggest[0] : Tain suggest[1] : Tain suggest[2] : Tain suggest[3] : Tain suggest[4] : Tain suggest[5] : Tain suggest[6] : Tain suggest[7] : Tain suggest[8] : Tain suggest[9] : Tain suggest[10] : Tain ``` 因為只更改了 REF ，很有可能是 ```tst_ins_del``` 在某部份的運行跟預想不一樣，剛好看到 272 行是控制點 ```c=268 /* Place nodes until end of the string, at end of string allocate * space for data, copy data as final eqkid, and return. */ if (*p++ == 0) { if (cpy) { /* allocate storage for 's' */ const char *eqdata = strdup(*s); if (!eqdata) return NULL; curr->eqkid = (tst_node *) eqdata; return (void *) eqdata; } else { /* save pointer to 's' (allocated elsewhere) */ curr->eqkid = (tst_node *) *s; return (void *) *s; } } ``` > 先看到 273 行這裡利用 ```strdup``` (string duplication) 來對字串重新安排一個空間進行複製，因此對 copy 來說沒有什麼問題 > :::info ```p``` 為指向字串第一個字母的指標，因此 ```*p``` 為該個字母，而 ```*p++``` 表示的是檢驗字串中的每個字母，並利用 ++ 指向下個字母的位址，雖然 ++ 最後做，先 ```*p``` 再 ```++``` ，但是後來的 ++ 只針對 p ，也就是 ```p++```，而非字母的數值加一，故 ```*p++ == 0``` 為 true 表示字串結束 ::: 而會造成這個錯誤的原因就是，因為 s 是由 main function 裡的 p 傳入位址，而 p 又是讀取 word 的位址 ```c=52 while ((rtn = fscanf(fp, "%s", word)) != EOF) { char *p = word; if (!tst_ins_del(&root, &p, INS, REF)) { ... ``` word 只是讀取資料時的一個變數，他在讀取每一筆資料時，也只是變更其值，而非更改位址，因此每次傳入的 p 都是相同位址， s 也就都為相同值如果輸出 s 的位置來看 ```c=279 // tst.c printf("%p \n", s); curr->eqkid = (tst_node *) strdup(*s); return (void *) *s; ``` 結果就是如下 ```shell= linyunwen@linyunwen-X550JK:~/course/embedded/hw2/prefix-search$ ./test_ref 0x7ffceefcffb8 0x7ffceefcffb8 0x7ffceefcffb8 0x7ffceefcffb8 0x7ffceefcffb8 0x7ffceefcffb8 ... ``` 因此要有另外一個位址紀錄字串，簡單來說可以像下面這樣，但是這個就跟 copy 的版本一樣了，所以嘗試在別的地方宣告出這個值 ```c=271 // tst.c if (*p++ == 0) { const char *eqdata = strdup(*s); if (cpy) { /* allocate storage for 's' */ if (!eqdata) return NULL; curr->eqkid = (tst_node *) eqdata; return (void *) eqdata; } else { /* save pointer to 's' (allocated elsewhere) */ curr->eqkid = (tst_node *) eqdata; return (void *) *eqdata; } } ``` 參考 <`tina0405`> 的作法是在一開始就建立一個 array 用來存放那些字串。讀取時，改為將這些字串位址傳入，就不會有同一個重複位址的問題建立 ```word_saver``` 來儲存 words ```c=54 char word_saver[10000][WRDMAX]; while ((rtn = fscanf(fp, "%s", word_saver[idx])) != EOF) { char *p = word_saver[idx]; if (!tst_ins_del(&root, &p, INS, REF)) { ... ``` 因為整體上無法容納全部 25 萬多筆的城市名，因此暫且只能先將數量維持在 10000 左右 - test ```shell perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./test_cpy perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./test_ref ``` ```shell Performance counter stats for './test_cpy' (100 runs): 27,323 cache-misses # 37.271 % of all cache refs ( +- 2.53% ) 73,309 cache-references ( +- 2.17% ) 20,472,722 instructions # 0.99 insn per cycle ( +- 0.02% ) 20,612,748 cycles ( +- 1.47% ) 0.006955518 seconds time elapsed ( +- 1.84% ) ---------------------------------------------------------------- Performance counter stats for './test_ref' (100 runs): 39,510 cache-misses # 39.766 % of all cache refs ( +- 2.78% ) 99,355 cache-references ( +- 2.09% ) 20,939,620 instructions # 0.84 insn per cycle ( +- 0.03% ) 25,067,981 cycles ( +- 1.53% ) 0.008300689 seconds time elapsed ( +- 1.52% ) ``` :::warning 問題: 1. 雖說似乎有個解法，但是這樣先建立一個儲存空間來存放輸入的資料，和在插入前先複製一份的複製版本好像沒有太大差別，這樣 reference 的效果在哪裡？ 2. 當嘗試 strdup 動態分配記憶體空間時，可以將全部的程式資料都倒進來，但是安排成 array 就無法，是因為 array 記憶體要是連續的嗎？還是因為 strdup 會給剛好的大小， array 會浪費很多空間？ ::: ## 建立 make bench - 要修改 makefile 使得可以利用 make bench 或 --bench 參數來讓程式執行效能分析 - 參考上一份作業類似部份 ```c= EXEC = phonebook_orig phonebook_opt cache-test: $(EXEC) perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./phonebook_orig perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./phonebook_opt ``` 修改為 ```c= TESTS = test_cpy test_ref bench: $(TESTS) perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_cpy perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_ref ``` 但是問題會先遇到，執行時它是個無限迴圈，並且會要手動輸入參數，因此這部份利用 argv, argc 來處理，先假定預期輸入為 ```shell= // 單純分析 ./test_cpy --bench // 自動輸入參數 ./test_cpy --bench s Man ``` - [C argc and argv Examples to Parse Command Line Arguments](https://www.thegeekstuff.com/2013/01/c-argc-argv/) - [argc and argv](http://crasseux.com/books/ctutorial/argc-and-argv.html) :::warning char **argv 和 char *argv[] 沒有太大差別 - [C/C++ 程式設計教學：main 函數讀取命令列參數，argc 與 argv 用法](https://blog.gtwang.org/programming/c-cpp-tutorial-argc-argv-read-command-line-arguments/) ::: 建立兩個 flag 來判斷接下來要用哪種模式 ```c= int bench_flag = 0, arg_flag = 0; // bench if (strcmp(argv[1], "--bench") == 0) { bench_flag = 1; } if (argc >= 4) { arg_flag = 1; } ``` ```c=90 // input mode if (arg_flag) { strcpy(word, argv[2]); } else { fgets(word, sizeof word, stdin); } ``` ```c=133 // input search prefix if (arg_flag) { strcpy(word, argv[3]); } else { if (!fgets(word, sizeof word, stdin)) { fprintf(stderr, "error: insufficient input.\n"); break; } } ``` > 小插曲: > 原本想用 boolen type 宣告 flag ，但是遇到 ```error: unknown type name ‘bool’``` ，查一下才知道原來 boolen type 並非一直都有 > [error: unknown type name ‘bool’ - Stackoverflow](https://stackoverflow.com/questions/8133074/error-unknown-type-name-bool) 接好可以吃參數的接口後，就將 makefile 也更改成引入參數的樣式 ```c= bench: $(TESTS) perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_cpy --bench s Man perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_ref --bench s Man ``` - 希望可以有 cycle 數的分析，並且畫出圖來 - 因此首先設立個 random 機制，隨機選取一個 data 的前三個字作為 prefix - [亂數的使用](http://dhcp.tcgs.tc.edu.tw/c/p005.htm) ```c= #define SEARCH_PREFIX 3 srand(time(NULL)); rand_i = rand()%10000; if (idx == rand_i) { strncpy(search_prefix, word, (size_t) SEARCH_PREFIX); } ``` 再來將 search prefix 用剛剛隨機選取到的作為輸入 ```c= if (arg_flag) { // strcpy(word, argv[3]); strcpy(word, search_prefix); } else { ... ``` 基本上在取得 cycle 數然後輸出成檔案，將其用 gnuplot 畫出即可，但是目前卡在取得 cycle 數上，我在網路上找不太到能夠抓取 cycle 的方式，大多都為取時間