2018q1 Homework4 (prefix-search)

tags: `linyunwen`

contributed by <LinYunWen>

D04: prefix-search

Envorinment

Architecture: x86_64
CPU 作業模式： 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
每核心執行緒數：2
每通訊端核心數：2
Socket(s): 1
NUMA 節點： 1
供應商識別號： GenuineIntel
CPU 家族： 6
型號： 60
Model name: Intel® Core™ i5-4200H CPU @ 2.80GHz
製程： 3
CPU MHz： 2793.554
CPU max MHz: 3400.0000
CPU min MHz: 800.0000
BogoMIPS: 5587.10
虛擬： VT-x
L1d 快取： 32K
L1i 快取： 32K
L2 快取： 256K
L3 快取： 3072K

了解 ternary search tree

閱讀
理解
1. 原以為 ternary tree 就是字典樹，但其實這樣的說法並不完全，ternary tree 是字典樹 (tire) 的一種
tire (字典樹)
- 也被稱之為單詞搜尋樹、前綴樹，是一種特殊的樹狀資料結構，善用字串的特性，將每個字的共同前輟 (common prefix) 當作儲存的依據，加速搜尋時間。每個節點 (node) 都代表一個字母，根結點 (root) 除外 (根節點要應付空字串)，因此每一個節點的子孫都有相同的前輟
- 重點是
  1. 每個節點可以有 26 個子節點 (若不分大小寫的話)
  2. 每個節點都可能代表一組字串，不一定要到達葉子
  3. 整棵樹的高度為最長字串長度 + 1
- 在應用上，大多出現在
  1. 詞頻統計
  2. 前輟匹配
  如相關搜尋，或是網頁上的字詞搜尋
Ternary tree (三元樹)
- 他是介於 tire 、 binary tree 的樹狀結構，因為每個節點不是像字元樹一樣都存下來，也不是像二元樹般只有兩個，三元樹有三個子節點，個別為等位子節點 (equal kid)、低位子節點 (lo kid)、高位子節點 (hi kid)
- 重點是
  1. 因為根節點上是直接存放第一個 insert 的第一個字母
  2. 其每個節點只有三個子節點，因此相較原本的 tire 會比較節省空間
  3. 但是也因此造成樹狀結構較深，搜尋的速度在某些狀況下會相對比較慢
  圖為上圖的 Ternary tree
- 在應用上，大多用於拼寫檢查、自動完成
TST 有個隱藏問題，它會因為字串插入的順序不同，而產生不同結構的樹，也因此如何安排插入順序使得深度最淺 （每個節點都塞滿三個子節點） ，也是值得思考

理解 prefix search 程式碼

首先觀察整份檔案，看到
- tst.c, tst.h => 為 TST 相關函式，如 insert, delete 等等
  可以看到一個很重要的地方是
```
// tst.h
void *tst_ins_del(tst_node **root,
                  char *const *s,
                  const int del,
                  const int cpy);
```
  最後一個參數即為要以 copy or reference 的方式執行
- cities.txt => 為存放所有城市資料的文件檔
- test_cpy.c, test_ref.c => 為 main function 所在，主要為程式執行時所用 copy or reference
  兩者接設有
```
/** constants insert, delete, max word(s) & stack nodes */
enum { INS, DEL, WRDMAX = 256, STKMAX = 512, LMAX = 1024 };
#define REF INS
#define CPY DEL
```
  因為 enum 是遞增，故 INS=0, DEL=1
  同時 REF=0, CPY=1

進一步看到 FIXME 所在

在 test_ref.c






/* FIXME: insert reference to each string */
if (!tst_ins_del(&root, &p, INS, CPY)) {
    fprintf(stderr, "error: memory exhausted, tst_insert.\n");
    fclose(fp);
    return 1;
}

在 test_ref.c









/* FIXME: insert reference to each string */
res = tst_ins_del(&root, &p, INS, CPY);
t2 = tvgetf();
if (res) {
    idx++;
    printf("  %s - inserted in %.6f sec. (%d words in tree)\n",
           (char *) res, t2 - t1, idx);
} else
    printf("  %s - already exists in list.\n", (char *) res);

在 test_ref.c









/* FIXME: remove reference to each string */
res = tst_ins_del(&root, &p, DEL, CPY);
t2 = tvgetf();
if (res)
    printf("  delete failed.\n");
else {
    printf("  deleted %s in %.6f sec\n", word, t2 - t1);
    idx--;
}

排除 FIXME

brench bug/fix-me

在 test_ref.c 中應該要操作的是 reference 的方式，因此 tst_ins_del 最後一個參數應為 REF

test_ref.c (55)

- tst_ins_del(&root, &p, INS, CPY)
+ tst_ins_del(&root, &p, INS, REF)
// 樣式參考 <rex662624>

更改完後執行，發現有錯誤















choice: s
find words matching prefix (at least 1 char): Tain
  Tain - searched prefix in 0.000010 sec

suggest[0] : Tain
suggest[1] : Tain
suggest[2] : Tain
suggest[3] : Tain
suggest[4] : Tain
suggest[5] : Tain
suggest[6] : Tain
suggest[7] : Tain
suggest[8] : Tain
suggest[9] : Tain
suggest[10] : Tain

因為只更改了 REF ，很有可能是 tst_ins_del 在某部份的運行跟預想不一樣，剛好看到 272 行是控制點















/* Place nodes until end of the string, at end of string allocate
 * space for data, copy data as final eqkid, and return.
 */
if (*p++ == 0) {
    if (cpy) { /* allocate storage for 's' */
        const char *eqdata = strdup(*s);
        if (!eqdata)
            return NULL;
        curr->eqkid = (tst_node *) eqdata;
        return (void *) eqdata;
    } else { /* save pointer to 's' (allocated elsewhere) */
        curr->eqkid = (tst_node *) *s;
        return (void *) *s;
    }
}

先看到 273 行這裡利用 strdup (string duplication) 來對字串重新安排一個空間進行複製，因此對 copy 來說沒有什麼問題

p 為指向字串第一個字母的指標，因此 *p 為該個字母，而 *p++ 表示的是檢驗字串中的每個字母，並利用 ++ 指向下個字母的位址，雖然 ++ 最後做，先 *p 再 ++ ，但是後來的 ++ 只針對 p ，也就是 p++，而非字母的數值加一，故 *p++ == 0 為 true 表示字串結束

而會造成這個錯誤的原因就是，因為 s 是由 main function 裡的 p 傳入位址，而 p 又是讀取 word 的位址




while ((rtn = fscanf(fp, "%s", word)) != EOF) {
    char *p = word;
    if (!tst_ins_del(&root, &p, INS, REF)) {
    ...

word 只是讀取資料時的一個變數，他在讀取每一筆資料時，也只是變更其值，而非更改位址，因此每次傳入的 p 都是相同位址， s 也就都為相同值
如果輸出 s 的位置來看




// tst.c
printf("%p \n", s);
curr->eqkid = (tst_node *) strdup(*s);
return (void *) *s;

結果就是如下








linyunwen@linyunwen-X550JK:~/course/embedded/hw2/prefix-search$ ./test_ref
0x7ffceefcffb8
0x7ffceefcffb8
0x7ffceefcffb8
0x7ffceefcffb8
0x7ffceefcffb8
0x7ffceefcffb8
...

因此要有另外一個位址紀錄字串，簡單來說可以像下面這樣，但是這個就跟 copy 的版本一樣了，所以嘗試在別的地方宣告出這個值













// tst.c
if (*p++ == 0) {
    const char *eqdata = strdup(*s);
    if (cpy) { /* allocate storage for 's' */
        if (!eqdata)
            return NULL;
        curr->eqkid = (tst_node *) eqdata;
        return (void *) eqdata;
    } else { /* save pointer to 's' (allocated elsewhere) */
        curr->eqkid = (tst_node *) eqdata;
        return (void *) *eqdata;
    }
}

參考 <tina0405> 的作法是在一開始就建立一個 array 用來存放那些字串。讀取時，改為將這些字串位址傳入，就不會有同一個重複位址的問題

建立 word_saver 來儲存 words





char word_saver[10000][WRDMAX];
while ((rtn = fscanf(fp, "%s", word_saver[idx])) != EOF) {
    char *p = word_saver[idx];
    if (!tst_ins_del(&root, &p, INS, REF)) {
    ...

因為整體上無法容納全部 25 萬多筆的城市名，因此暫且只能先將數量維持在 10000 左右

test

perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./test_cpy

perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./test_ref

 Performance counter stats for './test_cpy' (100 runs):

        27,323      cache-misses              #   37.271 % of all cache refs      ( +-  2.53% )
        73,309      cache-references                                              ( +-  2.17% )
    20,472,722      instructions              #    0.99  insn per cycle           ( +-  0.02% )
    20,612,748      cycles                                                        ( +-  1.47% )

   0.006955518 seconds time elapsed                                          ( +-  1.84% )
----------------------------------------------------------------
 Performance counter stats for './test_ref' (100 runs):

        39,510      cache-misses              #   39.766 % of all cache refs      ( +-  2.78% )
        99,355      cache-references                                              ( +-  2.09% )
    20,939,620      instructions              #    0.84  insn per cycle           ( +-  0.03% )
    25,067,981      cycles                                                        ( +-  1.53% )

   0.008300689 seconds time elapsed                                          ( +-  1.52% )

問題:

雖說似乎有個解法，但是這樣先建立一個儲存空間來存放輸入的資料，和在插入前先複製一份的複製版本好像沒有太大差別，這樣 reference 的效果在哪裡？
當嘗試 strdup 動態分配記憶體空間時，可以將全部的程式資料都倒進來，但是安排成 array 就無法，是因為 array 記憶體要是連續的嗎？還是因為 strdup 會給剛好的大小， array 會浪費很多空間？

建立 make bench

要修改 makefile 使得可以利用 make bench 或 –bench 參數來讓程式執行效能分析
參考上一份作業類似部份









EXEC = phonebook_orig phonebook_opt

cache-test: $(EXEC)
    perf stat --repeat 100 \
        -e cache-misses,cache-references,instructions,cycles \
        ./phonebook_orig
    perf stat --repeat 100 \
        -e cache-misses,cache-references,instructions,cycles \
        ./phonebook_opt

修改為









TESTS = test_cpy test_ref
    
bench: $(TESTS)
	perf stat --repeat 100 \
		-e cache-misses,cache-references,instructions,cycles \
		./test_cpy
	perf stat --repeat 100 \
		-e cache-misses,cache-references,instructions,cycles \
		./test_ref

但是問題會先遇到，執行時它是個無限迴圈，並且會要手動輸入參數，因此這部份利用 argv, argc 來處理，先假定預期輸入為





// 單純分析
./test_cpy --bench 

// 自動輸入參數
./test_cpy --bench s Man

C argc and argv Examples to Parse Command Line Arguments

argc and argv

char **argv 和 char *argv[] 沒有太大差別

C/C++ 程式設計教學：main 函數讀取命令列參數，argc 與 argv 用法

建立兩個 flag 來判斷接下來要用哪種模式









int bench_flag = 0, arg_flag = 0;

// bench 
if (strcmp(argv[1], "--bench") == 0) {
    bench_flag = 1;
}
if (argc >= 4) {
    arg_flag = 1;
}






// input mode
if (arg_flag) {
    strcpy(word, argv[2]);
} else {
    fgets(word, sizeof word, stdin);
}









// input search prefix
if (arg_flag) {
    strcpy(word, argv[3]);
} else {
    if (!fgets(word, sizeof word, stdin)) {
        fprintf(stderr, "error: insufficient input.\n");
        break;
    }
}

小插曲:
原本想用 boolen type 宣告 flag ，但是遇到 error: unknown type name ‘bool’ ，查一下才知道原來 boolen type 並非一直都有
error: unknown type name ‘bool’ - Stackoverflow

接好可以吃參數的接口後，就將 makefile 也更改成引入參數的樣式







bench: $(TESTS)
perf stat --repeat 100 \
	-e cache-misses,cache-references,instructions,cycles \
	./test_cpy --bench s Man
perf stat --repeat 100 \
	-e cache-misses,cache-references,instructions,cycles \
	./test_ref --bench s Man

希望可以有 cycle 數的分析，並且畫出圖來

因此首先設立個 random 機制，隨機選取一個 data 的前三個字作為 prefix

亂數的使用






#define SEARCH_PREFIX 3
srand(time(NULL));
rand_i = rand()%10000;
if (idx == rand_i) {
    strncpy(search_prefix, word, (size_t) SEARCH_PREFIX);
}

再來將 search prefix 用剛剛隨機選取到的作為輸入





if (arg_flag) {
    // strcpy(word, argv[3]);
    strcpy(word, search_prefix);
} else {
...

基本上在取得 cycle 數然後輸出成檔案，將其用 gnuplot 畫出即可，但是目前卡在取得 cycle 數上，我在網路上找不太到能夠抓取 cycle 的方式，大多都為取時間