2018q1 Homework2 (prefix-search)

contributed by <HexRabbit, Yuessiah^[1]>

Reviewed by `Yuessiah`

用詞應該更精確一點，或是提出符合邏輯的解釋。

開發環境

                   -`                 
                  .o+`                 HexRabbit
                 `ooo/                 OS: Arch Linux 
                `+oooo:                Kernel: x86_64 Linux 4.13.3-1-zen
               `+oooooo:               Virtualization: VT-x
               -+oooooo+:              CPU: Intel Core i5-3210M
             `/:-:++oooo+:             L1d cache: 32K
            `/++++/+++++++:            L1i cache: 32K
           `/++++++++++++++:           L2 cache: 256K
          `/+++ooooooooooooo/`         L3 cache: 3072K
         ./ooosssso++osssssso+`        GPU: intel
        .oossssso-````/ossssss+`       RAM: 7876MiB
       -osssssso.      :ssssssso.      
      :osssssss/        osssso+++.     
     /ossssssss/        +ssssooo/-      
   `/ossssso+/:-        -:/+osssso+-   
  `+sso+:-`                 `.-/+oso:  
 `++:.                           `-/+/
 .`                                 `/

Ternary Search Tree (TST)

一般用來儲存字串的資料結構有四種：Hash Table, Binary Search Tree, Trie, Ternary Search Tree

Hash Table 的查詢速度是最快的

O (1)

、建表時間為

O (n)

，但是需要另外為 table 分配一段空間，且對於不同大小的 data set 需要採用不同的 table size 才能避免 collision 發生。

Binary Search Tree 查詢與建表時間是中規中矩的

O (l o g_{2} n)

，但是在最糟的情況下時間複雜度可能轉為

O (n)

，除了要儲存的字串外不須另外分配空間。

Trie 的時間複雜度是

O (n)

，能夠實現前兩者難以做到的 prefix-search，缺點是會有大量的空指針表，造成空間開銷過大。

Ternary Search Tree 結合了 Binary Search Tree 與 Trie 的特性，結構與 Trie 十分相似，也同時有二元樹的影子，操作的時間複雜度為

O (l o g_{2} n)

，對於龐大資料量的壓縮也相當可觀。

"有二元樹的影子"，建議補充說明。
"Yuessiah"

以這次提供的 cities.txt 搜尋城市景點為例( 93827 筆, 984410 字元)，操作方式是先查詢城市後透過指標(64 bit)存取存放該城市的景點的記憶體，以下為各資料結構的記憶體消耗量：

Hash Table: 981140 + 49157(table size)*4 + 93827*8*2 = 2679000 Byte

Binary Search Tree: 981140 + 93827*8*3 = 3232988 Byte

Trie: 無資料，不過若從資料結構來判斷則可能比 TST 大
(在 TST 單層的 node 數少於(26*8)/(3*8+4+1) = 7個時)

Ternary Search Tree: 981140 + 16710080(實驗得出) = 17691220 Byte

結果相當悲劇，空間消耗量為前兩者的 6 ~ 7 倍，

「悲劇」不是精確的用語，請調整。這是技術報告，不是流水帳
"Yuessiah"

這可能是因為資料量不夠大，抑或是資料長短的差距過大，造成無法有效重複利用同一節點來節省空間。

圖像化

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

研究紀錄

沒想到才剛把專案 git clone 下來後執行 make 就突然噴出出現 error，

「噴出」不是精確的用語，請調整。這是技術報告，不是流水帳

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

test_cpy.c: In function ‘main’:
test_cpy.c:82:9: error: statement will never be executed [-Werror=switch-unreachable]
   char *p = NULL;
         ^

仔細看了一下才發現原來是個 warning 而已，不過被編譯選項 -Werror 改成 error 了，查看程式碼後發現該位置上的指令永遠不會被執行到

switch (*word) {
    char *p = NULL;  <--
    case 'a':
         :
         :
         :
}

查閱C99規格書§6.8.4.2，該處對此狀況提供了一個範例：

switch (expr)
{
    int i = 4;
    f(i);
    case 0:
        i = 17;
        /* falls through into default code */
    default:
        printf("%d\n", i);
}

the object whose identifier is i exists with automatic storage duration (within the block) but is never initialized, and thus if the controlling expression has a nonzero value, the call to the printf function will access an indeterminate value. Similarly, the call to the function f cannot be reached.

發現根據敘述 i 會被分配記憶體空間但永遠不會被初始化，所以可能會出現 undefined 的情形，將其改寫成在 switch 外宣告便可避免。
(不過在本份程式碼中沒有在宣告後直接使用的情形發生，即便不修改也不會出錯)

解決上述問題後，便透過 python 腳本 (下方 generate_input.py) 隨機從 cities.txt 和 female-names.txt 各抓出 10000 筆資料來測試有和沒有搜尋到資料的時間，前者是一定找得到的字串集 match_input.txt，後者為保證找不到的字串集 bad_input.txt ，之所以選擇 female-names 作為測資的理由是我想瞭解在完全搜尋不到資料的情形下會花費多少時間。(也意外的找到一個漏洞)

需要交代 female-names.txt 的出處，以及選用動機
"jserv"

感謝 Yuessiah 同學訂正
HexRabbit

generate_input.py:

#!/usr/bin/env python3
from random import randint
import sys


def main():
    name = open('dictionary/female-names.txt', 'r')
    city = open('dictionary/cities.txt', 'r')
    bad_in = open('bad_input.txt', 'w')
    match_in = open('match_input.txt', 'w')
    bad = []
    match = []

    num = int(sys.argv[1])

    for line in name:
        bad.append(line[:len(line)-1])

    for line in city:
        match.append(line[:len(line)-1])


    for i in range(num):
        bad_in.write('f\n')
        bad_in.write(bad[randint(0, len(bad)-1)]+'\n')

    for i in range(num):
        data = bad[randint(0, len(bad)-1)]
        bad_in.write('s\n')
        bad_in.write(data[:randint(1, len(data))]+'\n')

    for i in range(num):
        bad_in.write('d\n')
        bad_in.write(bad[randint(0, len(bad)-1)]+'\n')

    bad_in.write('e\nq\n')
    bad_in.close()

    for i in range(num):
        match_in.write('f\n')
        match_in.write(match[randint(0, len(match)-1)]+'\n')

    for i in range(num):
        data = match[randint(0, len(match)-1)]
        match_in.write('s\n')
        match_in.write(data[:randint(1, len(data))]+'\n')

    for i in range(num):
        match_in.write('d\n')
        match_in.write(match[randint(0, len(match)-1)]+'\n')

    match_in.write('e\nq\n')
    match_in.close()


if __name__ == '__main__':
    main()

沒想到在測試時居然會隨機出現 core dump，而在朋友的電腦上卻完全不會發生，令人相當頭痛。
把整份 code 都看過一遍還是沒有想法，不過後來用 gdb 跟了一次之後便很快的找出錯誤(早知道一開始就這樣做了…)

Program received signal SIGSEGV, Segmentation fault.                                                                                    
0x0000555555556082 in tst_suggest (p=0x5555561780c0, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:330       
330             a[(*n)++] = (char *) p->eqkid;                                                                                          
(gdb) bt                                                                                                                                
#0  0x0000555555556082 in tst_suggest (p=0x5555561780c0, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:330   
#1  0x000055555555603d in tst_suggest (p=0x555556178090, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:328   
#2  0x000055555555603d in tst_suggest (p=0x555556178060, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:328   
#3  0x000055555555603d in tst_suggest (p=0x555556178030, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:328   
#4  0x000055555555603d in tst_suggest (p=0x555556178000, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:328   
#5  0x000055555555603d in tst_suggest (p=0x555556177fd0, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:328   
#6  0x000055555555603d in tst_suggest (p=0x555556177fa0, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:328   
#7  0x00005555555560af in tst_suggest (p=0x555555dccbe0, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:331   
#8  0x0000555555556008 in tst_suggest (p=0x555555a60f10, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:326   
#9  0x0000555555556008 in tst_suggest (p=0x5555557d7690, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:326   
#10 0x000055555555603d in tst_suggest (p=0x5555557d7660, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:328   
#11 0x0000555555556008 in tst_suggest (p=0x555555780b80, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:326   
#12 0x00005555555560af in tst_suggest (p=0x55555576c320, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:331   
#13 0x0000555555556008 in tst_suggest (p=0x55555576b870, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:326   
#14 0x00005555555560af in tst_suggest (p=0x555555761590, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:331   
#15 0x000055555555603d in tst_suggest (p=0x555555761560, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:328   
#16 0x0000555555556008 in tst_suggest (p=0x555555759ac0, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:326   
#17 0x00005555555560af in tst_suggest (p=0x555555759980, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:331   
#18 0x000055555555603d in tst_suggest (p=0x555555759950, c=65 'A', nchr=1, a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:328   
#19 0x0000555555556199 in tst_search_prefix (root=0x5555557594a0, s=0x7fffffffe230 "\200yvUUU", a=0x7fffffffc230, n=0x7fffffffc1b4, max=1024) at tst.c:369        
#20 0x0000555555555145 in main (argc=1, argv=0x7fffffffe428) at test_cpy.c:126
(gdb) p *n
$4 = 1467

這才發現問題 (528a510)出在以下的 tst_suggest() 函式，如果在因為 *n == max 而 return 時，若前一層的 tst_suggest() 執行到 if (p->key) 的位置，並且p->key == '\0'，便可能使得 *n > max 而讓判斷式失效。
所以只要將該判斷式移至(*n)++之前即可。

實際上在研究過後，發現這是一個本應不該發生的bug

其發生的前提是:
某 node N 的 N->key == '\0'，且可以從 N->lokid 找到另一個 suggestion 讓 *n == max

但問題就是若該 node 的 key 為 '\0'，則應該是位於最左邊，不會有讓條件成立的可能…

查看 tst.c 中 insert node 的部分(tst_ins_del) 似乎也沒有問題

最後透過 gdb 在 *n == max 設置斷點追蹤發現一件奇怪的現象

gef> p *(struct tst_node *)p
$15 = {
  key = 0x0, 
  refcnt = 0xbb, 
  lokid = 0x63c670, 
  eqkid = 0x613730, 
  hikid = 0x62f020
}

gef> p *(struct tst_node *)p->lokid  
$16 = {
  key = 0xc4, 
  refcnt = 0x1, 
  lokid = 0x930500, 
  eqkid = 0x63c6a0, 
  hikid = 0x717460
}

'\0' 節點的確有 lokid，而且他的 key = 0xc4 不是一個正常的 ascii 字元
看到這我終於想到輸入可能有問題了，急忙打開 cities.txt 查看，搜尋由 A 開頭的資料

Ahmadābād, India
Al Jīzah, Egypt

才發現問題是出在 Unicode 身上，在讀入資料時是 char-by-char 的讀取，所以 Unicode 字元會被拆成多個 char 讀入，上述兩筆資料中的 Unicode 若對應到 binary 則會變成
'ā' == b'11000100 10000001'
'ī' == b'11000100 10101011'
又 char 是有 signed 且儲存相減結果的 diff 也是 signed，則此狀況下的 diff 為負數

        int diff;
        ...
        
        diff = *p - curr->key;        /* get ASCII diff for >, <, = */

而造成在 insert node 時的這個比較式

        } else if (diff < 0) {      /* if char less than node->key */
            pcurr = &(curr->lokid); /* get next lokid pointer address */
        } else {                    /* if char greater than node->key */
            pcurr = &(curr->hikid); /* get next hikid pointer address */
        }

把 Unicode 字元分配到 '\0' node 的 lokid(左邊) 去，而引發這個 bug

另外，我猜測在朋友的環境不會出現錯誤訊息的原因可能是因為gcc的版本不同，
使得編譯出的程式在 stack 上為 return address 與該 array(a) 之間留下較大的空間。(待確認)

附上對應的 Git commit 連結
"jserv"

感謝老師訂正
"HexRabbit

"function" 當作非數學用途時 (如上方段落)，繁體中文翻譯為「函式」，而非「函數」

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

{
-   if (!p || *n == max)
+   if (!p)
        return;
    tst_suggest(p->lokid, c, nchr, a, n, max);
    if (p->key)
        tst_suggest(p->eqkid, c, nchr, a, n, max);
    else if (*(((char *) p->eqkid) + nchr - 1) == c) {
+       if (*n == max) return;
        a[(*n)++] = (char *) p->eqkid;
    }
    tst_suggest(p->hikid, c, nchr, a, n, max);
}

本以為程式已經沒有問題了，但是我們在測試途中意外發現若在字元 X 不存在時執行 'd' 來刪除 X 第一次會失敗，但重新刪除一次居然會成功?
仔細搜尋了一下程式碼，發現錯誤發生在 tst_ins_del() 函式中，如果給定的 key 不在該層 TST 中， curr 會一路被向右(或左)更新，最後等於 NULL，此時會跳出 while，意外的進入到下方提供給 insert 模式使用的程式碼中，造成錯誤的 allocate 字串至 TST 上。
解決方式相當簡單，便是在下方 for 迴圈中檢查 del 變數即可。

while ((curr = *pcurr)) {
        diff = *p - curr->key;
        if (diff == 0) {
                :
                :
            pcurr = &(curr->eqkid); /* get next eqkid pointer address */
        } else if (diff < 0) {      /* if char less than node->key */
            pcurr = &(curr->lokid); /* get next lokid pointer address */
        } else {                    /* if char greater than node->key */
            pcurr = &(curr->hikid); /* get next hikid pointer address */
        }
                :
}

/* if not duplicate, insert remaining chars into tree rooted at curr */
for (;;) {
        /* allocate memory for node ... */
        :
        :
}

優化方式

嘗試修改原先串在鍊尾的字串的儲存方式，原本在 test_cpy.c 的作法是：
使用 strdup() 複製字串至一個空間，並將指標指向他

if (cpy) { /* allocate storage for 's' */
    const char *eqdata = strdup(*s);
    if (!eqdata)
        return NULL;
    curr->eqkid = (tst_node *) eqdata;
    return (void *) eqdata;
}

但此種方法的缺點十分明顯，不斷向系統要求記憶體空間可能導致資料片段化，也會花費大量運算時間。嘗試改進成一次分配較多空間，並減少 strdup() 呼叫。

char *pool_request(char *s)
{
    if(reflen + strlen(s) + 1 >= REFMAX) {
        refer[++ord] = (char*)malloc(sizeof(char)*REFMAX);
        reflen = 0;
    }
    char *fp = strcpy(refer[ord] + reflen, tmp);
    reflen += strlen(s) + 1;

    return fp;
}

效能分析

利用前述的隨機生成測試資料的小程式 generate_input.py 輸入 num 會分別產生兩個測試輸入檔案 (case 'f', 's', 'd' 分別操作 num 次 )
一個是一定找得到的字串集 match_input.txt
另一個保證找不到的字串集 bad_input.txt

CPY 的版本(num = 10000)

 Performance counter stats for './test_cpy':                                                                                            
                                                                                                                                        
       248,029,200      cache-misses:u            #   86.363 % of all cache refs      (83.31%)                                          
       287,192,521      cache-references:u                                            (83.33%)                                          
        83,810,174      branch-misses:u           #    2.63% of all branches          (83.33%)                                          
     3,188,692,031      branches:u                                                    (66.69%)                                          
    18,949,347,145      instructions:u            #    0.62  insn per cycle                                              (83.36%)       
    30,511,103,031      cycles:u                                                      (83.34%)                                          
                                                                                                                                        
      10.475868243 seconds time elapsed                                                                                                 
                                                                                                                                        
                                                                                                                                        
 Performance counter stats for './test_cpy':                                                                                            
                                                                                                                                        
         1,098,522      cache-misses:u            #   57.685 % of all cache refs      (81.99%)                                          
         1,904,339      cache-references:u                                            (83.65%)                                          
         1,561,908      branch-misses:u           #    1.53% of all branches          (83.80%)                                          
       102,304,780      branches:u                                                    (67.69%)                                          
       499,282,191      instructions:u            #    1.07  insn per cycle                                              (83.90%)       
       468,214,590      cycles:u                                                      (83.64%)                                          
                                                                                                                                        
       0.167016863 seconds time elapsed

REF的版本(num = 10000)

 Performance counter stats for './test_ref':                                                                                            
                                                                                                                                        
       268,692,862      cache-misses:u            #   85.492 % of all cache refs      (83.33%)                                          
       314,288,495      cache-references:u                                            (83.33%)                                          
        85,748,320      branch-misses:u           #    2.70% of all branches          (83.33%)                                          
     3,179,698,929      branches:u                                                    (66.70%)                                          
    18,939,634,492      instructions:u            #    0.56  insn per cycle                                              (83.35%)
    33,757,518,712      cycles:u                                                      (83.33%)

      11.246064683 seconds time elapsed


 Performance counter stats for './test_ref':

           285,129      cache-misses:u            #   30.635 % of all cache refs      (80.67%)
           930,744      cache-references:u                                            (83.39%)
         1,355,382      branch-misses:u           #    1.89% of all branches          (83.39%)
        71,861,226      branches:u                                                    (68.97%)
       372,931,052      instructions:u            #    1.22  insn per cycle                                              (85.89%)
       305,050,244      cycles:u                                                      (84.88%)

       0.108702688 seconds time elapsed

實際應用

auto-completion
spell-checking
prefix search

本共筆是與Yuessiah共同研究撰寫而成 ↩︎

2018q1 Homework2 (prefix-search)

Reviewed by Yuessiah

開發環境

Ternary Search Tree (TST)

圖像化

研究紀錄

優化方式

效能分析

實際應用

Reviewed by `Yuessiah`