2017q3 Homework2 (prefix-search)

tags: `sysprogram`

contributed by <st9007a>

Github

Reviewed by `amikai`

建議提及理論的 prefix search 以及 TST, 在關聯到實做, 報告會比較有完整性
使用 s 來搜尋 Tai 的結果, 建議提供正確的版本作為對照, 可以用一些 grep 的方式之類的
tst_traverse_fn 這個函式老師以提供了, 建議先用原本的函式的 code 做舉例 (e.g. 印出整棵樹的值)，在提到 javascipt 的應用會比較流暢, 直接跳到 javascript 舉例好像怪怪的

以上是小弟淺見, 如有冒犯請見諒

Reviewed by `HTYISABUG`

C 有提供 getopt.h 幫助編程者管理 option, 直接用 argv 取得 argument 不夠嚴謹

開發環境

Linux 4.4.0-92-generic
Ubuntu 16.04.1 LTS
L1d cache:  32K
L1i cache:  32K
L2 cache:   256K
L3 cache:   6144K

修正 test_ref.c

首先打開，test_ref.c 會發現需要修正的地方都使用 copy 的方式

tst_ins_del(&root, &p, INS, CPY)

如果直接把 CPY 改成 REF 然後執行的話，雖然能正確編譯，但是執行結果是錯誤的，以下是使用 s 來搜尋 Tai 的結果

suggest[961] : Tai
suggest[962] : Tai
suggest[963] : Tai
suggest[964] : Tai
suggest[965] : Tai
suggest[966] : Tai
suggest[967] : Tai
suggest[968] : Tai
suggest[969] : Tai
suggest[970] : Tai

檢查 tst.c 的 tst_ins_del

void *tst_ins_del(tst_node **root, char *const *s, const int del, const int cpy)
{
    ...
    
    for (;;) {
        ...
        
        if (*p++ == 0) {
            if (cpy) { /* allocate storage for 's' */
                const char *eqdata = strdup(*s);
                if (!eqdata)
                    return NULL;
                curr->eqkid = (tst_node *) eqdata;
                return (void *) eqdata;
            } else { /* save pointer to 's' (allocated elsewhere) */
                curr->eqkid = (tst_node *) *s;
                return (void *) *s;
            }
        }
        pcurr = &(curr->eqkid);
    }
}

從上面程式碼可以看到如果 cpy 被設為 0 時，*s 會被直接存進去，而配合 test_ref.c 的程式碼，*s 對映到 word 這個 pointer to character，換句話說所有 insert 操作的節點都存到以 word 為起點的連續記憶體位址的其中一個位址，所以修正方法就很明顯了，就是不要讓他們都存取到 word 的位址

p = strdup(word);
tst_ins_del(&root, &p, INS, REF);

strdup 的代價要考慮，對 cache locality 有衝擊。不能只比較 prefix search 時間，建立 TST 的時間和空間開銷也該考慮。
"jserv"
更新嘗試使用連續記憶體的方式來取代 strdup
"st9007a"

我的作法是在 word assign 給 p 之前用 strdup 重新分配一段空間，如此就能保證每次 p 的值是不同的記憶體位址，由於 tst_ins_del 裡面有針對 *s 是否為 null 做判斷，所以這邊就不對 p == null 的狀況做檢查，編譯出來後執行結果就正常了

suggest[28] : Taivalkoski,
suggest[29] : Taivassalo,
suggest[30] : Taiwala,
suggest[31] : Taiwan
suggest[32] : Taixing,
suggest[33] : Taiynsha,
suggest[34] : Taiyuan,
suggest[35] : Taizhou,

效能評比

寫一段程式來針對 prefix 長度為 3 的所有可能組合作測試

static const char *upper_case_alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
static const char *lower_case_alphabet = "abcdefghijklmnopqrstuvwxyz";

int bench_test(const tst_node *root, char *out_file, const int max)
{
    char prefix[3];
    char **sgl;
    FILE *fp = fopen(out_file, "w");
    int idx = 0, sidx = 0;
    double t1, t2;

    if (!fp) {
        fprintf(stderr, "error: file open failed '%s'.\n", out_file);
        return 1;
    }

    sgl = (char **)malloc(sizeof(char *) * max);
    for (int i = 0; i < 26; i++) {
        prefix[0] = upper_case_alphabet[i];
        for (int j = 0; j < 26; j++) {
            prefix[1] = lower_case_alphabet[j];
            for (int k = 0; k < 26; k++) {
                prefix[2] = lower_case_alphabet[k];
                t1 = tvgetf();
                tst_search_prefix(root, prefix, sgl, &sidx, max);
                t2 = tvgetf();
                idx++;
                fprintf(fp, "%d %.6f sec\n", idx, t2 - t1);
            }
        }
    }
    fclose(fp);
    return 0;
}

回傳值代表程式執行結果，是給 main function 回傳用的，將這個 function 應用在 test_cpy.c 跟 test_ref.c 中

if (argc == 2 && strcmp(argv[1], "--bench") == 0) {
    int stat = bench_test(root, BENCH_TEST_FILE, LMAX);
    tst_free_all(root);
    return stat;
}

在 Makefile 中加入 $ make bench

BIN = \
      test_cpy \
      test_ref

bench: $(BIN)
      @for test in $(BIN); do\
          ./$$test --bench; \
      done

首先先比較看看建立 tst 的速度

$ make bench
CPY: ternary_tree, loaded 259112 words in 0.093352 sec
REF: ternary_tree, loaded 259112 words in 0.098250 sec

$ make bench
CPY: ternary_tree, loaded 259112 words in 0.089738 sec
REF: ternary_tree, loaded 259112 words in 0.109493 sec

$ make bench
CPY: ternary_tree, loaded 259112 words in 0.089678 sec
REF: ternary_tree, loaded 259112 words in 0.096827 sec

執行三次可以看到 cpy 的方法都比 ref 快 $0.05 \sim 0.1$ 秒

上圖為針對 prefix 長度為 3 的所有可能做搜尋的執行速度比較，x 軸的數字代表的是第幾次搜尋，看起來好像差不多，找兩個比較明顯起伏的區間來看看

這兩個區間分別是 $1000 \sim 2000$ 跟 $12000 \sim 14000$ ，從這兩個區間可以看出來 cpy 的執行速度比 ref 快一點

修改 bench_test

由於原本的 bench_test 用三個 for loop 來測試所有字串長度為 3 的 prefix，這樣的作法會讓程式碼可讀性變差，而且也不符合實際狀況，所以改成用一個字典來儲存要測試的 prefix，這裡的字典取自 cities.txt 前 10000 筆資料

while (fscanf(dict, "%s", word) != EOF) {
    if (strlen(word) < 4) continue;
    strncpy(prefix, word, 3);
    t1 = tvgetf();
    tst_search_prefix(root, prefix, sgl, &sidx, max);
    t2 = tvgetf();
    fprintf(fp, "%d %.6f sec\n", idx, t2 - t1);
    idx++;
}

修正 test_ref.c (Part 2)

由於之前的做法使用了 strdup 來避免參考到同一段位址的問題，但是根據第一周 phonebook 作業的結論，使用非連續記憶體來存取龐大的資料會導致cache miss 機率提高，所以這裡換一個方法來修正 test_ref.c 中建立 tst 的部分

enum { INS, DEL, WRDMAX = 256, STKMAX = 512, LMAX = 1024, DICTMAX = 300000 };

char (*cities_dict)[WRDMAX] = malloc(sizeof *cities_dict * DICTMAX);

while ((rtn = fscanf(fp, "%s", cities_dict[idx])) != EOF) {
    char *p = cities_dict[idx];
    if (!tst_ins_del(&root, &p, INS, REF)) {
        fprintf(stderr, "error: memory exhausted, tst_insert.\n");
        fclose(fp);
        return 1;
    }
    idx++;
}

這裡使用一個字串陣列 cities_dict 來存取資料，如此可以保證資料不會參考到同一段位址，同時一次性的分配記憶體也比每次 strdup 有效率多了，接著執行後選擇 q 這個指令會發現有錯誤

*** Error in `./test_ref': free(): invalid pointer: 0x00007fa76abe4f10 ***

研究了一下發現原本 q 這個指令所使用的 api 是 tst_free_all，這個 function 的註解有說明這個 api 是用來釋放儲存在內部的資料用的，但是現在的做法等同於把資料存在 cities_dict 這個變數中，所以要改成使用 tst_free 這個 api，註解有提到這個 api 是用來釋放儲存在外部的資料

再來是新增資料的部分，保持資料都存在 cities_dict 內，所以用 strcpy 把讀到的字複製進去

strcpy(cities_dict[idx], word);
p = cities_dict[idx];

最後刪除資料的部分，由於 cpy 跟 ref 的操作只會在新增時有差別，所以不對 word 做任何處理

分析 copy / reference 兩種方式的效能

這裡利用 $ perf stat 分別對 $ ./test_cpy --bench 以及 $ ./test_ref --bench 做測試

Performance counter stats for './test_ref --bench' (100 runs):

       1,9833,7717  cache-misses      #  58.596 % of all cache refs  ( +-  0.03% )
       3,3848,2694  cache-references                                 ( +-  0.03% )
      55,1851,5511  instructions      #   0.98  insns per cycle      ( +-  0.00% )
      56,5946,9075  cycles                                           ( +-  0.02% )

       2.193260663 seconds time elapsed                              ( +-  0.05% )

 Performance counter stats for './test_cpy --bench' (100 runs):

       1,6970,8821  cache-misses      #  54.942 % of all cache refs  ( +-  0.05% )
       3,0888,9526  cache-references                                 ( +-  0.07% )
      55,4853,7937  instructions      #   1.11  insns per cycle      ( +-  0.00% )
      49,8513,9041  cycles                                           ( +-  0.04% )

       1.866943152 seconds time elapsed                              ( +-  0.15% )

可以看到 cpy 的效能比 ref 還好，有意思的是 ref 的 cache miss 比 cpy 還高，這邊可以分成兩部分來探討：

建立 tst
prefix search

不要搞錯動名詞使用的場合，抽象詞彙場合用 cache miss，而探討在特定場合的行為時，才用 cache missing，如 Cache Missing for Fun and Profit
"jserv"

建立 tst

建立 tst 時，cpy 透過重複存取 word 來完成 tst，而 ref 則是在 cities_dict 上移動來完成 tst，以重複存取變數的效率來說 cpy 應該是比較好的
再來考慮到空間使用的開銷，cpy 只要維持 tst 的結構就好，而 ref 除了 tst 以外，還需要額外保存 cities_dict 這個變數，因此空間使用的效率上也是 cpy 比較好

prefix search

雖然 ref 的 key 是由連續記憶體來存取，但是一個 word 的長度為 256 bytes，資料尺寸過大讓 cache 沒辦法有好的效率，接著是 search 時在節點上的移動是在非連續的記憶體上，所以 prefix search 並沒有讓 ref 佔太大優勢

Benchmark 分析與改進

原本拿來做測試的用的字典檔是拿原本的字典檔前 10000 筆資料的前 3 個字元去做 prefix 的測試，其測試結果可以在上一節看到，但是字典檔的排序是隨機的，在這種狀況下沒辦法有效測試 cache hit 的效能( temporal locality )，所以將測試用字典檔做排序後重新測試

Performance counter stats for './test_cpy --bench' (100 runs):

       1,0217,7314  cache-misses      #  34.198 % of all cache refs  ( +-  0.20% )
       2,9878,5062  cache-references                                 ( +-  0.07% )

       1.408432966 seconds time elapsed                              ( +-  0.13% )

Performance counter stats for './test_ref --bench' (100 runs):

       1,2274,2242  cache-misses      #  37.353 % of all cache refs  ( +-  0.16% )
       3,2860,1004  cache-references                                 ( +-  0.08% )

       1.606459157 seconds time elapsed                              ( +-  0.16% )

以上為新的測試結果，兩種方法的 cache miss 都有下降不少，其原因就是這種測試方式讓資料的存取有 temporal locality 的特性

引入 Memory Pool

參考 tina0405 同學共筆，嘗試將之前用字串陣列來存取外部資料的方式改成用 memory pool 來存取外部資料

typedef struct __MEM_POOL MemPool;

MemPool *new_mem_pool(size_t size);
char *poalloc(size_t size, MemPool *p);

char *get_pool_curr(MemPool *p);
void mv_pool_curr(int dist, MemPool *p);

以上為 mempool.h 內容，由於在建立 tst 時，若使用 poalloc() 會需要在額外做 strcpy() 把 word 複製到 p 中，所以額外設計 get_pool_curr() 跟 mv_pool_curr() 來取代建立 tst 時的　memory pool allocation

Performance counter stats for './test_cpy --bench' (100 runs):

       1,0299,9538  cache-misses      #  34.449 % of all cache refs  ( +-  0.20% )
       2,9898,9009  cache-references                                 ( +-  0.05% )

       1.399240554 seconds time elapsed                              ( +-  0.16% )

Performance counter stats for './test_ref --bench' (100 runs):

       1,1457,0305  cache-misses      #  34.869 % of all cache refs  ( +-  0.15% )
       3,2857,3670  cache-references                                 ( +-  0.05% )

       1.549336182 seconds time elapsed                               ( +-  0.15% )

其結果 reference 的方式較原本下降了 3% 左右

問題討論

詳細觀察 tst.h 和 tst.c 為何要 forward declaration 呢？分離介面和實作有何好處呢？這樣的思維又該如何運用於 phonebook 呢？提出想法並且實作

forward declaration 跟分離介面可以在保持相同介面的狀況下抽換 context，在介面被大量使用的狀況下只需修改實作介面的的部分即可讓所有使用介面的程式碼套用，forward declaration 也有封裝資料的作用，能保護 struct 的欄位不會被直接存取
在第一周的 phonebook 的作業中可以實作一個通用介面 phonebook.h

typedef struct __PHONE_BOOK_ENTRY entry;

entry *findName(char lastName[], entry *pHead);
void append(char lastName[], entry *pHead);

然後依照不同介面實作方式來寫在不同的 .c 檔，但是寫完後基本上編譯是不會過的，因為 main.c 裡面會直接操作 struct 的欄位，由於 main.c 只有 include 通用介面，而通用介面並沒有做出完整的 struct 所以會出現 incomplete type 的錯誤
修正方法可以在通用介面中加入存取 struct 欄位的 function 來取代直接操作 struct 欄位

char *getLastName(entry *pNode);

相關程式碼已經更新在 github

研究程式碼的 tst_traverse_fn 函式，並思考如何運用這個實作，注意到 callback function 的程式開發技巧/模式

wikipedia 提到 callback 就是將一段可執行的程式碼(即 function )作為參數傳入另一個 function，可以被應用到不少場合，比如說 Javascript 操作 Array 的 function 就有大量使用 callback (sort, filter, map … 等等)

var arr = [1, 3, 2, 60, 0];
var sorted = arr.sort(function (a, b) { return a - b; });     // 排序陣列
var lessThan3 = arr.filter(function (el) { return el < 3; }); // 過濾掉陣列中大於等於 3 的 element
var mapTo = arr.map(function (el) { return el + 1; }); // 將陣列的每個 element 加 3

在 tst_traverse_fn 中，如果不傳入 fn 的話，他就只會單純的遍歷整棵樹，可以傳入一個 fn 來印出每個節點的資訊，如此就能利用 tst_traverse_fn 來 debug

參考資料

stackoverflow – Malloc a 2D array in C
wikipedia – callback (computer programming)
tina0405 同學共筆
 Memory and Locality