2020q3 Homework3 (dict)

# 2020q3 Homework3 (dict) contributed by < `zzzxxx00019` > > [dict](https://hackmd.io/@sysprog/2020-dict) ## 程式運作解讀： ### test_common.c 程式碼解讀 * 在程式執行初期， REF 被預設為 INS ， CPYmask 預設為 -1 ， REF 用於決定程式執行何種模式 * 若執行 CPY 模式， REF 會被設為 DEL 、 CPYmask 設為 0 * 透過輸入命令 `./test_common CPY` 進入 CPY 模式，其他字串則進入 REF 模式 ```cpp if (!strcmp(argv[1], "CPY") || (argc > 2 && !strcmp(argv[2], "CPY"))) { CPYmask = 0; REF = DEL; printf("CPY mechanism\n"); } else printf("REF mechanism\n"); ``` * Pool 主要用於 REF 機制，讀檔時將地名永久存入 memory pool ，若 `CPYmask = -1` 條件式成立 * 在 `test_common.c` 中， `poolsize = 2000000 * WRDMAX` ，而 `WRDMAX = 256` * 當使用 REF 機制時，將預先分配 poolsize 個字元空間供讀取檔案與暫存使用 ```cpp if (CPYmask) { /* memory pool */ pool = (char *) malloc(poolsize * sizeof(char)); Top = pool; ``` * 讀入的字串根據 `','` 與 `'\n'` 做為切割依據 * 透過 `tst_ins_del` 將資料插入 Ternary search tree * 而 CPY 與 REF 在此的差異在於 `Top -= offset & ~CPYmask` 出來的結果 * 當 CPY 模式， `CPYmask = 0` ， `~CPYmask = 0xff...f` ， Top 會被推回原點 * 當 REF 模式， `CPYmask = -1` ， `CPYmask = 0x0` ， Top 維持在插入字串後的位址 ```cpp char buf[WORDMAX]; while (fgets(buf, WORDMAX, fp)) { int offset = 0; for (int i = 0, j = 0; buf[i + offset]; i++) { Top[i] = (buf[i + j] == ',' || buf[i + j] == '\n') ? '\0' : buf[i + j]; j += (buf[i + j] == ','); } while (*Top) { if (!tst_ins_del(&root, Top, INS, REF)) { /* fail to insert */ fprintf(stderr, "error: memory exhausted, tst_insert.\n"); fclose(fp); return 1; } bloom_add(bloom, Top); idx++; int len = strlen(Top); offset += len + 1; Top += len + 1; } Top -= offset & ~CPYmask; memset(Top, '\0', WORDMAX); } ``` * 透過輸入指令 --bench ，可使用 bench_test 這個 function 去對 ternary search tree 進行測試 ```cpp if (argc == 3 && strcmp(argv[1], "--bench") == 0) ， int stat = bench_test(root, BENCH_TEST_FILE, LMAX); tst_free(root); free(pool); return stat; ``` * 觀察到 `bench_test` 中，最核心的迴圈部分，其實就是藉由讀取原本的 `cities.txt` ，去搜尋裡面字串的部分，所以理論上每個字串應該都要存在於 `tst` 中，再將每個字串搜尋所需要的時間輸出到 `bench_ref.txt` 中 ```cpp while (fscanf(dict, "%s", word) != EOF) { if (strlen(word) < sizeof(prefix) - 1) continue; strncpy(prefix, word, sizeof(prefix) - 1); t1 = tvgetf(); tst_search_prefix(root, prefix, sgl, &sidx, max); t2 = tvgetf(); fprintf(fp, "%d %f msec\n", idx, (t2 - t1) * 1000); idx++; } ``` ### tst_ins_del 程式碼解讀 ( 某次更新中已移除此 function ) * tst_ins_del 有兩種模式，分別別為 delete 與 insert ，由 del 這個變數決定何種模式，由 cpy 變數決定 insert 的機制為 cpy 還是 ref * 目前已將 `tst_ins_del` 拆為 `tst_ins` 、 `tst_del` ```cpp void *tst_ins_del(tst_node **root, const char *s, const int del, const int cpy) ``` * 透過 `while` 迴圈去搜尋字串 s 是否存在於 tst 中，若為 del 模式， stack 會記錄當前走訪路徑，為節點刪除後做準備 * 若成功搜尋到該字串，可分 del 與 ins 兩種狀況去看會如何處理： * del 模式會呼叫 `tst_del_word` 並透過 stack 將這個字串刪除 * ins 模式則將 `refcnt + 1` ，並回傳該字串 ```cpp pcurr = root; /* start at root */ while ((curr = *pcurr)) { /* iterate to insertion node */ diff = *p - curr->key; /* get ASCII diff for >, <, = */ if (diff == 0) { /* if char equal to node->key */ if (*p++ == 0) { /* check if word is duplicate */ if (del) { /* delete instead of insert */ (curr->refcnt)--; /* decrement reference count */ /* chk refcnt, del 's', return NULL on successful del */ return tst_del_word(root, curr, &stk, cpy); } else curr->refcnt++; /* increment refcnt if word exists */ return (void *) curr->eqkid; /* pointer to word / NULL on del */ } pcurr = &(curr->eqkid); /* get next eqkid pointer address */ } else if (diff < 0) { /* if char less than node->key */ pcurr = &(curr->lokid); /* get next lokid pointer address */ } else { /* if char greater than node->key */ pcurr = &(curr->hikid); /* get next hikid pointer address */ } if (del) tst_stack_push(&stk, curr); /* push node on stack for del */ } ``` * 如果要插入的字串尚未存在 tst 中，則透過以下迴圈去進行插入的動作，將該字串所缺乏的節點做補齊 * 而 CPY 與 REF 在 insert 的動作上就有所差異： CPY 透過 `strdup` 分配一個儲存空間，而 REF 則在一開始就分配一個巨大的儲存空間提供新資料插入使用 * `char * strdup(const char *s)` 的用途是透過 `malloc` 分配一個與字串 s 相同空間的大小，再將 s 的內容複製到該地址，並將地址回傳 ```cpp for (;;) { /* allocate memory for node, and fill. use calloc (or include * string.h and initialize w/memset) to avoid valgrind warning * "Conditional jump or move depends on uninitialised value(s)" */ if (!(*pcurr = calloc(1, sizeof **pcurr))) { fprintf(stderr, "error: tst_insert(), memory exhausted.\n"); return NULL; } curr = *pcurr; curr->key = *p; curr->refcnt = 1; curr->lokid = curr->hikid = curr->eqkid = NULL; if (!*root) /* handle assignment to root if no root */ *root = *pcurr; /* Place nodes until end of the string, at end of stign allocate * space for data, copy data as final eqkid, and return. */ if (*p++ == 0) { if (cpy) { /* allocate storage for 's' */ const char *eqdata = strdup(s); if (!eqdata) return NULL; curr->eqkid = (tst_node *) eqdata; return (void *) eqdata; } else { /* save pointer to 's' (allocated elsewhere) */ curr->eqkid = (tst_node *) s; return (void *) s; } } pcurr = &(curr->eqkid); } ``` ## 視覺化 ternary search tree + bloom filter 的效能表現並分析 ### 透過 gnuplot ，將 CPY 與 REF 透過 cities.txt 為字典檔所搜尋的時間繪製成圖與分析在原始程式碼中， bench_test 直接使用 tst_search_prefix 去搜尋字串，沒有預先透過 bloom_test 檢查，為了方便測試，稍微改寫了一下程式，將 --bloom 加入指令：透過輸入 `./test_common --bloom CPY` ，在執行 `tst_search_prefix` 前會先透過 `bloom filter` 判斷字串是否可能出現在 tst 中在 test_common.c 中加入此段程式碼： ```cpp else if (argc == 3 && strcmp(argv[1], "--bloom") == 0) { ... int stat = benchbloom_test(root, BENCH_TEST_FILE, LMAX, &bloom); ... return stat; } ``` * 在 `bench.c` 方面，考慮到 `bloom filter` 的特性，並不允許 `fscanf` 這樣的模式輸入尋找字串 ( 會發生空白截斷 ) ，因此參考 `test_common.c` 在字串讀入與處理的部分，對字串做相對的處理，才不會發生 `bloom filter` 找不到符合字串的情況 * 為了實驗的公平性，在 `no bloom filter` 的狀況下，也必須透過 `fgets` 去讀檔，且都不透過 `prefix` 去對字串做切割處理，在一般環境與 bloom filter 的差異下，就在於是否有透過 `bloom_test` 預測字串可能存在於 tst 中 ```cpp while (fgets(buf, WORDMAX, dict)) { for (int i = 0, j = 0; buf[i]; i++) { word[i] = (buf[i + j] == ',' || buf[i + j] == '\n') ? '\0' : buf[i + j]; j += (buf[i + j] == ','); } ... if (bloom_test(*filter, word)) { tst_search_prefix(root, word, sgl, &sidx, max); } ... } ``` ==gunplot 製圖結果（搜尋字典檔為 "cities.txt" ）== ![](https://i.imgur.com/If5K3kK.png) ==CPY mechanism 分析結果 ( No bloom filter )== ``` Performance counter stats for './test_common --bench CPY': 207.75 msec task-clock # 0.999 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 7349 page-faults # 0.035 M/sec 8,5749,8592 cycles # 4.128 GHz 11,0039,1657 instructions # 1.28 insn per cycle 1,7779,8904 branches # 855.822 M/sec 432,6607 branch-misses # 2.43% of all branches 0.207985136 seconds time elapsed 0.179999000 seconds user 0.027999000 seconds sys ``` ==REF mechanism 分析結果 ( No bloom filter )== ``` Performance counter stats for './test_common --bench REF': 214.12 msec task-clock # 0.999 CPUs utilized 1 context-switches # 0.005 K/sec 0 cpu-migrations # 0.000 K/sec 7169 page-faults # 0.033 M/sec 8,8678,6840 cycles # 4.141 GHz 10,7758,3942 instructions # 1.22 insn per cycle 1,7331,4011 branches # 809.408 M/sec 460,4517 branch-misses # 2.66% of all branches 0.214399980 seconds time elapsed 0.206320000 seconds user 0.008091000 seconds sys ``` ==CPY mechanism 分析結果 ( bloom filter )== ``` Performance counter stats for './test_common --bloom CPY': 217.18 msec task-clock # 0.999 CPUs utilized 2 context-switches # 0.009 K/sec 0 cpu-migrations # 0.000 K/sec 7346 page-faults # 0.034 M/sec 8,9835,1949 cycles # 4.136 GHz 11,4112,9491 instructions # 1.27 insn per cycle 1,8157,1321 branches # 836.023 M/sec 436,5518 branch-misses # 2.40% of all branches 0.217446657 seconds time elapsed 0.201304000 seconds user 0.016104000 seconds sys ``` ==REF mechanism 分析結果 ( bloom filter )== ``` Performance counter stats for './test_common --bloom REF': 224.07 msec task-clock # 0.999 CPUs utilized 1 context-switches # 0.004 K/sec 0 cpu-migrations # 0.000 K/sec 7168 page-faults # 0.032 M/sec 9,2713,6034 cycles # 4.138 GHz 11,1810,8196 instructions # 1.21 insn per cycle 1,7687,3963 branches # 789.361 M/sec 464,1634 branch-misses # 2.62% of all branches 0.224305167 seconds time elapsed 0.220304000 seconds user 0.004005000 seconds sys ``` * 從 `perf` 分析結果來看，在沒有 `bloom filter` 實驗中， `CPY` 與 `REF` 在效率上並沒有很大的落差，此部分還有待深入探討 * 因為搜尋的字典檔是相同於建立 `tst` 的字典檔，因此在實驗過程中，每個字都會 hit ，`bloom filter` 就成為了搜尋過程的一大累贅，因此從上圖可以觀察到，透過 `bloom filter` 所耗費的時間較多 * `bloom filter` 是用來檢查搜尋字串的正確性，避免在龐大的資料庫，搜尋一個確定不存在的資料 ==gunplot 製圖結果（使用字典檔的字串幾乎不存在於 cities.txt 中）== ![](https://i.imgur.com/kpLcrLx.png) * 可以發現 `bloom filter` ，在一開始就過濾掉不存在於 `hash table` 裡面的字串，因此不會浪費額外的資源去做 search 的動作，相較於不使用 `bloom filter` 的實驗，必須搜尋到某個階段才能確定字串不存在於 `tst` ，因此需浪費更多的額外資源與時間 :::warning To Do：思考 CPY 與 REF 在效能上有何落差。 :::