2018q3 Homework3 (dict)

2018q3 Homework3 (dict) === contributed by <[LiuJuiHung](https://github.com/LiuJuiHung)> --- ## 作業目標 * 在 GitHub 上 fork dict，研讀並修改 Makefile 以允許 $ make plot 時，視覺化 ternary search tree + bloom filter 的效能表現。注意，需要設計實驗，允許不同的字串輸入並探討 TST 回應的統計分佈 * 要考慮到 prefix search 的特性還有詞彙的涵蓋率 * 限用 gnuplot 來製圖 * 參閱 Hash Table Benchmarks * 以 perf 分析 TST 程式碼的執行時期行為並且充分解讀 * 考慮到 CPY 和 REF 這兩種實作機制，思考效能落差的緣故，並且提出後續效能改善機制 * 充分量化並以現代處理器架構的行為解讀 * 注意：這兩者都該加入 Bloom filter，而且分析具體效益 (原有程式存在若干缺陷，也該指出並修正) --- ## [Bloom Filter](https://blog.csdn.net/jiaomeng/article/details/1495500) Bloom Filter 是一種空間效率很高的隨機數據結構，利用 table 很簡潔地表示一個集合，並能判斷一個元素是否屬於這個集合。 Bloom Filter 的這種高效是有一定代價的：在判斷一個元素是否屬於某個集合時，有可能會把不屬於這個集合的元素誤認為屬於這個集合（false positive）。因此，Bloom Filter不適合那些“零錯誤”的應用場合。而在能容忍低錯誤率的應用場合下，Bloom Filter通過極少的錯誤換取了存儲空間的極大節省。 * 一個 n-bits 的 table ，初始值都是 `0` * 使用 k 個相互獨立的 hash function ，$h_0$、$h_1$、$h_2$...$h_{k-1}$ * 每當要加入新字串 s 時，會將 s 透過這 k 個 hash function 各自轉換為 table index (範圍 : 0~n-1) ，所以有 k 個 hash function ，就該有 k 個 index ，然後將該 table[index] 上的 bit set ![](https://i.imgur.com/txGvXCh.png) --- ## Install perf and Testing 1. 安裝 linux-tools-common ``` $ sudo apt-get install linux-tools-common ``` 2. 安裝 linux-tools-4.15.0-34-generic ``` $ sudo apt-get install linux-tools-4.15.0-34-generic ``` --- ## Makefile 解析 * 變數 * `TESTS` 指的是 `test_cpy.c` 、`test_ref.c` 這兩個檔案 * `TEST_DATA` 是執行 `test_cpy` 和 `test_ref` 時 ``--bench` 的參數 * `CFLAGS` 為 gcc 的參數 ```clike=1 TESTS = test_cpy test_ref TEST_DATA = s Tai CFLAGS = -O0 -Wall -Werror -g ``` * 為了 `all` 和 `clean` 能順利執行，利用 `.PHONY` 指定 `all` 和 `clean` 為 fake 項目，這樣執行 `make` ， make 就不會去檢查這兩個是否被執行過 ```clike=1 .PHONY: all clean ``` * `$(VECHO) " CC\t$@\n"` ：印出編譯訊息 * `$(Q)$(CC) -o $@ $(CFLAGS) -c -MMD -MF .$@.d $<` : * `$(Q)` 變數為 `@` ，為不顯示執行的命令 * `$(CC)` 為利用 `gcc` 編譯 * `-o` 為 output 也就是輸出 * `$@` 為目前的檔案 * `$(CFLAGS)` 就等於 `-O0 -Wall -Werror -g` * `-c` 產生 `*.o` 檔的 object 文件 * `-MMD -MF .$@.d $<` 產生與檔案相關的訊息，不包含 `include` 的 dependency 檔案，並將訊息導入到 `*.d` 檔中 ```clike=1 %.o: %.c $(VECHO) " CC\t$@\n" $(Q)$(CC) -o $@ $(CFLAGS) -c -MMD -MF .$@.d $< ``` * `$(Q)$(CC) $(LDFLAGS) -o $@ $^ -lm` 主要目的是連結各個 object 檔並產生執行檔 * `-lm` 表示使用 libm.so (或 libm.a) 這個函式庫 * `-l` : 是『加入某個函式庫(library)』的意思 * `m` : 是 libm.so 這個函式庫，其中， lib 與副檔名(.a 或 .so)不需要寫 ```clike=1 test_%: test_%.o $(OBJS_LIB) $(VECHO) " LD\t$@\n" $(Q)$(CC) $(LDFLAGS) -o $@ $^ -lm ``` --- ## Ternary Search Trie (TST) * TST 資料結構 * key : 用來儲存字元 * refcnt : 刪除字元時使用 * 每個 node 下最多只有 3 個 children: * lokid (low child) : 小於目前字元 * eqkid (equal child) : 等於目前字元 * hikid (high child) : 大於目前字元 ```clike=1 typedef struct tst_node { char key; /* char key for node (null for node with string) */ unsigned refcnt; /* refcnt tracks occurrence of word (for delete) */ struct tst_node *lokid; /* ternary low child pointer */ struct tst_node *eqkid; /* ternary equal child pointer */ struct tst_node *hikid; /* ternary high child pointer */ } tst_node; ``` * test_cpy.c ```clike=1 FILE *fp = fopen(IN_FILE, "r"); ``` 1. `fopen()` 觀察 * 以下為所屬的資料庫及格式 ``` #include <stdio.h> FILE *fopen(const char *path, const char *mode); ``` * `man page` 描述 ``` The fopen() function opens the file whose name is the string pointed to by path and associates a stream with it. The argument mode points to a string beginning with one of the following sequences (possibly followed by additional characters, as described below): r Open text file for reading. The stream is positioned at the beginning of the file. r+ Open for reading and writing.The stream is positioned at the beginning of the file. w Truncate file to zero length or create text file for writing. The stream is positioned at the beginning of the file. w+ Open for reading and writing.The file is created if it does not exist, otherwise it is truncated. The stream is positioned at the beginning of the file. a Open for appending (writing at end of file). The file is created if it does not exist. The stream is positioned at the end of the file. a+ Open for reading and appending (writing at end of file).The file is created if it does not exist. The initial file position for reading is at the beginning of the file, but output is always appended to the end of the file. ``` 2. 紀錄讀入多少 words ```clike=1 t1 = tvgetf(); while ((rtn = fscanf(fp, "%s", word)) != EOF) { char *p = word; if (!tst_ins_del(&root, &p, INS, CPY)) { fprintf(stderr, "error: memory exhausted, tst_insert.\n"); fclose(fp); return 1; } idx++; } t2 = tvgetf(); fclose(fp); printf("ternary_tree, loaded %d words in %.6f sec\n", idx, t2 - t1); ``` ```clike=1 ternary_tree, loaded 259112 words in 0.130151 sec ``` 3. `tvgetf()` 計時 * `clock_gettime` 獲取特定的時間 * 參數 `CLOCK_REALTIME` 為系統當前的時間 ```clike=1 double tvgetf() { struct timespec ts; double sec; clock_gettime(CLOCK_REALTIME, &ts); sec = ts.tv_nsec; sec /= 1e9; sec += ts.tv_sec; return sec; } ``` ```clike=1 struct timespec { __time_t tv_sec; /* Seconds. */ __syscall_slong_t tv_nsec; /* Nanoseconds. */ }; ``` --- ## TST 測試執行以下指令會執行 `bench.c` 裡的 `bench_test` 函式，並將搜尋的時間紀錄在 `bench_cpy.txt` 裡 ``` $ make $ ./test_cpy --bench ``` 將搜尋時間繪製成圖表 ``` $ gnuplot script/runtime3.gp $ eog runtime3.gp ``` ### Bug Fix 1. 發現在 `bench.c` 中函式 `bench_test` array 大小錯誤原 `bench.c` 程式碼 ```clike=1 char prefix[3] = ""; ... ``` 因為 `prefix` 的大小只有 3，而在執行 `strncpy` 時，沒有將 `\0` 放到 `prefix` 裡，故在使用時會出現問題 `man page` 中提到 >The strncpy() function is similar, except that at most n bytes of src are copied. Warning: If there is no null byte among the first n bytes of src, the string placed in dest will not be null-terminated. 將 `array` 大小修正為 4 2. 時間因 t1、 t2 都是 nsec，所以 `(t2 - t1) * 1000000` 應該是 ms(micro second) ```clike=1 char prefix[4] = ""; ... strncpy(prefix, word, 3); t1 = tvgetf(); tst_search_prefix(root, prefix, sgl, &sidx, max); t2 = tvgetf(); fprintf(fp, "%d %f msec %s\n", idx, (t2 - t1) * 1000000, *sgl); idx++; ... ```