2019q3 Homework5 (dict)

# 2019q3 Homework5 (dict) contributed by < `kaeteyaruyo` > ## 作業要求 ### 詳細閱讀 [隱藏在字典裡的數學](https://hackmd.io/@sysprog/BkE3uSvdN)，並在自己的電腦環境重現裡頭的實驗，應該充分記錄所見所聞 ### fork [dict](https://github.com/sysprog21/dict)，研讀並修改 `Makefile` * 允許 `$ make plot` 時，視覺化 ternary search tree + bloom filter 的效能表現。 * 需要設計實驗，允許不同的字串輸入並探討 TST 回應的統計分佈 * 要考慮到 prefix search 的特性還有詞彙的涵蓋率 * 限用 gnuplot 來製圖 * 參閱 [Hash Table Benchmarks](http://incise.org/hash-table-benchmarks.html) ### 以 perf 分析 TST 程式碼的執行時期行為並且充分解讀 ### 考慮到 CPY 和 REF 這兩種實作機制，思考效能落差的緣故，並且提出後續效能改善機制 * 充分量化並以現代處理器架構的行為解讀 ### 實作 [Quotient filter](https://en.wikipedia.org/wiki/Quotient_filter) 並整合進 [dict](https://github.com/sysprog21/dict) 程式碼中，需要比較兩者的錯誤率及記憶體開銷 ### 研讀 [dict 改善 & 效能探討](https://hackmd.io/@j4ywJU0OTzS2BlwpJo6THA/r1ZdcbFpm)，改善現有程式碼改善部份如下。 ## [dict](https://github.com/sysprog21/dict) 程式碼檢查事項 1. Bloom filter 空間浪費: 整個 table 都把 byte 當 bit 來用，浪費 1/8 的 table 空間 :::info 應為 7 / 8? ::: ```cpp void bloom_add(bloom_t filter, const void *item) { struct bloom_hash *h = filter->func; uint8_t *bits = filter->bits; while (h) { unsigned int hash = h->func(item); hash %= filter->size; bits[hash] = 1; // 每個 uint8_t 不是 1 就是 0 h = h->next; } } ``` 在 `bloom_filter` 的結構定義中，使用了 `size_t` 這個型態記錄 bit table 的大小，其單位為 byte。從上述 `bloom_add()` 的程式碼中也可以發現我們是使用 `uint8_t` 這樣的型態在使用這塊記憶體。但其實一個 byte 可以當作 8 個 bit 在使用，也就是說，我們實際上浪費了 7 / 8 的空間。換句話說，如果空間使用得當（8 個 bit 全用上），我們實際上只需要配置現在的 1 / 8 的空間。因此，我們先從 `bloom_create()` 開始修改。原本傳入的 `size` 參數是以 byte 為單位，我們保留原本的數字，但是在解讀這個數字時改以 bit 為單位，並且在 `calloc()` 時只配置 `size >> 3`，也就是這個數字的 1 / 8 個 byte 的空間。 ```cpp bloom_t bloom_create(size_t size) { bloom_t res = calloc(1, sizeof(struct bloom_filter)); res->size = size; // Store size in bit res->bits = calloc(size >> 3, 1); // But alloc memory in byte, with 1 / 8 size bloom_add_hash(res, djb2); bloom_add_hash(res, jenkins); return res; } ``` 為了完整用上全部的空間，我們引入了 [bit array](http://www.mathcs.emory.edu/~cheung/Courses/255/Syllabus/1-C-intro/bit-array.html) 的機制，分別實作了 `bloom_add()` 中用到的 set 操作，和 `bloom_test()` 當中用到的 test 操作： ```cpp #define set_bit(table, index) (table[(index / 8)] |= (1 << (index % 8))) #define test_bit(table, index) (table[(index / 8)] & (1 << (index % 8))) ``` 如此一來，便能將上述兩個函式改成這樣，達到 100% 的空間使用率： ```cpp void bloom_add(bloom_t filter, const void *item) { struct bloom_hash *h = filter->func; uint8_t *bits = filter->bits; // Convert from void* to uint8_t* while (h) { unsigned hash = h->func(item); hash %= filter->size; // Index in bit set_bit(bits, hash); h = h->next; } } bool bloom_test(bloom_t filter, const void *item) { struct bloom_hash *h = filter->func; uint8_t *bits = filter->bits; while (h) { unsigned int hash = h->func(item); hash %= filter->size; if (!test_bit(bits, hash)) { return false; } h = h->next; } return true; } ``` 需注意，此實作版本僅在要求的 table 大小恰為 8 的倍數個 bit 時才能正常運作。若 table 的大小不為 8 的倍數，尾端 `filter->size % 8` 個 bit 的部份不會被配置到，在使用到這部份的 bit 時，會引起 array index out of range 的錯誤。 2. `test_cpy.c` 和 `test_ref.c` 兩個檔案有許多部分重複，是否能引入 `test-common.c` 來統合呢？執行 `diff test_cpy.c test_ref.c`，找出兩個檔案不同的部份如下： * 20 行處，使用的測試檔檔名不同 ```bash 20c20 < #define BENCH_TEST_FILE "bench_cpy.txt" --- > #define BENCH_TEST_FILE "bench_ref.txt" ``` * 45 行處，推測應該是沒有改到參數，所以我統一改成 `IN_FILE` ```bash 45c45 < fprintf(stderr, "error: file open failed '%s'.\n", IN_FILE); --- > fprintf(stderr, "error: file open failed '%s'.\n", argv[1]); ``` * 53 處的 while 迴圈， CPY 的版本只利用一個字串變數做暫存，而 REF 的版本則是用一個 memory pool 儲存所有讀進來的字串，所以有關於 memory pool 的額外程式碼 ```bash < while ((rtn = fscanf(fp, "%s", word)) != EOF) { < if (!tst_ins_del(&root, word, INS, CPY)) { < fprintf(stderr, "error: memory exhausted, tst_insert.\n"); < fclose(fp); < return 1; < } < bloom_add(bloom, word); < idx++; < } --- > /* memory pool */ > char *pool = (char *) malloc(poolsize * sizeof(char)); > char *Top = pool; > while ((rtn = fscanf(fp, "%s", Top)) != EOF) { > /* insert reference to each string */ > if (!tst_ins_del(&root, Top, INS, REF)) { /* fail to insert */ > fprintf(stderr, "error: memory exhausted, tst_insert.\n"); > fclose(fp); > return 1; > } > bloom_add(bloom, Top); > idx++; > Top += strlen(Top) + 1; > } ``` * CPY 69 行 / REF 74 行處，CPY 版本把資料儲存在 tst 裏面，而 REF 把資料儲存在 tst 外，所以使用了不同版本的 free。 CPY 202 / REF 210 行處亦同。 ```bash 69c74 < tst_free_all(root); --- > tst_free(root); ... 203c210 < tst_free_all(root); --- > tst_free(root); ``` * 74 / 79 行處，使用的 output 檔名不同，可以像 `BENCH_TEST_FILE` 一樣用定義的解決 ```bash 74c79 < output = fopen("cpy.txt", "a"); --- > output = fopen("ref.txt", "a"); ``` * `switch` 內， `case 'a'` 處，因為 CPY 和 REF 版本使用不同的方式儲存字串（`word` 和 `Top`），所以會有 memory pool 有無被使用的差別 ```bash < case 'a': < printf("enter word to add: "); < if (argc > 1 && strcmp(argv[1], "--bench") == 0) < strcpy(word, argv[3]); < else if (!fgets(word, sizeof word, stdin)) { < fprintf(stderr, "error: insufficient input.\n"); < break; < } < rmcrlf(word); < < t1 = tvgetf(); < if (bloom_test(bloom, word)) /* if detected by filter, skip */ < res = NULL; < else { /* update via tree traversal and bloom filter */ < bloom_add(bloom, word); < res = tst_ins_del(&root, word, INS, CPY); < } < t2 = tvgetf(); < if (res) { < idx++; < printf(" %s - inserted in %.12f sec. (%d words in tree)\n", < (char *) res, t2 - t1, idx); < } else < printf(" %s - already exists in list.\n", (char *) res); < < if (argc > 1 && strcmp(argv[1], "--bench") == 0) // a for auto < goto quit; < break; --- > case 'a': > printf("enter word to add: "); > if (argc > 1 && strcmp(argv[1], "--bench") == 0) > strcpy(Top, argv[3]); > else if (!fgets(Top, sizeof word, stdin)) { > fprintf(stderr, "error: insufficient input.\n"); > break; > } > rmcrlf(Top); > > t1 = tvgetf(); > if (bloom_test(bloom, Top)) /* if detected by filter, skip */ > res = NULL; > else { /* update via tree traversal and bloom filter */ > bloom_add(bloom, Top); > res = tst_ins_del(&root, Top, INS, REF); > } > t2 = tvgetf(); > if (res) { > idx++; > Top += (strlen(Top) + 1); > printf(" %s - inserted in %.12f sec. (%d words in tree)\n", > (char *) res, t2 - t1, idx); > } else > printf(" %s - already exists in list.\n", (char *) res); > > if (argc > 1 && strcmp(argv[1], "--bench") == 0) // a for auto > goto quit; > break; ``` * CPY 185 行 / REF 191 行處，使用了不同版本的參數 ```bash 185c191,192 < res = tst_ins_del(&root, word, DEL, CPY); --- > /* FIXME: remove reference to each string */ > res = tst_ins_del(&root, word, DEL, REF); ``` 統整以上差異後，可以發現大部分不同的程式碼在於使用的參數 / 檔案名稱不同，以及有無使用到 memory pool的部份。由於差異不大，可以用 define macro 的方式，在編譯時期選擇不同版本的程式碼。我將整合後的版本實作成一個 [test_common.c](https://github.com/kaeteyaruyo/dict/blob/master/test_common.c)，檔案中的部份內容如下： ```cpp #if !defined(REF) && !defined(CPY) #define REF #endif #ifdef REF // for reference mode, tst_ins_del()'s cpy argument should be zero #define MODE 0 #define BENCH_TEST_FILE "bench_ref.txt" #define OUTPUT_FILE "ref.txt" #define TST_FREE tst_free #endif #ifdef CPY // for cpy mode, tst_ins_del()'s cpy argument should be non-zero #define MODE 1 #define BENCH_TEST_FILE "bench_cpy.txt" #define OUTPUT_FILE "cpy.txt" #define TST_FREE tst_free_all #endif ... #ifdef REF /* memory pool */ char *pool = (char *) malloc(poolsize * sizeof(char)); char *buffer = pool; #endif #ifdef CPY char *buffer = word; #endif ``` 搭配 gcc 的 `-Dmacro` 參數，在編譯時若使用 `-DREF`，就會編譯出 REF 的版本，反之用 `-DCPY`，就會編譯出 CPY 的版本，如果忘了打，會預設編譯出 REF 的版本。但這個專案有引入了 makefile，所以我嘗試把這項功能也整合進 makefile 裡面： ```shell test_exe: test_common.o $(OBJS_LIB) $(VECHO) " LD\t$@\n" $(Q)$(CC) $(LDFLAGS) -o test_$(shell echo $(MODE) | tr A-Z a-z) $^ -lm %.o: %.c $(VECHO) " CC\t$@\n" $(Q)$(CC) -D$(MODE) -o $@ $(CFLAGS) -c -MMD -MF .$@.d $< ``` 執行 `make test_exe MODE=[REF, CPY]` ，就能編譯出對應的版本。此外，為了讓執行 `make [all]` 可以跑出和以前一樣的結果，我另外加了幾條規則： ```shell test_ref: FORCE rm -f test_common.o $(MAKE) test_exe MODE=REF test_cpy: FORCE rm -f test_common.o $(MAKE) test_exe MODE=CPY FORCE: ; ``` 由於 `test_exe` 的規則相依於 `test_common.c` 但是這個規則沒有參數就無法編譯，並且其內容物也不應該共享，所以只能強制讓 `test_[ref, cpy]` 這兩條規則無條件執行，因此加入了 `FORCE` 規則（不這麼做的話，就算 `test_common.c` 有所更新，也不會重新編譯），並且在編譯另一個版本之前，先將 `test_common.o` 清理掉，這樣才會重新編譯到這個檔案。 3. 城市名後面跟着逗號: 應該區分城市名和國家名。另外，記錄城市名時不應該把逗號也存下來首先我必須先確定這個檔案當中到底有哪幾種格式的資料，才能決定如何處理 input。在 VS code 當中以 regular expression 搜尋過整個檔案，可以發現： * `,(.*)?,(.*)?,(.*)?,`: 有 4 個逗號的資料，共有 1 個結果 * `Villa Presidente Frei, Ñuñoa, Santiago, Chile, Chile`，依序為 `建築物名稱, 城鎮名, 城市名, 國名`。其中城市名的部份為 `Santiago, Chile`，猜測為 "Santiago de Chile" 的縮寫 * `,(.*)?,(.*)?,`: 有 3 個逗號的資料，共有 2 個結果 * `Oliver-Valdefierro, Oliver, Valdefierro, Spain`，依序為 `城市名, 區名, 區名, 國名`。其中 Oliver 和 Valdefierro 是兩個相同層級的行政區 * `Earth, TX, Texas, United States`，依序為 `城市名, 州名縮寫, 州名, 國名`。這算是例外，因為除此之外的所有美國的地名都是遵循 `城市名, 州名, 國名` 的格式。 * `,(.*)?,`: 有 2 個逗號的資料，共有 19188 個結果 * 無法全數列出，但大部份為美國、加拿大、墨西哥、澳洲等有州層級的行政區規劃的國家，所以可以粗略推測這種情況的資料，中間的部份都是州 * 剩下的就是只有一個逗號的國家，格式為 `城市名, 國名` 以資料科學的角度來看，在拿到一筆 raw data 時應當先做前處理，整理並儘量去除當中的雜訊（即如上述的例外狀況）。由於我們已經偵測出所有例外狀況了，並且這些資料並不多，因此可以直接根據我們的需求將這些資料整理成正確的格式。就目前這份文字檔，我會選擇將上述有三個以上逗號的資料取代如下： * `Villa Presidente Frei, Ñuñoa, Santiago, Chile, Chile` -> `Ñuñoa, Chile` * 因為智利其他的地名都是遵循 `城市名, 國名` 的格式，而且 `Santiago, Chile` 已經存在了 * `Oliver-Valdefierro, Oliver, Valdefierro, Spain` -> `Oliver-Valdefierro, Spain` * 因為就西班牙的行政區分類，這兩個地名本來就被視為是[一個區](https://es.wikipedia.org/wiki/Oliver-Valdefierro)，且大部份西班牙的地名都是遵循 `城市名, 國名` 的格式 * `Earth, TX, Texas, United States` -> `Earth, Texas, United States` * 因為除此之外的所有美國的地名都是遵循 `城市名, 州名, 國名` 的格式還有一個小例外是 `Xeraco,Jaraco, Spain` ，他的城市名的逗號後方沒有空白鍵（就只有這筆資料是這樣）。 Google 了一下發現是因為 Xeraco 和 Jaraco 是同一個地方，只是拼法不同，因此保留西班牙本地常用的 Xeraco 即可。如此一來，這份資料就只剩下「有一個逗號的資料」跟「有兩個逗號的資料」了，在我們假設「有兩個逗號的資料」都是`城市名, 州名/省名, 國名`的情況下，我們可以用以下程式碼讀取輸入（此方法會將逗號和換行符都排除）： ```cpp char line[WRDMAX]; while (fgets(line, WRDMAX, fp) != NULL) { char city[WRDMAX] = "", province[WRDMAX] = "", nation[WRDMAX] = ""; sscanf(line, "%[^,^\n], %[^,^\n], %[^,^\n]", city, province, nation); // Do what you want } ``` 在本專案中，我單純將所有取得的地名插入樹中而已。詳細實作方法請見 [test_common.c 第 68 行處](https://github.com/kaeteyaruyo/dict/blob/master/test_common.c)。 4. memory leaks 的偵測及修正執行 `valgrind --leak-check=full ./test_ref --bench`，測得 memory leak 的狀況如下： ```bash ==17885== HEAP SUMMARY: ==17885== in use at exit: 517,008,248 bytes in 6 blocks ==17885== total heap usage: 502,763 allocs, 502,757 frees, 533,111,216 bytes allocated ==17885== ==17885== 8,192 bytes in 1 blocks are definitely lost in loss record 3 of 6 ==17885== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==17885== by 0x400DEC: bench_test (bench.c:46) ==17885== by 0x4011FE: main (test_ref.c:73) ==17885== ==17885== 5,000,056 (24 direct, 5,000,032 indirect) bytes in 1 blocks are definitely lost in loss record 5 of 6 ==17885== at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==17885== by 0x402A57: bloom_create (bloom.c:45) ==17885== by 0x401078: main (test_ref.c:51) ==17885== ==17885== 512,000,000 bytes in 1 blocks are possibly lost in loss record 6 of 6 ==17885== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==17885== by 0x40108E: main (test_ref.c:54) ==17885== ==17885== LEAK SUMMARY: ==17885== definitely lost: 8,216 bytes in 2 blocks ==17885== indirectly lost: 5,000,032 bytes in 3 blocks ==17885== possibly lost: 512,000,000 bytes in 1 blocks ==17885== still reachable: 0 bytes in 0 blocks ==17885== suppressed: 0 bytes in 0 blocks ==17885== ==17885== For counts of detected and suppressed errors, rerun with: -v ==17885== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0) ``` * `test_ref.c` 73 行，呼叫至 `bench.c` 46 行處， `sgl` 使用的空間在 `bench_test()` 結束前都沒有被釋放。考慮到 `bench_test()` 被呼叫過後沒多久程式必然會結束，且 `sgl` 最終亦不會在此函式外部被使用，因此應當在此函式內直接釋放該記憶體。 * 修正方法：在 `bench_test()` 尾端 `return 0;` 前增加一行 `free（sgl）;` * `test_ref.c` 51 行，呼叫至 `bloom.c` 45 行處，使用 `bloom_create()` 配置的記憶體最終沒有被釋放。這是因為在 `--bench` 模式下，程式最終不會跑到 `quit` 區塊，而是在執行完 `bench_test()` 後直接結束。雖然 tst 本身的記憶體有記得被釋放掉，但是 bloom filter 和 memory pool 的部份卻被遺忘了。 * 修正方法：在 73 行呼叫完 `bench_test()` 後，增加一行 `bloom_free(bloom);`，再行 `return 0;` * `test_ref.c` 54 行處， `pool` 沒有被釋放。無論是在 `--bench` 模式還是一般模式下，都沒有會釋放 `pool` 的程式碼。 * 修正方法：在 73 行呼叫完 `bench_test()` 後，以及 `main（）` 尾端程式結束前，增加一行 `free(pool);`，再行 `return 0;` 此外，在非 `--bench` 模式下執行 `a` 指令新增字串時，會有以下錯誤 ```bash ==18909== Conditional jump or move depends on uninitialised value(s) ==18909== at 0x402C94: bloom_test (bloom.c:102) ==18909== by 0x40141F: main (test_ref.c:115) ``` * 在 `bloom.c` 當中的 `bloom_test()` 函式，有一條 `if (!(bits[hash])` 判斷式，這個判斷式中的 `bits` 是來源於 `uint8_t *bits = filter->bits;`，而 `filter->bits` 是用 `malloc()` 配置出來的，因此導致這項警告。 * 修正方法:將 `bloom_create()` 中的 `res->bits = malloc(size);` 改為 `res->bits = calloc(size, 1);` 即可以 `valgrind --leak-check=full ./test_cpy --bench` 測得的錯誤及原因大致相同（因為錯誤多半來自共用的檔案），修正方法亦相同，因此不再贅述 5. 如何縮減頻繁 malloc 帶來的效能衝擊呢？提示: [memory pool](https://en.wikipedia.org/wiki/Memory_pool) 由於儲存字串用的 memory pool 在 REF 版本中已經實作了，因此我猜想這裡指的應該是在 `tst_ins_del()` 當中頻繁以 `calloc()` 配置節點空間造成的效能衝擊。這部份可以仿照 REF 版本，先直接配置一個足夠大的連續 heap 空間，並且在需要空間時直接從中取得。然而 REF 版本中的 pool 只會一直增添新的資料，並沒有從中移除資料的操作發生，因此可以用 stack 的方式管理；而 tst 則會有移除節點的操作，我們無法確定要被刪除的節點他使用的記憶體落在 memory pool 的哪個地方，因此將其視為 stack 管理是不可行的。如何管理 memory pool，使之不會有記憶體碎片產生，是這個問題的關鍵。（待補完） ## 參考資料 * [Select a function at compile time - Stack Overflow](https://stackoverflow.com/questions/1640567/select-a-function-at-compile-time/1640592) * [[C] 使用gcc參數 -Dmacro的編譯小技巧 - KT0nE's 程式筆記](http://kt0ne.pixnet.net/blog/post/63048996-%5Bc%5D-%E4%BD%BF%E7%94%A8gcc%E5%8F%83%E6%95%B8--dmacro%E7%9A%84%E7%B7%A8%E8%AD%AF%E5%B0%8F%E6%8A%80%E5%B7%A7) * [How can I force Make to execute a recipe all the time - Stack Overflow](https://stackoverflow.com/questions/26226106/how-can-i-force-make-to-execute-a-recipe-all-the-time) * [Array of bits](http://www.mathcs.emory.edu/~cheung/Courses/255/Syllabus/1-C-intro/bit-array.html)