dict 改善 & 效能探討

# dict 改善 & 效能探討 ###### tags: `C語言` contributed < `asd757817`,`yichung279` > ## 原專案 bugs 修正 1. 原專案在讀取 cities.txt 時，遇到地名含有空白會有 bug fscanf 在遇到空白、換行都會停止讀取，因此城鎮名稱含有空白的會儲存錯誤，修改後只會紀錄地名或城鎮名(cities.txt 內的格式為：地名, 城鎮, 國家 OR 城鎮名, 國家) * test_cpy.c 修改 ```C /* * 將這段程式碼內容修改成下面程式碼 while ((rtn = fscanf(fp, "%s", Top)) != EOF) { ...省略... } */ char buf[WORDMAX]; while (fgets(buf, WORDMAX, fp)) { char *token = ","; char *s = strtok(buf, token); char *p = s; if (!tst_ins_del(&root, &p, INS, CPY)) { fprintf(stderr, "error: memory exhausted, tst_insert.\n"); fclose(fp); return 1; } idx++; } ``` * test_ref.c 修改 ```C /* * 將這段程式碼內容修改成下面程式碼 while ((rtn = fscanf(fp, "%s", Top)) != EOF) { ...省略... } */ char buf[WORDMAX]; while (fgets(buf, WORDMAX, fp)) { char *token = ","; strcpy(Top, strtok(buf, token)); rmcrlf(Top); char *p = Top; /* insert reference to each string */ if (!tst_ins_del(&root, &p, INS, REF)) { /* fail to insert */ fprintf(stderr, "error: memory exhausted, tst_insert.\n"); fclose(fp); return 1; } else { /* update bloom filter */ bloom_add(bloom, Top); } idx++; Top += (strlen(Top) + 1); } ``` 2. 原專案 bloomfilter 宣告 table 時，使用 malloc()，但未將 table 初始化，因此改用 calloc()。 ``` res->bits = malloc(size); ``` 改為 ``` res->bits = calloc(size, 1); ``` 3. 使用易於增加數量的雜湊函數由於舊的 bloom filter 使用雜湊函數 (hash function) 時，只將兩種不同的雜湊函數放進 bloom filter 中，如此一來難以更改函數數量，也給測試帶來許多不變，因此改用 murmur3 這個能加入種子 (seed) 的函數做為雜湊函數。 ```clike unsigned int murmur3(const void *_str, unsigned int seed) { size_t len = strlen(_str); const uint8_t *key = _str; uint32_t h = seed; if (len > 3) { const uint32_t *key_x4 = (const uint32_t *) key; size_t i = len >> 2; do { uint32_t k = *key_x4++; k *= 0xcc9e2d51; k = (k << 15) | (k >> 17); k *= 0x1b873593; h ^= k; h = (h << 13) | (h >> 19); h = (h * 5) + 0xe6546b64; } while (--i); key = (const uint8_t *) key_x4; } if (len & 3) { size_t i = len & 3; uint32_t k = 0; key = &key[i - 1]; do { k <<= 8; k |= *key--; } while (--i); k *= 0xcc9e2d51; k = (k << 15) | (k >> 17); k *= 0x1b873593; h ^= k; } h ^= len; h ^= h >> 16; h *= 0x85ebca6b; h ^= h >> 13; h *= 0xc2b2ae35; h ^= h >> 16; return h; } ``` ## bloom filter 結構修改 * bloom filter: 包含 hash　function，一個　table 還有　table size。 * hash function: 以 linked list 的結構串在一起，每個結點包含一個指向函式的指標。修改之後增加 seed，讓每個 hash function 自行記錄自己的 seed。 ```clike typedef unsigned int (*hash_function)(const void *data, unsigned int seed); typedef struct bloom_filter *bloom_t; struct bloom_hash { hash_function func; unsigned int seed; struct bloom_hash *next; }; struct bloom_filter { struct bloom_hash *func; void *bits; size_t size; }; ``` ## bloom filter error rate 視覺化以 hash function的數量、table size 的大小做為橫軸縱軸，以錯誤率做為深淺。而畫面中紫色的線為 $y=ln2\cdot\frac{x}{93827}$，其中93827 為字典中 item 數量。 ![](https://i.imgur.com/6SKmS2x.png) * 把線性規劃的概念代入，我們想要最小的空間及執行速度，並符合一定的錯誤率，我們可以很清楚的知道我們想要的值在可行域的左下角。為了讓圖片不明顯的部分，能被看得清楚，將圖片轉為灰階後做直方圖均衡化 ![](https://i.imgur.com/DxibxrA.png) ![](https://i.imgur.com/pXWBJuw.png) * 把此圖視為高線圖，可以發現山谷都在 $k=ln2\frac{m}{n}$上，而在這裡實際的意義就是錯誤率最小值發生在 $k=ln2\frac{m}{n}$與推導的結果相符透過開四次方根把錯誤率之間的差距加大在畫一次圖，我們能更明顯的看出上述的內容。 ![](https://i.imgur.com/53VgivI.png) ## Bloom filter 的 hash function 數目、 table size 與搜尋時間之關係探討使用 make plot 呈現不同 hash function 數量與 table 大小的搜尋效能 ``` plot: $(TESTS) python gen_test.py ./test_cpy --bench ./test_ref --bench gnuplot scripts/bloom_err_rate.gp gnuplot scripts/runtime3.gp eog runtime3.png eog bloom_err_rate.png ``` * hash function 數量從 1 至 15，hash table 大小從 1 至 40，統計各種組合的總搜尋時間，與單純使用 prefix search 做 600次比較。 * 測試檔案：將原字典內字串第二字取代為\$，即第二個字元便不在 tree 中。 ### 測試檔說明原 cities.txt 檔案內容 ``` Shanghai, China Buenos Aires, Argentina Mumbai, India Mexico City, Distrito Federal, Mexico Karachi, Pakistan İstanbul, Turkey ...以下省略... ``` * 測試檔以原 cities.txt 為基礎，將第 n 個字元取代成 "$"，表示搜尋到第 n 字元即可判斷搜尋字串不在字典內部 * 測試檔命名為： case_n.txt，n 表示第 n 個字元取代成 "$" 如： case_02.txt 內容如下 ``` S$anghai,China B$enos Aires,Argentina M$mbai,India M$xico City,Distrito Federal,Mexico K$rachi,Pakistan ...以下省略... ``` ### 實驗： * bloom filter 雜湊函式數量從 1 ~ 15、table size 由 1~40 共600種組合 * 與不加 bloom filter 做600次進行比較 ### 測試環境 ``` Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 每核心執行緒數：2 每通訊端核心數：4 CPU MHz： 1600.092 CPU max MHz: 1600.0000 CPU min MHz: 400.0000 L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 6144K ``` ### 情境1: 搜尋字串全部在字典內（以 cities.txt 作為輸入搜尋） ![](https://i.imgur.com/PxjAxi5.png) 從結果中可以發現： 1. tst_search 重複測試 600 次的搜尋時間大致上相同 2. 因為全部的搜尋字串都在字典內，因此 bloom filter 進行的預測時間是浪費的，隨著table size、雜湊函式數量提高，浪費的時間愈多 ### 情境2：搜尋字串在第一個字元即可判斷不在字典內（case_1.txt) ![](https://i.imgur.com/eJiaj8f.png) 1. 加上 bloom filter 後的執行時間為 L 型：在這個情境中，若 bloom filter 預測錯誤，進行一個搜尋所花的時間為 bloom filter + tst_search 的時間；若是預測正確則是只需要花 bloom filter 的時間，因此預測的錯誤率愈低搜尋所需要的時間愈短。已知： ==bloom filter 預測的錯誤率計算公式為 $(1-e^{-\frac{kn}{m}})^k$== **其中 k為 hash function 數量、m 為 table size、n 為字串數量** 依據上述公式，當 k 固定時，m 增加錯誤率會逐漸下降，繪製出的圖形也是呈現 L 形與輸出結果相符。 2. 把這張圖 bloom filter 的部分與 bloom filter錯誤率的 heat map 一起看，由於錯誤率與執行時間成正比，這張圖 bloom filter 的部分，正是一列一列拆開的 heat map。 3. 因為在第一個字元就可以判斷出字串不在字典中，tst_search 進行搜尋的時間反而少於進行 hash 的時間。 ### 情境3: 搜尋到最後一個字元才發現字串不在字典內 ![](https://i.imgur.com/7OTUiPM.png) 1. 第三個 L 之後加入 bloom filter 的搜尋時間全在不加 bloom filter 之上，這與==搜尋的字串長度==有關，雖然搜尋字串全部不在字典中且需要到最後一個字元才能發現，但因為字串長度不長，當雜湊函式的數量大於3個之後，進行雜湊運算與預測的時間反而大於 tst_search，因此在輸入的搜尋字串長度與本次情境相近時，bloom filter 的雜湊函式數量小於3個才可以使效率提高。 ## 以 Perf 效能分析工具探討程式效能 Perf 可以取樣的事件種類很多，利用其中的 cache-misses 可以分析程式是否有善用 Locality of reference. 分析 ./test_cpy --bench ### 測試情境1：以 cities.txt 作為輸入，檢查程式在空間局部性的利用狀況 ``` ternary_tree, loaded 93827 words in 0.076877 sec Performance counter stats for './test_cpy --bench': 2,989,918 cache-misses 0.155304671 seconds time elapsed ``` miss 的數量非常多，可見程式在空間局部性的利用很差，進一步使用 perf 查看造成高 miss 的函式 ```shell $ perf record -e cache-misses ./test_cpy --bench && sudo perf report ternary_tree, loaded 93827 words in 0.091907 sec [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.041 MB perf.data (715 samples) ] 29.06% test_cpy test_cpy [.] tst_free_all 17.41% test_cpy test_cpy [.] tst_search 5.95% test_cpy test_cpy [.] tst_ins_del 5.13% test_cpy libc-2.23.so [.] free 4.07% test_cpy [kernel.kallsyms] [k] clear_page_erms . . . ``` 可以發現 tst_free_all 與 tst_search 兩函式是造成高 miss 的主因。 * tst_free_all 的部分：將整個資料結構釋放時，由於資料量過於龐大，很自然地，我們的快取不會存有整個資料結構，這部分的 cache misses 無可厚非。 * tst_search 的部分： ```shell while (curr) { /* loop over each char in 's' */ │ ↓ jmp 68 │ int diff = *s - curr->key; /* calculate the difference */ 3.47 │ e: mov 0xc(%ebp),%eax │ movzbl (%eax),%eax 0.69 │ movsbl %al,%edx 0.69 │ mov -0x8(%ebp),%eax 0.69 │ movzbl (%eax),%eax 39.58 │ movsbl %al,%eax 1.39 │ sub %eax,%edx 2.78 │ mov %edx,%eax │ mov %eax,-0x4(%ebp) │ //*s - curr->key; │ if (diff == 0) { /* handle the equal case */ ``` 發現是 tst_search 中計算搜尋字串與當前節點的差 `int diff = *s - curr->key;` 一步實驗後，發現`*s` dereference 搜尋字串是罪魁禍首，由於每次搜尋的字串都略有不同，對於 cache 來說每次都是第一次存取，自然有 cache misses。相反地，若短時間內重複搜尋，便有時間上的 locality，進而避免 cache misses。我們將在下方的實驗證實這件事。分析過後，我們知道除了搜尋外，還有其他無可避免的 cache misses，我們搜尋空字串後，觀察程式運作基本 miss 數量。 ```shell $ perf stat --repeat 10 -e cache-misses ./test_cpy --bench Performance counter stats for './test_cpy --bench' (10 runs): 2,116,940 cache-misses ( +- 1.17% ) 0.105038320 seconds time elapsed ( +- 1.17% ) ``` 以上面得到的數量作為程式運作基本 miss 數量。 ### 情境2：驗證搜尋過的字串確實有存入 cache 中分為3個情況進行檢查： 1. 原輸入輸入檔1: ``` 's Gravenmoe$,Netherlands 's-Gravenzand$,Netherlands 's-Hertogenbosc$,Netherlands 't Hofke,Netherlands A Coruña,Spain A Estrada,Spain A dos Cunhado$,Portugal Aabenraa,Denmark Aach,Germany Aachen,Germany Aadorf,Switzerland ...以下省略... ``` 2. 同一個輸入重入輸入3次輸入檔2： ``` 's Gravenmoe$,Netherlands 's Gravenmoe$,Netherlands 's Gravenmoe$,Netherlands 's-Gravenzand$,Netherlands 's-Gravenzand$,Netherlands 's-Gravenzand$,Netherlands 's-Hertogenbosc$,Netherlands 's-Hertogenbosc$,Netherlands 's-Hertogenbosc$,Netherlands 't Hofke,Netherlands 't Hofke,Netherlands 't Hofke,Netherlands A Coruña,Spain A Coruña,Spain A Coruña,Spain A Estrada,Spain A Estrada,Spain A Estrada,Spain ...以下省略... ``` 3. 原輸入搜尋後再以原輸入搜尋一次輸入檔3 ``` 's Gravenmoe$,Netherlands 's-Gravenzand$,Netherlands 's-Hertogenbosc$,Netherlands 't Hofke,Netherlands A Coruña,Spain ...以下省略... 's Gravenmoe$,Netherlands 's-Gravenzand$,Netherlands 's-Hertogenbosc$,Netherlands 't Hofke,Netherlands A Coruña,Spain ...以下省略 ``` 結果1 ```shell $ perf stat --repeat 10 -e cache-misses ./test_cpy --bench Performance counter stats for './test_cpy --bench' (10 runs): 3,120,522 cache-misses ( +- 2.02% ) 0.154313338 seconds time elapsed ( +- 0.84% ) ``` 結果2 ```shell $ perf stat --repeat 10 -e cache-misses ./test_cpy --bench Performance counter stats for './test_cpy --bench' (10 runs): 3,066,592 cache-misses ( +- 2.71% ) 0.181553031 seconds time elapsed ( +- 0.60% ) ``` 結果3 ```shell $ perf stat --repeat 10 -e cache-misses ./test_cpy --bench Performance counter stats for './test_cpy --bench' (10 runs): 3,840,617 cache-misses ( +- 2.04% ) 0.199821574 seconds time elapsed ( +- 0.37% ) ``` 結果1、2、3都必須再減去程式基本的 miss 數量。依據輸出結果1、2來推斷：由於第一次搜尋時會把資料存放在 cache 中，因次再次搜尋時並沒有發生 cache miss，因此兩個情境的 cache miss 數量相差不多。結果1、3在減去基本的 miss 數量後，兩者差距約2倍，表示在第二次搜尋時 cache 中已經沒有前一次的輸入，出現第二次的 miss。將輸入每 n 個分成一組，以組為單位重複搜尋，測試 cache 大約可以存放多少筆輸入： n = 20 ```shell perf stat --repeat 10 -e cache-misses ./test_cpy --bench Performance counter stats for './test_cpy --bench' (10 runs): 3,178,390 cache-misses ( +- 1.05% ) 0.187150742 seconds time elapsed ( +- 0.62% ) ``` n = 1000 ```shell $ perf stat --repeat 10 -e cache-misses ./test_cpy --bench Performance counter stats for './test_cpy --bench' (10 runs): 3,187,476 cache-misses ( +- 1.48% ) 0.195568548 seconds time elapsed ( +- 0.66% ) ``` n = 1500 ```shell $ perf stat --repeat 10 -e cache-misses ./test_cpy --bench Performance counter stats for './test_cpy --bench' (10 runs): 3,325,723 cache-misses ( +- 1.52% ) 0.197077394 seconds time elapsed ( +- 0.53% ) ``` n = 2000 ```shell $ perf stat --repeat 10 -e cache-misses ./test_cpy --bench Performance counter stats for './test_cpy --bench' (10 runs): 3,667,668 cache-misses ( +- 5.20% ) 0.205070631 seconds time elapsed ( +- 2.07% ) ``` 依照實驗的電腦配備，約1000個輸入內重複的輸入可以直接在 cache 中取得資料