2020q3 Homework3 (dict)

# 2020q3 Homework3 (dict) contributed by < `shauming1020` > ###### tags: `homework` ## 效能表現與分析 > 視覺化 ternary search tree + bloom filter 的效能表現並分析 > 1. 要考慮到 prefix search 的特性還有詞彙的涵蓋率 > 2. 限用 gnuplot 來製圖 - 測量在不同情境下 ```res = tst_search(root, word);``` 的執行時間 - Hash 參數 ```cpp #define TableSize 50000000 /* size of bloom filter */ #define HashNumber 5000000 /* number of hash functions */ ``` - 從 cities 中隨機取 5000 筆當作測資 ### 情境 1: 搜尋字串全部在字典內 > 以 cities.txt 作為輸入搜尋 ![](https://i.imgur.com/BV9lW5p.png) ![](https://i.imgur.com/SGYYpZL.png) 由於輸入都是 cities 中的字串，因此大多情況下沒有採用 bloom filter 的搜尋時間會來的快，省去額外執行 bloom filter 的時間 ### 情境 2: 搜尋字串在第一個字元即可判斷不在字典內 > e.g. word[0] = '$'; ![](https://i.imgur.com/m1GabyV.png) ![](https://i.imgur.com/rm82XqM.png) 由於第一個字在字典中沒有出現過，因此至多比較常數次數(英文字母的大小寫個數)即可知道沒有這個字串，可以很明顯看到採用 bloom filter 的執行時間會比沒有的還來得久 ### 情境 3: 搜尋字串全部在字典內的 prefix > e.g. Tokyo 在 cities 內，取其 prefix 'Toky' 作為輸入 ![](https://i.imgur.com/iNZmWrf.png) ![](https://i.imgur.com/BSKYGJN.png) 以 prefix 作為輸入，TST 走訪到 leaf node 的上一層才知 miss，因此在大多情況下，無論是否有 bloom filter，執行時間都會比情境 1 和 2 來得多 ### 情境 4: 搜尋隨機字串 > 隨機生成長度 1 - 256 的字串 ![](https://i.imgur.com/MyDbdMg.png) ![](https://i.imgur.com/KKUR5yB.png) ## 以 perf 分析 TST 程式碼的執行時期行為並且充分解讀 - Search ``` bash= echo 3 | sudo tee /proc/sys/vm/drop_caches; sudo perf stat --repeat 500 \ -e cache-misses,cache-references,instructions,cycles \ ./test_common --bloom CPY "" sudo perf stat --repeat 500 \ -e cache-misses,cache-references,instructions,cycles \ ./test_common --bloom REF "" ``` > $./test_common --bloom CPY search-a-word > 1. bloom or bloom-wo，是否啟動 bloom filter > 2. CPY or REF，模式的選擇 > 3. search-a-word，在 TST 中搜尋一個 word，當 word 的長度為 0 時 (word="")，則從 cities 中隨機取 5000 筆資料用來分析上述四種情況 - 分析結果 ```log Performance counter stats for './test_common --bloom CPY ' (500 runs): 3,447,745 cache-misses # 31.710 % of all cache refs ( +- 0.64% ) 10,872,620 cache-references ( +- 0.17% ) 721,588,119 instructions # 1.22 insn per cycle ( +- 0.01% ) 591,450,395 cycles ( +- 0.16% ) 0.126787 +- 0.000248 seconds time elapsed ( +- 0.20% ) ``` ```log Performance counter stats for './test_common --bloom REF ' (500 runs): 2,976,359 cache-misses # 27.486 % of all cache refs ( +- 0.59% ) 10,828,747 cache-references ( +- 0.13% ) 697,580,790 instructions # 1.22 insn per cycle ( +- 0.00% ) 574,036,069 cycles ( +- 0.12% ) 0.122579 +- 0.000206 seconds time elapsed ( +- 0.17% ) ``` - Report ```log $perf record -e cache-misses ./test_common --bloom CPY "" && perf report 37.12% test_common test_common [.] tst_free_all 7.76% test_common test_common [.] tst_search 5.18% test_common test_common [.] tst_ins_del ``` ```log $perf record -e cache-misses ./test_common --bloom REF "" && perf report 36.60% test_common test_common [.] tst_free 8.92% test_common test_common [.] tst_search 5.97% test_common test_common [.] tst_ins_del ``` - Prefix Search ```bash= echo 3 | sudo tee /proc/sys/vm/drop_caches; sudo perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_common --bench CPY "s Tai" sudo perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_common --bench REF "s Tai" ``` - 分析結果 ```log Performance counter stats for './test_common --bench CPY s Tai' (500 runs): 2,721,171 cache-misses # 27.798 % of all cache refs ( +- 0.57% ) 9,789,154 cache-references ( +- 0.17% ) 584,808,232 instructions # 1.23 insn per cycle ( +- 0.01% ) 476,011,493 cycles ( +- 0.11% ) 0.102034 +- 0.000192 seconds time elapsed ( +- 0.19% ) ``` ```log Performance counter stats for './test_common --bench REF s Tai' (500 runs): 2,227,871 cache-misses # 23.064 % of all cache refs ( +- 0.30% ) 9,659,418 cache-references ( +- 0.14% ) 550,516,998 instructions # 1.20 insn per cycle ( +- 0.00% ) 458,495,558 cycles ( +- 0.05% ) 0.0976760 +- 0.0000603 seconds time elapsed ( +- 0.06% ) ``` - 從結果中發現 | 比較 | CPY | REF | | -------- | -------- | -------- | | cache-misses| 多 | 少 | | instructions | 多 | 少 | | running-time | 較慢 | 較快 | ## CPY & REF > 考慮到 CPY 和 REF 這兩種實作機制，思考效能落差的緣故，並且提出後續效能改善機制 > 1. 充分量化並以現代處理器架構的行為解讀 > 2. 如何兼顧執行時間和空間運用效率？ - REF 機制會配置 memory pool 來存放資料，而 CPY 機制則採用變數 word 作為 buffer 暫存一筆資料 ```c= test_common.c /* memory pool */ pool = (char *) malloc(poolsize * sizeof(char)); Top = pool; ``` - CPY 機制當插入節點時。走訪到 string 尾端會額外 malloc 與 s 相同長度的空間，並讓 curr->eqkid 指向，而 REF 機制則是直接讓 curr->eqkid 指向 Top (即 pool 內的空間) ```c= tst_ins_del /* Place nodes until end of the string, at end of stign allocate * space for data, copy data as final eqkid, and return. */ if (*p++ == 0) { if (cpy) { /* allocate storage for 's' */ const char *eqdata = strdup(s); if (!eqdata) return NULL; curr->eqkid = (tst_node *) eqdata; return (void *) eqdata; } else { /* save pointer to 's' (allocated elsewhere) */ curr->eqkid = (tst_node *) s; return (void *) s; } } ``` ### 思考效能落差的緣故 - CPY 機制所需要執行的指令較多，例如在 insert 時 ```const char *eqdata = strdup(s);```及 tst_free_all ```free(p->eqkid);``` 等等 - 參考[隱藏在字典裡的數學](https://hackmd.io/@sysprog/BkE3uSvdN#%E4%BD%BF%E7%94%A8-perf-%E5%88%86%E6%9E%90%E7%A8%8B%E5%BC%8F%E6%95%88%E8%83%BD)，預期 REF cache-misses 次數會較多，但從我的實驗結果來看和參考來源不一致，這裡沒有頭緒... - 重新省思自己的實驗設計流程 ## 研讀 Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters 論文並參閱對應程式碼實作 xor_singleheader > 1. 引入上述 dict 程式碼中，需要比較原有的 bloom filter 和 xor filter 在錯誤率及記憶體開銷 > 2. 善用 Valgrind 和 Massif，使用方式參見 I01: lab0 ## 如何縮減頻繁 malloc 帶來的效能衝擊 > 1. 儲存字串用的 memory pool 在 REF 版本中已實作，但在 tst_ins_del() 當中頻繁以 calloc() 配置節點空間造成的效能衝擊，應許以改進。 > 2. REF 版本中的 pool 只會一直增添新的資料，並沒有從中移除資料的操作發生，因此可以用更進階的手段來管理 > 3. tst 則會有移除節點的操作，我們無法確定要被刪除的節點他使用的記憶體落在 memory pool 的哪個地方，如何管理 memory pool，使之不會有記憶體碎片產生，是這個問題的關鍵。 ## 在現有的開放原始碼專案中，找出使用 bloom filter 的專案，並解析其導入的考量