2017q3 Homework1 (phonebook) [分組報告]

# 2017q3 Homework1 (phonebook) [分組報告] ###### tags: `進階電腦系統理論與實作 (Fall 2017)` contributed by < `jcyztw`, `davidshih0908` > ## 預計完成項目 * 改善效能 - [ ] 使用 binary search tree 或其他資料結構改寫 - [ ] 改善電話簿的效能分析，透過大量的重複執行，找出 95% 信賴區間，而且允許動態新增資料 (較符合現實) 和查詢 - [x] 支援 fuzzy search，允許搜尋 lastName 時，不用精準打出正確的寫法 - [x] 引入 memory pool，降低記憶體存取的成本 * 查閱相關資料 - [ ] [cache 原理與實驗的參考及引用資料](https://hackmd.io/s/S14L26-_l#%E5%8F%83%E8%80%83%E5%8F%8A%E5%BC%95%E7%94%A8%E8%B3%87%E6%96%99) - [x] [ [fuzzy search](http://www.informit.com/articles/article.aspx?p=1848528) ] [Levenshtein Distance](https://charles620016.hackpad.com/ep/pad/static/Japi4qFyAzt) (編輯距離) - [x] 何謂信賴區間，機率密度函數(PDF)，以及累積分佈函數(CDF) - [x] 作業共筆：[ierosodin](https://hackmd.io/s/SJ4uqs9Yx) - [ ] [春季班期中報告](https://hackmd.io/s/H1g0s3sax) * 回答問題 - [ ] 資料難道「數大就是美」嗎？ - [ ] 如果我們要建立符合真實電話簿情境的輸入，該怎麼作？ ## 開發環境 ``` Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 每核心執行緒數：2 每通訊端核心數：4 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 158 Model name: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 製程： 9 CPU MHz： 900.000 CPU max MHz: 4200.0000 CPU min MHz: 800.0000 BogoMIPS: 7200.00 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 8192K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp ``` ## 改善效能 ### 優化方案：memory pool 參考 [ierosodin](https://hackmd.io/s/SJ4uqs9Yx) 實作 memory pool 的[程式碼](https://github.com/ierosodin/phonebook)，依樣畫葫蘆實作，測試看看使用 memory pool 可以將 miss rate 降到多低。我在實作 memory pool 時，是基於 hash table 方案來加以改善，兩者差別只是在建立電話簿中的資料時 ( `append()`)，每筆 entry 配置記憶體的方式不同。hash table 的作法是使用 `malloc()` 配置記憶體，而 memory pool 是用一開始配置好的 pool 直接指定可使用的記憶體位址。 ```clike= entry *append(char lastName[], entry *e) { e->pNext = (entry *) pool_alloc(mpool, sizeof(entry)); e = e->pNext; strcpy(e->lastName, lastName); e->pNext = NULL; return e; } ``` 上面的程式碼為 memory pool 優化方案透過 `pool_alloc()` 直接指定可使用的記憶體位址。以下是 memory pool 相關操作 (create, allocate, free) 的程式碼片段： ```clike= pool *pool_create(size_t size) { pool *p = (pool *) malloc(size + sizeof(pool)); if (p) { p->next = p + sizeof(pool); p->end = p + size + sizeof(pool); } return p; } void *pool_alloc(pool *p, size_t size) { void *mem = NULL; if (p->end - p->next >= size) { mem = p->next; p->next = p->next + size; } return mem; } void pool_destroy(pool *p) { free(p); } ``` 利用 `perf stat -e` 指令來紀錄 cache miss，cache reference 次數，總指令數以及總 cycle 數，下方表格用來比較 miss rate： |original|resized structure|hash table|memory pool| |:------:|:---------------:|:--------:|:---------:| |87.693%|51.938%|43.080%|37.338%| ``` Performance counter stats for './phonebook_orig' (100 runs): 5,345,827 cache-misses # 87.693 % of all cache refs ( +- 0.17% ) 6,096,065 cache-references ( +- 0.07% ) 322,369,931 instructions # 1.37 insn per cycle ( +- 0.02% ) 235,405,977 cycles ( +- 0.47% ) 0.057201688 seconds time elapsed ( +- 0.57% ) Performance counter stats for './phonebook_opt' (100 runs): 1,443,369 cache-misses # 51.938 % of all cache refs ( +- 0.65% ) 2,779,036 cache-references ( +- 0.15% ) 283,246,415 instructions # 2.15 insn per cycle ( +- 0.02% ) 131,848,129 cycles ( +- 0.64% ) 0.032153434 seconds time elapsed ( +- 0.66% ) Performance counter stats for './phonebook_opt2' (100 runs): 1,725,825 cache-misses # 43.080 % of all cache refs ( +- 1.10% ) 4,006,108 cache-references ( +- 1.18% ) 264,386,230 instructions # 1.33 insn per cycle ( +- 0.02% ) 199,015,555 cycles ( +- 0.40% ) 0.048235382 seconds time elapsed ( +- 0.43% ) Performance counter stats for './phonebook_mpool' (100 runs): 308,512 cache-misses # 37.338 % of all cache refs ( +- 1.03% ) 826,268 cache-references ( +- 0.30% ) 157,945,451 instructions # 1.84 insn per cycle ( +- 0.03% ) 85,842,871 cycles ( +- 0.44% ) 0.021161593 seconds time elapsed ( +- 0.49% ) ``` 由下圖可看出 memory pool 方案是所有情況中 `append()` 以及 `findName()` 中執行時間花費最短的。 ![](https://i.imgur.com/heJJPsE.png) ### 實作 fuzzy search 我們實作了兩種搜尋相似字串的演算法： * Levenshtein distance * Wagner-Fischer algorithm 透過這兩種演算法可以算出 edit distance (意即將一字串轉換成另一個字串所需之編輯次數，編輯可分為插入、刪除以及取代三種操作)。以下是實作 Wagner-Fischer algorithm，搜尋 `words.txt` 中，與 `kaneshiro` 的相似字串： ``` $ ./phonebook_fuzzy kaneshiro similarity = 3 last names which edit distance is less or equal to similarity: kanashi janeiro koshiro yashiro kanesian aleshire maeshiro kazihiro kunashir kazuhiro taushiro kaneshite kanephore lanesboro kunashiri kanephoros ``` similarity 表示設定的 edit distance，此處是設為 3。以上搜尋到的 last name，都是與 `kaneshiro` 的 similarity 小於或等於 3 的字串。參考[維基百科](https://en.wikipedia.org/wiki/Wagner–Fischer_algorithm)，Wagner-Fischer algorithm 的程式碼如下： ```clike= int wagnerFischer(const char *s, const char *t) { int len_s = strlen(s), len_t = strlen(t); int distance[len_s + 1][len_t + 1]; for (int i = 0; i <= len_s; i++) { distance[i][0] = i; } for (int j = 0; j <= len_t; j++) { distance[0][j] = j; } for (int j = 1; j <= len_t; j++) { for (int i = 1; i <= len_s; i++) { if (s[i - 1] == t[j - 1]) distance[i][j] = distance[i - 1][j - 1]; else distance[i][j] = min( distance[i-1][j] + 1, distance[i][j-1] + 1, distance[i-1][j-1] + 1); } } return distance[len_s][len_t]; } ``` --- ## 資料閱讀 ### fuzzy search * 模糊搜尋(fuzzy search)，找出與你給定字串相似的字串，一般來說可以透過查字典的方法找出與你相近的字串。常用於 `search engine`，e.g. google search 中，當你輸入 `mississipp` ， google search 會建議你要不要搜尋 `mississippi`。 * 找出相似字串的演算法有很多種，依據 [Fuzzy search strings](http://ntz-develop.blogspot.tw/2011/03/fuzzy-string-search.html) ，以下針對下列兩種作說明: * Levenshtein distance * BK-trees #### Levenshtein distance Levenshtein distance 是將一字串轉換成另一個字串所需之編輯次數，能使用的編輯方式有**插入**，**刪除**或是**取代**三種。若將這 3 個操作方式的 cost 視為相同，則「總 cost = 總編輯次數 $\times$ 每編輯一次的 cost 」。其數學式為： $$ lev_{a,b}(i,j) = \left\{ \begin{array}{ll} max(i,j) & \mbox{if $min(i,j) = 0,$} \\ min\left\{\begin{array}{ll} lev_{a,b}(i-1,j)+1 \\ lev_{a,b}(i,j-1)+1 \\ lev_{a,b}(i-1,j-1)+1_{(a_i \neq b_j)} \end{array} \right. & \mbox{otherwise.} \end{array} \right. $$ 以字串 `kitten` 與 `sitting` 做示範，若想將 `kitten` 轉換成 `sitting`： 1. kitten -> sitten ( s 取代 k ) 2. sitten -> sittin ( i 取代 e ) 3. sittin -> sitting (字串尾加上 g ) 得到總編輯次數等於 3。在實作上可以採用 `recursive` 或者 `iterative`： * `recursive` 不是很有效率的做法，會有大量的重複計算，因此不作介紹 * `iterative` 的話可以採用 Wagner–Fischer algorithm 來實作 (利用 dynamic programming 的方式算兩字串的編輯距離)，其演算法如下： ![](https://i.imgur.com/DeHcq4u.png) 很直接的就是設初值，然後開始建表。以 `GARVEY` 和 `AVERY` 為例，算出 edit distance 為 3。 ![](https://i.imgur.com/TQ6BMsg.gif) #### BK-trees BK-trees 是建立在 Levenshtein distance 的基礎上的樹狀資料結構。若有一棵叫作 **T** 的 BK-tree，由於 `BOOK` 與 `BOOKS` 的 Levenshtein distance 為 1，所以在 `BOOK` 與 `BOOKS` 相連的邊上標上 1 (**T** 見下圖)。 ![](https://i.imgur.com/eZou6qf.png) 以下介紹**如何插入節點至 BK-tree** 以及如何**在 BK-tree 中尋找相似的字串**。 * BK-tree 插入的規則 1. 計算欲插入字串與 BK-tree 根節點的 Levenshtein distance (令此 distance 為 N) 2. 找尋是否有等於 N 的邊值 (所有根節點與其所有子節點相連的邊，每邊都會有一個值紀錄此節點與父點的 Levenshtein distance) 2-a. 若有，以邊值等於 N 的子節點作為新的根節點，重回步驟 1 執行，直到找到插入的位置 2-b. 若無，直接將欲插入的字串插到根節點之下，成為根節點的新子點舉例來說 (使用上圖作例子)，假如若想插入一字串 `BOO` ，首先算出與 root (`BOOK`) 的 Levenshtein distance (得到 distance = 1)，找到子節點 `BOOKS` 作為新的 root (distance = 1)；接著計算 `BOO` 跟 `BOOKS` 的 Levenshtein distance，得到 distance = 2。因為找不到邊值相符的子節點，便將 `BOO` 插在 `BOOKS` 之下 (如下圖)。 ![](https://i.imgur.com/NPDNA2s.png) * BK-tree 中找相似字串的節點 ![](https://i.imgur.com/W2a1tiV.png) 以上圖的 BK-tree 來搜尋字串 `CAQE` 為例，令 distance tolerance N = 1： 1. 計算根節點 `BOOK` 與 `CAQE` 的 Levenshtein distance (distance = 4)，而 distance 並沒有小於或等於 N ($4 \leq 1$ 不成立)，因此 `BOOK` 不列入匹配的字串選擇中。再接著搜尋邊值落在 $(distance-1, distance+1) = (3,5)$ 範圍內的 `BOOK` 子節點。 2. 符合 1 最後敘述的子節點只有 `CAKE`，重複 1 的作法，得到 distance = 1，distance 等於 N，故將 `CAKE` 列入匹配的字串中。接下來搜尋邊值落在 $(distance-1, distance+1)=(0,2)$ 範圍內的 `CAKE` 子節點。 3. 符合 2 最後敘述的子節點有 `CAPE` 和 `CART`，重複 1 的作法：`CAQE` 與 `CAPE` 的 distance = 1，因此將 `CAPE` 列入匹配的字串選擇中；`CAQE` 與 `CART` 的 distance = 2，因 distance 大於 N，故不將 `CART` 列入匹配字串。在 distance tolerance 為 1 的情況下，最終得到的相似字串有：`CAKE` 以及 `CAPE`。 ### 信賴區間，機率密度函數 (PDF) 以及累積分佈函數 (CDF) 名詞解釋： * [信賴區間](https://en.wikipedia.org/wiki/Confidence_interval) 信賴區間 (Confidence Interval) 是對機率樣本的某個總體參數的區間估計；而信賴區間展現這個總體參數的真實值**有一定機率**落在與該測量結果有關的某對應區間，這個「一定機率」稱為信心水準。舉例來說，某選舉的民意調查中，「在 95% 信心水準下，抽樣誤差在正負 3.1% 以內，甲候選人獲得 37% 的選民支持，乙候選人則有 23% 的支持率。」意即我們有 95% 的信心，支持甲的選民比例，在 (0.339, 0.401) 範圍內，而支持乙的選民比例，在 (0.199, 0.261) 的範圍內。這兩個範圍即**信賴區間**。 * [機率密度函數 (PDF, Probability Density Function)](https://en.wikipedia.org/wiki/Probability_density_function) 一個描述這個連續隨機變量 (continuous random variable) 的輸出值，在某個確定的取值點附近的可能性的函數。而隨機變量的取值落在某個區域之內的機率，則為機率密度函數在這個區域上的積分。以下使用 ASPE4e 課本 (見參考資料) P.112 範例 4-1 輔助說明。 $X$ 表示在一細薄銅線 (thin copper wire) 中測量到的電流 (單位為 mA，毫安培)，假設 $X$ 的範圍為 [0, 20 mA]，且 $X$ 的機率密度函數 $f(x) = 0.05, 0 \leq x \leq 20$，此時**銅線中電流小於 10 毫安培的機率**為何？不在 $X$ 的範圍內的 $f(x) = 0$，範例所求機率等於下圖中的陰影範圍。 $$ P(x < 10) = \int_0^{10} f(x)dx = \int_0^{10} 0.05\ dx = 0.5 $$ ![](https://i.imgur.com/mkAflCv.jpg) * [累積分佈函數 (CDF, Cumulative Distribution Function)](https://en.wikipedia.org/wiki/Cumulative_distribution_function) 是機率密度函數的積分，能完整描述一個實隨機變量 $X$ 的機率分布。舉例說明前，首先看看 ASPE4e 課本給的累積分佈函數的數學定義：一個連續隨機變數 (continuous random variable) $X$ 的累積分佈函數為 $$ F(x) = P(X \leq x) = \int_{-\infty}^x f(u)\ du,\ \mbox{for $-\infty < x < \infty$} $$ 舉例：延續在機率密度函數部份的例子，找出 $X$ 的累積分佈函數。 $X$ 的 CDF 由三個部份組成： If $x < 0$, $f(x) = 0$, therefore $F(x) = 0$, for $x < 0$ And $F(x) = \int_0^x f(u)du = 0.05x$, for $0 \leq x < 20$ Finally, $F(x) = \int_0^x f(u)du = 1$, for $20 \leq x$ 整理之後得到以下結果： $$ F(x) = \left\{ \begin{array}{lc} 0 & x < 0\\ 0.05x & 0 \leq x < 20\\ 1 & 20 \leq x \end{array} \right. $$ 上述式子作圖後如下： ![](https://i.imgur.com/2FufnOH.jpg) ## 參考資料 * 共筆 * 同學共筆：[st9007a](https://hackmd.io/s/BJvNXiEib) * 選讀共筆：[0xff07](https://hackmd.io/s/SkpIRTvKe)，[nekoneko](https://hackmd.io/s/rJCs_ZVa)，[ierosodin](https://hackmd.io/s/SJ4uqs9Yx)，[charles620016](https://charles620016.hackpad.com/ep/pad/static/Japi4qFyAzt) * [Hashing & Hash Tables](http://www.eecs.wsu.edu/~ananth/CptS223/Lectures/hashing.pdf "Cpt S 223. School of EECS, WSU") * Memory pool * [Implement own memory pool](https://stackoverflow.com/questions/11749386/implement-own-memory-pool "stackoverflow") * [C 語言動態記憶體配置教學：malloc、free 等函數](https://blog.gtwang.org/programming/c-memory-functions-malloc-free/) * [malloc(3) - Linux man page](https://linux.die.net/man/3/malloc) * [Fuzzy string search](http://ntz-develop.blogspot.tw/2011/03/fuzzy-string-search.html) * [Levenshtein distance from Wikipedia](https://en.wikipedia.org/wiki/Levenshtein_distance) * [The BK-Tree – A Data Structure for Spell Checking](https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/) * [BK-Tree | Introduction & Implementation](http://www.geeksforgeeks.org/bk-tree-introduction-implementation/) * [什麼是信賴區間](http://mathcenter.ck.tp.edu.tw/MCenter/Ctrl/ePaper/ePaperOpenFileX.ashx?autoKey=11 "鄭惟厚教授/淡江大學數學系") * _Applied Statistics and Probability for Engineers, 4th ed._ (ASPE4e), Douglas C. Montgomery, George C. Runger * Probability Density Function (P.111) * Cumulative Distribution Function (P.114)