Phonebook - HackMD

# Phonebook contribured <`hugikun999`>, <`claaaaassic`>, <`Weinux`> GitHub: [phonebook](https://github.com/claaaaassic/phonebook) :::info Task: 比照 B01: phonebook，支援 fuzzy search，允許搜尋 lastName 時，不用精準打出正確的寫法。指出原有 dataset 的問題 (需要有工程分析，拿出統計學) 並改進。 ::: - [x] fuzzy search - [x] cache block size(link-list, binary-search tree) - [x] 95%信賴區間 - [ ] delete, insert API - [x] input data - [x] mmap 對大量讀取時的影響 - [x] hash table size 和 collision 的關係 - [x] Memory Pool ## mmap 大量讀取時的影響 [Which is fastest: read, fread, ifstream or mmap?](http://lemire.me/blog/2012/06/26/which-is-fastest-read-fread-ifstream-or-mmap/) 1. Overhead on ```mmap``` is heavily than ```read```. 2. Memory maps is faster than randomly access. 3. Memory map keep using page from memory until you finish your work.However,with ```read```,files may have been flushed out memory. 4. Use memory maps if you access data **randomly**, keep it around for a long time, or if you know you can share it with other processes. If your access pattern is **sequential**, `mmap` can't work efficiently than other funcion. 5. Memory map is less robust,any slightest mistake can make your make program crash. 6. ```mmap``` directly access memory. ```read``` fetches the data from disk to page cache and then copy them to position you specify. * Downsides to mmap: * resulting page fault * quite noticeable setup and teardown costs * Upsides of mmap: * if the data gets re-used over and over again * if the data is large，The kernel can forget pages as memory pressure forces the system to page stuff out. Conclude: mmap 僅在 random access memory 的時候會比較有效率，如果為 sequential access 則使用 fread 會比較有效率。 ### [Code to test](https://eklausmeier.wordpress.com/2016/02/03/performance-comparison-mmap-versus-read-versus-fread/) 這裡有提供一個可以測試 ```mmap``` 和 ```fread```的程式，我把 word.txt 當成輸入檔並用 ```$perf sate``` 重覆測試 100 次，可以發現 ```mmap``` 的 cache-missed 低了很大一個幅度，但是總體執行卻沒有比較快。 (1)fread ```shell Performance counter stats for './tbytesum1 -f words.txt' (100 runs): 6491 branch-misses ( +- 0.42% ) 2,1510 cache-misses # 40.280 % of all cache refs ( +- 2.39% ) 5,3401 cache-references ( +- 1.02% ) 2206,1482 cycles ( +- 0.21% ) 0.007478928 seconds time elapsed ( +- 0.36% ) ``` (2)mmap ```shell Performance counter stats for './tbytesum1 -m words.txt' (100 runs): 6554 branch-misses ( +- 0.29% ) 6788 cache-misses # 29.101 % of all cache refs ( +- 3.07% ) 2,3325 cache-references ( +- 0.65% ) 2513,0427 cycles ( +- 0.10% ) 0.008581382 seconds time elapsed ( +- 1.01% ) ``` ### phonebook 實際比較 mmap 版從整體的消耗時間看起來沒有減少反而增加，其原因在於**先將 `word.txt` 做 align**，比起原版額外增加了許多的 instructions，故導致最後的整體時間並沒有比較快，但是可以發現 cache-misses 已經減少了一成左右。 ```C file_align(DICT_FILE, FILE_OUT, MAX_LAST_NAME_SIZE); fd = open(FILE_OUT, O_RDONLY | O_NONBLOCK); if (!fd) { printf("cannot open the file\n"); return -1; } size = fsize(FILE_OUT); data_count = size / (sizeof(char) * MAX_LAST_NAME_SIZE); map = mmap(NULL, size, PROT_READ, MAP_SHARED | MAP_NONBLOCK, fd, 0); ``` ```shell Performance counter stats for './phonebook_orig' (100 runs): 123,9627 cache-misses # 88.189 % of all cache refs ( +- 0.76% ) 140,5652 cache-references ( +- 0.86% ) 3,2127,2267 instructions # 1.45 insns per cycle ( +- 0.01% ) 2,2187,7205 cycles ( +- 0.29% ) 0.071603623 seconds time elapsed ( +- 0.32% ``` ```shell Performance counter stats for './phonebook_mmap' (100 runs): 108,3608 cache-misses # 76.881 % of all cache refs ( +- 0.16% ) 140,9459 cache-references ( +- 0.54% ) 4,1566,4954 instructions # 1.26 insns per cycle ( +- 0.02% ) 3,2864,4359 cycles ( +- 0.31% ) 0.105448071 seconds time elapsed ( +- 0.38% ) ``` > 這裡可以利用 `$perf record ./phonebook_orig` `$perf annotate --no-source` 查看轉成組語後每個所花費的時間。 > ![](https://i.imgur.com/q8FO6T3.png) 從 gprof 可以看到時間明顯都花在 `file_align()` 上。 >要使用 `gprof` 時記得要加 `-pg` ```shell Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 66.78 0.02 0.02 1 20.03 20.03 file_align 33.39 0.03 0.01 3 3.34 3.34 findName 0.00 0.03 0.00 349900 0.00 0.00 append 0.00 0.03 0.00 2 0.00 0.00 diff_in_second 0.00 0.03 0.00 1 0.00 0.00 fsize ``` 從 append 的角度去比較就可以發現事先 mmap，可以省下四成的時間。 ![](https://i.imgur.com/ptRIIis.png) - [ ] 研究MARSSx86 (http://marss86.org/~marss86/index.php/Home） ## memory pool memory pool 的概念是透過**預先** malloc 一段記憶體空間，等到需要記憶體的時候再去從剛剛已經先配置好的空間拿取。在一個大量使用 malloc 的程式中，透過 memory pool 可以省去多次呼叫 malloc 所造成的 overhead，perf 所測出來的 instructions 減少二成五了。 * advantage: * The memory release for thousands of objects in a pool is just one operation * Memory pools can be grouped in hierarchical tree structures * Fixed-size block memory pools do not need to store allocation metadata for each allocation * Allows deterministic behavior on real-time systems avoiding the out of memory errors. * Avoid fragmentation * multithread program could decrease overhead on snychronization * disadvantage: * need to be tuned for the application which deploys them ```C for (int i = 0; i < data_count; i++) { entry *next = (entry *) mp_get(mp, sizeof(entry)); e = append_mp(map + i * MAX_LAST_NAME_SIZE, e, next); } void *mp_get(Memory_pool *mp, size_t size) { if (mp->tail - mp->head < size) return NULL; void *tmp = mp->head; mp->head = mp->head + size; return tmp; } ``` ```shell Performance counter stats for './phonebook_mp' (100 runs): 73,3750 cache-misses # 72.755 % of all cache refs ( +- 0.30% ) 100,8523 cache-references ( +- 0.22% ) 3,1056,8327 instructions # 1.15 insns per cycle ( +- 0.04% ) 2,7011,4028 cycles ( +- 0.17% ) 0.088669643 seconds time elapsed ( +- 0.23% ) ``` ![](https://i.imgur.com/5hoAoqP.png) > 該圖為 mmap + memory pool 的時間 ### Fast Efficient Fixed-Size Memory Pool #### I. INTRODUCTION * Naive memory pool implementations initialize all the memory pool segments when created. This can be expensive since it is usually necessary to loop over all the uninitialized segments. #### II. RELATED WORK * Present here is not novel, but is a modification of an existing technique;whereby loops and initialization overheads are removed. * the most memory efficient implementation available since it has very little memory footprint and while giving an O(1) access time. #### IV. HOW IT WORKS * When the pool is created, we obtain a continuous section of memory that we know the start and end address of. * Needs a bookkeeping algorithm keep track of which blocks are used and un-used as they are allocated and deallocated. > Bookkeeping algorithm keeping a list of the unused blocks. * Use a four-byte index number to store identify any memory location. > 每個 block 必須大於4 bytes，因為必須存下一個 unused 的 block number. * Initialize a variable to inform us of how many of the n blocks have been appended to the unused list. > 這個變數不會另外使用記憶體，而是將變數存在未使用的 block 內。 > ![](https://i.imgur.com/H6Ifw9f.png) * We keep track of the head of the unused list of blocks and is updated after each allocation.when a block de-allocated append it to the list of unused blocks. * keep track of how many blocks have been added to the list and stop appending new blocks when we have reached the upper limit. * Writing a custom memory pool allocator can be both difficult and error prone. > 相比 fixed size memory pool 需要較複雜且耗時檢查的機制。 #### V. IMPLEMENTATION * Four essential public functions: Create, Destroy, Allocate, and De-allocate. > 使用 create/destroy 而非 constructor/destructor，便於動態新增。 * Combining the fixed pool allocator with overloading the new and delete operators. * The greatest care must be exercised to ensure that classes and structures have their constructors and destructors manually called. #### VI. LIMITATIONS * Need a continuous block of memory. * Not discussed hardware limitations(e.q:page faults) * Not addressed the issue of using the memory pool in a multi-threaded environment. * If the requested memory is dramatically smaller than the slot-size then lots of memory will be wasted. * If the requested memory is greater than the slot-size then it is impossible to allocate memory from the pool. * a suitable block of memory would require considerable searching overhead.Unsuitable and unusable memory being scattered around. #### VII. RESIZING * This list resides in the unused memory and is incrementally extended when a memory block is allocated. * The algorithm currently always initializes the next unused memory block during the allocation call. > 可以額外設計檢查機制是否真的需要初始化下一個尚未被使用的 block。 * The large pool of memory could be resized-down without needing to destroy and re-create the pool. #### IX. CONCLUSION AND FURTHER WORK * Keep It Short and Simple * Need to investigate if the algorithm could be optimised to use less decisional logic * Exploring hardware considerations > e.g: caching, paging, registers, memory alignment, threading ## Fuzzy search * The problem of fuzzy string searching can be formulated as follows: :::info Find in the text or dictionary of size n all the words that match the given word (or start with the given word), taking into account k possible differences (errors). ::: 其演算法可以分為 * Levenshtein distance * Damerau-Levenshtein distance * Bitap algorithm with modifications by Wu and Manber * Spell-checker method * N-gram method * Signature hashing method * BK-trees 透過演算法將兩個字之間的**距離**算出來，以距離表達兩個字的相似程度。 ### Levenshtein Distance 透過對一個字串(source)做**刪除**、**替換**、**增加**字元以形成另一個字串(target)，中間所做的運算次數，即為編輯距離(distance)。比較編輯距離的大小，可以找出兩個字串之間的相似程度。 ![](https://i.imgur.com/5MpfP8L.png) 參考維基百科的演算法實作，遞迴版的運算量很大，搜尋較短的字串時可以找的出結果，但遇到比較長的字串會花費許多時間在計算編輯距離。例如將`kitten`轉成`sitting`： step 1: kitten → sitten (substitution of "s" for "k") step 2: sitten → sittin (substitution of "i" for "e") step 3: sittin → sitting (insertion of "g" at the end) ```C int LevenshteinDistance(const char *s, int len_s, const char *t, int len_t) { int cost; /* base case: empty strings */ if (len_s == 0) return len_t; if (len_t == 0) return len_s; /* test if last characters of the strings match */ if (s[len_s-1] == t[len_t-1]) cost = 0; else cost = 1; /* return minimum of delete char from s, delete char from t, and delete char from both */ return minimum(LevenshteinDistance(s, len_s - 1, t, len_t ) + 1, LevenshteinDistance(s, len_s , t, len_t - 1) + 1, LevenshteinDistance(s, len_s - 1, t, len_t - 1) + cost); } ``` ```shell hugikun@hugikun:~/embedded/team/phonebook$ ./phonebook_mp hjka size of entry : 136 bytes Can not find the name you want.Do you want to fuzzy search similar name? Y/N: Y The lastName similar to hjka: aeka ahka akka . . . execution time of append() : 0.017253 sec execution time of findName() : 70.463820 sec ``` ### SOUNDEX 以英文的發音作為比較的基礎，將每個字母依照其發音的特性給予一個數字，比較字串按照 SOUNDEX 法則所得出的結果進行比較。此種方法在同一個數字會對應許多字串，造成 fuzzy search 的結果過多，需要透過其他方法使其更精準搜尋出目標字串。 ```shell b, f, p, v = 1 c, g, j, k, q, s, x, z = 2 d, t = 3 l = 4 m, n = 5 r = 6 ``` ```C void fuzzy_search(char *lastName, entry *e) { int tar_ans = Soundex(lastName); printf("The lastName similar to %s:\n", lastName); while (e != NULL) { int ans = Soundex(e->lastName); if (ans == tar_ans) printf("%s: %d\n", e->lastName, ans); e = e->pNext; } } ``` ```shell hugikun@hugikun:~/embedded/team/phonebook$ ./phonebook_mp asd size of entry : 136 bytes Can not find the name you want.Do you want to fuzzy search similar name? Y/N: Y The lastName similar to asd: aaccf: 5 aaem: 5 aaiun: 5 aana: 5 aani: 5 ababdeh: 5 . . . zway: 5 zythia: 5 execution time of append() : 0.016174 sec execution time of findName() : 0.054875 sec ``` ### combine Levenshtein Distance and SOUNDEX 由於計算編輯距離的運算量實在是太大，所以合併了上述的兩種方法，先透過運算量小的 SOUNDEX 過慮部份的字串再透過 Levenshtein Distance 計算所需的編輯距離。目前只能用在短字串的搜尋，對於長字串雖然已經大幅減少搜尋的時間，但是卻無法找出正確的字串。SOUNDEX 的限制給的太小把正確的字串給濾掉了，如果把限制放寬又會造成運算太過龐大。 > 還沒想到好的方法解決這個問題 ```shell hugikun@hugikun:~/embedded/team/phonebook$ ./phonebook_mp aappmedicp size of entry : 136 bytes Can not find the name you want.Do you want to fuzzy search similar name? Y/N: Y The lastName similar to aappmedicp: execution time of append() : 0.023745 sec execution time of findName() : 323.052746 sec ``` ## Input Data - [ ] 統計學 - [ ] 工程分析 - [x] 改善 ### 分析為了分析 dataset 是否有代表性，拿[美國人口調查局](https://www.census.gov/)中 1990 年做的人口普查來作為對照組合適的 dataset * 真實姓名 * 姓名分佈下面是 ==words.txt== 有出現在人口調查中的字 | | 人口調查 | words.txt | |:-:|:-------:|:---------:| |總人數|6290251|349900| |姓氏種類|88799| 27820 | |女性名字種類|4275| 3083 | |男姓名字種類|1219| 1071 | |有效比例|98.38%|9.14%| 雖然說 ==words.txt== 裡面涵蓋到人口調查中真實名字的比例蠻高的，但只有佔所有資料 9.14% ，有其他三十幾萬的資料都是無意義的字，在真實電話簿中不會有這樣的情形出現，整份資料當成電話簿中的姓名我想是不適合的。然後順手用同樣的程式碼做了 ==./dictionary/all-names.txt== 的分析，這份資料裡面有 24% 左右是真實姓名，雖然比上一份資料的真實姓名多，而且有重複的姓氏或名字，但是數量太少，如果要當作有按照真實比例的 dataset 還是不太夠 | | 人口調查 | all-names.txt | |:-:|:-------:|:---------:| |總人數|6290251|16751| |非真實姓名|0|12705| |姓氏總數|6290251| 6851 | |女性名字|3184399| 5514 | |男姓名字|3003954| 2048 | |女姓名字所佔總數比例|50.62%|32.92%| |男姓名字所佔總數比例|47.76%|12.23%| #### 研究如何使用熵分析 [Entropy (information theory)](https://en.wikipedia.org/wiki/Entropy_(information_theory)) : entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message. 'Messages' can be modeled by any flow of information. > 在資訊理論裡面，熵是對不確定性的測量。但是在資訊世界，熵越高，則能傳輸越多的資訊，熵越低，則意味著傳輸的資訊越少。熵是對某個系統之不確定性或混亂程度的度量方法，也可以想成，如果 dataset 的熵值越大，資較就越混亂，不確定性越高，越無法預測不過我沒有找到適合用在分析或改善 dataset 的應用，熵可以用來分析一個字當中有包含多少的資訊，或是分析壓縮過後的資料是否有提高壓縮率，雖然說可以知道姓氏與名字出現的次數所呈現的亂度，也可以計算出真實統計出來的亂度，若是亂度接近只能說某些字出現的頻率與真實相似，但是這並不能知道同樣頻率的字與真實是否為同一個，所以不採用此方法分析舉個例子，假設真實世界姓名的分佈是 ``` SMITH, SMITH, SMITH, SMITH, JOHNSON, JOHNSON, JOHNSON, BROWN, BROWN, DAVIS ``` 但是我分析的資料是 ``` JONES, JONES, JONES, JONES, LEE, LEE, LEE, HILL, HILL, SCOTT ``` 兩筆資料的 entropy 都是大約 4.94，頻率一樣的字卻不相同，我認為是不適合這個問題 ### 改善改善的部份，我在[美國人口調查局](https://www.census.gov/)中拿他們 1990 的姓名[資料](https://www.census.gov/topics/population/genealogy/data/1990_census.html)，這次人口普查蒐集 7,200,000 人的資料，一共分成姓氏、男性名子與女性名字三份統計，並且依據出現的頻率下去做排序 ``` File Name Valid Records Unique Names 1. dist.all.last 6,290,251 88,799 2. dist.female.first 3,184,399 4,275 3. dist.male.first 3,003,954 1.219 ``` 首先考慮到我們的 dataset 就是電話簿本身的資料，在 349900 人中出現比例最低也有大約 4 個人，而電話簿同樣名字無法識別誰是誰，所以以下的測試資料會全都使用名字 + 姓氏來當作一筆資料名子與姓氏的資料都有出現的比例，改善後的 dataset 依照名字比例數量配上也依照比例的姓氏，大概是像這樣子 ```shell MARY MILES MARY STEWART MARY AGEE MARY ANDERSON MARY CRAVEN MARY CLARK MARY FLUDD MARY FORREST MARY BARELA MARY LARA MARY KELLER MARY RENEAU ``` 最後是改善過的資料與人口調查的比較，改善過後的 dataset 包含真實姓氏與真實名字，男女性名字比例與真實接近 | | 人口調查 | full-names.txt | |:-:|:-------:|:----------:| |總人數|6290251|5567224| | 姓氏 |6290251| 5567224 | |女性名字|3184399| 2861372 | |男姓名字|3003954| 2727188 | |女姓名字所佔比例|50.62%|51.40%| |男姓名字所佔比例|47.76%|48.99%| ## 95%信賴區間原本測量出的時間會有些誤差，在 main.c 中加入信賴區間的計算，使得 append 及 findname 量測出的時間較準確，這邊使用 95 ％信賴區間表示真實資料有95％的可能性會落在這個區間。 ## Hash table size 和 collision 的關係 **Hash function ( 雜湊函式 )** * 是把訊息或資料轉換，建立一個叫做 **雜湊值（hash values，hash codes，hash sums，或hashes）**,這個雜湊值就當作是陣列的索引，資料就儲存在這個索引的位置中。雜湊值通常用一個短的隨機字母和數字組成的字串來代表。 **hash table** * 使用雜湊表能夠利用資料所產生的雜湊值當索引, 快速尋找資料 ![](https://i.imgur.com/D85eZMz.png) * 一般而言如果兩個雜湊值是 **不相同** 的（根據同一函式），那麼這兩個雜湊值的原始輸入也是 **不相同** 的 * 但如果兩個雜湊值是 **相同** 的（根據同一函式），那麼這兩個雜湊值的原始輸入 **不一定** 是相同的, 而這時就發生了衝突 (collision), 遇到衝突時就資料就接在同一個索引後面即可. * 一個好的雜湊函式應該具有 **均勻** 的真正隨機輸出。且隨機雜湊函式不太會出現非常高的衝突率。但是，少量的可以估計的衝突在實際狀況下是不可避免的. * 因此以下根據常見的 hash function 做比較, 來分析它的隨機輸出與衝突率, 並且比較在 phonebook 內的 cache miss , append() findName() 時間的差異. ### 不同 Hash function 的比較 * 參考 [各种字符串Hash函数比较](https://www.byvoid.com/zhs/blog/string-hash-compare)中 8 種 hash function ( SDBM,RS,JS,PJW,ELF,BKDR,DJB2,AP ) 的實作 * 試著使用在 w5 作業中 [matrix_oo ](https://github.com/sysprog21/matrix_oo) 物件導向的做法, 把 hash function 封裝起來 phonebook_hash.h ```c=30 typedef struct __HASH_ALGO { unsigned int (*hashfuction)(char *str); hashTable* (*init)(); void (*free)(hashTable *ht ); } hashAlgo; entry* findName(char lastName[], hashTable *ht, hashAlgo *algo); void append(char lastName[], hashTable *ht, hashAlgo *algo); extern hashAlgo SDBMHashProvider; extern hashAlgo RSHashProvider; extern hashAlgo JSHashProvider; extern hashAlgo PJWHashProvider; extern hashAlgo ELFHashProvider; extern hashAlgo BKDRHashProvider; extern hashAlgo DJB2HashProvider; extern hashAlgo APHashProvider; ``` 接著只要在指向不同實作就可切換不同 hash function ```c=212 hashAlgo BKDRHashProvider = { .hashfuction = BKDRHash, .init =initHashTable, .free = freeHashTable, }; ``` 如此一來在 main.c 裡就比較方便切換也不用寫太多的判斷, 不過應該還可以透過老師說過的 **ELF linker set** 的方法在寫得更漂亮. main.c ```c= hashAlgo * hash_function_providers[] = { &SDBMHashProvider, &RSHashProvider, &JSHashProvider, &PJWHashProvider, &ELFHashProvider, &BKDRHashProvider, &DJB2HashProvider, &APHashProvider }; #ifdef HASH /* init hash table */ hashAlgo *algo = hash_function_providers[0]; hashTable *ht = algo->init(); #endif ``` - [x] 比較圖表 ### table size 和 collision 分析 * 以下透過不同的 table size 來分析不同的 hash fuction 下 input data `word.txt` 分布情形 #### **Table size : 3000** * 將所有分布圖表全部顯示會發現 ELF 與 PJW 的分佈變動很大, 統計資料觀察資料的平均值標準差與變異數的分布情況如下 ![](https://i.imgur.com/n9RtSQB.png) | |SDBM | RS |JS|PJW|ELF|BKDR|DJB2|AB| |:----:|:-----:|:------:|:----:|:-----:|:------:|:----:|:-----:|:-----:|:-----:| |Mean |116.63|116.63|116.63|116.63|116.6|116.63| 116.63| 116.63| |Standard deviation|10.686|10.683|10.785|56.979|56.979|++**10.592**++|10.753|10.666| |Variations|0.09162|0.09160|0.09247|0.4885|0.4885|++**0.0908**++|0.0921|0.0914| * 由於 input data `word.txt` 的所有資料為 349900 筆, 因此理論上平均分配到每個 hash[tag] 內的資料應該為 116.6333 筆資料, 在觀察邊準差與變異數分布發現 BKDR 分布的情形最為平均 (邊準差與變異數皆為最小) ![](https://i.imgur.com/dqPgXAY.png) ![](https://i.imgur.com/kfYJobt.png) #### **Table size : 30000** * 接著將 Table size 提高到 30000 再來觀察 * 在這個 table size 下表現最好的為 RS function , 除了 ,PJW 與 ELF 之外數據都沒有差太多 ![](https://i.imgur.com/kubFl2Y.png) | |SDBM | RS |JS|PJW|ELF|BKDR|DJB2|AB| |:----:|:-----:|:------:|:----:|:-----:|:------:|:----:|:-----:|:-----:|:-----:| |Standard deviation|3.4135|++**3.3923**++|3.4240|10.2899|10.2899|3.4080|3.4108|3.3959| |Variations|0.2926|++**0.2908**++|0.2935|0.8822|0.8822|0.2921|0.2924|0.2911| ![](https://i.imgur.com/sQvdbNp.png) ![](https://i.imgur.com/2Q36pdu.png) #### **Table size : 300000** * 最後將 table size 提高到 300000 跟 input data 的資料總數是同一個 order 再觀察 BKDR 的表現還是最好, 雖然與其他 fuction 的差異, 僅僅只有小數點後 3 位整體比較下來 BKDR 在三個不同 table size 的分布最為穩定 >PJW 與 ELF 的資料分布都完全相同看起來要回頭看一下實作是不是有問題 > ![](https://i.imgur.com/z4ZrNQU.png) | |SDBM | RS |JS|PJW|ELF|BKDR|DJB2|AB| |:----:|:-----:|:------:|:----:|:-----:|:------:|:----:|:-----:|:-----:|:-----:| |Standard deviation|1.0791|1.0802|1.0796|1.4691|1.4691|++**1.0788**++|1.0795|1.0816| |Variations|0.9252|0.9262|0.9257|1.2596|1.2596|++**0.9250**++|0.9256|0.9274| ![](https://i.imgur.com/mjj79pj.png) ![](https://i.imgur.com/ORMx3Nr.png) ## Binary-search tree 與 cache line size ### Binary search tree 由於在原始版本程式中, 電話簿中的資料是被按照排序過的 ( 由Ａ到Ｚ). 因此實作參考了 [Sorted Linked List to Balanced BST](http://www.geeksforgeeks.org/sorted-linked-list-to-balanced-bst/) ，將原本由 append() 建立已排序好的 linked list 轉成 Binary search tree 的結構, 建立 binary searh tree 需要給 head 與 list 總數為輸入，最後回傳 root 節點 (如下列表示的 4 or 3) ```text= Input: Linked List 1->2->3->4->5->6->7 Output: A Balanced BST 4 / \ 2 6 / \ / \ 1 3 4 7 Input: Linked List 1->2->3->4 Output: A Balanced BST 3 / \ 2 4 / 1 ``` ```c=34 node *sortedListToBST (entry *e) { int countListLength = 0; entry *tmp = e; while (tmp) { countListLength++; tmp = tmp->pNext; } return sortedListToBSTRecur(&e, countListLength); } node *sortedListToBSTRecur(entry **e, int listlength) { if (listlength <= 0) return NULL; node *left = sortedListToBSTRecur(e, listlength/2); node *root = (node *) malloc(sizeof(node)); root->left = left; root->pEntry = *e; *e = (*e)->pNext; root->right = sortedListToBSTRecur(e, listlength - listlength/2 - 1); return root; } ``` ``` Performance counter stats for './phonebook_bst' (100 runs): 524,818 cache-misses # 68.953 % of all cache refs ( +- 0.16% ) 761,122 cache-references ( +- 0.31% ) 330,462,207 instructions # 1.72 insns per cycle ( +- 0.02% ) 192,097,333 cycles ( +- 0.28% ) 0.074406735 seconds time elapsed ( +- 1.27% ) ``` ![](https://i.imgur.com/qWK0nD8.png) * 其中 hash table size 是 42937使用的 function 是 BKDR ### cache line size 這邊參考[ vtim9907 ](https://github.com/vtim9907/phonebook)的實做,為了符合 cache line 的特性，決定把 node 的結構更該成 64 bytes 的大小，原來 node 的結構就只包含兩個分別指到左、右子樹的指標，和一個指到一個 phonebook entry 的指標：大小為 8 * 3 = 24 bytes ```c= typedef struct __PHONE_BOOK_BST_NODE { struct __PHONE_BOOK_BST_NODE *left; struct __PHONE_BOOK_BST_NODE *right; entry *pEntry; } node; ``` 為了增加 node 的大小, 將原本的 `word.txt` 中的資料將它分組, 如下面的例子即是將資料 6 個一組. node 中增加了一個陣列裡頭包含了 6 個指向 phonebook entry 的指標 :大小為 8 * 6 + 8 * 2 = 64 bytes ```c= typedef struct __PHONE_BOOK_BST_NODE { struct __PHONE_BOOK_BST_NODE *left; struct __PHONE_BOOK_BST_NODE *right; entry *pEntry[6] ; } node; ``` ``` Performance counter stats for './phonebook_bst' (100 runs): 453,643 cache-misses # 63.317 % of all cache refs ( +- 0.19% ) 716,468 cache-references ( +- 0.25% ) 251,680,487 instructions # 1.58 insns per cycle ( +- 0.04% ) 159,466,589 cycles ( +- 0.20% ) 0.060881942 seconds time elapsed ``` ![](https://i.imgur.com/WxuWXqM.png) * 可以看到 append() time 從原本的 0.064934 下降到 0.051413 下降約 20% * 整體執行時間由 0.074406735 下降到 0.060881942) 下降約 18% * Cache-miss 也是從 68.9% 下降到 63.3% ## reference [heathcliffYang 共筆](https://hackmd.io/s/Hyr9PhFYx) [hash function 觀念和實務](https://hackmd.io/s/HJln3jU_e) [git branch and remote repo](https://ihower.tw/blog/archives/2620) [Perf -- Linux下的系统性能调优工具，第 1 部分](https://www.ibm.com/developerworks/cn/linux/l-cn-perf1/) Fast Efficient Fixed-Size Memory Pool [Using Fuzzy Matching to Search by Sound with Python](http://www.informit.com/articles/article.aspx?p=1848528) [NIKITA'S BLOG: Fuzzy string search](http://ntz-develop.blogspot.tw/2011/03/fuzzy-string-search.html) [Charles 共筆](https://charles620016.hackpad.com/ep/pad/static/Japi4qFyAzt) [SOUNDEX](https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions148.htm) [Levenshtein distance wiki](https://en.wikipedia.org/wiki/Levenshtein_distance) [Entropy (information theory) wikipedia](https://en.wikipedia.org/wiki/Entropy_(information_theory)) [vtim9907 共筆](https://hackmd.io/s/r1YyTRqFe#phonebook-*%16)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.