owned this note
owned this note
Published
Linked with GitHub
# Phonebook
contribured <`hugikun999`>, <`claaaaassic`>, <`Weinux`>
GitHub: [phonebook](https://github.com/claaaaassic/phonebook)
:::info
Task: 比照 B01: phonebook,支援 fuzzy search,允許搜尋 lastName 時,不用精準打出正確的寫法。指出原有 dataset 的問題 (需要有工程分析,拿出統計學) 並改進。
:::
- [x] fuzzy search
- [x] cache block size(link-list, binary-search tree)
- [x] 95%信賴區間
- [ ] delete, insert API
- [x] input data
- [x] mmap 對大量讀取時的影響
- [x] hash table size 和 collision 的關係
- [x] Memory Pool
## mmap 大量讀取時的影響
[Which is fastest: read, fread, ifstream or mmap?](http://lemire.me/blog/2012/06/26/which-is-fastest-read-fread-ifstream-or-mmap/)
1. Overhead on ```mmap``` is heavily than ```read```.
2. Memory maps is faster than randomly access.
3. Memory map keep using page from memory until you finish your work.However,with ```read```,files may have been flushed out memory.
4. Use memory maps if you access data **randomly**, keep it around for a long time, or if you know you can share it with other processes.
If your access pattern is **sequential**, `mmap` can't work efficiently than other funcion.
5. Memory map is less robust,any slightest mistake can make your make program crash.
6. ```mmap``` directly access memory.
```read``` fetches the data from disk to page cache and then copy them to position you specify.
* Downsides to mmap:
* resulting page fault
* quite noticeable setup and teardown costs
* Upsides of mmap:
* if the data gets re-used over and over again
* if the data is large,The kernel can forget pages as memory pressure forces the system to page stuff out.
Conclude: mmap 僅在 random access memory 的時候會比較有效率,如果為 sequential access 則使用 fread 會比較有效率。
### [Code to test](https://eklausmeier.wordpress.com/2016/02/03/performance-comparison-mmap-versus-read-versus-fread/)
這裡有提供一個可以測試 ```mmap``` 和 ```fread```的程式,我把 word.txt 當成輸入檔並用 ```$perf sate``` 重覆測試 100 次,可以發現 ```mmap``` 的 cache-missed 低了很大一個幅度,但是總體執行卻沒有比較快。
(1)fread
```shell
Performance counter stats for './tbytesum1 -f words.txt' (100 runs):
6491 branch-misses ( +- 0.42% )
2,1510 cache-misses # 40.280 % of all cache refs ( +- 2.39% )
5,3401 cache-references ( +- 1.02% )
2206,1482 cycles ( +- 0.21% )
0.007478928 seconds time elapsed ( +- 0.36% )
```
(2)mmap
```shell
Performance counter stats for './tbytesum1 -m words.txt' (100 runs):
6554 branch-misses ( +- 0.29% )
6788 cache-misses # 29.101 % of all cache refs ( +- 3.07% )
2,3325 cache-references ( +- 0.65% )
2513,0427 cycles ( +- 0.10% )
0.008581382 seconds time elapsed ( +- 1.01% )
```
### phonebook 實際比較
mmap 版從整體的消耗時間看起來沒有減少反而增加,其原因在於**先將 `word.txt` 做 align**,比起原版額外增加了許多的 instructions,故導致最後的整體時間並沒有比較快,但是可以發現 cache-misses 已經減少了一成左右。
```C
file_align(DICT_FILE, FILE_OUT, MAX_LAST_NAME_SIZE);
fd = open(FILE_OUT, O_RDONLY | O_NONBLOCK);
if (!fd) {
printf("cannot open the file\n");
return -1;
}
size = fsize(FILE_OUT);
data_count = size / (sizeof(char) * MAX_LAST_NAME_SIZE);
map = mmap(NULL, size, PROT_READ, MAP_SHARED | MAP_NONBLOCK, fd, 0);
```
```shell
Performance counter stats for './phonebook_orig' (100 runs):
123,9627 cache-misses # 88.189 % of all cache refs ( +- 0.76% )
140,5652 cache-references ( +- 0.86% )
3,2127,2267 instructions # 1.45 insns per cycle ( +- 0.01% )
2,2187,7205 cycles ( +- 0.29% )
0.071603623 seconds time elapsed ( +- 0.32%
```
```shell
Performance counter stats for './phonebook_mmap' (100 runs):
108,3608 cache-misses # 76.881 % of all cache refs ( +- 0.16% )
140,9459 cache-references ( +- 0.54% )
4,1566,4954 instructions # 1.26 insns per cycle ( +- 0.02% )
3,2864,4359 cycles ( +- 0.31% )
0.105448071 seconds time elapsed ( +- 0.38% )
```
> 這裡可以利用 `$perf record ./phonebook_orig` `$perf annotate --no-source` 查看轉成組語後每個所花費的時間。
> ![](https://i.imgur.com/q8FO6T3.png)
從 gprof 可以看到時間明顯都花在 `file_align()` 上。
>要使用 `gprof` 時記得要加 `-pg`
```shell
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
66.78 0.02 0.02 1 20.03 20.03 file_align
33.39 0.03 0.01 3 3.34 3.34 findName
0.00 0.03 0.00 349900 0.00 0.00 append
0.00 0.03 0.00 2 0.00 0.00 diff_in_second
0.00 0.03 0.00 1 0.00 0.00 fsize
```
從 append 的角度去比較就可以發現事先 mmap,可以省下四成的時間。
![](https://i.imgur.com/ptRIIis.png)
- [ ] 研究MARSSx86 (http://marss86.org/~marss86/index.php/Home)
## memory pool
memory pool 的概念是透過**預先** malloc 一段記憶體空間,等到需要記憶體的時候再去從剛剛已經先配置好的空間拿取。在一個大量使用 malloc 的程式中,透過 memory pool 可以省去多次呼叫 malloc 所造成的 overhead,perf 所測出來的 instructions 減少二成五了。
* advantage:
* The memory release for thousands of objects in a pool is just one operation
* Memory pools can be grouped in hierarchical tree structures
* Fixed-size block memory pools do not need to store allocation metadata for each allocation
* Allows deterministic behavior on real-time systems avoiding the out of memory errors.
* Avoid fragmentation
* multithread program could decrease overhead on snychronization
* disadvantage:
* need to be tuned for the application which deploys them
```C
for (int i = 0; i < data_count; i++) {
entry *next = (entry *) mp_get(mp, sizeof(entry));
e = append_mp(map + i * MAX_LAST_NAME_SIZE, e, next);
}
void *mp_get(Memory_pool *mp, size_t size)
{
if (mp->tail - mp->head < size)
return NULL;
void *tmp = mp->head;
mp->head = mp->head + size;
return tmp;
}
```
```shell
Performance counter stats for './phonebook_mp' (100 runs):
73,3750 cache-misses # 72.755 % of all cache refs ( +- 0.30% )
100,8523 cache-references ( +- 0.22% )
3,1056,8327 instructions # 1.15 insns per cycle ( +- 0.04% )
2,7011,4028 cycles ( +- 0.17% )
0.088669643 seconds time elapsed ( +- 0.23% )
```
![](https://i.imgur.com/5hoAoqP.png)
> 該圖為 mmap + memory pool 的時間
### Fast Efficient Fixed-Size Memory Pool
#### I. INTRODUCTION
* Naive memory pool implementations initialize all the memory pool segments when created. This can be expensive since it is usually necessary to loop over all the uninitialized segments.
#### II. RELATED WORK
* Present here is not novel, but is a modification of an existing technique;whereby loops and initialization overheads are removed.
* the most memory efficient implementation available since it has very little memory footprint and while giving an O(1) access time.
#### IV. HOW IT WORKS
* When the pool is created, we obtain a continuous section of memory that we know the start and end address of.
* Needs a bookkeeping algorithm keep track of which blocks are used and un-used as they are allocated and deallocated.
> Bookkeeping algorithm keeping a list of the unused blocks.
* Use a four-byte index number to store identify any memory location.
> 每個 block 必須大於4 bytes,因為必須存下一個 unused 的 block number.
* Initialize a variable to inform us of how many of the n blocks have been appended to the unused list.
> 這個變數不會另外使用記憶體,而是將變數存在未使用的 block 內。
> ![](https://i.imgur.com/H6Ifw9f.png)
* We keep track of the head of the unused list of blocks and is updated after each allocation.when a block de-allocated append it to the list of unused blocks.
* keep track of how many blocks have been added to the list and stop appending new blocks when we have reached the upper limit.
* Writing a custom memory pool allocator can be both difficult and error prone.
> 相比 fixed size memory pool 需要較複雜且耗時檢查的機制。
#### V. IMPLEMENTATION
* Four essential public functions: Create, Destroy, Allocate, and De-allocate.
> 使用 create/destroy 而非 constructor/destructor,便於動態新增。
* Combining the fixed pool allocator with overloading the new and delete operators.
* The greatest care must be exercised to ensure that classes and structures have their constructors and destructors manually called.
#### VI. LIMITATIONS
* Need a continuous block of memory.
* Not discussed hardware limitations(e.q:page faults)
* Not addressed the issue of using the memory pool in a multi-threaded environment.
* If the requested memory is dramatically smaller than the slot-size then lots of memory will be wasted.
* If the requested memory is greater than the slot-size then it is impossible to allocate memory from the pool.
* a suitable block of memory would require considerable searching overhead.Unsuitable and unusable memory being scattered around.
#### VII. RESIZING
* This list resides in the unused memory and is incrementally extended when a memory block is allocated.
* The algorithm currently always initializes the next unused memory block during the allocation call.
> 可以額外設計檢查機制是否真的需要初始化下一個尚未被使用的 block。
* The large pool of memory could be resized-down without needing to destroy and re-create the pool.
#### IX. CONCLUSION AND FURTHER WORK
* Keep It Short and Simple
* Need to investigate if the algorithm could be optimised to use less decisional logic
* Exploring hardware considerations
> e.g: caching, paging, registers, memory alignment, threading
## Fuzzy search
* The problem of fuzzy string searching can be formulated as follows:
:::info
Find in the text or dictionary of size n all the words that match the given word (or start with the given word), taking into account k possible differences (errors).
:::
其演算法可以分為
* Levenshtein distance
* Damerau-Levenshtein distance
* Bitap algorithm with modifications by Wu and Manber
* Spell-checker method
* N-gram method
* Signature hashing method
* BK-trees
透過演算法將兩個字之間的**距離**算出來,以距離表達兩個字的相似程度。
### Levenshtein Distance
透過對一個字串(source)做**刪除**、**替換**、**增加**字元以形成另一個字串(target),中間所做的運算次數,即為編輯距離(distance)。比較編輯距離的大小,可以找出兩個字串之間的相似程度。
![](https://i.imgur.com/5MpfP8L.png)
參考維基百科的演算法實作,遞迴版的運算量很大,搜尋較短的字串時可以找的出結果,但遇到比較長的字串會花費許多時間在計算編輯距離。
例如將`kitten`轉成`sitting`:
step 1: kitten → sitten (substitution of "s" for "k")
step 2: sitten → sittin (substitution of "i" for "e")
step 3: sittin → sitting (insertion of "g" at the end)
```C
int LevenshteinDistance(const char *s, int len_s, const char *t, int len_t)
{
int cost;
/* base case: empty strings */
if (len_s == 0) return len_t;
if (len_t == 0) return len_s;
/* test if last characters of the strings match */
if (s[len_s-1] == t[len_t-1])
cost = 0;
else
cost = 1;
/* return minimum of delete char from s, delete char from t, and delete char from both */
return minimum(LevenshteinDistance(s, len_s - 1, t, len_t ) + 1,
LevenshteinDistance(s, len_s , t, len_t - 1) + 1,
LevenshteinDistance(s, len_s - 1, t, len_t - 1) + cost);
}
```
```shell
hugikun@hugikun:~/embedded/team/phonebook$ ./phonebook_mp hjka
size of entry : 136 bytes
Can not find the name you want.Do you want to fuzzy search similar name?
Y/N: Y
The lastName similar to hjka:
aeka
ahka
akka
.
.
.
execution time of append() : 0.017253 sec
execution time of findName() : 70.463820 sec
```
### SOUNDEX
以英文的發音作為比較的基礎,將每個字母依照其發音的特性給予一個數字,比較字串按照 SOUNDEX 法則所得出的結果進行比較。此種方法在同一個數字會對應許多字串,造成 fuzzy search 的結果過多,需要透過其他方法使其更精準搜尋出目標字串。
```shell
b, f, p, v = 1
c, g, j, k, q, s, x, z = 2
d, t = 3
l = 4
m, n = 5
r = 6
```
```C
void fuzzy_search(char *lastName, entry *e)
{
int tar_ans = Soundex(lastName);
printf("The lastName similar to %s:\n", lastName);
while (e != NULL) {
int ans = Soundex(e->lastName);
if (ans == tar_ans)
printf("%s: %d\n", e->lastName, ans);
e = e->pNext;
}
}
```
```shell
hugikun@hugikun:~/embedded/team/phonebook$ ./phonebook_mp asd
size of entry : 136 bytes
Can not find the name you want.Do you want to fuzzy search similar name?
Y/N: Y
The lastName similar to asd:
aaccf: 5
aaem: 5
aaiun: 5
aana: 5
aani: 5
ababdeh: 5
.
.
.
zway: 5
zythia: 5
execution time of append() : 0.016174 sec
execution time of findName() : 0.054875 sec
```
### combine Levenshtein Distance and SOUNDEX
由於計算編輯距離的運算量實在是太大,所以合併了上述的兩種方法,先透過運算量小的 SOUNDEX 過慮部份的字串再透過 Levenshtein Distance 計算所需的編輯距離。目前只能用在短字串的搜尋,對於長字串雖然已經大幅減少搜尋的時間,但是卻無法找出正確的字串。SOUNDEX 的限制給的太小把正確的字串給濾掉了,如果把限制放寬又會造成運算太過龐大。
> 還沒想到好的方法解決這個問題
```shell
hugikun@hugikun:~/embedded/team/phonebook$ ./phonebook_mp aappmedicp
size of entry : 136 bytes
Can not find the name you want.Do you want to fuzzy search similar name?
Y/N: Y
The lastName similar to aappmedicp:
execution time of append() : 0.023745 sec
execution time of findName() : 323.052746 sec
```
## Input Data
- [ ] 統計學
- [ ] 工程分析
- [x] 改善
### 分析
為了分析 dataset 是否有代表性,拿[美國人口調查局](https://www.census.gov/)中 1990 年做的人口普查來作為對照組
合適的 dataset
* 真實姓名
* 姓名分佈
下面是 ==words.txt== 有出現在人口調查中的字
| | 人口調查 | words.txt |
|:-:|:-------:|:---------:|
|總人數|6290251|349900|
|姓氏種類|88799| 27820 |
|女性名字種類|4275| 3083 |
|男姓名字種類|1219| 1071 |
|有效比例|98.38%|9.14%|
雖然說 ==words.txt== 裡面涵蓋到人口調查中真實名字的比例蠻高的,但只有佔所有資料 9.14% ,有其他三十幾萬的資料都是無意義的字,在真實電話簿中不會有這樣的情形出現,整份資料當成電話簿中的姓名我想是不適合的。
然後順手用同樣的程式碼做了 ==./dictionary/all-names.txt== 的分析,這份資料裡面有 24% 左右是真實姓名,雖然比上一份資料的真實姓名多,而且有重複的姓氏或名字,但是數量太少,如果要當作有按照真實比例的 dataset 還是不太夠
| | 人口調查 | all-names.txt |
|:-:|:-------:|:---------:|
|總人數|6290251|16751|
|非真實姓名|0|12705|
|姓氏總數|6290251| 6851 |
|女性名字|3184399| 5514 |
|男姓名字|3003954| 2048 |
|女姓名字所佔總數比例|50.62%|32.92%|
|男姓名字所佔總數比例|47.76%|12.23%|
#### 研究如何使用熵分析
[Entropy (information theory)](https://en.wikipedia.org/wiki/Entropy_(information_theory)) : entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message. 'Messages' can be modeled by any flow of information.
> 在資訊理論裡面,熵是對不確定性的測量。但是在資訊世界,熵越高,則能傳輸越多的資訊,熵越低,則意味著傳輸的資訊越少。
熵是對某個系統之不確定性或混亂程度的度量方法,也可以想成,如果 dataset 的熵值越大,資較就越混亂,不確定性越高,越無法預測
不過我沒有找到適合用在分析或改善 dataset 的應用,熵可以用來分析一個字當中有包含多少的資訊,或是分析壓縮過後的資料是否有提高壓縮率,雖然說可以知道姓氏與名字出現的次數所呈現的亂度,也可以計算出真實統計出來的亂度,若是亂度接近只能說某些字出現的頻率與真實相似,但是這並不能知道同樣頻率的字與真實是否為同一個,所以不採用此方法分析
舉個例子,假設真實世界姓名的分佈是
```
SMITH, SMITH, SMITH, SMITH, JOHNSON, JOHNSON, JOHNSON, BROWN, BROWN, DAVIS
```
但是我分析的資料是
```
JONES, JONES, JONES, JONES, LEE, LEE, LEE, HILL, HILL, SCOTT
```
兩筆資料的 entropy 都是大約 4.94,頻率一樣的字卻不相同,我認為是不適合這個問題
### 改善
改善的部份,我在[美國人口調查局](https://www.census.gov/)中拿他們 1990 的姓名[資料](https://www.census.gov/topics/population/genealogy/data/1990_census.html),這次人口普查蒐集 7,200,000 人的資料,一共分成姓氏、男性名子與女性名字三份統計,並且依據出現的頻率下去做排序
```
File Name Valid Records Unique Names
1. dist.all.last 6,290,251 88,799
2. dist.female.first 3,184,399 4,275
3. dist.male.first 3,003,954 1.219
```
首先考慮到我們的 dataset 就是電話簿本身的資料,在 349900 人中出現比例最低也有大約 4 個人,而電話簿同樣名字無法識別誰是誰,所以以下的測試資料會全都使用名字 + 姓氏來當作一筆資料
名子與姓氏的資料都有出現的比例,改善後的 dataset 依照名字比例數量配上也依照比例的姓氏,大概是像這樣子
```shell
MARY MILES
MARY STEWART
MARY AGEE
MARY ANDERSON
MARY CRAVEN
MARY CLARK
MARY FLUDD
MARY FORREST
MARY BARELA
MARY LARA
MARY KELLER
MARY RENEAU
```
最後是改善過的資料與人口調查的比較,改善過後的 dataset 包含真實姓氏與真實名字,男女性名字比例與真實接近
| | 人口調查 | full-names.txt |
|:-:|:-------:|:----------:|
|總人數|6290251|5567224|
| 姓氏 |6290251| 5567224 |
|女性名字|3184399| 2861372 |
|男姓名字|3003954| 2727188 |
|女姓名字所佔比例|50.62%|51.40%|
|男姓名字所佔比例|47.76%|48.99%|
## 95%信賴區間
原本測量出的時間會有些誤差,在 main.c 中加入信賴區間的計算,使得 append 及 findname 量測出的時間較準確,這邊使用 95 %信賴區間表示真實資料有95%的可能性會落在這個區間。
## Hash table size 和 collision 的關係
**Hash function ( 雜湊函式 )**
* 是把訊息或資料轉換,建立一個叫做 **雜湊值(hash values,hash codes,hash sums,或hashes)**,這個雜湊值就當作是陣列的索引,資料就儲存在這個索引的位置中。雜湊值通常用一個短的隨機字母和數字組成的字串來代表。
**hash table**
* 使用雜湊表 能夠利用資料所產生的雜湊值當索引, 快速尋找資料
![](https://i.imgur.com/D85eZMz.png)
* 一般而言如果兩個雜湊值是 **不相同** 的(根據同一函式),那麼這兩個雜湊值的原始輸入也是 **不相同** 的
* 但如果兩個雜湊值是 **相同** 的(根據同一函式),那麼這兩個雜湊值的原始輸入 **不一定** 是相同的, 而這時就發生了衝突 (collision), 遇到衝突時就資料就接在同一個索引後面即可.
* 一個好的雜湊函式應該具有 **均勻** 的真正隨機輸出。且隨機雜湊函式不太會出現非常高的衝突率。但是,少量的可以估計的衝突在實際狀況下是不可避免的.
* 因此以下根據常見的 hash function 做比較, 來分析它的隨機輸出與衝突率, 並且比較在 phonebook 內的 cache miss , append() findName() 時間的差異.
### 不同 Hash function 的比較
* 參考 [各种字符串Hash函数比较](https://www.byvoid.com/zhs/blog/string-hash-compare)中 8 種 hash function ( SDBM,RS,JS,PJW,ELF,BKDR,DJB2,AP ) 的實作
* 試著使用在 w5 作業中 [matrix_oo ](https://github.com/sysprog21/matrix_oo) 物件導向的做法, 把 hash function 封裝起來
phonebook_hash.h
```c=30
typedef struct __HASH_ALGO {
unsigned int (*hashfuction)(char *str);
hashTable* (*init)();
void (*free)(hashTable *ht );
} hashAlgo;
entry* findName(char lastName[], hashTable *ht, hashAlgo *algo);
void append(char lastName[], hashTable *ht, hashAlgo *algo);
extern hashAlgo SDBMHashProvider;
extern hashAlgo RSHashProvider;
extern hashAlgo JSHashProvider;
extern hashAlgo PJWHashProvider;
extern hashAlgo ELFHashProvider;
extern hashAlgo BKDRHashProvider;
extern hashAlgo DJB2HashProvider;
extern hashAlgo APHashProvider;
```
接著只要在指向不同實作就可切換不同 hash function
```c=212
hashAlgo BKDRHashProvider = {
.hashfuction = BKDRHash,
.init =initHashTable,
.free = freeHashTable,
};
```
如此一來在 main.c 裡就比較方便切換也不用寫太多的判斷, 不過應該還可以透過老師說過的 **ELF linker set** 的方法在寫得更漂亮.
main.c
```c=
hashAlgo * hash_function_providers[] = {
&SDBMHashProvider,
&RSHashProvider,
&JSHashProvider,
&PJWHashProvider,
&ELFHashProvider,
&BKDRHashProvider,
&DJB2HashProvider,
&APHashProvider
};
#ifdef HASH
/* init hash table */
hashAlgo *algo = hash_function_providers[0];
hashTable *ht = algo->init();
#endif
```
- [x] 比較圖表
### table size 和 collision 分析
* 以下透過不同的 table size 來分析不同的 hash fuction 下 input data `word.txt` 分布情形
#### **Table size : 3000**
* 將所有分布圖表全部顯示 會發現 ELF 與 PJW 的分佈變動很大, 統計資料觀察資料的平均值 標準差 與變異數的分布情況如下
![](https://i.imgur.com/n9RtSQB.png)
| |SDBM | RS |JS|PJW|ELF|BKDR|DJB2|AB|
|:----:|:-----:|:------:|:----:|:-----:|:------:|:----:|:-----:|:-----:|:-----:|
|Mean |116.63|116.63|116.63|116.63|116.6|116.63| 116.63| 116.63|
|Standard deviation|10.686|10.683|10.785|56.979|56.979|++**10.592**++|10.753|10.666|
|Variations|0.09162|0.09160|0.09247|0.4885|0.4885|++**0.0908**++|0.0921|0.0914|
* 由於 input data `word.txt` 的所有資料為 349900 筆, 因此理論上平均分配到每個 hash[tag] 內的資料應該為 116.6333 筆資料, 在觀察邊準差與變異數分布 發現 BKDR 分布的情形最為平均 (邊準差 與變異數 皆為最小)
![](https://i.imgur.com/dqPgXAY.png)
![](https://i.imgur.com/kfYJobt.png)
#### **Table size : 30000**
* 接著將 Table size 提高到 30000 再來觀察
* 在這個 table size 下表現最好的為 RS function , 除了 ,PJW 與 ELF 之外 數據都沒有差太多
![](https://i.imgur.com/kubFl2Y.png)
| |SDBM | RS |JS|PJW|ELF|BKDR|DJB2|AB|
|:----:|:-----:|:------:|:----:|:-----:|:------:|:----:|:-----:|:-----:|:-----:|
|Standard deviation|3.4135|++**3.3923**++|3.4240|10.2899|10.2899|3.4080|3.4108|3.3959|
|Variations|0.2926|++**0.2908**++|0.2935|0.8822|0.8822|0.2921|0.2924|0.2911|
![](https://i.imgur.com/sQvdbNp.png)
![](https://i.imgur.com/2Q36pdu.png)
#### **Table size : 300000**
* 最後將 table size 提高到 300000 跟 input data 的 資料總數是同一個 order 再觀察 BKDR 的表現還是最好, 雖然與其他 fuction 的差異, 僅僅只有小數點後 3 位 整體比較下來 BKDR 在三個不同 table size 的分布最為穩定
>PJW 與 ELF 的資料分布都完全相同 看起來要回頭看一下實作是不是有問題
>
![](https://i.imgur.com/z4ZrNQU.png)
| |SDBM | RS |JS|PJW|ELF|BKDR|DJB2|AB|
|:----:|:-----:|:------:|:----:|:-----:|:------:|:----:|:-----:|:-----:|:-----:|
|Standard deviation|1.0791|1.0802|1.0796|1.4691|1.4691|++**1.0788**++|1.0795|1.0816|
|Variations|0.9252|0.9262|0.9257|1.2596|1.2596|++**0.9250**++|0.9256|0.9274|
![](https://i.imgur.com/mjj79pj.png)
![](https://i.imgur.com/ORMx3Nr.png)
## Binary-search tree 與 cache line size
### Binary search tree
由於在原始版本程式中, 電話簿中的資料是被按照排序過的 ( 由 A 到 Z). 因此實作參考了 [Sorted Linked List to Balanced BST](http://www.geeksforgeeks.org/sorted-linked-list-to-balanced-bst/) ,將原本由 append() 建立已排序好的 linked list 轉成 Binary search tree 的結構, 建立 binary searh tree 需要給 head 與 list 總數為輸入,最後回傳 root 節點 (如下列表示的 4 or 3)
```text=
Input: Linked List 1->2->3->4->5->6->7
Output: A Balanced BST
4
/ \
2 6
/ \ / \
1 3 4 7
Input: Linked List 1->2->3->4
Output: A Balanced BST
3
/ \
2 4
/
1
```
```c=34
node *sortedListToBST (entry *e)
{
int countListLength = 0;
entry *tmp = e;
while (tmp) {
countListLength++;
tmp = tmp->pNext;
}
return sortedListToBSTRecur(&e, countListLength);
}
node *sortedListToBSTRecur(entry **e, int listlength)
{
if (listlength <= 0)
return NULL;
node *left = sortedListToBSTRecur(e, listlength/2);
node *root = (node *) malloc(sizeof(node));
root->left = left;
root->pEntry = *e;
*e = (*e)->pNext;
root->right = sortedListToBSTRecur(e, listlength - listlength/2 - 1);
return root;
}
```
```
Performance counter stats for './phonebook_bst' (100 runs):
524,818 cache-misses # 68.953 % of all cache refs ( +- 0.16% )
761,122 cache-references ( +- 0.31% )
330,462,207 instructions # 1.72 insns per cycle ( +- 0.02% )
192,097,333 cycles ( +- 0.28% )
0.074406735 seconds time elapsed ( +- 1.27% )
```
![](https://i.imgur.com/qWK0nD8.png)
* 其中 hash table size 是 42937使用的 function 是 BKDR
### cache line size
這邊參考[ vtim9907 ](https://github.com/vtim9907/phonebook)的實做,為了符合 cache line 的特性,決定把 node 的結構更該成 64 bytes 的大小,原來 node 的結構就只包含兩個分別指到左、右子樹的指標,和一個指到一個 phonebook entry 的指標:大小為 8 * 3 = 24 bytes
```c=
typedef struct __PHONE_BOOK_BST_NODE {
struct __PHONE_BOOK_BST_NODE *left;
struct __PHONE_BOOK_BST_NODE *right;
entry *pEntry;
} node;
```
為了增加 node 的大小, 將原本的 `word.txt` 中的資料將它分組, 如下面的例子即是將資料 6 個一組. node 中 增加了一個陣列 裡頭包含了 6 個指向 phonebook entry 的指標 :大小為 8 * 6 + 8 * 2 = 64 bytes
```c=
typedef struct __PHONE_BOOK_BST_NODE {
struct __PHONE_BOOK_BST_NODE *left;
struct __PHONE_BOOK_BST_NODE *right;
entry *pEntry[6] ;
} node;
```
```
Performance counter stats for './phonebook_bst' (100 runs):
453,643 cache-misses # 63.317 % of all cache refs ( +- 0.19% )
716,468 cache-references ( +- 0.25% )
251,680,487 instructions # 1.58 insns per cycle ( +- 0.04% )
159,466,589 cycles ( +- 0.20% )
0.060881942 seconds time elapsed
```
![](https://i.imgur.com/WxuWXqM.png)
* 可以看到 append() time 從原本的 0.064934 下降到 0.051413 下降約 20%
* 整體執行時間由 0.074406735 下降到 0.060881942) 下降約 18%
* Cache-miss 也是從 68.9% 下降到 63.3%
## reference
[heathcliffYang 共筆](https://hackmd.io/s/Hyr9PhFYx)
[hash function 觀念和實務](https://hackmd.io/s/HJln3jU_e)
[git branch and remote repo](https://ihower.tw/blog/archives/2620)
[Perf -- Linux下的系统性能调优工具,第 1 部分](https://www.ibm.com/developerworks/cn/linux/l-cn-perf1/)
Fast Efficient Fixed-Size Memory Pool
[Using Fuzzy Matching to Search by Sound with Python](http://www.informit.com/articles/article.aspx?p=1848528)
[NIKITA'S BLOG: Fuzzy string search](http://ntz-develop.blogspot.tw/2011/03/fuzzy-string-search.html)
[Charles 共筆](https://charles620016.hackpad.com/ep/pad/static/Japi4qFyAzt)
[SOUNDEX](https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions148.htm)
[Levenshtein distance wiki](https://en.wikipedia.org/wiki/Levenshtein_distance)
[Entropy (information theory) wikipedia](https://en.wikipedia.org/wiki/Entropy_(information_theory))
[vtim9907 共筆](https://hackmd.io/s/r1YyTRqFe#phonebook-*%16)