Phonebook

contribured <hugikun999>, <claaaaassic>, <Weinux>
GitHub: phonebook

Task: 比照 B01: phonebook，支援 fuzzy search，允許搜尋 lastName 時，不用精準打出正確的寫法。指出原有 dataset 的問題 (需要有工程分析，拿出統計學) 並改進。

mmap 大量讀取時的影響

Which is fastest: read, fread, ifstream or mmap?

Overhead on mmap is heavily than read.
Memory maps is faster than randomly access.
Memory map keep using page from memory until you finish your work.However,with read,files may have been flushed out memory.
Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes.
If your access pattern is sequential, mmap can't work efficiently than other funcion.
Memory map is less robust,any slightest mistake can make your make program crash.
mmap directly access memory.
read fetches the data from disk to page cache and then copy them to position you specify.

Downsides to mmap:
- resulting page fault
- quite noticeable setup and teardown costs
Upsides of mmap:
- if the data gets re-used over and over again
- if the data is large，The kernel can forget pages as memory pressure forces the system to page stuff out.

Conclude: mmap 僅在 random access memory 的時候會比較有效率，如果為 sequential access 則使用 fread 會比較有效率。

Code to test

這裡有提供一個可以測試 mmap 和 fread的程式，我把 word.txt 當成輸入檔並用 $perf sate 重覆測試 100 次，可以發現 mmap 的 cache-missed 低了很大一個幅度，但是總體執行卻沒有比較快。
(1)fread

 Performance counter stats for './tbytesum1 -f words.txt' (100 runs):

              6491      branch-misses                                                 ( +-  0.42% )
            2,1510      cache-misses              #   40.280 % of all cache refs      ( +-  2.39% )
            5,3401      cache-references                                              ( +-  1.02% )
         2206,1482      cycles                                                        ( +-  0.21% )

       0.007478928 seconds time elapsed                                          ( +-  0.36% )

(2)mmap

 Performance counter stats for './tbytesum1 -m words.txt' (100 runs):

              6554      branch-misses                                                 ( +-  0.29% )
              6788      cache-misses              #   29.101 % of all cache refs      ( +-  3.07% )
            2,3325      cache-references                                              ( +-  0.65% )
         2513,0427      cycles                                                        ( +-  0.10% )

       0.008581382 seconds time elapsed                                          ( +-  1.01% )

phonebook 實際比較

mmap 版從整體的消耗時間看起來沒有減少反而增加，其原因在於先將 word.txt 做 align，比起原版額外增加了許多的 instructions，故導致最後的整體時間並沒有比較快，但是可以發現 cache-misses 已經減少了一成左右。

    file_align(DICT_FILE, FILE_OUT, MAX_LAST_NAME_SIZE);

    fd = open(FILE_OUT, O_RDONLY | O_NONBLOCK);
    if (!fd) {
        printf("cannot open the file\n");
        return -1;
    }
    size = fsize(FILE_OUT);
    data_count = size / (sizeof(char) * MAX_LAST_NAME_SIZE);

    map = mmap(NULL, size, PROT_READ, MAP_SHARED | MAP_NONBLOCK, fd, 0);

 Performance counter stats for './phonebook_orig' (100 runs):

          123,9627      cache-misses              #   88.189 % of all cache refs      ( +-  0.76% )
          140,5652      cache-references                                              ( +-  0.86% )
       3,2127,2267      instructions              #    1.45  insns per cycle          ( +-  0.01% )
       2,2187,7205      cycles                                                        ( +-  0.29% )

       0.071603623 seconds time elapsed                                          ( +-  0.32%

 Performance counter stats for './phonebook_mmap' (100 runs):

          108,3608      cache-misses              #   76.881 % of all cache refs      ( +-  0.16% )
          140,9459      cache-references                                              ( +-  0.54% )
       4,1566,4954      instructions              #    1.26  insns per cycle          ( +-  0.02% )
       3,2864,4359      cycles                                                        ( +-  0.31% )

       0.105448071 seconds time elapsed                                          ( +-  0.38% )

這裡可以利用 $perf record ./phonebook_orig $perf annotate --no-source 查看轉成組語後每個所花費的時間。

從 gprof 可以看到時間明顯都花在 file_align() 上。

要使用 gprof 時記得要加 -pg

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 66.78      0.02     0.02        1    20.03    20.03  file_align
 33.39      0.03     0.01        3     3.34     3.34  findName
  0.00      0.03     0.00   349900     0.00     0.00  append
  0.00      0.03     0.00        2     0.00     0.00  diff_in_second
  0.00      0.03     0.00        1     0.00     0.00  fsize

從 append 的角度去比較就可以發現事先 mmap，可以省下四成的時間。

研究MARSSx86 (http://marss86.org/~marss86/index.php/Home）

memory pool

memory pool 的概念是透過預先 malloc 一段記憶體空間，等到需要記憶體的時候再去從剛剛已經先配置好的空間拿取。在一個大量使用 malloc 的程式中，透過 memory pool 可以省去多次呼叫 malloc 所造成的 overhead，perf 所測出來的 instructions 減少二成五了。

advantage:
- The memory release for thousands of objects in a pool is just one operation
- Memory pools can be grouped in hierarchical tree structures
- Fixed-size block memory pools do not need to store allocation metadata for each allocation
- Allows deterministic behavior on real-time systems avoiding the out of memory errors.
- Avoid fragmentation
- multithread program could decrease overhead on snychronization
disadvantage:
- need to be tuned for the application which deploys them

for (int i = 0; i < data_count; i++) {
    entry *next = (entry *) mp_get(mp, sizeof(entry));
    e = append_mp(map + i * MAX_LAST_NAME_SIZE, e, next);
}

void *mp_get(Memory_pool *mp, size_t size)
{
    if (mp->tail - mp->head < size)
        return NULL;

    void *tmp = mp->head;
    mp->head = mp->head + size;

    return tmp;
}

 Performance counter stats for './phonebook_mp' (100 runs):

           73,3750      cache-misses              #   72.755 % of all cache refs      ( +-  0.30% )
          100,8523      cache-references                                              ( +-  0.22% )
       3,1056,8327      instructions              #    1.15  insns per cycle          ( +-  0.04% )
       2,7011,4028      cycles                                                        ( +-  0.17% )

       0.088669643 seconds time elapsed                                          ( +-  0.23% )

該圖為 mmap + memory pool 的時間

Fast Efficient Fixed-Size Memory Pool

I. INTRODUCTION

Naive memory pool implementations initialize all the memory pool segments when created. This can be expensive since it is usually necessary to loop over all the uninitialized segments.

Present here is not novel, but is a modification of an existing technique;whereby loops and initialization overheads are removed.
the most memory efficient implementation available since it has very little memory footprint and while giving an O(1) access time.

IV. HOW IT WORKS

When the pool is created, we obtain a continuous section of memory that we know the start and end address of.
Needs a bookkeeping algorithm keep track of which blocks are used and un-used as they are allocated and deallocated.

Bookkeeping algorithm keeping a list of the unused blocks.

Use a four-byte index number to store identify any memory location.

每個 block 必須大於4 bytes，因為必須存下一個 unused 的 block number.

Initialize a variable to inform us of how many of the n blocks have been appended to the unused list.

這個變數不會另外使用記憶體，而是將變數存在未使用的 block 內。

We keep track of the head of the unused list of blocks and is updated after each allocation.when a block de-allocated append it to the list of unused blocks.
keep track of how many blocks have been added to the list and stop appending new blocks when we have reached the upper limit.
Writing a custom memory pool allocator can be both difficult and error prone.

相比 fixed size memory pool 需要較複雜且耗時檢查的機制。

V. IMPLEMENTATION

Four essential public functions: Create, Destroy, Allocate, and De-allocate.

使用 create/destroy 而非 constructor/destructor，便於動態新增。

Combining the fixed pool allocator with overloading the new and delete operators.
The greatest care must be exercised to ensure that classes and structures have their constructors and destructors manually called.

VI. LIMITATIONS

Need a continuous block of memory.
Not discussed hardware limitations(e.q:page faults)
Not addressed the issue of using the memory pool in a multi-threaded environment.
If the requested memory is dramatically smaller than the slot-size then lots of memory will be wasted.
If the requested memory is greater than the slot-size then it is impossible to allocate memory from the pool.
a suitable block of memory would require considerable searching overhead.Unsuitable and unusable memory being scattered around.

VII. RESIZING

This list resides in the unused memory and is incrementally extended when a memory block is allocated.
The algorithm currently always initializes the next unused memory block during the allocation call.

可以額外設計檢查機制是否真的需要初始化下一個尚未被使用的 block。

The large pool of memory could be resized-down without needing to destroy and re-create the pool.

IX. CONCLUSION AND FURTHER WORK

Keep It Short and Simple
Need to investigate if the algorithm could be optimised to use less decisional logic
Exploring hardware considerations

e.g: caching, paging, registers, memory alignment, threading

Fuzzy search

The problem of fuzzy string searching can be formulated as follows:

Find in the text or dictionary of size n all the words that match the given word (or start with the given word), taking into account k possible differences (errors).

其演算法可以分為

Levenshtein distance
Damerau-Levenshtein distance
Bitap algorithm with modifications by Wu and Manber
Spell-checker method
N-gram method
Signature hashing method
BK-trees
透過演算法將兩個字之間的距離算出來，以距離表達兩個字的相似程度。

Levenshtein Distance

透過對一個字串(source)做刪除、替換、增加字元以形成另一個字串(target)，中間所做的運算次數，即為編輯距離(distance)。比較編輯距離的大小，可以找出兩個字串之間的相似程度。

參考維基百科的演算法實作，遞迴版的運算量很大，搜尋較短的字串時可以找的出結果，但遇到比較長的字串會花費許多時間在計算編輯距離。

例如將kitten轉成sitting：
step 1: kitten → sitten (substitution of "s" for "k")
step 2: sitten → sittin (substitution of "i" for "e")
step 3: sittin → sitting (insertion of "g" at the end)

int LevenshteinDistance(const char *s, int len_s, const char *t, int len_t)
{ 
  int cost;

  /* base case: empty strings */
  if (len_s == 0) return len_t;
  if (len_t == 0) return len_s;

  /* test if last characters of the strings match */
  if (s[len_s-1] == t[len_t-1])
      cost = 0;
  else
      cost = 1;

  /* return minimum of delete char from s, delete char from t, and delete char from both */
  return minimum(LevenshteinDistance(s, len_s - 1, t, len_t    ) + 1,
                 LevenshteinDistance(s, len_s    , t, len_t - 1) + 1,
                 LevenshteinDistance(s, len_s - 1, t, len_t - 1) + cost);
}

hugikun@hugikun:~/embedded/team/phonebook$ ./phonebook_mp hjka
size of entry : 136 bytes
Can not find the name you want.Do you want to fuzzy search similar name?
Y/N: Y
The lastName similar to hjka:
aeka
ahka
akka
 .
 .
 .
execution time of append() : 0.017253 sec
execution time of findName() : 70.463820 sec

SOUNDEX

以英文的發音作為比較的基礎，將每個字母依照其發音的特性給予一個數字，比較字串按照 SOUNDEX 法則所得出的結果進行比較。此種方法在同一個數字會對應許多字串，造成 fuzzy search 的結果過多，需要透過其他方法使其更精準搜尋出目標字串。

b, f, p, v = 1
c, g, j, k, q, s, x, z = 2
d, t = 3
l = 4
m, n = 5
r = 6

void fuzzy_search(char *lastName, entry *e)
{
    int tar_ans = Soundex(lastName);

    printf("The lastName similar to %s:\n", lastName);

    while (e != NULL) {
        int ans = Soundex(e->lastName);
        if (ans == tar_ans)
            printf("%s: %d\n", e->lastName, ans);
        e = e->pNext;
    }
}

hugikun@hugikun:~/embedded/team/phonebook$ ./phonebook_mp asd
size of entry : 136 bytes
Can not find the name you want.Do you want to fuzzy search similar name?
Y/N: Y
The lastName similar to asd:
aaccf: 5
aaem: 5
aaiun: 5
aana: 5
aani: 5
ababdeh: 5
    .
    .
    .
zway: 5
zythia: 5
execution time of append() : 0.016174 sec
execution time of findName() : 0.054875 sec

combine Levenshtein Distance and SOUNDEX

由於計算編輯距離的運算量實在是太大，所以合併了上述的兩種方法，先透過運算量小的 SOUNDEX 過慮部份的字串再透過 Levenshtein Distance 計算所需的編輯距離。目前只能用在短字串的搜尋，對於長字串雖然已經大幅減少搜尋的時間，但是卻無法找出正確的字串。SOUNDEX 的限制給的太小把正確的字串給濾掉了，如果把限制放寬又會造成運算太過龐大。

還沒想到好的方法解決這個問題

hugikun@hugikun:~/embedded/team/phonebook$ ./phonebook_mp aappmedicp
size of entry : 136 bytes
Can not find the name you want.Do you want to fuzzy search similar name?
Y/N: Y
The lastName similar to aappmedicp:
execution time of append() : 0.023745 sec
execution time of findName() : 323.052746 sec

Input Data

統計學
工程分析
改善

分析

為了分析 dataset 是否有代表性，拿美國人口調查局中 1990 年做的人口普查來作為對照組

合適的 dataset

真實姓名
姓名分佈

下面是 words.txt 有出現在人口調查中的字

	人口調查	words.txt
總人數	6290251	349900
姓氏種類	88799	27820
女性名字種類	4275	3083
男姓名字種類	1219	1071
有效比例	98.38%	9.14%

雖然說 words.txt 裡面涵蓋到人口調查中真實名字的比例蠻高的，但只有佔所有資料 9.14% ，有其他三十幾萬的資料都是無意義的字，在真實電話簿中不會有這樣的情形出現，整份資料當成電話簿中的姓名我想是不適合的。

然後順手用同樣的程式碼做了 ./dictionary/all-names.txt 的分析，這份資料裡面有 24% 左右是真實姓名，雖然比上一份資料的真實姓名多，而且有重複的姓氏或名字，但是數量太少，如果要當作有按照真實比例的 dataset 還是不太夠

	人口調查	all-names.txt
總人數	6290251	16751
非真實姓名	0	12705
姓氏總數	6290251	6851
女性名字	3184399	5514
男姓名字	3003954	2048
女姓名字所佔總數比例	50.62%	32.92%
男姓名字所佔總數比例	47.76%	12.23%

研究如何使用熵分析

Entropy (information theory) : entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message. 'Messages' can be modeled by any flow of information.

在資訊理論裡面，熵是對不確定性的測量。但是在資訊世界，熵越高，則能傳輸越多的資訊，熵越低，則意味著傳輸的資訊越少。

熵是對某個系統之不確定性或混亂程度的度量方法，也可以想成，如果 dataset 的熵值越大，資較就越混亂，不確定性越高，越無法預測

不過我沒有找到適合用在分析或改善 dataset 的應用，熵可以用來分析一個字當中有包含多少的資訊，或是分析壓縮過後的資料是否有提高壓縮率，雖然說可以知道姓氏與名字出現的次數所呈現的亂度，也可以計算出真實統計出來的亂度，若是亂度接近只能說某些字出現的頻率與真實相似，但是這並不能知道同樣頻率的字與真實是否為同一個，所以不採用此方法分析

舉個例子，假設真實世界姓名的分佈是

SMITH, SMITH, SMITH, SMITH, JOHNSON, JOHNSON, JOHNSON, BROWN, BROWN, DAVIS

但是我分析的資料是

JONES, JONES, JONES, JONES, LEE, LEE, LEE, HILL, HILL, SCOTT

兩筆資料的 entropy 都是大約 4.94，頻率一樣的字卻不相同，我認為是不適合這個問題

改善

改善的部份，我在美國人口調查局中拿他們 1990 的姓名資料，這次人口普查蒐集 7,200,000 人的資料，一共分成姓氏、男性名子與女性名字三份統計，並且依據出現的頻率下去做排序

	 File Name		Valid Records	  Unique Names	
						 
  1. dist.all.last		6,290,251         88,799 
  2. dist.female.first		3,184,399          4,275
  3. dist.male.first		3,003,954          1.219

首先考慮到我們的 dataset 就是電話簿本身的資料，在 349900 人中出現比例最低也有大約 4 個人，而電話簿同樣名字無法識別誰是誰，所以以下的測試資料會全都使用名字 + 姓氏來當作一筆資料

名子與姓氏的資料都有出現的比例，改善後的 dataset 依照名字比例數量配上也依照比例的姓氏，大概是像這樣子

MARY MILES
MARY STEWART
MARY AGEE
MARY ANDERSON
MARY CRAVEN
MARY CLARK
MARY FLUDD
MARY FORREST
MARY BARELA
MARY LARA
MARY KELLER
MARY RENEAU

最後是改善過的資料與人口調查的比較，改善過後的 dataset 包含真實姓氏與真實名字，男女性名字比例與真實接近

	人口調查	full-names.txt
總人數	6290251	5567224
姓氏	6290251	5567224
女性名字	3184399	2861372
男姓名字	3003954	2727188
女姓名字所佔比例	50.62%	51.40%
男姓名字所佔比例	47.76%	48.99%

95%信賴區間

原本測量出的時間會有些誤差，在 main.c 中加入信賴區間的計算，使得 append 及 findname 量測出的時間較準確，這邊使用 95 ％信賴區間表示真實資料有95％的可能性會落在這個區間。

Hash table size 和 collision 的關係

Hash function ( 雜湊函式 )

是把訊息或資料轉換，建立一個叫做 雜湊值（hash values，hash codes，hash sums，或hashes）,這個雜湊值就當作是陣列的索引，資料就儲存在這個索引的位置中。雜湊值通常用一個短的隨機字母和數字組成的字串來代表。

hash table

使用雜湊表能夠利用資料所產生的雜湊值當索引, 快速尋找資料

一般而言如果兩個雜湊值是 不相同 的（根據同一函式），那麼這兩個雜湊值的原始輸入也是 不相同 的
但如果兩個雜湊值是相同的（根據同一函式），那麼這兩個雜湊值的原始輸入 不一定 是相同的, 而這時就發生了衝突 (collision), 遇到衝突時就資料就接在同一個索引後面即可.
一個好的雜湊函式應該具有均勻的真正隨機輸出。且隨機雜湊函式不太會出現非常高的衝突率。但是，少量的可以估計的衝突在實際狀況下是不可避免的.
因此以下根據常見的 hash function 做比較, 來分析它的隨機輸出與衝突率, 並且比較在 phonebook 內的 cache miss , append() findName() 時間的差異.

不同 Hash function 的比較

參考各种字符串Hash函数比较中 8 種 hash function ( SDBM,RS,JS,PJW,ELF,BKDR,DJB2,AP ) 的實作
試著使用在 w5 作業中 matrix_oo 物件導向的做法, 把 hash function 封裝起來

phonebook_hash.h


















typedef struct __HASH_ALGO {
    unsigned int (*hashfuction)(char *str);
    hashTable* (*init)();
    void (*free)(hashTable *ht );
} hashAlgo;


entry* findName(char lastName[], hashTable *ht, hashAlgo *algo);
void append(char lastName[], hashTable *ht, hashAlgo *algo);

extern hashAlgo SDBMHashProvider;
extern hashAlgo RSHashProvider;
extern hashAlgo JSHashProvider;
extern hashAlgo PJWHashProvider;
extern hashAlgo ELFHashProvider;
extern hashAlgo BKDRHashProvider;
extern hashAlgo DJB2HashProvider;
extern hashAlgo APHashProvider;

接著只要在指向不同實作就可切換不同 hash function





hashAlgo BKDRHashProvider = {
    .hashfuction = BKDRHash,
    .init =initHashTable,
    .free = freeHashTable,
};

如此一來在 main.c 裡就比較方便切換也不用寫太多的判斷, 不過應該還可以透過老師說過的 ELF linker set 的方法在寫得更漂亮.

main.c

















hashAlgo * hash_function_providers[] = {
    &SDBMHashProvider,
    &RSHashProvider,
    &JSHashProvider,
    &PJWHashProvider,
    &ELFHashProvider,
    &BKDRHashProvider,
    &DJB2HashProvider,
    &APHashProvider
};

#ifdef HASH
    /* init hash table */
    hashAlgo *algo = hash_function_providers[0];
    hashTable *ht = algo->init();
#endif

比較圖表

table size 和 collision 分析

以下透過不同的 table size 來分析不同的 hash fuction 下 input data word.txt 分布情形

Table size : 3000

將所有分布圖表全部顯示會發現 ELF 與 PJW 的分佈變動很大, 統計資料觀察資料的平均值標準差與變異數的分布情況如下

| |SDBM | RS |JS|PJW|ELF|BKDR|DJB2|AB|
|:––:|:––-:|:–––:|:––:|:––-:|:–––:|:––:|:––-:|:––-:|:––-:|
|Mean |116.63|116.63|116.63|116.63|116.6|116.63| 116.63| 116.63|
|Standard deviation|10.686|10.683|10.785|56.979|56.979|10.592|10.753|10.666|
|Variations|0.09162|0.09160|0.09247|0.4885|0.4885|0.0908|0.0921|0.0914|

由於 input data word.txt 的所有資料為 349900 筆, 因此理論上平均分配到每個 hash[tag] 內的資料應該為 116.6333 筆資料, 在觀察邊準差與變異數分布發現 BKDR 分布的情形最為平均 (邊準差與變異數皆為最小)

Table size : 30000

接著將 Table size 提高到 30000 再來觀察
在這個 table size 下表現最好的為 RS function , 除了 ,PJW 與 ELF 之外數據都沒有差太多

| |SDBM | RS |JS|PJW|ELF|BKDR|DJB2|AB|
|:––:|:––-:|:–––:|:––:|:––-:|:–––:|:––:|:––-:|:––-:|:––-:|
|Standard deviation|3.4135|3.3923|3.4240|10.2899|10.2899|3.4080|3.4108|3.3959|
|Variations|0.2926|0.2908|0.2935|0.8822|0.8822|0.2921|0.2924|0.2911|

Table size : 300000

最後將 table size 提高到 300000 跟 input data 的資料總數是同一個 order 再觀察 BKDR 的表現還是最好, 雖然與其他 fuction 的差異, 僅僅只有小數點後 3 位整體比較下來 BKDR 在三個不同 table size 的分布最為穩定

PJW 與 ELF 的資料分布都完全相同看起來要回頭看一下實作是不是有問題

| |SDBM | RS |JS|PJW|ELF|BKDR|DJB2|AB|
|:––:|:––-:|:–––:|:––:|:––-:|:–––:|:––:|:––-:|:––-:|:––-:|
|Standard deviation|1.0791|1.0802|1.0796|1.4691|1.4691|1.0788|1.0795|1.0816|
|Variations|0.9252|0.9262|0.9257|1.2596|1.2596|0.9250|0.9256|0.9274|

Binary-search tree 與 cache line size

Binary search tree

由於在原始版本程式中, 電話簿中的資料是被按照排序過的 ( 由Ａ到Ｚ). 因此實作參考了 Sorted Linked List to Balanced BST ，將原本由 append() 建立已排序好的 linked list 轉成 Binary search tree 的結構, 建立 binary searh tree 需要給 head 與 list 總數為輸入，最後回傳 root 節點 (如下列表示的 4 or 3)

Input: Linked List 1->2->3->4->5->6->7
Output: A Balanced BST
        4
      /   \
     2     6
   /  \   / \
  1   3  4   7  

Input: Linked List 1->2->3->4
Output: A Balanced BST
      3   
    /  \  
   2    4 
 / 
1























node *sortedListToBST (entry *e)
{
    int countListLength = 0;
    entry *tmp = e;
    while (tmp) {
        countListLength++;
        tmp = tmp->pNext;
    }
    return sortedListToBSTRecur(&e, countListLength);
}

node *sortedListToBSTRecur(entry **e, int listlength)
{
    if (listlength <= 0)
        return NULL;
    node *left = sortedListToBSTRecur(e, listlength/2);
    node *root = (node *) malloc(sizeof(node));
    root->left = left;
    root->pEntry = *e;
    *e = (*e)->pNext;
    root->right = sortedListToBSTRecur(e, listlength - listlength/2 - 1);
    return root;
}

 Performance counter stats for './phonebook_bst' (100 runs):

           524,818      cache-misses              #   68.953 % of all cache refs      ( +-  0.16% )
           761,122      cache-references                                              ( +-  0.31% )
       330,462,207      instructions              #    1.72  insns per cycle          ( +-  0.02% )
       192,097,333      cycles                                                        ( +-  0.28% )

       0.074406735 seconds time elapsed                                          ( +-  1.27% )

其中 hash table size 是 42937使用的 function 是 BKDR

cache line size

這邊參考 vtim9907 的實做,為了符合 cache line 的特性，決定把 node 的結構更該成 64 bytes 的大小，原來 node 的結構就只包含兩個分別指到左、右子樹的指標，和一個指到一個 phonebook entry 的指標：大小為 8 * 3 = 24 bytes





typedef struct __PHONE_BOOK_BST_NODE {
    struct __PHONE_BOOK_BST_NODE *left;
    struct __PHONE_BOOK_BST_NODE *right;
    entry *pEntry;
} node;

為了增加 node 的大小, 將原本的 word.txt 中的資料將它分組, 如下面的例子即是將資料 6 個一組. node 中增加了一個陣列裡頭包含了 6 個指向 phonebook entry 的指標 :大小為 8 * 6 + 8 * 2 = 64 bytes





typedef struct __PHONE_BOOK_BST_NODE {
    struct __PHONE_BOOK_BST_NODE *left;
    struct __PHONE_BOOK_BST_NODE *right;
    entry *pEntry[6] ;
} node;

Performance counter stats for './phonebook_bst' (100 runs):

           453,643      cache-misses              #   63.317 % of all cache refs      ( +-  0.19% )
           716,468      cache-references                                              ( +-  0.25% )
       251,680,487      instructions              #    1.58  insns per cycle          ( +-  0.04% )
       159,466,589      cycles                                                        ( +-  0.20% )

       0.060881942 seconds time elapsed

可以看到 append() time 從原本的 0.064934 下降到 0.051413 下降約 20%
整體執行時間由 0.074406735 下降到 0.060881942) 下降約 18%
Cache-miss 也是從 68.9% 下降到 63.3%

reference

heathcliffYang 共筆
 hash function 觀念和實務
 git branch and remote repo
Perf – Linux下的系统性能调优工具，第 1 部分
Fast Efficient Fixed-Size Memory Pool
Using Fuzzy Matching to Search by Sound with Python
NIKITA'S BLOG: Fuzzy string search
Charles 共筆
 SOUNDEX
Levenshtein distance wiki
Entropy (information theory) wikipedia
vtim9907 共筆