2017q1 Homework1 (phonebook)

contributed by <csielee>

Reviewed by `baimao8437`

注意共筆中該有的空格 $make plot 應為 $ make plot

已修正空格問題東霖

commit message 注意開頭大寫

好的東霖

在 memory pool 中一次 malloc 大量空間，能有效減少時間開銷，但是沒有考慮到 malloc 失敗的情況。
使用的 data set 雖已改進，不只用 zyxel 做測試，但仍可能為無意義的英文字母，尚未使用符合現實的英文姓氏做實驗。

開發環境

OS: ubuntu 16.04 LTS
Architecture: x86_64
CPU 作業模式： 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
Model name: Intel® Core™ i5-5200U CPU @ 2.20GHz
CPU MHz： 2500.109
CPU max MHz: 2700.0000
CPU min MHz: 500.0000
L1d 快取： 32K, 8-way Set-associative
L1i 快取： 32K, 8-way Set-associative
L2 快取： 256K, 8-way Set-associative
L3 快取： 3072K, 12-way Set-associative
記憶體： 8G

東霖衝阿!!亮谷

好的！東霖

原始版本

使用$ make cache-test得到

 Performance counter stats for './phonebook_orig' (100 runs):

         3,411,843      cache-misses              #   91.811 % of all cache refs      ( +-  0.05% )
         3,716,148      cache-references                                              ( +-  0.15% )
       262,364,338      instructions              #    1.46  insn per cycle           ( +-  0.02% )
       179,522,561      cycles                                                        ( +-  0.21% )

       0.070422742 seconds time elapsed                                          ( +-  1.54% )

發現 cache miss 高的嚇人，高達90%以上

而執行$ make plot，看output.txt得到的執行時間

size of entry : 136 bytes
execution time of append() : 0.048714 sec
execution time of findName() : 0.005552 sec

優化方法A–改變資料結構(減少entry大小)

從原始版本的執行資料中，發現 cache miss 很高
研讀cache 原理和實驗之後
還是不太能夠理解為什麼 entry 的大小136byte會讓 cache miss 如此高
在我的理解當中，每次產生一個新的 entry 然後對他進行操作
就是一次的 cache reference，但在前面 append() 函數的階段
每次都是創一個新的然後操作他，應該 cachemiss 非常高是很合理的
(因為都是新的位址，理論上每次 reference 就 miss)
而在後面 findname() 階段才會重複用到相同的 entry，才有機會 hit

而且我發現當文章當中計算 cache miss 的方法似乎怪怪的

cache miss
1574000 (總 block)/ 3(一個 entry 需要 3 block) = 524667 組 entry
524667 / 16(index 數) = 32790(tag 數，填到相同 index 的個數)
由上知一個 cache line 最多可以放 2 組 entry，所以有兩次機會
32790/2=16395(可能被對到次數)
16395 * 16(index 數)* 3(entry 佔的 block 數) = 786960(cache hit 次數)
1574000-78960 = 1495040(cache miss)
1495040/1574000 = 94%(miss rate)

倒數第2行 1574000-78960應該是1574000-786960
如此一來cache miss應該是787040
787040/1574000 = 50%(miss rate)
不會得到perf stat算出來的答案

but，人生永遠有那個but
嘗試將資料結構除了 lastname 以外的部份改變成 pointer
讓 entry 的大小從136byte變成96byte

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    char *firstName;
    char *email;
    char *phone;
    char *cell;
    char *addr1;
    char *addr2;
    char *city;
    char *state;
    char *zip;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

卻能發現 cache miss 的減少，以及執行時間的縮短
但是 cache reference 並沒有改變太多

 Performance counter stats for './phonebook_opt' (100 runs):

         2,679,614      cache-misses              #   73.957 % of all cache refs      ( +-  0.06% )
         3,623,214      cache-references                                              ( +-  0.14% )
       258,304,089      instructions              #    1.67  insn per cycle           ( +-  0.02% )
       155,053,519      cycles                                                        ( +-  0.18% )

       0.059083326 seconds time elapsed                                          ( +-  0.25% )

往這個方向繼續加強，只讓entry當中有具體的lastName可以去當索引被搜尋，其餘資料只剩一個指標去指引
這樣等找到要的entry再去讀其餘資料
entry大小變為32byte

typedef struct __PHONE_BOOK_ENTRY_DATA {
	char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
} entryData;
typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    struct __PHONE_BOOK_ENTRY_DATA *data;	
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

觀察效能發現 cache miss 跟 cache reference 竟然都減少了50%左右
執行時間也有縮短

 Performance counter stats for './phonebook_opt' (100 runs):

         1,196,690      cache-misses              #   73.574 % of all cache refs      ( +-  0.13% )
         1,626,509      cache-references                                              ( +-  0.42% )
       244,622,117      instructions              #    1.97  insn per cycle           ( +-  0.02% )
       124,207,231      cycles                                                        ( +-  0.55% )

       0.047873363 seconds time elapsed                                          ( +-  0.65% )

做到這裡
讓我產生新的疑問，為什麼 entry 的大小也會連帶令 cache reference 減少，難道不是有多少資料要處理就要 reference 多少次嗎
觀察main.c，發現裏面都是先把字典中的資料一個一個append進去
然後去搜尋zyxel，照裡來說，所需要處理的資料量應該是相同的
也就是說存取變數記憶體位址的次數不會有太大的改變

這些疑問，我會放在心中
持續尋找解答東霖

後來重新聽老師上課，並且自己重新思考後
得出新的結論
entry越小，一個entry所佔據的block越小，因此reference也會相對減少
而cache miss的減少，主要是因為cache存取資料的時候
會一次把cache line的大小都讀進cache
因此如果entry可以容納進cache line，這樣可以讓miss的機會變少東霖

用張圖總結資料結構改變帶來的效能增進

優化方法B–使用hash function

比較過不同的 hash function
得知 BKDR 是較好的方法
建立 hash table，同時加入 BKDR
為了避免每次在發生碰撞後，要從頭跑到尾的時間
紀錄每次的頭跟尾巴，加速 append 的時間

//phonebook_opt.h
typedef struct __ENTRY_HASH {
	entry *pHead;
	entry *pTail;
} entry_hash;
entry_hash hash_table[HASH_SIZE];

//phonebook_opt.c
entry *findName(char lastName[], entry *pHead)
{
	pHead = hash_table[hash(lastName)].pHead;
	while(pHead!=NULL)
	{
		if(strcasecmp(pHead->lastName,lastName)==0)
			return pHead;
		pHead = pHead->pNext;
	}
    return NULL;
}
entry *append(char lastName[], entry *e)
{
	entry_hash *t = &hash_table[hash(lastName)];
	if(t->pHead == NULL)
	{
		t->pHead = (entry *) malloc(sizeof(entry));
		t->pHead->pNext = (entry *) malloc(sizeof(entry));
		t->pTail = t->pHead->pNext;
		strcpy(t->pTail->lastName,lastName);
		t->pTail->pNext = NULL;
	}
	else
	{
		t->pTail->pNext = (entry *) malloc(sizeof(entry));
		t->pTail = t->pTail->pNext;
		strcpy(t->pTail->lastName,lastName);
		t->pTail->pNext = NULL;
	}
    return e;
}
unsigned int BKDRHash(char *str)
{
    unsigned int seed=131;
	unsigned int hash=0;
	while(*str)
	    hash = hash * seed +(*str++);
	
	return (hash & HASH_SIZE);
}
unsigned int hash(char str[])
{
	return BKDRHash(str);
}

經由 perf 觀察得知，cache miss 有減少

 Performance counter stats for './phonebook_opt' (100 runs):

           427,869      cache-misses              #   37.063 % of all cache refs      ( +-  0.54% )
         1,154,428      cache-references                                              ( +-  0.46% )
       239,938,464      instructions              #    1.70  insn per cycle           ( +-  0.02% )
       140,743,767      cycles                                                        ( +-  0.24% )

       0.054507888 seconds time elapsed                                          ( +-  0.35% )

藉由圖表呈現

append() 的時間有稍微上升，主要原因是建立hash的串列上比較複雜
要先計算出 hash 的值，再找到 table 中的串列並增加進去
但是 findName() 的時間大幅縮短，主因是利用 hash 分成很多不同的串列，藉此減少拜訪的串列長度，因此時間縮短

改進findName的測試方法

從main.c中發現，最後 findName() 所用的字串都是zyxel
我在想是否能夠改進這部份的code，讓 findName 每次找都不太一樣
去觀察是否前面的優化是有幫助的

    //因為執行100次的速度太快，只使用time(NULL)當種子不夠好
    srand(time(NULL)+cpu_time1*1000);
    //count是利用前面append的時候計算測資有多少筆
    int choose = rand()%count+1;
    char input[MAX_LAST_NAME_SIZE] = "zyxel";
    count = 0;
    fp = fopen(DICT_FILE,"r");
    if(fp == NULL) {
        printf("cannot open the file\n");
    }
    i = 0;
    while (fgets(line, sizeof(line), fp)) {
        count++;
        if(count == choose) {
            while (line[i] != '\0')
                i++;
            line[i - 1] = '\0';
            strcpy(input,line);
            break;
        }
    }

修改完後，利用$ make plot觀察，的確每次查詢的字串都不相同
且不會有連續兩到三次以上都是相同的字串
得到的圖如下

發現到原始版本的 findName 時間有下降
我想這是因為不再每次都是搜尋zyxel的 wrost case
因此 linked list 的效能損失沒那麼明顯

而hash版本看起來毫無影響，我預期是因為BKDR hash函數的分佈很平均
因此就算每次都搜尋不同的字串，遇到 linked list 也不會太長

BKDR hash的分佈

BKDR hash 的分佈還需要驗證東霖

上面是將 words.txt 放進去 BKDR hash
並將每個 hash value 的個數紀錄下來變成圖表
從上圖可以觀察到從0~2047(我設定的 hash size)
大部分的 hash value 所對應到的字串並不會差異很大
把 349900/2048 大約就等於是 170
而這個分佈圖算出來的變異係數是 0.077291

./hashdistribute 
HASH_SIZE = 2047
total numbers of string = 349900,Standard deviation = 13.205107,coefficient of variation = 0.077291

根據95%信賴區間(兩倍標準差)，我們可以預估95%的hash value會有 143.6~196.4 個字串

Hash 函數的好壞

只測試一種 hash 不夠表現 BKDR hash 是不是分佈夠平均
多試試看不同的 hash function 和不同的HASH SIZE

unsigned int BKDRHash(char *str)
{
    unsigned int seed = 131;
    unsigned int hash = 0;
    while(*str)
        hash = hash * seed +(*str++);
    return (hash & HASH_SIZE);
}
unsigned int APHash(char *str)
{
    unsigned int hash = 0;
    int i;
    for (i=0; *str; i++)
    {
        if ((i & 1) == 0)
            hash ^= ((hash << 7) ^ (*str++) ^ (hash >> 3));
        else
            hash ^= (~((hash << 11) ^ (*str++) ^ (hash >> 5)));
    }
    return (hash & HASH_SIZE);
}
unsigned int DJB2Hash(char *str)
{
    unsigned int hash = 5381;
    int c;
    while (c = *str++)
        hash = ((hash << 5) + hash) + c;
    return (hash & HASH_SIZE );
}
unsigned int ELFHash(char *str)
{
    unsigned int hash = 0;
    unsigned int x    = 0;
    while (*str) {
        hash = (hash << 4) + (*str++);
        if ((x = hash & 0xF0000000L) != 0) {
            hash ^= (x >> 24);
            hash &= ~x;
        }
    }
    return (hash & HASH_SIZE);
}

hash function\hash size	0x7FF	0x7FFF
BKDR	0.077291	0.308122
AP	0.080878	0.305714
DJB2	0.076022	0.304758
ELF	1.008573	1.240482

相比之下，發現在各種字符串Hash函數比較文章當中
ELF hash 的確是分佈的非常差
而 BKDR 跟 DJB2 都比較優秀，AP 則是差了一點

優化版本C–使用memory pool

因為頻繁的使用malloc，也會造成效能的損失
因此事先malloc好一塊記憶體，需要的時候取一段出來
能夠有效減少malloc的次數

再實作memory pool遇到了一個困難
就是我一次 malloc 1000個entry的空間
用完就在 malloc 1000個空間，本來預期可以有效的減少malloc的次數
但是卻在執行階段出現 segmentation falut

int entry_memory_count=POOL_SIZE;
entry *entry_memorypool;

void *getEntry()
{
    if(entry_memory_count==POOL_SIZE) {
        printf("get pool\n");
        entry_memorypool = (entry *)malloc(sizeof(entry)*POOL_SIZE);
        entry_memory_count=0;
    }
    //entry *e = entry_memorypool+sizeof(entry)*entry_memory_count;
    void *a = &entry_memorypool[entry_memory_count];
    entry_memory_count++;
    return a;
}

因此嘗試使用gdb跟valgrind去找出問題
用gdb找到問題發生在append()的時候，在某個 lastName 要加進來的時候
要去存取pNext就發生錯誤，但是那明明就是在我 malloc 出來的記憶體範圍內
上面473,62bff0，是代表著getEntry所給予的記憶體位址跟在每次1000個 entry 當中的473個
我去觀察 pNext 的位置是 entry 32bytes當中的24byte到31byte(0x62bff0+0x18 = 0x62c008)
所以我遲遲找不出問題點

...
472,62bfd0
473,62bff0

Program received signal SIGSEGV, Segmentation fault.                   
0x00000000004010c7 in append (lastName=0x7fffffffd8b0 "acceptances",   
    e=0x60b240) at phonebook_opt.c:54                                  
54          t->pTail->pNext = NULL;

(gdb) print t->pTail
$1 = (entry *) 0x62bff0

(gdb) print t->pTail->pNext
Cannot access memory at address 0x62c008

(gdb) print t->pTail->lastName
$2 = "acceptances\000\000\000\000"

後來去使用valgrind嘗試執行去觀察記憶體的狀況
程式能夠順利執行完，但是卻出現很多錯誤
大多是無效的寫入跟讀取

==21788== Invalid write of size 8
==21788==    at 0x4010C7: append (in /home/csielee/embedded/phonebook-1/phonebook_opt)
==21788==    by 0x400B92: main (in /home/csielee/embedded/phonebook-1/phonebook_opt)
==21788==  Address 0x5230f68 is 14,520 bytes inside an unallocated block of size 4,020,528 in arena "client"

所以我在想是不是malloc出來的空間並不連續，當成陣列來分配是會有問題的

後來發現，以上都是我腦殘所導致的
原因出現優化版本B所使用的hashtable

因為在我的 BKDR hash 當中
不是利用 % 去把hash值控制在 HASH_SIZE 而是利用 &
所以hash值會落在 0~HASH_SIZE ，因此hashtable必須宣告 HASH_SIZE+1 這麼多
而我當初只有宣告 HASH_SIZE ，所以當hash值是 HASH_SIZE 的時候，去存取就會發生問題

正確實作memory pool之後，呼叫 malloc 的次數下降了非常多
讓 append() 的時間有效的下降，以下是圖表

而因為這個機會，特別去注意到memory leak的問題
利用 valgrind 檢查

==24671== HEAP SUMMARY:
==24671==     in use at exit: 11,265,536 bytes in 2,398 blocks
==24671==   total heap usage: 2,406 allocs, 8 frees, 11,280,536 bytes allocated
==24671== 
==24671== LEAK SUMMARY:
==24671==    definitely lost: 0 bytes in 0 blocks
==24671==    indirectly lost: 0 bytes in 0 blocks
==24671==      possibly lost: 0 bytes in 0 blocks
==24671==    still reachable: 11,265,536 bytes in 2,398 blocks

發現並沒有leak的問題，但是卻有沒有釋放掉的記憶體
就是hash_table以及使用的entry

因為當初寫的 getEntry() 並沒有去紀錄 malloc 哪些位址
所以直接釋放 entry 會有 double free 的問題
因此改寫code

問題回答

Q：有代表性嗎？跟真實英文姓氏差異大嗎？: 觀察了words.txt，發現蠻多無意義不像是名字的字串。如此一來，代表性可能不太夠。也跟真實姓名差異頗大。
Q：資料難道「數大就是美」嗎？: 個人認為不是，資料要有代表性有意義的才好。資料量看起來龐大，但是如果太多跟真實情況相差太多的資料，反而會無法真正解決問題。
Q：如果我們要建立符合真實電話簿情境的輸入，該怎麼作？: 感覺可以去尋找英文名字的名冊，從中取得真實且存在的英文姓名。

不要說「感覺」，理工科系的學生說話要非常精準，不要假裝自己是文青。 –jserv

好的，我會繼續讓自己說的話更精準東霖

額外補充

push.default

這一次在寫這個作業的時候，使用$git push的時候
很突然的跳出警告，以前使用都沒有遇到
想說來研究發生了什麼

warning: push.default is unset; its implicit value has changed in
Git 2.0 from 'matching' to 'simple'. To squelch this message
and maintain the traditional behavior, use:

  git config --global push.default matching

To squelch this message and adopt the new behavior now, use:

  git config --global push.default simple

When push.default is set to 'matching', git will push local branches
to the remote branches that already exist with the same name.

Since Git 2.0, Git defaults to the more conservative 'simple'
behavior, which only pushes the current branch to the corresponding
remote branch that 'git pull' uses to update the current branch.

See 'git help config' and search for 'push.default' for further information.
(the 'simple' mode was introduced in Git 1.7.11. Use the similar mode
'current' instead of 'simple' if you sometimes use older versions of Git)

參考了Git push与pull的默认行为後得知
原來是隨著github的改版，在push指令的預設行為有了不同的設定
因此會特別跳出來警告，避免使用者不知道現在push的行為是什麼
也希望使用者能夠設定預設行為是什麼

除了警告提到的matching、simple之外
還有很多不同模式

nothing
- 什麼都不做，除非有指定push的branch
current
- push branch到遠端同名branch，遠端branch不存在會建立一個
upstream
- 將本地branch上傳到upstream branch
simple
- 遠端跟本地branch要同名才會允許操作
matching
- 當本地跟遠端branch同名就都push

參考資料

示範用的phonebook
去年的我
 Git push与pull的默认行为
 cache 原理和實驗
 勃興大大的筆記