2017q1 Homework1 (phonebook)

contributed by < etc276 >

tags: `etc276`

Reviewed by `yanang`

在 append() 中，當發生相同 hash value ，以 linked list 串連心增加的 data ，若考慮將資料重最前端插入，則可以減少尋找尾端的時間。




    entry *pHead = (entry *) malloc (sizeof(entry));
    pHead->pNext = e[hash];
    e[hash] = pHead;
    strcpy(pHead->lastName,lastName);

不同的 hash function 和不同的 table size 都會影響實驗結果，可以參考 how to choose size of hash table
針對你的實驗結果，就我的實驗結果和各位同學的共筆展示，使用 hash table 來搜尋資料應是非常迅速。
- time of finName() = 0.002348 是完全不符合預期的，是否在繪圖時使用了錯誤的資料來源?

開發環境

Ubuntu 16.10 (64bit)

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 61
Model name:            Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
Stepping:              4
CPU MHz:               2400.878
CPU max MHz:           3000.0000
CPU min MHz:           500.0000
BogoMIPS:              4788.90
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              4096K
NUMA node0 CPU(s):     0-3

開發紀錄

Original













typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

自 GitHub 取得作業程式碼 (先要 fork 到自己帳號在 clone )

$ git clone https://github.com/etc276
$ cd phonebook

執行程式，並用 perf 檢查效能，可以發現 cache miss 高達 81.73%

Performance counter stats for './phonebook_orig' (100 runs):

         3,430,005      cache-misses              #   81.729 % of all cache refs      ( +-  0.03% )
         4,196,826      cache-references                                              ( +-  0.19% )
       261,712,845      instructions              #    1.34  insn per cycle           ( +-  0.02% )
       194,998,572      cycles                                                        ( +-  0.26% )

       0.066929775 seconds time elapsed                                          ( +-  0.31% )

透過 gnuplot 建立執行時間的圖表

Optimization 1 (By struct)

















typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    struct __PHONE_BOOK_ENTRY *pNext;
    struct __PHONE_BOOK_DETAIL *pDetail;
} entry;

typedef struct __PHONE_BOOK_DETAIL {
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
} detail;

目前我對 cache miss 對效能影響的理解為，由於記憶體使用上並不連續，造成 latency 變長。所以先從 struct 下手，讓會用到的資料盡量連續，而方法為將 lastname 以外的資料存在別的地方，減少單一 entry 的大小。

如下方所示，cache miss下降到 69.64%，執行時間也稍有下降。

 Performance counter stats for './phonebook_opt' (100 runs):

         1,210,125      cache-misses              #   69.643 % of all cache refs      ( +-  0.19% )
         1,737,624      cache-references                                              ( +-  0.58% )
       243,060,991      instructions              #    1.75  insn per cycle           ( +-  0.02% )
       139,089,184      cycles                                                        ( +-  0.78% )

       0.048515702 seconds time elapsed                                          ( +-  0.93% )

Optimization 2 (By hash)























entry *findName(char lastName[], entry *e[])
{
    int hash = BKDRHash(lastName)%TABLE_SIZE;
    entry *tmp = e[hash];
    while (tmp) {
        if (!strcasecmp(lastName, tmp->lastName))
            return tmp;
        tmp = tmp->pNext;
    }
    return NULL;
}

void append(char lastName[], entry *e[])
{
    int hash = BKDRHash(lastName)%TABLE_SIZE;
    entry *tmp = e[hash];
    while (tmp->pNext) {
        tmp = tmp->pNext;
    }
    entry *node = (entry *) malloc(sizeof(entry));
    tmp->pNext = node;
    strcpy(node->lastName, lastName);
}

減少 cache miss 之後，下一步想辦法從演算法減少時間。 original 的版本就是從頭到尾跑一遍查詢，這在資料量龐大時效能會有明顯影響，所以就想用 O(1) 的 hash 去實現。而在參考完 Section 8 哈希表（Hash Table），決定使用 BKDR 加上"拉鍊法"處理 hash 和碰撞的問題。

而實際上 coding 才想到自己其實不知道要怎麼選 table size，參考完为什么hash table的大小最好取一个不接近2^p的素数還是不知道要選用什麼數量級的質數，再繼續看其他人的開發紀錄，發現有人實驗過，最後決定選用 20k 附近且有在這篇中出現的 21911 作為 table size.

cache miss 下降到 36.93%

 Performance counter stats for './phonebook_opt' (100 runs):

            15,649      cache-misses              #   36.930 % of all cache refs      ( +-  1.25% )
            42,374      cache-references                                              ( +-  0.61% )
           691,447      instructions              #    0.59  insn per cycle           ( +-  0.20% )
         1,169,857      cycles                                                        ( +-  1.09% )

       0.121802201 seconds time elapsed                                          ( +-  1.69% )

所以你還是沒搞懂，快去「找教科書」(不要茫茫網海亂找) 來讀，我等你的報告 –jserv

問題思考

有代表性嗎？跟真實英文姓氏差異大嗎？
- 我們先查看各字母使用的頻率，可以發現差異相當大
  
  圖片來源 : https://zh.wikipedia.org/wiki/字母频率
- 然後再查看英文單詞中，首字母的頻率
- 再查看 work.txt 以及 illusion030 的共筆可得知此 dataset 的真實姓名不多
words.txt 中有大量沒有意義的單字(ex.aaaaa)，寫一個比較的程式發現，words.txt 裡總共 349900 筆資料，但是只有 7426 筆資料有在 all-names.txt 裡出現，所以 words.txt 對 phonebook 來說代表性很低。
- 而且真實情況中， lastName 一定會發生重複的情況

首字母	單詞頻率
t	16.671%	
a	11.602%	
s	7.755%	
h	7.232%	
w	6.753%	
i	6.286%	
o	6.264%	
b	4.702%	
m	4.374%	
f	3.779%	
c	3.511%	
l	2.705%	
d	2.670%	
p	2.545%	
n	2.365%	
e	2.007%	
g	1.950%	
r	1.653%	
y	1.620%	
u	1.487%	
v	0.649%	
j	0.597%	
k	0.590%	
q	0.173%	
x	0.037%	
z	0.034%

資料難道「數大就是美」嗎？
寫程式是要給其他人使用的，所以數據的選擇也要盡量真實。以此 dataset 為例，就算增加 work.txt 裡的 "偽" 數據，也無法"美"。例如 Michael 在英文名字中出現的頻率遠比 Zuzzolo 高，如果我們單純地用 lastName 去做 hash 處理，會發現前者的 link list 過長，而後者就只有一個 entry 的情況出現。
如果我們要建立符合真實電話簿情境的輸入，該怎麼作？
要先確定這本電話簿的使用情況，如年代、地區等等，然後去戶政事務所或是相關網站找出一部份的姓名，並試圖增加重複性或抽樣等方法當作母群體，去實際執行並抓出 95% 的信賴區間，以符合真實情況。

進度與小結

「預計」就不要寫出來，Show me the code!
HackMD 會幫你記錄每筆紀錄的變更和具體時間，把時間省下來發現並解決問題 –jserv

* 2/27 以前在安裝 ubuntu 和閱讀相關資烙 * 2/28 開始寫開發紀綠(一些自己的認知和 original 的測試結果) * 3/01 第一次 commit (Opt by struct) * 3/02 (預計) 完成 Opt by hash

心得

在寫 hash 的時候，產生一堆 bug ，時間詭異，圖也沒有出現。後來才知道是 main.c 錯太多東西，導致像 orig.txt 等文件並沒有產生成功，索性刪掉重新 clone 。之後先仔細閱讀 Makefile 裡各 source code 的相依性，才知道要先從 opt.h 修改起，像是我修改 function 的 prototye，這時 main.c 就需要去利用 #ifdef OPT 做修改，最後才是 opt.c 的小 bug .

2017q1 Homework1 (phonebook)

tags: `etc276`

Reviewed by `yanang`

開發環境

相關連結

開發紀錄

Original

Optimization 1 (By struct)

Optimization 2 (By hash)

問題思考

相關知識

cache

參考資料

perf

gunplot

astyle

進度與小結

心得

2017q1 Homework1 (phonebook)

tags: etc276

Reviewed by yanang

開發環境

相關連結

開發紀錄

Original

Optimization 1 (By struct)

Optimization 2 (By hash)

問題思考

相關知識

cache

參考資料

perf

gunplot

astyle

進度與小結

心得

tags: `etc276`

Reviewed by `yanang`