2017q3 Homework1 (phonebook)

# 2017q3 Homework1 (phonebook) ###### tags: `sysprogram` contributed by <`st9007a`> >> 不要打錯字，預期是 "contributed"，之前少了 "d" 字母 >> [name="jserv][color=red] ### Reviewed by `amikai` - 在 Hash Function 碰撞分析, y 軸的 frequency 是什麼的頻率, x 軸代表什麼意義, 這張圖有幾條 bar 是比較密的, 有幾條是比較稀疏的好像沒有特別的解釋, 上面那張圖跟下面的表格在文章裡好像沒有敘述他們有什麼關聯 (P.S. 可能程度比較差看不太懂) - 在問題與討論, 第一個問題, 建議同學可以引進真正的資料集, 並且多測試幾種, 以佐證你所說的 load factor 0.7 - 0.8 之間有教好的效果, 並且讓此實驗更完備 - 在問題與討論,第二個問題, 建議同學可以進一步嘗試丟入更龐大的資料, 以佐證自己的結論 - 這位同學的研究, 多半是在 hash 上, 在 append 裡大量的使用了 malloc ,這也是降低效能的因素之一, 建議這位同學可以嘗試引入 mmap 或是 memory pool - 老師好像只提到 `git commit --amend` ，這似乎只能修改前一個 commit message, 如果要修改很前面的 commit 可以考慮使用 `git rebase -i` 以上為小弟簡短的建議, 如有冒犯請見諒 ## [Github](https://github.com/st9007a/phonebook) ## 開發環境 ``` Linux 4.4.0-92-generic Ubuntu 16.04.1 LTS L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K ``` ## 原始程式碼首先看看 `phonebook_orig.h` ```c #ifndef _PHONEBOOK_H #define _PHONEBOOK_H #define MAX_LAST_NAME_SIZE 16 /* original version */ typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; struct __PHONE_BOOK_ENTRY *pNext; } entry; entry *findName(char lastName[], entry *pHead); entry *append(char lastName[], entry *e); #endif ``` 可以知道 entry 的資料結構是 linked list，接著看看 `phonebook_orig.c` ```c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h> #include "phonebook_orig.h" /* original version */ entry *findName(char lastName[], entry *pHead) { while (pHead != NULL) { if (strcasecmp(lastName, pHead->lastName) == 0) return pHead; pHead = pHead->pNext; } return NULL; } entry *append(char lastName[], entry *e) { /* allocate memory for the new entry and put lastName */ e->pNext = (entry *) malloc(sizeof(entry)); e = e->pNext; strcpy(e->lastName, lastName); e->pNext = NULL; return e; } ``` 可以知道裡面的 `findName()` 實作了 linked list 的 search ，而 `append()` 則是在 entry 後面插入一個新的 entry 接著來看看原始效能如何，在 command line 上輸入 `$ perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./phonebook_orig` 出現如下的顯示： ``` Performance counter stats for './phonebook_orig' (5 runs): 457,6717 cache-misses # 88.239 % of all cache refs ( +- 0.33% ) 518,6740 cache-references ( +- 0.38% ) 2,6177,2530 instructions # 1.53 insns per cycle ( +- 0.12% ) 1,7155,7629 cycles ( +- 0.87% ) 0.169891497 seconds time elapsed ( +- 34.68% ) ``` 可以看到 cache miss 有 88.239% ## 縮減結構的size 根據 [cache line](http://cenalulu.github.io/linux/all-about-cpu-cache/) 裡面有個實驗，當存取的資料超過 cache line 的大小後，執行時間會上升，原因在於填充 cache line 的時間大於存取資料的時間，所以我認為縮減資料結構的大小應該可以提升效能，即盡可能讓越多資料被 cache 在同一條 cache line 上越好，所以就直接在資料結構上動手腳，其餘都先保持原樣 > access 在台灣的慣用翻譯詞是「存取」，而非「訪問」 > [name="jserv"][color=red] ```c typedef struct __PHONE_BOOK_INFO { char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; } info; typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; struct __PHONE_BOOK_INFO *pInfo; struct __PHONE_BOOK_ENTRY *pNext; } entry; ``` 以上使用了一個額外的 struct 來將比較詳細的資訊記下來，而 `lastName` 是搜尋用的指標，因此保留在 `entry` 裡面，比較一下原本跟修改後的資料大小，直接執行 `phonebook_orig` 跟 `phonebook_opt` 來看結果 ``` $ ./phonebook_orig size of entry : 136 bytes execution time of append() : 0.036390 sec execution time of findName() : 0.004765 sec $ ./phonebook_opt size of entry : 32 bytes execution time of append() : 0.030494 sec execution time of findName() : 0.001589 sec ``` 可以注意到修改後 `entry` 的 size 由 136bytes 縮減為 32bytes，除此之外還可以發現 `findName()` 的執行速度提升了大約三倍左右，接著使用 perf 來看看 cache miss 的狀況如何，在 command line 輸入 `$ make cache-test` ``` Performance counter stats for './phonebook_orig' (100 runs): 453,4382 cache-misses # 88.507 % of all cache refs ( +- 0.08% ) 512,3211 cache-references ( +- 0.06% ) 2,6126,0775 instructions # 1.54 insns per cycle ( +- 0.02% ) 1,7013,9033 cycles ( +- 0.07% ) 0.104214970 seconds time elapsed ( +- 1.82% ) Performance counter stats for './phonebook_opt' (100 runs): 138,5229 cache-misses # 57.016 % of all cache refs ( +- 0.15% ) 242,9525 cache-references ( +- 0.09% ) 2,4394,8709 instructions # 2.16 insns per cycle ( +- 0.02% ) 1,1295,1297 cycles ( +- 0.06% ) 0.088586763 seconds time elapsed ( +- 1.07% ) ``` cache miss 從 88.507% 降至 57.016%，下降了 30% 左右 ## 改善查詢速度 **注：這裡的 entry 結構沿用上一節為縮減 size 後的結構** 我們都知道一般的 linked list 搜尋的時間複雜度為 $O(n)$，所以我試著加速查詢速度，首先使用 hash table 來取代單純的 linkedlist 結構，hash table 根據處理 collision 的方式分為兩種： 1. 使用 Linked list 把「Hashing 到同一個 slot」的資料串起來 2. 使用 Probing Method 來尋找 Table 中「空的 slot」存放資料我想要盡可能不浪費 hash table 的空間，所以我先用方法 2 來實作，而 probing method 先用比較簡單的方式，也就是 linear probing，首先在 `phonebook_opt.h` 定義 hash table 的資料結構，新增 hash function，以及修改 `findName()` 跟 `append()` 的輸入參數 ```c #define HASH_TABLE_SIZE 350000 typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; struct __PHONE_BOOK_INFO *pInfo; } entry; typedef struct { entry *cell[HASH_TABLE_SIZE]; } hashTable; unsigned int BKDRHash(char *str); entry *findName(char lastName[], hashTable *table); void append(char lastName[], hashTable *table); ``` 這裡 hash function 我是參考 [各種字串符 Hash 函數比較](https://www.byvoid.com/zhs/blog/string-hash-compare) 這篇，使用 BKDR 雜湊法，實際狀況可能需要考慮到雜湊出來的分布，hash table 的長度則是參考 `dictionary/words.txt` 的行數定出來的，在 entry 中我拿掉了 pNext，因為這個 hash table 並不需要 linked list，接下來在 `phonebook_opt.c` 中修改 `findName()` 跟 `append()` ```c unsigned int BKDRHash(char *str) { unsigned int seed = 131; unsigned int hash = 0; while (*str) { hash = hash * seed + (*str++); } return (hash & 0x7FFFFFFF); } entry *findName(char lastName[], hashTable *table) { unsigned int i = BKDRHash(lastName) % HASH_TABLE_SIZE; while (strcmp(table->cell[i]->lastName, lastName) != 0) { i = (i + 1) % HASH_TABLE_SIZE; if (!table->cell[i]) return NULL; } return table->cell[i]; } void append(char lastName[], hashTable *table) { unsigned int i = BKDRHash(lastName) % HASH_TABLE_SIZE; while (table->cell[i]) { i = (i + 1) % HASH_TABLE_SIZE; } entry *e = (entry *) malloc(sizeof(entry)); e->pInfo = (info *) malloc(sizeof(entry)); strcpy(e->lastName, lastName); table->cell[i] = e; } ``` 然後修改 `main.c` 裡面有關 `findName` 跟 `append` 的介面後，先直接執行看看： :::danger 請尊重台灣文化和傳統，將 interface 翻譯為繁體中文的「介面」一詞，而非「接口」 --jserv ::: ``` $ make $ ./phonebook_opt size of entry : 24 bytes execution time of append() : 0.843872 sec execution time of findName() : 0.001289 sec ``` 發現 `append()` 的執行速度慢了非常多，其原因在於原本的 `append()` 的時間複雜度為 $O(1)$ 而修改後時間複雜度變成了 $O(n/k)$ 了，在理論上沒有 collision 的狀況時間複雜度應該保持 $O(1)$，所以猜測裡面的 collision 應該滿嚴重的，不過還是先用 perf 來看看效能如何 ``` Performance counter stats for './phonebook_opt' (100 runs): 173,0000 cache-misses # 3.975 % of all cache refs ( +- 0.30% ) 4352,1870 cache-references ( +- 0.19% ) 32,1506,9274 instructions # 1.06 insns per cycle ( +- 0.00% ) 30,3073,3228 cycles ( +- 0.01% ) 0.901060342 seconds time elapsed ( +- 0.11% ) ``` 可以看到 cache miss 已經降至 3.975%，然而 instruction 跟 cycle 數量遠大於上一節的狀況(10 倍以上)，所以如果要繼續使用 hash table 勢必要想辦法解決 collision 的問題 ## 改善 `append()` 執行速度首先考慮到可能是 hash table 的長度過於接近資料庫的大小，所以參考 [wikipedia](https://zh.wikipedia.org/wiki/%E5%93%88%E5%B8%8C%E8%A1%A8) 以及 [stackoverflow](https://stackoverflow.com/questions/22741966/how-to-choose-size-of-hash-table)，裡面提到資料總數與 hash table 的長度比值應該要介於 0.7 與 0.8 之間，這個比值被稱為 load factor，所以這邊分別嘗試使用 $load factor = 0.7\ ,\ 0.75\ ,\ 0.8$ 來做比較 ![](https://i.imgur.com/WtRGjVl.png) >請調整圖中的數據排列 >[name=課程助教][color=red] Y 軸我使用 logscale 來表示，主要是為了讓 `findName()` 的數據間的差距表現的明顯一點，圖中是使用單次執行的時間，並且每次執行前會先使用 `$ echo 1 | sudo tee /proc/sys/vm/drop_caches` 來清除快取，因為有無清除快取影響執行時間滿多的(有快取的狀況下，執行速度大約快了 5 倍) 從上圖可以看出 load factor 如果介於 0.7 到 0.8 之間的執行時間都是差不多的，都在 0.3 秒左右，接著來比較它們的 cache miss ![](https://i.imgur.com/rrKg4Zl.png) >> 統一寫作 "cache miss"，後面不加 -ing >> [name="jserv"][color=red] 很明顯的 load factor 為 0.8 時在效能上是比較好的，其 cache miss 跟縮減過資料尺寸的純 linked list ( cache miss = 57.016% )相比，下降了 23% 左右再來，根據不同的資料庫以及 hash function，其 hash 出來的分布也會不一樣，我認為可以嘗試 load factor 大於 0.8 的狀況，所以將 load factor 調整為 0.9 看看其結果如何 ``` $ echo 1 | sudo tee /proc/sys/vm/drop_caches $ ./phonebook_opt size of entry : 24 bytes execution time of append() : 0.312205 sec execution time of findName() : 0.000001 sec ``` ``` $ make cache-test Performance counter stats for './phonebook_opt' (100 runs): 105,6438 cache-misses # 27.359 % of all cache refs ( +- 0.33% ) 386,1389 cache-references ( +- 0.05% ) 3,6838,3004 instructions # 1.55 insns per cycle ( +- 0.02% ) 2,3769,1406 cycles ( +- 0.10% ) 0.118767684 seconds time elapsed ( +- 2.17% ) ``` 跟 load factor 為 0.8 時相比，執行時間增加了 0.002sec 但是 cahce miss 減少了 7%，我認為以這個案例來說，load factor = 0.9還算可以接受 ## 試試另一種 hash table 前幾節使用了 open addressing 的 hash table，也就是利用 probing method 來尋找 hash table 中空的 slot，接著來嘗試 chaining 的 hash table，也就是使用 linked list 將 collision 的資料鏈結起來，被鏈結的資料不會佔用 hash table 的 slot，所以理論上 collision 的狀況會比較不嚴重，附上修改後的程式碼 ```c typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; struct __PHONE_BOOK_INFO *pInfo; struct __PHONE_BOOK_ENTRY *pNext; } entry; ``` ```c entry *findName(char lastName[], hashTable *table) { unsigned int i = BKDRHash(lastName) % HASH_TABLE_SIZE; entry *head = table->cell[i]; while (strcmp(head->lastName, lastName) != 0) { head = head->pNext; if (head == NULL) break; } return head; } void append(char lastName[], hashTable *table) { unsigned int i = BKDRHash(lastName) % HASH_TABLE_SIZE; entry *e = (entry *) malloc(sizeof(entry)); e->pInfo = (info *) malloc(sizeof(entry)); e->pNext = NULL; strcpy(e->lastName, lastName); if (table->cell[i] == NULL) { table->cell[i] = e; return; } entry *head = table->cell[i]; while (head->pNext != NULL) { head = head->pNext; } head->pNext = e; } ``` 接著直接執行看看結果如何，這裡我先設定 load factor = 0.8 ``` $ echo 1 | sudo tee /proc/sys/vm/drop_caches $ ./phonebook_chaining size of entry : 32 bytes execution time of append() : 0.302260 sec execution time of findName() : 0.000000 sec ``` 可以看到 `append()` 的速度比 open addressing 的 hash table 快了約 0.007sec，接著來看看 cache miss 狀況如何 ``` Performance counter stats for './phonebook_chaining' (100 runs): 210,3447 cache-misses # 55.627 % of all cache refs ( +- 0.09% ) 378,1313 cache-references ( +- 0.16% ) 3,1909,8415 instructions # 1.63 insns per cycle ( +- 0.03% ) 1,9627,3235 cycles ( +- 0.15% ) 0.099198112 seconds time elapse ( +- 2.84% ) ``` 發現到 cache miss 反而提升了，其原因是當 CPU 向 cache 存取資料時，如果 cache 中不存在該資料會向 main memory 複製一份包含該資料的一段 cache line size 大小的資料到 cache line 上，因此在該資料周遭的其他資料也會一起被快取住，而 chaining 的 hash table 雖然使 slot 上的資料較不會發生 collision 但是也使 slot 上的資料分布變得比較不連續，資料之間可能會被空 slot 給隔開，所以在 cache line 被快取住的有效資料就相對比較少，同時 linked list 結構的記憶體本身也並非連續分布，因此造成 cache miss 的機率上升 > 「cache line 會快取住一段連續的記憶體」的陳述不精確，請翻閱計算機組織與結構的書籍 > [name="jserv"][color=red] > 已修正陳述內容 > [name="st9007a"] ## Hash Function 碰撞分析首先先對 dataset 中每筆資料 hash 後的結果做統計，x 軸代表資料 hash 出來後的數字，y軸代表資料經過 hash 後的結果出現的次數，例如：(15000, 3) 可以解讀為整個 dataset 中有 3 筆資料 hash 後的結果為 15000，這裡的 hash table 的 load factor 設定為 8： ![](https://i.imgur.com/y35ycOT.png) 會發現由於資料範圍太大，所以 1 到 4 次的頻率看起來差不多，所以接上圖轉換成表格來看 | hash結果重複次數 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | | ------ | - | - | - | - | - | - | - | | 資料數量 | 156580 | 126182 | 50397 | 13436 | 2740 | 516 | 49 | 上面表格是統計每筆資料 hash 後與其他資料重複的次數，可以看到只有大約 44% 的資料沒有重複，也就是說理想狀況下有 $1-(156580 +\frac{126182}{2}+\frac{50397}{3}+\frac{13436}{4}+\frac{2740}{5}+\frac{526}{6}+\frac{49}{7}) / 349900 = 31.275\%$ 的資料會發生碰撞，但是考慮到 linear probing 會把空的 slot 給佔走，經過實際測試總共有 40.742% 的資料發生碰撞 ## 使用接近真實情況的 dataset 由於原本的 dataset 中的英文姓氏為隨機的字母排列，因此不夠具有代表性，因此在 dataset 換成 `dictionary/all-names.txt`，並且對 open addressing hash table 做 load factor 對 cache miss 的重新比較 | load factor | 0.8 | 0.75 | 0.7 | | ----------- | --- | ---- | --- | | cache miss | 12.561% | 12.903% | 12.610% | 重新比較後發現到，cache miss 在 0.8, 0.75, 0.7 之間非常相近，其原因在於這次使用的 dataset 資料量較小 ( 共 16750 筆 )，接著是 load factor = 0.8 以及 0.7 時的 cache miss 非常接近，反而 load factor = 0.75 時的 cahce miss 最高 ## 問題討論 ### 本例使用的 dataset 有代表性嗎？跟真實英文姓氏差異大嗎？我們可以從 `main.c` 中很容易的知道，程式使用的 dataset 來自於 `dictionary/words.txt`，而將其打開來檢查就可以發現，裡面的 word 應該是按字母順序隨機產生的，因此它並不具有代表性( 與真實狀況相去甚遠 )，跟英文姓氏差異極大 ### 資料難道數大就是美嗎？根據上面的實驗以及 cache 的原理來看的話，理論上資料量越大，cache miss 發生的機率越高，同樣的，資料儲存的內容越多，cache miss 發生機率也會提高，因此在處理龐大的數據時，必須慎選資料結構 ### 如果我們要建立符合真實電話簿情境的輸入，該怎麼作？使用符合真實情況的 dataset，接著除了查詢功能外可以加入新增與刪除功能，如果繼續選用 hash table 的話，可以適時的 resize 它的大小，保持 load factor 介於 0.7 到 0.8 之間 ## 參考資料 [關於 CPU cache -- 程序猿需要知道的那些事](http://cenalulu.github.io/linux/all-about-cpu-cache/) [各種字串符 Hash 函數比較](https://www.byvoid.com/zhs/blog/string-hash-compare) [Hash Table：Intro (簡介)](http://alrightchiu.github.io/SecondRound/hash-tableintrojian-jie.html) [wikipedia -- hash table](https://zh.wikipedia.org/wiki/%E5%93%88%E5%B8%8C%E8%A1%A8) [stackoverflow -- How to choose size of hash table?](https://stackoverflow.com/questions/22741966/how-to-choose-size-of-hash-table) [stackoverflow -- How do cache lines work?](https://stackoverflow.com/questions/3928995/how-do-cache-lines-work) [知乎 -- 請教CPU的cache中關於line,block,index等的理解?](https://www.zhihu.com/question/24612442)