# 2017q3 Homework1 (phonebook) contributed by <`vonchuang`> ### Reviewed by `ZixinYang` - 本篇用了幾個不同的 hash function 並比較其效能, 建議作者再多闡述不同 hash function 的特性, 及分析造成效能高低的原因。 - 建議不同 hash 的程式名稱不要用 hash1, hash2, 而使用 hash function 的名稱。 :::info 1. 已分析 hash function 實作效益 2. 缺乏 data set 更換和擬定分析策略 ::: - [ ] ptmalloc、malloc 記憶體配置的連續性 ## 開發環境 $ cat /etc/issue Ubuntu 16.04.3 LTS $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 37 Model name: Intel(R) Core(TM) i3 CPU M 350 @ 2.27GHz Stepping: 2 CPU MHz: 1866.000 CPU max MHz: 2266.0000 CPU min MHz: 933.0000 BogoMIPS: 4522.32 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K NUMA node0 CPU(s): 0-3 ## Cache - [ ] **TODO:cache miss,caching** - [ ] **TODO:分析結果** - [ ] [Why is it faster to process a sorted array than an unsorted array? ](https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array) * 早期計算機系統的 memory 只有三層:CPU registers,main DRAM memory,和 dist memory。但由於 CPU 和 main memory 之間的速度差距逐漸擴大,系統設計者被迫在 CPU register file 和 main memory 間插入一個小的 SRAM cache memory(稱為L1 cache)。 ![Processor v.s. Memory Performance](https://www.extremetech.com/wp-content/uploads/2014/08/CPU-DRAM.png) > The L1 cache can be accessed nearly as fast as the registers, typically in 2 to 4 clock cycles. * 隨著 CPU 和 main memory 之間的性能不斷增大,系統設計者在 L1 cache 和 main memory 間插入一個更大的 cache,稱為 L2 cache。 > L2 cache can be accessed in about 10 clock cycles. ![](https://i.imgur.com/ikPJz9z.png) *圖片來源:[Computer Systems: A Programmer's Perspective](http://csapp.cs.cmu.edu/)* * 有些現代系統還包括有一個更大的 cache,稱為 L3 cache。 > L3 cache sits between the L2 cache and main memory in the memory hierarchy and can be accessed in love with0 or 40 cycles. ![](https://i.imgur.com/iFlL1SS.png) *圖片來源:[Computer Systems: A Programmer's Perspective](http://csapp.cs.cmu.edu/)* ### Real Cache Hierarchy * 事實上,cache 既保存數據(data),也保存指令(instruction)。只保存指令的 cache 稱為 i-cache;只保存數據的 cache 稱為 d-cache;既保存指令,又包括數據的,稱為 unidied cache。 * 現代的處理器包括獨立的 i-cache 和 d-cache。如此,處理器能同時讀取一個指令和依組數據。 * i-cache 通常只讀取,因此比較簡單。 > The two cache are often optimized to different access patterns and can have different block sizes,associativites,and capacities.Also,having seperate caches ensures that data accessed do not create conflict misses with instruction accesses,and vice versa,at the cost of a potential increase in capacity misses. * 舉 Intel Core i7 為例,以下為 Intel Core i7 的 Cache 結構、特性: ![](https://i.imgur.com/SYEILw0.png) ![](https://i.imgur.com/j7v1RpW.png) *圖片來源:[Computer Systems: A Programmer's Perspective](http://csapp.cs.cmu.edu/)* ### 參考資料 * [cache 原理和實驗](https://hackmd.io/s/S14L26-_l) * [Computer Systems: A Programmer's Perspective](http://csapp.cs.cmu.edu/) ## 案例分析:Phone Book優化前 * `$make run` size of entry : 136 bytes execution time of append() : 0.086821 sec execution time of findName() : 0.012002 sec * perf `$perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_orig` Performance counter stats for './phonebook_orig' (10 runs): 946,438 cache-misses # 86.699 % of all cache refs ( +- 1.01% ) (66.53%) 1,091,635 cache-references ( +- 0.99% ) (67.35%) 2,452,718 L1-dcache-load-misses ( +- 0.71% ) (67.43%) 896,913 L1-dcache-store-misses ( +- 0.45% ) (67.48%) 1,839,844 L1-dcache-prefetch-misses ( +- 2.38% ) (67.10%) 576,013 L1-icache-load-misses ( +- 9.58% ) (66.08%) 0.128343367 seconds time elapsed ( +- 3.62% ) `$ sudo perf record -F 12500 -e cache-misses ./phonebook_orig ` size of entry : 136 bytes execution time of append() : 0.084899 sec execution time of findName() : 0.012862 sec [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.068 MB perf.data (1488 samples) ] `$sudo perf report` Overhead Command Shared Object Symbol 53.37% phonebook_orig [kernel.kallsyms] [k] clear_page 11.23% phonebook_orig phonebook_orig [.] findLastName 6.53% phonebook_orig libc-2.23.so [.] __strcasecmp_l_sse42 5.33% phonebook_orig [kernel.kallsyms] [k] handle_mm_fault 4.77% phonebook_orig [kernel.kallsyms] [k] pmd_page_vaddr 2.24% phonebook_orig [kernel.kallsyms] [k] unmap_page_range 2.21% phonebook_orig [kernel.kallsyms] [k] __rmqueue 1.46% phonebook_orig [kernel.kallsyms] [k] copy_user_generic_string 1.29% phonebook_orig [kernel.kallsyms] [k] get_page_from_freelist 1.26% phonebook_orig [kernel.kallsyms] [k] try_charge 1.22% phonebook_orig [kernel.kallsyms] [k] __alloc_pages_nodemask 1.14% phonebook_orig [kernel.kallsyms] [k] free_pcppages_bulk 1.06% phonebook_orig [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.76% phonebook_orig [kernel.kallsyms] [k] mem_cgroup_try_charge 0.62% phonebook_orig [kernel.kallsyms] [k] alloc_pages_vma * gunplot ![](https://i.imgur.com/CoxuRG2.png) ### 參考資料 ## 優化:改善不直觀的程式碼 ### findName * 在 `phonebook_opt.c`中,有一 function 叫 findname。 entry *findName(char lastName[], entry *pHead) * 但其是由輸入 lastname 去找,故其原有名稱並不精確,在此改為 findLastName。 entry *findLastName(char lastName[], entry *pHead) ## 優化:例外處理(Exception handling) ### malloc * 在`phonebook_orig.c`中的 function append(),其並沒有考慮到 malloc 函式呼叫失敗的例外處理。 ```c= entry *append(char lastName[], entry *e) { /* allocate memory for the new entry and put lastName */ e->pNext = (entry *) malloc(sizeof(entry)); e = e->pNext; strcpy(e->lastName, lastName); e->pNext = NULL; return e; } ``` * 在此以 assert 實作 malloc 的 expection handling。 ```c= entry *append(char lastName[], entry *e) { /* allocate memory for the new entry and put lastName */ e->pNext = (entry *) malloc(sizeof(entry)); assert(e->pNext && "malloc for e->Next error."); e = e->pNext; strcpy(e->lastName, lastName); e->pNext = NULL; return e; } ``` ## 優化:縮小 struct * 此處所用開發環境之L1 cache 為 32K bits,而`phonebook_orig`中的 struct __PHONE_BOOK_ENTRY 為136 bytes。 $32\cdot1024/(136\cdot8)=30.12$ ```c= typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; struct __PHONE_BOOK_ENTRY *pNext; } entry; ``` * 故在執行 findLastName 時,L1 cache 最多只能放 30 個 entry。而在`word.txt`中,有近35萬筆資料,在執行中是必會發生 cache miss,降低程式效率。 * 然查本案件情形,其目的只是要找 last name,與其他資料無關。故可將 last name 從 __PHONE_BOOK_ENTRY 獨立出來,以縮小 struct 大小,進而降低 cache miss。 * 優化後結果: $perf record -F 12500 -e cache-misses ./phonebook_opt size of entry : 32 bytes execution time of append() : 0.071934 sec execution time of findName() : 0.005095 sec [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.050 MB perf.data (1013 samples) ] $perf report Overhead Command Shared Object Symbol 54.36% phonebook_opt [kernel.kallsyms] [k] clear_page 7.26% phonebook_opt libc-2.23.so [.] __strcasecmp_l_sse42 6.96% phonebook_opt [kernel.kallsyms] [k] handle_mm_fault 3.65% phonebook_opt [kernel.kallsyms] [k] copy_user_generic_string 3.24% phonebook_opt [kernel.kallsyms] [k] __rmqueue 2.91% phonebook_opt [kernel.kallsyms] [k] pmd_page_vaddr 2.87% phonebook_opt phonebook_opt [.] findLastName $perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_opt Performance counter stats for './phonebook_opt' (10 runs): 356,696 cache-misses # 92.695 % of all cache refs ( +- 1.60% ) (62.61%) 384,808 cache-references ( +- 2.58% ) (64.87%) 1,397,588 L1-dcache-load-misses ( +- 2.64% ) (68.68%) 274,496 L1-dcache-store-misses ( +- 1.42% ) (71.04%) 449,449 L1-dcache-prefetch-misses ( +- 22.10% ) (70.08%) 331,442 L1-icache-load-misses ( +- 4.48% ) (65.40%) 0.085531137 seconds time elapsed ( +- 1.00% ) findLastName 的 cache miss overhead 從原本的 11.23% 降到 2.87%。 * gunplot ![](https://i.imgur.com/hWYxSTv.png) append() 執行時間降低約 18%。 findLastName() 執行時間降低約 60%。 ## 優化:使用 Hash function 在`phonebook_orig.c`中,`findLastName`的時間複雜度為 $O(N)$。在此嘗試沿用之前的 struct,並建立 Hash Table 及 Hash function,以降低尋找時間。 ### Hash Table ### DJB Hash * 先將 hash table size 設為1024(參考[ryanwang522](https://hackmd.io/s/Sk-qaWiYl)) * DJB Hash Function ```c= unsigned int BJDHash( char *str){ unsigned int hash = 5381; while (*str) hash = ((hash << 5) + hash) + (*str++); return (hash % MAX_HASH_TABLE_SIZE); } ``` * perf $perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_hash Performance counter stats for './phonebook_hash' (10 runs): 327,504 cache-misses # 80.183 % of all cache refs ( +- 1.56% ) (64.15%) 408,445 cache-references ( +- 3.94% ) (65.74%) 640,421 L1-dcache-load-misses ( +- 3.13% ) (68.02%) 316,427 L1-dcache-store-misses ( +- 1.07% ) (70.02%) 52,671 L1-dcache-prefetch-misses ( +- 6.83% ) (69.38%) 426,756 L1-icache-load-misses ( +- 6.84% ) (65.78%) 0.089159747 seconds time elapsed ( +- 1.77% ) * gunplot ![](https://i.imgur.com/ux4VBlG.png) * 由此可以看到,因為有了 hash table,使得尋找時間得以降低,但建立 table 也使 append()的執行時間增加。 * 而若依據 [參考](https://stackoverflow.com/questions/22741966/how-to-choose-size-of-hash-table),將 load factor 定在0.75。現有資料 349,900 筆,故將 Hash Table size 設為 $349,900/0.75=466,533$。以 perf 測得數值為: Performance counter stats for './phonebook_hash' (10 runs): 941,119 cache-misses # 67.372 % of all cache refs ( +- 0.50% ) (65.94%) 1,396,901 cache-references ( +- 2.25% ) (66.21%) 1,963,602 L1-dcache-load-misses ( +- 1.17% ) (66.90%) 525,652 L1-dcache-store-misses ( +- 0.74% ) (67.66%) 105,146 L1-dcache-prefetch-misses ( +- 10.88% ) (67.74%) 1,708,760 L1-icache-load-misses ( +- 2.31% ) (66.72%) 0.261410686 seconds time elapsed ( +- 1.31% ) 整體執行時間及 cache miss 反而是增加的 - [ ] 待研究 * 參考資料:[Hash Functions](http://www.cse.yorku.ca/~oz/hash.html) ### BKDR Hash * 沿用上述 hash table size = 466553 ```c= unsigned int BKDRHash( char *str){ unsigned int hash = 0; unsigned int seed = 131; //31 131 1313 13131 etc... while (*str) hash = (hash*seed) + (*str++); return (hash % MAX_HASH_TABLE_SIZE); } ``` * gunplot ![](https://i.imgur.com/UZfKQdp.png) * 參考資料:[各种字符串Hash函数比较](https://www.byvoid.com/zhs/blog/string-hash-compare) ### One-at-a Time Hash ```c= unsigned int OneAtATimeHash( char *str){ unsigned int hash = 0,i = 0; while (*str){ hash += str[i]; hash += (hash << 10); hash ^= (hash >> 6); ++i; *str++; } hash += (hash << 3); hash ^= (hash >> 11); hash += (hash << 15); return (hash % MAX_HASH_TABLE_SIZE); } ``` * 參考資料:[ A hash function for hash Table lookup](http://www.burtleburtle.net/bob/hash/doobs.html) ### Hash 比較 ![](https://i.imgur.com/XOhBe5M.png) | | DJB | BKDR | One-at-a-Time | |--|-----|------|---------------| |spend time|0.104402s|0.123361s|0.129983s| |append() time|0.114534s|0.133128s|0.137205s |findName() time|0.000000s|0.000000s|0.000000s| |cache miss|1,041,945|671,454|624,490| |cache reference|1,788,171|1,004,645|958,847| |cache miss/cache reference|58.269%|66.835%|65.129%| ## 本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題? ### 有代表性嗎?跟真實英文姓氏差異大嗎? 在輸入的檔案`words.txt`中,有許多明顯非姓名、甚至是無意義的字,並且比例不低,故此份 dataset 並不具代表性,與真實英文姓氏有一定程度的差異。 ### 資料難道「數大就是美」嗎? 不一定,當資料量多時,相比於少量資料,使用適當與不適當尋找方法的時間差會變大,也更容易發生 cache miss,故在實作時,需要花更多時間及精力在優化上。 ### 如果我們要建立符合真實電話簿情境的輸入,該怎麼作? * 使用適當的 dataset * 加入新增和刪除資料的功能 ### 參考資料 * [探討演算法和計算機結構對程式的影響:以實作電話簿為例](https://hackmd.io/s/SklSEW6G-) * [Programming Small](https://hackmd.io/s/SkfffA-jZ) * [perf 原理和實務](https://hackmd.io/s/B11109rdg) * [共筆:nekoneko](https://hackmd.io/s/rJCs_ZVa) * [共筆:ryanwang522](https://hackmd.io/s/Sk-qaWiYl) * [共筆:st9007a](https://hackmd.io/s/BJvNXiEib) * [共筆:陳品睿同學](https://paper.dropbox.com/doc/2016q1-Homework1-JRcf4eP183WKmeISyqlkp)