Try   HackMD

2017q3 Homework1 (phonebook)

contributed by <vonchuang>

Reviewed by ZixinYang

  • 本篇用了幾個不同的 hash function 並比較其效能, 建議作者再多闡述不同 hash function 的特性, 及分析造成效能高低的原因。
  • 建議不同 hash 的程式名稱不要用 hash1, hash2, 而使用 hash function 的名稱。
  1. 已分析 hash function 實作效益
  2. 缺乏 data set 更換和擬定分析策略
  • ptmalloc、malloc 記憶體配置的連續性

開發環境

​​​​$ cat /etc/issue
​​​​Ubuntu 16.04.3 LTS
​​​​
​​​​$ lscpu
​​​​Architecture:          x86_64
​​​​CPU op-mode(s):        32-bit, 64-bit
​​​​Byte Order:            Little Endian
​​​​CPU(s):                4
​​​​On-line CPU(s) list:   0-3
​​​​Thread(s) per core:    2
​​​​Core(s) per socket:    2
​​​​Socket(s):             1
​​​​NUMA node(s):          1
​​​​Vendor ID:             GenuineIntel
​​​​CPU family:            6
​​​​Model:                 37
​​​​Model name:            Intel(R) Core(TM) i3 CPU       M 350  @ 2.27GHz
​​​​Stepping:              2
​​​​CPU MHz:               1866.000
​​​​CPU max MHz:           2266.0000
​​​​CPU min MHz:           933.0000
​​​​BogoMIPS:              4522.32
​​​​Virtualization:        VT-x
​​​​L1d cache:             32K
​​​​L1i cache:             32K
​​​​L2 cache:              256K
​​​​L3 cache:              3072K
​​​​NUMA node0 CPU(s):     0-3

Cache

  • 早期計算機系統的 memory 只有三層:CPU registers,main DRAM memory,和 dist memory。但由於 CPU 和 main memory 之間的速度差距逐漸擴大,系統設計者被迫在 CPU register file 和 main memory 間插入一個小的 SRAM cache memory(稱為L1 cache)。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The L1 cache can be accessed nearly as fast as the registers, typically in 2 to 4 clock cycles.

  • 隨著 CPU 和 main memory 之間的性能不斷增大,系統設計者在 L1 cache 和 main memory 間插入一個更大的 cache,稱為 L2 cache。

L2 cache can be accessed in about 10 clock cycles.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

圖片來源:Computer Systems: A Programmer's Perspective

  • 有些現代系統還包括有一個更大的 cache,稱為 L3 cache。

L3 cache sits between the L2 cache and main memory in the memory hierarchy and can be accessed in love with0 or 40 cycles.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

圖片來源:Computer Systems: A Programmer's Perspective

Real Cache Hierarchy

  • 事實上,cache 既保存數據(data),也保存指令(instruction)。只保存指令的 cache 稱為 i-cache;只保存數據的 cache 稱為 d-cache;既保存指令,又包括數據的,稱為 unidied cache。
  • 現代的處理器包括獨立的 i-cache 和 d-cache。如此,處理器能同時讀取一個指令和依組數據。
  • i-cache 通常只讀取,因此比較簡單。

The two cache are often optimized to different access patterns and can have different block sizes,associativites,and capacities.Also,having seperate caches ensures that data accessed do not create conflict misses with instruction accesses,and vice versa,at the cost of a potential increase in capacity misses.

  • 舉 Intel Core i7 為例,以下為 Intel Core i7 的 Cache 結構、特性:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

圖片來源:Computer Systems: A Programmer's Perspective

參考資料

案例分析:Phone Book優化前

  • $make run

    ​​​​​​  size of entry : 136 bytes
    ​​​​​​  execution time of append() : 0.086821 sec
    ​​​​​​  execution time of findName() : 0.012002 sec
    
  • perf
    $perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_orig

    ​​​​​​   Performance counter stats for './phonebook_orig' (10 runs):
    
    ​​​​​​     946,438      cache-misses              #   86.699 % of all cache refs      ( +-  1.01% )  (66.53%)
    ​​​​​​   1,091,635      cache-references                                              ( +-  0.99% )  (67.35%)
    ​​​​​​   2,452,718      L1-dcache-load-misses                                         ( +-  0.71% )  (67.43%)
    ​​​​​​     896,913      L1-dcache-store-misses                                        ( +-  0.45% )  (67.48%)
    ​​​​​​   1,839,844      L1-dcache-prefetch-misses                                     ( +-  2.38% )  (67.10%)
    ​​​​​​     576,013      L1-icache-load-misses                                         ( +-  9.58% )  (66.08%)
    
    ​​​​​​ 0.128343367 seconds time elapsed                                          ( +-  3.62% )
    

$ sudo perf record -F 12500 -e cache-misses ./phonebook_orig

​​​​size of entry : 136 bytes
​​​​execution time of append() : 0.084899 sec
​​​​execution time of findName() : 0.012862 sec
​​​​[ perf record: Woken up 1 times to write data ]
​​​​[ perf record: Captured and wrote 0.068 MB perf.data (1488 samples) ]

$sudo perf report

​​​​Overhead  Command         Shared Object      Symbol                         
​​​​  53.37%  phonebook_orig  [kernel.kallsyms]  [k] clear_page                 
​​​​  11.23%  phonebook_orig  phonebook_orig     [.] findLastName               
​​​​   6.53%  phonebook_orig  libc-2.23.so       [.] __strcasecmp_l_sse42       
​​​​   5.33%  phonebook_orig  [kernel.kallsyms]  [k] handle_mm_fault            
​​​​   4.77%  phonebook_orig  [kernel.kallsyms]  [k] pmd_page_vaddr             
​​​​   2.24%  phonebook_orig  [kernel.kallsyms]  [k] unmap_page_range           
​​​​   2.21%  phonebook_orig  [kernel.kallsyms]  [k] __rmqueue                  
​​​​   1.46%  phonebook_orig  [kernel.kallsyms]  [k] copy_user_generic_string   
​​​​   1.29%  phonebook_orig  [kernel.kallsyms]  [k] get_page_from_freelist     
​​​​   1.26%  phonebook_orig  [kernel.kallsyms]  [k] try_charge                 
​​​​   1.22%  phonebook_orig  [kernel.kallsyms]  [k] __alloc_pages_nodemask     
​​​​   1.14%  phonebook_orig  [kernel.kallsyms]  [k] free_pcppages_bulk         
​​​​   1.06%  phonebook_orig  [kernel.kallsyms]  [k] get_mem_cgroup_from_mm     
​​​​   0.76%  phonebook_orig  [kernel.kallsyms]  [k] mem_cgroup_try_charge      
​​​​   0.62%  phonebook_orig  [kernel.kallsyms]  [k] alloc_pages_vma
  • gunplot
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

參考資料

優化:改善不直觀的程式碼

findName

  • phonebook_opt.c中,有一 function 叫 findname。

    ​​​​​​  entry *findName(char lastName[], entry *pHead)
    
  • 但其是由輸入 lastname 去找,故其原有名稱並不精確,在此改為 findLastName。

    ​​​​​​  entry *findLastName(char lastName[], entry *pHead)
    

優化:例外處理(Exception handling)

malloc

  • phonebook_orig.c中的 function append(),其並沒有考慮到 malloc 函式呼叫失敗的例外處理。
entry *append(char lastName[], entry *e) { /* allocate memory for the new entry and put lastName */ e->pNext = (entry *) malloc(sizeof(entry)); e = e->pNext; strcpy(e->lastName, lastName); e->pNext = NULL; return e; }
  • 在此以 assert 實作 malloc 的 expection handling。
entry *append(char lastName[], entry *e) { /* allocate memory for the new entry and put lastName */ e->pNext = (entry *) malloc(sizeof(entry)); assert(e->pNext && "malloc for e->Next error."); e = e->pNext; strcpy(e->lastName, lastName); e->pNext = NULL; return e; }

優化:縮小 struct

  • 此處所用開發環境之L1 cache 為 32K bits,而phonebook_orig中的 struct __PHONE_BOOK_ENTRY 為136 bytes。
    321024/1368=30.12
typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; struct __PHONE_BOOK_ENTRY *pNext; } entry;
  • 故在執行 findLastName 時,L1 cache 最多只能放 30 個 entry。而在word.txt中,有近35萬筆資料,在執行中是必會發生 cache miss,降低程式效率。

  • 然查本案件情形,其目的只是要找 last name,與其他資料無關。故可將 last name 從 __PHONE_BOOK_ENTRY 獨立出來,以縮小 struct 大小,進而降低 cache miss。

  • 優化後結果:

    ​​​​​​  $perf record -F 12500 -e cache-misses ./phonebook_opt
    ​​​​​​  
    ​​​​​​  size of entry : 32 bytes
    ​​​​​​  execution time of append() : 0.071934 sec
    ​​​​​​  execution time of findName() : 0.005095 sec
    ​​​​​​  [ perf record: Woken up 1 times to write data ]
    ​​​​​​  [ perf record: Captured and wrote 0.050 MB perf.data (1013 samples) ]
    ​​​​​​  
    ​​​​​​  $perf report
    ​​​​​​  
    ​​​​​​  Overhead  Command        Shared Object      Symbol                                                                    
    ​​​​​​    54.36%  phonebook_opt  [kernel.kallsyms]  [k] clear_page                                                            
    ​​​​​​     7.26%  phonebook_opt  libc-2.23.so       [.] __strcasecmp_l_sse42                                                  
    ​​​​​​     6.96%  phonebook_opt  [kernel.kallsyms]  [k] handle_mm_fault                                                       
    ​​​​​​     3.65%  phonebook_opt  [kernel.kallsyms]  [k] copy_user_generic_string                                              
    ​​​​​​     3.24%  phonebook_opt  [kernel.kallsyms]  [k] __rmqueue                                                             
    ​​​​​​     2.91%  phonebook_opt  [kernel.kallsyms]  [k] pmd_page_vaddr                                                        
    ​​​​​​     2.87%  phonebook_opt  phonebook_opt      [.] findLastName 
    
    ​​​​​​  $perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_opt
    ​​​​​​  
    ​​​​​​   Performance counter stats for './phonebook_opt' (10 runs):
    
    ​​​​​​     356,696      cache-misses              #   92.695 % of all cache refs      ( +-  1.60% )  (62.61%)
    ​​​​​​     384,808      cache-references                                              ( +-  2.58% )  (64.87%)
    ​​​​​​   1,397,588      L1-dcache-load-misses                                         ( +-  2.64% )  (68.68%)
    ​​​​​​     274,496      L1-dcache-store-misses                                        ( +-  1.42% )  (71.04%)
    ​​​​​​     449,449      L1-dcache-prefetch-misses                                     ( +- 22.10% )  (70.08%)
    ​​​​​​     331,442      L1-icache-load-misses                                         ( +-  4.48% )  (65.40%)
    
    ​​​​​​ 0.085531137 seconds time elapsed                                          ( +-  1.00% )
    

findLastName 的 cache miss overhead 從原本的 11.23% 降到 2.87%。

  • gunplot

append() 執行時間降低約 18%。
findLastName() 執行時間降低約 60%。

優化:使用 Hash function

phonebook_orig.c中,findLastName的時間複雜度為

O(N)。在此嘗試沿用之前的 struct,並建立 Hash Table 及 Hash function,以降低尋找時間。

Hash Table

DJB Hash

  • 先將 hash table size 設為1024(參考ryanwang522

  • DJB Hash Function

unsigned int BJDHash( char *str){ unsigned int hash = 5381; while (*str) hash = ((hash << 5) + hash) + (*str++); return (hash % MAX_HASH_TABLE_SIZE); }
  • perf

    ​​​​​​  $perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_hash
    
    ​​​​​​   Performance counter stats for './phonebook_hash' (10 runs):
    
    ​​​​​​     327,504      cache-misses              #   80.183 % of all cache refs      ( +-  1.56% )  (64.15%)
    ​​​​​​     408,445      cache-references                                              ( +-  3.94% )  (65.74%)
    ​​​​​​     640,421      L1-dcache-load-misses                                         ( +-  3.13% )  (68.02%)
    ​​​​​​     316,427      L1-dcache-store-misses                                        ( +-  1.07% )  (70.02%)
    ​​​​​​      52,671      L1-dcache-prefetch-misses                                     ( +-  6.83% )  (69.38%)
    ​​​​​​     426,756      L1-icache-load-misses                                         ( +-  6.84% )  (65.78%)
    
    ​​​​​​ 0.089159747 seconds time elapsed                                          ( +-  1.77% )
    
  • gunplot

  • 由此可以看到,因為有了 hash table,使得尋找時間得以降低,但建立 table 也使 append()的執行時間增加。

  • 而若依據 參考,將 load factor 定在0.75。現有資料 349,900 筆,故將 Hash Table size 設為

    349,900/0.75=466,533。以 perf 測得數值為:

    ​​​​​​   Performance counter stats for './phonebook_hash' (10 runs):
    
    ​​​​​​     941,119      cache-misses              #   67.372 % of all cache refs      ( +-  0.50% )  (65.94%)
    ​​​​​​   1,396,901      cache-references                                              ( +-  2.25% )  (66.21%)
    ​​​​​​   1,963,602      L1-dcache-load-misses                                         ( +-  1.17% )  (66.90%)
    ​​​​​​     525,652      L1-dcache-store-misses                                        ( +-  0.74% )  (67.66%)
    ​​​​​​     105,146      L1-dcache-prefetch-misses                                     ( +- 10.88% )  (67.74%)
    ​​​​​​   1,708,760      L1-icache-load-misses                                         ( +-  2.31% )  (66.72%)
    
    ​​​​​​ 0.261410686 seconds time elapsed                                          ( +-  1.31% )
    

整體執行時間及 cache miss 反而是增加的

  • 待研究

BKDR Hash

  • 沿用上述 hash table size = 466553
unsigned int BKDRHash( char *str){ unsigned int hash = 0; unsigned int seed = 131; //31 131 1313 13131 etc... while (*str) hash = (hash*seed) + (*str++); return (hash % MAX_HASH_TABLE_SIZE); }

One-at-a Time Hash

unsigned int OneAtATimeHash( char *str){ unsigned int hash = 0,i = 0; while (*str){ hash += str[i]; hash += (hash << 10); hash ^= (hash >> 6); ++i; *str++; } hash += (hash << 3); hash ^= (hash >> 11); hash += (hash << 15); return (hash % MAX_HASH_TABLE_SIZE); }

Hash 比較

DJB BKDR One-at-a-Time
spend time 0.104402s 0.123361s 0.129983s
append() time 0.114534s 0.133128s 0.137205s
findName() time 0.000000s 0.000000s 0.000000s
cache miss 1,041,945 671,454 624,490
cache reference 1,788,171 1,004,645 958,847
cache miss/cache reference 58.269% 66.835% 65.129%

本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題?

有代表性嗎?跟真實英文姓氏差異大嗎?

在輸入的檔案words.txt中,有許多明顯非姓名、甚至是無意義的字,並且比例不低,故此份 dataset 並不具代表性,與真實英文姓氏有一定程度的差異。

資料難道「數大就是美」嗎?

不一定,當資料量多時,相比於少量資料,使用適當與不適當尋找方法的時間差會變大,也更容易發生 cache miss,故在實作時,需要花更多時間及精力在優化上。

如果我們要建立符合真實電話簿情境的輸入,該怎麼作?

  • 使用適當的 dataset
  • 加入新增和刪除資料的功能

參考資料