2017q3 Homework1 (phonebook)

contributed by <vonchuang>

Reviewed by `ZixinYang`

本篇用了幾個不同的 hash function 並比較其效能, 建議作者再多闡述不同 hash function 的特性, 及分析造成效能高低的原因。
建議不同 hash 的程式名稱不要用 hash1, hash2, 而使用 hash function 的名稱。

已分析 hash function 實作效益
缺乏 data set 更換和擬定分析策略

ptmalloc、malloc 記憶體配置的連續性

開發環境

$ cat /etc/issue
Ubuntu 16.04.3 LTS

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 37
Model name:            Intel(R) Core(TM) i3 CPU       M 350  @ 2.27GHz
Stepping:              2
CPU MHz:               1866.000
CPU max MHz:           2266.0000
CPU min MHz:           933.0000
BogoMIPS:              4522.32
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K
NUMA node0 CPU(s):     0-3

Cache

TODO：cache miss,caching
TODO：分析結果
Why is it faster to process a sorted array than an unsorted array?

早期計算機系統的 memory 只有三層：CPU registers,main DRAM memory,和 dist memory。但由於 CPU 和 main memory 之間的速度差距逐漸擴大，系統設計者被迫在 CPU register file 和 main memory 間插入一個小的 SRAM cache memory(稱為L1 cache)。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The L1 cache can be accessed nearly as fast as the registers, typically in 2 to 4 clock cycles.

隨著 CPU 和 main memory 之間的性能不斷增大，系統設計者在 L1 cache 和 main memory 間插入一個更大的 cache，稱為 L2 cache。

L2 cache can be accessed in about 10 clock cycles.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

圖片來源：Computer Systems: A Programmer's Perspective

有些現代系統還包括有一個更大的 cache，稱為 L3 cache。

L3 cache sits between the L2 cache and main memory in the memory hierarchy and can be accessed in love with0 or 40 cycles.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

圖片來源：Computer Systems: A Programmer's Perspective

Real Cache Hierarchy

事實上，cache 既保存數據（data），也保存指令（instruction）。只保存指令的 cache 稱為 i-cache；只保存數據的 cache 稱為 d-cache；既保存指令，又包括數據的，稱為 unidied cache。
現代的處理器包括獨立的 i-cache 和 d-cache。如此，處理器能同時讀取一個指令和依組數據。
i-cache 通常只讀取，因此比較簡單。

The two cache are often optimized to different access patterns and can have different block sizes,associativites,and capacities.Also,having seperate caches ensures that data accessed do not create conflict misses with instruction accesses,and vice versa,at the cost of a potential increase in capacity misses.

舉 Intel Core i7 為例，以下為 Intel Core i7 的 Cache 結構、特性：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

圖片來源：Computer Systems: A Programmer's Perspective

參考資料

案例分析：Phone Book優化前

$make run

  size of entry : 136 bytes
  execution time of append() : 0.086821 sec
  execution time of findName() : 0.012002 sec

perf
$perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_orig

   Performance counter stats for './phonebook_orig' (10 runs):

     946,438      cache-misses              #   86.699 % of all cache refs      ( +-  1.01% )  (66.53%)
   1,091,635      cache-references                                              ( +-  0.99% )  (67.35%)
   2,452,718      L1-dcache-load-misses                                         ( +-  0.71% )  (67.43%)
     896,913      L1-dcache-store-misses                                        ( +-  0.45% )  (67.48%)
   1,839,844      L1-dcache-prefetch-misses                                     ( +-  2.38% )  (67.10%)
     576,013      L1-icache-load-misses                                         ( +-  9.58% )  (66.08%)

 0.128343367 seconds time elapsed                                          ( +-  3.62% )

$ sudo perf record -F 12500 -e cache-misses ./phonebook_orig

size of entry : 136 bytes
execution time of append() : 0.084899 sec
execution time of findName() : 0.012862 sec
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.068 MB perf.data (1488 samples) ]

$sudo perf report

Overhead  Command         Shared Object      Symbol                         
  53.37%  phonebook_orig  [kernel.kallsyms]  [k] clear_page                 
  11.23%  phonebook_orig  phonebook_orig     [.] findLastName               
   6.53%  phonebook_orig  libc-2.23.so       [.] __strcasecmp_l_sse42       
   5.33%  phonebook_orig  [kernel.kallsyms]  [k] handle_mm_fault            
   4.77%  phonebook_orig  [kernel.kallsyms]  [k] pmd_page_vaddr             
   2.24%  phonebook_orig  [kernel.kallsyms]  [k] unmap_page_range           
   2.21%  phonebook_orig  [kernel.kallsyms]  [k] __rmqueue                  
   1.46%  phonebook_orig  [kernel.kallsyms]  [k] copy_user_generic_string   
   1.29%  phonebook_orig  [kernel.kallsyms]  [k] get_page_from_freelist     
   1.26%  phonebook_orig  [kernel.kallsyms]  [k] try_charge                 
   1.22%  phonebook_orig  [kernel.kallsyms]  [k] __alloc_pages_nodemask     
   1.14%  phonebook_orig  [kernel.kallsyms]  [k] free_pcppages_bulk         
   1.06%  phonebook_orig  [kernel.kallsyms]  [k] get_mem_cgroup_from_mm     
   0.76%  phonebook_orig  [kernel.kallsyms]  [k] mem_cgroup_try_charge      
   0.62%  phonebook_orig  [kernel.kallsyms]  [k] alloc_pages_vma

gunplot
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

參考資料

優化：改善不直觀的程式碼

findName

在 phonebook_opt.c中，有一 function 叫 findname。

  entry *findName(char lastName[], entry *pHead)

但其是由輸入 lastname 去找，故其原有名稱並不精確，在此改為 findLastName。
```
  entry *findLastName(char lastName[], entry *pHead)
```

優化：例外處理（Exception handling）

malloc

在phonebook_orig.c中的 function append()，其並沒有考慮到 malloc 函式呼叫失敗的例外處理。










    entry *append(char lastName[], entry *e)
    {
        /* allocate memory for the new entry and put lastName */
        e->pNext = (entry *) malloc(sizeof(entry));
        e = e->pNext;
        strcpy(e->lastName, lastName);
        e->pNext = NULL;

        return e;
    }

在此以 assert 實作 malloc 的 expection handling。











    entry *append(char lastName[], entry *e)
    {
        /* allocate memory for the new entry and put lastName */
        e->pNext = (entry *) malloc(sizeof(entry));
        assert(e->pNext && "malloc for e->Next error.");
        e = e->pNext;
        strcpy(e->lastName, lastName);
        e->pNext = NULL;

        return e;
    }

優化：縮小 struct

此處所用開發環境之L1 cache 為 32K bits，而phonebook_orig中的 struct __PHONE_BOOK_ENTRY 為136 bytes。

$32 \cdot 1024 / （ 136 \cdot 8 ） = 30.12$













typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

故在執行 findLastName 時，L1 cache 最多只能放 30 個 entry。而在word.txt中，有近35萬筆資料，在執行中是必會發生 cache miss，降低程式效率。
然查本案件情形，其目的只是要找 last name，與其他資料無關。故可將 last name 從 __PHONE_BOOK_ENTRY 獨立出來，以縮小 struct 大小，進而降低 cache miss。

優化後結果：

  $perf record -F 12500 -e cache-misses ./phonebook_opt
  
  size of entry : 32 bytes
  execution time of append() : 0.071934 sec
  execution time of findName() : 0.005095 sec
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.050 MB perf.data (1013 samples) ]
  
  $perf report
  
  Overhead  Command        Shared Object      Symbol                                                                    
    54.36%  phonebook_opt  [kernel.kallsyms]  [k] clear_page                                                            
     7.26%  phonebook_opt  libc-2.23.so       [.] __strcasecmp_l_sse42                                                  
     6.96%  phonebook_opt  [kernel.kallsyms]  [k] handle_mm_fault                                                       
     3.65%  phonebook_opt  [kernel.kallsyms]  [k] copy_user_generic_string                                              
     3.24%  phonebook_opt  [kernel.kallsyms]  [k] __rmqueue                                                             
     2.91%  phonebook_opt  [kernel.kallsyms]  [k] pmd_page_vaddr                                                        
     2.87%  phonebook_opt  phonebook_opt      [.] findLastName 

  $perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_opt
  
   Performance counter stats for './phonebook_opt' (10 runs):

     356,696      cache-misses              #   92.695 % of all cache refs      ( +-  1.60% )  (62.61%)
     384,808      cache-references                                              ( +-  2.58% )  (64.87%)
   1,397,588      L1-dcache-load-misses                                         ( +-  2.64% )  (68.68%)
     274,496      L1-dcache-store-misses                                        ( +-  1.42% )  (71.04%)
     449,449      L1-dcache-prefetch-misses                                     ( +- 22.10% )  (70.08%)
     331,442      L1-icache-load-misses                                         ( +-  4.48% )  (65.40%)

 0.085531137 seconds time elapsed                                          ( +-  1.00% )

findLastName 的 cache miss overhead 從原本的 11.23% 降到 2.87%。

gunplot

append() 執行時間降低約 18%。
findLastName() 執行時間降低約 60%。

優化：使用 Hash function

在phonebook_orig.c中，findLastName的時間複雜度為

O (N)

。在此嘗試沿用之前的 struct，並建立 Hash Table 及 Hash function，以降低尋找時間。

Hash Table

DJB Hash

先將 hash table size 設為1024（參考ryanwang522）
DJB Hash Function








unsigned int BJDHash( char *str){
    unsigned int hash = 5381;

    while (*str)
        hash = ((hash << 5) + hash) + (*str++);

    return (hash % MAX_HASH_TABLE_SIZE);
}

perf

  $perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_hash

   Performance counter stats for './phonebook_hash' (10 runs):

     327,504      cache-misses              #   80.183 % of all cache refs      ( +-  1.56% )  (64.15%)
     408,445      cache-references                                              ( +-  3.94% )  (65.74%)
     640,421      L1-dcache-load-misses                                         ( +-  3.13% )  (68.02%)
     316,427      L1-dcache-store-misses                                        ( +-  1.07% )  (70.02%)
      52,671      L1-dcache-prefetch-misses                                     ( +-  6.83% )  (69.38%)
     426,756      L1-icache-load-misses                                         ( +-  6.84% )  (65.78%)

 0.089159747 seconds time elapsed                                          ( +-  1.77% )

gunplot
由此可以看到，因為有了 hash table，使得尋找時間得以降低，但建立 table 也使 append()的執行時間增加。

而若依據參考，將 load factor 定在0.75。現有資料 349,900 筆，故將 Hash Table size 設為

349, 900 / 0.75 = 466, 533

。以 perf 測得數值為：

   Performance counter stats for './phonebook_hash' (10 runs):

     941,119      cache-misses              #   67.372 % of all cache refs      ( +-  0.50% )  (65.94%)
   1,396,901      cache-references                                              ( +-  2.25% )  (66.21%)
   1,963,602      L1-dcache-load-misses                                         ( +-  1.17% )  (66.90%)
     525,652      L1-dcache-store-misses                                        ( +-  0.74% )  (67.66%)
     105,146      L1-dcache-prefetch-misses                                     ( +- 10.88% )  (67.74%)
   1,708,760      L1-icache-load-misses                                         ( +-  2.31% )  (66.72%)

 0.261410686 seconds time elapsed                                          ( +-  1.31% )

整體執行時間及 cache miss 反而是增加的

待研究

參考資料：Hash Functions

BKDR Hash

沿用上述 hash table size = 466553










unsigned int BKDRHash( char *str){
    unsigned int hash = 0;
    unsigned int seed = 131; //31 131 1313 13131 etc...

    while (*str)
        hash = (hash*seed) + (*str++);

    return (hash % MAX_HASH_TABLE_SIZE);
}

gunplot
參考資料：各种字符串Hash函数比较

One-at-a Time Hash
















unsigned int OneAtATimeHash( char *str){
    unsigned int hash = 0,i = 0;

    while (*str){
        hash += str[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
        ++i;
        *str++;
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);

    return (hash % MAX_HASH_TABLE_SIZE);
}

參考資料： A hash function for hash Table lookup

Hash 比較

	DJB	BKDR	One-at-a-Time
spend time	0.104402s	0.123361s	0.129983s
append() time	0.114534s	0.133128s	0.137205s
findName() time	0.000000s	0.000000s	0.000000s
cache miss	1,041,945	671,454	624,490
cache reference	1,788,171	1,004,645	958,847
cache miss/cache reference	58.269%	66.835%	65.129%

本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題？

有代表性嗎？跟真實英文姓氏差異大嗎？

在輸入的檔案words.txt中，有許多明顯非姓名、甚至是無意義的字，並且比例不低，故此份 dataset 並不具代表性，與真實英文姓氏有一定程度的差異。

資料難道「數大就是美」嗎？

不一定，當資料量多時，相比於少量資料，使用適當與不適當尋找方法的時間差會變大，也更容易發生 cache miss，故在實作時，需要花更多時間及精力在優化上。

如果我們要建立符合真實電話簿情境的輸入，該怎麼作？

使用適當的 dataset
加入新增和刪除資料的功能

2017q3 Homework1 (phonebook)

Reviewed by ZixinYang

開發環境

Cache

Real Cache Hierarchy

參考資料

案例分析：Phone Book優化前

參考資料

優化：改善不直觀的程式碼

findName

優化：例外處理（Exception handling）

malloc

優化：縮小 struct

優化：使用 Hash function

Hash Table

DJB Hash

BKDR Hash

One-at-a Time Hash

Hash 比較

本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題？

有代表性嗎？跟真實英文姓氏差異大嗎？

資料難道「數大就是美」嗎？

如果我們要建立符合真實電話簿情境的輸入，該怎麼作？

參考資料

Reviewed by `ZixinYang`