2017q1 Homework1 (phonebook)

tags: `sysprog2017` `phonebook`

contributed by <stanleytazi>

Reviewed by `etc276`

這次 commit message 的 title 是 "Replace single linked-list by hash table" ，但內文並沒有具體補充說明 hash 後會比 single linked-list 優化的原因(看完內容沒有比看完標題時更懂這次 commit 的理由，以及更改後的優缺點)
在 Hackmd 上放 code 時，盡量都使用 clike = ，以便使用者閱讀時知道行數(例如 memory pool 那邊)，也較好提出建議，開發環境的內容也需要排版。
後面有張圖顯示優化後的 append() 時間並沒有下降，我猜測原因是 hash table size 的選擇，可以參考這篇和共筆，選擇特定數量級的質數以優化

開發環境

CPU 基本資訊

指令 $ lscpu

Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
每核心執行緒數：2
每通訊端核心數：2
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              69
Model name:            Intel(R) Core(TM) i5-4260U CPU @ 1.40GHz
製程：              1
CPU MHz：             1400.312
CPU max MHz:           2700.0000
CPU min MHz:           800.0000
BogoMIPS:              4000.20
虛擬：              VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           256K
L3 快取：           3072K
NUMA node0 CPU(s):     0-3

cache line size

中英文字間請以空白隔開!
課程助教

指令 $ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size

得到 64(bytes? 猜測，待文件或是實驗查證)
在補充資料中皆有提到目前主流是64B

實驗查證
cache補充資料中有一個程式來觀察 cache line size 對不同 array size 做 read 操作時的影響。
結果如下：()

原始版本執行結果

觀察cache使用狀況

$ make plot

可以看到 cache-misses 的比例很高，首要目標降低 cache-misses rate

Performance counter stats for './phonebook_orig' (100 runs):

         1,024,254      cache-misses              #   90.150 % of all cache refs    
         1,140,377      cache-references                                            
       263,685,295      instructions              #    1.48  insns per cycle        
       178,278,149      cycles                                                      

       0.067626164 seconds time elapsed                                          ( +-  0.13% )

執行時間 (optimized 尚未實作)

現有append()&findName()只需要lastName就可以了
原先的儲存人員資料的structure太多資訊在上述兩個function是不需要的

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

perf top
- 可以觀察到消耗CPU cycle最多的為strcasecmp，可以查看function實作看是否有更改的空間或是利用其他方式提前判斷兩個字串是不一樣的而不用呼叫此function
- 第二名為findName，這邊會需要更改 single-linked list的連結方式以及改善cache misses rate

可實驗方向

更改 struct __PHONE_BOOK_ENTRY 儲存資料方式
減少呼叫strcasecmp
append內所需要的空間可以在程式一開始產生出來
利用 hash function

實驗一減少cache misses rate

更改 struct __PHONE_BOOK_ENTRY 儲存資料方式

in phonebook_opt.h

typedef struct {
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];  
} entry_detail;

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    struct __PHONE_BOOK_ENTRY *pNext;
    entry_detail *info;
} entry;

與 orig 版本 append()&findName() 執行時間比較
phonebook_opt 的 cache使用狀況

Performance counter stats for './phonebook_opt' (100 runs):

           114,443      cache-misses              #   29.189 % of all cache refs    
           397,487      cache-references                                            
       245,583,148      instructions              #    2.00  insns per cycle        
       122,015,515      cycles                                                      

       0.047291252 seconds time elapsed                                          ( +-  0.26% )

實驗二 - 利用 hash function

改用 hash table

這邊使用作業說明(各种字符串Hash函数比较)提及的BKDRHash











unsigned int BKDRHash(char *str)
{
    unsigned int seed = 131;
    unsigned int hash = 0;

    while (*str) {
        hash = hash * seed + (*str++);
    }

    return (hash & 0x7FFFFFFF);
}

把Hash value也儲存到 entry裡面








typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    unsigned int hashVal;
    entry_detail *info;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

findName() & append() 實作



































#define MAX_HASH_TABLE_SIZE 1024
entry *findName(char lastName[], entry *pHead)
{
    unsigned int  hVal = BKDRHash(lastName);
    unsigned char hashTableIndex = hVal % MAX_HASH_TABLE_SIZE;

    pHead = hashTable[hashTableIndex];

    while (pHead != NULL) {
        if (pHead->hashVal == hVal &&
                strcasecmp(lastName, pHead->lastName) == 0)
            return pHead;
        pHead = pHead->pNext;
    }
    return NULL;
}

entry *append(char lastName[], entry *e)
{
    unsigned char hashTableIndex ;

    /* allocate memory for the new entry and put lastName */
    entry *hashEntry = (entry *) malloc(sizeof(entry));
    strcpy(hashEntry->lastName, lastName);
    hashEntry->hashVal = BKDRHash(lastName);
    hashTableIndex = hashEntry->hashVal % MAX_HASH_TABLE_SIZE;

    if(hashTable[hashTableIndex]) {
        hashEntry->pNext = hashTable[hashTableIndex];
    }

    hashTable[hashTableIndex] = hashEntry;

    return hashEntry;
}

與 phonebook_orig 版本實行時間比較圖
- 可以看到 append()所花費的時間比實驗一的還多，與orig版本一樣，但是findName() 的執行時間縮短很多
利用 perf 可以看到 cpu 主要花在 BKDRHash()


    # Samples: 528  of event 'cycles'
# Event count (approx.): 131645130
#
# Overhead  Command        Shared Object      Symbol                                
# ........  .............  .................  ......................................
#
    24.16%  phonebook_opt  phonebook_opt      [.] BKDRHash                          
    18.10%  phonebook_opt  phonebook_opt      [.] main                              
    10.62%  phonebook_opt  libc-2.21.so       [.] _int_malloc                       
     9.17%  phonebook_opt  phonebook_opt      [.] append                            
     8.80%  phonebook_opt  libc-2.21.so       [.] _IO_fgets                         
     5.28%  phonebook_opt  libc-2.21.so       [.] _IO_getline_info                  
     4.85%  phonebook_opt  libc-2.21.so       [.] __strcpy_sse2_unaligned           
     3.48%  phonebook_opt  libc-2.21.so       [.] malloc                            
     2.31%  phonebook_opt  libc-2.21.so       [.] __memcpy_sse2                     
     1.52%  phonebook_opt  [kernel.kallsyms]  [k] page_fault                        
     1.29%  phonebook_opt  [kernel.kallsyms]  [k] clear_page_c_e                    
     1.16%  phonebook_opt  libc-2.21.so       [.] memchr                            
     1.09%  phonebook_opt  [kernel.kallsyms]  [k] _raw_spin_lock

cache misses rate
- 可以看到cache misses rate 比實驗一還要糟糕，應該是在entry裡又多了一個hashVal element，使得每次每一條cache line所能儲存的entry個數減少

Performance counter stats for './phonebook_opt' (5 runs):

       132,277,864      cycles                                                      
       236,245,171      instructions              #    1.80  insns per cycle        
           143,696      cache-references                                            
            68,468      cache-misses              #   44.361 % of all cache refs    
         5,047,187      bus-cycles                                                  

       0.057255147 seconds time elapsed                                          ( +-  9.29% )

因為 hash table 只有1024個 entry，所以對於 BKDRHash 產生的hash value 會有許多的 collision，想建立一個 memory pool 是對於 hash table中在同一個 index 下的所有 free entry 會是相鄰位置的，目的想在traverse 某一格 hash table 的 single-linked list 時，減少 cache misses rate，同時間也想利用memory pool 來減少每次 malloc()產生的負擔，實驗進行在下個項目

針對hash table 中每個index 建立鄰近位置的memory pool

新增 initFreeEntryPool() 和 allocEmptyEntry()

// create a memory pool 
void initFreeEntryPool(unsigned int freeBufferNum)
{
    int i,j;
    entry *freeEntry = NULL;

    for(i=0; i < MAX_HASH_TABLE_SIZE; i++) {
        freeEntry = (entry *) malloc(sizeof(entry));
        freePool[i] = freeEntry;
        for(j=1; j < freeBufferNum; j++) {
            if(freeEntry) {
                freeEntry->pNext = (entry *) malloc(sizeof(entry));
                freeEntry = freeEntry->pNext;
                freeEntry->pNext = NULL;
            }
        }
    }
}

//allocate a free entry
entry *allocEmptyEntry(unsigned int tableIndex)
{
    entry *output = NULL;
    if(tableIndex < MAX_HASH_TABLE_SIZE) {
        output = freePool[tableIndex];
        if(output)
            freePool[tableIndex] = output->pNext;
    }

    return output;
}

cache misses rate

Performance counter stats for './phonebook_opt' (100 runs):

           305,065      cache-misses              #   67.835 % of all cache refs    
           431,262      cache-references                                            
       292,667,383      instructions              #    1.46  insns per cycle        
       199,296,907      cycles                                                      

       0.078124204 seconds time elapsed                                          ( +-  0.24% )

執行時間比較
perf top

  41.23%  phonebook_opt  [.] allocEmptyEntry
  20.12%  phonebook_opt  [.] main
  11.57%  phonebook_opt  [.] BKDRHash
   7.13%  phonebook_opt  [.] append
   4.33%  libc-2.21.so   [.] _IO_fgets
   3.25%  libc-2.21.so   [.] _IO_getline_info
   2.21%  [kernel]       [k] page_fault
   2.13%  libc-2.21.so   [.] __memcpy_sse2
   1.45%  libc-2.21.so   [.] malloc

perf record -e cache-refereces,cache-misses -F 10000 ./phonebook_opt

# Samples: 628  of event 'cache-references'
# Event count (approx.): 496297
#
# Overhead  Command        Shared Object      Symbol                            
# ........  .............  .................  ..................................
#
    44.82%  phonebook_opt  libc-2.21.so       [.] _IO_fgets                     
     9.20%  phonebook_opt  phonebook_opt      [.] allocEmptyEntry               
     6.15%  phonebook_opt  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
     6.15%  phonebook_opt  phonebook_opt      [.] append                        
     4.81%  phonebook_opt  phonebook_opt      [.] main                          
     1.87%  phonebook_opt  [kernel.kallsyms]  [k] clear_page_c_e                
     1.72%  phonebook_opt  [kernel.kallsyms]  [k] free_pages_and_swap_cache     
     1.37%  phonebook_opt  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave        
     1.35%  phonebook_opt  [kernel.kallsyms]  [k] __rmqueue                     
     1.26%  phonebook_opt  phonebook_opt      [.] BKDRHash                      
     1.25%  phonebook_opt  libc-2.21.so       [.] _IO_getline_info              
     1.03%  phonebook_opt  [kernel.kallsyms]  [k] page_remove_rmap 

#
# Total Lost Samples: 0
#
# Samples: 742  of event 'cache-misses'
# Event count (approx.): 331210
#
# Overhead  Command        Shared Object      Symbol                             
# ........  .............  .................  ...................................
#
    61.73%  phonebook_opt  libc-2.21.so       [.] _IO_fgets                      
     7.07%  phonebook_opt  phonebook_opt      [.] append                         
     5.49%  phonebook_opt  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string 
     2.40%  phonebook_opt  [kernel.kallsyms]  [k] clear_page_c_e                 
     2.36%  phonebook_opt  phonebook_opt      [.] allocEmptyEntry                
     2.15%  phonebook_opt  [kernel.kallsyms]  [k] __rmqueue                      
     1.54%  phonebook_opt  [kernel.kallsyms]  [k] unmap_page_range               
     1.52%  phonebook_opt  [kernel.kallsyms]  [k] ext4_discard_preallocations    
     1.44%  phonebook_opt  [kernel.kallsyms]  [k] free_pcppages_bulk             
     1.35%  phonebook_opt  [kernel.kallsyms]  [k] __mod_zone_page_state          
     0.97%  phonebook_opt  phonebook_opt      [.] BKDRHash

結果討論
- 可以從 perf 後的結果發現新增的 allocEmptyEntry()反而耗費cpu很多執行時間
- 如何解釋這現象？
  - 多建立了一個特地想讓每個 hash bucket所串接的entry都是相鄰位置，反而會造成每次拿 free entry時都會有cache miss
  - 所以就單純建立一個linked list，每次存取都會是從head拿新的 free entry
- 感覺對於cache的使用狀況還是沒有完全了解
  - 做了一個簡短的筆記

不懂就不懂，不要假文青用「感覺」，理工科系的學生說話要精確。
快去讀書，並且把你的認知貼出來 –jserv

是! 會誠實面對自己!