Try   HackMD

2016q3 Homework1 (phonebook)

contributed by <nekoneko>

tags: sys2016 nekoneko homework

Reviewed by ChenYi

  • 請試著把#define OPT 1 改成 #define OPT 0試試看
  • 記得commit前將沒有用到的code註解刪除
  • 如果要測的資料遠大於hash table所能使用的空間的話,處理方案為何?

開發環境

  • 系統環境: Lubuntu 16.04.1 LTS(64-bit)
  • linux kernel 版本: 4.4.0-38-generic
  • processor: 4x Intel® Core i3 CPU M330 @2.13GHz
  • memory: 3833MB
  • L1d cache: 32K
  • L1i cache: 32K
  • L2 cache: 256K
  • L3 cache: 3072K

Git設定

  • 因為把舊有的Ubuntu 14.04重灌成lubuntu 16.04.1,git 和 ssh key的設定都要重新來。參考GitHub 設定指引Generating an SSH key完成git ssh key的設定。
  • 參考初次設定Git,設定git configure
  • 參照dada同學上學期的共筆,設定terminal prompt的顏色。
    terminal prompt
  • clone完phonebook repository,做以下設定便可以使用ssh key做git push
    $ git remote add origin git@github.com:your_account/your_repository.git

vimrc設定

其實以前都有設定過,只是這次重灌,重新設定之外,做點紀錄

Astyle

astyle --style=kr --indent=spaces=4 --indent-switches --suffix=none *.[ch]
  • style=kr : bracket style為K&R的風格
  • suffix=none : 取消自動備份

gnu plot

Perf

照著Linux 效能分析工具: Perf資料的perf_top_example.c來執行perf top,但是沒有跑出預期結果

  • 將kernel.perf_event_paranoid改為-1(權限全開)
$ sudo sh -c " echo -1 > /proc/sys/kernel/perf_event_paranoid"
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
  • 編譯
g++ -c perf_top_example.c
g++ perf_top_example.o -o example
./example
  • 執行perf
sudo perf top -p $pid

sudo perf top 顯示也還是如上圖,但是下-e參數的話,是會有資訊的。應該是文件沒看完,看熟文件後,來想辦法找出一點頭緒cheng hung lin

  • cyclesevent

參考文件中,提到預設的perfomance event為cycles,然而從上圖來看,預設卻是cycles:pp。接著執行perf list來查閱有cycles關鍵字的event。

  • cpu-cycles OR cycles
  • ref-cycles
  • stalled-cycles-backend OR idle-cycles-backend
  • stalled-cycles-frontend OR idle-cycles-frontend
  • cpu-cycles OR cpu/cpu-cycles/

實際上,在我的筆電上是沒有cycles:pp的event。所以改成以下就可看到我們要的cyclesevent

$ sudo perf top -e cycles -p $pid
  • 結果

「程式輸出的文字訊息」請勿使用圖片表示,請改由文字貼上課程助教
好的,之後不會再使用圖片,謝謝提醒cheng hung lin
可以用 perf --stdio 來作為文字輸出 louielu

perf 基本指令

  • perf top
    • -e <event>: 選擇要測試的PMU event
    • -p <pid>: 針對某個$pid process做profiling
    • perf state
    • repeat <n>: 重複執行n次
    • :u : 測量的instructions只限在user level
    • :k : kernel level
  • perf record
    • 紀錄command的profile到perf.data
    • -F: 取樣的平均頻率
    • -c: 取樣的間距
    • example: -F 250,以平均每秒250個sample的平率取樣
    • example: -c 2000,監測event每2000次取樣一次
  • perf report
    • 讀取perf.data來顯示profile

背景知識

Branch Prediction

  • Static predict v.s. Dynamic predict
  • Branch History Table (BHT)
  • Correlating Branch Predictor
  • General (m,n) Branch Predictors
  • Branch Target Buffer (BTB)

if-else statement

for (int i = 0; i < max; i++) if (<condition>) sum++;

branch fetch

之後有時間再回來看cheng hung lin

Cache miss

複習白算盤 4e 5.2 章節

  • 書中提到cache miss對於效能的影響
    The processing of a cache miss create a pipeline stall as opposed to an interrupt, which would require saving the state of registers

  • cahe miss on write

    • write-through:
      當write miss 發生時會同時寫入cache 和main memory,但是會花長的時間
    • write buffer:
      processor將要寫入的資料放進write buffer後,就可繼續執行而不被暫停。當processor寫入的速度小於寫入main memory時,processor仍然會有stall產生的機會。(burst)
    • write-back:
      只先寫入cache,當cache blocks將被替換,再寫回main memory(原文是:The modified block is written to the lower level of the hierarchy when it is replaced.)

phonebook

原始版本

$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_orig

  • 觀察的事件有

    • cache-misses
    • cache-references
    • L1-dcache-load-misses
    • L1-dcache-store-misses
    • L1-dcache-prefetch-misses
    • L1-icache-load-misses
    • instructions
    • cycles
  • 結果

size of entry : 136 bytes
execution time of append() : 0.085346 sec
execution time of findName() : 0.011937 sec

 Performance counter stats for './phonebook_orig' (10 runs):

           988,062      cache-misses              #   93.309 % of all cache refs      ( +-  0.87% )  (61.95%)
         1,058,908      cache-references                                              ( +-  0.63% )  (61.94%)
         2,644,837      L1-dcache-load-misses                                         ( +-  0.56% )  (61.94%)
           877,813      L1-dcache-store-misses                                        ( +-  1.34% )  (50.10%)
         1,294,117      L1-dcache-prefetch-misses                                     ( +-  1.38% )  (52.38%)
           526,219      L1-icache-load-misses                                         ( +-  5.38% )  (54.38%)
       260,342,568      instructions              #    1.00  insns per cycle          ( +-  0.36% )  (65.70%)
       260,840,687      cycles                                                        ( +-  0.61% )  (63.41%)

       0.129547167 seconds time elapsed                                          ( +-  2.20% )
  • caches-misses高達93.309%
  • append() 的時間 0.085346 sec
  • findName() 的時間 0.011937 sec

$ perf record -F 12500 -e cache-misses ./phonebook_orig && perf report

  • findName的Overhead佔10.73%
  • append所花的時間相對findName來的多,但是cache miss的測試上,findName比append的overhead還來的多。
  • 從main.c的程式碼來看,append和findName分別是用cpu_time1和cpu_time2這兩個變數來紀錄,cpu_time1紀錄的時間是整筆資料append的時間,也就是呼叫了349900次的append(),時間上本來就會比findName大
  • append:
0.05%  phonebook_orig  phonebook_orig     [.] append 

append在cache miss的測試上,並沒有每次都有紀錄。cheng hung lin

  • 原始版本的圖

phonebook_origin

優化版本1 - 調整struct內容

  • 目標: 縮小struct的大小

原始struct大小為136Byte,findName 和 append需要只有last name的資訊。

  • 原始的entry
/* original version */ typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; struct __PHONE_BOOK_ENTRY *pNext; } entry;
  • 縮小過後的struct
#define OPT 1 typedef struct __DETAIL_ENTRY { char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; } detailEntry; typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; detailEntry *detail; /* char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; */ struct __PHONE_BOOK_ENTRY *pNext; } entry;
  • phonebook_opt.h 裡要把#define OPT 1註解拿掉,回到main.c裡可以看到
#if defined(OPT)
    output = fopen("opt.txt", "a");
#else
  • defined說明

  • perf stat 結果

size of entry : 32 bytes
execution time of append() : 0.103262 sec
execution time of findName() : 0.016726 sec

 Performance counter stats for './phonebook_opt' (10 runs):

         1,669,477      cache-misses              #   94.412 % of all cache refs      ( +-  0.44% )  (62.14%)
         1,768,295      cache-references                                              ( +-  0.45% )  (62.22%)
         2,823,014      L1-dcache-load-misses                                         ( +-  0.47% )  (62.07%)
         1,058,923      L1-dcache-store-misses                                        ( +-  0.87% )  (50.66%)
         1,672,812      L1-dcache-prefetch-misses                                     ( +-  2.14% )  (53.15%)
           559,604      L1-icache-load-misses                                         ( +-  7.74% )  (52.70%)
       330,247,777      instructions              #    1.00  insns per cycle          ( +-  0.63% )  (64.64%)
       328,716,232      cycles                                                        ( +-  0.41% )  (62.74%)

       0.161549694 seconds time elapsed                                          ( +-  1.50% )

phonebook_opt

如同參考文件寫的,append的時候還有malloc detailEntry的話,cache miss不降反生,從93.309%升到94.412 %,但是這裡可以看出來findName的cache miss比原先版本的下降,從10.3%到3.03%。

那94.412%的整體cache miss是給detail配置記憶體空間造成的,但要從何才能觀察到呢?cheng hung lin
原本編譯參數自己有加上-g,忘記刪除,所以上面數據重新更改為去除-gflagscheng hung linMon, Sep 26, 2016 8:27 PM

detail pointer不配置記憶體而先初始成NULL

size of entry : 32 bytes
execution time of append() : 0.072660 sec
execution time of findName() : 0.004983 sec

 Performance counter stats for './phonebook_opt' (10 runs):

           349,355      cache-misses              #   78.215 % of all cache refs      ( +-  0.77% )  (59.17%)
           446,659      cache-references                                              ( +-  1.63% )  (59.07%)
           954,744      L1-dcache-load-misses                                         ( +-  3.86% )  (60.24%)
           295,731      L1-dcache-store-misses                                        ( +-  3.02% )  (53.95%)
         1,506,670      L1-dcache-prefetch-misses                                     ( +-  5.08% )  (57.94%)
           272,066      L1-icache-load-misses                                         ( +-  4.59% )  (55.98%)
       261,005,283      instructions              #    1.41  insns per cycle          ( +-  0.98% )  (66.21%)
       185,111,487      cycles                                                        ( +-  0.69% )  (61.76%)

       0.102715691 seconds time elapsed                                          ( +- 12.42% )

phonebook_opt

  • findName cache misses Overhead: 2.69%

phonebook_opt

  • append(): 時間減少15.42% 速度為原來的1.18倍
  • findName(): 時間減少58.22% 速度為原來的2.39倍

優化版本2 hash function

  • hash
    當初資料結構沒學好,從原本的教科書(Fundamentals of data structures in c)和網路資料看起。

    • 定義:
      將key經過hash function預算後,轉化成bucket address。
      • collision:不同的key,產生相同的hash值 (frome wikipedia)
      • overflow: bucket裡的slot滿了
  • 實做
    參考了Yuron大大和ayueh0822大大的共筆,初步了解hashing的過程,分別是 建立hash table、hash function把key轉成buckect address、將資料存進bucket address。因為在插入資料到hash table時,有可能會有overflow和collision的狀況,從ayueh0822大大共筆的圖來看,每個buckect 的slot用linked list來存可以避免這樣個狀況。

overflow和collision其實還是沒有很懂,等實做完後再來查資料。

  • overflow and collision

  • hash table structure

#define TWO_POWER_NUM 8 #define MAX_HASH_TABLE_SIZE 1 << TWO_POWER_NUM typedef struct __HASH_SLOT { entry *data; struct __HASH_SLOT *pNext; } slot_unit; typedef struct __HASH_BUCKET { slot_unit *slot; } hash_bucket; hash_bucket hashTable[MAX_HASH_TABLE_SIZE];

再精簡一些

#define TWO_POWER_NUM 8 #define MAX_HASH_TABLE_SIZE 1 << TWO_POWER_NUM typedef struct __HASH_SLOT { entry *data; struct __HASH_SLOT *pNext; } slot_unit; slot_unit hashTable[MAX_HASH_TABLE_SIZE];
  • string conversion
    參照Yuron大大的筆記和資節課本,單純將每個字元相加產生key
unsigned int stringToInt(char *key) { int number = 0; while (*key) number += *key++; return number; }
  • unsigned int v.s. int bit 的表示差別
  • hashFunction
    • 方法:division
unsigned int hashFunction(unsigned int key) { return key & ((1<<TWO_POWER_NUM)-1); }
  • [ ]key & ((1<<TWO_POWER_NUM)-1)是否作到預期結果

  • [ ]是否有比key % 256來的快

  • findName

entry *findName(char lastname[], entry *pHead) { unsigned int hashPos; slot_unit *slot; hashPos = hashFunction(stringToInt(lastname)); slot = hashTable[hashPos]; while (slot!=NULL) { if (strcasecmp(lastname, slot->data->lastName) == 0) return slot->data; slot = slot->pNext; } return NULL; }
  • append
/* allocate memory for the new entry and put lastName */ e->pNext = (entry *) malloc(sizeof(entry)); e->detail = NULL; e = e->pNext; strcpy(e->lastName, lastName); e->pNext = NULL; unsigned int hashPos; slot_unit *newSlot; hashPos = hashFunction(stringToInt(lastName)); newSlot = (slot_unit *) malloc(sizeof(slot_unit)); newSlot->pNext = hashTable[hashPos]; newSlot->data = e; hashTable[hashPos] = newSlot;
  • main.c
    hash table 初始化
#if defined(HASH) unsigned int __i; for (__i=0;__i < MAX_HASH_TABLE_SIZE;++__i) hashTable[__i] = NULL; #endif

這個版本的程式碼,無法跑出結果,應該是卡在某個loop中,來debugcheng hung linTue, Sep 27, 2016 3:09 AM

stringToInt函式第五行number += *key;,改成*key++cheng hung linTue, Sep 27, 2016 3:28 AM

使用hash table的資料結構,代表說原本的linked list可以不需要用到了,目前的版本保留原本的linked list是多餘的,等bug解完再修cheng hung lin

  • 結果
size of entry : 32 bytes
execution time of append() : 0.103418 sec
execution time of findName() : 0.000000 sec

 Performance counter stats for './phonebook_hash' (10 runs):

           484,838      cache-misses              #   89.380 % of all cache refs      ( +-  1.35% )  (59.20%)
           542,444      cache-references                                              ( +-  1.22% )  (63.18%)
           781,892      L1-dcache-load-misses                                         ( +-  2.35% )  (65.95%)
           491,304      L1-dcache-store-misses                                        ( +-  0.47% )  (55.71%)
            68,136      L1-dcache-prefetch-misses                                     ( +- 12.38% )  (53.77%)
           380,728      L1-icache-load-misses                                         ( +-  5.78% )  (50.07%)
       304,166,446      instructions              #    1.36  insns per cycle          ( +-  0.37% )  (61.18%)
       224,161,591      cycles                                                        ( +-  0.74% )  (57.94%)

       0.111725998 seconds time elapsed                                          ( +-  2.45% )
  • findName所花的時間減少,但是cache-misses反而增加
  • [ ]為什麼cache-misses增加
  • [ ]何謂cache-references
Samples: 1K of event 'cache-misses', Event count (approx.): 515758                                                                    
Overhead  Command         Shared Object      Symbol64.32%  phonebook_hash  [kernel.kallsyms]  [k] clear_page                                                                          ▒
   8.63%  phonebook_hash  [kernel.kallsyms]  [k] pte_lockptr.isra.13                                                                 ▒
   3.73%  phonebook_hash  [kernel.kallsyms]  [k] try_charge                                                                          ▒
   3.48%  phonebook_hash  [kernel.kallsyms]  [k] copy_user_generic_string                                                            ▒
   2.74%  phonebook_hash  [kernel.kallsyms]  [k] unmap_page_range                                                                    ▒
   2.62%  phonebook_hash  [kernel.kallsyms]  [k] __rmqueue.isra.79                                                                   ▒
   2.48%  phonebook_hash  [kernel.kallsyms]  [k] get_page_from_freelist                                                              ▒
   1.74%  phonebook_hash  [kernel.kallsyms]  [k] get_mem_cgroup_from_mm                                                              ▒
   1.57%  phonebook_hash  [kernel.kallsyms]  [k] free_pcppages_bulk                                                                  ◆
   1.29%  phonebook_hash  [kernel.kallsyms]  [k] handle_mm_fault                                                                     ▒
   0.98%  phonebook_hash  [kernel.kallsyms]  [k] __alloc_pages_nodemask                                                              ▒
   0.94%  phonebook_hash  [kernel.kallsyms]  [k] mem_cgroup_try_charge                                                               ▒

64.32%, 8.63%都為紅字,其餘為綠字cheng hung lin

phonebook_hash

  • slots 分佈
    • h(x)=x mod M
    • M = 64
    • M = 256
    • M = 1024
    • M = 42737
    • M = 42737
size of entry : 32 bytes
execution time of append() : 0.109029 sec
execution time of findName() : 0.000000 sec

 Performance counter stats for './phonebook_hash' (10 runs):

           877,411      cache-misses              #   87.872 % of all cache refs      ( +-  1.44% )  (61.75%)
           998,513      cache-references                                              ( +-  1.43% )  (63.25%)
         1,406,025      L1-dcache-load-misses                                         ( +-  1.37% )  (64.22%)
           536,177      L1-dcache-store-misses                                        ( +-  1.04% )  (52.47%)
            71,249      L1-dcache-prefetch-misses                                     ( +- 22.30% )  (51.38%)
           603,052      L1-icache-load-misses                                         ( +-  4.61% )  (49.54%)
       321,729,410      instructions              #    0.93  insns per cycle          ( +-  2.92% )  (62.35%)
       347,147,596      cycles                                                        ( +-  0.76% )  (60.77%)

       0.169511003 seconds time elapsed                                          ( +-  0.80% )
  • M = 42737
    將原本的兩次malloc降到一次
entry *append(char lastName[], entry *e) { /* allocate memory for the new entry and put lastName */ unsigned int hashPos; hashPos = hashFunction(stringToInt(lastName)); e = (entry *) malloc(sizeof(entry)); e->pNext = hashTable[hashPos]; strcpy(e->lastName, lastName); e->detail = NULL; hashTable[hashPos] = e; return e; }
size of entry : 32 bytes
execution time of append() : 0.087568 sec
execution time of findName() : 0.000000 sec

 Performance counter stats for './phonebook_hash' (10 runs):

           692,774      cache-misses              #   87.907 % of all cache refs      ( +-  0.90% )  (59.08%)
           788,076      cache-references                                              ( +-  1.63% )  (61.45%)
         1,110,403      L1-dcache-load-misses                                         ( +-  1.20% )  (63.80%)
           300,690      L1-dcache-store-misses                                        ( +-  2.65% )  (54.17%)
            79,456      L1-dcache-prefetch-misses                                     ( +-  4.96% )  (53.95%)
           448,250      L1-icache-load-misses                                         ( +-  5.02% )  (51.45%)
       295,086,356      instructions              #    0.98  insns per cycle          ( +-  1.58% )  (62.71%)
       299,696,349      cycles                                                        ( +-  0.56% )  (59.94%)

       0.148717215 seconds time elapsed                                          ( +-  3.20% )

hash funtion 2

  • hash function
    • M = 42737
unsigned int hash(hash_table *hashtable , char *str) { unsigned int hash_value = 0; while (*str) hash_value = (hash_value << 5) - hash_value + (*str++); return (hash_value % MAX_HASH_TABLE_SIZE); }
  • slot 分佈

  • cache misses rate: 61.825 %
size of entry : 32 bytes
execution time of append() : 0.093066 sec
execution time of findName() : 0.000010 sec

 Performance counter stats for './phonebook_hash' (10 runs):

           695,569      cache-misses              #   61.825 % of all cache refs      ( +-  0.40% )  (60.64%)
         1,125,064      cache-references                                              ( +-  0.35% )  (60.66%)
         1,524,817      L1-dcache-load-misses                                         ( +-  0.77% )  (61.47%)
           321,656      L1-dcache-store-misses                                        ( +-  0.73% )  (52.48%)
            48,764      L1-dcache-prefetch-misses                                     ( +-  1.28% )  (54.86%)
           567,795      L1-icache-load-misses                                         ( +-  3.91% )  (53.19%)
       281,766,203      instructions              #    0.89  insns per cycle          ( +-  1.17% )  (64.34%)
       316,135,640      cycles                                                        ( +-  0.63% )  (61.96%)

       0.159598584 seconds time elapsed                                          ( +-  3.83% )
  48.91%  phonebook_hash1  phonebook_hash1    [.] main
  30.85%  phonebook_hash1  [kernel.kallsyms]  [k] clear_page
   3.48%  phonebook_hash1  [kernel.kallsyms]  [k] pte_lockptr.isra.13
   2.60%  phonebook_hash1  [kernel.kallsyms]  [k] copy_user_generic_string
   1.65%  phonebook_hash1  [kernel.kallsyms]  [k] try_charge
   1.49%  phonebook_hash1  phonebook_hash1    [.] append
   1.37%  phonebook_hash1  [kernel.kallsyms]  [k] unmap_page_range
   1.30%  phonebook_hash1  [kernel.kallsyms]  [k] get_page_from_freelist
   1.17%  phonebook_hash1  [kernel.kallsyms]  [k] __rmqueue.isra.79
   0.91%  phonebook_hash1  [kernel.kallsyms]  [k] get_mem_cgroup_from_mm
   0.80%  phonebook_hash1  [kernel.kallsyms]  [k] mem_cgroup_try_charge
   0.75%  phonebook_hash1  [kernel.kallsyms]  [k] free_pcppages_bulk
   0.51%  phonebook_hash1  [kernel.kallsyms]  [k] handle_mm_fault

  • cache miss rate 下降,但是cache miss沒有減少,反而是cache-references增加

  • 為了測試slot,phonebook_hash,phonebook_hash1,都會在讀取整個hash table一遍,試著將以下程式碼註解

#if 0 FILE *__fp; entry *__entry; unsigned int __count; __fp = fopen("hash_slots.txt", "w"); for (__i=0; __i<MAX_HASH_TABLE_SIZE; ++__i) { __count = 0; __entry = hashTable[__i]; while (__entry) { ++__count; __entry = __entry->pNext; } fprintf(__fp, "%d %d\n", __i, __count); } fclose(__fp); #endif
size of entry : 32 bytes execution time of append() : 0.087941 sec execution time of findName() : 0.000000 sec Performance counter stats for './phonebook_hash1' (10 runs): 337,303 cache-misses # 53.906 % of all cache refs ( +- 0.37% ) (64.80%) 625,727 cache-references ( +- 1.56% ) (64.38%) 912,984 L1-dcache-load-misses ( +- 1.76% ) (65.78%) 313,882 L1-dcache-store-misses ( +- 0.70% ) (69.79%) 62,026 L1-dcache-prefetch-misses ( +- 5.81% ) (70.86%) 290,789 L1-icache-load-misses ( +- 6.16% ) (67.31%) 0.105764890 seconds time elapsed ( +- 13.96% )
  • 所以讀取整個hash table也會造成cache miss

小結

  • 相對其他人的測出的cache miss rate從9x%降到3x%,我測出的只有從94.412 %降到78.215 %

延伸閱讀: Intel Ivy Bridge Cache Replacement Policy

這可見 Intel i3, i5, i7 之間 cache 設計的落差 jserv

  • division的hash function為減少collision,要以"質數"來取餘數

perfect hash functions

Generating Perfect Hash Functions
Perfect Hash Function Generator

參考資料