Try   HackMD

2016q3 Homework1 (phonebook)

contributed by <louielu>

Reviewed by jserv

  • 沒有嘗試不同的 data set,原本的程式輸入是已排序、非典型英文姓氏,這與現實不匹配
  • 實作提到透過引入 hash 加速 fineName() 的操作,但缺乏不同 hash function 的效能比較和設計取捨
  • append() 中,malloc() 是個顯著的時間開銷,缺乏減緩效能衝擊的方案,而且沒考慮到 malloc() 失敗的情況
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →
    • 在上圖的環境中,可用記憶體不足以載入 35 萬筆電話資料,於是連 phonebook_orig 執行都會失敗:
    ​​​​$ ./phonebook_orig 
    ​​​​size of entry : 136 bytes
    ​​​​程式記憶體區段錯誤
    
  • 卻乏搜尋演算法的評估和效能分析
  • 考慮到電話簿需要作到動態資料新增和刪除,若引入 hash,面對大量資料時,會有什麼影響?
  • 儘管已經整理頗多 perf 和效能測量的資料,但並未反映到此程式效能分析,除了 cache miss,還請一併探討 branch prediction accuracy 等議題
  • main.c 無法透過 function pointer 來切換和比較不同實作的效能落差,應該先設計一份可通用的軟體界面,然後將 binary tree, hash table, trie 等不同實作機制加入
  • append()findName() 時間加入統計的意義不大,真實應用往往是個別操作,特別在圖表的呈現

開發環境

  • CPU: Intel® Core i7-2640M CPU @ 2.80GHz
  • MEM: 8GB
  • Cache:
    • L1d cache: 32K
    • L1i cache: 32K
    • L2 cache: 256K
    • L3 cache: 4096K

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

開發目標

  • 精通 perf
  • 改善 phonebook 效能
  • findName cache miss
  • append()
  • fuzzy

perf

perf - raw counter

格式:rUUEE (U = Umask Value, E = Event num)
範例:

  • Umask: 02H
  • Event: 11H
  • Description:Counts 256-bit packed double-percision floating-polint insturctions
  • Event Mask Mnemonic:SIMD_FP_256.PACKED_DOUBLE

使用指令:$ taskset -c 1 perf stat -e r0211 ./time_test_avx (借用一下 compute-pi, -c 指定 core 1)
讀取register確認:$ sudo rdmsr -p 1 0x186; output 110211 (-p 指定 core 1)
perf 輸出:

 Performance counter stats for './time_test_avx':
       678,616,723      r0211:u                     
       1.698941816 seconds time elapsed

換一個沒有用到 SIMD 的
$ taskset -c 1 perf stat -e r0211 ./time_test_baseline

perf 輸出:

 Performance counter stats for './time_test_baseline':
                 0      r0211:u                                             
       5.186517630 seconds time elapsed

明顯會是 0 (不是 0 就是你打錯raw counter,或是你用的不是 sandy bridge)

Reading PERV_EVT_SEL MSRs (Model-specific register) value

rdmsr 這個工具可以用來讀取 CPU MSRs (Model-specific register) 的資料。
根據 Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B p.3989 34.7 MSRS IN INTEL PROCESSOR FAMILY (INTEL MICROARCHITECTURE CODE NAME SANDY BRIDGE)Table 34-10 (p.3994) 指出,Register Address 186H ~ 18DH 是 IA32_PERFEVTSEL 的暫存器位置。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

所以剛才才會指定 rdmsr 0x186,因為他是 IA32_PERFEVTSEL0 的位置

Misleading perf event name!

既然知道怎麼看到現在 event 的 raw code,我們就要來驗證一下,到底 perf list 出來的 events 跟我們想像中的有沒有一樣?

Reference

example of raw counter umask and event num.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

v## 案例分析: phonebook

檔案分析

main.c

main.c 分為兩個函數,一個是 diff_in_second 負責計算時間差,一個是 main 負責運行整著處理程序。

7: #include IMPL

會使用到 gcc -DIMPL="" 的 macro,可以參考 gcc help

-D name=definition
The contents of definition are tokenized and
processed as if they appearedduring translation
phase three in a #define directive.
In particular, the definition will be truncated
by embedded newline characters.

大意就是說 gcc -Dname=definition 的東西會被翻譯成 #define name definition。透過這樣的方式,我們可以只用一個 main.c 檔來運行 orig 或是 opt 版本的 phonebook。

!!必須要注意!!
在 shell 裏面的話 gcc -DIMPL"\"phonebook_orig.h\"" 雙引號要做跳脫字元,要不然會一直出錯

main.c:7:10: error: #include expects "FILENAME" or <FILENAME>
 #include IMPL

延伸閱讀: 警告:"no newline at end of file" jserv

9: #define DICT_FILE "./dictionary/words.txt"

這邊定義使用的字典檔,可以看到是使用 /dictionary/words.txt

46: #if defined(__GNUC__)

gcc 特有的 predefine macro,可以用來偵測 compiler 是不是 gcc,搭配下一行 gcc 的 buildin funtion 來使用。

可以拿起手邊的 tccclang 來玩,加個程式碼

#if defined(__GNUC__) puts("__GNUC__"); __builtin___clear_cache((char *) pHead, (char *) pHead + sizeof(entry)); #endif

gcc -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_gcc
tcc -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_tcc
clang -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_clang

執行 phonebook_orig_[gcc,tcc,clang] 只有 phonebook_orig_gcc 會輸出 __GNUC__

47: __builtin__clear_cache((char *) pHead, (char *) pHead + sizeof(entry))

gcc 的 builtin function __builtin__clear_cache,用來清除從 begin 到 end 之間的 instructions cache。

(gdb) x/136x (char *) pHead
0x602240:	0x00000000	0x00000000	0x00000000	0x00000000
0x602250:	0x00000000	0x00000000	0x00000000	0x00000000
0x602260:	0x00000000	0x00000000	0x00000000	0x00000000
0x602270:	0x00000000	0x00000000	0x00000000	0x00000000
0x602280:	0x00000000	0x00000000	0x00000000	0x00000000
0x602290:	0x00000000	0x00000000	0x00000000	0x00000000
0x6022a0:	0x00000000	0x00000000	0x00000000	0x00000000
0x6022b0:	0x00000000	0x00000000	0x00000000	0x00000000
0x6022c0:	0x00000000	0x00000000	0x00000411	0x00000000
0x6022d0:	0x657a6973	0x20666f20	0x72746e65	0x203a2079
0x6022e0:	0x20363331	0x65747962	0x00000a73	0x00000000
0x6022f0:	0x00000000	0x00000000	0x00000000	0x00000000
0x602300:	0x00000000	0x00000000	0x00000000	0x00000000
50	    clock_gettime(CLOCK_REALTIME, &start);
(gdb) x/136x (char *) pHead
0x602240:	0x00000000	0x00000000	0x00000000	0x00000000
0x602250:	0x00000000	0x00000000	0x00000000	0x00000000
0x602260:	0x00000000	0x00000000	0x00000000	0x00000000
0x602270:	0x00000000	0x00000000	0x00000000	0x00000000
0x602290:	0x00000000	0x00000000	0x00000000	0x00000000
0x6022a0:	0x00000000	0x00000000	0x00000000	0x00000000
0x6022b0:	0x00000000	0x00000000	0x00000000	0x00000000
0x6022c0:	0x00000000	0x00000000	0x00000411	0x00000000
0x6022d0:	0x4e475f5f	0x5f5f4355	0x72746e0a	0x203a2079
0x6022e0:	0x20363331	0x65747962	0x00000a73	0x00000000
0x6022f0:	0x00000000	0x00000000	0x00000000	0x00000000
0x602300:	0x00000000	0x00000000	0x00000000	0x00000000

中間 0x6022d0 的數值被改變了,不知道有何作用。

49-59: clock_gettime(CLOCK_REALTIME, &start)

這邊負責 append 資料到 *e 裡面

e = append(line, e)

使用這種方式來 append,可以降低一般 linked list append 需要 O(n) 的時間,可是必須要維護一個 list head,所以可以看到 line 63: e = pHead,把 pHead 指向的位置傳給 e,讓 e 指向 list head 的地方。

66-80: char input[MAX_LAST_NAME_SIZE] = "zyxel";

這邊先段來看

char input[MAX_LAST_NAME_SIZE] = "zyxel";`

這行把等等要找的 last name 設定為 "zyxel",看到 words.txt 裏面,zyxel 是倒數第 10 個,代表 findName() 一定要加班了 (幫QQ)

e = pHead

因為 e 在 append 完之後現在在 last element,我們要讓 e 指回 first element。

assert(findName(input, e) && "Did you implement findName() in " IMPL "?"); assert(0 == strcmp(findName(input, e)->lastName, "zyxel"));
#if defined(__GNUC__) __builtin___clear_cache((char *) pHead, (char *) pHead + sizeof(entry)); #endif /* compute the execution time */ clock_gettime(CLOCK_REALTIME, &start); findName(input, e); clock_gettime(CLOCK_REALTIME, &end); cpu_time2 = diff_in_second(start, end);

未優化版本

$ make run

size of entry : 136 bytes
execution time of append() : 0.046524 sec
execution time of findName() : 0.006071 sec

basic CPU statistic
$ perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles ./phone_orig

 Performance counter stats for './phonebook_orig':

       175,357,733      cycles:u                                                    
       229,437,098      instructions:u            #    1.31  insn per cycle                                            
         1,301,277      cache-references:u                                          
         1,277,761      cache-misses:u            #   98.193 % of all cache refs    
         5,058,973      bus-cycles:u                                                

       0.066305491 seconds time elapsed

CPU L1 cache staticstic
$ perf stat -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-prefetch-misses,L1-dcache-store-misses,L1-icache-load-misses ./phonebook_orig

 Performance counter stats for './phonebook_orig':

         1,315,576      cache-misses:u            #  120.087 % of all cache refs      (59.07%)
         1,095,522      cache-references:u                                            (60.37%)
         1,231,509      L1-dcache-load-misses:u   #    1.95% of all L1-dcache hits    (78.15%)
        63,186,159      L1-dcache-loads:u                                             (81.73%)
         3,369,852      L1-dcache-prefetch-misses:u                                     (43.53%)
            35,271      L1-dcache-store-misses:u                                      (39.19%)
            21,174      L1-icache-load-misses:u                                       (56.70%)

       0.068884934 seconds time elapsed

縮減 entry size

!!!WARNING!!!

開始之前,不要被表了,先看 code,記得把 phonebook_opt.h 的 #define OPT 1 註解拿掉,要不然你怎麼跑 make plot 都是沒用的bb

將 entry size 從 136 bytes 縮小為 32 bytes,append 與 findName 的效率都有所提升。

append 提升 21.97% 速度,findName 提升 122.31 % 速度。

typedef struct __LAST_NAME_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    phonebook_detail *detail;
    struct __LAST_NAME_ENTRY *pNext;
} entry;

orig cache-test

 Performance counter stats for './phonebook_orig' (100 runs):

         1,181,680      cache-misses:u            #   95.942 % of all cache refs      ( +-  0.49% )
         1,231,662      cache-references:u                                            ( +-  0.41% )
       229,689,183      instructions:u            #    1.24  insn per cycle                                              ( +-  0.02% )
       185,425,108      cycles:u                                                      ( +-  0.46% )

       0.079698901 seconds time elapsed                                          ( +-  0.83% )

entry size cache-test

 Performance counter stats for './phonebook_opt' (100 runs):

            89,105      cache-misses:u            #   32.843 % of all cache refs      ( +-  0.98% )
           271,309      cache-references:u                                            ( +-  0.70% )
       233,120,203      instructions:u            #    1.62  insn per cycle                                              ( +-  0.02% )
       143,480,575      cycles:u                                                      ( +-  0.35% )

       0.053849988 seconds time elapsed                                          ( +-  0.78% )

cache misses 從 95% 下降到 32%!太神啦~

Hash table

Hash function

從記憶裏面找到這個網站:
各种字符串Hash函数比较 by byvoid

採用推薦的 BKDR hash function (就是 K&R,鼎鼎大名的 Brian Kernighan 和 Dennis Ritchie 在《The C Programming Language》裏面寫的 hash function:

unsigned int BKDRHash(char *str) { unsigned int seed = 131; unsigned int hash = 0; while (*str) { hash = hash * seed + (*str++); } return (hash & 0x7FFFFFFF); }

Hash table

因為很喜歡使用 Python,所以決定了解 Python 的 dict 是如何實作的,同時運用在這次的作業當中。

參考 Python Dictionary Implementation by Laurent Luce,可以發現一件重要的事情:

Open Addressing

系統程式的老師上課有教過,可是忘記了。看到 wiki 就想起來了。

Open addressing 是當 hash table 出現撞擊的時候的處理方式,運用不同的探針 (probe) 來決定這個 value 的 index。

Open addressing 最大的好處就是,可以將 hash table 的查詢時間降低為 O(1),以往用 list 來處理 collision 的話會讓查詢時間出現 worse case O(n)。

常見的 probe 以下幾種:

  • Linear probing: +1 大法
  • Quadratic probing : new_index = f(old_index);

Python 使用 Quadratic probing,使用的 function 如下:

perturb = hash; index = (index << 2) + index + perturb + 1; perturb >>= PERTURB_SHIFT;

PERTURB_SHIFT 預設是 5,不能小於 1。

假設 hash table size = 32,hash = 187875484, index = 28
跑 probing 的結果會是:

28 -> 9 -> 18 -> 11 -> 29 -> 5 -> 31 -> 28 ->
13 -> 2 -> 11 -> 24 -> 25 -> 30 -> 23 -> 20 ->
5 -> 26 -> 3 -> 16 -> 17 -> ....

Dict Structure

Python dict 的 structure 總共用到兩個,一個是 PyDictEntry 作為 key value structure、一個是 PyDictObject 作為整個 Dict 的 Object

可以在 svn.python.org 找到 dictobject.c 的原始碼,如果你想要做的話,打開他,你會用到的。

在這邊把他特化為 phonebook 專用的 DictEntryDictObject 如下:

typedef struct { unsigned int hash; char lastName[MAX_LAST_NAME_SIZE]; phonebook_detail *detail; } DictEntry; typedef struct _dictobject DictObject; struct _dictobject { unsigned int ma_mask; int ma_used; int ma_fill; int ma_size; DictEntry *ma_table; DictEntry *(*ma_lookup)(DictObject *mp, char key[], unsigned int hash); };
  • unsigned int ma_mask: hash mask, size - 1
  • ma_used:目前使用的 slot 數量
  • ma_fill:目前使用的 slot + dummy slot,dummy slot 是刪除 slot 之後,會將 slot 設定為 dummy slot
  • ma_size:目前 ma_table 的大小
  • ma_table:hash table
  • ma_lookup:lookup function

Dict function

DictObject *Dict_New(); int Dict_SetItem(DictObject *mp, char key[]); int Dict_Lookup(DictObject *mp, char key[]); DictEntry *lookdict(DictObject *mp, char key[], unsigned int hash); int insertdict(DictObject *mp, char key[], const unsigned int hash); void insertdict_clean(DictObject *mp, char key[], unsigned int hash, phonebook_detail *value); int dictresize(DictObject *mp, int minused); unsigned int BKDRHash(char *s);

Hashtable 效能

 Performance counter stats for './phonebook_opt' (100 runs):

         1,553,358      cache-misses:u            #   68.207 % of all cache refs      ( +-  0.28% )
         2,277,402      cache-references:u                                            ( +-  0.20% )
       270,305,408      instructions:u            #    0.62  insn per cycle                                              ( +-  0.02% )
       437,887,064      cycles:u                                                      ( +-  0.34% )

       0.162588847 seconds time elapsed                                          ( +-  0.60% )

  • cache-misses 提升到 68.207%
  • append 時間也上升到 0.147秒,跟 orig 相比慢了3倍
  • findName 時間降到 0 秒

電話簿這種東西建完就可以用很久了,findName 降到 0 應該滿棒的啊!

Performance 比較 - abandoned

  • append() all word in dictionary/words.txt
  • findName("zyxel") 100 times

這不是個好的實驗設計。

 Performance counter stats for './phonebook_orig' (100 runs):

        42,431,897      cache-misses:u            #   99.071 % of all cache refs      ( +-  0.11% )
        42,829,855      cache-references:u                                            ( +-  0.09% )
     2,164,254,681      instructions:u            #    1.04  insn per cycle                                              ( +-  0.00% )
     2,079,356,180      cycles:u                                                      ( +-  0.42% )

       0.653479071 seconds time elapsed                                          ( +-  0.27% )
 Performance counter stats for './phonebook_opt' (100 runs):

         2,481,889      cache-misses #   33.237 % of all cache refs ( +-  0.36% )
         7,467,180      cache-references ( +-  0.29% )
     2,167,727,565      instructions #    2.21  insn per cycle ( +-  0.00% )
       981,860,196      cycles ( +-  0.18% )

       0.410072312 seconds time elapsed ( +-  1.42% )
 Performance counter stats for './phonebook_opt_hash' (100 runs):

         1,556,307      cache-misses #   68.441 % of all cache refs ( +-  0.43% )
         2,273,932      cache-references ( +-  0.21% )
       270,227,846      instructions #    0.65  insn per cycle ( +-  0.02% )
       412,602,985      cycles ( +-  0.29% )

       0.211090389 seconds time elapsed ( +-  1.22% )

Performance - Intel Hierarchical Top-Down Performance Characterization Methodology

  • Reference from: http://www.brendangregg.com/methodology.html

  • Are UOPs issued?

    • If yes:
      • Are UOPs retired?
        • If yes: retiring (good)
        • If no: investigate bad speculations
    • If no:
      • Allocation stall?
        • If yes: investigate back-end stalls
        • If no: investigate front-end stalls

CPU front-end 意義

簡單來說前端就是:取得指令、以及把指令解碼成微指令。sandy bridge 系列可以一次傳 4 個 uops 給後端

中文翻譯 by louielu

CPU back-end 意義

簡單來說後端就是:執行 uops,有 parallelism

中文翻譯 by louielu

phonebook case

$ perf stat -d ./phonebook_oirg
 Performance counter stats for './phonebook_orig':

         66.166969      task-clock:u (msec)       #    0.995 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            12,346      page-faults:u             #    0.187 M/sec                  
       113,680,547      cycles:u                  #    1.718 GHz                      (34.24%)
        82,373,177      stalled-cycles-frontend:u #   72.46% frontend cycles idle     (48.06%)
        64,255,629      stalled-cycles-backend:u  #  56.52% backend cycles idle       (50.65%)
       186,811,699      instructions:u            #    1.64  insn per cycle         
                                                  #    0.44  stalled cycles per insn  (62.39%)
        34,733,972      branches:u                #  524.944 M/sec                    (63.81%)
           789,273      branch-misses:u           #    2.27% of all branches          (62.64%)
        49,760,981      L1-dcache-loads:u         #  752.052 M/sec                    (44.50%)
            56,790      L1-dcache-load-misses:u   #    0.11% of all L1-dcache hits    (18.82%)
             4,932      LLC-loads:u               #    0.075 M/sec                    (23.20%)
   <not supported>      LLC-load-misses:u                                           

       0.066500871 seconds time elapsed
$ perf stat -d ./phonebook_opt     
 Performance counter stats for './phonebook_opt':

         41.165953      task-clock:u (msec)       #    0.993 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             4,147      page-faults:u             #    0.101 M/sec                  
        89,513,769      cycles:u                  #    2.174 GHz                      (30.69%)
        47,728,782      stalled-cycles-frontend:u #   53.32% frontend cycles idle     (45.97%)
        30,544,573      stalled-cycles-backend:u  #  34.12% backend cycles idle       (50.45%)
       186,816,945      instructions:u            #    2.09  insn per cycle         
                                                  #    0.26  stalled cycles per insn  (61.88%)
        35,932,936      branches:u                #  872.880 M/sec                    (63.67%)
           848,250      branch-misses:u           #    2.36% of all branches          (70.17%)
        54,843,255      L1-dcache-loads:u         # 1332.248 M/sec                    (41.03%)
            26,195      L1-dcache-load-misses:u   #    0.05% of all L1-dcache hits    (15.29%)
             2,297      LLC-loads:u               #    0.056 M/sec                    (22.03%)
   <not supported>      LLC-load-misses:u                                           

       0.041471519 seconds time elapsed

A Re think of WHY?

  • 在不改變讀取方式以及 append code 的狀況下,為什麼降低 struct size 可以增加 append 效能?
    • sizeof((entry *) 0x0) -> 8
      • entry pointer 的大小不會改變啊。
    • 推測是 malloc size 對於速度有差異
    • proof it.

pthread boss/worker model

pthreads-example/example3-boss-worker-model.c