# 2016q3 Homework1 (phonebook) contributed by <`louielu`> ### Reviewed by `jserv` * 沒有嘗試不同的 data set,原本的程式輸入是已排序、非典型英文姓氏,這與現實不匹配 * 實作提到透過引入 hash 加速 `fineName()` 的操作,但缺乏不同 hash function 的效能比較和設計取捨 * 在 `append()` 中,`malloc()` 是個顯著的時間開銷,缺乏減緩效能衝擊的方案,而且沒考慮到 `malloc()` 失敗的情況 ![image](https://hackmd.io/_uploads/B1ArXW4pT.png) * 在上圖的環境中,可用記憶體不足以載入 35 萬筆電話資料,於是連 `phonebook_orig` 執行都會失敗: ```shell $ ./phonebook_orig size of entry : 136 bytes 程式記憶體區段錯誤 ``` * 卻乏搜尋演算法的評估和效能分析 * 考慮到電話簿需要作到動態資料新增和刪除,若引入 hash,面對大量資料時,會有什麼影響? * 儘管已經整理頗多 perf 和效能測量的資料,但並未反映到此程式效能分析,除了 cache miss,還請一併探討 branch prediction accuracy 等議題 * `main.c` 無法透過 function pointer 來切換和比較不同實作的效能落差,應該先設計一份可通用的軟體界面,然後將 binary tree, hash table, trie 等不同實作機制加入 * 將 `append()` 和 `findName()` 時間加入統計的意義不大,真實應用往往是個別操作,特別在圖表的呈現 ## 開發環境 * CPU: Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz * MEM: 8GB * Cache: * L1d cache: 32K * L1i cache: 32K * L2 cache: 256K * L3 cache: 4096K ![](https://i.imgur.com/jx9h9w8.png) * [Portable Hardware Locality](https://www.open-mpi.org/projects/hwloc/) (hwloc) ## 開發目標 * 精通 perf * 改善 phonebook 效能 * findName cache miss * append() * fuzzy ## perf * [perf Examples](http://www.brendangregg.com/perf.html) by Brendan Gregg * [lmbench](http://www.bitmover.com/lmbench/) * [如何有效測量 performance ? (FB: embedded2016)](https://www.facebook.com/groups/system.software2016/permalink/1124571464289024/) ### perf - raw counter 格式:`rUUEE` (U = Umask Value, E = Event num) 範例: * Umask: 02H * Event: 11H * Description:`Counts 256-bit packed double-percision floating-polint insturctions` * Event Mask Mnemonic:`SIMD_FP_256.PACKED_DOUBLE` 使用指令:`$ taskset -c 1 perf stat -e r0211 ./time_test_avx` (借用一下 compute-pi, -c 指定 core 1) 讀取register確認:`$ sudo rdmsr -p 1 0x186; output 110211` (-p 指定 core 1) perf 輸出: ``` Performance counter stats for './time_test_avx': 678,616,723 r0211:u 1.698941816 seconds time elapsed ``` 換一個沒有用到 SIMD 的 `$ taskset -c 1 perf stat -e r0211 ./time_test_baseline` perf 輸出: ``` Performance counter stats for './time_test_baseline': 0 r0211:u 5.186517630 seconds time elapsed ``` 明顯會是 0 (不是 0 就是你打錯raw counter,或是你用的不是 sandy bridge) ### Reading PERV_EVT_SEL MSRs (Model-specific register) value `rdmsr` 這個工具可以用來讀取 CPU MSRs (Model-specific register) 的資料。 根據 `Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B` p.3989 `34.7 MSRS IN INTEL PROCESSOR FAMILY (INTEL MICROARCHITECTURE CODE NAME SANDY BRIDGE)` 的 `Table 34-10 (p.3994)` 指出,Register Address 186H ~ 18DH 是 `IA32_PERFEVTSEL` 的暫存器位置。 ![](https://i.imgur.com/uxBc3X9.png) 所以剛才才會指定 `rdmsr 0x186`,因為他是 `IA32_PERFEVTSEL0` 的位置 ### Misleading perf event name! 既然知道怎麼看到現在 event 的 raw code,我們就要來驗證一下,到底 perf list 出來的 events 跟我們想像中的有沒有一樣? ### Reference * [Linux perf #raw counter](http://www.brendangregg.com/perf.html) by Brendan Gregg's * [Intel Sandy Bridge Microarchitecture events](http://oprofile.sourceforge.net/docs/intel-sandybridge-events.php) - 常用 profile events 整理 (可是還是有差,他沒有 event num. 跟 umask value) * [Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B](https://www.cs.uaf.edu/2011/fall/cs301/intel_x64_megadoc_2011.pdf ) - p.3158 `CHAPTER19 Performance-Monitoring Events`, p.3159 `19.2 Performance Monitoring Events for Intel Core Processor 2xxx Series` example of raw counter umask and event num. ![](https://i.imgur.com/u1YQ46c.png) * [About Events for Intel(R) Microarchitecture Code Name Sandy Bridge](https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/snb/index.htm) (searchable) * [Understanding L1, L2, L3 cache misses](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/557604) v## 案例分析: phonebook ### 檔案分析 ### main.c `main.c` 分為兩個函數,一個是 `diff_in_second` 負責計算時間差,一個是 `main` 負責運行整著處理程序。 #### 7: `#include IMPL` 會使用到 gcc -DIMPL="" 的 macro,可以參考 gcc --help > -D name=definition > The contents of definition are tokenized and > processed as if they appearedduring translation > phase three in a `#define` directive. > In particular, the definition will be truncated > by embedded newline characters. 大意就是說 gcc -Dname=definition 的東西會被翻譯成 `#define name definition`。透過這樣的方式,我們可以只用一個 `main.c` 檔來運行 orig 或是 opt 版本的 phonebook。 !!必須要注意!! 在 shell 裏面的話 `gcc -DIMPL"\"phonebook_orig.h\""` 雙引號要做跳脫字元,要不然會一直出錯 ``` main.c:7:10: error: #include expects "FILENAME" or <FILENAME> #include IMPL ``` >> 延伸閱讀: [警告:"no newline at end of file"](http://blog.linux.org.tw/~jserv/archives/001933.html) [name=jserv] >> #### 9: `#define DICT_FILE "./dictionary/words.txt"` 這邊定義使用的字典檔,可以看到是使用 /dictionary/words.txt #### 46: `#if defined(__GNUC__)` gcc 特有的 predefine macro,可以用來偵測 compiler 是不是 gcc,搭配下一行 gcc 的 buildin funtion 來使用。 可以拿起手邊的 `tcc` 跟 `clang` 來玩,加個程式碼 ```c= #if defined(__GNUC__) puts("__GNUC__"); __builtin___clear_cache((char *) pHead, (char *) pHead + sizeof(entry)); #endif ``` `gcc -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_gcc` `tcc -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_tcc` `clang -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_clang` 執行 `phonebook_orig_[gcc,tcc,clang]` 只有 `phonebook_orig_gcc` 會輸出 `__GNUC__` #### 47: `__builtin__clear_cache((char *) pHead, (char *) pHead + sizeof(entry))` gcc 的 [builtin function](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) `__builtin__clear_cache`,用來清除從 begin 到 end 之間的 instructions cache。 ``` (gdb) x/136x (char *) pHead 0x602240: 0x00000000 0x00000000 0x00000000 0x00000000 0x602250: 0x00000000 0x00000000 0x00000000 0x00000000 0x602260: 0x00000000 0x00000000 0x00000000 0x00000000 0x602270: 0x00000000 0x00000000 0x00000000 0x00000000 0x602280: 0x00000000 0x00000000 0x00000000 0x00000000 0x602290: 0x00000000 0x00000000 0x00000000 0x00000000 0x6022a0: 0x00000000 0x00000000 0x00000000 0x00000000 0x6022b0: 0x00000000 0x00000000 0x00000000 0x00000000 0x6022c0: 0x00000000 0x00000000 0x00000411 0x00000000 0x6022d0: 0x657a6973 0x20666f20 0x72746e65 0x203a2079 0x6022e0: 0x20363331 0x65747962 0x00000a73 0x00000000 0x6022f0: 0x00000000 0x00000000 0x00000000 0x00000000 0x602300: 0x00000000 0x00000000 0x00000000 0x00000000 ``` ``` 50 clock_gettime(CLOCK_REALTIME, &start); (gdb) x/136x (char *) pHead 0x602240: 0x00000000 0x00000000 0x00000000 0x00000000 0x602250: 0x00000000 0x00000000 0x00000000 0x00000000 0x602260: 0x00000000 0x00000000 0x00000000 0x00000000 0x602270: 0x00000000 0x00000000 0x00000000 0x00000000 0x602290: 0x00000000 0x00000000 0x00000000 0x00000000 0x6022a0: 0x00000000 0x00000000 0x00000000 0x00000000 0x6022b0: 0x00000000 0x00000000 0x00000000 0x00000000 0x6022c0: 0x00000000 0x00000000 0x00000411 0x00000000 0x6022d0: 0x4e475f5f 0x5f5f4355 0x72746e0a 0x203a2079 0x6022e0: 0x20363331 0x65747962 0x00000a73 0x00000000 0x6022f0: 0x00000000 0x00000000 0x00000000 0x00000000 0x602300: 0x00000000 0x00000000 0x00000000 0x00000000 ``` 中間 0x6022d0 的數值被改變了,不知道有何作用。 #### 49-59: `clock_gettime(CLOCK_REALTIME, &start)` 這邊負責 append 資料到 `*e` 裡面 ```c= e = append(line, e) ``` 使用這種方式來 append,可以降低一般 linked list append 需要 O(n) 的時間,可是必須要維護一個 list head,所以可以看到 line 63: `e = pHead`,把 pHead 指向的位置傳給 e,讓 e 指向 list head 的地方。 ![](https://i.imgur.com/8xCb7RO.png) #### 66-80: `char input[MAX_LAST_NAME_SIZE] = "zyxel";` 這邊先段來看 ```c=66 char input[MAX_LAST_NAME_SIZE] = "zyxel";` ``` 這行把等等要找的 last name 設定為 "zyxel",看到 `words.txt` 裏面,zyxel 是倒數第 10 個,代表` findName()` 一定要加班了 (幫QQ) ```c=67 e = pHead ``` 因為 e 在 append 完之後現在在 last element,我們要讓 e 指回 first element。 ```c=69 assert(findName(input, e) && "Did you implement findName() in " IMPL "?"); assert(0 == strcmp(findName(input, e)->lastName, "zyxel")); ``` ```c=73 #if defined(__GNUC__) __builtin___clear_cache((char *) pHead, (char *) pHead + sizeof(entry)); #endif /* compute the execution time */ clock_gettime(CLOCK_REALTIME, &start); findName(input, e); clock_gettime(CLOCK_REALTIME, &end); cpu_time2 = diff_in_second(start, end); ``` ## 未優化版本 `$ make run` ``` size of entry : 136 bytes execution time of append() : 0.046524 sec execution time of findName() : 0.006071 sec ``` basic CPU statistic `$ perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles ./phone_orig` ``` Performance counter stats for './phonebook_orig': 175,357,733 cycles:u 229,437,098 instructions:u # 1.31 insn per cycle 1,301,277 cache-references:u 1,277,761 cache-misses:u # 98.193 % of all cache refs 5,058,973 bus-cycles:u 0.066305491 seconds time elapsed ``` CPU L1 cache staticstic ` $ perf stat -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-prefetch-misses,L1-dcache-store-misses,L1-icache-load-misses ./phonebook_orig` ``` Performance counter stats for './phonebook_orig': 1,315,576 cache-misses:u # 120.087 % of all cache refs (59.07%) 1,095,522 cache-references:u (60.37%) 1,231,509 L1-dcache-load-misses:u # 1.95% of all L1-dcache hits (78.15%) 63,186,159 L1-dcache-loads:u (81.73%) 3,369,852 L1-dcache-prefetch-misses:u (43.53%) 35,271 L1-dcache-store-misses:u (39.19%) 21,174 L1-icache-load-misses:u (56.70%) 0.068884934 seconds time elapsed ``` ## 縮減 entry size !!!WARNING!!! > 開始之前,不要被表了,先看 code,記得把 phonebook_opt.h 的 `#define OPT 1` 註解拿掉,要不然你怎麼跑 `make plot` 都是沒用的bb 將 entry size 從 136 bytes 縮小為 32 bytes,append 與 findName 的效率都有所提升。 `append` 提升 21.97% 速度,`findName` 提升 122.31 % 速度。 ```c typedef struct __LAST_NAME_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; phonebook_detail *detail; struct __LAST_NAME_ENTRY *pNext; } entry; ``` ![](https://i.imgur.com/4DR7hp3.png) orig cache-test ``` Performance counter stats for './phonebook_orig' (100 runs): 1,181,680 cache-misses:u # 95.942 % of all cache refs ( +- 0.49% ) 1,231,662 cache-references:u ( +- 0.41% ) 229,689,183 instructions:u # 1.24 insn per cycle ( +- 0.02% ) 185,425,108 cycles:u ( +- 0.46% ) 0.079698901 seconds time elapsed ( +- 0.83% ) ``` entry size cache-test ``` Performance counter stats for './phonebook_opt' (100 runs): 89,105 cache-misses:u # 32.843 % of all cache refs ( +- 0.98% ) 271,309 cache-references:u ( +- 0.70% ) 233,120,203 instructions:u # 1.62 insn per cycle ( +- 0.02% ) 143,480,575 cycles:u ( +- 0.35% ) 0.053849988 seconds time elapsed ( +- 0.78% ) ``` cache misses 從 95% 下降到 32%!太神啦~ ## Hash table ### Hash function 從記憶裏面找到這個網站: [各种字符串Hash函数比较](https://www.byvoid.com/blog/string-hash-compare) by byvoid 採用推薦的 `BKDR hash function` (就是 K&R,鼎鼎大名的 Brian Kernighan 和 Dennis Ritchie 在《[The C Programming Language](https://en.wikipedia.org/wiki/The_C_Programming_Language)》裏面寫的 hash function: ```c= unsigned int BKDRHash(char *str) { unsigned int seed = 131; unsigned int hash = 0; while (*str) { hash = hash * seed + (*str++); } return (hash & 0x7FFFFFFF); } ``` ### Hash table 因為很喜歡使用 Python,所以決定了解 Python 的 `dict` 是如何實作的,同時運用在這次的作業當中。 參考 [Python Dictionary Implementation](http://www.laurentluce.com/posts/python-dictionary-implementation/) by Laurent Luce,可以發現一件重要的事情: * Python dict 是採用 [open addressing](https://en.wikipedia.org/wiki/Open_addressing) 的方式處理 collision #### Open Addressing 系統程式的老師上課有教過,可是忘記了......。看到 wiki 就想起來了。 Open addressing 是當 hash table 出現撞擊的時候的處理方式,運用不同的探針 (probe) 來決定這個 value 的 index。 Open addressing 最大的好處就是,可以將 hash table 的查詢時間降低為 O(1),以往用 list 來處理 collision 的話會讓查詢時間出現 worse case O(n)。 常見的 probe 以下幾種: * Linear probing: +1 大法 * Quadratic probing : new_index = f(old_index); Python 使用 Quadratic probing,使用的 function 如下: ```c= perturb = hash; index = (index << 2) + index + perturb + 1; perturb >>= PERTURB_SHIFT; ``` PERTURB_SHIFT 預設是 5,不能小於 1。 假設 hash table size = 32,hash = 187875484, index = 28 跑 probing 的結果會是: ``` 28 -> 9 -> 18 -> 11 -> 29 -> 5 -> 31 -> 28 -> 13 -> 2 -> 11 -> 24 -> 25 -> 30 -> 23 -> 20 -> 5 -> 26 -> 3 -> 16 -> 17 -> .... ``` #### Dict Structure Python dict 的 structure 總共用到兩個,一個是 `PyDictEntry` 作為 key value structure、一個是 `PyDictObject` 作為整個 Dict 的 Object 可以在 [svn.python.org](http://svn.python.org/view/python/trunk/Objects/dictobject.c?view=markup) 找到 `dictobject.c` 的原始碼,如果你想要做的話,打開他,你會用到的。 在這邊把他特化為 phonebook 專用的 `DictEntry` 跟 `DictObject` 如下: ```c= typedef struct { unsigned int hash; char lastName[MAX_LAST_NAME_SIZE]; phonebook_detail *detail; } DictEntry; typedef struct _dictobject DictObject; struct _dictobject { unsigned int ma_mask; int ma_used; int ma_fill; int ma_size; DictEntry *ma_table; DictEntry *(*ma_lookup)(DictObject *mp, char key[], unsigned int hash); }; ``` * `unsigned int ma_mask`: hash mask, size - 1 * `ma_used`:目前使用的 slot 數量 * `ma_fill`:目前使用的 slot + dummy slot,dummy slot 是刪除 slot 之後,會將 slot 設定為 dummy slot * `ma_size`:目前 ma_table 的大小 * `ma_table`:hash table * `ma_lookup`:lookup function #### Dict function ```c= DictObject *Dict_New(); int Dict_SetItem(DictObject *mp, char key[]); int Dict_Lookup(DictObject *mp, char key[]); DictEntry *lookdict(DictObject *mp, char key[], unsigned int hash); int insertdict(DictObject *mp, char key[], const unsigned int hash); void insertdict_clean(DictObject *mp, char key[], unsigned int hash, phonebook_detail *value); int dictresize(DictObject *mp, int minused); unsigned int BKDRHash(char *s); ``` ### Hashtable 效能 ``` Performance counter stats for './phonebook_opt' (100 runs): 1,553,358 cache-misses:u # 68.207 % of all cache refs ( +- 0.28% ) 2,277,402 cache-references:u ( +- 0.20% ) 270,305,408 instructions:u # 0.62 insn per cycle ( +- 0.02% ) 437,887,064 cycles:u ( +- 0.34% ) 0.162588847 seconds time elapsed ( +- 0.60% ) ``` ![](https://i.imgur.com/2iq4E05.png) * cache-misses 提升到 68.207% * append 時間也上升到 0.147秒,跟 orig 相比慢了3倍 * findName 時間降到 0 秒 電話簿這種東西建完就可以用很久了,findName 降到 0 應該滿棒的啊! ## Performance 比較 - abandoned * append() all word in `dictionary/words.txt` * `findName("zyxel")` 100 times 這不是個好的實驗設計。 ``` Performance counter stats for './phonebook_orig' (100 runs): 42,431,897 cache-misses:u # 99.071 % of all cache refs ( +- 0.11% ) 42,829,855 cache-references:u ( +- 0.09% ) 2,164,254,681 instructions:u # 1.04 insn per cycle ( +- 0.00% ) 2,079,356,180 cycles:u ( +- 0.42% ) 0.653479071 seconds time elapsed ( +- 0.27% ) ``` ``` Performance counter stats for './phonebook_opt' (100 runs): 2,481,889 cache-misses # 33.237 % of all cache refs ( +- 0.36% ) 7,467,180 cache-references ( +- 0.29% ) 2,167,727,565 instructions # 2.21 insn per cycle ( +- 0.00% ) 981,860,196 cycles ( +- 0.18% ) 0.410072312 seconds time elapsed ( +- 1.42% ) ``` ``` Performance counter stats for './phonebook_opt_hash' (100 runs): 1,556,307 cache-misses # 68.441 % of all cache refs ( +- 0.43% ) 2,273,932 cache-references ( +- 0.21% ) 270,227,846 instructions # 0.65 insn per cycle ( +- 0.02% ) 412,602,985 cycles ( +- 0.29% ) 0.211090389 seconds time elapsed ( +- 1.22% ) ``` ![](https://i.imgur.com/ECYVJE8.png) ## Performance - Intel Hierarchical Top-Down Performance Characterization Methodology * Reference from: http://www.brendangregg.com/methodology.html * Are UOPs issued? * If yes: * Are UOPs retired? * If yes: retiring (good) * If no: investigate bad speculations * If no: * Allocation stall? * If yes: investigate back-end stalls * If no: investigate front-end stalls ### CPU front-end 意義 簡單來說前端就是:取得指令、以及把指令解碼成微指令。sandy bridge 系列可以一次傳 4 個 uops 給後端 [中文翻譯](https://blog.louie.lu/2016/09/07/pipeline-speak-learning-more-about-intel-microarchitechture-codename-sandy-bridge-%E4%B8%AD%E6%96%87%E7%BF%BB%E8%AD%AF/) by louielu ### CPU back-end 意義 簡單來說後端就是:執行 uops,有 parallelism [中文翻譯](https://blog.louie.lu/2016/09/07/pipeline-speak-part-2-the-second-part-of-the-sandy-bridge-pipeline-%E4%B8%AD%E6%96%87%E7%BF%BB%E8%AD%AF/) by louielu ### phonebook case ``` $ perf stat -d ./phonebook_oirg Performance counter stats for './phonebook_orig': 66.166969 task-clock:u (msec) # 0.995 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 12,346 page-faults:u # 0.187 M/sec 113,680,547 cycles:u # 1.718 GHz (34.24%) 82,373,177 stalled-cycles-frontend:u # 72.46% frontend cycles idle (48.06%) 64,255,629 stalled-cycles-backend:u # 56.52% backend cycles idle (50.65%) 186,811,699 instructions:u # 1.64 insn per cycle # 0.44 stalled cycles per insn (62.39%) 34,733,972 branches:u # 524.944 M/sec (63.81%) 789,273 branch-misses:u # 2.27% of all branches (62.64%) 49,760,981 L1-dcache-loads:u # 752.052 M/sec (44.50%) 56,790 L1-dcache-load-misses:u # 0.11% of all L1-dcache hits (18.82%) 4,932 LLC-loads:u # 0.075 M/sec (23.20%) <not supported> LLC-load-misses:u 0.066500871 seconds time elapsed ``` ``` $ perf stat -d ./phonebook_opt Performance counter stats for './phonebook_opt': 41.165953 task-clock:u (msec) # 0.993 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 4,147 page-faults:u # 0.101 M/sec 89,513,769 cycles:u # 2.174 GHz (30.69%) 47,728,782 stalled-cycles-frontend:u # 53.32% frontend cycles idle (45.97%) 30,544,573 stalled-cycles-backend:u # 34.12% backend cycles idle (50.45%) 186,816,945 instructions:u # 2.09 insn per cycle # 0.26 stalled cycles per insn (61.88%) 35,932,936 branches:u # 872.880 M/sec (63.67%) 848,250 branch-misses:u # 2.36% of all branches (70.17%) 54,843,255 L1-dcache-loads:u # 1332.248 M/sec (41.03%) 26,195 L1-dcache-load-misses:u # 0.05% of all L1-dcache hits (15.29%) 2,297 LLC-loads:u # 0.056 M/sec (22.03%) <not supported> LLC-load-misses:u 0.041471519 seconds time elapsed ``` ## A Re think of WHY? * 在不改變讀取方式以及 append code 的狀況下,為什麼降低 struct size 可以增加 append 效能? * sizeof((entry *) 0x0) -> 8 * entry pointer 的大小不會改變啊。 * 推測是 malloc size 對於速度有差異 * proof it. ## pthread boss/worker model [pthreads-example/example3-boss-worker-model.c](https://github.com/fbroom/pthreads-examples/blob/master/example3-boss-worker-model.c)