# 2016q3 Homework1 (phonebook)
contributed by <`louielu`>
### Reviewed by `jserv`
* 沒有嘗試不同的 data set,原本的程式輸入是已排序、非典型英文姓氏,這與現實不匹配
* 實作提到透過引入 hash 加速 `fineName()` 的操作,但缺乏不同 hash function 的效能比較和設計取捨
* 在 `append()` 中,`malloc()` 是個顯著的時間開銷,缺乏減緩效能衝擊的方案,而且沒考慮到 `malloc()` 失敗的情況
![image](https://hackmd.io/_uploads/B1ArXW4pT.png)
* 在上圖的環境中,可用記憶體不足以載入 35 萬筆電話資料,於是連 `phonebook_orig` 執行都會失敗:
```shell
$ ./phonebook_orig
size of entry : 136 bytes
程式記憶體區段錯誤
```
* 卻乏搜尋演算法的評估和效能分析
* 考慮到電話簿需要作到動態資料新增和刪除,若引入 hash,面對大量資料時,會有什麼影響?
* 儘管已經整理頗多 perf 和效能測量的資料,但並未反映到此程式效能分析,除了 cache miss,還請一併探討 branch prediction accuracy 等議題
* `main.c` 無法透過 function pointer 來切換和比較不同實作的效能落差,應該先設計一份可通用的軟體界面,然後將 binary tree, hash table, trie 等不同實作機制加入
* 將 `append()` 和 `findName()` 時間加入統計的意義不大,真實應用往往是個別操作,特別在圖表的呈現
## 開發環境
* CPU: Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
* MEM: 8GB
* Cache:
* L1d cache: 32K
* L1i cache: 32K
* L2 cache: 256K
* L3 cache: 4096K
![](https://i.imgur.com/jx9h9w8.png)
* [Portable Hardware Locality](https://www.open-mpi.org/projects/hwloc/) (hwloc)
## 開發目標
* 精通 perf
* 改善 phonebook 效能
* findName cache miss
* append()
* fuzzy
## perf
* [perf Examples](http://www.brendangregg.com/perf.html) by Brendan Gregg
* [lmbench](http://www.bitmover.com/lmbench/)
* [如何有效測量 performance ? (FB: embedded2016)](https://www.facebook.com/groups/system.software2016/permalink/1124571464289024/)
### perf - raw counter
格式:`rUUEE` (U = Umask Value, E = Event num)
範例:
* Umask: 02H
* Event: 11H
* Description:`Counts 256-bit packed double-percision floating-polint insturctions`
* Event Mask Mnemonic:`SIMD_FP_256.PACKED_DOUBLE`
使用指令:`$ taskset -c 1 perf stat -e r0211 ./time_test_avx` (借用一下 compute-pi, -c 指定 core 1)
讀取register確認:`$ sudo rdmsr -p 1 0x186; output 110211` (-p 指定 core 1)
perf 輸出:
```
Performance counter stats for './time_test_avx':
678,616,723 r0211:u
1.698941816 seconds time elapsed
```
換一個沒有用到 SIMD 的
`$ taskset -c 1 perf stat -e r0211 ./time_test_baseline`
perf 輸出:
```
Performance counter stats for './time_test_baseline':
0 r0211:u
5.186517630 seconds time elapsed
```
明顯會是 0 (不是 0 就是你打錯raw counter,或是你用的不是 sandy bridge)
### Reading PERV_EVT_SEL MSRs (Model-specific register) value
`rdmsr` 這個工具可以用來讀取 CPU MSRs (Model-specific register) 的資料。
根據 `Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B` p.3989 `34.7 MSRS IN INTEL PROCESSOR FAMILY (INTEL MICROARCHITECTURE CODE NAME SANDY BRIDGE)` 的 `Table 34-10 (p.3994)` 指出,Register Address 186H ~ 18DH 是 `IA32_PERFEVTSEL` 的暫存器位置。
![](https://i.imgur.com/uxBc3X9.png)
所以剛才才會指定 `rdmsr 0x186`,因為他是 `IA32_PERFEVTSEL0` 的位置
### Misleading perf event name!
既然知道怎麼看到現在 event 的 raw code,我們就要來驗證一下,到底 perf list 出來的 events 跟我們想像中的有沒有一樣?
### Reference
* [Linux perf #raw counter](http://www.brendangregg.com/perf.html) by Brendan Gregg's
* [Intel Sandy Bridge Microarchitecture events](http://oprofile.sourceforge.net/docs/intel-sandybridge-events.php) - 常用 profile events 整理 (可是還是有差,他沒有 event num. 跟 umask value)
* [Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B](https://www.cs.uaf.edu/2011/fall/cs301/intel_x64_megadoc_2011.pdf
) - p.3158 `CHAPTER19 Performance-Monitoring Events`, p.3159 `19.2 Performance Monitoring Events for Intel Core Processor 2xxx Series`
example of raw counter umask and event num.
![](https://i.imgur.com/u1YQ46c.png)
* [About Events for Intel(R) Microarchitecture Code Name Sandy Bridge](https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/snb/index.htm) (searchable)
* [Understanding L1, L2, L3 cache misses](https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/557604)
v## 案例分析: phonebook
### 檔案分析
### main.c
`main.c` 分為兩個函數,一個是 `diff_in_second` 負責計算時間差,一個是 `main` 負責運行整著處理程序。
#### 7: `#include IMPL`
會使用到 gcc -DIMPL="" 的 macro,可以參考 gcc --help
> -D name=definition
> The contents of definition are tokenized and
> processed as if they appearedduring translation
> phase three in a `#define` directive.
> In particular, the definition will be truncated
> by embedded newline characters.
大意就是說 gcc -Dname=definition 的東西會被翻譯成 `#define name definition`。透過這樣的方式,我們可以只用一個 `main.c` 檔來運行 orig 或是 opt 版本的 phonebook。
!!必須要注意!!
在 shell 裏面的話 `gcc -DIMPL"\"phonebook_orig.h\""` 雙引號要做跳脫字元,要不然會一直出錯
```
main.c:7:10: error: #include expects "FILENAME" or <FILENAME>
#include IMPL
```
>> 延伸閱讀: [警告:"no newline at end of file"](http://blog.linux.org.tw/~jserv/archives/001933.html) [name=jserv]
>>
#### 9: `#define DICT_FILE "./dictionary/words.txt"`
這邊定義使用的字典檔,可以看到是使用 /dictionary/words.txt
#### 46: `#if defined(__GNUC__)`
gcc 特有的 predefine macro,可以用來偵測 compiler 是不是 gcc,搭配下一行 gcc 的 buildin funtion 來使用。
可以拿起手邊的 `tcc` 跟 `clang` 來玩,加個程式碼
```c=
#if defined(__GNUC__)
puts("__GNUC__");
__builtin___clear_cache((char *) pHead,
(char *) pHead + sizeof(entry));
#endif
```
`gcc -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_gcc`
`tcc -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_tcc`
`clang -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_clang`
執行 `phonebook_orig_[gcc,tcc,clang]` 只有 `phonebook_orig_gcc` 會輸出 `__GNUC__`
#### 47: `__builtin__clear_cache((char *) pHead, (char *) pHead + sizeof(entry))`
gcc 的 [builtin function](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) `__builtin__clear_cache`,用來清除從 begin 到 end 之間的 instructions cache。
```
(gdb) x/136x (char *) pHead
0x602240: 0x00000000 0x00000000 0x00000000 0x00000000
0x602250: 0x00000000 0x00000000 0x00000000 0x00000000
0x602260: 0x00000000 0x00000000 0x00000000 0x00000000
0x602270: 0x00000000 0x00000000 0x00000000 0x00000000
0x602280: 0x00000000 0x00000000 0x00000000 0x00000000
0x602290: 0x00000000 0x00000000 0x00000000 0x00000000
0x6022a0: 0x00000000 0x00000000 0x00000000 0x00000000
0x6022b0: 0x00000000 0x00000000 0x00000000 0x00000000
0x6022c0: 0x00000000 0x00000000 0x00000411 0x00000000
0x6022d0: 0x657a6973 0x20666f20 0x72746e65 0x203a2079
0x6022e0: 0x20363331 0x65747962 0x00000a73 0x00000000
0x6022f0: 0x00000000 0x00000000 0x00000000 0x00000000
0x602300: 0x00000000 0x00000000 0x00000000 0x00000000
```
```
50 clock_gettime(CLOCK_REALTIME, &start);
(gdb) x/136x (char *) pHead
0x602240: 0x00000000 0x00000000 0x00000000 0x00000000
0x602250: 0x00000000 0x00000000 0x00000000 0x00000000
0x602260: 0x00000000 0x00000000 0x00000000 0x00000000
0x602270: 0x00000000 0x00000000 0x00000000 0x00000000
0x602290: 0x00000000 0x00000000 0x00000000 0x00000000
0x6022a0: 0x00000000 0x00000000 0x00000000 0x00000000
0x6022b0: 0x00000000 0x00000000 0x00000000 0x00000000
0x6022c0: 0x00000000 0x00000000 0x00000411 0x00000000
0x6022d0: 0x4e475f5f 0x5f5f4355 0x72746e0a 0x203a2079
0x6022e0: 0x20363331 0x65747962 0x00000a73 0x00000000
0x6022f0: 0x00000000 0x00000000 0x00000000 0x00000000
0x602300: 0x00000000 0x00000000 0x00000000 0x00000000
```
中間 0x6022d0 的數值被改變了,不知道有何作用。
#### 49-59: `clock_gettime(CLOCK_REALTIME, &start)`
這邊負責 append 資料到 `*e` 裡面
```c=
e = append(line, e)
```
使用這種方式來 append,可以降低一般 linked list append 需要 O(n) 的時間,可是必須要維護一個 list head,所以可以看到 line 63: `e = pHead`,把 pHead 指向的位置傳給 e,讓 e 指向 list head 的地方。
![](https://i.imgur.com/8xCb7RO.png)
#### 66-80: `char input[MAX_LAST_NAME_SIZE] = "zyxel";`
這邊先段來看
```c=66
char input[MAX_LAST_NAME_SIZE] = "zyxel";`
```
這行把等等要找的 last name 設定為 "zyxel",看到 `words.txt` 裏面,zyxel 是倒數第 10 個,代表` findName()` 一定要加班了 (幫QQ)
```c=67
e = pHead
```
因為 e 在 append 完之後現在在 last element,我們要讓 e 指回 first element。
```c=69
assert(findName(input, e) &&
"Did you implement findName() in " IMPL "?");
assert(0 == strcmp(findName(input, e)->lastName, "zyxel"));
```
```c=73
#if defined(__GNUC__)
__builtin___clear_cache((char *) pHead, (char *) pHead + sizeof(entry));
#endif
/* compute the execution time */
clock_gettime(CLOCK_REALTIME, &start);
findName(input, e);
clock_gettime(CLOCK_REALTIME, &end);
cpu_time2 = diff_in_second(start, end);
```
## 未優化版本
`$ make run`
```
size of entry : 136 bytes
execution time of append() : 0.046524 sec
execution time of findName() : 0.006071 sec
```
basic CPU statistic
`$ perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles ./phone_orig`
```
Performance counter stats for './phonebook_orig':
175,357,733 cycles:u
229,437,098 instructions:u # 1.31 insn per cycle
1,301,277 cache-references:u
1,277,761 cache-misses:u # 98.193 % of all cache refs
5,058,973 bus-cycles:u
0.066305491 seconds time elapsed
```
CPU L1 cache staticstic
` $ perf stat -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-prefetch-misses,L1-dcache-store-misses,L1-icache-load-misses ./phonebook_orig`
```
Performance counter stats for './phonebook_orig':
1,315,576 cache-misses:u # 120.087 % of all cache refs (59.07%)
1,095,522 cache-references:u (60.37%)
1,231,509 L1-dcache-load-misses:u # 1.95% of all L1-dcache hits (78.15%)
63,186,159 L1-dcache-loads:u (81.73%)
3,369,852 L1-dcache-prefetch-misses:u (43.53%)
35,271 L1-dcache-store-misses:u (39.19%)
21,174 L1-icache-load-misses:u (56.70%)
0.068884934 seconds time elapsed
```
## 縮減 entry size
!!!WARNING!!!
> 開始之前,不要被表了,先看 code,記得把 phonebook_opt.h 的 `#define OPT 1` 註解拿掉,要不然你怎麼跑 `make plot` 都是沒用的bb
將 entry size 從 136 bytes 縮小為 32 bytes,append 與 findName 的效率都有所提升。
`append` 提升 21.97% 速度,`findName` 提升 122.31 % 速度。
```c
typedef struct __LAST_NAME_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
phonebook_detail *detail;
struct __LAST_NAME_ENTRY *pNext;
} entry;
```
![](https://i.imgur.com/4DR7hp3.png)
orig cache-test
```
Performance counter stats for './phonebook_orig' (100 runs):
1,181,680 cache-misses:u # 95.942 % of all cache refs ( +- 0.49% )
1,231,662 cache-references:u ( +- 0.41% )
229,689,183 instructions:u # 1.24 insn per cycle ( +- 0.02% )
185,425,108 cycles:u ( +- 0.46% )
0.079698901 seconds time elapsed ( +- 0.83% )
```
entry size cache-test
```
Performance counter stats for './phonebook_opt' (100 runs):
89,105 cache-misses:u # 32.843 % of all cache refs ( +- 0.98% )
271,309 cache-references:u ( +- 0.70% )
233,120,203 instructions:u # 1.62 insn per cycle ( +- 0.02% )
143,480,575 cycles:u ( +- 0.35% )
0.053849988 seconds time elapsed ( +- 0.78% )
```
cache misses 從 95% 下降到 32%!太神啦~
## Hash table
### Hash function
從記憶裏面找到這個網站:
[各种字符串Hash函数比较](https://www.byvoid.com/blog/string-hash-compare) by byvoid
採用推薦的 `BKDR hash function` (就是 K&R,鼎鼎大名的 Brian Kernighan 和 Dennis Ritchie 在《[The C Programming Language](https://en.wikipedia.org/wiki/The_C_Programming_Language)》裏面寫的 hash function:
```c=
unsigned int BKDRHash(char *str)
{
unsigned int seed = 131;
unsigned int hash = 0;
while (*str) {
hash = hash * seed + (*str++);
}
return (hash & 0x7FFFFFFF);
}
```
### Hash table
因為很喜歡使用 Python,所以決定了解 Python 的 `dict` 是如何實作的,同時運用在這次的作業當中。
參考 [Python Dictionary Implementation](http://www.laurentluce.com/posts/python-dictionary-implementation/) by Laurent Luce,可以發現一件重要的事情:
* Python dict 是採用 [open addressing](https://en.wikipedia.org/wiki/Open_addressing) 的方式處理 collision
#### Open Addressing
系統程式的老師上課有教過,可是忘記了......。看到 wiki 就想起來了。
Open addressing 是當 hash table 出現撞擊的時候的處理方式,運用不同的探針 (probe) 來決定這個 value 的 index。
Open addressing 最大的好處就是,可以將 hash table 的查詢時間降低為 O(1),以往用 list 來處理 collision 的話會讓查詢時間出現 worse case O(n)。
常見的 probe 以下幾種:
* Linear probing: +1 大法
* Quadratic probing : new_index = f(old_index);
Python 使用 Quadratic probing,使用的 function 如下:
```c=
perturb = hash;
index = (index << 2) + index + perturb + 1;
perturb >>= PERTURB_SHIFT;
```
PERTURB_SHIFT 預設是 5,不能小於 1。
假設 hash table size = 32,hash = 187875484, index = 28
跑 probing 的結果會是:
```
28 -> 9 -> 18 -> 11 -> 29 -> 5 -> 31 -> 28 ->
13 -> 2 -> 11 -> 24 -> 25 -> 30 -> 23 -> 20 ->
5 -> 26 -> 3 -> 16 -> 17 -> ....
```
#### Dict Structure
Python dict 的 structure 總共用到兩個,一個是 `PyDictEntry` 作為 key value structure、一個是 `PyDictObject` 作為整個 Dict 的 Object
可以在 [svn.python.org](http://svn.python.org/view/python/trunk/Objects/dictobject.c?view=markup) 找到 `dictobject.c` 的原始碼,如果你想要做的話,打開他,你會用到的。
在這邊把他特化為 phonebook 專用的 `DictEntry` 跟 `DictObject` 如下:
```c=
typedef struct {
unsigned int hash;
char lastName[MAX_LAST_NAME_SIZE];
phonebook_detail *detail;
} DictEntry;
typedef struct _dictobject DictObject;
struct _dictobject {
unsigned int ma_mask;
int ma_used;
int ma_fill;
int ma_size;
DictEntry *ma_table;
DictEntry *(*ma_lookup)(DictObject *mp,
char key[],
unsigned int hash);
};
```
* `unsigned int ma_mask`: hash mask, size - 1
* `ma_used`:目前使用的 slot 數量
* `ma_fill`:目前使用的 slot + dummy slot,dummy slot 是刪除 slot 之後,會將 slot 設定為 dummy slot
* `ma_size`:目前 ma_table 的大小
* `ma_table`:hash table
* `ma_lookup`:lookup function
#### Dict function
```c=
DictObject *Dict_New();
int Dict_SetItem(DictObject *mp, char key[]);
int Dict_Lookup(DictObject *mp, char key[]);
DictEntry *lookdict(DictObject *mp, char key[], unsigned int hash);
int insertdict(DictObject *mp, char key[], const unsigned int hash);
void insertdict_clean(DictObject *mp, char key[],
unsigned int hash, phonebook_detail *value);
int dictresize(DictObject *mp, int minused);
unsigned int BKDRHash(char *s);
```
### Hashtable 效能
```
Performance counter stats for './phonebook_opt' (100 runs):
1,553,358 cache-misses:u # 68.207 % of all cache refs ( +- 0.28% )
2,277,402 cache-references:u ( +- 0.20% )
270,305,408 instructions:u # 0.62 insn per cycle ( +- 0.02% )
437,887,064 cycles:u ( +- 0.34% )
0.162588847 seconds time elapsed ( +- 0.60% )
```
![](https://i.imgur.com/2iq4E05.png)
* cache-misses 提升到 68.207%
* append 時間也上升到 0.147秒,跟 orig 相比慢了3倍
* findName 時間降到 0 秒
電話簿這種東西建完就可以用很久了,findName 降到 0 應該滿棒的啊!
## Performance 比較 - abandoned
* append() all word in `dictionary/words.txt`
* `findName("zyxel")` 100 times
這不是個好的實驗設計。
```
Performance counter stats for './phonebook_orig' (100 runs):
42,431,897 cache-misses:u # 99.071 % of all cache refs ( +- 0.11% )
42,829,855 cache-references:u ( +- 0.09% )
2,164,254,681 instructions:u # 1.04 insn per cycle ( +- 0.00% )
2,079,356,180 cycles:u ( +- 0.42% )
0.653479071 seconds time elapsed ( +- 0.27% )
```
```
Performance counter stats for './phonebook_opt' (100 runs):
2,481,889 cache-misses # 33.237 % of all cache refs ( +- 0.36% )
7,467,180 cache-references ( +- 0.29% )
2,167,727,565 instructions # 2.21 insn per cycle ( +- 0.00% )
981,860,196 cycles ( +- 0.18% )
0.410072312 seconds time elapsed ( +- 1.42% )
```
```
Performance counter stats for './phonebook_opt_hash' (100 runs):
1,556,307 cache-misses # 68.441 % of all cache refs ( +- 0.43% )
2,273,932 cache-references ( +- 0.21% )
270,227,846 instructions # 0.65 insn per cycle ( +- 0.02% )
412,602,985 cycles ( +- 0.29% )
0.211090389 seconds time elapsed ( +- 1.22% )
```
![](https://i.imgur.com/ECYVJE8.png)
## Performance - Intel Hierarchical Top-Down Performance Characterization Methodology
* Reference from: http://www.brendangregg.com/methodology.html
* Are UOPs issued?
* If yes:
* Are UOPs retired?
* If yes: retiring (good)
* If no: investigate bad speculations
* If no:
* Allocation stall?
* If yes: investigate back-end stalls
* If no: investigate front-end stalls
### CPU front-end 意義
簡單來說前端就是:取得指令、以及把指令解碼成微指令。sandy bridge 系列可以一次傳 4 個 uops 給後端
[中文翻譯](https://blog.louie.lu/2016/09/07/pipeline-speak-learning-more-about-intel-microarchitechture-codename-sandy-bridge-%E4%B8%AD%E6%96%87%E7%BF%BB%E8%AD%AF/) by louielu
### CPU back-end 意義
簡單來說後端就是:執行 uops,有 parallelism
[中文翻譯](https://blog.louie.lu/2016/09/07/pipeline-speak-part-2-the-second-part-of-the-sandy-bridge-pipeline-%E4%B8%AD%E6%96%87%E7%BF%BB%E8%AD%AF/) by louielu
### phonebook case
```
$ perf stat -d ./phonebook_oirg
Performance counter stats for './phonebook_orig':
66.166969 task-clock:u (msec) # 0.995 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
12,346 page-faults:u # 0.187 M/sec
113,680,547 cycles:u # 1.718 GHz (34.24%)
82,373,177 stalled-cycles-frontend:u # 72.46% frontend cycles idle (48.06%)
64,255,629 stalled-cycles-backend:u # 56.52% backend cycles idle (50.65%)
186,811,699 instructions:u # 1.64 insn per cycle
# 0.44 stalled cycles per insn (62.39%)
34,733,972 branches:u # 524.944 M/sec (63.81%)
789,273 branch-misses:u # 2.27% of all branches (62.64%)
49,760,981 L1-dcache-loads:u # 752.052 M/sec (44.50%)
56,790 L1-dcache-load-misses:u # 0.11% of all L1-dcache hits (18.82%)
4,932 LLC-loads:u # 0.075 M/sec (23.20%)
<not supported> LLC-load-misses:u
0.066500871 seconds time elapsed
```
```
$ perf stat -d ./phonebook_opt
Performance counter stats for './phonebook_opt':
41.165953 task-clock:u (msec) # 0.993 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,147 page-faults:u # 0.101 M/sec
89,513,769 cycles:u # 2.174 GHz (30.69%)
47,728,782 stalled-cycles-frontend:u # 53.32% frontend cycles idle (45.97%)
30,544,573 stalled-cycles-backend:u # 34.12% backend cycles idle (50.45%)
186,816,945 instructions:u # 2.09 insn per cycle
# 0.26 stalled cycles per insn (61.88%)
35,932,936 branches:u # 872.880 M/sec (63.67%)
848,250 branch-misses:u # 2.36% of all branches (70.17%)
54,843,255 L1-dcache-loads:u # 1332.248 M/sec (41.03%)
26,195 L1-dcache-load-misses:u # 0.05% of all L1-dcache hits (15.29%)
2,297 LLC-loads:u # 0.056 M/sec (22.03%)
<not supported> LLC-load-misses:u
0.041471519 seconds time elapsed
```
## A Re think of WHY?
* 在不改變讀取方式以及 append code 的狀況下,為什麼降低 struct size 可以增加 append 效能?
* sizeof((entry *) 0x0) -> 8
* entry pointer 的大小不會改變啊。
* 推測是 malloc size 對於速度有差異
* proof it.
## pthread boss/worker model
[pthreads-example/example3-boss-worker-model.c](https://github.com/fbroom/pthreads-examples/blob/master/example3-boss-worker-model.c)