contributed by <louielu
>
jserv
fineName()
的操作,但缺乏不同 hash function 的效能比較和設計取捨append()
中,malloc()
是個顯著的時間開銷,缺乏減緩效能衝擊的方案,而且沒考慮到 malloc()
失敗的情況phonebook_orig
執行都會失敗:$ ./phonebook_orig
size of entry : 136 bytes
程式記憶體區段錯誤
main.c
無法透過 function pointer 來切換和比較不同實作的效能落差,應該先設計一份可通用的軟體界面,然後將 binary tree, hash table, trie 等不同實作機制加入append()
和 findName()
時間加入統計的意義不大,真實應用往往是個別操作,特別在圖表的呈現格式:rUUEE
(U = Umask Value, E = Event num)
範例:
Counts 256-bit packed double-percision floating-polint insturctions
SIMD_FP_256.PACKED_DOUBLE
使用指令:$ taskset -c 1 perf stat -e r0211 ./time_test_avx
(借用一下 compute-pi, -c 指定 core 1)
讀取register確認:$ sudo rdmsr -p 1 0x186; output 110211
(-p 指定 core 1)
perf 輸出:
Performance counter stats for './time_test_avx':
678,616,723 r0211:u
1.698941816 seconds time elapsed
換一個沒有用到 SIMD 的
$ taskset -c 1 perf stat -e r0211 ./time_test_baseline
perf 輸出:
Performance counter stats for './time_test_baseline':
0 r0211:u
5.186517630 seconds time elapsed
明顯會是 0 (不是 0 就是你打錯raw counter,或是你用的不是 sandy bridge)
rdmsr
這個工具可以用來讀取 CPU MSRs (Model-specific register) 的資料。
根據 Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B
p.3989 34.7 MSRS IN INTEL PROCESSOR FAMILY (INTEL MICROARCHITECTURE CODE NAME SANDY BRIDGE)
的 Table 34-10 (p.3994)
指出,Register Address 186H ~ 18DH 是 IA32_PERFEVTSEL
的暫存器位置。
所以剛才才會指定 rdmsr 0x186
,因為他是 IA32_PERFEVTSEL0
的位置
既然知道怎麼看到現在 event 的 raw code,我們就要來驗證一下,到底 perf list 出來的 events 跟我們想像中的有沒有一樣?
CHAPTER19 Performance-Monitoring Events
, p.3159 19.2 Performance Monitoring Events for Intel Core Processor 2xxx Series
example of raw counter umask and event num.
v## 案例分析: phonebook
main.c
分為兩個函數,一個是 diff_in_second
負責計算時間差,一個是 main
負責運行整著處理程序。
#include IMPL
會使用到 gcc -DIMPL="" 的 macro,可以參考 gcc –help
-D name=definition
The contents of definition are tokenized and
processed as if they appearedduring translation
phase three in a#define
directive.
In particular, the definition will be truncated
by embedded newline characters.
大意就是說 gcc -Dname=definition 的東西會被翻譯成 #define name definition
。透過這樣的方式,我們可以只用一個 main.c
檔來運行 orig 或是 opt 版本的 phonebook。
!!必須要注意!!
在 shell 裏面的話 gcc -DIMPL"\"phonebook_orig.h\""
雙引號要做跳脫字元,要不然會一直出錯
main.c:7:10: error: #include expects "FILENAME" or <FILENAME>
#include IMPL
延伸閱讀: 警告:"no newline at end of file" jserv
#define DICT_FILE "./dictionary/words.txt"
這邊定義使用的字典檔,可以看到是使用 /dictionary/words.txt
#if defined(__GNUC__)
gcc 特有的 predefine macro,可以用來偵測 compiler 是不是 gcc,搭配下一行 gcc 的 buildin funtion 來使用。
可以拿起手邊的 tcc
跟 clang
來玩,加個程式碼
#if defined(__GNUC__)
puts("__GNUC__");
__builtin___clear_cache((char *) pHead,
(char *) pHead + sizeof(entry));
#endif
gcc -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_gcc
tcc -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_tcc
clang -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_clang
執行 phonebook_orig_[gcc,tcc,clang]
只有 phonebook_orig_gcc
會輸出 __GNUC__
__builtin__clear_cache((char *) pHead, (char *) pHead + sizeof(entry))
gcc 的 builtin function __builtin__clear_cache
,用來清除從 begin 到 end 之間的 instructions cache。
(gdb) x/136x (char *) pHead
0x602240: 0x00000000 0x00000000 0x00000000 0x00000000
0x602250: 0x00000000 0x00000000 0x00000000 0x00000000
0x602260: 0x00000000 0x00000000 0x00000000 0x00000000
0x602270: 0x00000000 0x00000000 0x00000000 0x00000000
0x602280: 0x00000000 0x00000000 0x00000000 0x00000000
0x602290: 0x00000000 0x00000000 0x00000000 0x00000000
0x6022a0: 0x00000000 0x00000000 0x00000000 0x00000000
0x6022b0: 0x00000000 0x00000000 0x00000000 0x00000000
0x6022c0: 0x00000000 0x00000000 0x00000411 0x00000000
0x6022d0: 0x657a6973 0x20666f20 0x72746e65 0x203a2079
0x6022e0: 0x20363331 0x65747962 0x00000a73 0x00000000
0x6022f0: 0x00000000 0x00000000 0x00000000 0x00000000
0x602300: 0x00000000 0x00000000 0x00000000 0x00000000
50 clock_gettime(CLOCK_REALTIME, &start);
(gdb) x/136x (char *) pHead
0x602240: 0x00000000 0x00000000 0x00000000 0x00000000
0x602250: 0x00000000 0x00000000 0x00000000 0x00000000
0x602260: 0x00000000 0x00000000 0x00000000 0x00000000
0x602270: 0x00000000 0x00000000 0x00000000 0x00000000
0x602290: 0x00000000 0x00000000 0x00000000 0x00000000
0x6022a0: 0x00000000 0x00000000 0x00000000 0x00000000
0x6022b0: 0x00000000 0x00000000 0x00000000 0x00000000
0x6022c0: 0x00000000 0x00000000 0x00000411 0x00000000
0x6022d0: 0x4e475f5f 0x5f5f4355 0x72746e0a 0x203a2079
0x6022e0: 0x20363331 0x65747962 0x00000a73 0x00000000
0x6022f0: 0x00000000 0x00000000 0x00000000 0x00000000
0x602300: 0x00000000 0x00000000 0x00000000 0x00000000
中間 0x6022d0 的數值被改變了,不知道有何作用。
clock_gettime(CLOCK_REALTIME, &start)
這邊負責 append 資料到 *e
裡面
e = append(line, e)
使用這種方式來 append,可以降低一般 linked list append 需要 O(n) 的時間,可是必須要維護一個 list head,所以可以看到 line 63: e = pHead
,把 pHead 指向的位置傳給 e,讓 e 指向 list head 的地方。
char input[MAX_LAST_NAME_SIZE] = "zyxel";
這邊先段來看
char input[MAX_LAST_NAME_SIZE] = "zyxel";`
這行把等等要找的 last name 設定為 "zyxel",看到 words.txt
裏面,zyxel 是倒數第 10 個,代表 findName()
一定要加班了 (幫QQ)
e = pHead
因為 e 在 append 完之後現在在 last element,我們要讓 e 指回 first element。
assert(findName(input, e) &&
"Did you implement findName() in " IMPL "?");
assert(0 == strcmp(findName(input, e)->lastName, "zyxel"));
#if defined(__GNUC__)
__builtin___clear_cache((char *) pHead, (char *) pHead + sizeof(entry));
#endif
/* compute the execution time */
clock_gettime(CLOCK_REALTIME, &start);
findName(input, e);
clock_gettime(CLOCK_REALTIME, &end);
cpu_time2 = diff_in_second(start, end);
$ make run
size of entry : 136 bytes
execution time of append() : 0.046524 sec
execution time of findName() : 0.006071 sec
basic CPU statistic
$ perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles ./phone_orig
Performance counter stats for './phonebook_orig':
175,357,733 cycles:u
229,437,098 instructions:u # 1.31 insn per cycle
1,301,277 cache-references:u
1,277,761 cache-misses:u # 98.193 % of all cache refs
5,058,973 bus-cycles:u
0.066305491 seconds time elapsed
CPU L1 cache staticstic
$ perf stat -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-prefetch-misses,L1-dcache-store-misses,L1-icache-load-misses ./phonebook_orig
Performance counter stats for './phonebook_orig':
1,315,576 cache-misses:u # 120.087 % of all cache refs (59.07%)
1,095,522 cache-references:u (60.37%)
1,231,509 L1-dcache-load-misses:u # 1.95% of all L1-dcache hits (78.15%)
63,186,159 L1-dcache-loads:u (81.73%)
3,369,852 L1-dcache-prefetch-misses:u (43.53%)
35,271 L1-dcache-store-misses:u (39.19%)
21,174 L1-icache-load-misses:u (56.70%)
0.068884934 seconds time elapsed
!!!WARNING!!!
開始之前,不要被表了,先看 code,記得把 phonebook_opt.h 的
#define OPT 1
註解拿掉,要不然你怎麼跑make plot
都是沒用的bb
將 entry size 從 136 bytes 縮小為 32 bytes,append 與 findName 的效率都有所提升。
append
提升 21.97% 速度,findName
提升 122.31 % 速度。
typedef struct __LAST_NAME_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
phonebook_detail *detail;
struct __LAST_NAME_ENTRY *pNext;
} entry;
orig cache-test
Performance counter stats for './phonebook_orig' (100 runs):
1,181,680 cache-misses:u # 95.942 % of all cache refs ( +- 0.49% )
1,231,662 cache-references:u ( +- 0.41% )
229,689,183 instructions:u # 1.24 insn per cycle ( +- 0.02% )
185,425,108 cycles:u ( +- 0.46% )
0.079698901 seconds time elapsed ( +- 0.83% )
entry size cache-test
Performance counter stats for './phonebook_opt' (100 runs):
89,105 cache-misses:u # 32.843 % of all cache refs ( +- 0.98% )
271,309 cache-references:u ( +- 0.70% )
233,120,203 instructions:u # 1.62 insn per cycle ( +- 0.02% )
143,480,575 cycles:u ( +- 0.35% )
0.053849988 seconds time elapsed ( +- 0.78% )
cache misses 從 95% 下降到 32%!太神啦~
從記憶裏面找到這個網站:
各种字符串Hash函数比较 by byvoid
採用推薦的 BKDR hash function
(就是 K&R,鼎鼎大名的 Brian Kernighan 和 Dennis Ritchie 在《The C Programming Language》裏面寫的 hash function:
unsigned int BKDRHash(char *str)
{
unsigned int seed = 131;
unsigned int hash = 0;
while (*str) {
hash = hash * seed + (*str++);
}
return (hash & 0x7FFFFFFF);
}
因為很喜歡使用 Python,所以決定了解 Python 的 dict
是如何實作的,同時運用在這次的作業當中。
參考 Python Dictionary Implementation by Laurent Luce,可以發現一件重要的事情:
系統程式的老師上課有教過,可是忘記了…。看到 wiki 就想起來了。
Open addressing 是當 hash table 出現撞擊的時候的處理方式,運用不同的探針 (probe) 來決定這個 value 的 index。
Open addressing 最大的好處就是,可以將 hash table 的查詢時間降低為 O(1),以往用 list 來處理 collision 的話會讓查詢時間出現 worse case O(n)。
常見的 probe 以下幾種:
Python 使用 Quadratic probing,使用的 function 如下:
perturb = hash;
index = (index << 2) + index + perturb + 1;
perturb >>= PERTURB_SHIFT;
PERTURB_SHIFT 預設是 5,不能小於 1。
假設 hash table size = 32,hash = 187875484, index = 28
跑 probing 的結果會是:
28 -> 9 -> 18 -> 11 -> 29 -> 5 -> 31 -> 28 ->
13 -> 2 -> 11 -> 24 -> 25 -> 30 -> 23 -> 20 ->
5 -> 26 -> 3 -> 16 -> 17 -> ....
Python dict 的 structure 總共用到兩個,一個是 PyDictEntry
作為 key value structure、一個是 PyDictObject
作為整個 Dict 的 Object
可以在 svn.python.org 找到 dictobject.c
的原始碼,如果你想要做的話,打開他,你會用到的。
在這邊把他特化為 phonebook 專用的 DictEntry
跟 DictObject
如下:
typedef struct {
unsigned int hash;
char lastName[MAX_LAST_NAME_SIZE];
phonebook_detail *detail;
} DictEntry;
typedef struct _dictobject DictObject;
struct _dictobject {
unsigned int ma_mask;
int ma_used;
int ma_fill;
int ma_size;
DictEntry *ma_table;
DictEntry *(*ma_lookup)(DictObject *mp,
char key[],
unsigned int hash);
};
unsigned int ma_mask
: hash mask, size - 1ma_used
:目前使用的 slot 數量ma_fill
:目前使用的 slot + dummy slot,dummy slot 是刪除 slot 之後,會將 slot 設定為 dummy slotma_size
:目前 ma_table 的大小ma_table
:hash tablema_lookup
:lookup function
DictObject *Dict_New();
int Dict_SetItem(DictObject *mp, char key[]);
int Dict_Lookup(DictObject *mp, char key[]);
DictEntry *lookdict(DictObject *mp, char key[], unsigned int hash);
int insertdict(DictObject *mp, char key[], const unsigned int hash);
void insertdict_clean(DictObject *mp, char key[],
unsigned int hash, phonebook_detail *value);
int dictresize(DictObject *mp, int minused);
unsigned int BKDRHash(char *s);
Performance counter stats for './phonebook_opt' (100 runs):
1,553,358 cache-misses:u # 68.207 % of all cache refs ( +- 0.28% )
2,277,402 cache-references:u ( +- 0.20% )
270,305,408 instructions:u # 0.62 insn per cycle ( +- 0.02% )
437,887,064 cycles:u ( +- 0.34% )
0.162588847 seconds time elapsed ( +- 0.60% )
電話簿這種東西建完就可以用很久了,findName 降到 0 應該滿棒的啊!
dictionary/words.txt
findName("zyxel")
100 times這不是個好的實驗設計。
Performance counter stats for './phonebook_orig' (100 runs):
42,431,897 cache-misses:u # 99.071 % of all cache refs ( +- 0.11% )
42,829,855 cache-references:u ( +- 0.09% )
2,164,254,681 instructions:u # 1.04 insn per cycle ( +- 0.00% )
2,079,356,180 cycles:u ( +- 0.42% )
0.653479071 seconds time elapsed ( +- 0.27% )
Performance counter stats for './phonebook_opt' (100 runs):
2,481,889 cache-misses # 33.237 % of all cache refs ( +- 0.36% )
7,467,180 cache-references ( +- 0.29% )
2,167,727,565 instructions # 2.21 insn per cycle ( +- 0.00% )
981,860,196 cycles ( +- 0.18% )
0.410072312 seconds time elapsed ( +- 1.42% )
Performance counter stats for './phonebook_opt_hash' (100 runs):
1,556,307 cache-misses # 68.441 % of all cache refs ( +- 0.43% )
2,273,932 cache-references ( +- 0.21% )
270,227,846 instructions # 0.65 insn per cycle ( +- 0.02% )
412,602,985 cycles ( +- 0.29% )
0.211090389 seconds time elapsed ( +- 1.22% )
Reference from: http://www.brendangregg.com/methodology.html
Are UOPs issued?
簡單來說前端就是:取得指令、以及把指令解碼成微指令。sandy bridge 系列可以一次傳 4 個 uops 給後端
中文翻譯 by louielu
簡單來說後端就是:執行 uops,有 parallelism
中文翻譯 by louielu
$ perf stat -d ./phonebook_oirg
Performance counter stats for './phonebook_orig':
66.166969 task-clock:u (msec) # 0.995 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
12,346 page-faults:u # 0.187 M/sec
113,680,547 cycles:u # 1.718 GHz (34.24%)
82,373,177 stalled-cycles-frontend:u # 72.46% frontend cycles idle (48.06%)
64,255,629 stalled-cycles-backend:u # 56.52% backend cycles idle (50.65%)
186,811,699 instructions:u # 1.64 insn per cycle
# 0.44 stalled cycles per insn (62.39%)
34,733,972 branches:u # 524.944 M/sec (63.81%)
789,273 branch-misses:u # 2.27% of all branches (62.64%)
49,760,981 L1-dcache-loads:u # 752.052 M/sec (44.50%)
56,790 L1-dcache-load-misses:u # 0.11% of all L1-dcache hits (18.82%)
4,932 LLC-loads:u # 0.075 M/sec (23.20%)
<not supported> LLC-load-misses:u
0.066500871 seconds time elapsed
$ perf stat -d ./phonebook_opt
Performance counter stats for './phonebook_opt':
41.165953 task-clock:u (msec) # 0.993 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
4,147 page-faults:u # 0.101 M/sec
89,513,769 cycles:u # 2.174 GHz (30.69%)
47,728,782 stalled-cycles-frontend:u # 53.32% frontend cycles idle (45.97%)
30,544,573 stalled-cycles-backend:u # 34.12% backend cycles idle (50.45%)
186,816,945 instructions:u # 2.09 insn per cycle
# 0.26 stalled cycles per insn (61.88%)
35,932,936 branches:u # 872.880 M/sec (63.67%)
848,250 branch-misses:u # 2.36% of all branches (70.17%)
54,843,255 L1-dcache-loads:u # 1332.248 M/sec (41.03%)
26,195 L1-dcache-load-misses:u # 0.05% of all L1-dcache hits (15.29%)
2,297 LLC-loads:u # 0.056 M/sec (22.03%)
<not supported> LLC-load-misses:u
0.041471519 seconds time elapsed