contributed by <mingnus
>
ncku
sysprog
perf
linux
phonebook
作業系統: Ubuntu 16.04.1, kernel 4.4.0-38
硬體: Intel Celeron N3150
安裝作業所需套件: gnuplot-x11, linux-tools-common, linux-tools-4.4.0-38-generic
[optional] 調整系統locale (例如LC_NUMERIC會影響perf stat千位數逗號的位置)
http://www.davidpai.tw/ubuntu/2011/ubuntu-set-locale/
(Ubuntu可從UI設定: System Settings > Language Support > Regional Formats; 設定後重開機)
調整 CPU 電源設定, 以測得 CPU 最大效能
cat /sys/devices/system/cpu/cpu<N>/cpufreq/scaling_available_governors
echo performance > /sys/devices/system/cpu/cpu<N>/cpufreq/scaling_governors
http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial
$ sudo sh -c "echo 0 > /proc/sys/kernel/perf_event_paranoid"
$ sudo sh -c "echo 0 > /proc/sys/kernel/kptr_restrict"
## perf stat [-r <repeat_count>] [-e <events>] <command>
$ perf stat --repeat 100 \
-e cache-misses,cache-references,instructions,cycles \
./phonebook_orig
sample output:
Performance counter stats for './phonebook_orig' (100 runs):
3,422,897 cache-misses # 29.110 % of all cache refs ( +- 0.18% ) (74.90%)
11,758,558 cache-references ( +- 0.27% ) (50.51%)
265,764,121 instructions # 0.63 insns per cycle ( +- 0.07% ) (75.42%)
424,232,571 cycles ( +- 0.08% ) (75.24%)
0.212240981 seconds time elapsed ( +- 0.25% )
行末括號內的數字是events multiplex percentages, 當取樣的PMU event類型超過PMU個數時就會啟用此功能: https://perf.wiki.kernel.org/index.php/Tutorial#multiplexing_and_scaling_events
$ sudo perf record -e cache-misses,cache-references,instructions,cycles ./phonebook_orig
$ sudo perf report
"It might be confusing that the sum of all the 'children' overhead values exceeds 100% since each of them is already an accumulationof 'self' overhead of its child functions. But with this enabled, users can find which function has the most overhead even if samples are spread over the children.”
Operations
實驗: 觀察bandwidth運作期間的event count。該程式是將一個heap buffer(連續空間)內的資料反覆讀進register來測試memory throughput。以下觀察循序讀取時的event count。
(注意: 最好不要用–fastest參數執行bandwidth,否則本程式觸發的event count不夠多, 不易和背景程式的events區隔)
Cache misses in _Reader.L2():
│ Disassembly of section .text:
│
│ 0000000000401588 <_Reader.L2>:
│ _Reader.L2():
1.87 │ 0: mov (%r10),%rax
0.34 │ mov 0x8(%r10),%rax
0.68 │ mov 0x10(%r10),%rax
1.92 │ mov 0x18(%r10),%rax
4.58 │ mov 0x20(%r10),%rax
4.30 │ mov 0x28(%r10),%rax
1.47 │ mov 0x30(%r10),%rax
2.32 │ mov 0x38(%r10),%rax
1.81 │ mov 0x40(%r10),%rax
2.38 │ mov 0x48(%r10),%rax
2.43 │ mov 0x50(%r10),%rax
1.81 │ mov 0x58(%r10),%rax
* 9.50 │ mov 0x60(%r10),%rax
*10.46 │ mov 0x68(%r10),%rax
2.04 │ mov 0x70(%r10),%rax
2.43 │ mov 0x78(%r10),%rax
1.30 │ mov 0x80(%r10),%rax
1.70 │ mov 0x88(%r10),%rax
2.32 │ mov 0x90(%r10),%rax
1.36 │ mov 0x98(%r10),%rax
* 6.90 │ mov 0xa0(%r10),%rax
* 6.50 │ mov 0xa8(%r10),%rax
1.47 │ mov 0xb0(%r10),%rax
2.21 │ mov 0xb8(%r10),%rax
2.71 │ mov 0xc0(%r10),%rax
2.09 │ mov 0xc8(%r10),%rax
0.96 │ mov 0xd0(%r10),%rax
0.79 │ mov 0xd8(%r10),%rax
4.47 │ mov 0xe0(%r10),%rax
* 5.20 │ mov 0xe8(%r10),%rax
2.04 │ mov 0xf0(%r10),%rax
3.00 │ mov 0xf8(%r10),%rax
2.94 │ add $0x100,%r10
0.51 │ cmp %rsi,%r10
1.13 │ ↑ jb 0
│ dec %rdx
0.06 │ ↓ jne fffffffffffffffd
│ pop %r10
│ ← retq
see also:
本程式是連續呼叫malloc()建立鏈結串列, 所有的串列節點理論上是線性分布於heap空間。但因struct __PHONE_BOOK_ENTRY
大小超過cache line size (64 bytes),所以每當存取下一個串列節點的姓名欄位時, 必然引發L1-dcache-loads events, 也容易產生cache miss。
(問題: memory access pattern與cache prefetch的資料量是否有關? 我們現在只存取每個節點開頭的16 bytes lastName, CPU是否會prefetch超過64 byte的資料?)
perf output:
Performance counter stats for './phonebook_orig' (100 runs):
1,326,960 L1-dcache-load-misses ( +- 0.37% ) (59.37%)
3,344,492 cache-misses # 29.130 % of all cache refs ( +- 0.29% ) (41.10%)
11,481,124 cache-references ( +- 0.26% ) (41.32%)
264,859,089 instructions # 0.62 insns per cycle ( +- 0.11% ) (60.90%)
425,551,760 cycles ( +- 0.09% ) (59.41%)
0.212769801 seconds time elapsed ( +- 0.20% )
加強資料的special locality: 將struct __PHONE_BOOK_ENTRY
內非鍵值欄位移出,結構縮小為32 bytes。
perf output:
Performance counter stats for './phonebook_opt' (100 runs):
529,675 L1-dcache-load-misses ( +- 1.23% ) (58.87%)
1,217,650 cache-misses # 27.935 % of all cache refs ( +- 0.63% ) (40.89%)
4,358,932 cache-references ( +- 0.39% ) (41.92%)
246,162,656 instructions # 0.70 insns per cycle ( +- 0.14% ) (61.60%)
349,640,074 cycles ( +- 0.08% ) (59.62%)
0.175088380 seconds time elapsed ( +- 0.24% )
基於實驗(1), 再建立binary search tree以加快搜尋速度: 以額外函式將sorted linked list轉為binary search tree。
struct __PHONE_BOOK_ENTRY
裡新增兩個指向左右子樹的指標,struct size增為48 bytes。
perf output:
Performance counter stats for './phonebook_opt' (100 runs):
164,926 L1-dcache-load-misses ( +- 1.69% ) (59.74%)
813,046 cache-misses # 19.045 % of all cache refs ( +- 0.86% ) (40.67%)
4,269,014 cache-references ( +- 0.58% ) (41.01%)
206,653,954 instructions # 0.73 insns per cycle ( +- 0.16% ) (61.62%)
283,291,500 cycles ( +- 0.10% ) (60.09%)
0.143766307 seconds time elapsed ( +- 0.34% )
測試前均釋放page caches
$ echo 3 > /proc/sys/vm/drop_caches
值得注意的是,如果把struct size改為和cache line一樣大小的話 (64 bytes),效能仍較原始版本佳 (原因待查)
program output:
size of entry : 64 bytes
execution time of append() : 0.127740 sec
execution time of findName() : 0.017339 sec
perf output:
Performance counter stats for './phonebook_opt' (100 runs):
865,688 L1-dcache-load-misses ( +- 0.29% ) (59.67%)
1,910,430 cache-misses # 27.976 % of all cache refs ( +- 0.50% ) (41.08%)
6,828,855 cache-references ( +- 0.23% ) (40.91%)
253,911,736 instructions # 0.68 insns per cycle ( +- 0.14% ) (60.72%)
373,165,944 cycles ( +- 0.09% ) (59.44%)
0.186791556 seconds time elapsed ( +- 0.21% )
main.c line 69: assert()參數IMPL沒被展開
#include <stdio.h>
#include <assert.h>
#define MY_STR "my string"
int main() {
assert(0 && "test: " MY_STR);
return 0;
}
test: test.c:7: main: Assertion `0 && "test: " MY_STR' failed.
$ gcc -E -DM test.c
#define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (#expr, __FILE__, __LINE__, __ASSERT_FUNCTION))
$ gcc -E -P test.c
int main() {
((0 && "test: " "my string") ? (void) (0) : __assert_fail ("0 && \"test: \" MY_STR", "test.c", 7, __PRETTY_FUNCTION__));
}
可以看到傳入__assert_fail()
的IMPL沒被展開