contributed by <KobeYu
>
github:https://github.com/kobe0308/phonebook
作業說明: https://hackmd.io/s/S1RVdgza
kobeyu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 58
Model name: Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz
Stepping: 9
CPU MHz: 2796.207
CPU max MHz: 2800.0000
CPU min MHz: 800.0000
BogoMIPS: 4589.39
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
.
├── calculate.c
├── dictionary
│ ├── all-names.txt
│ ├── female-names.txt
│ ├── male-names.txt
│ ├── Prime-10000.csv
│ ├── tolowercase.c
│ └── words.txt
├── LICENSE
├── main.c
├── Makefile
├── phonebook_opt.c
├── phonebook_opt.h
├── phonebook_orig.c
├── phonebook_orig.h
├── README.md
└── scripts
├── pre-commit.hook
└── runtime.gp
在進行效能測試時,會讓讓程式執行100次,並記錄每次執行的時間,最後calculate.c就會把所有的執行結果取平均,然後透過gunplot繪出不同的執行結果
此資料節內存放的是大量的字典檔,此專案用到的是words.txt要從大量的數據中,找到目標的字串
這是一個可以正常運行的版本,是此次作業要比較的對象,目標是能夠找到效能更好的方法.
在git的操作流程中,在commit一次訊息之前,能夠先運行一小段script程式,在這邊老師的設定是使用astyle來檢查程式碼的格式是否符合k&r的風格
執行make plot後,會透過這個描述檔,執行的結果會製成圖檔runtime.png
關閉最佳化,先以改變演算法或資料結構的方式進行最佳化
CFLAGS_orig = -O0
CFLAGS_opt = -O0
gcc -D是用來定義一個名稱,在此老師根據編譯不同的執行檔分別帶入phonebook_opt.h與phonebook_orig.h,可以共用main清楚的分離不同功能模組的程式碼
gcc -D
perf可以量測的效能指標很多,透過perf list可以知道所有可量測的事件,這邊先參考Makefile中在意的事件為主(cache-misses,cache-references,instructions,cycles)
$ perf stat -e cache-misses,cache-references,instructions,cycles ./phonebook_orig
Performance counter stats for './phonebook_orig':
174,4231 cache-misses # 94.821 % of all cache refs
183,9492 cache-references
2,6396,0693 instructions # 1.49 insns per cycle
1,7755,3941 cycles
0.114544631 seconds time elapsed
從上述結果可以看到有很高的cache-misses(94.821%),接著透過perf record來看看大部分的週期(cycles)都花在哪個函式裡面.
$ perf record ./phonebook_orig && perf report
Samples: 548 of event 'cycles:pp', Event count (approx.): 177767811
Overhead Command Shared Object Symbol
15.01% phonebook_orig phonebook_orig [.] main
12.33% phonebook_orig libc-2.23.so [.] __strcasecmp_l_avx
9.88% phonebook_orig libc-2.23.so [.] _int_malloc
9.69% phonebook_orig phonebook_orig [.] findName
8.66% phonebook_orig libc-2.23.so [.] _IO_fgets
5.71% phonebook_orig libc-2.23.so [.] __strcpy_sse2_unaligned
5.64% phonebook_orig libc-2.23.so [.] __memcpy_sse2
4.60% phonebook_orig libc-2.23.so [.] _IO_getline_info
....
以執行結果來看,大部分的時間都花在findName裡面,所以根據perf量測的結果我們先朝著降低cache-misses與減少findName所需要的執行時間為出發點.
perf之所以能夠量測到硬體的資訊是因為CPU中的PMU,PMU的全名是performance monitor unit,而根據perf wiki的內容來看會perf能夠量測到的硬體事件取決於硬體製造商是否支援,Intel的文件在此: Intel PMU event tables: Appendix A of manual here (待讀!)
根據perf的量測結果,效能瓶頸在於cache-misses與執行findName的週期數(cycles)過多,所以先針對cache-misses提出的策略是降低結構entry的資料大小
將原本的資料從原本的136Byte將到32Byte,
sizeof(entry) => 136Byte
$ perf stat -e cache-misses,cache-references,instructions,cycles ./phonebook_orig
Performance counter stats for './phonebook_orig' (100 runs):
198,9886 cache-misses # 96.073 % of all cache refs ( +- 0.18% )
207,1232 cache-references ( +- 0.17% )
2,6201,1014 instructions # 1.42 insns per cycle ( +- 0.02% )
1,8478,3054 cycles ( +- 0.14% )
0.073231265 seconds time elapsed ( +- 0.38% )
sizeof(entry) => 32Byte
$ perf stat -e cache-misses,cache-references,instructions,cycles ./phonebook_opt
Performance counter stats for './phonebook_opt' (100 runs):
41,8951 cache-misses # 73.309 % of all cache refs ( +- 0.21% )
57,1486 cache-references ( +- 0.75% )
2,4420,5622 instructions # 1.75 insns per cycle ( +- 0.02% )
1,3939,9549 cycles ( +- 0.85% )
0.055475061 seconds time elapsed ( +- 1.05% )
sizeof(entry) => 24Byte
$ perf stat -e cache-misses,cache-references,instructions,cycles ./phonebook_opt
Performance counter stats for './phonebook_opt':
43,6432 cache-misses # 79.510 % of all cache refs
54,8901 cache-references
4,0384,2330 instructions # 2.00 insns per cycle
2,0207,4955 cycles
0.120265124 seconds time elapsed
在新的結構中,若是將MAX_LAST_NAME_SIZE從16改為15…9,sizeof(entry)一樣是32,但如果從16改為8則sizeof(entry)會變成24,看起來應該是跟data alignment有關.
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
struct __PHONE_BOOK_DETAIL_INFO *pDetailInfo;
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
把entry資料結構從36Byte在降到24Byte,cache-misses反而又升高了,還在思考可能的原因是什麼.
一篇關於hash table的介紹與範例程式 link
origin 的版本
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ perf record ./phonebook_orig
$ perf report
Samples: 564 of event 'cycles:pp', Event count (approx.): 188719537
Overhead Command Shared Object Symbol
13.45% phonebook_orig phonebook_orig [.] main
12.68% phonebook_orig libc-2.23.so [.] __strcasecmp_l_avx
11.76% phonebook_orig phonebook_orig [.] findName
9.90% phonebook_orig libc-2.23.so [.] _int_malloc
9.30% phonebook_orig libc-2.23.so [.] _IO_fgets
6.52% phonebook_orig libc-2.23.so [.] __memcpy_sse2
4.88% phonebook_orig libc-2.23.so [.] __strcpy_sse2_unaligned
opt(sizeof(entry)=32)的版本
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ perf record ./phonebook_opt
$ perf report
Samples: 427 of event 'cycles:pp', Event count (approx.): 136560977
Overhead Command Shared Object Symbol
22.03% phonebook_opt phonebook_opt [.] main
14.02% phonebook_opt libc-2.23.so [.] __strcasecmp_l_avx
10.31% phonebook_opt libc-2.23.so [.] _IO_fgets
9.52% phonebook_opt libc-2.23.so [.] _int_malloc
8.48% phonebook_opt libc-2.23.so [.] __memcpy_sse2
4.63% phonebook_opt libc-2.23.so [.] __strcpy_sse2_unaligned
4.47% phonebook_opt libc-2.23.so [.] _IO_getline_info
4.34% phonebook_opt libc-2.23.so [.] memchr
3.44% phonebook_opt phonebook_opt [.] append
2.91% phonebook_opt libc-2.23.so [.] malloc
1.94% phonebook_opt phonebook_opt [.] findName
hash table + originy(sizeof(entry)=136)的版本
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ perf record ./phonebook_hashtb
$ perf report
Samples: 917 of event 'cycles:pp', Event count (approx.): 450670184
Overhead Command Shared Object Symbol
59.34% phonebook_hasht phonebook_hashtb [.] append
8.45% phonebook_hasht phonebook_hashtb [.] BKDRHash
6.06% phonebook_hasht phonebook_hashtb [.] main
3.80% phonebook_hasht libc-2.23.so [.] _int_malloc
3.29% phonebook_hasht libc-2.23.so [.] _IO_fgets
1.83% phonebook_hasht libc-2.23.so [.] __memcpy_sse2
hash table + opt(sizeof(entry) = 32)的版本
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ perf record ./phonebook_hashtb
$ perf report
Samples: 867 of event 'cycles:pp', Event count (approx.): 438113332
Overhead Command Shared Object Symbol
58.01% phonebook_hasht phonebook_hashtb [.] append
8.72% phonebook_hasht phonebook_hashtb [.] BKDRHash
5.55% phonebook_hasht libc-2.23.so [.] _int_malloc
5.49% phonebook_hasht phonebook_hashtb [.] main
2.99% phonebook_hasht libc-2.23.so [.] __strcpy_sse2_unaligned
2.94% phonebook_hasht libc-2.23.so [.] _IO_fgets
2.52% phonebook_hasht libc-2.23.so [.] _IO_getline_info
2.10% phonebook_hasht libc-2.23.so [.] malloc
再加入了hash table後看到了瓶頸反而變成了計算key值的BKDRHash()
綜合比較結果