Try   HackMD

2016q3 Homework1 (phonebook)

contributed by <KobeYu>
github:https://github.com/kobe0308/phonebook
作業說明: https://hackmd.io/s/S1RVdgza

tags: kobeyu

目標

  1. 熟悉perf效能分析工具
  2. 使用perf分析效能瓶頸
  3. 根據已知的效能瓶頸進行最佳化

執行環境 Ubuntu 16.04 @MacBook Air

​Architecture:          x86_64
​CPU op-mode(s):        32-bit, 64-bit
​Byte Order:            Little Endian
​CPU(s):                4
​On-line CPU(s) list:   0-3
​Thread(s) per core:    2
​Core(s) per socket:    2
​Socket(s):             1
​NUMA node(s):          1
​Vendor ID:             GenuineIntel
​CPU family:            6
​Model:                 58
​Model name:            Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz
​Stepping:              9
​CPU MHz:               2796.207
​CPU max MHz:           2800.0000
​CPU min MHz:           800.0000
​BogoMIPS:              4589.39
​Virtualization:        VT-x
​L1d cache:             32K
​L1i cache:             32K
​L2 cache:              256K
​L3 cache:              3072K
​NUMA node0 CPU(s):     0-3
​Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

資料夾結構

​.
​├── calculate.c
​├── dictionary
​│   ├── all-names.txt
​│   ├── female-names.txt
​│   ├── male-names.txt
​│   ├── Prime-10000.csv
​│   ├── tolowercase.c
​│   └── words.txt
​├── LICENSE
​├── main.c
​├── Makefile
​├── phonebook_opt.c
​├── phonebook_opt.h
​├── phonebook_orig.c
​├── phonebook_orig.h
​├── README.md
​└── scripts
​	├── pre-commit.hook
​	└── runtime.gp

calculate.c

在進行效能測試時,會讓讓程式執行100次,並記錄每次執行的時間,最後calculate.c就會把所有的執行結果取平均,然後透過gunplot繪出不同的執行結果

dictionary

此資料節內存放的是大量的字典檔,此專案用到的是words.txt要從大量的數據中,找到目標的字串

phonebook_orig.[ch]

這是一個可以正常運行的版本,是此次作業要比較的對象,目標是能夠找到效能更好的方法.

scripts

pre-commit.hook

在git的操作流程中,在commit一次訊息之前,能夠先運行一小段script程式,在這邊老師的設定是使用astyle來檢查程式碼的格式是否符合k&r的風格

git hook

Astyle K&R

runtime_gp

執行make plot後,會透過這個描述檔,執行的結果會製成圖檔runtime.png

Makefile

關閉最佳化,先以改變演算法或資料結構的方式進行最佳化
CFLAGS_orig = -O0
CFLAGS_opt = -O0

gcc -D是用來定義一個名稱,在此老師根據編譯不同的執行檔分別帶入phonebook_opt.h與phonebook_orig.h,可以共用main清楚的分離不同功能模組的程式碼
gcc -D

效能分析 - Perf

perf可以量測的效能指標很多,透過perf list可以知道所有可量測的事件,這邊先參考Makefile中在意的事件為主(cache-misses,cache-references,instructions,cycles)

$ perf stat -e cache-misses,cache-references,instructions,cycles ./phonebook_orig

​ Performance counter stats for './phonebook_orig':

​​​​      174,4231      cache-misses              #   94.821 % of all cache refs    
​​​​      183,9492      cache-references                                            
​​​​   2,6396,0693      instructions              #    1.49  insns per cycle        
​​​​   1,7755,3941      cycles                                                      

​​​​   0.114544631 seconds time elapsed

從上述結果可以看到有很高的cache-misses(94.821%),接著透過perf record來看看大部分的週期(cycles)都花在哪個函式裡面.

$ perf record ./phonebook_orig && perf report

​Samples: 548  of event 'cycles:pp', Event count (approx.): 177767811
​Overhead  Command         Shared Object     Symbol
​  15.01%  phonebook_orig  phonebook_orig    [.] main
​  12.33%  phonebook_orig  libc-2.23.so      [.] __strcasecmp_l_avx
​   9.88%  phonebook_orig  libc-2.23.so      [.] _int_malloc
​   9.69%  phonebook_orig  phonebook_orig    [.] findName
​   8.66%  phonebook_orig  libc-2.23.so      [.] _IO_fgets
​   5.71%  phonebook_orig  libc-2.23.so      [.] __strcpy_sse2_unaligned
​   5.64%  phonebook_orig  libc-2.23.so      [.] __memcpy_sse2
​   4.60%  phonebook_orig  libc-2.23.so      [.] _IO_getline_info
​​​​   ....

以執行結果來看,大部分的時間都花在findName裡面,所以根據perf量測的結果我們先朝著降低cache-misses與減少findName所需要的執行時間為出發點.

心中的疑問

  1. 根據執行結果findName的執行時間比append少,但為什麼cycles數反倒是findName以較多?
  2. perf -d -d -d列出了很多跟L1 cache的統計數據,跟-e cache-misses的關聯是什麼或者說cache-misses是怎麼計算出來的?

Reference

  1. perf stat -d可以列出更多細節,但要如何分析這些數據需要經驗跟背景知識,根據louielu同學的作業紀錄,看到了Brendan D. Gregg的個人網站在介紹linux效能分析的經驗與觀點,值得花時間好好在這上面研究.
  2. perf wiki link1 link2
  3. IBM工程師對perf的介紹 link

perf與PMU的關係

perf之所以能夠量測到硬體的資訊是因為CPU中的PMU,PMU的全名是performance monitor unit,而根據perf wiki的內容來看會perf能夠量測到的硬體事件取決於硬體製造商是否支援,Intel的文件在此: Intel PMU event tables: Appendix A of manual here (待讀!)

最佳化策略

根據perf的量測結果,效能瓶頸在於cache-misses與執行findName的週期數(cycles)過多,所以先針對cache-misses提出的策略是降低結構entry的資料大小

降低entry的資料大小

將原本的資料從原本的136Byte將到32Byte,

sizeof(entry) => 136Byte
$ perf stat -e cache-misses,cache-references,instructions,cycles ./phonebook_orig

​ Performance counter stats for './phonebook_orig' (100 runs):

​​​​      198,9886      cache-misses              #   96.073 % of all cache refs      ( +-  0.18% )
​​​​      207,1232      cache-references                                              ( +-  0.17% )
​​​​   2,6201,1014      instructions              #    1.42  insns per cycle          ( +-  0.02% )
​​​​   1,8478,3054      cycles                                                        ( +-  0.14% )

​​​​   0.073231265 seconds time elapsed                                          ( +-  0.38% )

sizeof(entry) => 32Byte
$ perf stat -e cache-misses,cache-references,instructions,cycles ./phonebook_opt

​Performance counter stats for './phonebook_opt'  (100 runs):

​​​​       41,8951      cache-misses              #   73.309 % of all cache refs      ( +-  0.21% )
​​​​       57,1486      cache-references                                              ( +-  0.75% )
​​​​    2,4420,5622      instructions              #    1.75  insns per cycle          ( +-  0.02% )
​​​​   1,3939,9549      cycles                                                        ( +-  0.85% )

​​​​    0.055475061 seconds time elapsed                                          ( +-  1.05% )

sizeof(entry) => 24Byte
$ perf stat -e cache-misses,cache-references,instructions,cycles ./phonebook_opt

​Performance counter stats for './phonebook_opt':

​​​​       43,6432      cache-misses              #   79.510 % of all cache refs    
​​​​       54,8901      cache-references                                            
​​​​   4,0384,2330      instructions              #    2.00  insns per cycle        
​​​​   2,0207,4955      cycles                                                      

​​​​   0.120265124 seconds time elapsed

心中的疑問

  1. 在新的結構中,若是將MAX_LAST_NAME_SIZE從16改為159,sizeof(entry)一樣是32,但如果從16改為8則sizeof(entry)會變成24,看起來應該是跟data alignment有關.

    typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    struct __PHONE_BOOK_DETAIL_INFO *pDetailInfo;
    struct __PHONE_BOOK_ENTRY *pNext;
    } entry;

  2. 把entry資料結構從36Byte在降到24Byte,cache-misses反而又升高了,還在思考可能的原因是什麼.

以Hash table降低findName的執行週期數

一篇關於hash table的介紹與範例程式 link

origin 的版本
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ perf record ./phonebook_orig
$ perf report

​Samples: 564  of event 'cycles:pp', Event count (approx.): 188719537
​Overhead  Command         Shared Object     Symbol
​  13.45%  phonebook_orig  phonebook_orig    [.] main
​  12.68%  phonebook_orig  libc-2.23.so      [.] __strcasecmp_l_avx
​  11.76%  phonebook_orig  phonebook_orig    [.] findName
​   9.90%  phonebook_orig  libc-2.23.so      [.] _int_malloc
​   9.30%  phonebook_orig  libc-2.23.so      [.] _IO_fgets
​   6.52%  phonebook_orig  libc-2.23.so      [.] __memcpy_sse2
​   4.88%  phonebook_orig  libc-2.23.so      [.] __strcpy_sse2_unaligned

opt(sizeof(entry)=32)的版本
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ perf record ./phonebook_opt
$ perf report

​Samples: 427  of event 'cycles:pp', Event count (approx.): 136560977
​Overhead  Command        Shared Object     Symbol
​  22.03%  phonebook_opt  phonebook_opt     [.] main
​  14.02%  phonebook_opt  libc-2.23.so      [.] __strcasecmp_l_avx
​  10.31%  phonebook_opt  libc-2.23.so      [.] _IO_fgets
​   9.52%  phonebook_opt  libc-2.23.so      [.] _int_malloc
​   8.48%  phonebook_opt  libc-2.23.so      [.] __memcpy_sse2
​   4.63%  phonebook_opt  libc-2.23.so      [.] __strcpy_sse2_unaligned
​   4.47%  phonebook_opt  libc-2.23.so      [.] _IO_getline_info
​   4.34%  phonebook_opt  libc-2.23.so      [.] memchr
​   3.44%  phonebook_opt  phonebook_opt     [.] append
​   2.91%  phonebook_opt  libc-2.23.so      [.] malloc
​   1.94%  phonebook_opt  phonebook_opt     [.] findName

hash table + originy(sizeof(entry)=136)的版本
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ perf record ./phonebook_hashtb
$ perf report

​Samples: 917  of event 'cycles:pp', Event count (approx.): 450670184
​Overhead  Command          Shared Object     Symbol
​  59.34%  phonebook_hasht  phonebook_hashtb  [.] append
​   8.45%  phonebook_hasht  phonebook_hashtb  [.] BKDRHash
​   6.06%  phonebook_hasht  phonebook_hashtb  [.] main
​   3.80%  phonebook_hasht  libc-2.23.so      [.] _int_malloc
​   3.29%  phonebook_hasht  libc-2.23.so      [.] _IO_fgets
​   1.83%  phonebook_hasht  libc-2.23.so      [.] __memcpy_sse2

hash table + opt(sizeof(entry) = 32)的版本
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ perf record ./phonebook_hashtb
$ perf report

​Samples: 867  of event 'cycles:pp', Event count (approx.): 438113332
​Overhead  Command          Shared Object     Symbol
​  58.01%  phonebook_hasht  phonebook_hashtb  [.] append
​   8.72%  phonebook_hasht  phonebook_hashtb  [.] BKDRHash
​   5.55%  phonebook_hasht  libc-2.23.so      [.] _int_malloc
​   5.49%  phonebook_hasht  phonebook_hashtb  [.] main
​   2.99%  phonebook_hasht  libc-2.23.so      [.] __strcpy_sse2_unaligned
​   2.94%  phonebook_hasht  libc-2.23.so      [.] _IO_fgets
​   2.52%  phonebook_hasht  libc-2.23.so      [.] _IO_getline_info
​   2.10%  phonebook_hasht  libc-2.23.so      [.] malloc

再加入了hash table後看到了瓶頸反而變成了計算key值的BKDRHash()


綜合比較結果