# 2017q1 Homework1 (phonebook) contributed by < `Hsiang` > ==作業解說 [video](https://www.youtube.com/watch?v=ZICRLKf_bVw)== ### Reviewed by `yangyang95` - 文字訊息不要用圖片表示,以方便搜尋 - Cache-misses 的比例 92.475 "%" ,百分比需明確標註 - .gitignore 裡面可以寫成 *.txt *.png 來忽略特定檔案格式 - commit [1d98e806cd66156ab0393942921262bfae052a2b]() 筆記中應介紹三種 Hash function 是如何實作,之間的差異在哪 這個 commit 也應考慮分成三個,一次一個實作,並配上適合的 message - 可以做圖比較各實作 append() findName() 時間差異,並放在最後,方便比較 --- ### 目標 * 熟悉 Perf * 優化 phonebook 效能 * 學習關於Cache基礎 ### 開發環境 ![](https://i.imgur.com/W1xcHgT.png) 作業系統: ``` uname -a Linux ubuntu 4.8.0-39-generic #42~16.04.1-Ubuntu SMP Mon Feb 20 15:06:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux ``` ### Cache 架構 ``` ⚡ lstopo Machine (16GB) Package L#0 + L3 L#0 (6144KB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#4) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#5) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#6) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#7) HostBridge L#0 PCI 8086:0166 GPU L#0 "renderD128" GPU L#1 "card0" GPU L#2 "controlD64" PCIBridge PCI 14e4:1686 Net L#3 "enp1s0f0" PCIBridge PCI 14e4:4331 Net L#4 "wlp2s0" PCI 8086:1e03 Block(Disk) L#5 "sda" Block(Disk) L#6 "sdb" ``` * L1i Cache 32KB * L1d Cache 32KB * L2 Cache 256KB * L3 Cache 6144KB ##### Cache line size ``` ⚡ getconf LEVEL1_DACAHE_LINESIZE 64 ``` ### perf 設定一般user ```sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'``` ### 問題分析 * cache miss 是什麼? ### 效能分析-優化前 * `./phonebook_orig` ``` size of entry : 136 bytes execution time of append() : 0.082711 sec execution time of findName() : 0.006030 sec ``` * Stat 分析 ``` ⚡ sudo perf stat -r 100 ./phonebook_orig Performance counter stats for './phonebook_orig' (100 runs): 62.434152 task-clock (msec) # 0.996 CPUs utilized ( +- 2.01% ) 0 context-switches # 0.002 K/sec ( +- 25.73% ) 0 cpu-migrations # 0.000 K/sec 12,354 page-faults # 0.198 M/sec ( +- 0.00% ) 177,805,159 cycles # 2.848 GHz ( +- 0.69% ) 84,945,067 stalled-cycles-frontend # 47.77% frontend cycles idle ( +- 1.33) 264,406,729 instructions # 1.49 insn per cycle # 0.32 stalled cycles per insn ( +- 0.02% ) 48,204,531 branches # 772.086 M/sec ( +- 0.01% ) 923,397 branch-misses # 1.92% of all branches ( +- 0.93% ) 0.062661753 seconds time elapsed ( +- 2.00% ) ``` ==由於0.996 CPUs utilized非常接近1屬CPU BOUND型== * 針對 Event Types 分析 ``` ⚡ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" 取消禁用 Kenerl pointer ⚡ sudo perf stat -r 100 -e \ cache-misses,cache-references, \ L1-dcache-load-misses, \ L1-dcache-store-misses, \ L1-dcache-prefetch-misses, \ L1-icache-load-misses, \ instructions, \ cycles \ ./phonebook_orig ``` ``` Performance counter stats for './phonebook_orig' (100 runs): 1,790,769 cache-misses # 92.475 % of all cache refs ( +- 1.41% ) (61.16%) 1,936,484 cache-references ( +- 1.32% ) (64.08%) 2,679,276 L1-dcache-load-misses ( +- 1.14% ) (65.20%) 1,000,437 L1-dcache-store-misses ( +- 0.70% ) (54.45%) 799,336 L1-dcache-prefetch-misses ( +- 3.44% ) (53.35%) 125,428 L1-icache-load-misses ( +- 1.45% ) (49.85%) 257,663,444 instructions # 1.56 insn per cycle ( +- 0.34% ) (61.42%) 165,474,641 cycles ( +- 0.36% ) (58.85%) 0.095967552 seconds time elapsed ( +- 1.83% ) ``` ==Cache-misses 的比例92.475== * perf hotspot ``` ⚡ sudo perf record -e cache-misses ./phonebook_orig ⚡ sudo perf report Samples: 350 of event 'cache-misses', Event count (approx.): 1655901 Overhead Command Shared Object Symbol # red 41.68% phonebook_orig libc-2.23.so [.] __strcasecmp_l_avx 35.11% phonebook_orig [kernel.kallsyms] [k] clear_page_c_e 5.39% phonebook_orig phonebook_orig [.] findName # green 4.07% phonebook_orig [kernel.kallsyms] [k] _raw_spin_lock 2.51% phonebook_orig [kernel.kallsyms] [k] get_mem_cgroup_from_mm 2.09% phonebook_orig [kernel.kallsyms] [k] copy_user_enhanced_fast_string 1.97% phonebook_orig [kernel.kallsyms] [k] get_page_from_freelist 1.27% phonebook_orig [kernel.kallsyms] [k] handle_mm_fault 1.15% phonebook_orig [kernel.kallsyms] [k] pte_lockptr.isra.15 0.81% phonebook_orig [kernel.kallsyms] [k] try_charge 0.80% phonebook_orig [kernel.kallsyms] [k] mem_cgroup_try_charge 0.64% phonebook_orig libc-2.23.so [.] __strcasecmp_avx 0.64% phonebook_orig [kernel.kallsyms] [k] unmap_page_range 0.62% phonebook_orig [kernel.kallsyms] [k] free_hot_cold_page_list 0.58% phonebook_orig [kernel.kallsyms] [k] free_pcppages_bulk 0.57% phonebook_orig [kernel.kallsyms] [k] __alloc_pages_nodemask 0.11% phonebook_orig ld-2.23.so [.] _dl_map_object_from_fd # white 0.00% perf [kernel.kallsyms] [k] __perf_event_header__init_id 0.00% perf [kernel.kallsyms] [k] strlcpy 0.00% perf [kernel.kallsyms] [k] perf_event_addr_filters_exec ``` 根據Hotspot分析出來 前三個function是造成cache-miss主要的原因: * __strcasecmp_l_avx -> int strcasecmp (const char *s1, const char *s2) * clear_page_c_e -> * findName -> entry *findName(char lastName[], entry *pHead) 由函數指出字串比對次數過多導致miss次數變多 ### 改善方向 由上面的分析數據顯示以下: * 儲存資料的結構與演算法導致miss次數增加 * 搜尋名字的演算法沒有效率 ### Optimization L1d Cache size = 32KB,Cache line size = 64Bytes Cache Line的個數 ==32 * 1024 / 64== = 512 每筆Entry有136Bytes,因此一次存入L1d的數量就只有 ==32 * 512 / 136== 大約 120筆 Entry 根據大小改善的原始結構後 每筆Entry剩下24Bytes,因此一次存入L1d的數量增加為 ==32 * 512 / 24== 大約 682筆 Entry ##### phonebook_orig.h ```clike= #define MAX_LAST_NAME_SIZE 16 typedef struct __PHONE_BOOK_ENTRY { struct __PHONE_BOOK_ENTRY *pNext; char lastName[MAX_LAST_NAME_SIZE]; } ``` ``` Performance counter stats for './phonebook_opt' (100 runs): 216,188 cache-misses # 72.132 % of all cache refs ( +- 0.48% ) 299,709 cache-references ( +- 0.37% ) 0.061514855 seconds time elapsed ``` 時間上的成長 ``` size of entry : 136 bytes execution time of append() : 0.084283 sec execution time of findName() : 0.006502 sec =========================================== size of entry : 24 bytes execution time of append() : 0.068817 sec execution time of findName() : 0.003831 sec ``` 減少不必要的欄位空間,使cache-misses率下降了約20%,搜尋以及儲存資料的時間也有些微下降 #### Hash Function(BKDR) 整理了 [Hash Function](https://hackmd.io/OwIwTALAHGwCYFoDMA2AnGhEQGMUKigkTSgFMBWARioAYBDOFAM1qA==?both) 先是建立了一張3571大小的TABLE,再不減少Entry的情況下進行分析 ``` size of entry :136 bytes execution time of append() : 0.110690 sec execution time of findName() : 0.000001 sec ============================================================== Performance counter stats for './phonebook_hash' (100 runs): 2,514,577 cache-misses # 56.257 % of all cache refs ( +- 0.27% ) (57.49%) 4,469,801 cache-references ( +- 0.31% ) (58.22%) 5,879,673 L1-dcache-load-misses # 5.82% of all L1-dcache hits ( +- 0.25% ) (58.54%) 101,099,783 L1-dcache-loads ( +- 0.60% ) (50.24%) 629,741 L1-dcache-prefetch-misses ( +- 0.99% ) (28.11%) 3,190,335 L1-dcache-store-misses ( +- 0.43% ) (29.31%) 130,865 L1-icache-load-misses ( +- 0.88% ) (43.36%) 0.164596361 seconds time elapsed ``` 減少大小後 ``` size of entry :32 bytes execution time of append() : 0.099975 sec execution time of findName() : 0.000001 sec ============================================================== Performance counter stats for './phonebook_hash' (100 runs): 1,701,515 cache-misses # 45.586 % of all cache refs ( +- 0.32% ) (57.57%) 3,732,515 cache-references ( +- 0.35% ) (58.29%) 5,336,750 L1-dcache-load-misses # 5.56% of all L1-dcache hits ( +- 0.40% ) (58.54%) 95,991,463 L1-dcache-loads ( +- 0.87% ) (50.07%) 482,912 L1-dcache-prefetch-misses ( +- 0.77% ) (28.06%) 2,756,774 L1-dcache-store-misses ( +- 0.80% ) (29.43%) 70,032 L1-icache-load-misses ( +- 0.79% ) (43.42%) 0.162973286 seconds time elapsed ``` 由數據分析換上Hash Function後 Cache-miss的比例降低不少 ==92.475 -> 56.257== append時間多出==0.03 sec== findName時間則下降至==0.000001 sec== 然而減少Hash大小也減少的些微Cache-miss的比例 ==56.257 -> 45.586== append時間下降 ==0.011== sec :::info Cache-misses 45.586% 依然偏高 ::: --- 學習ierosodin的分析,也著手分析大小對於分佈的影響 ![](https://i.imgur.com/E0drSSN.png) ![](https://i.imgur.com/UtFt6IY.png) ![](https://i.imgur.com/rtCQpA0.png) ![](https://i.imgur.com/OnVaMkl.png) ``` (mod 9997) size of entry :32 bytes execution time of append() : 0.101987 sec execution time of findName() : 0.000000 sec ============================================================== Performance counter stats for './phonebook_hash' (100 runs): 1,676,158 cache-misses # 43.189 % of all cache refs ( +- 0.29% ) (57.55%) 3,880,979 cache-references ( +- 0.34% ) (58.36%) 5,470,038 L1-dcache-load-misses # 5.51% of all L1-dcache hits ( +- 0.42% ) (58.80%) 99,279,186 L1-dcache-loads ( +- 0.71% ) (50.62%) 483,688 L1-dcache-prefetch-misses ( +- 0.84% ) (27.92%) 2,852,689 L1-dcache-store-misses ( +- 0.80% ) (29.11%) 75,499 L1-icache-load-misses ( +- 0.86% ) (43.37%) 0.167409641 seconds time elapsed ``` 根據結果 當把TABLE_SIZE設成9997時,search時間下降到小於 10^-6^ sec 且append時間只增加了==0.002 sec==,Cache-misses 也多下降了 ==2%== 也驗證了分佈越均勻確實可以下降Cache-misses的比例 --- * perf hotspot ``` Samples: 695 of event 'cache-misses', Event count (approx.): 1660031 Overhead Command Shared Object Symbol 49.30% phonebook_hash libc-2.23.so [.] malloc_consolidate 22.13% phonebook_hash libc-2.23.so [.] _int_free 12.29% phonebook_hash [kernel.kallsyms] [k] clear_page_c_e ``` 根據顯示 malloc的部分還是造成了許多Cache-misses的情況,也因此在釋放記憶體的時間增加了Caches misses的比例。 ```clike= //main.c#55 entry *pHead, *e[TABLE_SIZE]; pHead = (entry *) malloc(sizeof(entry) * TABLE_SIZE); printf("size of entry :%lu bytes\n", sizeof(entry)); for (i = 0; i < TABLE_SIZE; i++) { e[i] = &pHead[i]; e[i]->pNext = NULL; } i = 0; ``` 著手改善的方向memory pool ### 參考資料 ##### 網路資源 * [Perf Wiki](https://perf.wiki.kernel.org/index.php/Tutorial) * [Linux 效能分析工具: Perf](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial) * [Perf 介紹](http://mropengate.blogspot.tw/2016/04/perf.html) * [Perf Example](http://www.brendangregg.com/perf.html) * [Perf 教學-1](https://www.ibm.com/developerworks/cn/linux/l-cn-perf1/) * [Perf 教學-2](https://www.ibm.com/developerworks/cn/linux/l-cn-perf2/index.html) * [Cache explain](http://cenalulu.github.io/linux/all-about-cpu-cache/) * [Struct with Memory](http://www.catb.org/esr/structure-packing/) ##### 書籍 * [書籍-計算機組織與設計(軟硬體介面) 5-3章節] ##### 共筆 * [jayfeng0225](https://hackmd.io/EYBgHAbMCGAsCMBaaATcjYCYCmAzRAnCMAOyJp4G4DGKE8AzNEA=?view) * [ierosodin](https://hackmd.io/s/SJ4uqs9Yx#2017q1-homework1-phonebook) * [陳品睿](https://embedded2016.hackpad.com/2016q1-Homework1-utBhpkDFVMh)