# 2017q1 Homework1 (phonebook)
contributed by < `Hsiang` >
==作業解說 [video](https://www.youtube.com/watch?v=ZICRLKf_bVw)==
### Reviewed by `yangyang95`
- 文字訊息不要用圖片表示,以方便搜尋
- Cache-misses 的比例 92.475 "%" ,百分比需明確標註
- .gitignore 裡面可以寫成 *.txt *.png 來忽略特定檔案格式
- commit [1d98e806cd66156ab0393942921262bfae052a2b]()
筆記中應介紹三種 Hash function 是如何實作,之間的差異在哪
這個 commit 也應考慮分成三個,一次一個實作,並配上適合的 message
- 可以做圖比較各實作 append() findName() 時間差異,並放在最後,方便比較
---
### 目標
* 熟悉 Perf
* 優化 phonebook 效能
* 學習關於Cache基礎
### 開發環境

作業系統:
```
uname -a
Linux ubuntu 4.8.0-39-generic #42~16.04.1-Ubuntu SMP Mon Feb 20 15:06:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
```
### Cache 架構
```
⚡ lstopo
Machine (16GB)
Package L#0 + L3 L#0 (6144KB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#4)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#5)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#6)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#7)
HostBridge L#0
PCI 8086:0166
GPU L#0 "renderD128"
GPU L#1 "card0"
GPU L#2 "controlD64"
PCIBridge
PCI 14e4:1686
Net L#3 "enp1s0f0"
PCIBridge
PCI 14e4:4331
Net L#4 "wlp2s0"
PCI 8086:1e03
Block(Disk) L#5 "sda"
Block(Disk) L#6 "sdb"
```
* L1i Cache 32KB
* L1d Cache 32KB
* L2 Cache 256KB
* L3 Cache 6144KB
##### Cache line size
```
⚡ getconf LEVEL1_DACAHE_LINESIZE
64
```
### perf
設定一般user ```sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'```
### 問題分析
* cache miss 是什麼?
### 效能分析-優化前
* `./phonebook_orig`
```
size of entry : 136 bytes
execution time of append() : 0.082711 sec
execution time of findName() : 0.006030 sec
```
* Stat 分析
```
⚡ sudo perf stat -r 100 ./phonebook_orig
Performance counter stats for './phonebook_orig' (100 runs):
62.434152 task-clock (msec) # 0.996 CPUs utilized ( +- 2.01% )
0 context-switches # 0.002 K/sec ( +- 25.73% )
0 cpu-migrations # 0.000 K/sec
12,354 page-faults # 0.198 M/sec ( +- 0.00% )
177,805,159 cycles # 2.848 GHz ( +- 0.69% )
84,945,067 stalled-cycles-frontend # 47.77% frontend cycles idle ( +- 1.33)
264,406,729 instructions # 1.49 insn per cycle
# 0.32 stalled cycles per insn ( +- 0.02% )
48,204,531 branches # 772.086 M/sec ( +- 0.01% )
923,397 branch-misses # 1.92% of all branches ( +- 0.93% )
0.062661753 seconds time elapsed ( +- 2.00% )
```
==由於0.996 CPUs utilized非常接近1屬CPU BOUND型==
* 針對 Event Types 分析
```
⚡ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" 取消禁用 Kenerl pointer
⚡ sudo perf stat -r 100 -e \
cache-misses,cache-references, \
L1-dcache-load-misses, \
L1-dcache-store-misses, \
L1-dcache-prefetch-misses, \
L1-icache-load-misses, \
instructions, \
cycles \
./phonebook_orig
```
```
Performance counter stats for './phonebook_orig' (100 runs):
1,790,769 cache-misses # 92.475 % of all cache refs ( +- 1.41% ) (61.16%)
1,936,484 cache-references ( +- 1.32% ) (64.08%)
2,679,276 L1-dcache-load-misses ( +- 1.14% ) (65.20%)
1,000,437 L1-dcache-store-misses ( +- 0.70% ) (54.45%)
799,336 L1-dcache-prefetch-misses ( +- 3.44% ) (53.35%)
125,428 L1-icache-load-misses ( +- 1.45% ) (49.85%)
257,663,444 instructions # 1.56 insn per cycle ( +- 0.34% ) (61.42%)
165,474,641 cycles ( +- 0.36% ) (58.85%)
0.095967552 seconds time elapsed ( +- 1.83% )
```
==Cache-misses 的比例92.475==
* perf hotspot
```
⚡ sudo perf record -e cache-misses ./phonebook_orig
⚡ sudo perf report
Samples: 350 of event 'cache-misses', Event count (approx.): 1655901
Overhead Command Shared Object Symbol
# red
41.68% phonebook_orig libc-2.23.so [.] __strcasecmp_l_avx
35.11% phonebook_orig [kernel.kallsyms] [k] clear_page_c_e
5.39% phonebook_orig phonebook_orig [.] findName
# green
4.07% phonebook_orig [kernel.kallsyms] [k] _raw_spin_lock
2.51% phonebook_orig [kernel.kallsyms] [k] get_mem_cgroup_from_mm
2.09% phonebook_orig [kernel.kallsyms] [k] copy_user_enhanced_fast_string
1.97% phonebook_orig [kernel.kallsyms] [k] get_page_from_freelist
1.27% phonebook_orig [kernel.kallsyms] [k] handle_mm_fault
1.15% phonebook_orig [kernel.kallsyms] [k] pte_lockptr.isra.15
0.81% phonebook_orig [kernel.kallsyms] [k] try_charge
0.80% phonebook_orig [kernel.kallsyms] [k] mem_cgroup_try_charge
0.64% phonebook_orig libc-2.23.so [.] __strcasecmp_avx
0.64% phonebook_orig [kernel.kallsyms] [k] unmap_page_range
0.62% phonebook_orig [kernel.kallsyms] [k] free_hot_cold_page_list
0.58% phonebook_orig [kernel.kallsyms] [k] free_pcppages_bulk
0.57% phonebook_orig [kernel.kallsyms] [k] __alloc_pages_nodemask
0.11% phonebook_orig ld-2.23.so [.] _dl_map_object_from_fd
# white
0.00% perf [kernel.kallsyms] [k] __perf_event_header__init_id
0.00% perf [kernel.kallsyms] [k] strlcpy
0.00% perf [kernel.kallsyms] [k] perf_event_addr_filters_exec
```
根據Hotspot分析出來 前三個function是造成cache-miss主要的原因:
* __strcasecmp_l_avx -> int strcasecmp (const char *s1, const char *s2)
* clear_page_c_e ->
* findName -> entry *findName(char lastName[], entry *pHead)
由函數指出字串比對次數過多導致miss次數變多
### 改善方向
由上面的分析數據顯示以下:
* 儲存資料的結構與演算法導致miss次數增加
* 搜尋名字的演算法沒有效率
### Optimization
L1d Cache size = 32KB,Cache line size = 64Bytes
Cache Line的個數
==32 * 1024 / 64== = 512
每筆Entry有136Bytes,因此一次存入L1d的數量就只有
==32 * 512 / 136== 大約 120筆 Entry
根據大小改善的原始結構後
每筆Entry剩下24Bytes,因此一次存入L1d的數量增加為
==32 * 512 / 24== 大約 682筆 Entry
##### phonebook_orig.h
```clike=
#define MAX_LAST_NAME_SIZE 16
typedef struct __PHONE_BOOK_ENTRY {
struct __PHONE_BOOK_ENTRY *pNext;
char lastName[MAX_LAST_NAME_SIZE];
}
```
```
Performance counter stats for './phonebook_opt' (100 runs):
216,188 cache-misses # 72.132 % of all cache refs ( +- 0.48% )
299,709 cache-references ( +- 0.37% )
0.061514855 seconds time elapsed
```
時間上的成長
```
size of entry : 136 bytes
execution time of append() : 0.084283 sec
execution time of findName() : 0.006502 sec
===========================================
size of entry : 24 bytes
execution time of append() : 0.068817 sec
execution time of findName() : 0.003831 sec
```
減少不必要的欄位空間,使cache-misses率下降了約20%,搜尋以及儲存資料的時間也有些微下降
#### Hash Function(BKDR)
整理了 [Hash Function](https://hackmd.io/OwIwTALAHGwCYFoDMA2AnGhEQGMUKigkTSgFMBWARioAYBDOFAM1qA==?both)
先是建立了一張3571大小的TABLE,再不減少Entry的情況下進行分析
```
size of entry :136 bytes
execution time of append() : 0.110690 sec
execution time of findName() : 0.000001 sec
==============================================================
Performance counter stats for './phonebook_hash' (100 runs):
2,514,577 cache-misses # 56.257 % of all cache refs ( +- 0.27% ) (57.49%)
4,469,801 cache-references ( +- 0.31% ) (58.22%)
5,879,673 L1-dcache-load-misses # 5.82% of all L1-dcache hits ( +- 0.25% ) (58.54%)
101,099,783 L1-dcache-loads ( +- 0.60% ) (50.24%)
629,741 L1-dcache-prefetch-misses ( +- 0.99% ) (28.11%)
3,190,335 L1-dcache-store-misses ( +- 0.43% ) (29.31%)
130,865 L1-icache-load-misses ( +- 0.88% ) (43.36%)
0.164596361 seconds time elapsed
```
減少大小後
```
size of entry :32 bytes
execution time of append() : 0.099975 sec
execution time of findName() : 0.000001 sec
==============================================================
Performance counter stats for './phonebook_hash' (100 runs):
1,701,515 cache-misses # 45.586 % of all cache refs ( +- 0.32% ) (57.57%)
3,732,515 cache-references ( +- 0.35% ) (58.29%)
5,336,750 L1-dcache-load-misses # 5.56% of all L1-dcache hits ( +- 0.40% ) (58.54%)
95,991,463 L1-dcache-loads ( +- 0.87% ) (50.07%)
482,912 L1-dcache-prefetch-misses ( +- 0.77% ) (28.06%)
2,756,774 L1-dcache-store-misses ( +- 0.80% ) (29.43%)
70,032 L1-icache-load-misses ( +- 0.79% ) (43.42%)
0.162973286 seconds time elapsed
```
由數據分析換上Hash Function後
Cache-miss的比例降低不少 ==92.475 -> 56.257==
append時間多出==0.03 sec==
findName時間則下降至==0.000001 sec==
然而減少Hash大小也減少的些微Cache-miss的比例 ==56.257 -> 45.586==
append時間下降 ==0.011== sec
:::info
Cache-misses 45.586% 依然偏高
:::
---
學習ierosodin的分析,也著手分析大小對於分佈的影響




```
(mod 9997)
size of entry :32 bytes
execution time of append() : 0.101987 sec
execution time of findName() : 0.000000 sec
==============================================================
Performance counter stats for './phonebook_hash' (100 runs):
1,676,158 cache-misses # 43.189 % of all cache refs ( +- 0.29% ) (57.55%)
3,880,979 cache-references ( +- 0.34% ) (58.36%)
5,470,038 L1-dcache-load-misses # 5.51% of all L1-dcache hits ( +- 0.42% ) (58.80%)
99,279,186 L1-dcache-loads ( +- 0.71% ) (50.62%)
483,688 L1-dcache-prefetch-misses ( +- 0.84% ) (27.92%)
2,852,689 L1-dcache-store-misses ( +- 0.80% ) (29.11%)
75,499 L1-icache-load-misses ( +- 0.86% ) (43.37%)
0.167409641 seconds time elapsed
```
根據結果
當把TABLE_SIZE設成9997時,search時間下降到小於 10^-6^ sec
且append時間只增加了==0.002 sec==,Cache-misses 也多下降了 ==2%==
也驗證了分佈越均勻確實可以下降Cache-misses的比例
---
* perf hotspot
```
Samples: 695 of event 'cache-misses', Event count (approx.): 1660031
Overhead Command Shared Object Symbol
49.30% phonebook_hash libc-2.23.so [.] malloc_consolidate
22.13% phonebook_hash libc-2.23.so [.] _int_free
12.29% phonebook_hash [kernel.kallsyms] [k] clear_page_c_e
```
根據顯示 malloc的部分還是造成了許多Cache-misses的情況,也因此在釋放記憶體的時間增加了Caches misses的比例。
```clike=
//main.c#55
entry *pHead, *e[TABLE_SIZE];
pHead = (entry *) malloc(sizeof(entry) * TABLE_SIZE);
printf("size of entry :%lu bytes\n", sizeof(entry));
for (i = 0; i < TABLE_SIZE; i++) {
e[i] = &pHead[i];
e[i]->pNext = NULL;
}
i = 0;
```
著手改善的方向memory pool
### 參考資料
##### 網路資源
* [Perf Wiki](https://perf.wiki.kernel.org/index.php/Tutorial)
* [Linux 效能分析工具: Perf](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial)
* [Perf 介紹](http://mropengate.blogspot.tw/2016/04/perf.html)
* [Perf Example](http://www.brendangregg.com/perf.html)
* [Perf 教學-1](https://www.ibm.com/developerworks/cn/linux/l-cn-perf1/)
* [Perf 教學-2](https://www.ibm.com/developerworks/cn/linux/l-cn-perf2/index.html)
* [Cache explain](http://cenalulu.github.io/linux/all-about-cpu-cache/)
* [Struct with Memory](http://www.catb.org/esr/structure-packing/)
##### 書籍
* [書籍-計算機組織與設計(軟硬體介面) 5-3章節]
##### 共筆
* [jayfeng0225](https://hackmd.io/EYBgHAbMCGAsCMBaaATcjYCYCmAzRAnCMAOyJp4G4DGKE8AzNEA=?view)
* [ierosodin](https://hackmd.io/s/SJ4uqs9Yx#2017q1-homework1-phonebook)
* [陳品睿](https://embedded2016.hackpad.com/2016q1-Homework1-utBhpkDFVMh)