contributed by < Hsiang
>
作業解說 video
yangyang95
作業系統:
uname -a
Linux ubuntu 4.8.0-39-generic #42~16.04.1-Ubuntu SMP Mon Feb 20 15:06:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
⚡ lstopo
Machine (16GB)
Package L#0 + L3 L#0 (6144KB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#4)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#5)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#6)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#7)
HostBridge L#0
PCI 8086:0166
GPU L#0 "renderD128"
GPU L#1 "card0"
GPU L#2 "controlD64"
PCIBridge
PCI 14e4:1686
Net L#3 "enp1s0f0"
PCIBridge
PCI 14e4:4331
Net L#4 "wlp2s0"
PCI 8086:1e03
Block(Disk) L#5 "sda"
Block(Disk) L#6 "sdb"
⚡ getconf LEVEL1_DACAHE_LINESIZE
64
設定一般user sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'
./phonebook_orig
size of entry : 136 bytes
execution time of append() : 0.082711 sec
execution time of findName() : 0.006030 sec
⚡ sudo perf stat -r 100 ./phonebook_orig
Performance counter stats for './phonebook_orig' (100 runs):
62.434152 task-clock (msec) # 0.996 CPUs utilized ( +- 2.01% )
0 context-switches # 0.002 K/sec ( +- 25.73% )
0 cpu-migrations # 0.000 K/sec
12,354 page-faults # 0.198 M/sec ( +- 0.00% )
177,805,159 cycles # 2.848 GHz ( +- 0.69% )
84,945,067 stalled-cycles-frontend # 47.77% frontend cycles idle ( +- 1.33)
264,406,729 instructions # 1.49 insn per cycle
# 0.32 stalled cycles per insn ( +- 0.02% )
48,204,531 branches # 772.086 M/sec ( +- 0.01% )
923,397 branch-misses # 1.92% of all branches ( +- 0.93% )
0.062661753 seconds time elapsed ( +- 2.00% )
由於0.996 CPUs utilized非常接近1屬CPU BOUND型
⚡ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" 取消禁用 Kenerl pointer
⚡ sudo perf stat -r 100 -e \
cache-misses,cache-references, \
L1-dcache-load-misses, \
L1-dcache-store-misses, \
L1-dcache-prefetch-misses, \
L1-icache-load-misses, \
instructions, \
cycles \
./phonebook_orig
Performance counter stats for './phonebook_orig' (100 runs):
1,790,769 cache-misses # 92.475 % of all cache refs ( +- 1.41% ) (61.16%)
1,936,484 cache-references ( +- 1.32% ) (64.08%)
2,679,276 L1-dcache-load-misses ( +- 1.14% ) (65.20%)
1,000,437 L1-dcache-store-misses ( +- 0.70% ) (54.45%)
799,336 L1-dcache-prefetch-misses ( +- 3.44% ) (53.35%)
125,428 L1-icache-load-misses ( +- 1.45% ) (49.85%)
257,663,444 instructions # 1.56 insn per cycle ( +- 0.34% ) (61.42%)
165,474,641 cycles ( +- 0.36% ) (58.85%)
0.095967552 seconds time elapsed ( +- 1.83% )
Cache-misses 的比例92.475
⚡ sudo perf record -e cache-misses ./phonebook_orig
⚡ sudo perf report
Samples: 350 of event 'cache-misses', Event count (approx.): 1655901
Overhead Command Shared Object Symbol
# red
41.68% phonebook_orig libc-2.23.so [.] __strcasecmp_l_avx
35.11% phonebook_orig [kernel.kallsyms] [k] clear_page_c_e
5.39% phonebook_orig phonebook_orig [.] findName
# green
4.07% phonebook_orig [kernel.kallsyms] [k] _raw_spin_lock
2.51% phonebook_orig [kernel.kallsyms] [k] get_mem_cgroup_from_mm
2.09% phonebook_orig [kernel.kallsyms] [k] copy_user_enhanced_fast_string
1.97% phonebook_orig [kernel.kallsyms] [k] get_page_from_freelist
1.27% phonebook_orig [kernel.kallsyms] [k] handle_mm_fault
1.15% phonebook_orig [kernel.kallsyms] [k] pte_lockptr.isra.15
0.81% phonebook_orig [kernel.kallsyms] [k] try_charge
0.80% phonebook_orig [kernel.kallsyms] [k] mem_cgroup_try_charge
0.64% phonebook_orig libc-2.23.so [.] __strcasecmp_avx
0.64% phonebook_orig [kernel.kallsyms] [k] unmap_page_range
0.62% phonebook_orig [kernel.kallsyms] [k] free_hot_cold_page_list
0.58% phonebook_orig [kernel.kallsyms] [k] free_pcppages_bulk
0.57% phonebook_orig [kernel.kallsyms] [k] __alloc_pages_nodemask
0.11% phonebook_orig ld-2.23.so [.] _dl_map_object_from_fd
# white
0.00% perf [kernel.kallsyms] [k] __perf_event_header__init_id
0.00% perf [kernel.kallsyms] [k] strlcpy
0.00% perf [kernel.kallsyms] [k] perf_event_addr_filters_exec
根據Hotspot分析出來 前三個function是造成cache-miss主要的原因:
由函數指出字串比對次數過多導致miss次數變多
由上面的分析數據顯示以下:
L1d Cache size = 32KB,Cache line size = 64Bytes
Cache Line的個數
32 * 1024 / 64 = 512
每筆Entry有136Bytes,因此一次存入L1d的數量就只有
32 * 512 / 136 大約 120筆 Entry
根據大小改善的原始結構後
每筆Entry剩下24Bytes,因此一次存入L1d的數量增加為
32 * 512 / 24 大約 682筆 Entry
#define MAX_LAST_NAME_SIZE 16
typedef struct __PHONE_BOOK_ENTRY {
struct __PHONE_BOOK_ENTRY *pNext;
char lastName[MAX_LAST_NAME_SIZE];
}
Performance counter stats for './phonebook_opt' (100 runs):
216,188 cache-misses # 72.132 % of all cache refs ( +- 0.48% )
299,709 cache-references ( +- 0.37% )
0.061514855 seconds time elapsed
時間上的成長
size of entry : 136 bytes
execution time of append() : 0.084283 sec
execution time of findName() : 0.006502 sec
===========================================
size of entry : 24 bytes
execution time of append() : 0.068817 sec
execution time of findName() : 0.003831 sec
減少不必要的欄位空間,使cache-misses率下降了約20%,搜尋以及儲存資料的時間也有些微下降
整理了 Hash Function
先是建立了一張3571大小的TABLE,再不減少Entry的情況下進行分析
size of entry :136 bytes
execution time of append() : 0.110690 sec
execution time of findName() : 0.000001 sec
==============================================================
Performance counter stats for './phonebook_hash' (100 runs):
2,514,577 cache-misses # 56.257 % of all cache refs ( +- 0.27% ) (57.49%)
4,469,801 cache-references ( +- 0.31% ) (58.22%)
5,879,673 L1-dcache-load-misses # 5.82% of all L1-dcache hits ( +- 0.25% ) (58.54%)
101,099,783 L1-dcache-loads ( +- 0.60% ) (50.24%)
629,741 L1-dcache-prefetch-misses ( +- 0.99% ) (28.11%)
3,190,335 L1-dcache-store-misses ( +- 0.43% ) (29.31%)
130,865 L1-icache-load-misses ( +- 0.88% ) (43.36%)
0.164596361 seconds time elapsed
減少大小後
size of entry :32 bytes
execution time of append() : 0.099975 sec
execution time of findName() : 0.000001 sec
==============================================================
Performance counter stats for './phonebook_hash' (100 runs):
1,701,515 cache-misses # 45.586 % of all cache refs ( +- 0.32% ) (57.57%)
3,732,515 cache-references ( +- 0.35% ) (58.29%)
5,336,750 L1-dcache-load-misses # 5.56% of all L1-dcache hits ( +- 0.40% ) (58.54%)
95,991,463 L1-dcache-loads ( +- 0.87% ) (50.07%)
482,912 L1-dcache-prefetch-misses ( +- 0.77% ) (28.06%)
2,756,774 L1-dcache-store-misses ( +- 0.80% ) (29.43%)
70,032 L1-icache-load-misses ( +- 0.79% ) (43.42%)
0.162973286 seconds time elapsed
由數據分析換上Hash Function後
Cache-miss的比例降低不少 92.475 -> 56.257
append時間多出0.03 sec
findName時間則下降至0.000001 sec
然而減少Hash大小也減少的些微Cache-miss的比例 56.257 -> 45.586
append時間下降 0.011 sec
Cache-misses 45.586% 依然偏高
學習ierosodin的分析,也著手分析大小對於分佈的影響
(mod 9997)
size of entry :32 bytes
execution time of append() : 0.101987 sec
execution time of findName() : 0.000000 sec
==============================================================
Performance counter stats for './phonebook_hash' (100 runs):
1,676,158 cache-misses # 43.189 % of all cache refs ( +- 0.29% ) (57.55%)
3,880,979 cache-references ( +- 0.34% ) (58.36%)
5,470,038 L1-dcache-load-misses # 5.51% of all L1-dcache hits ( +- 0.42% ) (58.80%)
99,279,186 L1-dcache-loads ( +- 0.71% ) (50.62%)
483,688 L1-dcache-prefetch-misses ( +- 0.84% ) (27.92%)
2,852,689 L1-dcache-store-misses ( +- 0.80% ) (29.11%)
75,499 L1-icache-load-misses ( +- 0.86% ) (43.37%)
0.167409641 seconds time elapsed
根據結果
當把TABLE_SIZE設成9997時,search時間下降到小於 10-6 sec
且append時間只增加了0.002 sec,Cache-misses 也多下降了 2%
也驗證了分佈越均勻確實可以下降Cache-misses的比例
Samples: 695 of event 'cache-misses', Event count (approx.): 1660031
Overhead Command Shared Object Symbol
49.30% phonebook_hash libc-2.23.so [.] malloc_consolidate
22.13% phonebook_hash libc-2.23.so [.] _int_free
12.29% phonebook_hash [kernel.kallsyms] [k] clear_page_c_e
根據顯示 malloc的部分還是造成了許多Cache-misses的情況,也因此在釋放記憶體的時間增加了Caches misses的比例。
//main.c#55
entry *pHead, *e[TABLE_SIZE];
pHead = (entry *) malloc(sizeof(entry) * TABLE_SIZE);
printf("size of entry :%lu bytes\n", sizeof(entry));
for (i = 0; i < TABLE_SIZE; i++) {
e[i] = &pHead[i];
e[i]->pNext = NULL;
}
i = 0;
著手改善的方向memory pool