contributed by <Jim00000
>
ubuntu 16.04 LTS
拜拜lubuntu,哈囉ubuntu
$ lscpu
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
每核心執行緒數:2
每通訊端核心數:4
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 58
Model name: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
CPU MHz: 1263.750
CPU max MHz: 3400.0000
CPU min MHz: 1200.0000
BogoMIPS: 4789.11
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 6144K
NUMA node0 CPU(s): 0-7
sudo sh -c " echo -1 > /proc/sys/kernel/perf_event_paranoid"
sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ perf list
節錄部份perf可以觸發的events
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
context-switches OR cs [Software event]
cpu-clock [Software event]
dummy [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-loads [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
L1-dcache-store-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
branch-load-misses [Hardware cache event]
branch-loads [Hardware cache event]
branch-instructions OR cpu/branch-instructions/ [Kernel PMU event]
branch-misses OR cpu/branch-misses/ [Kernel PMU event]
bus-cycles OR cpu/bus-cycles/ [Kernel PMU event]
cache-misses OR cpu/cache-misses/ [Kernel PMU event]
cache-references OR cpu/cache-references/ [Kernel PMU event]
cpu-cycles OR cpu/cpu-cycles/ [Kernel PMU event]
$ perf top
即時分析系統熱點
$ perf top -p $pid
根據pid 查看process的熱點
$ perf top –event event
–count event period to sample
針對特定程式檢查事件
$ perf stat –repeat 5 -e cache-misses,cache-references,instructions,cycles ./a.out
$ perf stat –repeat 5 -e cache-misses:u,cache-references:u,instructions:u,cycles:u ./a.out
:u 代表只統計發生在user space的事件
$ ./a.out & sudo perf top -p $!
針對a.out的程式做效能分析
在Learn Git Branching這個網站多做幾次就大致理解如何使用git了
#測站 | 淡水 | 鞍部 | 臺北 | 基隆 | 竹子湖 |
#月份
1 103.9 294.3 83.2 331.6 232.6
2 174.8 329.2 170.3 397.0 273.5
3 194.5 281.8 180.4 321.0 227.1
4 179.3 247.9 177.8 242.0 207.2
5 216.1 321.2 234.5 285.1 267.4
6 243.4 345.8 325.9 301.6 314.8
7 149.2 266.1 245.1 148.4 247.7
8 202.9 422.5 322.1 210.1 439.5
9 299.1 758.5 360.5 423.5 717.4
10 173.9 703.5 148.9 400.3 683.9
11 120.7 534.7 83.1 399.6 488.8
12 97.6 357.6 73.3 311.8 289.1
set title "臺北月平均降雨量"
set xlabel "月份"
set ylabel "降雨量(毫米)"
set terminal png font " Times_New_Roman,12 "
set output "statistic.png"
set xtics 1 ,1 ,12
set key left
plot \
"data.csv" using 1:2 with linespoints linewidth 2 title "淡水", \
"data.csv" using 1:3 with linespoints linewidth 2 title "鞍部", \
"data.csv" using 1:4 with linespoints linewidth 2 title "臺北", \
"data.csv" using 1:5 with linespoints linewidth 2 title "基隆", \
"data.csv" using 1:6 with linespoints linewidth 2 title "竹子湖" \
multi-language indentation and reformatting filters
http://astyle.sourceforge.net/astyle.html
使用astyle檢查程式排版
ln -sf ../../scripts/pre-commit.hook .git/hooks/pre-commit
echo 3 | sudo tee /proc/sys/vm/drop_caches
man tee
watch -d -t "./phonebook_orig && echo 3 | sudo tee /proc/sys/vm/drop_caches"
man watch
-d
-t
.PHONY
#include <assert.h>
man assert
void assert(scalar expression);
#if defined(__GNUC__)
__builtin___clear_cache((char *) pHead, (char *) pHead + sizeof(entry));
void __builtin___clear_cache (char *begin, char *end)
原始Phonebook版本使用Linklist串接每一筆資料
entry *findName(char lastName[], entry *pHead)
while (pHead != NULL) {
if (strcasecmp(lastName, pHead->lastName) == 0)
return pHead;
pHead = pHead->pNext;
}
return NULL;
}
entry *append(char lastName[], entry *e)
{
/* allocate memory for the new entry and put lastName */
e->pNext = (entry *) malloc(sizeof(entry));
e = e->pNext;
strcpy(e->lastName, lastName);
e->pNext = NULL;
return e;
}
strcasecmp
比較字串
$ perf stat --repeat 100 -e cache-misses:u,cache-references:u,instructions:u,cycles:u ./phonebook_orig
Performance counter stats for './phonebook_orig' (100 runs):
1,543,383 cache-misses:u # 97.442 % of all cache refs ( +- 0.17% )
1,583,900 cache-references:u ( +- 0.18% )
290,557,320 instructions:u # 1.67 insns per cycle ( +- 0.02% )
173,701,898 cycles:u ( +- 0.40% )
0.095503279 seconds time elapsed ( +- 1.44% )
$ ./phonebook_orig & perf top -p $!
Overhead Shared Object Symbol
22.78% libc-2.23.so [.] __strcasecmp_l_avx
17.41% phonebook_orig [.] findName
9.29% phonebook_orig [.] main
6.70% libc-2.23.so [.] _int_malloc
4.95% libc-2.23.so [.] _IO_fgets
3.86% [kernel] [k] clear_page_c_e
3.27% libc-2.23.so [.] __memcpy_sse2
2.96% libc-2.23.so [.] malloc
2.81% libc-2.23.so [.] _IO_getline_info
2.69% phonebook_orig [.] strcasecmp@plt
2.29% [kernel] [k] free_pcppages_bulk
1.93% phonebook_orig [.] append
1.87% libc-2.23.so [.] __strcpy_sse2_unaligned
1.72% [kernel] [k] page_remove_rmap
1.64% libc-2.23.so [.] __strcasecmp_avx
1.52% libc-2.23.so [.] memchr
1.14% [kernel] [k] release_pages
1.03% [kernel] [k] native_irq_return_iret
0.98% [kernel] [k] down_read_trylock
0.91% [kernel] [k] __rmqueue.isra.79
0.57% [kernel] [k] free_hot_cold_page_list
0.57% [kernel] [k] free_hot_cold_page
0.57% [kernel] [k] free_pages_prepare
0.57% [kernel] [k] unmap_page_range
0.57% [kernel] [k] __mod_zone_page_state
0.52% [kernel] [k] __schedule
0.52% [kernel] [k] copy_user_enhanced_fast_string
0.51% [kernel] [k] apic_timer_interrupt
struct __PHONE_BOOK_ENTRY
struct __PHONE_BOOK_ENTRY
struct __PHONE_BOOK_ENTRY
改善cache-misses
將struct __PHONE_BOOK_CONTENT
內不必要的資訊搬移至別處,減小空間(縮小到24bytes)
typedef struct __PHONE_BOOK_ENTRY entry;
typedef struct __PHONE_BOOK_CONTENT Content;
struct __PHONE_BOOK_ENTRY {
char *lastName;
entry *pNext;
Content *pContent;
};
struct __PHONE_BOOK_CONTENT {
char firstName[16];
char email[16];
char phone[10];
char cell[10];
char addr1[16];
char addr2[16];
char city[16];
char state[2];
char zip[5];
};
cache-misses縮小了,這是合理的,因為cache可以放入更多的資料組進去
Performance counter stats for './phonebook_opt' (100 runs):
89,853 cache-misses:u # 33.448 % of all cache refs ( +- 4.40% )
268,634 cache-references:u ( +- 3.05% )
400,792,662 instructions:u # 2.18 insns per cycle ( +- 0.02% )
184,158,414 cycles:u ( +- 0.38% )
0.087245876 seconds time elapsed ( +- 2.01% )
有變動的地方為struct __PHONE_BOOK_CONTENT
大小以及append()
的實作
可以從效能比較圖中看到finName()
的時間縮減了,但是在append()
的時間增加了
finName()
的效能增益來自於cache miss的減少,而append()
的時間增加了可能是多了一次malloc()
的使用
如果將lastname改成固定為16bytes
struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
entry *pNext;
Content *pContent;
};
跟第一個版本相比,這個版本的append()
的時間會比上一版還要短,是因為少了malloc()
的使用
,不過在稍稍浪費了少量的記憶體(在多筆資料下浪費的量會非常可觀!!!)
結論:cache miss的減少確實影響著程式的效能
從perf效能工具的圖表,overhead第1名是__strcasecmp_l_avx
,程式花了很多cycles在strcasecmp()
上面
使用hash來取代strcasecmp()
或許是個很好的作法
我將比較http://www.cse.yorku.ca/~oz/hash.html內的各種hash function,找出哪一個hash function比較有效率
djb2
與loselose
兩者之間djb2 0.000000705 seconds
sdbm 0.000000404 seconds
loselose 0.000000381 seconds
hash function將採用sdbm
演算法實作
hash function的比較目前只有思考時間的部份,可能還不夠全面
hash
替換lastName
以hash換掉lastName
struct __PHONE_BOOK_ENTRY {
unsigned long hash;
entry *pNext;
Content *pContent;
};
將findName()
內的字串比較換成為比較hash
,hash
必須在使用findName()
之前算出來,避免重複計算hash
entry *findName(unsigned long hash, entry *pHead)
{
while (pHead != NULL) {
if (pHead->hash == hash )
return pHead;
pHead = pHead->pNext;
}
return NULL;
}
struct __PHONE_BOOK_ENTRY
大小沒有變,cache-misses的比率沒有太大的變化
Performance counter stats for './phonebook_opt' (100 runs):
50,211 cache-misses:u # 35.883 % of all cache refs ( +- 3.40% )
139,928 cache-references:u ( +- 2.73% )
278,444,297 instructions:u # 1.97 insns per cycle ( +- 0.00% )
141,480,873 cycles:u ( +- 0.31% )
0.061914541 seconds time elapsed ( +- 2.13% )
程式的熱點依舊花在findName
上很多時間,很多都花在遍歷linklist了
19.24% phonebook_opt [.] findName
12.64% phonebook_opt [.] sdbm
11.11% libc-2.23.so [.] _int_free
10.50% phonebook_opt [.] main
7.63% phonebook_opt [.] free_entry
4.81% libc-2.23.so [.] __memcpy_sse2
4.17% phonebook_opt [.] free_Content
3.80% libc-2.23.so [.] malloc
3.74% libc-2.23.so [.] _IO_fgets
3.17% libc-2.23.so [.] _IO_getline_info
2.99% phonebook_opt [.] append
2.78% libc-2.23.so [.] free
2.43% libc-2.23.so [.] _int_malloc
2.33% libc-2.23.so [.] memchr
1.25% [kernel] [k] copy_user_enhanced_fast_string
0.71% [kernel] [k] uncharge_list
0.70% [kernel] [k] release_pages
0.70% [kernel] [k] __tlb_remove_page
0.69% [kernel] [k] __khugepaged_exit
0.69% phonebook_opt [.] free@plt
0.69% ld-2.23.so [.] do_lookup_x
append()
看不出來有明顯的變化,但是findName()
的時間有稍微降低
結論:使用hash代替strcasecmp
做比較可以減少一些時間,但是程式需要遍歷整個linklist直到找到字串為止的方法依然不變,需要想辦法減少遍歷整個linklist
合理使用雜湊表,減少遍歷整個linklist的可能性
因為原始版與優化版差異性變大,在這裡我將會對其做分檔處理以方便區分及理解程式碼
建立一個hash table,內含固定數量的buckets,每個buckets內含一條串列(請想像成linklist[buckets_num]
)
struct __PHONE_BOOK_HASH_TABLE {
unsigned int buckets_num;
entry **buckets;
};
從結果來看cache-misses有點高
Performance counter stats for './phonebook_opt' (100 runs):
87,790 cache-misses:u # 66.771 % of all cache refs ( +- 1.91% )
131,478 cache-references:u ( +- 2.13% )
275,992,495 instructions:u # 2.12 insns per cycle ( +- 0.00% )
130,270,449 cycles:u ( +- 0.24% )
0.060449365 seconds time elapsed ( +- 2.17% )
程式熱點從findName()
轉移至main()
與sdbm()
為主了
24.60% phonebook_opt [.] main
16.55% phonebook_opt [.] sdbm
14.34% libc-2.23.so [.] _int_malloc
8.75% phonebook_opt [.] append
8.59% libc-2.23.so [.] _IO_fgets
5.18% libc-2.23.so [.] __memcpy_sse2
3.39% libc-2.23.so [.] malloc
3.19% [kernel] [k] clear_page_c_e
2.34% [kernel] [k] free_pcppages_bulk
2.27% libc-2.23.so [.] _IO_getline_info
2.22% [kernel] [k] native_irq_return_iret
1.17% [kernel] [k] unmap_page_range
1.17% [kernel] [k] __pagevec_lru_add_fn
1.13% [kernel] [k] copy_user_enhanced_fast_string
1.11% [kernel] [k] __rmqueue.isra.79
1.11% [kernel] [k] _raw_spin_lock
0.99% [kernel] [k] release_pages
0.96% [kernel] [k] get_mem_cgroup_from_mm
0.93% [kernel] [k] handle_mm_fault
0.01% [kernel] [k] perf_pmu_enable.part.89
0.00% [kernel] [k] native_sched_clock
0.00% [kernel] [k] native_write_msr_safe
append()
因為多了算hash以及一些搬移,時間會比原始來得多一點是可以預期的,findName()
的時間降至接近於0了
結論:使用hash table讓資料分散在不同buckets,然後在findName()
時藉由字串算出的hash找到bucket後就不用找其他bucket內的資料,在findName()
的效能提升是明顯的
在實作hash table(Key對應到Object)中每個Object都是一條Linklist,所以仍然無法避免遍歷這一條linklist如果資料比較靠近list尾端,在最差的情況下還是得從頭比對到最後
使用binary search tree取代linklist,增加搜尋的效率
因為寫紅黑樹是一件大工程,為了驗證使用紅黑樹改善效能是可行的就先拿現成的程式碼來用吧
比起使用串列,紅黑樹在cache-misses上有一定程度的改善 (But why?)
Performance counter stats for './phonebook_opt' (100 runs):
1,247,330 cache-misses:u # 33.152 % of all cache refs ( +- 0.71% )
3,762,470 cache-references:u ( +- 0.22% )
401,991,447 instructions:u # 0.61 insns per cycle ( +- 0.00% )
658,891,050 cycles:u ( +- 0.41% )
0.242765557 seconds time elapsed ( +- 1.04% )
紅黑樹在insert和insert後的調整上花了很多cycles,銷毀紅黑樹也很花時間這倒是很意外
52.20% phonebook_opt [.] RBT_insert
13.19% phonebook_opt [.] RBT_destroy
7.78% phonebook_opt [.] RBT_insert_fixup
5.37% libc-2.23.so [.] _int_free
3.90% phonebook_opt [.] main
3.40% phonebook_opt [.] sdbm
3.03% libc-2.23.so [.] _int_malloc
1.75% libc-2.23.so [.] _IO_getline_info
1.37% phonebook_opt [.] append
1.08% libc-2.23.so [.] __memcpy_sse2
0.96% phonebook_opt [.] RBT_create_node
0.75% libc-2.23.so [.] _IO_fgets
0.65% libc-2.23.so [.] malloc
append()
的時間突破了天際(差不多是0.2秒左右),而findName()
一樣是趨近於0的時間花費,理論上findName()
用紅黑樹實做要比串列好
與f5120125同學實做的binary search tree相比較可以發現f5120125做得相當好(在append()
上花比較少的時間),把串列直接轉換成binary search tree並考慮到二元樹傾斜的情況,考慮周到,但是有個缺點是來自於測資必須事先經過排序(本作業的測資本來就是排序好的不用做排序),若在動態新增資料的方面上是有問題的
結論:使用二元樹搜尋確實能大幅度改善findName()
的時間,但是在append()
上付出高額的時間代價,在效能trade-off上需要評估