contributed by <vonchuang
>
ZixinYang
$ cat /etc/issue
Ubuntu 16.04.3 LTS
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 37
Model name: Intel(R) Core(TM) i3 CPU M 350 @ 2.27GHz
Stepping: 2
CPU MHz: 1866.000
CPU max MHz: 2266.0000
CPU min MHz: 933.0000
BogoMIPS: 4522.32
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
The L1 cache can be accessed nearly as fast as the registers, typically in 2 to 4 clock cycles.
L2 cache can be accessed in about 10 clock cycles.
L3 cache sits between the L2 cache and main memory in the memory hierarchy and can be accessed in love with0 or 40 cycles.
The two cache are often optimized to different access patterns and can have different block sizes,associativites,and capacities.Also,having seperate caches ensures that data accessed do not create conflict misses with instruction accesses,and vice versa,at the cost of a potential increase in capacity misses.
$make run
size of entry : 136 bytes
execution time of append() : 0.086821 sec
execution time of findName() : 0.012002 sec
perf
$perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_orig
Performance counter stats for './phonebook_orig' (10 runs):
946,438 cache-misses # 86.699 % of all cache refs ( +- 1.01% ) (66.53%)
1,091,635 cache-references ( +- 0.99% ) (67.35%)
2,452,718 L1-dcache-load-misses ( +- 0.71% ) (67.43%)
896,913 L1-dcache-store-misses ( +- 0.45% ) (67.48%)
1,839,844 L1-dcache-prefetch-misses ( +- 2.38% ) (67.10%)
576,013 L1-icache-load-misses ( +- 9.58% ) (66.08%)
0.128343367 seconds time elapsed ( +- 3.62% )
$ sudo perf record -F 12500 -e cache-misses ./phonebook_orig
size of entry : 136 bytes
execution time of append() : 0.084899 sec
execution time of findName() : 0.012862 sec
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.068 MB perf.data (1488 samples) ]
$sudo perf report
Overhead Command Shared Object Symbol
53.37% phonebook_orig [kernel.kallsyms] [k] clear_page
11.23% phonebook_orig phonebook_orig [.] findLastName
6.53% phonebook_orig libc-2.23.so [.] __strcasecmp_l_sse42
5.33% phonebook_orig [kernel.kallsyms] [k] handle_mm_fault
4.77% phonebook_orig [kernel.kallsyms] [k] pmd_page_vaddr
2.24% phonebook_orig [kernel.kallsyms] [k] unmap_page_range
2.21% phonebook_orig [kernel.kallsyms] [k] __rmqueue
1.46% phonebook_orig [kernel.kallsyms] [k] copy_user_generic_string
1.29% phonebook_orig [kernel.kallsyms] [k] get_page_from_freelist
1.26% phonebook_orig [kernel.kallsyms] [k] try_charge
1.22% phonebook_orig [kernel.kallsyms] [k] __alloc_pages_nodemask
1.14% phonebook_orig [kernel.kallsyms] [k] free_pcppages_bulk
1.06% phonebook_orig [kernel.kallsyms] [k] get_mem_cgroup_from_mm
0.76% phonebook_orig [kernel.kallsyms] [k] mem_cgroup_try_charge
0.62% phonebook_orig [kernel.kallsyms] [k] alloc_pages_vma
在 phonebook_opt.c
中,有一 function 叫 findname。
entry *findName(char lastName[], entry *pHead)
但其是由輸入 lastname 去找,故其原有名稱並不精確,在此改為 findLastName。
entry *findLastName(char lastName[], entry *pHead)
phonebook_orig.c
中的 function append(),其並沒有考慮到 malloc 函式呼叫失敗的例外處理。
entry *append(char lastName[], entry *e)
{
/* allocate memory for the new entry and put lastName */
e->pNext = (entry *) malloc(sizeof(entry));
e = e->pNext;
strcpy(e->lastName, lastName);
e->pNext = NULL;
return e;
}
entry *append(char lastName[], entry *e)
{
/* allocate memory for the new entry and put lastName */
e->pNext = (entry *) malloc(sizeof(entry));
assert(e->pNext && "malloc for e->Next error.");
e = e->pNext;
strcpy(e->lastName, lastName);
e->pNext = NULL;
return e;
}
phonebook_orig
中的 struct __PHONE_BOOK_ENTRY 為136 bytes。
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
char firstName[16];
char email[16];
char phone[10];
char cell[10];
char addr1[16];
char addr2[16];
char city[16];
char state[2];
char zip[5];
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
故在執行 findLastName 時,L1 cache 最多只能放 30 個 entry。而在word.txt
中,有近35萬筆資料,在執行中是必會發生 cache miss,降低程式效率。
然查本案件情形,其目的只是要找 last name,與其他資料無關。故可將 last name 從 __PHONE_BOOK_ENTRY 獨立出來,以縮小 struct 大小,進而降低 cache miss。
優化後結果:
$perf record -F 12500 -e cache-misses ./phonebook_opt
size of entry : 32 bytes
execution time of append() : 0.071934 sec
execution time of findName() : 0.005095 sec
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.050 MB perf.data (1013 samples) ]
$perf report
Overhead Command Shared Object Symbol
54.36% phonebook_opt [kernel.kallsyms] [k] clear_page
7.26% phonebook_opt libc-2.23.so [.] __strcasecmp_l_sse42
6.96% phonebook_opt [kernel.kallsyms] [k] handle_mm_fault
3.65% phonebook_opt [kernel.kallsyms] [k] copy_user_generic_string
3.24% phonebook_opt [kernel.kallsyms] [k] __rmqueue
2.91% phonebook_opt [kernel.kallsyms] [k] pmd_page_vaddr
2.87% phonebook_opt phonebook_opt [.] findLastName
$perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_opt
Performance counter stats for './phonebook_opt' (10 runs):
356,696 cache-misses # 92.695 % of all cache refs ( +- 1.60% ) (62.61%)
384,808 cache-references ( +- 2.58% ) (64.87%)
1,397,588 L1-dcache-load-misses ( +- 2.64% ) (68.68%)
274,496 L1-dcache-store-misses ( +- 1.42% ) (71.04%)
449,449 L1-dcache-prefetch-misses ( +- 22.10% ) (70.08%)
331,442 L1-icache-load-misses ( +- 4.48% ) (65.40%)
0.085531137 seconds time elapsed ( +- 1.00% )
findLastName 的 cache miss overhead 從原本的 11.23% 降到 2.87%。
append() 執行時間降低約 18%。
findLastName() 執行時間降低約 60%。
在phonebook_orig.c
中,findLastName
的時間複雜度為
先將 hash table size 設為1024(參考ryanwang522)
DJB Hash Function
unsigned int BJDHash( char *str){
unsigned int hash = 5381;
while (*str)
hash = ((hash << 5) + hash) + (*str++);
return (hash % MAX_HASH_TABLE_SIZE);
}
perf
$perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_hash
Performance counter stats for './phonebook_hash' (10 runs):
327,504 cache-misses # 80.183 % of all cache refs ( +- 1.56% ) (64.15%)
408,445 cache-references ( +- 3.94% ) (65.74%)
640,421 L1-dcache-load-misses ( +- 3.13% ) (68.02%)
316,427 L1-dcache-store-misses ( +- 1.07% ) (70.02%)
52,671 L1-dcache-prefetch-misses ( +- 6.83% ) (69.38%)
426,756 L1-icache-load-misses ( +- 6.84% ) (65.78%)
0.089159747 seconds time elapsed ( +- 1.77% )
gunplot
由此可以看到,因為有了 hash table,使得尋找時間得以降低,但建立 table 也使 append()的執行時間增加。
而若依據 參考,將 load factor 定在0.75。現有資料 349,900 筆,故將 Hash Table size 設為
Performance counter stats for './phonebook_hash' (10 runs):
941,119 cache-misses # 67.372 % of all cache refs ( +- 0.50% ) (65.94%)
1,396,901 cache-references ( +- 2.25% ) (66.21%)
1,963,602 L1-dcache-load-misses ( +- 1.17% ) (66.90%)
525,652 L1-dcache-store-misses ( +- 0.74% ) (67.66%)
105,146 L1-dcache-prefetch-misses ( +- 10.88% ) (67.74%)
1,708,760 L1-icache-load-misses ( +- 2.31% ) (66.72%)
0.261410686 seconds time elapsed ( +- 1.31% )
整體執行時間及 cache miss 反而是增加的
unsigned int BKDRHash( char *str){
unsigned int hash = 0;
unsigned int seed = 131; //31 131 1313 13131 etc...
while (*str)
hash = (hash*seed) + (*str++);
return (hash % MAX_HASH_TABLE_SIZE);
}
unsigned int OneAtATimeHash( char *str){
unsigned int hash = 0,i = 0;
while (*str){
hash += str[i];
hash += (hash << 10);
hash ^= (hash >> 6);
++i;
*str++;
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return (hash % MAX_HASH_TABLE_SIZE);
}
DJB | BKDR | One-at-a-Time | |
---|---|---|---|
spend time | 0.104402s | 0.123361s | 0.129983s |
append() time | 0.114534s | 0.133128s | 0.137205s |
findName() time | 0.000000s | 0.000000s | 0.000000s |
cache miss | 1,041,945 | 671,454 | 624,490 |
cache reference | 1,788,171 | 1,004,645 | 958,847 |
cache miss/cache reference | 58.269% | 66.835% | 65.129% |
在輸入的檔案words.txt
中,有許多明顯非姓名、甚至是無意義的字,並且比例不低,故此份 dataset 並不具代表性,與真實英文姓氏有一定程度的差異。
不一定,當資料量多時,相比於少量資料,使用適當與不適當尋找方法的時間差會變大,也更容易發生 cache miss,故在實作時,需要花更多時間及精力在優化上。