contributed by <HMKRL
>
$ uname -a
Linux alfheim 4.10.0-33-generic #37-Ubuntu SMP Fri Aug 11 10:55:28 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Stepping: 9
CPU MHz: 899.951
CPU max MHz: 3800.0000
CPU min MHz: 800.0000
BogoMIPS: 5616.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
$ free
total used free shared buff/cache available
Mem: 16235072 2420160 12605616 487012 1209296 13048120
Swap: 2097148 0 2097148
make plot
畫出的圖並沒有變化,因此在 cache-test
的部份加入了以下指令刪除原本的紀錄檔:cache-test: $(EXEC)
rm -f ./*.txt
perf stat --repeat 100 \
-e cache-misses,cache-references,instructions,cycles \
./phonebook_orig
perf stat --repeat 100 \
-e cache-misses,cache-references,instructions,cycles \
./phonebook_opt
make plot
部份,利用 eog
在分析圖產生後直接顯示:plot: output.txt
gnuplot scripts/runtime.gp
eog ./runtime.png
請注意到 Don't invoke program eog. You should not assume the availability of X11.
課程助教
想法:除了 lastName 以外的欄位都不是搜尋時會用到的
因此另外定義 struct __PHONE_BOOK_ADDITION
儲存這些欄位
並在原本的資料結構中預留指標,有需要時再分配記憶體。
優點:
size of entry 縮小後由於佔用 cache 較小,應能降低cache-miss
缺點:
需要多一次 malloc
來分配記憶體給這些欄位,可能增加 append()
時間
實測
將資料結構改為以下型態
typedef struct __PHONE_BOOK_ADDITION {
char firstName[16];
char email[16];
char phone[10];
char cell[10];
char addr1[16];
char addr2[16];
char city[16];
char state[2];
char zip[5];
} addition;
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
struct __PHONE_BOOK_ADDITION *addtionInfo;
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
entry size 變為 32 bytes, cache-miss 則變為 64.499 %
這樣的調整使得「entry size 變為 32 bytes」,是否還能縮減呢?與其用指標
addtionInfo
來指向完整資料的記憶體空間,不如換用可預先計算範圍的數值,並用uint16_t
(或更小的型態) 來記錄。
"jserv"
若能事先確認需額外儲存的資料量,則可以採用 uint16_t
等資料型態,但需特別注意:若為做額外處理,entry size 會因為 alignments 而保持在 32 bytes , 無法有效減小:
typedef struct __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
uint16_t index;
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
執行結果:
size of entry : 32 bytes
execution time of append() : 0.033276 sec
execution time of findName() : 0.001591 sec
為了能確實減小尺寸,可以加入 __attribute__((__packed__))
(reference)
typedef struct __attribute__((__packed__)) __PHONE_BOOK_ENTRY {
char lastName[MAX_LAST_NAME_SIZE];
uint16_t index;
struct __PHONE_BOOK_ENTRY *pNext;
} entry;
執行結果:
size of entry : 26 bytes
execution time of append() : 0.029881 sec
execution time of findName() : 0.001542 sec
應該用 perf 比較 cache miss 一類的落差
"jserv"
透過perf驗證後,發現雖然減小了entry size, 但cache miss及時間表現卻沒有進展(甚至有些微落後)
32-byte entry cache miss:
Performance counter stats for './phonebook_orig' (100 runs):
1,665,405 cache-misses # 72.560 % of all cache refs ( +- 0.17% )
2,295,210 cache-references ( +- 0.25% )
242,975,461 instructions # 1.57 insn per cycle ( +- 0.02% )
155,039,764 cycles ( +- 0.55% )
26-byte entry cache miss:
Performance counter stats for './phonebook_opt' (100 runs):
1,667,705 cache-misses # 73.910 % of all cache refs ( +- 0.19% )
2,256,388 cache-references ( +- 0.22% )
242,905,795 instructions # 1.55 insn per cycle ( +- 0.02% )
157,022,276 cycles ( +- 0.66% )
猜想原因:雖然entry size透過取消alignment減小,但CPU存取cache時仍然採用aligned模式,因此沒有幫助
實作 hash table, 採用 dj2b hash function 對 lastName
字串做 hash
並建立 Size 為2048的Hash table
entry *findName(char lastName[], entry **pArr)
{
entry *pHead = pArr[hash(lastName) % HASH_TABLE_SIZE];
while (pHead != NULL) {
if (strcasecmp(lastName, pHead->lastName) == 0)
return pHead;
pHead = pHead->pNext;
}
return NULL;
}
entry *append(char lastName[], entry **pArr)
{
/* allocate memory for the new entry and put lastName */
int index = hash(lastName) % HASH_TABLE_SIZE;
entry *e = (entry *) malloc(sizeof(entry));
e->pNext = pArr[index];
pArr[index] = e;
strcpy(e->lastName, lastName);
return e;
}
結果:
orig | opt | hash | |
---|---|---|---|
cache miss | 94.140% | 67.303% | 73.737% |
append time | 0.039918 | 0.032211 | 0.038637 |
find time | 0.005459 | 0.002070 | 0.000000 |
比起縮小 entry size 後的版本,append()
時間增加,findName()
時間則大幅縮短