2017q3 Homework1 (phonebook)

# 2017q3 Homework1 (phonebook) contributed by <`HMKRL`> ## 系統環境 ``` $ uname -a Linux alfheim 4.10.0-33-generic #37-Ubuntu SMP Fri Aug 11 10:55:28 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Stepping: 9 CPU MHz: 899.951 CPU max MHz: 3800.0000 CPU min MHz: 800.0000 BogoMIPS: 5616.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp $ free total used free shared buff/cache available Mem: 16235072 2420160 12605616 487012 1209296 13048120 Swap: 2097148 0 2097148 ``` ## 對預設 Makefile 進行修改 - 進行測試時，發現雖然對程式碼做了修改，但 `make plot` 畫出的圖並沒有變化，因此在 `cache-test` 的部份加入了以下指令刪除原本的紀錄檔： ```shell cache-test: $(EXEC) rm -f ./*.txt perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./phonebook_orig perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./phonebook_opt ``` - 為了方便觀察結果，修改了 `make plot` 部份，利用 `eog` 在分析圖產生後直接顯示： ```shell plot: output.txt gnuplot scripts/runtime.gp eog ./runtime.png ``` >請注意到 [Don't invoke program eog. You should not assume the availability of X11.](https://github.com/HMKRL/phonebook/commit/96bb305836ce3eba69b0aeb2e51b6468971abb53#commitcomment-24519785) >[name=課程助教][color=red] ## 減小 size of entry - 想法：除了 lastName 以外的欄位都不是搜尋時會用到的因此另外定義 `struct __PHONE_BOOK_ADDITION` 儲存這些欄位並在原本的資料結構中預留指標，有需要時再分配記憶體。 - 優點： size of entry 縮小後由於佔用 cache 較小，應能降低cache-miss - 缺點：需要多一次 `malloc` 來分配記憶體給這些欄位，可能增加 `append()` 時間 - 實測將資料結構改為以下型態 ```C typedef struct __PHONE_BOOK_ADDITION { char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; } addition; typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; struct __PHONE_BOOK_ADDITION *addtionInfo; struct __PHONE_BOOK_ENTRY *pNext; } entry; ``` entry size 變為 32 bytes, cache-miss 則變為 64.499 % > 這樣的調整使得「entry size 變為 32 bytes」，是否還能縮減呢？與其用指標 `addtionInfo` 來指向完整資料的記憶體空間，不如換用可預先計算範圍的數值，並用 `uint16_t` (或更小的型態) 來記錄。 > [name="jserv"][color=red] ![](https://i.imgur.com/y5hRJIf.png) 若能事先確認需額外儲存的資料量，則可以採用 `uint16_t` 等資料型態，但需特別注意：若為做額外處理，entry size 會因為 alignments 而保持在 32 bytes , 無法有效減小： ``` typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; uint16_t index; struct __PHONE_BOOK_ENTRY *pNext; } entry; ``` 執行結果： ``` size of entry : 32 bytes execution time of append() : 0.033276 sec execution time of findName() : 0.001591 sec ``` 為了能確實減小尺寸，可以加入 `__attribute__((__packed__))` ([reference](https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/Common-Type-Attributes.html#Common-Type-Attributes)) ``` typedef struct __attribute__((__packed__)) __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; uint16_t index; struct __PHONE_BOOK_ENTRY *pNext; } entry; ``` 執行結果： ``` size of entry : 26 bytes execution time of append() : 0.029881 sec execution time of findName() : 0.001542 sec ``` > 應該用 perf 比較 cache miss 一類的落差 > [name="jserv"][color=red] 透過perf驗證後，發現雖然減小了entry size, 但cache miss及時間表現卻沒有進展（甚至有些微落後） 32-byte entry cache miss: ``` Performance counter stats for './phonebook_orig' (100 runs): 1,665,405 cache-misses # 72.560 % of all cache refs ( +- 0.17% ) 2,295,210 cache-references ( +- 0.25% ) 242,975,461 instructions # 1.57 insn per cycle ( +- 0.02% ) 155,039,764 cycles ( +- 0.55% ) ``` 26-byte entry cache miss: ``` Performance counter stats for './phonebook_opt' (100 runs): 1,667,705 cache-misses # 73.910 % of all cache refs ( +- 0.19% ) 2,256,388 cache-references ( +- 0.22% ) 242,905,795 instructions # 1.55 insn per cycle ( +- 0.02% ) 157,022,276 cycles ( +- 0.66% ) ``` ![](https://i.imgur.com/NPKDndt.png) 猜想原因：雖然entry size透過取消alignment減小，但CPU存取cache時仍然採用aligned模式，因此沒有幫助 ## 嘗試採用hash table加速實作 hash table, 採用 [dj2b hash function](http://www.cse.yorku.ca/~oz/hash.html) 對 `lastName` 字串做 hash 並建立 Size 為2048的Hash table ``` entry *findName(char lastName[], entry **pArr) { entry *pHead = pArr[hash(lastName) % HASH_TABLE_SIZE]; while (pHead != NULL) { if (strcasecmp(lastName, pHead->lastName) == 0) return pHead; pHead = pHead->pNext; } return NULL; } ``` ``` entry *append(char lastName[], entry **pArr) { /* allocate memory for the new entry and put lastName */ int index = hash(lastName) % HASH_TABLE_SIZE; entry *e = (entry *) malloc(sizeof(entry)); e->pNext = pArr[index]; pArr[index] = e; strcpy(e->lastName, lastName); return e; } ``` 結果： ![](https://i.imgur.com/OgTEqOE.png) | | orig| opt| hash| |:---------:|:------:|:------:|:------:| |cache miss | 94.140%| 67.303%| 73.737%| |append time|0.039918|0.032211|0.038637| |find time |0.005459|0.002070|0.000000| 比起縮小 entry size 後的版本，`append()` 時間增加，`findName()` 時間則大幅縮短