2017q3 Homework1 (phonebook)

# 2017q3 Homework1 (phonebook) contributed by < `lukechin` > 作業要求: https://hackmd.io/s/HJJUmdXsZ ## 開發環境 * 硬體規格： ```shell lukechin@lukechin-X250-Linux:~$ lscpu Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數：2 每通訊端核心數：2 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 61 Model name: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz 製程： 4 CPU MHz： 2195.043 CPU max MHz: 2700.0000 CPU min MHz: 500.0000 BogoMIPS: 4390.08 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 3072K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt dtherm ida arat pln pts ``` * Linux版本： ```shell lukechin@lukechin-X250-Linux:~$ uname -r 4.13.0-26-generic ``` ## 預期目標 - [ ] 準備 GNU/Linux 開發工具 - [ ] 學習使用 Git 與 GitHub - [ ] 學習效能分析工具 - [ ] 研究軟體最佳化 - [ ] 技術報告寫作練習 ## 準備工具 * 依照[作業說明](https://hackmd.io/s/B17yi6WoW)準備開發環境 ``` $ sudo apt-get update $ sudo apt-get install build-essential $ sudo apt-get install linux-tools-common linux-tools-generic $ sudo apt-get install cppcheck astyle colordiff gnuplot ``` * 根據[GNU/Linux 開發工具共筆](https://hackmd.io/c/rJKbX1pFZ/https%3A%2F%2Fhackmd.io%2Fs%2Fr1Psrf0KW)熟悉開發工具 * 利用 HackMD 寫下開發紀錄 * 熟悉效能分析工具: Perf * 熟悉製圖工具: gnuplot * 熟悉 Makefile 語法 ## perf 原理和實務 * kernel.perf_event_paranoid 是用來決定你在沒有 root 權限下使用 perf 時，你可以取得哪些 event data。我輸入以下指令得到的預設值是 3。 ```bash $ cat /proc/sys/kernel/perf_event_paranoid 3 ``` * 一共有四種權限值: * `2` : 不允許任何量測。但部份用來查看或分析已存在的紀錄的指令仍可使用，如 perf ls、perf report、perf timechart、 perf trace。 * `1` : 不允許 CPU events data。但可以使用 perf stat、perf record 並取得 Kernel profiling data。 * `0` : 不允許 raw tracepoint access。但可以使用 perf stat、perf record 並取得 CPU events data。 * `-1`: 權限全開。 * 要檢測 cache miss event ，需要先取消 kernel pointer 的禁用。 ```bash $ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" ``` * 重新登入時發現 kptr_strict 的值又會變回 1，故每次進行實驗之前記得要再重新確認是否已經取消 kernel pointer 的禁用。 * 第一次閱讀 perf 原理和實務時發現不認識的名詞 * superscalar: 指一個時鐘週期觸發 (issue) 多條指令的 pipeline機器架構。 * 比如 Intel 的 Pentium 處理器，內部有兩個執行單元，在一個時鐘週期內允許執行兩條指令。 * 這樣稱為 dual-issue，可想像為一個 packet 裡同時有兩組 pipelined 的 instruction * out-of-order execution: 在處理器內部，不同指令所需要的執行時間和時鐘週期是不同的，如果嚴格按照程序的執行順序執行，那麼就無法充分利用處理器的 pipeline。因此指令有可能需要被亂序執行。 * 需重新複習 branch prediction 及 branch pattern。（詳細待補） ## 先來分析未優化前的效能 * 為了使測試更為精準，再執行測試之前，要先清空cache ```echo 1 | sudo tee /proc/sys/vm/drop_caches``` * 輸入指令 ```./phonebook_orig``` ```bash size of entry : 136 bytes execution time of append() : 0.055069 sec execution time of findName() : 0.005760 sec ``` * 執行 ```make run``` 可以看到 phonebook_orig 大約每兩秒執行一次，打開 Makefile 可以看到實際上是在執行以下這段程式碼 ``` watch -d -t "./phonebook_orig && echo 3 | sudo tee /proc/sys/vm/drop_caches" ``` * watch 為 Linux 內建的一個指令，可用於重複執行某件事情，預設為每兩秒執行一次 * ```-d (or --differences)```: watch 會反白顯示變化的區域。若使用```-d=cumulative```則會把變動過的地方都反白顯示出來。 * ``` -n (or --interval)```: watch預設每2秒運行一下程式，可以用-n或-interval後加上秒數來指定間隔的時間。 * ```-t (or -no-title)```: 會關閉watch命令在頂部的時間間隔、執行指令以及當前時間的輸出。 * 這邊試著將 Makefile 裡 make run 的指令改成 ```watch -d -n 5 "./phonebook_orig && echo 3 | sudo tee /proc/sys/vm/drop_caches"``` * 以下為輸出結果 ``` Every 5.0s: ./phonebook_orig && echo 3 | sudo tee /proc/sys/vm/drop_caches Sat Jan 20 03:46:22 2018 size of entry : 136 bytes execution time of append() : 0.052655 sec execution time of findName() : 0.006034 sec 3 ``` * make cache-test ``` Performance counter stats for './phonebook_orig' (100 runs): 3,452,012 cache-misses # 92.071 % of all cache refs ( +- 0.05% ) 3,749,275 cache-references ( +- 0.08% ) 262,732,870 instructions # 1.36 insn per cycle ( +- 0.02% ) 193,673,268 cycles ( +- 0.11% ) 0.074537705 seconds time elapsed ( +- 0.17% ) ``` 可以看到 cache-misses 高達 92.071%，總計 193,673,268 個 cycles * 從 Makefile 裡可以看到 `make cache-test` 的指令是執行 `perf stat`來進行測試 ``` perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./phonebook_orig ``` `-e` 可以指定要量測的event，若沒有指定的話預設會有10種常用的events `--repeat <n>`或是 `-r <n>` 可以重複執行 n 次該程序，並顯示每個 event 的變化區間。 * 執行 `make plot` ![](https://i.imgur.com/Dnfpby3.png) 因為還沒對程式進行優化，所以兩個結果是一樣的。 ## 開始做實驗 ### step 1: 縮小 strutuce 大小觀察 phonebook_orig.c 的程式碼可以發現，雖然在 `__PHONE_BOOK_ENTRY` 中定義了許多使用者的資料，但實際上 append 跟 findName 只有用到 lastName 去搜尋而已，所以先試著註解掉其他使用者資料。 * code ```clike= typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; /*char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5];*/ struct __PHONE_BOOK_ENTRY *pNext; } entry; ``` * 縮小 structure size ``` size of entry : 24 bytes execution time of append() : 0.039951 sec execution time of findName() : 0.002136 sec ``` * make cache-test ``` Performance counter stats for './phonebook_opt' (100 runs): 843,641 cache-misses # 75.363 % of all cache refs ( +- 0.09% ) 1,119,436 cache-references ( +- 0.15% ) 241,336,359 instructions # 1.98 insn per cycle ( +- 0.02% ) 122,166,026 cycles ( +- 0.15% ) 0.047273819 seconds time elapsed ( +- 0.20% ) ``` * make plot ![](https://i.imgur.com/WR5liKg.png) * 觀察與結論可以發現當 structure size 從 136 bytes 縮小到 24 bytes 時，讓 cache-miss 有了改善，從 92% 降低到 75%，cycle 數從 193,673,268 降低到 122,166,026 個 cycles。 ### step 2: 還原 phonebook 該含有的資訊量 * code ```clike= typedef struct __PHONE_BOOK_DETAIL { char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; } phonebookdetail; typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; phonebookdetail *detail; struct __PHONE_BOOK_ENTRY *pNext; } entry; ``` * size 從 24 bytes 變成 32 bytes ``` size of entry : 32 bytes execution time of append() : 0.045141 sec execution time of findName() : 0.002260 sec ``` * make cache-test ``` Performance counter stats for './phonebook_opt2' (100 runs): 1,229,308 cache-misses # 71.144 % of all cache refs ( +- 0.09% ) 1,727,921 cache-references ( +- 0.18% ) 244,817,721 instructions # 1.90 insn per cycle ( +- 0.02% ) 129,150,043 cycles ( +- 0.17% ) 0.050132034 seconds time elapsed ( +- 0.30% ) ``` * make plot ![](https://i.imgur.com/Nt89Ax9.png) * 觀察與結論跟第一個實驗相比，雖然因為 structure size 較大的關係，需要花比較多的時間在 append() 與 findName()，但也因為擁有教完整的訊息，比較符合現實中電話簿的狀況。另外，在 cache-test裡發現，雖然 cache-misses 發生的次數變多了，但發生的百分比卻下降了。 ### step 3: 實做 DKBRHash * code ```clike= unsigned int BKDRHash(char lastName[]) { unsigned int seed = 131; //31, 131,1313, ...etc unsigned int hash = 0; while(*lastName) hash = hash * seed + (*lastName++); return hash; } ``` * 執行 ``` size of entry : 136 bytes zyxel is found! zyxel is found! zyxel is found! execution time of append() : 0.063970 sec execution time of findName() : 0.000012 sec ``` * cache-miss ``` Performance counter stats for './phonebook_hash' (100 runs): 1,103,430 cache-misses # 53.294 % of all cache refs ( +- 0.08% ) 2,070,474 cache-references ( +- 0.13% ) 262,799,248 instructions # 1.47 insn per cycle ( +- 0.02% ) 179,179,332 cycles ( +- 0.27% ) 0.068228267 seconds time elapsed ( +- 0.28% ) ``` * plot ![](https://i.imgur.com/JsrX1BJ.png) * 觀察與結論雖然為了建出 hash table 導致 append() 的時間大幅上升，但在 findName() 的搜尋時間有著極大的進步。在同樣使用 136 bytes 的資料結構下，cache-misses也從 92% 降低至 53%。 ### step 4: DKBRHash + 縮小結構馬上來試試看利用前面縮小過後的資料結構看執行的效率如何 * 執行 ``` size of entry : 32 bytes zyxel is found! zyxel is found! zyxel is found! execution time of append() : 0.053068 sec execution time of findName() : 0.000001 sec ``` * cache-test ``` Performance counter stats for './phonebook_hash' (100 runs): 439,791 cache-misses # 36.721 % of all cache refs ( +- 0.45% ) 1,197,662 cache-references ( +- 0.97% ) 244,494,343 instructions # 1.66 insn per cycle ( +- 0.02% ) 146,955,437 cycles ( +- 0.55% ) 0.056842941 seconds time elapsed ( +- 0.61% ) ``` * plot ![](https://i.imgur.com/TJz4s7a.png) * 觀察與結論在縮小結構之後，DKBRHash 的 append time也有了大幅的進步，但在 findName 的搜尋時間則差不多。 * 若要使用 pre-define 使兩個不同的定義指定到同一個方式上時，可以使用以下寫法 `#if defined(HASH) || defined(HASH2)` ## 問題討論本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題？ 1. 有代表性嗎？跟真實英文姓氏差異大嗎？ 2. 資料難道「數大就是美」嗎？ 3. 如果我們要建立符合真實電話簿情境的輸入，該怎麼作？接下來還能如何改善？ 1. 使用 binary search tree 或其他資料結構改寫 2. 藉由 hash function 來加速查詢，請詳閱 hash function 觀念和實務 3. 既然 first name, last name, address 都是合法的英文 (可假設成立)，使用字串壓縮的演算法，降低資料表示的成本 ## 參考資料 [Which hashing algorithm is best for uniqueness and speed?](https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed) [cache 原理和實驗](https://hackmd.io/s/S14L26-_l) [ryanwang522](https://hackmd.io/s/Sk-qaWiYl)