2017q3 Homework1 (phonebook)

# 2017q3 Homework1 (phonebook) contributed by < `changyuanhua` > ### Reviewed by `xdennisx` - 實做提到透過引入 hash 加速 `fineName()` 的操作，但缺乏不同 hash function 的效能比較和設計取捨 - 文中提到使用 hsah 之後 append() 時間上升，但卻未解釋為什麼會上升 - Git commit message 都有點太簡略，很難了解實際做了什麼事 - commit [d9e23d45016e2ebe243ffe6d79895577954a1d1c](https://github.com/changyuanhua/phonebook/commit/d9e23d45016e2ebe243ffe6d79895577954a1d1c) 中，寫 Reduce the size of structure，但實作上應該是把不需要的資料用另一個結構包起來，與 commit message 不太相符 - 卻乏搜尋演算法的評估和效能分析 ## 開發環境 * 輸入指令 ` lscpu ` ``` Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數：1 每通訊端核心數：4 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 94 Model name: Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz 製程： 3 CPU MHz： 799.890 CPU max MHz: 3200.0000 CPU min MHz: 800.0000 BogoMIPS: 4608.00 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 6144K NUMA node0 CPU(s): 0-3 ``` >請列出相關硬體資訊，如：cache, memory >[color=red][name=課程助教] >ok, 謝謝[name=changyuanhua][color=black] ## 準備過程 * **準備 GNU/Linux 開發工具** * win10 進入 BIOS 步驟 * 設定 -> 更新與安全性 -> 復原 -> 進階啟動 -> 疑難排解 -> 進階選項 -> UEFI韌體設定 -> 重新啟動 --- 就可以進入 BIOS * ACER 電腦在看[輕鬆學會 Windows / Ubuntu 雙系統安裝](https://www.youtube.com/watch?v=rjpBTT6AmeU)完後, 如果沒有成功, 可以試著看[Acer Aspire E15 will not dual boot](http://askubuntu.com/questions/627416/acer-aspire-e15-will-not-dual-boot), 從文中第35個步驟開始做 * 閱讀 [perf 原理和實務](https://hackmd.io/s/B11109rdg#) * 閱讀 [cache 原理和實驗](https://hackmd.io/s/S14L26-_l#) ## 使用 astyle + Git 實做自動程式碼排版檢查 * 輸入指令 ` astyle --style=kr --indent=spaces=4 --indent-switches --suffix=none *.[ch] ` ## Perf 效能分析 * [perf 教學](https://hackmd.io/s/B11109rdg#) * Perf 觸發的事件分為三類： * **hardware** : 由 PMU 產生的事件，比如 cache-misses、cpu-cycles、instructions、branch-misses …等等，通常是當需要瞭解程序對硬體特性的使用情況時會使用。 * **software** : 是核心程式產生的事件，比如 context-switches、page-faults、cpu-clock、cpu-migrations …等等。 * **tracepoint** : 是核心中的靜態 tracepoint 所觸發的事件，這些 tracepoint 用來判斷在程式執行時期，核心的行為細節，比如 slab 記憶體配置器的配置次數等。 * **perf top** * 跟平常 Linux 內建的 top 指令很相似。它能夠「即時」的分析各個函式在某個 event 上的熱點 * **perf stat** * 相較於 top，使用 perf stat 往往是你已經有個要優化的目標，對這個目標進行特定或一系列的 event 檢查，進而了解該程序的效能概況 * **perf record & perf report** * 可以針對函式級別進行 event 統計，方便我們對程序「熱點」作更精細的分析和優化 ## gnuplot 繪圖 * [gnuplot 繪圖教學](https://hackmd.io/s/Skwp-alOg#) ## cache * [cache 原理和實驗](https://hackmd.io/s/S14L26-_l#) * cache miss * cache 裡找不到需要的指令或資料，因此須繼續往 main memory 找，而 latency (延遲)時間會愈長 * cache miss 還分為以下三種類型： 1.instruction read miss 2.data read miss 3.data write miss ## 遇到問題 * 輸入指令 ` make plot ` * 遇到問題 ``` Error: You may not have permission to collect stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid, which controls use of the performance events system by unprivileged users (without CAP_SYS_ADMIN). The current value is 3: -1: Allow use of (almost) all events by all users >= 0: Disallow raw tracepoint access by users without CAP_IOC_LOCK >= 1: Disallow CPU event access by users without CAP_SYS_ADMIN >= 2: Disallow kernel profiling by users without CAP_SYS_ADMIN * 解決指令 ``` sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid' ``` * 發現圖檔重複出現相同數據 * 沒有把之前的數據清除 * 指令 ``` make clean ``` ## 未優化前效能分析 * code ```clike= /* original version */ entry *findName(char lastName[], entry *pHead) { while (pHead != NULL) { if (strcasecmp(lastName, pHead->lastName) == 0) return pHead; pHead = pHead->pNext; } return NULL; } entry *append(char lastName[], entry *e) { /* allocate memory for the new entry and put lastName */ e->pNext = (entry *) malloc(sizeof(entry)); e = e->pNext; strcpy(e->lastName, lastName); e->pNext = NULL; return e; } ``` entry *findName function 是一個重頭找 lastname 到尾的程式碼 entry *append function 是一個分配記憶體以及存放 lastName 的程式碼 * 要先清空 cache ` $ echo 1 | sudo tee /proc/sys/vm/drop_caches ` 再分析效能 * 輸入指令 ` ./phonebook_orig ` ``` size of entry : 136 bytes execution time of append() : 0.046820 sec execution time of findName() : 0.005392 sec ``` 顯示執行時間，以及 entry 大小 * 當輸入指令 ` make cache-test ` 後，會顯現下面資訊，是執行效能分析 `perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./phonebook_orig ` ``` Performance counter stats for './phonebook_orig' (100 runs): 4,616,916 cache-misses # 92.740 % of all cache refs ( +- 0.18% ) 4,978,335 cache-references ( +- 0.07% ) 262,592,789 instructions # 1.54 insn per cycle ( +- 0.02% ) 171,063,276 cycles ( +- 0.07% ) 0.059616342 seconds time elapsed ( +- 0.23% ) ``` 可以看出 cache-misses 蠻大的 * plot ![](https://i.imgur.com/MMlYw64.png) 這圖是2個相同的程式碼，是我用來試試 gnuplot 繪圖 ## 實驗 #### 實驗1： phonebook_opt.h 檔，刪除沒有用到的陣列，減少structure大小 >請注意程式碼縮排，為四個空白[name=課程助教][color=red] >>ok, 謝謝[name=changyuanhua][color=black] * code ```clike= typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; /* char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; */ struct __PHONE_BOOK_ENTRY *pNext; } entry; ``` * size 縮減 ``` size of entry : 24 bytes execution time of append() : 0.079211 sec execution time of findName() : 0.003613 sec ``` 因為把目前沒用的到陣列註解掉，所以 entry size 下降，但我發現下降的 size 跟他的陣列大小並沒有相對應，像是我單獨註解 char state[2] 整體的 size 就沒有下降，依然為136 bytes * cache misses ``` Performance counter stats for './phonebook_opt' (100 runs): 1,010,356 cache-misses # 62.711 % of all cache refs ( +- 0.70% ) 1,611,123 cache-references ( +- 0.12% ) 241,093,207 instructions # 2.21 insn per cycle ( +- 0.02% ) 108,989,869 cycles ( +- 0.07% ) 0.037727117 seconds time elapsed ( +- 0.51% ) ``` * plot ![](https://i.imgur.com/9i3vt7O.png) * 觀察與結論與原始檔案相互比較，可以看到 cache misses 值明顯的下降（92.740％ -> 62.711%)，也可以看到 size of entry 從 136 bytes -> 24 bytes ，以及 append() 和 findName() 的時間下降 >記得push回repo歐 [name=亮谷][color=#0fff56] >ok,謝謝[name=changyuanhua][color=black] #### 實驗2： phonebook_opt.h 檔，減少structure大小 * code ```clike= typedef struct __PHONE_BOOK_DETAIL { char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; } phonebookdetail; typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; phonebookdetail *detail; struct __PHONE_BOOK_ENTRY *pNext; } entry; ``` * size 縮減 ``` size of entry : 32 bytes execution time of append() : 0.032439 sec execution time of findName() : 0.001829 sec ``` * cache misses ``` Performance counter stats for './phonebook_opt' (100 runs): 1,425,648 cache-misses # 63.673 % of all cache refs ( +- 0.48% ) 2,239,025 cache-references ( +- 0.14% ) 244,508,202 instructions # 2.13 insn per cycle ( +- 0.02% ) 114,603,354 cycles ( +- 0.10% ) 0.039536986 seconds time elapsed ( +- 0.72% ) ``` * plot ![](https://i.imgur.com/QT1ZJcs.png) * 觀察與結論與原始檔案相互比較，可以看到 cache misses 值明顯的下降（92.740％ -> 63.673%)，也可以看到 size of entry 從 136 bytes -> 32 bytes ，以及 append() 和 findName() 的時間下降，再與實驗一做比較，除了 size of entry 值，可以發現並沒有明顯的差異，只是稍長一些，因為 size of entry 值變大，但有保有其他資訊，比較符合真實電話簿的性能 #### 實驗3：DKBRHash * code ```cstyle= unsigned int BKDRHash(char lastName[]) { unsigned int seed=131; // 31 131 1313 13131 etc.. unsigned int hash=0; while(*lastName) hash = hash * seed + (*lastName++); return hash; } ``` * ``` size of entry : 136 bytes execution time of append() : 0.047629 sec execution time of findName() : 0.000001 sec ``` * cache misses ``` Performance counter stats for './phonebook_hash' (100 runs): 1,243,609 cache-misses # 47.014 % of all cache refs ( +- 0.10% ) 2,645,181 cache-references ( +- 0.10% ) 262,250,871 instructions # 1.74 insn per cycle ( +- 0.02% ) 150,897,816 cycles ( +- 0.05% ) 0.051010663 seconds time elapsed ( +- 0.26% ) ``` * plot ![](https://i.imgur.com/tKjkihJ.png) * 觀察與結論與原始檔案相互比較，可以看到 cache misses 值明顯的下降（92.740％ -> 47.014%)，也可以看到 findName() 的時間顯著下降，但 append() 時間卻略為上升，因為多了些額外的運算 #### 實驗4：DKBRHash + structure 減少 * ``` size of entry : 32 bytes execution time of append() : 0.042075 sec execution time of findName() : 0.000001 sec ``` * cache miss ``` Performance counter stats for './phonebook_hash' (100 runs): 470,619 cache-misses # 31.964 % of all cache refs ( +- 0.22% ) 1,472,352 cache-references ( +- 0.19% ) 244,445,874 instructions # 1.98 insn per cycle ( +- 0.02% ) 123,749,401 cycles ( +- 0.06% ) 0.042495388 seconds time elapsed ( +- 0.27% ) ``` * plot ![](https://i.imgur.com/0BDdPG0.png) * 觀察與結論與原始檔案相互比較，可以看到 cache misses 值明顯的下降（92.740％ -> 31.964%)，也可以看到 size of entry 從 136 bytes -> 32 bytes ，以及 findName() 的時間顯著下降，但 append() 時間卻略為上升 ## 問題討論 1.回答以下問題：本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題？ * 有代表性嗎？跟真實英文姓氏差異大嗎？在 word.txt 中可以看到很多字都是無意義的，只是英文字母的排列組合 * 資料難道「數大就是美」嗎？並非數大就是美，我認為是看你目前的需求，因為當資料量龐大資訊齊全，你所需要花費搜尋的時間也較多，而假設你只需要姓名與電話，但資料中多了生日與地址，你將多花費不必要的時間 * 如果我們要建立符合真實電話簿情境的輸入，該怎麼作？電話簿裡應該要是真實的英文姓名，而不是一些英文字母的排列組合 ## 問題待解決 * 問題一 * 在我輸入指令 ` sudo sh -c " echo -1 > /proc/sys/kernel/perf_event_paranoid" ` 前 , 輸出指令 ` sudo perf top -p $pid & perf --stdio ` 或是 ` sudo perf top -p $pid & perf --stdio ` 所呈現的資料約相似 * 但在我輸入指令 ` sudo sh -c " echo -1 > /proc/sys/kernel/perf_event_paranoid" ` 後 * 當我輸入指令 ` sudo perf top -p $pid & perf --stdio ` * 所呈現的資料極為怪異 ``` Samples: 17 of event 'cycles:ppp', Event count (approx.):277764931792 Overhead Shared O Symbol 100.00% example [.] _Z19compute_pi_baselinem 0.00% example [.] main 0.00% [kernel] [k] do_nmi ``` * 當我輸入指令 ` sudo perf top -e cycles -p $pid & perf --stdio ` * 所呈現的資料與輸入指令 ` sudo sh -c " echo -1 > /proc/sys/kernel/perf_event_paranoid" ` 前相似 ``` Samples: 3K of event 'cycles', Event count (approx.): 745556798 Overhead Shared O Symbol 99.35% example [.] _Z19compute_pi_baselinem 0.38% [kernel] [k] i2c_dw_isr 0.13% [kernel] [k] idma64_irq 0.03% [kernel] [k] _raw_spin_lock 0.03% [kernel] [k] intel_pmu_enable_all 0.03% [kernel] [k] cpu_load_update 0.03% [kernel] [k] irq_work_run_list ``` >不知道問題是跟什麼有關係[name=changyuanhua][color=blue] ## 參考資料 [nekoneko](https://hackmd.io/s/rJCs_ZVa#) [hash function](https://hackmd.io/s/HJln3jU_e#) [bkdrhash](http://blog.csdn.net/wanglx_/article/details/40400693) [jkrvivian](https://hackmd.io/s/SyOQgOQa#) [hash table 簡單說明](http://www.csie.ntnu.edu.tw/~u91029/Set.html) [hash table 說明](http://alrightchiu.github.io/SecondRound/hash-tableintrojian-jie.html) [字符串 hash 函數比較](https://www.byvoid.com/zhs/blog/string-hash-compare) [BKDRhash 實現](http://www.codeceo.com/article/hash-based-on-bkdrhash.html)