# 2017q1 Homework1 (phonebook) contributed by < `0xff07` > ## 題目 * 原始碼在[這裡](https://github.com/0xff07/phonebook) * 作業要求在[這裡](https://hackmd.io/s/rJYD4UPKe#) ## 開發環境 Ubuntu 16.04.2 Linux version 4.8.0-36-generic gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K `$lscpu` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 69 Model name: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz Stepping: 1 CPU MHz: 887.109 CPU max MHz: 2900.0000 CPU min MHz: 800.0000 BogoMIPS: 4800.34 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K NUMA node0 CPU(s): 0-3 >中英文字間請已空白隔開[color=red][name=課程助教] <s> ## 進度 - [ ] 2/20 : 學習 perf, gnuplot, 並實際編譯給定的程式, 看看能不能重現結果。 </s> >> 不用特別標日期,HackMD 會精準追蹤每次修改的內容和時間 [name="jserv"] ## 待查資料 - [ ] Kernel pointer ? ## 目標 計劃跟做計算流體力學的教授做專題, 這些優化的技能學起來, 看看可不可以用上! # 工具 目前看來這份作業要使用的工具有: - [ ] 熟悉 Perf(gprof 跟 valgrind 不知道會不會有幫助?) - [ ] 熟悉 astyle 的基本使用 - [ ] 熟悉 gnuplot 所以開始啃東西吧! ## Perf 我把perf想像成超級高級的「系統監視器」。 >> 用途不同,請重新閱讀 [perf](https://perf.wiki.kernel.org/),不要混用 tracer 和 monitor [name="jserv"] ### 檢查開啟 首先是檢查perf有無啟動。利用作業說明中的[perf 原理和實務](https://hackmd.io/s/B11109rdg#)這篇文章,先看看自己有沒有啟動perf: ``` shell $ cat "/boot/config-`uname -r`" | grep "PERF_EVENT" ``` 結果發現以下錯誤: cat: '/boot/config-`uname -r`': No such file or directory 本來不知道為什麼, 不過5分鐘之後發現自己前一陣子把預設的shell換成fish, 而fish並不是POSIX標準的shell. 把預設的shell 改回 bash: ```shell $ chsh -s /bin/bash ``` 然後執行剛剛的命令, 終端機輸出: CONFIG_PERF_EVENTS_INTEL_UNCORE=y CONFIG_HAVE_PERF_EVENTS=y CONFIG_PERF_EVENTS=y CONFIG_HAVE_PERF_EVENTS_NMI=y 看來perf是有啟動的。接下來就開始安裝東西了! ### 安裝 利用文章當中提供的第二個方法安裝: ```shell $ sudo apt-get install linux-tools-common ``` 然後發現以下的警告訊息: WARNING: perf not found for kernel 4.4.0-31 You may need to install the following packages for this specific kernel: linux-tools-4.4.0-31-generic linux-cloud-tools-4.4.0-31-generic You may also want to install one of the following packages to keep up to date: linux-tools-generic linux-cloud-tools-generic 如文章所說少了一些東西。把他們安裝好 ```shell $ sudo apt-get install linux-tools-generic $ sudo apt-get install linux-cloud-tools-generic $ sudo apt-get install linux-tools-4.4.0-31-generic $ sudo apt-get install linux-cloud-tools-4.4.0-31-generic ``` 接著執行 ```shell $ perf list ``` 得到以下的結果: ``` List of pre-defined events (to be used in -e): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] alignment-faults [Software event] bpf-output [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event] ``` <!--![](https://i.imgur.com/9FHj5HH.png) --> >> 文字訊息請避免用圖片來表示,否則不好搜尋和分類 [name="jserv"] >> Ok, 已把文字補上! [name=Fernando Lin] <!--(做這部份時有研究了一下ubuntu該如何截圖。[這裡](http://blog.xuite.net/yh96301/blog/242377657-Ubuntu+14.04%E6%93%B7%E5%8FDF%96%E8%9E%A2%E5%B9%95)有提供一個好用的螢幕截圖方法)--> 接著文章中有提到如果沒有root權限, 可能沒辦法取得所有的event data. 但是我決定暫時使用sudo. 如果直接用vim 編輯 ``` $ vim /proc/sys/kernel/perf_event_paranoid ``` 存檔的時候vim抱怨錯誤: "/proc/sys/kernel/perf_event_paranoid" E667: Fsync failed 還沒有研究是什麼原因。 不過如果仿照下面的指令, 比如說想要把權限改成-1: ```SHELL $ sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoi ``` 這樣他就會乖乖改成-1了。 接著文章提到需要解除「kernel pointer」的使用限制,才可以觀察cach miss: ``` shell $ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" ``` 不過我想先看看「如果不解除」的話,會在哪裡出事, 因此暫時不執行他。 ### 測試perf 接下來編譯文章中提到的程式碼,並且依照測試執行perf. 執行結果如下: ``` 99.24% example [.] compute_pi_baseline 0.40% [kernel] [k] native_queued_spin_lock_slowpath 0.10% [kernel] [k] osl_readl 0.05% [kernel] [k] wlc_channel_set_chanspec 0.05% [kernel] [k] unmap_page_range 0.05% [kernel] [k] delay_tsc 0.05% [kernel] [k] clm_limits 0.05% [kernel] [k] osl_readw 0.00% [kernel] [k] native_irq_return_iret 0.00% [kernel] [k] native_write_msr_safe ``` > 現在還看不懂下面標籤是什麼意思, 不過大概猜[k]是跟kernel有關的數據。 繼續往下看文章。 ### 背景知識 這裡可以參照文章中的文字。有些在計算機組織都學過了(e.g. cach, branch類指令可能對效能帶來的不利等), 而沒有接觸過的主要有: - [ ] PMU : 硬體層面的效能監控 - [ ] dual-issue - [ ] tracepoint 再查tracepoint時發現了這個影片: {%youtube 37v14rMtALs%} >> 請附上你觀看演講後的認知,這樣旁人可以跟你做深入的討論 [name="jserv"] ### 用法 1. `perf help` <想問的東西> 2. `perf list` : 列出可支援的event。白話文就是列出可以看的東西 3. `perf top` : 跟「系統監視器」有點像, 不過perf更豪華, 除了哪個程式佔用最多效能, 連「那個程式中的哪個函數」都可以知道。 如果有指定的event想觀測, 可以用類似下面的用法 : perf top -p <process ID> -e <想觀察的event> -c <觀察周期cycle數> 常用的event包含: * cache-misses * cache-references * instructions * cycles > `--repeat <重複次數>` : 可以取重複測試的平均 4. `perf stat <某個程式>`:可以塞編好的binary。選項同上。 5. `perf record <某個程式>` : 一樣是可以塞編好的binary , 但是可以更細部追蹤到各種函式的效能 ## GNU Plot GNU Plot相對之下比較熟悉,之前有些作業有用到它。 [這裡](https://www.electricmonk.nl/log/2014/07/12/generating-good-looking-charts-with-gnuplot/)有篇關於把圖畫得漂亮的小文章。 ## astyle # 執行 ```bash $ cd phonebook $ make $ make run ``` 輸出結果如下 size of entry : 136 bytes execution time of append() : 0.044346 sec execution time of findName() : 0.005026 sec 3 另外,執行 ```BASH $ make plot ``` [](https://i.imgur.com/wfOzvEe.png) 以及會看到他再抱怨沒有優化版的實作。該開始動腦了 # phonebook 優化 - [x] 計劃0. 改變資料結構: 由一個大的linked list, 改成依照字母分成26個linked list - [x] 計劃1. 平行化 - [ ] 計劃2. 把「歷史紀錄」或是「常用的」記下來。 >> 今天寫完下面兩種優化之後, 突然發現作業要求裡面有 : >> 「main.c 應該只有一份,不要建立新的 main(),善用 Makefile 定義對應的 CFLAGS」 >> 不小心把這點忽略了, 寫了兩個 main.c >> 會再對程式做出修改。 ## 改成26條 Linked-List 原先得資料結構是將全部的名單放在同一條linked list, 並照字母排列. 如果要查詢的名字在很後面, 會遍歷很多不必要的姓名。 所以新增一個 main_opt.c , 其內容為將 main.c 中的 ```C entry *pHead, *e; ``` 調整成: ```C entry *pHead[26], *e[26]; ``` 並將建立部份改成 : ```C int first = (int)(line[0] - 'a'); e[first] = append(line, e[first]); ``` 將查詢部份改成: ```C int first = (int)(input[0] - 'a'); findName(input, e[first]); ``` 並對 Makefile 做出相應的調整。 執行結果為 : ``` Performance counter stats for './phonebook_opt' (100 runs): 159,196 cache-misses # 65.293 % of all cache refs ( +- 0.82% ) 243,817 cache-references ( +- 0.61% ) 205,701,663 instructions # 1.53 insn per cycle ( +- 0.02% ) 134,051,189 cycles ( +- 0.23% ) 0.054094570 seconds time elapsed ( +- 0.37% ) ``` 而未修改的版本為: ``` Performance counter stats for './phonebook_orig' (100 runs): 1,024,194 cache-misses # 88.881 % of all cache refs ( +- 0.24% ) 1,152,320 cache-references ( +- 0.23% ) 262,075,842 instructions # 1.46 insn per cycle ( +- 0.02% ) 179,420,610 cycles ( +- 0.28% ) 0.064920713 seconds time elapsed ( +- 0.44% ) ``` 可以發現 cache misses 從 88.881 % 降至 65.293 % 。 但意外的是圖畫出來花費時間居然一樣。 應該是 runtime.gp 或 Makefile 部份沒調整好。 正在調整中。 :::danger 這個推測合理嗎?請重新探討 --jserv ::: ## 平行化 主要的構想是這樣 : 1. 每固定數目(比如說 5000 筆資料), 紀錄一個節點的位址 2. 搜尋時, 依照執行緒的數目, 對這些中繼點做 cyclic partition 方式的平行化來搜尋資料。 3. 用openMP做嘗試 >> 這裡寫了很醜的 code , 而且感覺(應該說是一定)有更好更漂亮的寫法, 不過我還是想試試看完成平行化的查詢。 為了從這些很醜的 code 中理出頭緒, 所以先邊寫邊紀錄。 首先先設定每 5000 筆資料設立一個中繼點。 在 `phonebook_opt.h` 中: ```C #define PORTION_SIZE 5000 ``` 並且自行定義一個結構來儲存這些中繼點 : ```C typedef struct __segments { int cap; int size; entry** seg; }segments; ``` >> C11 支援 [Unnamed Structure and Union Fields](https://gcc.gnu.org/onlinedocs/gcc/Unnamed-Fields.html),所以你可以改寫為: ``` typedef struct { int cap; int size; entry **seg; } segments; ``` >> 看起來更清爽。注意 coding style: `} segments` 中間有空白,`entry **seg` 的指標跟著變數 `seg` [name="jserv"] 照理說不能期待能預先知道資料數目, 所以打算用動態的方式紀錄這些中繼點, 用以下的方式紀錄中繼點資料 : ```C void push_back(segments *s, entry* e) { if((s->size) + 1 >= s->cap){ (s->cap) *= 2; s->seg = realloc(s->seg, sizeof(entry*)*s->cap); s->seg[(s->size)++] = e; }else{ s->seg[(s->size)++] = e; } } ``` 接著修改 `findName()` 。 先寫出每個執行緒需要執行的任務 `__findName()` 。 這個跟原始版的 `findName()` 幾乎一樣, 只是把 while 迴圈中的條件由 : ```C while (pHead != NULL){ ... pHead = pHead -> pNext; } ``` 改成 : ```C int count = 0; while (pHead != NULL && count < PORTION_SIZE){ ... pHead = pHead -> pNext; count++; } ``` 最後是`findName`。 因為多了「紀錄中繼站」的陣列, 以及「執行緒的數目」, 所以參數變成 : ```C entry *findName(char lastName[], entry *pHead, segments *s, int nthread); ``` 而整個 `findName` 的實作如下 : ``` C entry *findName(char lastName[], entry *pHead, segments *s, int nthread) { int num = s->size; entry **res = calloc(sizeof(entry*), num); omp_set_num_threads(nthread); #pragma omp parallel { int id = omp_get_thread_num(); for(int i = id; i < num; i+=nthread){ res[i] = __findName(lastName, s->seg[i]); } } for(int i = 0; i < num; i++){ if(res[i]){ return res[i]; } } return NULL; } ``` 執行結果如下 : ``` Performance counter stats for './phonebook_opt' (100 runs): 533,129 cache-misses # 53.601 % of all cache refs ( +- 0.57% ) 994,635 cache-references ( +- 0.17% ) 279,567,254 instructions # 1.19 insn per cycle ( +- 0.19% ) 235,699,030 cycles ( +- 0.50% ) 0.066403495 seconds time elapsed ( +- 0.42% ) ``` 而這次測驗中, 原始版本如下 : ``` Performance counter stats for './phonebook_orig' (100 runs): 1,032,975 cache-misses # 88.910 % of all cache refs ( +- 0.13% ) 1,161,817 cache-references ( +- 0.12% ) 262,089,603 instructions # 1.45 insn per cycle ( +- 0.02% ) 180,133,449 cycles ( +- 0.17% ) 0.064475132 seconds time elapsed ( +- 0.25% ) ``` 減少了35%左右的cache miss >> 因為 gnuplot 的 script 還沒修好, 所以這裡暫時選取幾組時間資料來展示時間上的提升。 會儘快把圖補上! :::danger 請解釋 `#pragma omp parallel` 搭配迴圈切割的效益 --jserv ::: 優化前的時間 : ``` size of entry : 136 bytes execution time of append() : 0.044632 sec execution time of findName() : 0.006063 sec size of entry : 136 bytes execution time of append() : 0.046319 sec execution time of findName() : 0.004952 sec size of entry : 136 bytes execution time of append() : 0.044687 sec execution time of findName() : 0.005105 sec size of entry : 136 bytes execution time of append() : 0.048199 sec execution time of findName() : 0.004924 sec size of entry : 136 bytes execution time of append() : 0.043892 sec execution time of findName() : 0.004837 sec size of entry : 136 bytes execution time of append() : 0.046132 sec execution time of findName() : 0.004944 sec size of entry : 136 bytes execution time of append() : 0.044600 sec execution time of findName() : 0.004987 sec size of entry : 136 bytes execution time of append() : 0.045488 sec execution time of findName() : 0.004838 sec size of entry : 136 bytes execution time of append() : 0.044481 sec execution time of findName() : 0.005026 sec size of entry : 136 bytes execution time of append() : 0.047144 sec execution time of findName() : 0.005013 sec ``` 優化後 ``` execution time of append() : 0.045227 sec execution time of findName() : 0.002550 sec size of entry : 136 bytes execution time of append() : 0.044574 sec execution time of findName() : 0.002549 sec size of entry : 136 bytes execution time of append() : 0.046196 sec execution time of findName() : 0.002515 sec size of entry : 136 bytes execution time of append() : 0.045013 sec execution time of findName() : 0.003184 sec size of entry : 136 bytes execution time of append() : 0.045519 sec execution time of findName() : 0.002459 sec size of entry : 136 bytes execution time of append() : 0.044297 sec execution time of findName() : 0.002481 sec size of entry : 136 bytes execution time of append() : 0.045520 sec execution time of findName() : 0.002538 sec size of entry : 136 bytes execution time of append() : 0.044734 sec execution time of findName() : 0.002483 sec size of entry : 136 bytes execution time of append() : 0.046052 sec execution time of findName() : 0.002525 sec size of entry : 136 bytes execution time of append() : 0.044142 sec execution time of findName() : 0.002481 sec size of entry : 136 bytes execution time of append() : 0.047937 sec execution time of findName() : 0.002493 sec ``` <!-- 今天(2/20)下午經過資工系館的時候突然看到一門課: ![](https://i.imgur.com/Hh9rpWM.png) 而且很佛心的有影片: [Information Theory and Coding Technique](http://cmlab.csie.ntu.edu.tw/~itct/) 不知道會不會對這個議題有幫助,也許明天可以看看(雖然我都看不懂)。 感覺如果可以知道查詢可能的機率分佈,就有機會針對這個分布去做優化。 不知道是什麼分佈就猜常態分佈好了。 --> ## 問題討論 回答以下問題:本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題? 1. 有代表性嗎?跟真實英文姓氏差異大嗎? 首先做一點簡單的統計。參考維基百科上[英文字母出現頻率](https://en.wikipedia.org/wiki/Letter_frequency)中,目前英文字母的出現頻率 ![](https://i.imgur.com/LBMUClG.png) 接著統計 phonebook 中,words.txt 中各個字母出現的次數: ![](https://i.imgur.com/J68Y3Qo.png) 以及 phonebook 作業中,提供的 all-name.txt 中各個單字的出現次數 ![](https://i.imgur.com/uCipG3S.png) 若比 all-name.txt 與 words.txt 當中的頻率,可以發現兩者在字母頻率的趨勢上頗接近,表示 words.txt 一定程度上還原了使用「字母」的習慣 但是字母組成比例相同,不表示「與英文接近」,也未必代表跟真實情況接近。 舉例來說,這裡的程式假定所有 last name 只出現一次,但實際上的電話簿會可能會有很多人同姓、同名等等。如果這樣的話,用 hash 實作就必須考慮更多事情。 2. 資料難道「數大就是美」嗎? 3. 如果我們要建立符合真實電話簿情境的輸入,該怎麼作?