Phonebook Optimization

# Phonebook Optimization ## 輸入法 homework都是在ubuntu上作業，對於從windows轉到linux的user，其中一個面臨的問題就是輸入法，示範[安裝嘸蝦米輸入法](https://hackmd.io/GzAsDMEZQYwTgLTgKYA4DMDQEMBGuE5ldUEATAJjN3TMgFYAGYZAdiA=) ## Development Environment ### sudo lshw --->>> this command lshw (list hardware) can get more detailed hardware information.<br>下面只是一部分資訊 * OS : 14.04.1-Ubuntu * kernel version : 3.19.0-25-generic * model name : Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz * L1d cache: 32K * L1i cache: 32K * L2 cache: 3072K * memory description: System Memory physical id: 20 slot: System board or motherboard size: 8GiB * bank:0 description: SODIMM DDR3 Synchronous 1066 MHz (0.9 ns) product: HMT351S6EFR8C-PB vendor: Hynix Semiconductor (Hyundai Electronics) physical id: 0 serial: 2736E85A slot: DIMM 1 size: 4GiB width: 64 bits clock: 1066MHz (0.9ns) * bank:1 description: SODIMM DDR3 Synchronous 1066 MHz (0.9 ns) product: HMT351S6EFR8C-PB vendor: Hynix Semiconductor (Hyundai Electronics) physical id: 1 serial: 0B7C485B slot: DIMM 2 size: 4GiB width: 64 bits clock: 1066MHz (0.9ns) ## Git 設定參考同學 [GitHub 設定指引](http://wiki.csie.ncku.edu.tw/github) ## Astyle * astyle的參數，對source code編排的影響，請參考(http://astyle.sourceforge.net/astyle.html#_indent-switches)<br> * ## 作業要求的coding style * –suffix=none : 取消自動備份 * --indent-switches Indent 'switch' blocks so that the 'case X:' statements are indented in the switch block. The entire case block is indented. ``` $ astyle --style=kr --indent=spaces=4 --indent-switches --suffix=none *.[ch] ``` ## PI近似法以下不像蒙地卡羅法，那是什麼方法，為什麼可以求出PI? ``` #include <stdio.h> #include <unistd.h> double compute_pi_baseline(size_t N) { double pi = 0.0; double dt = 1.0 / N; for (size_t i = 0; i < N; i++) { double x = (double) i / N; pi += dt / (1.0 + x * x); } return pi * 4.0; } int main() { printf("pid: %d\n", getpid()); sleep(10); compute_pi_baseline(50000000); return 0; } ``` [蒙地卡羅法](http://latinboy.pixnet.net/blog/post/23461935-%E7%AC%AC%E4%B8%80%E6%AC%A1%E4%BD%BF%E7%94%A8%E4%BA%82%E6%95%B8%E5%B0%B1%E4%B8%8A%E6%89%8B---%E8%92%99%E5%9C%B0%E5%8D%A1%E7%BE%85%E6%B3%95%E6%B1%82%E5%9C%93%E5%91%A8%E7%8E%87)裡有一個橢圓的相關方程式，(xx + yy < 1) 這是什麼？ ## Perf 參考[Linux 效能分析工具: Perf](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial) ``` $ sudo su <--- 切到root $ echo -1 > /proc/sys/kernel/perf_event_paranoid --- 開啟可以取得所有event data的權限 $ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" --- 如果要檢測 cache miss event ，需要先取消 kernel pointer 的禁用。 ``` Ex:待測程式需印出PID，並延遲幾秒執行，讓另一個terminal可以執行perf top -p $pid（填上待測程式的PID），抓取相關的event data. * 分析性能問題時需要瞭解的[背景知識](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial) * 硬體特性之 cache：存取速率非常快，充分利用 cache 是軟體效能改善過程中，非常重要的部分。 * Pipeline：一個時鐘週期內可以同時處理三條指令，分別被 pipeline 的不同部分處理。 * Superscalar：一個時鐘週期觸發 (issue) 多條指令的 pipeline機器架構，比如 Intel 的 Pentium 處理器，內部有兩個執行單元，在一個時鐘週期內允許執行兩條指令。==> 這樣稱為 dual-issue，可想像為一個 packet 裡同時有兩組 pipelined 的 instruction * out-of-order execution：在處理器內部，不同指令所需要的執行時間和時鐘週期是不同的，如果嚴格按照程序的執行順序執行，那麼就無法充分利用處理器的 pipeline。因此指令有可能被亂序執行 * Pipeline，Superscalar，out-of-order execution執行的指令有一個基本要求，即相鄰的指令相互沒有依賴關係。假如某條指令需要依賴前面一條指令的執行結果數據，那麼 pipeline 便失去作用，因為第二條指令必須等待第一條指令完成。 * Branch Prediction：若分支的結果是跳到其它指令，則pipeline後面的執行單元必需flush，從而影響效能。 * static prediction * Opcode based 查 * Displacement based (forward not taken, backward taken) 查 * Compiler directed (branch likely, branch not likely) 查如何設定 * Dynamic prediction * Use part of the PC (low-order bits) to index buffer/table – Why? – Multiple branches may share the same bit 查 Backward branches for loops will be mispredicted twice -- once upon loop exit and then again on loop entry查 * Branch penalty involves both misprediction rate and branch frequency, and is higher for integer benchmarks查 * Prediction accuracy improves with buffer size, but does not improve beyond 4K entries 查 * [Correlating or Two-level Predictors](http://www.cs.ucr.edu/~gupta/teaching/203A-09/My6.pdf)查 ## 原始版本 * Data structure ``` 6 /* original version */ 7 typedef struct __PHONE_BOOK_ENTRY { 8 char lastName[MAX_LAST_NAME_SIZE]; 9 char firstName[16]; 10 char email[16]; 11 char phone[10]; 12 char cell[10]; 13 char addr1[16]; 14 char addr2[16]; 15 char city[16]; 16 char state[2]; 17 char zip[5]; 18 struct __PHONE_BOOK_ENTRY *pNext; 19 } entry; ``` * 指令 ``` $ sudo su $ echo -1 > /proc/sys/kernel/perf_event_paranoid $ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" $ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_orig ``` * perf stat 結果 ``` size of entry : 136 bytes execution time of append() : 0.088558 sec execution time of findName() : 0.017283 sec Performance counter stats for './phonebook_orig' (10 runs): 827,166 cache-misses # 29.994 % of all cache refs ( +- 0.95% ) [44.38%] 2,757,744 cache-references ( +- 0.77% ) [31.05%] 2,838,813 L1-dcache-load-misses ( +- 1.31% ) [28.96%] 5,272,629 L1-dcache-store-misses ( +- 1.29% ) [28.35%] 0 L1-dcache-prefetch-misses 67,225 L1-icache-load-misses ( +- 3.70% ) [29.64%] 259,001,566 instructions # 0.79 insns per cycle ( +- 0.62% ) [44.15%] 328,689,512 cycles ( +- 0.40% ) [42.98%] 0.146228190 seconds time elapsed ( +- 0.43% ) ``` ## 程式優化-1 (縮小 data structure) * Data structure: ``` 27 typedef struct __PHONE_BOOK_ENTRY { 28 char lastName[MAX_LAST_NAME_SIZE]; 29 #ifdef DATA_STRUCTURE_OPTIMIZATION_SMALLER_SIZE 30 releated *related_information; 31 #else 32 char firstName[16]; 33 char email[16]; 34 char phone[10]; 35 char cell[10]; 36 char addr1[16]; 37 char addr2[16]; 38 char city[16]; 39 char state[2]; 40 char zip[5]; 41 #endif 42 struct __PHONE_BOOK_ENTRY *pNext; 43 } entry; ``` * 指令 ``` $ sudo su $ echo -1 > /proc/sys/kernel/perf_event_paranoid $ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" $ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_opt ``` * perf stat output ``` size of entry : 32 bytes execution time of append() : 0.064288 sec execution time of findName() : 0.005590 sec Performance counter stats for './phonebook_opt' (10 runs): 70,125 cache-misses # 8.085 % of all cache refs ( +- 0.76% ) [42.66%] 867,400 cache-references ( +- 7.07% ) [29.23%] 634,781 L1-dcache-load-misses ( +- 2.59% ) [31.64%] 1,630,373 L1-dcache-store-misses ( +- 2.77% ) [33.57%] 0 L1-dcache-prefetch-misses 40,509 L1-icache-load-misses ( +- 6.46% ) [31.99%] 256,956,863 instructions # 1.32 insns per cycle ( +- 0.86% ) [45.52%] 193,948,741 cycles ( +- 0.54% ) [43.05%] 0.084300406 seconds time elapsed ( +- 0.54% ) ``` ## 程式優化-2 (Bucket Link List) * Data structure： * Bucket Link List <br> 不確定是什麼名字暫時取為Bucket Link List * ![](https://i.imgur.com/h3BJyXH.png) ``` 46 typedef struct __LEVEL1_COMPARE { 47 char first_char_of_name; 48 entry *link; 49 } level1_compare; ``` * 指令 ``` $ sudo su $ echo -1 > /proc/sys/kernel/perf_event_paranoid $ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" $ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_opt_2 ``` * perf stat output ``` size of entry : 32 bytes execution time of append() : 18.591619 sec execution time of findName() : 0.000032 sec Performance counter stats for './phonebook_opt_2' (10 runs): 199,603 cache-misses # 0.008 % of all cache refs ( +- 6.19% ) [42.85%] 2,489,462,022 cache-references ( +- 0.02% ) [28.60%] 5,701,701,395 L1-dcache-load-misses ( +- 0.30% ) [28.59%] 10,203,165 L1-dcache-store-misses ( +- 3.69% ) [28.58%] 0 L1-dcache-prefetch-misses 1,635,607 L1-icache-load-misses ( +- 1.90% ) [28.58%] 23,329,940,160 instructions # 0.53 insns per cycle ( +- 0.02% ) [42.86%] 44,357,537,174 cycles ( +- 0.12% ) [42.84%] 18.643948613 seconds time elapsed ( +- 0.13% ) ``` ## 程式優化-3 (Bucket Link List - modifed) * Data structure： * ![](https://i.imgur.com/ZUhNzZg.png) ``` 46 typedef struct __LEVEL1_COMPARE { 47 char first_char_of_name; 48 entry *the_last_node; 49 entry *link; 50 } level1_compare; ``` * 指令 ``` $ sudo su $ echo -1 > /proc/sys/kernel/perf_event_paranoid $ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" $ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_opt_3 ``` * perf stat output： ``` size of entry : 32 bytes execution time of append() : 0.070129 sec execution time of findName() : 0.000032 sec Performance counter stats for './phonebook_opt_3' (10 runs): 31,586 cache-misses # 5.733 % of all cache refs ( +- 3.19% ) [41.21%] 550,933 cache-references ( +- 1.64% ) [37.54%] 425,916 L1-dcache-load-misses ( +- 0.95% ) [34.90%] 1,906,506 L1-dcache-store-misses ( +- 0.92% ) [32.54%] 0 L1-dcache-prefetch-misses 63,040 L1-icache-load-misses ( +- 8.00% ) [27.99%] 206,689,110 instructions # 1.22 insns per cycle ( +- 0.38% ) [38.99%] 168,967,583 cycles ( +- 0.69% ) [34.78%] 0.073860286 seconds time elapsed ( +- 0.73% ) ``` * 結果比較 <table> <tr> <td></td> <td>original</td> <td>優化1</td> <td>優化2</td> <td>優化3</td> </tr> <td>execution time of append()</td> <td>0.088558 sec</td> <td>0.064288 sec</td> <td>18.591619 sec</td> <td>0.070129 sec</td> </tr> <tr> <td>execution time of findName()</td> <td>0.017283 sec</td> <td>0.005590 sec</td> <td>0.000032 sec</td> <td>0.000032 sec</td> </tr> <tr> <td>cache-misses</td> <td>827,166</td> <td>70,125</td> <td>199,603</td> <td>31,586</td> </tr> <tr> <td>cache-references</td> <td>2,757,744</td> <td>867,400</td> <td>2,489,462,022</td> <td>550,933</td> </tr> <tr> <td>instructions</td> <td>259,001,566</td> <td>256,956,863</td> <td>23,329,940,160</td> <td>206,689,110</td> </tr> <tr> <td>cycles</td> <td>328,689,512</td> <td>193,948,741</td> <td>44,357,537,174</td> <td>168,967,583</td> </tr> <tr> <td>append() cache-misses</td> <td></td> <td></td> <td>39.36%</td> <td></td> </tr> <tr> <td>findName() cache-misses</td> <td>48.96%</td> <td>6.89%</td> <td></td> <td></td> </tr> </table> >如何取得較小的function misses ? [youchihwang][time=Fri, Oct 7, 2016 5:10 AM] >gnuplot，如何設定長方圖的寬度呢？[youchihwang][time=Fri, Oct 7, 2016 5:11 AM] original vs 優化1 1：structure減小，cache可以放進更多需要儲存的資料，pde因此減少cache-misses,但為什麼可以減少cache-references. 2：cache-misses減小了，run time也就減少。優化1 vs 優化2 加了bucket link list, 將name的第一個字母做為index, 分別儲存在不同的link list。 1：append的作法，是以要加入的name的第一個字母做為index，再到對應的link list,依序搜尋直到最後一個node後，再加入name,因此runtime會較長 2：findname的作法，以要搜尋的name的第一個字母為index，再到相對應的link list裡搜尋，所以搜尋的時間會比從頭到尾的搜尋來得快。優化2 vs 優化3 優化2的append時間太長的原因是以name的第一個字母做為index，再到相對應的link list去搜尋，直到最後一個node再搜入，為了加速執行時間，所以在bucket link list的index structure加入了指向最後一個node的member, 因此在插入name時，即可直接跳到最後一個node並插入其後。 ## 繼續編輯