# Phonebook Optimization
## 輸入法
homework都是在ubuntu上作業,
對於從windows轉到linux的user,
其中一個面臨的問題就是輸入法,
示範[安裝嘸蝦米輸入法](https://hackmd.io/GzAsDMEZQYwTgLTgKYA4DMDQEMBGuE5ldUEATAJjN3TMgFYAGYZAdiA=)
## Development Environment
### sudo lshw
--->>> this command lshw (list hardware) can get more detailed hardware information.<br>下面只是一部分資訊
* OS : 14.04.1-Ubuntu
* kernel version : 3.19.0-25-generic
* model name : Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz
* L1d cache: 32K
* L1i cache: 32K
* L2 cache: 3072K
* memory
description: System Memory
physical id: 20
slot: System board or motherboard
size: 8GiB
* bank:0
description: SODIMM DDR3 Synchronous 1066 MHz (0.9 ns)
product: HMT351S6EFR8C-PB
vendor: Hynix Semiconductor (Hyundai Electronics)
physical id: 0
serial: 2736E85A
slot: DIMM 1
size: 4GiB
width: 64 bits
clock: 1066MHz (0.9ns)
* bank:1
description: SODIMM DDR3 Synchronous 1066 MHz (0.9 ns)
product: HMT351S6EFR8C-PB
vendor: Hynix Semiconductor (Hyundai Electronics)
physical id: 1
serial: 0B7C485B
slot: DIMM 2
size: 4GiB
width: 64 bits
clock: 1066MHz (0.9ns)
## Git 設定
參考同學 [GitHub 設定指引](http://wiki.csie.ncku.edu.tw/github)
## Astyle
* astyle的參數,對source code編排的影響,請參考(http://astyle.sourceforge.net/astyle.html#_indent-switches)<br>
* ## 作業要求的coding style
* –suffix=none : 取消自動備份
* --indent-switches
Indent 'switch' blocks so that the 'case X:' statements are indented in the switch block. The entire case block is indented.
```
$ astyle --style=kr --indent=spaces=4 --indent-switches --suffix=none *.[ch]
```
## PI近似法
以下不像蒙地卡羅法,那是什麼方法,為什麼可以求出PI?
```
#include <stdio.h>
#include <unistd.h>
double compute_pi_baseline(size_t N) {
double pi = 0.0;
double dt = 1.0 / N;
for (size_t i = 0; i < N; i++) {
double x = (double) i / N;
pi += dt / (1.0 + x * x);
}
return pi * 4.0;
}
int main() {
printf("pid: %d\n", getpid());
sleep(10);
compute_pi_baseline(50000000);
return 0;
}
```
[蒙地卡羅法](http://latinboy.pixnet.net/blog/post/23461935-%E7%AC%AC%E4%B8%80%E6%AC%A1%E4%BD%BF%E7%94%A8%E4%BA%82%E6%95%B8%E5%B0%B1%E4%B8%8A%E6%89%8B---%E8%92%99%E5%9C%B0%E5%8D%A1%E7%BE%85%E6%B3%95%E6%B1%82%E5%9C%93%E5%91%A8%E7%8E%87)裡有一個橢圓的相關方程式,(xx + yy < 1) 這是什麼?
## Perf
參考[Linux 效能分析工具: Perf](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial)
```
$ sudo su <--- 切到root
$ echo -1 > /proc/sys/kernel/perf_event_paranoid --- 開啟可以取得所有event data的權限
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" --- 如果要檢測 cache miss event ,需要先取消 kernel pointer 的禁用。
```
Ex:待測程式需印出PID,並延遲幾秒執行,讓另一個terminal可以執行perf top -p $pid(填上待測程式的PID),抓取相關的event data.
* 分析性能問題時需要瞭解的[背景知識](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial)
* 硬體特性之 cache:存取速率非常快,充分利用 cache 是軟體效能改善過程中,非常重要的部分。
* Pipeline:一個時鐘週期內可以同時處理三條指令,分別被 pipeline 的不同部分處理。
* Superscalar:一個時鐘週期觸發 (issue) 多條指令的 pipeline機器架構,比如 Intel 的 Pentium 處理器,內部有兩個執行單元,在一個時鐘週期內允許執行兩條指令。==> 這樣稱為 dual-issue,可想像為一個 packet 裡同時有兩組 pipelined 的 instruction
* out-of-order execution:在處理器內部,不同指令所需要的執行時間和時鐘週期是不同的,如果嚴格按照程序的執行順序執行,那麼就無法充分利用處理器的 pipeline。因此指令有可能被亂序執行
* Pipeline,Superscalar,out-of-order execution執行的指令有一個基本要求,即相鄰的指令相互沒有依賴關係。假如某條指令需要依賴前面一條指令的執行結果數據,那麼 pipeline 便失去作用,因為第二條指令必須等待第一條指令完成。
* Branch Prediction:若分支的結果是跳到其它指令,則pipeline後面的執行單元必需flush,從而影響效能。
* static prediction
* Opcode based 查
* Displacement based (forward not taken, backward taken) 查
* Compiler directed (branch likely, branch not likely) 查如何設定
* Dynamic prediction
* Use part of the PC (low-order bits) to index buffer/table – Why?
– Multiple branches may share the same bit 查
Backward branches for loops will be mispredicted twice
-- once upon loop exit and then again on loop entry查
* Branch penalty involves both misprediction rate and branch
frequency, and is higher for integer benchmarks查
* Prediction accuracy improves with buffer size, but does not improve beyond 4K entries 查
* [Correlating or Two-level Predictors](http://www.cs.ucr.edu/~gupta/teaching/203A-09/My6.pdf)查
## 原始版本
* Data structure
```
6 /* original version */
7 typedef struct __PHONE_BOOK_ENTRY {
8 char lastName[MAX_LAST_NAME_SIZE];
9 char firstName[16];
10 char email[16];
11 char phone[10];
12 char cell[10];
13 char addr1[16];
14 char addr2[16];
15 char city[16];
16 char state[2];
17 char zip[5];
18 struct __PHONE_BOOK_ENTRY *pNext;
19 } entry;
```
* 指令
```
$ sudo su
$ echo -1 > /proc/sys/kernel/perf_event_paranoid
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_orig
```
* perf stat 結果
```
size of entry : 136 bytes
execution time of append() : 0.088558 sec
execution time of findName() : 0.017283 sec
Performance counter stats for './phonebook_orig' (10 runs):
827,166 cache-misses # 29.994 % of all cache refs ( +- 0.95% ) [44.38%]
2,757,744 cache-references ( +- 0.77% ) [31.05%]
2,838,813 L1-dcache-load-misses ( +- 1.31% ) [28.96%]
5,272,629 L1-dcache-store-misses ( +- 1.29% ) [28.35%]
0 L1-dcache-prefetch-misses
67,225 L1-icache-load-misses ( +- 3.70% ) [29.64%]
259,001,566 instructions # 0.79 insns per cycle ( +- 0.62% ) [44.15%]
328,689,512 cycles ( +- 0.40% ) [42.98%]
0.146228190 seconds time elapsed ( +- 0.43% )
```
## 程式優化-1 (縮小 data structure)
* Data structure:
```
27 typedef struct __PHONE_BOOK_ENTRY {
28 char lastName[MAX_LAST_NAME_SIZE];
29 #ifdef DATA_STRUCTURE_OPTIMIZATION_SMALLER_SIZE
30 releated *related_information;
31 #else
32 char firstName[16];
33 char email[16];
34 char phone[10];
35 char cell[10];
36 char addr1[16];
37 char addr2[16];
38 char city[16];
39 char state[2];
40 char zip[5];
41 #endif
42 struct __PHONE_BOOK_ENTRY *pNext;
43 } entry;
```
* 指令
```
$ sudo su
$ echo -1 > /proc/sys/kernel/perf_event_paranoid
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_opt
```
* perf stat output
```
size of entry : 32 bytes
execution time of append() : 0.064288 sec
execution time of findName() : 0.005590 sec
Performance counter stats for './phonebook_opt' (10 runs):
70,125 cache-misses # 8.085 % of all cache refs ( +- 0.76% ) [42.66%]
867,400 cache-references ( +- 7.07% ) [29.23%]
634,781 L1-dcache-load-misses ( +- 2.59% ) [31.64%]
1,630,373 L1-dcache-store-misses ( +- 2.77% ) [33.57%]
0 L1-dcache-prefetch-misses
40,509 L1-icache-load-misses ( +- 6.46% ) [31.99%]
256,956,863 instructions # 1.32 insns per cycle ( +- 0.86% ) [45.52%]
193,948,741 cycles ( +- 0.54% ) [43.05%]
0.084300406 seconds time elapsed ( +- 0.54% )
```
## 程式優化-2 (Bucket Link List)
* Data structure:
* Bucket Link List <br> 不確定是什麼名字暫時取為Bucket Link List
* 
```
46 typedef struct __LEVEL1_COMPARE {
47 char first_char_of_name;
48 entry *link;
49 } level1_compare;
```
* 指令
```
$ sudo su
$ echo -1 > /proc/sys/kernel/perf_event_paranoid
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_opt_2
```
* perf stat output
```
size of entry : 32 bytes
execution time of append() : 18.591619 sec
execution time of findName() : 0.000032 sec
Performance counter stats for './phonebook_opt_2' (10 runs):
199,603 cache-misses # 0.008 % of all cache refs ( +- 6.19% ) [42.85%]
2,489,462,022 cache-references ( +- 0.02% ) [28.60%]
5,701,701,395 L1-dcache-load-misses ( +- 0.30% ) [28.59%]
10,203,165 L1-dcache-store-misses ( +- 3.69% ) [28.58%]
0 L1-dcache-prefetch-misses
1,635,607 L1-icache-load-misses ( +- 1.90% ) [28.58%]
23,329,940,160 instructions # 0.53 insns per cycle ( +- 0.02% ) [42.86%]
44,357,537,174 cycles ( +- 0.12% ) [42.84%]
18.643948613 seconds time elapsed ( +- 0.13% )
```
## 程式優化-3 (Bucket Link List - modifed)
* Data structure:
* 
```
46 typedef struct __LEVEL1_COMPARE {
47 char first_char_of_name;
48 entry *the_last_node;
49 entry *link;
50 } level1_compare;
```
* 指令
```
$ sudo su
$ echo -1 > /proc/sys/kernel/perf_event_paranoid
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_opt_3
```
* perf stat output:
```
size of entry : 32 bytes
execution time of append() : 0.070129 sec
execution time of findName() : 0.000032 sec
Performance counter stats for './phonebook_opt_3' (10 runs):
31,586 cache-misses # 5.733 % of all cache refs ( +- 3.19% ) [41.21%]
550,933 cache-references ( +- 1.64% ) [37.54%]
425,916 L1-dcache-load-misses ( +- 0.95% ) [34.90%]
1,906,506 L1-dcache-store-misses ( +- 0.92% ) [32.54%]
0 L1-dcache-prefetch-misses
63,040 L1-icache-load-misses ( +- 8.00% ) [27.99%]
206,689,110 instructions # 1.22 insns per cycle ( +- 0.38% ) [38.99%]
168,967,583 cycles ( +- 0.69% ) [34.78%]
0.073860286 seconds time elapsed ( +- 0.73% )
```
* 結果比較
<table>
<tr>
<td></td>
<td>original</td>
<td>優化1</td>
<td>優化2</td>
<td>優化3</td>
</tr>
<td>execution time of append()</td>
<td>0.088558 sec</td>
<td>0.064288 sec</td>
<td>18.591619 sec</td>
<td>0.070129 sec</td>
</tr>
<tr>
<td>execution time of findName()</td>
<td>0.017283 sec</td>
<td>0.005590 sec</td>
<td>0.000032 sec</td>
<td>0.000032 sec</td>
</tr>
<tr>
<td>cache-misses</td>
<td>827,166</td>
<td>70,125</td>
<td>199,603</td>
<td>31,586</td>
</tr>
<tr>
<td>cache-references</td>
<td>2,757,744</td>
<td>867,400</td>
<td>2,489,462,022</td>
<td>550,933</td>
</tr>
<tr>
<td>instructions</td>
<td>259,001,566</td>
<td>256,956,863</td>
<td>23,329,940,160</td>
<td>206,689,110</td>
</tr>
<tr>
<td>cycles</td>
<td>328,689,512</td>
<td>193,948,741</td>
<td>44,357,537,174</td>
<td>168,967,583</td>
</tr>
<tr>
<td>append() cache-misses</td>
<td></td>
<td></td>
<td>39.36%</td>
<td></td>
</tr>
<tr>
<td>findName() cache-misses</td>
<td>48.96%</td>
<td>6.89%</td>
<td></td>
<td></td>
</tr>
</table>
>如何取得較小的function misses ? [youchihwang][time=Fri, Oct 7, 2016 5:10 AM]
>gnuplot,如何設定長方圖的寬度呢?[youchihwang][time=Fri, Oct 7, 2016 5:11 AM]
original vs 優化1
1:structure減小,cache可以放進更多需要儲存的資料,pde因此減少cache-misses,但為什麼可以減少cache-references.
2:cache-misses減小了,run time也就減少。
優化1 vs 優化2
加了bucket link list, 將name的第一個字母做為index, 分別儲存在不同的link list。
1:append的作法,是以要加入的name的第一個字母做為index,再到對應的link list,依序搜尋直到最後一個node後,再加入name,因此runtime會較長
2:findname的作法,以要搜尋的name的第一個字母為index,再到相對應的link list裡搜尋,所以搜尋的時間會比從頭到尾的搜尋來得快。
優化2 vs 優化3
優化2的append時間太長的原因是以name的第一個字母做為index,再到相對應的link list去搜尋,直到最後一個node再搜入,為了加速執行時間,所以在bucket link list的index structure加入了指向最後一個node的member, 因此在插入name時,即可直接跳到最後一個node並插入其後。
## 繼續編輯