# 2017q1 Homework1 (phonebook)
contributed by < `0xff07` >
## 題目
* 原始碼在[這裡](https://github.com/0xff07/phonebook)
* 作業要求在[這裡](https://hackmd.io/s/rJYD4UPKe#)
## 開發環境
Ubuntu 16.04.2
Linux version 4.8.0-36-generic
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
`$lscpu`
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 69
Model name: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
Stepping: 1
CPU MHz: 887.109
CPU max MHz: 2900.0000
CPU min MHz: 800.0000
BogoMIPS: 4800.34
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
>中英文字間請已空白隔開[color=red][name=課程助教]
<s>
## 進度
- [ ] 2/20 : 學習 perf, gnuplot, 並實際編譯給定的程式, 看看能不能重現結果。
</s>
>> 不用特別標日期,HackMD 會精準追蹤每次修改的內容和時間 [name="jserv"]
## 待查資料
- [ ] Kernel pointer ?
## 目標
計劃跟做計算流體力學的教授做專題, 這些優化的技能學起來, 看看可不可以用上!
# 工具
目前看來這份作業要使用的工具有:
- [ ] 熟悉 Perf(gprof 跟 valgrind 不知道會不會有幫助?)
- [ ] 熟悉 astyle 的基本使用
- [ ] 熟悉 gnuplot
所以開始啃東西吧!
## Perf
我把perf想像成超級高級的「系統監視器」。
>> 用途不同,請重新閱讀 [perf](https://perf.wiki.kernel.org/),不要混用 tracer 和 monitor [name="jserv"]
### 檢查開啟
首先是檢查perf有無啟動。利用作業說明中的[perf 原理和實務](https://hackmd.io/s/B11109rdg#)這篇文章,先看看自己有沒有啟動perf:
``` shell
$ cat "/boot/config-`uname -r`" | grep "PERF_EVENT"
```
結果發現以下錯誤:
cat: '/boot/config-`uname -r`': No such file or directory
本來不知道為什麼, 不過5分鐘之後發現自己前一陣子把預設的shell換成fish, 而fish並不是POSIX標準的shell.
把預設的shell 改回 bash:
```shell
$ chsh -s /bin/bash
```
然後執行剛剛的命令, 終端機輸出:
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_EVENTS=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
看來perf是有啟動的。接下來就開始安裝東西了!
### 安裝
利用文章當中提供的第二個方法安裝:
```shell
$ sudo apt-get install linux-tools-common
```
然後發現以下的警告訊息:
WARNING: perf not found for kernel 4.4.0-31
You may need to install the following packages for this specific kernel:
linux-tools-4.4.0-31-generic
linux-cloud-tools-4.4.0-31-generic
You may also want to install one of the following packages to keep up to date:
linux-tools-generic
linux-cloud-tools-generic
如文章所說少了一些東西。把他們安裝好
```shell
$ sudo apt-get install linux-tools-generic
$ sudo apt-get install linux-cloud-tools-generic
$ sudo apt-get install linux-tools-4.4.0-31-generic
$ sudo apt-get install linux-cloud-tools-4.4.0-31-generic
```
接著執行
```shell
$ perf list
```
得到以下的結果:
```
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
```
<!-- -->
>> 文字訊息請避免用圖片來表示,否則不好搜尋和分類 [name="jserv"]
>> Ok, 已把文字補上! [name=Fernando Lin]
<!--(做這部份時有研究了一下ubuntu該如何截圖。[這裡](http://blog.xuite.net/yh96301/blog/242377657-Ubuntu+14.04%E6%93%B7%E5%8FDF%96%E8%9E%A2%E5%B9%95)有提供一個好用的螢幕截圖方法)-->
接著文章中有提到如果沒有root權限, 可能沒辦法取得所有的event data. 但是我決定暫時使用sudo.
如果直接用vim 編輯
```
$ vim /proc/sys/kernel/perf_event_paranoid
```
存檔的時候vim抱怨錯誤:
"/proc/sys/kernel/perf_event_paranoid" E667: Fsync failed
還沒有研究是什麼原因。 不過如果仿照下面的指令, 比如說想要把權限改成-1:
```SHELL
$ sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoi
```
這樣他就會乖乖改成-1了。
接著文章提到需要解除「kernel pointer」的使用限制,才可以觀察cach miss:
``` shell
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
```
不過我想先看看「如果不解除」的話,會在哪裡出事, 因此暫時不執行他。
### 測試perf
接下來編譯文章中提到的程式碼,並且依照測試執行perf. 執行結果如下:
```
99.24% example [.] compute_pi_baseline
0.40% [kernel] [k] native_queued_spin_lock_slowpath
0.10% [kernel] [k] osl_readl
0.05% [kernel] [k] wlc_channel_set_chanspec
0.05% [kernel] [k] unmap_page_range
0.05% [kernel] [k] delay_tsc
0.05% [kernel] [k] clm_limits
0.05% [kernel] [k] osl_readw
0.00% [kernel] [k] native_irq_return_iret
0.00% [kernel] [k] native_write_msr_safe
```
> 現在還看不懂下面標籤是什麼意思, 不過大概猜[k]是跟kernel有關的數據。 繼續往下看文章。
### 背景知識
這裡可以參照文章中的文字。有些在計算機組織都學過了(e.g. cach, branch類指令可能對效能帶來的不利等), 而沒有接觸過的主要有:
- [ ] PMU : 硬體層面的效能監控
- [ ] dual-issue
- [ ] tracepoint
再查tracepoint時發現了這個影片:
{%youtube 37v14rMtALs%}
>> 請附上你觀看演講後的認知,這樣旁人可以跟你做深入的討論 [name="jserv"]
### 用法
1. `perf help` <想問的東西>
2. `perf list` : 列出可支援的event。白話文就是列出可以看的東西
3. `perf top` : 跟「系統監視器」有點像, 不過perf更豪華, 除了哪個程式佔用最多效能, 連「那個程式中的哪個函數」都可以知道。
如果有指定的event想觀測, 可以用類似下面的用法 :
perf top -p <process ID> -e <想觀察的event> -c <觀察周期cycle數>
常用的event包含:
* cache-misses
* cache-references
* instructions
* cycles
> `--repeat <重複次數>` : 可以取重複測試的平均
4. `perf stat <某個程式>`:可以塞編好的binary。選項同上。
5. `perf record <某個程式>` : 一樣是可以塞編好的binary , 但是可以更細部追蹤到各種函式的效能
## GNU Plot
GNU Plot相對之下比較熟悉,之前有些作業有用到它。
[這裡](https://www.electricmonk.nl/log/2014/07/12/generating-good-looking-charts-with-gnuplot/)有篇關於把圖畫得漂亮的小文章。
## astyle
# 執行
```bash
$ cd phonebook
$ make
$ make run
```
輸出結果如下
size of entry : 136 bytes
execution time of append() : 0.044346 sec
execution time of findName() : 0.005026 sec
3
另外,執行
```BASH
$ make plot
```
[](https://i.imgur.com/wfOzvEe.png)
以及會看到他再抱怨沒有優化版的實作。該開始動腦了
# phonebook 優化
- [x] 計劃0. 改變資料結構: 由一個大的linked list, 改成依照字母分成26個linked list
- [x] 計劃1. 平行化
- [ ] 計劃2. 把「歷史紀錄」或是「常用的」記下來。
>> 今天寫完下面兩種優化之後, 突然發現作業要求裡面有 :
>> 「main.c 應該只有一份,不要建立新的 main(),善用 Makefile 定義對應的 CFLAGS」
>> 不小心把這點忽略了, 寫了兩個 main.c
>> 會再對程式做出修改。
## 改成26條 Linked-List
原先得資料結構是將全部的名單放在同一條linked list, 並照字母排列. 如果要查詢的名字在很後面, 會遍歷很多不必要的姓名。 所以新增一個 main_opt.c , 其內容為將 main.c 中的
```C
entry *pHead, *e;
```
調整成:
```C
entry *pHead[26], *e[26];
```
並將建立部份改成 :
```C
int first = (int)(line[0] - 'a');
e[first] = append(line, e[first]);
```
將查詢部份改成:
```C
int first = (int)(input[0] - 'a');
findName(input, e[first]);
```
並對 Makefile 做出相應的調整。
執行結果為 :
```
Performance counter stats for './phonebook_opt' (100 runs):
159,196 cache-misses # 65.293 % of all cache refs ( +- 0.82% )
243,817 cache-references ( +- 0.61% )
205,701,663 instructions # 1.53 insn per cycle ( +- 0.02% )
134,051,189 cycles ( +- 0.23% )
0.054094570 seconds time elapsed ( +- 0.37% )
```
而未修改的版本為:
```
Performance counter stats for './phonebook_orig' (100 runs):
1,024,194 cache-misses # 88.881 % of all cache refs ( +- 0.24% )
1,152,320 cache-references ( +- 0.23% )
262,075,842 instructions # 1.46 insn per cycle ( +- 0.02% )
179,420,610 cycles ( +- 0.28% )
0.064920713 seconds time elapsed ( +- 0.44% )
```
可以發現 cache misses 從 88.881 % 降至 65.293 % 。 但意外的是圖畫出來花費時間居然一樣。 應該是 runtime.gp 或 Makefile 部份沒調整好。 正在調整中。
:::danger
這個推測合理嗎?請重新探討 --jserv
:::
## 平行化
主要的構想是這樣 :
1. 每固定數目(比如說 5000 筆資料), 紀錄一個節點的位址
2. 搜尋時, 依照執行緒的數目, 對這些中繼點做 cyclic partition 方式的平行化來搜尋資料。
3. 用openMP做嘗試
>> 這裡寫了很醜的 code , 而且感覺(應該說是一定)有更好更漂亮的寫法, 不過我還是想試試看完成平行化的查詢。 為了從這些很醜的 code 中理出頭緒, 所以先邊寫邊紀錄。
首先先設定每 5000 筆資料設立一個中繼點。 在 `phonebook_opt.h` 中:
```C
#define PORTION_SIZE 5000
```
並且自行定義一個結構來儲存這些中繼點 :
```C
typedef struct __segments {
int cap;
int size;
entry** seg;
}segments;
```
>> C11 支援 [Unnamed Structure and Union Fields](https://gcc.gnu.org/onlinedocs/gcc/Unnamed-Fields.html),所以你可以改寫為:
```
typedef struct {
int cap;
int size;
entry **seg;
} segments;
```
>> 看起來更清爽。注意 coding style: `} segments` 中間有空白,`entry **seg` 的指標跟著變數 `seg` [name="jserv"]
照理說不能期待能預先知道資料數目, 所以打算用動態的方式紀錄這些中繼點, 用以下的方式紀錄中繼點資料 :
```C
void push_back(segments *s, entry* e)
{
if((s->size) + 1 >= s->cap){
(s->cap) *= 2;
s->seg = realloc(s->seg, sizeof(entry*)*s->cap);
s->seg[(s->size)++] = e;
}else{
s->seg[(s->size)++] = e;
}
}
```
接著修改 `findName()` 。
先寫出每個執行緒需要執行的任務 `__findName()` 。 這個跟原始版的 `findName()` 幾乎一樣, 只是把 while 迴圈中的條件由 :
```C
while (pHead != NULL){
...
pHead = pHead -> pNext;
}
```
改成 :
```C
int count = 0;
while (pHead != NULL && count < PORTION_SIZE){
...
pHead = pHead -> pNext;
count++;
}
```
最後是`findName`。 因為多了「紀錄中繼站」的陣列, 以及「執行緒的數目」, 所以參數變成 :
```C
entry *findName(char lastName[], entry *pHead, segments *s, int nthread);
```
而整個 `findName` 的實作如下 :
``` C
entry *findName(char lastName[], entry *pHead, segments *s, int nthread)
{
int num = s->size;
entry **res = calloc(sizeof(entry*), num);
omp_set_num_threads(nthread);
#pragma omp parallel
{
int id = omp_get_thread_num();
for(int i = id; i < num; i+=nthread){
res[i] = __findName(lastName, s->seg[i]);
}
}
for(int i = 0; i < num; i++){
if(res[i]){
return res[i];
}
}
return NULL;
}
```
執行結果如下 :
```
Performance counter stats for './phonebook_opt' (100 runs):
533,129 cache-misses # 53.601 % of all cache refs ( +- 0.57% )
994,635 cache-references ( +- 0.17% )
279,567,254 instructions # 1.19 insn per cycle ( +- 0.19% )
235,699,030 cycles ( +- 0.50% )
0.066403495 seconds time elapsed ( +- 0.42% )
```
而這次測驗中, 原始版本如下 :
```
Performance counter stats for './phonebook_orig' (100 runs):
1,032,975 cache-misses # 88.910 % of all cache refs ( +- 0.13% )
1,161,817 cache-references ( +- 0.12% )
262,089,603 instructions # 1.45 insn per cycle ( +- 0.02% )
180,133,449 cycles ( +- 0.17% )
0.064475132 seconds time elapsed ( +- 0.25% )
```
減少了35%左右的cache miss
>> 因為 gnuplot 的 script 還沒修好, 所以這裡暫時選取幾組時間資料來展示時間上的提升。 會儘快把圖補上!
:::danger
請解釋 `#pragma omp parallel` 搭配迴圈切割的效益 --jserv
:::
優化前的時間 :
```
size of entry : 136 bytes
execution time of append() : 0.044632 sec
execution time of findName() : 0.006063 sec
size of entry : 136 bytes
execution time of append() : 0.046319 sec
execution time of findName() : 0.004952 sec
size of entry : 136 bytes
execution time of append() : 0.044687 sec
execution time of findName() : 0.005105 sec
size of entry : 136 bytes
execution time of append() : 0.048199 sec
execution time of findName() : 0.004924 sec
size of entry : 136 bytes
execution time of append() : 0.043892 sec
execution time of findName() : 0.004837 sec
size of entry : 136 bytes
execution time of append() : 0.046132 sec
execution time of findName() : 0.004944 sec
size of entry : 136 bytes
execution time of append() : 0.044600 sec
execution time of findName() : 0.004987 sec
size of entry : 136 bytes
execution time of append() : 0.045488 sec
execution time of findName() : 0.004838 sec
size of entry : 136 bytes
execution time of append() : 0.044481 sec
execution time of findName() : 0.005026 sec
size of entry : 136 bytes
execution time of append() : 0.047144 sec
execution time of findName() : 0.005013 sec
```
優化後
```
execution time of append() : 0.045227 sec
execution time of findName() : 0.002550 sec
size of entry : 136 bytes
execution time of append() : 0.044574 sec
execution time of findName() : 0.002549 sec
size of entry : 136 bytes
execution time of append() : 0.046196 sec
execution time of findName() : 0.002515 sec
size of entry : 136 bytes
execution time of append() : 0.045013 sec
execution time of findName() : 0.003184 sec
size of entry : 136 bytes
execution time of append() : 0.045519 sec
execution time of findName() : 0.002459 sec
size of entry : 136 bytes
execution time of append() : 0.044297 sec
execution time of findName() : 0.002481 sec
size of entry : 136 bytes
execution time of append() : 0.045520 sec
execution time of findName() : 0.002538 sec
size of entry : 136 bytes
execution time of append() : 0.044734 sec
execution time of findName() : 0.002483 sec
size of entry : 136 bytes
execution time of append() : 0.046052 sec
execution time of findName() : 0.002525 sec
size of entry : 136 bytes
execution time of append() : 0.044142 sec
execution time of findName() : 0.002481 sec
size of entry : 136 bytes
execution time of append() : 0.047937 sec
execution time of findName() : 0.002493 sec
```
<!--
今天(2/20)下午經過資工系館的時候突然看到一門課:

而且很佛心的有影片:
[Information Theory and Coding Technique](http://cmlab.csie.ntu.edu.tw/~itct/)
不知道會不會對這個議題有幫助,也許明天可以看看(雖然我都看不懂)。
感覺如果可以知道查詢可能的機率分佈,就有機會針對這個分布去做優化。
不知道是什麼分佈就猜常態分佈好了。
-->
## 問題討論
回答以下問題:本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題?
1. 有代表性嗎?跟真實英文姓氏差異大嗎?
首先做一點簡單的統計。參考維基百科上[英文字母出現頻率](https://en.wikipedia.org/wiki/Letter_frequency)中,目前英文字母的出現頻率

接著統計 phonebook 中,words.txt 中各個字母出現的次數:

以及 phonebook 作業中,提供的 all-name.txt 中各個單字的出現次數

若比 all-name.txt 與 words.txt 當中的頻率,可以發現兩者在字母頻率的趨勢上頗接近,表示 words.txt 一定程度上還原了使用「字母」的習慣
但是字母組成比例相同,不表示「與英文接近」,也未必代表跟真實情況接近。
舉例來說,這裡的程式假定所有 last name 只出現一次,但實際上的電話簿會可能會有很多人同姓、同名等等。如果這樣的話,用 hash 實作就必須考慮更多事情。
2. 資料難道「數大就是美」嗎?
3. 如果我們要建立符合真實電話簿情境的輸入,該怎麼作?