2017q1 Homework1 (phonebook)

contributed by < 0xff07 >

題目

原始碼在這裡
作業要求在這裡

開發環境

Ubuntu 16.04.2
Linux version 4.8.0-36-generic

gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K

$lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 69
Model name: Intel® Core™ i5-4258U CPU @ 2.40GHz
Stepping: 1
CPU MHz: 887.109
CPU max MHz: 2900.0000
CPU min MHz: 800.0000
BogoMIPS: 4800.34
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3

中英文字間請已空白隔開課程助教

~~## 進度 - [ ] 2/20 : 學習 perf, gnuplot, 並實際編譯給定的程式, 看看能不能重現結果。~~

待查資料

Kernel pointer ?

目標

計劃跟做計算流體力學的教授做專題，這些優化的技能學起來，看看可不可以用上！

工具

目前看來這份作業要使用的工具有：

熟悉 Perf（gprof 跟 valgrind 不知道會不會有幫助？）
熟悉 astyle 的基本使用
熟悉 gnuplot

所以開始啃東西吧！

Perf

我把perf想像成超級高級的「系統監視器」。

用途不同，請重新閱讀 perf，不要混用 tracer 和 monitor "jserv"

檢查開啟

首先是檢查perf有無啟動。利用作業說明中的perf 原理和實務這篇文章，先看看自己有沒有啟動perf:

＄ cat "/boot/config-`uname -r`" | grep "PERF_EVENT"

結果發現以下錯誤：

cat: '/boot/config-`uname -r`': No such file or directory

本來不知道為什麼，不過5分鐘之後發現自己前一陣子把預設的shell換成fish, 而fish並不是POSIX標準的shell.

把預設的shell 改回 bash:

$ chsh -s /bin/bash

然後執行剛剛的命令, 終端機輸出：

CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_EVENTS=y
CONFIG_HAVE_PERF_EVENTS_NMI=y

看來perf是有啟動的。接下來就開始安裝東西了！

安裝

利用文章當中提供的第二個方法安裝：

$ sudo apt-get install linux-tools-common

然後發現以下的警告訊息：

WARNING: perf not found for kernel 4.4.0-31

  You may need to install the following packages for this specific kernel:
    linux-tools-4.4.0-31-generic
    linux-cloud-tools-4.4.0-31-generic

  You may also want to install one of the following packages to keep up to date:
    linux-tools-generic
    linux-cloud-tools-generic

如文章所說少了一些東西。把他們安裝好

$ sudo apt-get install linux-tools-generic
$ sudo apt-get install linux-cloud-tools-generic
$ sudo apt-get install linux-tools-4.4.0-31-generic
$ sudo apt-get install linux-cloud-tools-4.4.0-31-generic

接著執行

$ perf list

得到以下的結果：


List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

文字訊息請避免用圖片來表示，否則不好搜尋和分類 "jserv"

Ok, 已把文字補上! Fernando Lin

接著文章中有提到如果沒有root權限, 可能沒辦法取得所有的event data. 但是我決定暫時使用sudo.

如果直接用vim 編輯

$ vim /proc/sys/kernel/perf_event_paranoid

存檔的時候vim抱怨錯誤：

"/proc/sys/kernel/perf_event_paranoid" E667: Fsync failed

還沒有研究是什麼原因。不過如果仿照下面的指令, 比如說想要把權限改成-1：

$ sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoi

這樣他就會乖乖改成-1了。

接著文章提到需要解除「kernel pointer」的使用限制，才可以觀察cach miss：

$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"

不過我想先看看「如果不解除」的話，會在哪裡出事, 因此暫時不執行他。

測試perf

接下來編譯文章中提到的程式碼，並且依照測試執行perf. 執行結果如下：

  99.24%  example   [.] compute_pi_baseline
   0.40%  [kernel]  [k] native_queued_spin_lock_slowpath
   0.10%  [kernel]  [k] osl_readl
   0.05%  [kernel]  [k] wlc_channel_set_chanspec
   0.05%  [kernel]  [k] unmap_page_range
   0.05%  [kernel]  [k] delay_tsc
   0.05%  [kernel]  [k] clm_limits
   0.05%  [kernel]  [k] osl_readw
   0.00%  [kernel]  [k] native_irq_return_iret
   0.00%  [kernel]  [k] native_write_msr_safe

現在還看不懂下面標籤是什麼意思，不過大概猜[k]是跟kernel有關的數據。繼續往下看文章。

背景知識

這裡可以參照文章中的文字。有些在計算機組織都學過了（e.g. cach, branch類指令可能對效能帶來的不利等）, 而沒有接觸過的主要有：

PMU : 硬體層面的效能監控
dual-issue
tracepoint

再查tracepoint時發現了這個影片：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

請附上你觀看演講後的認知，這樣旁人可以跟你做深入的討論 "jserv"

用法

perf help <想問的東西>
perf list : 列出可支援的event。白話文就是列出可以看的東西
perf top : 跟「系統監視器」有點像，不過perf更豪華, 除了哪個程式佔用最多效能, 連「那個程式中的哪個函數」都可以知道。
如果有指定的event想觀測, 可以用類似下面的用法 :
```
 perf top -p <process ID> -e <想觀察的event> -c <觀察周期cycle數>
```
常用的event包含：
- cache-misses
- cache-references
- instructions
- cycles
--repeat <重複次數> : 可以取重複測試的平均
perf stat <某個程式>:可以塞編好的binary。選項同上。
perf record <某個程式> : 一樣是可以塞編好的binary , 但是可以更細部追蹤到各種函式的效能

GNU Plot

GNU Plot相對之下比較熟悉，之前有些作業有用到它。

這裡有篇關於把圖畫得漂亮的小文章。

astyle

執行

$ cd phonebook
$ make
$ make run

輸出結果如下

size of entry : 136 bytes
execution time of append() : 0.044346 sec
execution time of findName() : 0.005026 sec
3

另外，執行

$ make plot

以及會看到他再抱怨沒有優化版的實作。該開始動腦了

phonebook 優化

計劃0. 改變資料結構：由一個大的linked list, 改成依照字母分成26個linked list
計劃1. 平行化
計劃2. 把「歷史紀錄」或是「常用的」記下來。

今天寫完下面兩種優化之後，突然發現作業要求裡面有：
「main.c 應該只有一份，不要建立新的 main()，善用 Makefile 定義對應的 CFLAGS」
不小心把這點忽略了, 寫了兩個 main.c
會再對程式做出修改。

改成26條 Linked-List

原先得資料結構是將全部的名單放在同一條linked list, 並照字母排列. 如果要查詢的名字在很後面，會遍歷很多不必要的姓名。所以新增一個 main_opt.c , 其內容為將 main.c 中的

entry *pHead, *e;

調整成：

entry *pHead[26], *e[26];

並將建立部份改成：

    int first = (int)(line[0] - 'a');
    e[first] = append(line, e[first]);

將查詢部份改成：

int first = (int)(input[0] - 'a');
findName(input, e[first]);

並對 Makefile 做出相應的調整。

執行結果為：

 Performance counter stats for './phonebook_opt' (100 runs):

           159,196      cache-misses              #   65.293 % of all cache refs      ( +-  0.82% )
           243,817      cache-references                                              ( +-  0.61% )
       205,701,663      instructions              #    1.53  insn per cycle           ( +-  0.02% )
       134,051,189      cycles                                                        ( +-  0.23% )

       0.054094570 seconds time elapsed                                          ( +-  0.37% )

而未修改的版本為：

 Performance counter stats for './phonebook_orig' (100 runs):

         1,024,194      cache-misses              #   88.881 % of all cache refs      ( +-  0.24% )
         1,152,320      cache-references                                              ( +-  0.23% )
       262,075,842      instructions              #    1.46  insn per cycle           ( +-  0.02% )
       179,420,610      cycles                                                        ( +-  0.28% )

       0.064920713 seconds time elapsed                                          ( +-  0.44% )

可以發現 cache misses 從 88.881 % 降至 65.293 % 。但意外的是圖畫出來花費時間居然一樣。應該是 runtime.gp 或 Makefile 部份沒調整好。正在調整中。

這個推測合理嗎？請重新探討 –jserv

平行化

主要的構想是這樣：

每固定數目（比如說 5000 筆資料），紀錄一個節點的位址
搜尋時，依照執行緒的數目，對這些中繼點做 cyclic partition 方式的平行化來搜尋資料。
用openMP做嘗試

這裡寫了很醜的 code , 而且感覺（應該說是一定）有更好更漂亮的寫法, 不過我還是想試試看完成平行化的查詢。為了從這些很醜的 code 中理出頭緒, 所以先邊寫邊紀錄。

首先先設定每 5000 筆資料設立一個中繼點。在 phonebook_opt.h 中：

#define PORTION_SIZE 5000

並且自行定義一個結構來儲存這些中繼點：

typedef struct __segments {
    int cap;
    int size;
    entry** seg;
}segments;

C11 支援 Unnamed Structure and Union Fields，所以你可以改寫為:

typedef struct {
    int cap;
    int size;
    entry **seg;
} segments;

看起來更清爽。注意 coding style: } segments 中間有空白，entry **seg 的指標跟著變數 seg "jserv"

照理說不能期待能預先知道資料數目, 所以打算用動態的方式紀錄這些中繼點, 用以下的方式紀錄中繼點資料：

void push_back(segments *s, entry* e)
{
    if((s->size) + 1 >= s->cap){
        (s->cap) *= 2;
        s->seg = realloc(s->seg, sizeof(entry*)*s->cap);
        s->seg[(s->size)++] = e;
    }else{
        s->seg[(s->size)++] = e;
    }
}

接著修改 findName（） 。

先寫出每個執行緒需要執行的任務 __findName() 。這個跟原始版的 findName（） 幾乎一樣，只是把 while 迴圈中的條件由：

 while (pHead != NULL){
     ...
     pHead = pHead -> pNext;
 }

改成：

int count = 0;
while (pHead != NULL && count < PORTION_SIZE){
    ...
    pHead = pHead -> pNext;
    count++;
}

最後是findName。因為多了「紀錄中繼站」的陣列, 以及「執行緒的數目」, 所以參數變成：

entry *findName(char lastName[], entry *pHead, segments *s, int nthread);

而整個 findName 的實作如下：

entry *findName(char lastName[], entry *pHead, segments *s, int nthread)
{
    int num = s->size;
    entry **res = calloc(sizeof(entry*), num);

    omp_set_num_threads(nthread);
    #pragma omp parallel
    {
        int id = omp_get_thread_num();
        for(int i = id; i < num; i+=nthread){
            res[i] = __findName(lastName, s->seg[i]);
        }
    }


    for(int i = 0; i < num; i++){
        if(res[i]){
            return res[i];
        }
    }
    return NULL;
}

執行結果如下：

 Performance counter stats for './phonebook_opt' (100 runs):

           533,129      cache-misses              #   53.601 % of all cache refs      ( +-  0.57% )
           994,635      cache-references                                              ( +-  0.17% )
       279,567,254      instructions              #    1.19  insn per cycle           ( +-  0.19% )
       235,699,030      cycles                                                        ( +-  0.50% )

       0.066403495 seconds time elapsed                                          ( +-  0.42% )

而這次測驗中, 原始版本如下：

 Performance counter stats for './phonebook_orig' (100 runs):

         1,032,975      cache-misses              #   88.910 % of all cache refs      ( +-  0.13% )
         1,161,817      cache-references                                              ( +-  0.12% )
       262,089,603      instructions              #    1.45  insn per cycle           ( +-  0.02% )
       180,133,449      cycles                                                        ( +-  0.17% )

       0.064475132 seconds time elapsed                                          ( +-  0.25% )

減少了35%左右的cache miss

因為 gnuplot 的 script 還沒修好，所以這裡暫時選取幾組時間資料來展示時間上的提升。會儘快把圖補上！

請解釋 #pragma omp parallel 搭配迴圈切割的效益 –jserv

優化前的時間：

size of entry : 136 bytes
execution time of append() : 0.044632 sec
execution time of findName() : 0.006063 sec
size of entry : 136 bytes
execution time of append() : 0.046319 sec
execution time of findName() : 0.004952 sec
size of entry : 136 bytes
execution time of append() : 0.044687 sec
execution time of findName() : 0.005105 sec
size of entry : 136 bytes
execution time of append() : 0.048199 sec
execution time of findName() : 0.004924 sec
size of entry : 136 bytes
execution time of append() : 0.043892 sec
execution time of findName() : 0.004837 sec
size of entry : 136 bytes
execution time of append() : 0.046132 sec
execution time of findName() : 0.004944 sec
size of entry : 136 bytes
execution time of append() : 0.044600 sec
execution time of findName() : 0.004987 sec
size of entry : 136 bytes
execution time of append() : 0.045488 sec
execution time of findName() : 0.004838 sec
size of entry : 136 bytes
execution time of append() : 0.044481 sec
execution time of findName() : 0.005026 sec
size of entry : 136 bytes
execution time of append() : 0.047144 sec
execution time of findName() : 0.005013 sec

優化後

execution time of append() : 0.045227 sec
execution time of findName() : 0.002550 sec
size of entry : 136 bytes
execution time of append() : 0.044574 sec
execution time of findName() : 0.002549 sec
size of entry : 136 bytes
execution time of append() : 0.046196 sec
execution time of findName() : 0.002515 sec
size of entry : 136 bytes
execution time of append() : 0.045013 sec
execution time of findName() : 0.003184 sec
size of entry : 136 bytes
execution time of append() : 0.045519 sec
execution time of findName() : 0.002459 sec
size of entry : 136 bytes
execution time of append() : 0.044297 sec
execution time of findName() : 0.002481 sec
size of entry : 136 bytes
execution time of append() : 0.045520 sec
execution time of findName() : 0.002538 sec
size of entry : 136 bytes
execution time of append() : 0.044734 sec
execution time of findName() : 0.002483 sec
size of entry : 136 bytes
execution time of append() : 0.046052 sec
execution time of findName() : 0.002525 sec
size of entry : 136 bytes
execution time of append() : 0.044142 sec
execution time of findName() : 0.002481 sec
size of entry : 136 bytes
execution time of append() : 0.047937 sec
execution time of findName() : 0.002493 sec

問題討論

回答以下問題：本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題？

有代表性嗎？跟真實英文姓氏差異大嗎？

首先做一點簡單的統計。參考維基百科上英文字母出現頻率中，目前英文字母的出現頻率

接著統計 phonebook 中，words.txt 中各個字母出現的次數：

以及 phonebook 作業中，提供的 all-name.txt 中各個單字的出現次數

若比 all-name.txt 與 words.txt 當中的頻率，可以發現兩者在字母頻率的趨勢上頗接近，表示 words.txt 一定程度上還原了使用「字母」的習慣

但是字母組成比例相同，不表示「與英文接近」，也未必代表跟真實情況接近。

舉例來說，這裡的程式假定所有 last name 只出現一次，但實際上的電話簿會可能會有很多人同姓、同名等等。如果這樣的話，用 hash 實作就必須考慮更多事情。
資料難道「數大就是美」嗎？
如果我們要建立符合真實電話簿情境的輸入，該怎麼作？