Phonebook Optimization

輸入法

homework都是在ubuntu上作業，
對於從windows轉到linux的user，
其中一個面臨的問題就是輸入法，
示範安裝嘸蝦米輸入法

Development Environment

sudo lshw

–->>> this command lshw (list hardware) can get more detailed hardware information.
下面只是一部分資訊

OS : 14.04.1-Ubuntu
kernel version : 3.19.0-25-generic
model name : Intel® Core™2 Duo CPU P8400 @ 2.26GHz
L1d cache: 32K
L1i cache: 32K
L2 cache: 3072K
memory
description: System Memory
physical id: 20
slot: System board or motherboard
size: 8GiB
- bank:0
  description: SODIMM DDR3 Synchronous 1066 MHz (0.9 ns)
  product: HMT351S6EFR8C-PB
  vendor: Hynix Semiconductor (Hyundai Electronics)
  physical id: 0
  serial: 2736E85A
  slot: DIMM 1
  size: 4GiB
  width: 64 bits
  clock: 1066MHz (0.9ns)
- bank:1
  description: SODIMM DDR3 Synchronous 1066 MHz (0.9 ns)
  product: HMT351S6EFR8C-PB
  vendor: Hynix Semiconductor (Hyundai Electronics)
  physical id: 1
  serial: 0B7C485B
  slot: DIMM 2
  size: 4GiB
  width: 64 bits
  clock: 1066MHz (0.9ns)

Git 設定

參考同學 GitHub 設定指引

Astyle

astyle的參數，對source code編排的影響，請參考(http://astyle.sourceforge.net/astyle.html#_indent-switches)
作業要求的coding style
- –suffix=none : 取消自動備份
- –indent-switches
  Indent 'switch' blocks so that the 'case X:' statements are indented in the switch block. The entire case block is indented.

$ astyle --style=kr --indent=spaces=4 --indent-switches --suffix=none *.[ch]

PI近似法

以下不像蒙地卡羅法，那是什麼方法，為什麼可以求出PI?

#include <stdio.h>
#include <unistd.h>

double compute_pi_baseline(size_t N) {
    double pi = 0.0;
    double dt = 1.0 / N;
    for (size_t i = 0; i < N; i++) {
        double x = (double) i / N;
        pi += dt / (1.0 + x * x);
    }
    return pi * 4.0;
}
int main() {
    printf("pid: %d\n", getpid());
    sleep(10);
    compute_pi_baseline(50000000);
    return 0;
}

蒙地卡羅法裡有一個橢圓的相關方程式，(xx + yy < 1) 這是什麼？

Perf

參考Linux 效能分析工具: Perf

$ sudo su <--- 切到root
$ echo -1 >  /proc/sys/kernel/perf_event_paranoid --- 開啟可以取得所有event data的權限
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" --- 如果要檢測 cache miss event ，需要先取消 kernel pointer 的禁用。

Ex:待測程式需印出PID，並延遲幾秒執行，讓另一個terminal可以執行perf top -p $pid（填上待測程式的PID），抓取相關的event data.

分析性能問題時需要瞭解的背景知識
- 硬體特性之 cache：存取速率非常快，充分利用 cache 是軟體效能改善過程中，非常重要的部分。
- Pipeline：一個時鐘週期內可以同時處理三條指令，分別被 pipeline 的不同部分處理。
- Superscalar：一個時鐘週期觸發 (issue) 多條指令的 pipeline機器架構，比如 Intel 的 Pentium 處理器，內部有兩個執行單元，在一個時鐘週期內允許執行兩條指令。==> 這樣稱為 dual-issue，可想像為一個 packet 裡同時有兩組 pipelined 的 instruction
- out-of-order execution：在處理器內部，不同指令所需要的執行時間和時鐘週期是不同的，如果嚴格按照程序的執行順序執行，那麼就無法充分利用處理器的 pipeline。因此指令有可能被亂序執行
- Pipeline，Superscalar，out-of-order execution執行的指令有一個基本要求，即相鄰的指令相互沒有依賴關係。假如某條指令需要依賴前面一條指令的執行結果數據，那麼 pipeline 便失去作用，因為第二條指令必須等待第一條指令完成。
- Branch Prediction：若分支的結果是跳到其它指令，則pipeline後面的執行單元必需flush，從而影響效能。
  - static prediction
    - Opcode based 查
    - Displacement based (forward not taken, backward taken) 查
    - Compiler directed (branch likely, branch not likely) 查如何設定
  - Dynamic prediction
    - Use part of the PC (low-order bits) to index buffer/table – Why?
      – Multiple branches may share the same bit 查
      Backward branches for loops will be mispredicted twice
      – once upon loop exit and then again on loop entry查
  - Branch penalty involves both misprediction rate and branch
    frequency, and is higher for integer benchmarks查
  - Prediction accuracy improves with buffer size, but does not improve beyond 4K entries 查
  - Correlating or Two-level Predictors查

原始版本

Data structure

  6 /* original version */
  7 typedef struct __PHONE_BOOK_ENTRY {
  8     char lastName[MAX_LAST_NAME_SIZE];
  9     char firstName[16];
 10     char email[16];
 11     char phone[10];
 12     char cell[10];
 13     char addr1[16];
 14     char addr2[16];
 15     char city[16];
 16     char state[2];
 17     char zip[5];
 18     struct __PHONE_BOOK_ENTRY *pNext;
 19 } entry;

指令

$ sudo su
$ echo -1 >  /proc/sys/kernel/perf_event_paranoid
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_orig

perf stat 結果

size of entry : 136 bytes
execution time of append() : 0.088558 sec
execution time of findName() : 0.017283 sec

 Performance counter stats for './phonebook_orig' (10 runs):

           827,166      cache-misses              #   29.994 % of all cache refs      ( +-  0.95% ) [44.38%]
         2,757,744      cache-references                                              ( +-  0.77% ) [31.05%]
         2,838,813      L1-dcache-load-misses                                         ( +-  1.31% ) [28.96%]
         5,272,629      L1-dcache-store-misses                                        ( +-  1.29% ) [28.35%]
                 0      L1-dcache-prefetch-misses                                   
            67,225      L1-icache-load-misses                                         ( +-  3.70% ) [29.64%]
       259,001,566      instructions              #    0.79  insns per cycle          ( +-  0.62% ) [44.15%]
       328,689,512      cycles                     ( +-  0.40% ) [42.98%]

       0.146228190 seconds time elapsed                                          ( +-  0.43% )

程式優化-1 (縮小 data structure)

Data structure:

 27 typedef struct __PHONE_BOOK_ENTRY {   
 28     char lastName[MAX_LAST_NAME_SIZE];
 29 #ifdef DATA_STRUCTURE_OPTIMIZATION_SMALLER_SIZE
 30     releated *related_information;
 31 #else
 32     char firstName[16];
 33     char email[16];
 34     char phone[10];
 35     char cell[10];
 36     char addr1[16];
 37     char addr2[16];
 38     char city[16];
 39     char state[2];
 40     char zip[5];
 41 #endif
 42     struct __PHONE_BOOK_ENTRY *pNext;
 43 } entry;

指令

$ sudo su
$ echo -1 >  /proc/sys/kernel/perf_event_paranoid
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_opt

perf stat output

size of entry : 32 bytes
execution time of append() : 0.064288 sec
execution time of findName() : 0.005590 sec

 Performance counter stats for './phonebook_opt' (10 runs):

            70,125      cache-misses              #    8.085 % of all cache refs      ( +-  0.76% ) [42.66%]
           867,400      cache-references                                              ( +-  7.07% ) [29.23%]
           634,781      L1-dcache-load-misses                                         ( +-  2.59% ) [31.64%]
         1,630,373      L1-dcache-store-misses                                        ( +-  2.77% ) [33.57%]
                 0      L1-dcache-prefetch-misses                                   
            40,509      L1-icache-load-misses                                         ( +-  6.46% ) [31.99%]
       256,956,863      instructions              #    1.32  insns per cycle          ( +-  0.86% ) [45.52%]
       193,948,741      cycles                     ( +-  0.54% ) [43.05%]

       0.084300406 seconds time elapsed                                          ( +-  0.54% )

程式優化-2 (Bucket Link List)

Data structure：
- Bucket Link List
  不確定是什麼名字暫時取為Bucket Link List

 46 typedef struct __LEVEL1_COMPARE {
 47     char first_char_of_name;
 48     entry *link;
 49 } level1_compare;

指令

$ sudo su
$ echo -1 >  /proc/sys/kernel/perf_event_paranoid
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_opt_2

perf stat output

size of entry : 32 bytes
execution time of append() : 18.591619 sec
execution time of findName() : 0.000032 sec

 Performance counter stats for './phonebook_opt_2' (10 runs):

           199,603      cache-misses              #    0.008 % of all cache refs      ( +-  6.19% ) [42.85%]
     2,489,462,022      cache-references                                              ( +-  0.02% ) [28.60%]
     5,701,701,395      L1-dcache-load-misses                                         ( +-  0.30% ) [28.59%]
        10,203,165      L1-dcache-store-misses                                        ( +-  3.69% ) [28.58%]
                 0      L1-dcache-prefetch-misses                                   
         1,635,607      L1-icache-load-misses                                         ( +-  1.90% ) [28.58%]
    23,329,940,160      instructions              #    0.53  insns per cycle          ( +-  0.02% ) [42.86%]
    44,357,537,174      cycles                     ( +-  0.12% ) [42.84%]

      18.643948613 seconds time elapsed                                          ( +-  0.13% )

程式優化-3 (Bucket Link List - modifed)

Data structure：

 46 typedef struct __LEVEL1_COMPARE {
 47     char first_char_of_name;
 48     entry *the_last_node;
 49     entry *link;
 50 } level1_compare;

指令

$ sudo su
$ echo -1 >  /proc/sys/kernel/perf_event_paranoid
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_opt_3

perf stat output：

size of entry : 32 bytes
execution time of append() : 0.070129 sec
execution time of findName() : 0.000032 sec

 Performance counter stats for './phonebook_opt_3' (10 runs):

            31,586      cache-misses              #    5.733 % of all cache refs      ( +-  3.19% ) [41.21%]
           550,933      cache-references                                              ( +-  1.64% ) [37.54%]
           425,916      L1-dcache-load-misses                                         ( +-  0.95% ) [34.90%]
         1,906,506      L1-dcache-store-misses                                        ( +-  0.92% ) [32.54%]
                 0      L1-dcache-prefetch-misses                                   
            63,040      L1-icache-load-misses                                         ( +-  8.00% ) [27.99%]
       206,689,110      instructions              #    1.22  insns per cycle          ( +-  0.38% ) [38.99%]
       168,967,583      cycles                     ( +-  0.69% ) [34.78%]

       0.073860286 seconds time elapsed                                          ( +-  0.73% )

結果比較

	original	優化1	優化2	優化3
execution time of append()	0.088558 sec	0.064288 sec	18.591619 sec	0.070129 sec
execution time of findName()	0.017283 sec	0.005590 sec	0.000032 sec	0.000032 sec
cache-misses	827,166	70,125	199,603	31,586
cache-references	2,757,744	867,400	2,489,462,022	550,933
instructions	259,001,566	256,956,863	23,329,940,160	206,689,110
cycles	328,689,512	193,948,741	44,357,537,174	168,967,583
append() cache-misses			39.36%
findName() cache-misses	48.96%	6.89%

如何取得較小的function misses ? [youchihwang]Fri, Oct 7, 2016 5:10 AM
gnuplot，如何設定長方圖的寬度呢？[youchihwang]Fri, Oct 7, 2016 5:11 AM

original vs 優化1
1：structure減小，cache可以放進更多需要儲存的資料，pde因此減少cache-misses,但為什麼可以減少cache-references.
2：cache-misses減小了，run time也就減少。

優化1 vs 優化2
加了bucket link list, 將name的第一個字母做為index, 分別儲存在不同的link list。
1：append的作法，是以要加入的name的第一個字母做為index，再到對應的link list,依序搜尋直到最後一個node後，再加入name,因此runtime會較長
2：findname的作法，以要搜尋的name的第一個字母為index，再到相對應的link list裡搜尋，所以搜尋的時間會比從頭到尾的搜尋來得快。

優化2 vs 優化3
優化2的append時間太長的原因是以name的第一個字母做為index，再到相對應的link list去搜尋，直到最後一個node再搜入，為了加速執行時間，所以在bucket link list的index structure加入了指向最後一個node的member, 因此在插入name時，即可直接跳到最後一個node並插入其後。