Try   HackMD

Phonebook Optimization

輸入法

homework都是在ubuntu上作業,
對於從windows轉到linux的user,
其中一個面臨的問題就是輸入法,
示範安裝嘸蝦米輸入法

Development Environment

sudo lshw

->>> this command lshw (list hardware) can get more detailed hardware information.
下面只是一部分資訊

  • OS : 14.04.1-Ubuntu
  • kernel version : 3.19.0-25-generic
  • model name : Intel® Core2 Duo CPU P8400 @ 2.26GHz
  • L1d cache: 32K
  • L1i cache: 32K
  • L2 cache: 3072K
  • memory
    description: System Memory
    physical id: 20
    slot: System board or motherboard
    size: 8GiB
    • bank:0
      description: SODIMM DDR3 Synchronous 1066 MHz (0.9 ns)
      product: HMT351S6EFR8C-PB
      vendor: Hynix Semiconductor (Hyundai Electronics)
      physical id: 0
      serial: 2736E85A
      slot: DIMM 1
      size: 4GiB
      width: 64 bits
      clock: 1066MHz (0.9ns)
    • bank:1
      description: SODIMM DDR3 Synchronous 1066 MHz (0.9 ns)
      product: HMT351S6EFR8C-PB
      vendor: Hynix Semiconductor (Hyundai Electronics)
      physical id: 1
      serial: 0B7C485B
      slot: DIMM 2
      size: 4GiB
      width: 64 bits
      clock: 1066MHz (0.9ns)

Git 設定

參考同學 GitHub 設定指引

Astyle

  • astyle的參數,對source code編排的影響,請參考(http://astyle.sourceforge.net/astyle.html#_indent-switches)

  • 作業要求的coding style

    • –suffix=none : 取消自動備份
    • indent-switches
      Indent 'switch' blocks so that the 'case X:' statements are indented in the switch block. The entire case block is indented.
$ astyle --style=kr --indent=spaces=4 --indent-switches --suffix=none *.[ch]

PI近似法

以下不像蒙地卡羅法,那是什麼方法,為什麼可以求出PI?

#include <stdio.h>
#include <unistd.h>

double compute_pi_baseline(size_t N) {
    double pi = 0.0;
    double dt = 1.0 / N;
    for (size_t i = 0; i < N; i++) {
        double x = (double) i / N;
        pi += dt / (1.0 + x * x);
    }
    return pi * 4.0;
}
int main() {
    printf("pid: %d\n", getpid());
    sleep(10);
    compute_pi_baseline(50000000);
    return 0;
}

蒙地卡羅法裡有一個橢圓的相關方程式,(xx + yy < 1) 這是什麼?

Perf

參考Linux 效能分析工具: Perf

$ sudo su <--- 切到root
$ echo -1 >  /proc/sys/kernel/perf_event_paranoid --- 開啟可以取得所有event data的權限
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" --- 如果要檢測 cache miss event ,需要先取消 kernel pointer 的禁用。

Ex:待測程式需印出PID,並延遲幾秒執行,讓另一個terminal可以執行perf top -p $pid(填上待測程式的PID),抓取相關的event data.

  • 分析性能問題時需要瞭解的背景知識
    • 硬體特性之 cache:存取速率非常快,充分利用 cache 是軟體效能改善過程中,非常重要的部分。
    • Pipeline:一個時鐘週期內可以同時處理三條指令,分別被 pipeline 的不同部分處理。
    • Superscalar:一個時鐘週期觸發 (issue) 多條指令的 pipeline機器架構,比如 Intel 的 Pentium 處理器,內部有兩個執行單元,在一個時鐘週期內允許執行兩條指令。==> 這樣稱為 dual-issue,可想像為一個 packet 裡同時有兩組 pipelined 的 instruction
    • out-of-order execution:在處理器內部,不同指令所需要的執行時間和時鐘週期是不同的,如果嚴格按照程序的執行順序執行,那麼就無法充分利用處理器的 pipeline。因此指令有可能被亂序執行
    • Pipeline,Superscalar,out-of-order execution執行的指令有一個基本要求,即相鄰的指令相互沒有依賴關係。假如某條指令需要依賴前面一條指令的執行結果數據,那麼 pipeline 便失去作用,因為第二條指令必須等待第一條指令完成。
    • Branch Prediction:若分支的結果是跳到其它指令,則pipeline後面的執行單元必需flush,從而影響效能。
      • static prediction
        • Opcode based 查
        • Displacement based (forward not taken, backward taken) 查
        • Compiler directed (branch likely, branch not likely) 查如何設定
      • Dynamic prediction
        • Use part of the PC (low-order bits) to index buffer/table – Why?
          – Multiple branches may share the same bit 查
          Backward branches for loops will be mispredicted twice
          once upon loop exit and then again on loop entry查
      • Branch penalty involves both misprediction rate and branch
        frequency, and is higher for integer benchmarks查
      • Prediction accuracy improves with buffer size, but does not improve beyond 4K entries 查
      • Correlating or Two-level Predictors

原始版本

  • Data structure
  6 /* original version */
  7 typedef struct __PHONE_BOOK_ENTRY {
  8     char lastName[MAX_LAST_NAME_SIZE];
  9     char firstName[16];
 10     char email[16];
 11     char phone[10];
 12     char cell[10];
 13     char addr1[16];
 14     char addr2[16];
 15     char city[16];
 16     char state[2];
 17     char zip[5];
 18     struct __PHONE_BOOK_ENTRY *pNext;
 19 } entry;

  • 指令
$ sudo su
$ echo -1 >  /proc/sys/kernel/perf_event_paranoid
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_orig
  • perf stat 結果
size of entry : 136 bytes
execution time of append() : 0.088558 sec
execution time of findName() : 0.017283 sec

 Performance counter stats for './phonebook_orig' (10 runs):

           827,166      cache-misses              #   29.994 % of all cache refs      ( +-  0.95% ) [44.38%]
         2,757,744      cache-references                                              ( +-  0.77% ) [31.05%]
         2,838,813      L1-dcache-load-misses                                         ( +-  1.31% ) [28.96%]
         5,272,629      L1-dcache-store-misses                                        ( +-  1.29% ) [28.35%]
                 0      L1-dcache-prefetch-misses                                   
            67,225      L1-icache-load-misses                                         ( +-  3.70% ) [29.64%]
       259,001,566      instructions              #    0.79  insns per cycle          ( +-  0.62% ) [44.15%]
       328,689,512      cycles                     ( +-  0.40% ) [42.98%]

       0.146228190 seconds time elapsed                                          ( +-  0.43% )

程式優化-1 (縮小 data structure)

  • Data structure:
 27 typedef struct __PHONE_BOOK_ENTRY {   
 28     char lastName[MAX_LAST_NAME_SIZE];
 29 #ifdef DATA_STRUCTURE_OPTIMIZATION_SMALLER_SIZE
 30     releated *related_information;
 31 #else
 32     char firstName[16];
 33     char email[16];
 34     char phone[10];
 35     char cell[10];
 36     char addr1[16];
 37     char addr2[16];
 38     char city[16];
 39     char state[2];
 40     char zip[5];
 41 #endif
 42     struct __PHONE_BOOK_ENTRY *pNext;
 43 } entry;
  • 指令
$ sudo su
$ echo -1 >  /proc/sys/kernel/perf_event_paranoid
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_opt
  • perf stat output
size of entry : 32 bytes
execution time of append() : 0.064288 sec
execution time of findName() : 0.005590 sec

 Performance counter stats for './phonebook_opt' (10 runs):

            70,125      cache-misses              #    8.085 % of all cache refs      ( +-  0.76% ) [42.66%]
           867,400      cache-references                                              ( +-  7.07% ) [29.23%]
           634,781      L1-dcache-load-misses                                         ( +-  2.59% ) [31.64%]
         1,630,373      L1-dcache-store-misses                                        ( +-  2.77% ) [33.57%]
                 0      L1-dcache-prefetch-misses                                   
            40,509      L1-icache-load-misses                                         ( +-  6.46% ) [31.99%]
       256,956,863      instructions              #    1.32  insns per cycle          ( +-  0.86% ) [45.52%]
       193,948,741      cycles                     ( +-  0.54% ) [43.05%]

       0.084300406 seconds time elapsed                                          ( +-  0.54% )
  • Data structure:
    • Bucket Link List
      不確定是什麼名字暫時取為Bucket Link List
 46 typedef struct __LEVEL1_COMPARE {
 47     char first_char_of_name;
 48     entry *link;
 49 } level1_compare;
  • 指令
$ sudo su
$ echo -1 >  /proc/sys/kernel/perf_event_paranoid
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_opt_2
  • perf stat output
size of entry : 32 bytes
execution time of append() : 18.591619 sec
execution time of findName() : 0.000032 sec

 Performance counter stats for './phonebook_opt_2' (10 runs):

           199,603      cache-misses              #    0.008 % of all cache refs      ( +-  6.19% ) [42.85%]
     2,489,462,022      cache-references                                              ( +-  0.02% ) [28.60%]
     5,701,701,395      L1-dcache-load-misses                                         ( +-  0.30% ) [28.59%]
        10,203,165      L1-dcache-store-misses                                        ( +-  3.69% ) [28.58%]
                 0      L1-dcache-prefetch-misses                                   
         1,635,607      L1-icache-load-misses                                         ( +-  1.90% ) [28.58%]
    23,329,940,160      instructions              #    0.53  insns per cycle          ( +-  0.02% ) [42.86%]
    44,357,537,174      cycles                     ( +-  0.12% ) [42.84%]

      18.643948613 seconds time elapsed                                          ( +-  0.13% )
  • Data structure:
 46 typedef struct __LEVEL1_COMPARE {
 47     char first_char_of_name;
 48     entry *the_last_node;
 49     entry *link;
 50 } level1_compare;

  • 指令
$ sudo su
$ echo -1 >  /proc/sys/kernel/perf_event_paranoid
$ sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
$ sudo perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses,instructions,cycles ./phonebook_opt_3
  • perf stat output:
size of entry : 32 bytes
execution time of append() : 0.070129 sec
execution time of findName() : 0.000032 sec

 Performance counter stats for './phonebook_opt_3' (10 runs):

            31,586      cache-misses              #    5.733 % of all cache refs      ( +-  3.19% ) [41.21%]
           550,933      cache-references                                              ( +-  1.64% ) [37.54%]
           425,916      L1-dcache-load-misses                                         ( +-  0.95% ) [34.90%]
         1,906,506      L1-dcache-store-misses                                        ( +-  0.92% ) [32.54%]
                 0      L1-dcache-prefetch-misses                                   
            63,040      L1-icache-load-misses                                         ( +-  8.00% ) [27.99%]
       206,689,110      instructions              #    1.22  insns per cycle          ( +-  0.38% ) [38.99%]
       168,967,583      cycles                     ( +-  0.69% ) [34.78%]

       0.073860286 seconds time elapsed                                          ( +-  0.73% )
  • 結果比較
original 優化1 優化2 優化3
execution time of append() 0.088558 sec 0.064288 sec 18.591619 sec 0.070129 sec
execution time of findName() 0.017283 sec 0.005590 sec 0.000032 sec 0.000032 sec
cache-misses 827,166 70,125 199,603 31,586
cache-references 2,757,744 867,400 2,489,462,022 550,933
instructions 259,001,566 256,956,863 23,329,940,160 206,689,110
cycles 328,689,512 193,948,741 44,357,537,174 168,967,583
append() cache-misses 39.36%
findName() cache-misses 48.96% 6.89%

如何取得較小的function misses ? [youchihwang]Fri, Oct 7, 2016 5:10 AM
gnuplot,如何設定長方圖的寬度呢?[youchihwang]Fri, Oct 7, 2016 5:11 AM

original vs 優化1
1:structure減小,cache可以放進更多需要儲存的資料,pde因此減少cache-misses,但為什麼可以減少cache-references.
2:cache-misses減小了,run time也就減少。

優化1 vs 優化2
加了bucket link list, 將name的第一個字母做為index, 分別儲存在不同的link list。
1:append的作法,是以要加入的name的第一個字母做為index,再到對應的link list,依序搜尋直到最後一個node後,再加入name,因此runtime會較長
2:findname的作法,以要搜尋的name的第一個字母為index,再到相對應的link list裡搜尋,所以搜尋的時間會比從頭到尾的搜尋來得快。

優化2 vs 優化3
優化2的append時間太長的原因是以name的第一個字母做為index,再到相對應的link list去搜尋,直到最後一個node再搜入,為了加速執行時間,所以在bucket link list的index structure加入了指向最後一個node的member, 因此在插入name時,即可直接跳到最後一個node並插入其後。

繼續編輯