2016q3 Homework1 (phonebook)

contributed by <beieve7028>

預期目標

準備 GNU/Linux 開發工具
學習使用 Git 與 GitHub
學習效能分析工具
研究軟體最佳化

實驗環境

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 94
Stepping:              3
CPU MHz:               3491.640
BogoMIPS:              6815.99
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

先前準備

自己學習使用HackMD與Markdown的~~防痴呆~~筆記
安裝與更新linux必須套件(reference)
$ sudo apt-get update
$ sudo apt-get install build-essential
$ sudo apt-get install linux-tools-common linux-tools-generic
$ sudo apt-get install astyle colordiff gnuplot

perf 學習

perf 安裝筆記

使用perf說明提到的指令時有出現提到的錯誤

$ perf list
WARNING: perf not found for kernel 4.2.0-42

  You may need to install the following packages for this specific kernel:
    linux-tools-4.2.0-42-generic
    linux-cloud-tools-4.2.0-42-generic

  You may also want to install one of the following packages to keep up to date:
    linux-tools-generic-lts-<series>
    linux-cloud-tools-generic-lts-<series>

因此依照建議的指令安裝正確的perf

$ uname -a
Linux os-2 4.2.0-42-generic #49~14.04.1-Ubuntu SMP Wed Jun 29 20:22:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ sudo apt-get install linux-tools-4.2.0-42-generic linux-cloud-tools-4.2.0-42-generic

再次使用

$ perf list
List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]

  alignment-faults                                   [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

就可以正常執行了

perf 練習

在video當中老師有對perf稍作說明，我先藉由老師計算pi的範例當作學習perf的進入點，在複製perf說明內的程式碼並進行編譯：


















#include <stdio.h>
#include <unistd.h>

double compute_pi_baseline(size_t N) {
    double pi = 0.0;
    double dt = 1.0 / N;
    for (size_t i = 0; i < N; i++) {
        double x = (double) i / N;
        pi += dt / (1.0 + x * x);
    }
    return pi * 4.0;
}
int main() {
    printf("pid: %d\n", getpid());
    sleep(10);
    compute_pi_baseline(50000000);
    return 0;
}

在經過編譯時遇到了狀況：

$ gcc -o pi pi.c
pi.c: In function ‘compute_pi_baseline’:
pi.c:7:5: error: ‘for’ loop initial declarations are only allowed in C99 mode
     for (size_t i = 0; i < N; i++) {
     ^
pi.c:7:5: note: use option -std=c99 or -std=gnu99 to compile your code

得到一個錯誤，因為在範例中是使用g++來編譯，而我是採用gcc進行編譯，而從當中的說明可以見到應該是gcc預設編譯的版本不在c99之後的問題。要解決這個問題直覺上有兩個方法

將迴圈內的變數宣告拉到迴圈外
使用g++編譯
從編譯器給予的指示說明，以c99標準指示編譯器編譯

為了能重複利用程式碼減少修改，我選擇指名c99的指示進行編譯後就可以順利編譯：
$ gcc -o pi pi.c -std=c99

接著開兩個terminal視窗各別輸入指令：

$ ./pi
pid: 25306

$ sudo perf top -p 25306
Samples: 2K of event 'cycles', Event count (approx.): 148490621                 
Overhead  Shared O  Symbol                                                      
  99.96%  pi        [.] compute_pi_baseline
   0.04%  [kernel]  [k] uncharge_list

從perf的數據可以得知這支程式計算的開銷都在計算pi的部份

perf stat 練習

為了更熟悉perf的使用，我依照參考資料一步步操作，在能理解$sudo perf top的使用方式後，進一步以perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ~/Desktop/cache_misses去觀察cache miss的情況。

cache_misses.c:








static char array[10000][10000];
int main (void){
  int i, j;
  for (i = 0; i < 10000; i++)
    for (j = 0; j < 10000; j++)
       array[j][i]++;
  return 0;
}

$ perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ~/Desktop/cache_misses

 Performance counter stats for '/home/oslab/Desktop/cache_misses' (5 runs):

         5,278,828      cache-misses              #    2.125 % of all cache refs    
       248,157,964      cache-references                                            
     2,005,916,097      instructions              #    1.88  insns per cycle        
     1,057,336,702      cycles                                                      

       0.271109771 seconds time elapsed                                          ( +-  0.57% )

依照perf教學把i, j對調如參考之料所述可以發現cache-references有顯著的下降，原因在space locality與陣列在線性記憶體內擺放的方式有關：

$ perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ~/Desktop/cache_misses

 Performance counter stats for '/home/oslab/Desktop/cache_misses' (5 runs):

         1,909,115      cache-misses              #   37.448 % of all cache refs    
         5,289,898      cache-references                                            
     2,005,906,440      instructions              #    2.49  insns per cycle        
       807,624,474      cycles                                                      

       0.207120866 seconds time elapsed                                          ( +-  0.89% )

(練習到這時發現自己對於cache-references的意思很模糊，因此就去騷擾請問老師這方面概念的關鍵字，後來老師說要看資料不要從網路找，因此趁著颱風還不會把我吹走時趕緊去宿舍拿相關的書…QQ)

會減少cache-references的次數在於資料放進cache時是以一個cache line(64bytes)為單位，如果資料在存取時具有Spatial locality的特性，或是程式寫法能盡量保有cache friendly就能減少cache-misses與cache-references，因為之後要存取的資料已隨著前面的cache reference一起放置在cache內

在紀錄上面文字敘述時，又冒出了一個新的疑問「cache-references怎麼也會降低？」，看了一本爛書結果還是充滿著疑惑，於是再度前往給老師電教導後才比較了解cache。這時深深覺得以前都是一知半解的，等到真的碰觸這塊就顯得以前學習都是無用的。

原本以為cache-references是對cache的存取次數，也是一開始抱持的疑問，因為每次access都會經過cache，那數量怎麼會有差別，這也是自己對名詞定義的不清楚，才使得自己布滿障礙。

perf record & perf report

在以perf教學進行當中perf record&perf report練習時，因為敘述比較簡短，因此從當中的參考資料perf 性能分析实例——使用perf优化cache利用率進行學習，也順便了解code編寫方式對於cache miss的影響，在了解record&report的用法後，再回頭以perf教學的例子練習。






















#define N 5000000
static int array[N] = { 0 };
void normal_loop(int a) {
    int i;
    for (i = 0; i < N; i++)
        array[i] = array[i]+a;
}
void unroll_loop(int a) {
    int i;
    for (i = 0; i < N; i+=5){
        array[i] = array[i]+1;
        array[i+1] = array[i+1]+a;
        array[i+2] = array[i+2]+a;
        array[i+3] = array[i+3]+a;
        array[i+4] = array[i+4]+a;
    }
}
int main() {
    normal_loop(1);
    unroll_loop(1);
    return 0;
}

經過上面程式以perf進行分析後，在instructions可以看到下面結果，會發現normal_loop佔了比較大的overhead。

Samples: 64  of event 'branch-instructions:u', Event count (approx.): 6031975   
Overhead  Command          Shared Object        Symbol                          
  84.68%  perf_record_exa  perf_record_example  [.] normal_loop
  14.94%  perf_record_exa  perf_record_example  [.] unroll_loop
   0.31%  perf_record_exa  ld-2.19.so           [.] _dl_relocate_object
   0.07%  perf_record_exa  ld-2.19.so           [.] dl_main
   0.01%  perf_record_exa  ld-2.19.so           [.] _dl_start
   0.00%  perf_record_exa  ld-2.19.so           [.] _start

作業

先完成必要的準備

fork一份code到自己的repositories
clone 到 local
cd ~/cs2016q3/ && git clone https://github.com/believe7028/phonebook.git && cd phonebook/

在README.md當中提到必須使用自動排版工具astyle以及使用一個hookpre-commit，因此先去查了打造專屬於你的 Git 工作流程、2016q1 Homework 1與Git 客製化 - Git Hooks

修改結構成員

在老師講解的影片中，提到以前的學長先以修改資料結構大小的方式進行優化，我也以此概念先進行改善，因為這種方式是簡單且直觀的。在過程中共修改兩個相關的檔案：

phonebook_opt.h




























#ifndef _PHONEBOOK_H
#define _PHONEBOOK_H

#define MAX_LAST_NAME_SIZE 16

typedef struct __PHONE_BOOK_DETAIL {
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
} detail;

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    struct __PHONE_BOOK_DETAIL *pDetail;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

entry *findName(char lastName[], entry *pHead);
entry *append(char lastName[], entry *e);

#endif

phonebook_opt.c






























#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#include "phonebook_opt.h"

/* FILL YOUR OWN IMPLEMENTATION HERE! */
entry *findName(char lastName[], entry *pHead)
{
    while (pHead != NULL) {
        if (strcasecmp(lastName, pHead->lastName) == 0)
            return pHead;
        pHead = pHead->pNext;
    }
    return NULL;
}

entry *append(char lastName[], entry *e)
{
    /* allocate memory for the new entry and put lastName */
    e->pNext = (entry *) malloc(sizeof(entry));
    e = e->pNext;
    strcpy(e->lastName, lastName);
    e->pDetail = (detail *) malloc(sizeof(detail));
    e->pNext = NULL;

    return e;
}

在phonebook_opt.在當中，其實

e->pDetail = (detail *) malloc(sizeof(detail));

是沒有用到的，因為實際上沒有詳細的資料，但是為了符合真實情況，依然把這部份撰寫完成以貼近真實。

ori:

$ sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_orig
 Performance counter stats for './phonebook_orig' (10 runs):

         4,579,669      cache-misses              #   93.033 % of all cache refs    
         4,831,849      cache-references                                            

       0.047660639 seconds time elapsed                                          ( +-  2.25% )

opt:

$ sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_opt
 Performance counter stats for './phonebook_opt' (10 runs):

         5,535,335      cache-misses              #   90.737 % of all cache refs    
         6,082,763      cache-references                                            

       0.069276455 seconds time elapsed                                          ( +-  2.46% )

經由上面比較可以發現cache miss 降低的幅度很低，而且執行時間都反而增加，因此我回去影片找老師提到的範例來查詢原因，後來發現差異應該是在當中










lastNameEntry *appendOptimal(char lastName[], lastNameEntry *lne)
{
    /* allocate memory for the new entry and put lastName in it.*/
    lne->pNext = (lastNameEntry *) malloc(sizeof(lastNameEntry));
    lne = lne->pNext;
    strcpy(lne->lastName, lastName);
    lne->pNext = NULL;

    return lne;
}

雖然在概念上把詳細資料放在另一個結構裡，但在實作上他不會宣告詳細資料的結構。為了驗證是否差異在此，我註解了我會宣告詳細資料的那行並再次進行計算：





























#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#include "phonebook_opt.h"

/* FILL YOUR OWN IMPLEMENTATION HERE! */
entry *findName(char lastName[], entry *pHead)
{
    while (pHead != NULL) {
        if (strcasecmp(lastName, pHead->lastName) == 0)
            return pHead;
        pHead = pHead->pNext;
    }
    return NULL;
}

entry *append(char lastName[], entry *e)
{
    /* allocate memory for the new entry and put lastName */
    e->pNext = (entry *) malloc(sizeof(entry));
    e = e->pNext;
    strcpy(e->lastName, lastName);
    //e->pDetail = (detail *) malloc(sizeof(detail));
    e->pNext = NULL;

    return e;
}

ori:

$ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_orig
 Performance counter stats for './phonebook_orig' (10 runs):

         4,466,097      cache-misses              #   90.015 % of all cache refs    
         4,980,070      cache-references                                            

       0.048785512 seconds time elapsed                                          ( +-  2.25% )

opt:

$ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_opt
 Performance counter stats for './phonebook_opt' (10 runs):

         1,450,470      cache-misses              #   64.407 % of all cache refs    
         2,333,493      cache-references                                            

       0.031379584 seconds time elapsed                                          ( +-  3.77% )

經過註解詳細資料結構的記憶體配置指令後，可以見到在cache miss有大幅的改善，且數據的表現也比較接近，執行時間的開銷也有降低。

字典樹

在調整資料結構的大小之後，進一部想改變搜尋方式，直覺上有雜湊搜尋與字典樹兩種，於是我先採用字典樹看看是否在數據上有較好的表現：

ori:

$ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_orig
 Performance counter stats for './phonebook_orig' (10 runs):

         4,544,119      cache-misses              #   92.462 % of all cache refs    
         4,789,443      cache-references                                            

       0.048927051 seconds time elapsed                                          ( +-  2.43% )

opt:

$ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_opt
 Performance counter stats for './phonebook_opt' (10 runs):

         4,935,383      cache-misses              #   87.514 % of all cache refs    
         5,649,657      cache-references                                            

       0.135761925 seconds time elapsed                                          ( +-  0.92% )

可以見到雖然雖然採用了字典樹，在搜尋的時間有很好的表現，但是在append的步驟需要較高的時間開銷，因為建構字典樹節點的步驟相較採用link list來的多，而所在意的cache miss反而增加到87%。

lastName 編碼壓縮

表示一個字符需要一個byte，lastName因為只有小寫英文，這表示8bits 共256種可能卻只用到其中26个可能，因此就想以此加壓縮，原本一開始是想採用26進位，但想到會有乘法的操作，因此用位移操作符在計算時會比較快速，因此定為每5bits表示一個小寫英文字，會先把連續字符lastName轉換為一數字，在比較時則以此數值為主要依據：









unsigned long long int name2Int(char lastName[])
{
    unsigned long long int num = 0;
    int index = -1;
    while(lastName[++index] != '\0') {
        num= (num << 5) + (lastName[index] - 'a' + 1);
    }
    return num;
}

當中 -'a' + 1會有+1是考慮到表示'\0'的問題，為了未來可以從數字轉換為字符串，因此小寫英文字符都從1開始，0則用來表示'\0'，才能區別一個數值為0是已無英文字符，如果英文從0開始編號，則一個數值為0有可能是"aa\0"的表示法。

ori:

$ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_orig
 Performance counter stats for './phonebook_orig' (10 runs):

         4,376,367      cache-misses              #   89.796 % of all cache refs    
         4,907,063      cache-references                                            

       0.049096653 seconds time elapsed                                          ( +-  2.43% )

opt:

$ echo 1 | sudo tee /proc/sys/vm/drop_caches && sudo perf stat --repeat 10 -e cache-misses,cache-references ./phonebook_opt
 Performance counter stats for './phonebook_opt' (10 runs):

         7,314,511      cache-misses              #   93.471 % of all cache refs    
         7,795,438      cache-references                                            

       0.092457792 seconds time elapsed                                          ( +-  1.22% )

經過分析後，可以發現不管在chche miss或是執行效率都變得比較差，還有許多可以在改進的地方。