2016q3 Homework1 (phonebook)

contributed by <louielu>

Reviewed by `jserv`

沒有嘗試不同的 data set，原本的程式輸入是已排序、非典型英文姓氏，這與現實不匹配
實作提到透過引入 hash 加速 fineName() 的操作，但缺乏不同 hash function 的效能比較和設計取捨
在 append() 中，malloc() 是個顯著的時間開銷，缺乏減緩效能衝擊的方案，而且沒考慮到 malloc() 失敗的情況
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- 在上圖的環境中，可用記憶體不足以載入 35 萬筆電話資料，於是連 phonebook_orig 執行都會失敗:
```
$ ./phonebook_orig 
size of entry : 136 bytes
程式記憶體區段錯誤
```
卻乏搜尋演算法的評估和效能分析
考慮到電話簿需要作到動態資料新增和刪除，若引入 hash，面對大量資料時，會有什麼影響？
儘管已經整理頗多 perf 和效能測量的資料，但並未反映到此程式效能分析，除了 cache miss，還請一併探討 branch prediction accuracy 等議題
main.c 無法透過 function pointer 來切換和比較不同實作的效能落差，應該先設計一份可通用的軟體界面，然後將 binary tree, hash table, trie 等不同實作機制加入
將 append() 和 findName() 時間加入統計的意義不大，真實應用往往是個別操作，特別在圖表的呈現

開發環境

CPU: Intel® Core™ i7-2640M CPU @ 2.80GHz
MEM: 8GB
Cache:
- L1d cache: 32K
- L1i cache: 32K
- L2 cache: 256K
- L3 cache: 4096K

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Portable Hardware Locality (hwloc)

開發目標

精通 perf
改善 phonebook 效能
findName cache miss
append()
fuzzy

perf

perf - raw counter

格式：rUUEE (U = Umask Value, E = Event num)
範例：

Umask: 02H
Event: 11H
Description：Counts 256-bit packed double-percision floating-polint insturctions
Event Mask Mnemonic：SIMD_FP_256.PACKED_DOUBLE

使用指令：$ taskset -c 1 perf stat -e r0211 ./time_test_avx (借用一下 compute-pi, -c 指定 core 1)
讀取register確認：$ sudo rdmsr -p 1 0x186; output 110211 (-p 指定 core 1)
perf 輸出：

 Performance counter stats for './time_test_avx':
       678,616,723      r0211:u                     
       1.698941816 seconds time elapsed

換一個沒有用到 SIMD 的
$ taskset -c 1 perf stat -e r0211 ./time_test_baseline

perf 輸出：

 Performance counter stats for './time_test_baseline':
                 0      r0211:u                                             
       5.186517630 seconds time elapsed

明顯會是 0 (不是 0 就是你打錯raw counter，或是你用的不是 sandy bridge)

Reading PERV_EVT_SEL MSRs (Model-specific register) value

rdmsr 這個工具可以用來讀取 CPU MSRs (Model-specific register) 的資料。
根據 Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B p.3989 34.7 MSRS IN INTEL PROCESSOR FAMILY (INTEL MICROARCHITECTURE CODE NAME SANDY BRIDGE) 的 Table 34-10 (p.3994) 指出，Register Address 186H ~ 18DH 是 IA32_PERFEVTSEL 的暫存器位置。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

所以剛才才會指定 rdmsr 0x186，因為他是 IA32_PERFEVTSEL0 的位置

Misleading perf event name！

既然知道怎麼看到現在 event 的 raw code，我們就要來驗證一下，到底 perf list 出來的 events 跟我們想像中的有沒有一樣？

Reference

Linux perf #raw counter by Brendan Gregg's
Intel Sandy Bridge Microarchitecture events - 常用 profile events 整理 (可是還是有差，他沒有 event num. 跟 umask value)
Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B - p.3158 CHAPTER19 Performance-Monitoring Events, p.3159 19.2 Performance Monitoring Events for Intel Core Processor 2xxx Series

example of raw counter umask and event num.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

v## 案例分析: phonebook

檔案分析

main.c

main.c 分為兩個函數，一個是 diff_in_second 負責計算時間差，一個是 main 負責運行整著處理程序。

7: `#include IMPL`

會使用到 gcc -DIMPL="" 的 macro，可以參考 gcc –help

-D name=definition
The contents of definition are tokenized and
processed as if they appearedduring translation
phase three in a #define directive.
In particular, the definition will be truncated
by embedded newline characters.

大意就是說 gcc -Dname=definition 的東西會被翻譯成 #define name definition。透過這樣的方式，我們可以只用一個 main.c 檔來運行 orig 或是 opt 版本的 phonebook。

！！必須要注意！！
在 shell 裏面的話 gcc -DIMPL"\"phonebook_orig.h\"" 雙引號要做跳脫字元，要不然會一直出錯

main.c:7:10: error: #include expects "FILENAME" or <FILENAME>
 #include IMPL

延伸閱讀: 警告："no newline at end of file" jserv

9: `#define DICT_FILE "./dictionary/words.txt"`

這邊定義使用的字典檔，可以看到是使用 /dictionary/words.txt

46: `#if defined(GNUC)`

gcc 特有的 predefine macro，可以用來偵測 compiler 是不是 gcc，搭配下一行 gcc 的 buildin funtion 來使用。

可以拿起手邊的 tcc 跟 clang 來玩，加個程式碼





#if defined(__GNUC__)
    puts("__GNUC__");
    __builtin___clear_cache((char *) pHead,
                            (char *) pHead + sizeof(entry));
#endif

gcc -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_gcc
tcc -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_tcc
clang -DIMPL="\"phonebook_orig.h\"" phonebook_orig.c main.c -o phonebook_orig_clang

執行 phonebook_orig_[gcc,tcc,clang] 只有 phonebook_orig_gcc 會輸出 __GNUC__

47: `builtinclear_cache((char ) pHead, (char ) pHead + sizeof(entry))`

gcc 的 builtin function __builtin__clear_cache，用來清除從 begin 到 end 之間的 instructions cache。

(gdb) x/136x (char *) pHead
0x602240:	0x00000000	0x00000000	0x00000000	0x00000000
0x602250:	0x00000000	0x00000000	0x00000000	0x00000000
0x602260:	0x00000000	0x00000000	0x00000000	0x00000000
0x602270:	0x00000000	0x00000000	0x00000000	0x00000000
0x602280:	0x00000000	0x00000000	0x00000000	0x00000000
0x602290:	0x00000000	0x00000000	0x00000000	0x00000000
0x6022a0:	0x00000000	0x00000000	0x00000000	0x00000000
0x6022b0:	0x00000000	0x00000000	0x00000000	0x00000000
0x6022c0:	0x00000000	0x00000000	0x00000411	0x00000000
0x6022d0:	0x657a6973	0x20666f20	0x72746e65	0x203a2079
0x6022e0:	0x20363331	0x65747962	0x00000a73	0x00000000
0x6022f0:	0x00000000	0x00000000	0x00000000	0x00000000
0x602300:	0x00000000	0x00000000	0x00000000	0x00000000

50	    clock_gettime(CLOCK_REALTIME, &start);
(gdb) x/136x (char *) pHead
0x602240:	0x00000000	0x00000000	0x00000000	0x00000000
0x602250:	0x00000000	0x00000000	0x00000000	0x00000000
0x602260:	0x00000000	0x00000000	0x00000000	0x00000000
0x602270:	0x00000000	0x00000000	0x00000000	0x00000000
0x602290:	0x00000000	0x00000000	0x00000000	0x00000000
0x6022a0:	0x00000000	0x00000000	0x00000000	0x00000000
0x6022b0:	0x00000000	0x00000000	0x00000000	0x00000000
0x6022c0:	0x00000000	0x00000000	0x00000411	0x00000000
0x6022d0:	0x4e475f5f	0x5f5f4355	0x72746e0a	0x203a2079
0x6022e0:	0x20363331	0x65747962	0x00000a73	0x00000000
0x6022f0:	0x00000000	0x00000000	0x00000000	0x00000000
0x602300:	0x00000000	0x00000000	0x00000000	0x00000000

中間 0x6022d0 的數值被改變了，不知道有何作用。

49-59: `clock_gettime(CLOCK_REALTIME, &start)`

這邊負責 append 資料到 *e 裡面


e = append(line, e)

使用這種方式來 append，可以降低一般 linked list append 需要 O(n) 的時間，可是必須要維護一個 list head，所以可以看到 line 63: e = pHead，把 pHead 指向的位置傳給 e，讓 e 指向 list head 的地方。

66-80: `char input[MAX_LAST_NAME_SIZE] = "zyxel";`

這邊先段來看


char input[MAX_LAST_NAME_SIZE] = "zyxel";`

這行把等等要找的 last name 設定為 "zyxel"，看到 words.txt 裏面，zyxel 是倒數第 10 個，代表 findName() 一定要加班了 (幫QQ)


e = pHead

因為 e 在 append 完之後現在在 last element，我們要讓 e 指回 first element。



    assert(findName(input, e) &&
           "Did you implement findName() in " IMPL "?");
    assert(0 == strcmp(findName(input, e)->lastName, "zyxel"));








#if defined(__GNUC__)
    __builtin___clear_cache((char *) pHead, (char *) pHead + sizeof(entry));
#endif
    /* compute the execution time */
    clock_gettime(CLOCK_REALTIME, &start);
    findName(input, e);
    clock_gettime(CLOCK_REALTIME, &end);
    cpu_time2 = diff_in_second(start, end);

未優化版本

$ make run

size of entry : 136 bytes
execution time of append() : 0.046524 sec
execution time of findName() : 0.006071 sec

basic CPU statistic
$ perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles ./phone_orig

 Performance counter stats for './phonebook_orig':

       175,357,733      cycles:u                                                    
       229,437,098      instructions:u            #    1.31  insn per cycle                                            
         1,301,277      cache-references:u                                          
         1,277,761      cache-misses:u            #   98.193 % of all cache refs    
         5,058,973      bus-cycles:u                                                

       0.066305491 seconds time elapsed

CPU L1 cache staticstic
$ perf stat -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-prefetch-misses,L1-dcache-store-misses,L1-icache-load-misses ./phonebook_orig

 Performance counter stats for './phonebook_orig':

         1,315,576      cache-misses:u            #  120.087 % of all cache refs      (59.07%)
         1,095,522      cache-references:u                                            (60.37%)
         1,231,509      L1-dcache-load-misses:u   #    1.95% of all L1-dcache hits    (78.15%)
        63,186,159      L1-dcache-loads:u                                             (81.73%)
         3,369,852      L1-dcache-prefetch-misses:u                                     (43.53%)
            35,271      L1-dcache-store-misses:u                                      (39.19%)
            21,174      L1-icache-load-misses:u                                       (56.70%)

       0.068884934 seconds time elapsed

縮減 entry size

!!!WARNING!!!

開始之前，不要被表了，先看 code，記得把 phonebook_opt.h 的 #define OPT 1 註解拿掉，要不然你怎麼跑 make plot 都是沒用的bb

將 entry size 從 136 bytes 縮小為 32 bytes，append 與 findName 的效率都有所提升。

append 提升 21.97% 速度，findName 提升 122.31 % 速度。

typedef struct __LAST_NAME_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    phonebook_detail *detail;
    struct __LAST_NAME_ENTRY *pNext;
} entry;

orig cache-test

 Performance counter stats for './phonebook_orig' (100 runs):

         1,181,680      cache-misses:u            #   95.942 % of all cache refs      ( +-  0.49% )
         1,231,662      cache-references:u                                            ( +-  0.41% )
       229,689,183      instructions:u            #    1.24  insn per cycle                                              ( +-  0.02% )
       185,425,108      cycles:u                                                      ( +-  0.46% )

       0.079698901 seconds time elapsed                                          ( +-  0.83% )

entry size cache-test

 Performance counter stats for './phonebook_opt' (100 runs):

            89,105      cache-misses:u            #   32.843 % of all cache refs      ( +-  0.98% )
           271,309      cache-references:u                                            ( +-  0.70% )
       233,120,203      instructions:u            #    1.62  insn per cycle                                              ( +-  0.02% )
       143,480,575      cycles:u                                                      ( +-  0.35% )

       0.053849988 seconds time elapsed                                          ( +-  0.78% )

cache misses 從 95% 下降到 32%！太神啦~

Hash table

Hash function

從記憶裏面找到這個網站：
各种字符串Hash函数比较 by byvoid

採用推薦的 BKDR hash function (就是 K&R，鼎鼎大名的 Brian Kernighan 和 Dennis Ritchie 在《The C Programming Language》裏面寫的 hash function：










unsigned int BKDRHash(char *str)
{
    unsigned int seed = 131;
    unsigned int hash = 0;
    while (*str) {
        hash = hash * seed + (*str++);
    }
	
    return (hash & 0x7FFFFFFF);
}

Hash table

因為很喜歡使用 Python，所以決定了解 Python 的 dict 是如何實作的，同時運用在這次的作業當中。

參考 Python Dictionary Implementation by Laurent Luce，可以發現一件重要的事情：

Python dict 是採用 open addressing 的方式處理 collision

Open Addressing

系統程式的老師上課有教過，可是忘記了…。看到 wiki 就想起來了。

Open addressing 是當 hash table 出現撞擊的時候的處理方式，運用不同的探針 (probe) 來決定這個 value 的 index。

Open addressing 最大的好處就是，可以將 hash table 的查詢時間降低為 O(1)，以往用 list 來處理 collision 的話會讓查詢時間出現 worse case O(n)。

常見的 probe 以下幾種：

Linear probing: +1 大法
Quadratic probing : new_index = f(old_index);

Python 使用 Quadratic probing，使用的 function 如下：




perturb = hash;

index = (index << 2) + index + perturb + 1;
perturb >>= PERTURB_SHIFT;

PERTURB_SHIFT 預設是 5，不能小於 1。

假設 hash table size = 32，hash = 187875484, index = 28
跑 probing 的結果會是：

28 -> 9 -> 18 -> 11 -> 29 -> 5 -> 31 -> 28 ->
13 -> 2 -> 11 -> 24 -> 25 -> 30 -> 23 -> 20 ->
5 -> 26 -> 3 -> 16 -> 17 -> ....

Dict Structure

Python dict 的 structure 總共用到兩個，一個是 PyDictEntry 作為 key value structure、一個是 PyDictObject 作為整個 Dict 的 Object

可以在 svn.python.org 找到 dictobject.c 的原始碼，如果你想要做的話，打開他，你會用到的。

在這邊把他特化為 phonebook 專用的 DictEntry 跟 DictObject 如下：

















typedef struct {
    unsigned int hash;
    char lastName[MAX_LAST_NAME_SIZE];
    phonebook_detail *detail;
} DictEntry;

typedef struct _dictobject DictObject;
struct _dictobject {
    unsigned int ma_mask;
    int ma_used;
    int ma_fill;
    int ma_size;
    DictEntry *ma_table;
    DictEntry *(*ma_lookup)(DictObject *mp,
                             char key[], 
                             unsigned int hash);
};

unsigned int ma_mask： hash mask, size - 1
ma_used：目前使用的 slot 數量
ma_fill：目前使用的 slot + dummy slot，dummy slot 是刪除 slot 之後，會將 slot 設定為 dummy slot
ma_size：目前 ma_table 的大小
ma_table：hash table
ma_lookup：lookup function

Dict function











DictObject *Dict_New();

int Dict_SetItem(DictObject *mp, char key[]);
int Dict_Lookup(DictObject *mp, char key[]);

DictEntry *lookdict(DictObject *mp, char key[], unsigned int hash);
int insertdict(DictObject *mp, char key[], const unsigned int hash);
void insertdict_clean(DictObject *mp, char key[],
                      unsigned int hash, phonebook_detail *value);
int dictresize(DictObject *mp, int minused);
unsigned int BKDRHash(char *s);

Hashtable 效能

 Performance counter stats for './phonebook_opt' (100 runs):

         1,553,358      cache-misses:u            #   68.207 % of all cache refs      ( +-  0.28% )
         2,277,402      cache-references:u                                            ( +-  0.20% )
       270,305,408      instructions:u            #    0.62  insn per cycle                                              ( +-  0.02% )
       437,887,064      cycles:u                                                      ( +-  0.34% )

       0.162588847 seconds time elapsed                                          ( +-  0.60% )

cache-misses 提升到 68.207%
append 時間也上升到 0.147秒，跟 orig 相比慢了3倍
findName 時間降到 0 秒

電話簿這種東西建完就可以用很久了，findName 降到 0 應該滿棒的啊！

Performance 比較 - abandoned

append() all word in dictionary/words.txt
findName("zyxel") 100 times

這不是個好的實驗設計。

 Performance counter stats for './phonebook_orig' (100 runs):

        42,431,897      cache-misses:u            #   99.071 % of all cache refs      ( +-  0.11% )
        42,829,855      cache-references:u                                            ( +-  0.09% )
     2,164,254,681      instructions:u            #    1.04  insn per cycle                                              ( +-  0.00% )
     2,079,356,180      cycles:u                                                      ( +-  0.42% )

       0.653479071 seconds time elapsed                                          ( +-  0.27% )

 Performance counter stats for './phonebook_opt' (100 runs):

         2,481,889      cache-misses #   33.237 % of all cache refs ( +-  0.36% )
         7,467,180      cache-references ( +-  0.29% )
     2,167,727,565      instructions #    2.21  insn per cycle ( +-  0.00% )
       981,860,196      cycles ( +-  0.18% )

       0.410072312 seconds time elapsed ( +-  1.42% )

 Performance counter stats for './phonebook_opt_hash' (100 runs):

         1,556,307      cache-misses #   68.441 % of all cache refs ( +-  0.43% )
         2,273,932      cache-references ( +-  0.21% )
       270,227,846      instructions #    0.65  insn per cycle ( +-  0.02% )
       412,602,985      cycles ( +-  0.29% )

       0.211090389 seconds time elapsed ( +-  1.22% )

Performance - Intel Hierarchical Top-Down Performance Characterization Methodology

Reference from: http://www.brendangregg.com/methodology.html
Are UOPs issued?
- If yes:
  - Are UOPs retired?
    - If yes: retiring (good)
    - If no: investigate bad speculations
- If no:
  - Allocation stall?
    - If yes: investigate back-end stalls
    - If no: investigate front-end stalls

CPU front-end 意義

簡單來說前端就是：取得指令、以及把指令解碼成微指令。sandy bridge 系列可以一次傳 4 個 uops 給後端

中文翻譯 by louielu

CPU back-end 意義

簡單來說後端就是：執行 uops，有 parallelism

中文翻譯 by louielu

phonebook case

$ perf stat -d ./phonebook_oirg
 Performance counter stats for './phonebook_orig':

         66.166969      task-clock:u (msec)       #    0.995 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            12,346      page-faults:u             #    0.187 M/sec                  
       113,680,547      cycles:u                  #    1.718 GHz                      (34.24%)
        82,373,177      stalled-cycles-frontend:u #   72.46% frontend cycles idle     (48.06%)
        64,255,629      stalled-cycles-backend:u  #  56.52% backend cycles idle       (50.65%)
       186,811,699      instructions:u            #    1.64  insn per cycle         
                                                  #    0.44  stalled cycles per insn  (62.39%)
        34,733,972      branches:u                #  524.944 M/sec                    (63.81%)
           789,273      branch-misses:u           #    2.27% of all branches          (62.64%)
        49,760,981      L1-dcache-loads:u         #  752.052 M/sec                    (44.50%)
            56,790      L1-dcache-load-misses:u   #    0.11% of all L1-dcache hits    (18.82%)
             4,932      LLC-loads:u               #    0.075 M/sec                    (23.20%)
   <not supported>      LLC-load-misses:u                                           

       0.066500871 seconds time elapsed

$ perf stat -d ./phonebook_opt     
 Performance counter stats for './phonebook_opt':

         41.165953      task-clock:u (msec)       #    0.993 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
             4,147      page-faults:u             #    0.101 M/sec                  
        89,513,769      cycles:u                  #    2.174 GHz                      (30.69%)
        47,728,782      stalled-cycles-frontend:u #   53.32% frontend cycles idle     (45.97%)
        30,544,573      stalled-cycles-backend:u  #  34.12% backend cycles idle       (50.45%)
       186,816,945      instructions:u            #    2.09  insn per cycle         
                                                  #    0.26  stalled cycles per insn  (61.88%)
        35,932,936      branches:u                #  872.880 M/sec                    (63.67%)
           848,250      branch-misses:u           #    2.36% of all branches          (70.17%)
        54,843,255      L1-dcache-loads:u         # 1332.248 M/sec                    (41.03%)
            26,195      L1-dcache-load-misses:u   #    0.05% of all L1-dcache hits    (15.29%)
             2,297      LLC-loads:u               #    0.056 M/sec                    (22.03%)
   <not supported>      LLC-load-misses:u                                           

       0.041471519 seconds time elapsed

A Re think of WHY?

在不改變讀取方式以及 append code 的狀況下，為什麼降低 struct size 可以增加 append 效能？
- sizeof((entry *) 0x0) -> 8
  - entry pointer 的大小不會改變啊。
- 推測是 malloc size 對於速度有差異
- proof it.

pthread boss/worker model

pthreads-example/example3-boss-worker-model.c

2016q3 Homework1 (phonebook)

Reviewed by jserv

開發環境

開發目標

perf

perf - raw counter

Reading PERV_EVT_SEL MSRs (Model-specific register) value

Misleading perf event name！

Reference

檔案分析

main.c

7: #include IMPL

9: #define DICT_FILE "./dictionary/words.txt"

46: #if defined(__GNUC__)

47: __builtin__clear_cache((char *) pHead, (char *) pHead + sizeof(entry))

49-59: clock_gettime(CLOCK_REALTIME, &start)

66-80: char input[MAX_LAST_NAME_SIZE] = "zyxel";

未優化版本

縮減 entry size

Hash table

Hash function

Hash table

Open Addressing

Dict Structure

Dict function

Hashtable 效能

Performance 比較 - abandoned

Performance - Intel Hierarchical Top-Down Performance Characterization Methodology

CPU front-end 意義

CPU back-end 意義

phonebook case

A Re think of WHY?

pthread boss/worker model

Reviewed by `jserv`

7: `#include IMPL`

9: `#define DICT_FILE "./dictionary/words.txt"`

46: `#if defined(GNUC)`

47: `builtinclear_cache((char ) pHead, (char ) pHead + sizeof(entry))`

49-59: `clock_gettime(CLOCK_REALTIME, &start)`

66-80: `char input[MAX_LAST_NAME_SIZE] = "zyxel";`