2017q3 Homework1 (phonebook)

tags: `sysprogram`

contributed by <st9007a>

不要打錯字，預期是 "contributed"，之前少了 "d" 字母
"jserv

Reviewed by `amikai`

在 Hash Function 碰撞分析, y 軸的 frequency 是什麼的頻率, x 軸代表什麼意義, 這張圖有幾條 bar 是比較密的, 有幾條是比較稀疏的好像沒有特別的解釋, 上面那張圖跟下面的表格在文章裡好像沒有敘述他們有什麼關聯 (P.S. 可能程度比較差看不太懂)
在問題與討論, 第一個問題, 建議同學可以引進真正的資料集, 並且多測試幾種, 以佐證你所說的 load factor 0.7 - 0.8 之間有教好的效果, 並且讓此實驗更完備
在問題與討論,第二個問題, 建議同學可以進一步嘗試丟入更龐大的資料, 以佐證自己的結論
這位同學的研究, 多半是在 hash 上, 在 append 裡大量的使用了 malloc ,這也是降低效能的因素之一, 建議這位同學可以嘗試引入 mmap 或是 memory pool
老師好像只提到 git commit --amend ，這似乎只能修改前一個 commit message, 如果要修改很前面的 commit 可以考慮使用 git rebase -i

以上為小弟簡短的建議, 如有冒犯請見諒

Github

開發環境

Linux 4.4.0-92-generic
Ubuntu 16.04.1 LTS
L1d cache:  32K
L1i cache:  32K
L2 cache:   256K
L3 cache:   6144K

原始程式碼

首先看看 phonebook_orig.h

#ifndef _PHONEBOOK_H
#define _PHONEBOOK_H

#define MAX_LAST_NAME_SIZE 16

/* original version */
typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

entry *findName(char lastName[], entry *pHead);
entry *append(char lastName[], entry *e);

#endif

可以知道 entry 的資料結構是 linked list，接著看看 phonebook_orig.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#include "phonebook_orig.h"

/* original version */
entry *findName(char lastName[], entry *pHead)
{
    while (pHead != NULL) {
        if (strcasecmp(lastName, pHead->lastName) == 0)
            return pHead;
        pHead = pHead->pNext;
    }
    return NULL;
}

entry *append(char lastName[], entry *e)
{
    /* allocate memory for the new entry and put lastName */
    e->pNext = (entry *) malloc(sizeof(entry));
    e = e->pNext;
    strcpy(e->lastName, lastName);
    e->pNext = NULL;

    return e;
}

可以知道裡面的 findName() 實作了 linked list 的 search ，而 append() 則是在 entry 後面插入一個新的 entry

接著來看看原始效能如何，在 command line 上輸入 $ perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./phonebook_orig

出現如下的顯示：

 Performance counter stats for './phonebook_orig' (5 runs):

          457,6717  cache-misses       #  88.239 % of all cache refs   ( +-  0.33% )
          518,6740  cache-references                                   ( +-  0.38% )
       2,6177,2530  instructions       #   1.53  insns per cycle       ( +-  0.12% )
       1,7155,7629  cycles                                             ( +-  0.87% )

       0.169891497 seconds time elapsed                                ( +- 34.68% )

可以看到 cache miss 有 88.239%

縮減結構的size

根據 cache line 裡面有個實驗，當存取的資料超過 cache line 的大小後，執行時間會上升，原因在於填充 cache line 的時間大於存取資料的時間，所以我認為縮減資料結構的大小應該可以提升效能，即盡可能讓越多資料被 cache 在同一條 cache line 上越好，所以就直接在資料結構上動手腳，其餘都先保持原樣

access 在台灣的慣用翻譯詞是「存取」，而非「訪問」
"jserv"

typedef struct __PHONE_BOOK_INFO {
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
} info;

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    struct __PHONE_BOOK_INFO *pInfo;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

以上使用了一個額外的 struct 來將比較詳細的資訊記下來，而 lastName 是搜尋用的指標，因此保留在 entry 裡面，比較一下原本跟修改後的資料大小，直接執行 phonebook_orig 跟 phonebook_opt 來看結果

$ ./phonebook_orig
size of entry : 136 bytes
execution time of append() : 0.036390 sec
execution time of findName() : 0.004765 sec

$ ./phonebook_opt
size of entry : 32 bytes
execution time of append() : 0.030494 sec
execution time of findName() : 0.001589 sec

可以注意到修改後 entry 的 size 由 136bytes 縮減為 32bytes，除此之外還可以發現 findName() 的執行速度提升了大約三倍左右，接著使用 perf 來看看 cache miss 的狀況如何，在 command line 輸入 $ make cache-test

 Performance counter stats for './phonebook_orig' (100 runs):

          453,4382  cache-misses      #  88.507 % of all cache refs      ( +-  0.08% )
          512,3211  cache-references                                     ( +-  0.06% )
       2,6126,0775  instructions      #   1.54  insns per cycle          ( +-  0.02% )
       1,7013,9033  cycles                                               ( +-  0.07% )

       0.104214970 seconds time elapsed                                  ( +-  1.82% )


 Performance counter stats for './phonebook_opt' (100 runs):

          138,5229  cache-misses      #  57.016 % of all cache refs      ( +-  0.15% )
          242,9525  cache-references                                     ( +-  0.09% )
       2,4394,8709  instructions      #   2.16  insns per cycle          ( +-  0.02% )
       1,1295,1297  cycles                                               ( +-  0.06% )

       0.088586763 seconds time elapsed                                  ( +-  1.07% )

cache miss 從 88.507% 降至 57.016%，下降了 30% 左右

改善查詢速度

注：這裡的 entry 結構沿用上一節為縮減 size 後的結構

我們都知道一般的 linked list 搜尋的時間複雜度為

O (n)

，所以我試著加速查詢速度，首先使用 hash table 來取代單純的 linkedlist 結構，hash table 根據處理 collision 的方式分為兩種：

使用 Linked list 把「Hashing 到同一個 slot」的資料串起來
使用 Probing Method 來尋找 Table 中「空的 slot」存放資料

我想要盡可能不浪費 hash table 的空間，所以我先用方法 2 來實作，而 probing method 先用比較簡單的方式，也就是 linear probing，首先在 phonebook_opt.h 定義 hash table 的資料結構，新增 hash function，以及修改 findName() 跟 append() 的輸入參數

#define HASH_TABLE_SIZE 350000

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    struct __PHONE_BOOK_INFO *pInfo;
} entry;

typedef struct {
    entry *cell[HASH_TABLE_SIZE];
} hashTable;

unsigned int BKDRHash(char *str);
entry *findName(char lastName[], hashTable *table);
void append(char lastName[], hashTable *table);

這裡 hash function 我是參考各種字串符 Hash 函數比較這篇，使用 BKDR 雜湊法，實際狀況可能需要考慮到雜湊出來的分布，hash table 的長度則是參考 dictionary/words.txt 的行數定出來的，在 entry 中我拿掉了 pNext，因為這個 hash table 並不需要 linked list，接下來在 phonebook_opt.c 中修改 findName() 跟 append()

unsigned int BKDRHash(char *str)
{
    unsigned int seed = 131;
    unsigned int hash = 0;

    while (*str) {
        hash = hash * seed + (*str++);
    }

    return (hash & 0x7FFFFFFF);
}

entry *findName(char lastName[], hashTable *table)
{
    unsigned int i = BKDRHash(lastName) % HASH_TABLE_SIZE;
    while (strcmp(table->cell[i]->lastName, lastName) != 0) {
        i = (i + 1) % HASH_TABLE_SIZE;
        if (!table->cell[i]) return NULL;
    }
    return table->cell[i];

}

void append(char lastName[], hashTable *table)
{
    unsigned int i = BKDRHash(lastName) % HASH_TABLE_SIZE;
    while (table->cell[i]) {
        i = (i + 1) % HASH_TABLE_SIZE;
    }

    entry *e = (entry *) malloc(sizeof(entry));
    e->pInfo = (info *) malloc(sizeof(entry));
    strcpy(e->lastName, lastName);
    table->cell[i] = e;
 }

然後修改 main.c 裡面有關 findName 跟 append 的介面後，先直接執行看看：

請尊重台灣文化和傳統，將 interface 翻譯為繁體中文的「介面」一詞，而非「接口」 –jserv

$ make
$ ./phonebook_opt
size of entry : 24 bytes
execution time of append() : 0.843872 sec
execution time of findName() : 0.001289 sec

發現 append() 的執行速度慢了非常多，其原因在於原本的 append() 的時間複雜度為

O (1)

而修改後時間複雜度變成了

O (n / k)

了，在理論上沒有 collision 的狀況時間複雜度應該保持

O (1)

，所以猜測裡面的 collision 應該滿嚴重的，不過還是先用 perf 來看看效能如何

 Performance counter stats for './phonebook_opt' (100 runs):

          173,0000  cache-misses      #  3.975 % of all cache refs      ( +-  0.30% )
         4352,1870  cache-references                                    ( +-  0.19% )
      32,1506,9274  instructions      #  1.06  insns per cycle          ( +-  0.00% )
      30,3073,3228  cycles                                              ( +-  0.01% )

       0.901060342 seconds time elapsed                                 ( +-  0.11% )

可以看到 cache miss 已經降至 3.975%，然而 instruction 跟 cycle 數量遠大於上一節的狀況(10 倍以上)，所以如果要繼續使用 hash table 勢必要想辦法解決 collision 的問題

改善 `append()` 執行速度

首先考慮到可能是 hash table 的長度過於接近資料庫的大小，所以參考 wikipedia 以及 stackoverflow，裡面提到資料總數與 hash table 的長度比值應該要介於 0.7 與 0.8 之間，這個比值被稱為 load factor，所以這邊分別嘗試使用

l o a d f a c t o r = 0.7, 0.75, 0.8

來做比較

請調整圖中的數據排列
課程助教

Y 軸我使用 logscale 來表示，主要是為了讓 findName() 的數據間的差距表現的明顯一點，圖中是使用單次執行的時間，並且每次執行前會先使用 $ echo 1 | sudo tee /proc/sys/vm/drop_caches 來清除快取，因為有無清除快取影響執行時間滿多的(有快取的狀況下，執行速度大約快了 5 倍)
從上圖可以看出 load factor 如果介於 0.7 到 0.8 之間的執行時間都是差不多的，都在 0.3 秒左右，接著來比較它們的 cache miss

統一寫作 "cache miss"，後面不加 -ing
"jserv"

很明顯的 load factor 為 0.8 時在效能上是比較好的，其 cache miss 跟縮減過資料尺寸的純 linked list ( cache miss = 57.016% )相比，下降了 23% 左右
再來，根據不同的資料庫以及 hash function，其 hash 出來的分布也會不一樣，我認為可以嘗試 load factor 大於 0.8 的狀況，所以將 load factor 調整為 0.9 看看其結果如何

$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ ./phonebook_opt
size of entry : 24 bytes
execution time of append() : 0.312205 sec
execution time of findName() : 0.000001 sec

$ make cache-test
 Performance counter stats for './phonebook_opt' (100 runs):

          105,6438  cache-misses      #  27.359 % of all cache refs      ( +-  0.33% )
          386,1389  cache-references                                     ( +-  0.05% )
       3,6838,3004  instructions      #   1.55  insns per cycle          ( +-  0.02% )
       2,3769,1406  cycles                                               ( +-  0.10% )

       0.118767684 seconds time elapsed                                  ( +-  2.17% )

跟 load factor 為 0.8 時相比，執行時間增加了 0.002sec 但是 cahce miss 減少了 7%，我認為以這個案例來說，load factor = 0.9還算可以接受

試試另一種 hash table

前幾節使用了 open addressing 的 hash table，也就是利用 probing method 來尋找 hash table 中空的 slot，接著來嘗試 chaining 的 hash table，也就是使用 linked list 將 collision 的資料鏈結起來，被鏈結的資料不會佔用 hash table 的 slot，所以理論上 collision 的狀況會比較不嚴重，附上修改後的程式碼

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    struct __PHONE_BOOK_INFO *pInfo;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

entry *findName(char lastName[], hashTable *table)
{
    unsigned int i = BKDRHash(lastName) % HASH_TABLE_SIZE;
    entry *head = table->cell[i];
    while (strcmp(head->lastName, lastName) != 0) {
        head = head->pNext;
        if (head == NULL) break;
    }
    return head;

}

void append(char lastName[], hashTable *table)
{
    unsigned int i = BKDRHash(lastName) % HASH_TABLE_SIZE;
    entry *e = (entry *) malloc(sizeof(entry));
    e->pInfo = (info *) malloc(sizeof(entry));
    e->pNext = NULL;
    strcpy(e->lastName, lastName);

    if (table->cell[i] == NULL) {
        table->cell[i] = e;
        return;
    }

    entry *head = table->cell[i];
    while (head->pNext != NULL) {
        head = head->pNext;
    }

    head->pNext = e;
}

接著直接執行看看結果如何，這裡我先設定 load factor = 0.8

$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ ./phonebook_chaining
size of entry : 32 bytes
execution time of append() : 0.302260 sec
execution time of findName() : 0.000000 sec

可以看到 append() 的速度比 open addressing 的 hash table 快了約 0.007sec，接著來看看 cache miss 狀況如何

 Performance counter stats for './phonebook_chaining' (100 runs):

          210,3447  cache-misses      #  55.627 % of all cache refs      ( +-  0.09% )
          378,1313  cache-references                                     ( +-  0.16% )
       3,1909,8415  instructions      #   1.63  insns per cycle          ( +-  0.03% )
       1,9627,3235  cycles                                               ( +-  0.15% )

       0.099198112 seconds time elapse                                   ( +-  2.84% )

發現到 cache miss 反而提升了，其原因是當 CPU 向 cache 存取資料時，如果 cache 中不存在該資料會向 main memory 複製一份包含該資料的一段 cache line size 大小的資料到 cache line 上，因此在該資料周遭的其他資料也會一起被快取住，而 chaining 的 hash table 雖然使 slot 上的資料較不會發生 collision 但是也使 slot 上的資料分布變得比較不連續，資料之間可能會被空 slot 給隔開，所以在 cache line 被快取住的有效資料就相對比較少，同時 linked list 結構的記憶體本身也並非連續分布，因此造成 cache miss 的機率上升

「cache line 會快取住一段連續的記憶體」的陳述不精確，請翻閱計算機組織與結構的書籍
"jserv"
已修正陳述內容
"st9007a"

Hash Function 碰撞分析

首先先對 dataset 中每筆資料 hash 後的結果做統計，x 軸代表資料 hash 出來後的數字，y軸代表資料經過 hash 後的結果出現的次數，例如：(15000, 3) 可以解讀為整個 dataset 中有 3 筆資料 hash 後的結果為 15000，這裡的 hash table 的 load factor 設定為 8：

會發現由於資料範圍太大，所以 1 到 4 次的頻率看起來差不多，所以接上圖轉換成表格來看

hash結果重複次數	0	1	2	3	4	5	6
資料數量	156580	126182	50397	13436	2740	516	49

上面表格是統計每筆資料 hash 後與其他資料重複的次數，可以看到只有大約 44% 的資料沒有重複，也就是說理想狀況下有

1 - (156580 + \frac{126182}{2} + \frac{50397}{3} + \frac{13436}{4} + \frac{2740}{5} + \frac{526}{6} + \frac{49}{7}) / 349900 = 31.275 %

的資料會發生碰撞，但是考慮到 linear probing 會把空的 slot 給佔走，經過實際測試總共有 40.742% 的資料發生碰撞

使用接近真實情況的 dataset

由於原本的 dataset 中的英文姓氏為隨機的字母排列，因此不夠具有代表性，因此在 dataset 換成 dictionary/all-names.txt，並且對 open addressing hash table 做 load factor 對 cache miss 的重新比較

load factor	0.8	0.75	0.7
cache miss	12.561%	12.903%	12.610%

重新比較後發現到，cache miss 在 0.8, 0.75, 0.7 之間非常相近，其原因在於這次使用的 dataset 資料量較小 ( 共 16750 筆 )，接著是 load factor = 0.8 以及 0.7 時的 cache miss 非常接近，反而 load factor = 0.75 時的 cahce miss 最高

問題討論

本例使用的 dataset 有代表性嗎？跟真實英文姓氏差異大嗎？

我們可以從 main.c 中很容易的知道，程式使用的 dataset 來自於 dictionary/words.txt，而將其打開來檢查就可以發現，裡面的 word 應該是按字母順序隨機產生的，因此它並不具有代表性( 與真實狀況相去甚遠 )，跟英文姓氏差異極大

資料難道數大就是美嗎？

根據上面的實驗以及 cache 的原理來看的話，理論上資料量越大，cache miss 發生的機率越高，同樣的，資料儲存的內容越多，cache miss 發生機率也會提高，因此在處理龐大的數據時，必須慎選資料結構

如果我們要建立符合真實電話簿情境的輸入，該怎麼作？

使用符合真實情況的 dataset，接著除了查詢功能外可以加入新增與刪除功能，如果繼續選用 hash table 的話，可以適時的 resize 它的大小，保持 load factor 介於 0.7 到 0.8 之間

參考資料

關於 CPU cache – 程序猿需要知道的那些事
 各種字串符 Hash 函數比較
 Hash Table：Intro (簡介)
wikipedia – hash table
stackoverflow – How to choose size of hash table?
stackoverflow – How do cache lines work?
知乎 – 請教CPU的cache中關於line,block,index等的理解?