2017q1 Homework1 (phonebook)

contributed by <ryanwang522>

Reviewed by csielee

關於 strcasecmp() 的 byte-by-byte comparsion 就是會將傳入的字串 s1 s2 將給個字元個別比較(字元大小是一個 byte )，並紀錄相差的數值，最後將數值回傳。因此回傳值有大於0、等於0、小於0，三種狀況
在BST實作當中，並沒有把建立BST的時間計算進 append() 的時間，這樣會造成BST好像比較有效率的錯覺。
在BST中，可以加入能夠動態增加資料。因為目前phonebook是都會先新增然後查詢，但實際狀況是會動態的查詢跟新增
儘量不要修改 main.c 使用 findname 或 append 的函數方法，可以利用 OOP 的觀念，將不同版本的操作隱藏在函數裡

開發環境

OS: ubuntu 16.04 LTS
L1d cache： 32K
L1i cache： 32K
L2 cache： 256K
L3 cache： 3072K
Architecture：x86_64
CPU op-mode(s)： 32-bit, 64-bit
Byte Order： Little Endian
CPU(s)： 4
Model name: Intel® Core™ i5-4200H CPU @ 2.80GHz

原始版本

初始設定並初步執行
- 一開始為了了解 Git 的指令操作花了不少時間，但還是只弄懂了一點點皮毛，背後詳細的原理有待之後再來深入研究。
- 大致瀏覽過相關資料後，覺得還是實地動手做做看吧！
清空 cache

$ echo 1 | sudo tee /proc/sys/vm/drop_caches

片段輸出

$ sudo make cache-test

size of entry : 136 bytes
execution time of append() : 0.042719 sec
execution time of findName() : 0.006162 sec


 Performance counter stats for './phonebook_orig' (100 runs):

          123,6017      cache-misses              #   85.500 % of all cache refs      ( +-  0.29% )
          144,5636      cache-references                                              ( +-  0.19% )
       2,6300,2705      instructions              #    1.25  insn per cycle           ( +-  0.02% )
       2,1018,3982      cycles                                                        ( +-  0.17% )

       0.064090511 seconds time elapsed                                          ( +-  0.19% )

看到報表後覺得怪怪的，miss rate 和其他人好像有些差異Ryan Wang

接著來試試新鮮的 gnuplot
$ sudo make plot

初步構思

想法：
- 觀察main.c的程式碼，不管是append()或findName()在原本龐大的 struct 裡面只有用到 lastName，縮小結構內容應該對於效能改善助益不少。
- 使用 hash function 增加查找效率。

實際嘗試

縮小 Structure 內容

優化前版本的 Structure size = 136 bytes ，而我的 layer1 cache 也只有 32k bytes 頂多也只能放 32 * 1024 / 136 = 240.94 筆資料,在查找將近 35 萬筆資料時，無可避免的會發生大量 cache-miss，縮小體積後的 struct size = 32 bytes，能放 1024 筆資料，應該能有效減少 cache-miss。
更改 structure 內容以縮小體積
- 將原本的詳細資料獨立出來，透過指標存取。
- 避免影響程式效能，先不 malloc() 給 detail 所指向的內容，往後需要用到內容時再分配空間給它就好。

















typedef struct __PHONE_BOOK_DETAIL_ENTRY {
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
} detailEntry;

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    detailEntry *detail;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

執行並觀察

$ sudo perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./phonebook_opt

檢測結果

size of entry : 32 bytes
execution time of append() : 0.032987 sec
execution time of findName() : 0.002091 sec

 Performance counter stats for './phonebook_opt' (100 runs):

           16,1809      cache-misses              #   40.110 % of all cache refs      ( +-  0.73% )
           40,3410      cache-references                                              ( +-  0.81% )
       2,4494,4046      instructions              #    1.78  insn per cycle           ( +-  0.02% )
       1,3755,3543      cycles                                                        ( +-  0.51% )

       0.042066988 seconds time elapsed                                          ( +-  0.57% )

$ sudo make plot

發現 Cache-misses 從原本的 85.5％降至 40.11％！
表示原本的 structure 放置太多查找不需要的內容對 cache-misses 的 overhead 影響不小。

使用 Hash function

記得以前在上 OS 課程時有學過 hash 的基本概念，主要是由建立適當大小的 hash table 後，接著再由雜湊函數得到的雜湊值對照 table 去增加查找效率。而需要注意的地方有以下幾點。

如何決定 Table size
- 這裡我是根據電腦的 cache 所能存放的大小決定給 1024，避免查表造成過多的 cache-misses。
使用何種 Hash function
- 看了提供的資料以及一些先前同學的共筆，djb2 這個函數的均勻度似乎不錯，且改善效能的程度也算顯著，不過後來看到作業區提供的 BKDR ，就想說來試試吧！

unsigned int BKDRHash(char *str)
{
    unsigned int seed = 31; 
    unsigned int hash = 0;

    while(*str)
        hash = hash * seed + (*str++);

    return (hash % MAX_TABLE_SIZE);
}

之所以會將原來的雜湊值對 table size 取餘數是因為 size 大小的限制，而當 Collision（碰撞）發生，我採取的方法是 Chaining，直接將重複的資料串在一起，示意圖如下。

接著我對 main.c 做了相對應的修改，原本的 singly linked list 就保留著沒使用。



























    while (fgets(line, sizeof(line), fp)) {
        while (line[i] != '\0')
            i++;
        line[i - 1] = '\0';
        i = 0;

        unsigned int hashVal = BKDRHash(line);

        if (hashTable[hashVal] == NULL) {
            hashTable[hashVal] = (entry *) malloc(sizeof(entry));
            tableHead[hashVal] = hashTable[hashVal];
        }   

        hashTable[hashVal] = append(line, hashTable[hashVal]);
    }   

...

    /* the givn last name to find */
    char input[MAX_LAST_NAME_SIZE] = "zyxel";
    unsigned int hashVal = BKDRHash(input);

    e = tableHead[hashVal];

    assert(findName(input, e) &&
           "Did you implement findName() in " IMPL "?");
    assert(0 == strcmp(findName(input, e)->lastName, "zyxel"));

發現好像直接修改 main.c 在執行 make 時會因為 #ifdef 的問題導致有原始版本的編譯錯誤，不過就先測試看看，以後再來修改Ryan Wang

$ make phonebook_opt

$ sudo perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./phonebook_opt

執行結果

size of entry : 32 bytes
execution time of append() : 0.042810 sec
execution time of findName() : 0.000003 sec

 Performance counter stats for './phonebook_opt' (100 runs):

           11,1048      cache-misses              #   26.915 % of all cache refs      ( +-  3.29% )
           41,2592      cache-references                                              ( +-  0.33% )
       2,3471,2581      instructions              #    1.56  insn per cycle           ( +-  0.02% )
       1,5043,4217      cycles                                                        ( +-  0.58% )

       0.046055866 seconds time elapsed                                          ( +-  0.60% )

發現 append() 的執行時間有些許的增加
- 想了想覺得是因為建立 hash table 所造成的 overhead
- 不過看了 findName() 的時間大幅下降，證明了使用 hash function 雖然會犧牲一點append()的效能，但著實能改善效率。
Cache misses 從 40% 降為 26.91% ，因為適當的 table size 使得不僅查找效率提昇，miss rate 也順帶降低。
參考 Conditional Compilation，解決先前的 orig 版本編譯問題。
- 現在想想應該對append()以及findName()進行修改的，直接改main.c似乎有點不明智QQ
$ sudo make plot
- 這時發現畫出來的圖跟之前一樣，研究了一下發現 output.txt還是舊的，執行$ make clean後就可以正常使用 plot了。
因為先前 commit 的內容被 jserv 教授糾正了不少錯誤的寫法，所以花了一點時間研究 How to write a git commit message，這裡稍微整理一下。
Commit message 大致分為：
- Subject line
  - 長度限制為 50 個 Characters，開頭大寫且不需要加句點。
  - 內容需簡短扼要，如果真的無法在長度限制內撰寫代表你應該分成多次進行 commit。
  - 簡單的檢查方法就是將你的 Subject line 代入「 If applied, this commit will your subject line here 」
- Message body
  - 長度限制在 72 個 Characters 左右。
  - 顧名思義，body 就是讓你詳細解釋此次 commit 的原因以及內容，可以適時的進行分段。
最後看到一句不錯的話紀錄一下
- The future maintainer that thanks you may be yourself！

使用 Binary Search Tree

在想要怎麼實做的時候因為不太懂「字串」要依照什麼規則放入 BST 裡面，閱讀了許多資料後發現是利用 strcasecmp() 的回傳值作為要放置在左子樹還是右子樹的依據。

Google 了一下 strcasecmp()

The strcasecmp() function performs a byte-by-byte comparison of the
strings s1 and s2, ignoring the case of the characters.  It returns
an integer less than, equal to, or greater than zero if s1 is found,
respectively, to be less than, to match, or be greater than s2.

這裡還是不太懂 byte-by-byte comparsion 的實際運作方法，回頭再來研究。 Ryan Wang

實作
- 參考Sorted Linked List to Balanced BST ，利用 Linked list 已經照順序排序好的特性，透過遞迴呼叫，不用進行比較就可以建 BST 。





























int Length(entry *head)
{
    int count = 0; 
    entry *current = head;
    
    while (current != NULL) {
        count++;
        current = current -> pNext;
    }
    
    return count;
}

treeNode *BuildBST(entry **headRef, int length)
{
    if(length <= 0)
        return NULL;
    
    treeNode *left = BuildBST(headRef, length/2);
    treeNode *root = (treeNode *) malloc(sizeof(treeNode));
    root->entry = *headRef;
    root->left = left;

    *headRef = (*headRef)->pNext;

    root->right = BuildBST(headRef, length - (length / 2 + 1));
        
    return root;
}

接著修改findName()，利用循序的方式找出目標。

















entry *findName(char lastName[], treeNode *root)
{
    treeNode *current = root;
    int result;

    while (current != NULL && (result = strcasecmp(lastName, current->entry->lastName)) != 0) {
        if (result < 0)
            current = current -> left;
        else 
            current = current -> right;
    }

    if (!root)
        return (current->entry);
    else
        return NULL;
}

與先前 hash function 版本以及原始版本進行比較
- ./phonebook_bst

size of entry : 32 bytes
execution time of append() : 0.037696 sec
execution time of findName() : 0.000000 sec

 Performance counter stats for './phonebook_bst' (100 runs):

           20,8875      cache-misses              #   53.635 % of all cache refs      ( +-  0.70% )
           38,9440      cache-references                                              ( +-  0.82% )
       2,8554,9675      instructions              #    1.57  insn per cycle           ( +-  0.02% )
       1,8135,1595      cycles                                                        ( +-  0.67% )

       0.055963158 seconds time elapsed                                          ( +-  0.80% )

可見 findName() 的時間又比 hash function 版本有些微的減少。

更改了一下 runtime.gp

原本 Linked list 的查找時間複雜度為 O(n) ，將資料結構調整成 BST 後時間複雜度為 O(logn)，由圖表也可得知查找效率也大幅改善。

針對 Cache line 調整 BST 結構

問題探討

本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題？
- 輸入電話簿的姓名可能會有相同姓名的情況出現，需要進而比對其他資料的，可以選用每個人獨立的 ID 作為電話簿的 dataset ，避免多餘的查詢比對。
有代表性嗎？跟真實英文姓氏差異大嗎？
- 覺得跟真實情況有些微的出入，真實英文姓氏為 firstname + lastName，而本例只有 firstname 。
資料難道「數大就是美」嗎？
- 大量的資料要配合適合的資料結構去儲存，以及有效率的演算法為基礎的查詢方式，單純大量且沒有經過縝密規劃的大量資料反而會成會負擔。
如果我們要建立符合真實電話簿情境的輸入，該怎麼作？
- 利用現有的英文姓名去做隨機配對，避免字典檔的一些不符合姓名規則的輸入資料。