2016q3 Homework1 / phonebook

tags: `jserv`

Date: 2016/09/30

contribute by <hankgo>

github
youtube

電腦環境

OS: Lubuntu 16.04 (upgrade from 15.10)
因為舊的版本無法用內建的圖形介面升級，說缺少元件，所以使用終端機輸入指令升級。參考網址





  sudo apt-get update
  sudo apt-get upgrade
  sudo apt-get -f install   # (not "install -f"!)
  sudo apt-get -y install apt
  sudo do-release-upgrade

CPU: Intel i3-2350m ( 2 core / 4 thread )
- 使用指令 $ lscpu
- L1 data cache = 32K
- L1 instruction cache = 32K
- L2 cache = 256K
- L3 cache = 3072K
RAM: 8GB

github 指令

參考資料
$ git clone xxxx.git fork 之後下載檔案
$ git status 看狀態
$ git add [filename] 將檔案加入追蹤清單
- $ git add . 將全部檔案加入追蹤清單
- $ git add -i 互動模式，用挑選的加入清單
- $ git reset [filename] 特定檔案從 stage 移開 (移開追蹤)
- $ git reset . 檔案從 stage 移開
$ git commit commit，事後會問你要加入什麼訊息
- $ git commit -m "message" 一併將 message commit
- $ git commit -v 顯示檔案變動增減記錄
如何修改/取消上一次的 commit
- $ git commit --amend 修改上一次的 commit 訊息。
- $ git commit --amend [filename1] [filename2]... 將檔案1、檔案2加入上一次的 commit。
- $ git reset HEAD^ --soft 取消剛剛的 commit，但保留修改過的檔案。
- $ git reset HEAD^ --hard 取消剛剛的 commit，回到再上一次 commit的乾淨狀態。
git log 查看過去 commit 的紀錄
- git log -stat 看到更詳盡的訊息
- git log -p 看到檔案更詳細的變更內容
基本流程：修改檔案 => 加入 stage (git add) => 提交( git commit )=> 繼續修改其他檔案 => 上傳到 github (git push)

改善方式

reduce structure size

phonebook_orig.h



















/* original version */
typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

entry *findName(char lastName[], entry *pHead);
entry *append(char lastName[], entry *e);

#endif

phonebook_opt.h

















typedef struct __DETAIL_ENTRY {
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
} detailEntry;

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    detailEntry *detail;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

因為目前的搜尋只會使用到 lastName 而已，所以先將其他的欄位用別的 detailEntry structure 包起來，原始的 __PHONE_BOOK_ENTRY 只剩下 lastName、pNext、還有 detailEntry 這樣，名字搜尋到之後，真的有需要再去 detailEntry 去撈資料。如此一來，原始的 structure 的大小就會縮減，cache可以塞的 structure 的數量會增加。

這個嘗試之中，尚未去更動 phonebook_orig.c 的內容， phonebook_opt.c 的內容目前是跟 phonebook_orig.c 的內容是相同的，沒有做太大更動。

OPTION Control

改好了之後，跑了數次發現怎麼圖表都是一樣的。發現我的 Program 沒有把 opt.txt 印出來，再看看哪邊有跟 opt.txt 有關係，發現以下 code，兩個 orig 跟 opt 版本都是共用一個 main，是利用 OPT 的 define 狀況來控制適用哪一個 output file，未來要實作 hash optimized 的版本，也可能需要適當的修改這段 code 來共用 main.c。所以這個 #if define(OPT) 是關鍵





#if defined(OPT)
    output = fopen("opt.txt", "a");
#else
    output = fopen("orig.txt", "a");
#endif

一開始在 main.c 找了好久找不到，經過很長一段時間發現 phonebook_opt.h 有下列這段 code，只是一開始 clone 的時候是被註解的。解除註解就順利產生 opt.txt 也順利做出第一個改善效能的成果。

#define OPT 1

未來要再增加使用 hash function 的效果，則需要新增新的 OPT 定義和輸出，故修改如下 (MAKEFILE 等也需要一併修改)：

參考資料：

Hash Function

前置修改

我將第二種優化的方式寫在新的檔案 phonebook_hash，然後共用同一個 main.c。根據觀察 makefile 等檔案的結果，其他的檔案需要做對應的調整

Makefile：要增加 CFLAGS_hash、增加 phonebook_hash 的 compile (EXEC、phoebook_hash 的動作)、增加 phonebook_hash 的 perf test、make clean 要記得清 phonebook_hash 所產生的 hash.txt
main.c：利用 define OPT 來增加 output.txt 檔名 switch 的支援 (opt_hash.txt)

#if defined(OPT)
    output = fopen("opt.txt", "a");
#elif defined(HASH)
    output = fopen("hash.txt", "a");    
#else
    output = fopen("orig.txt", "a");
#endif

calculate.c：增加 phonebook_hash 的時間計算
scripts/runtime.gp：增加屬於 hash 方式的圖表

hash function 簡介

hash function 常用於密碼學。常見的函數有一對一、多對一，一對一的函數可以求出其函數的反函數 (可以根據輸出結果反推輸入值)。而多對一的函數則沒辦法有反函數。密碼學有時候會利用 hash function 來製作真實密碼的部分特徵。hash function 其實就是對原始輸入字串做一連串的特殊函數運算，輸出一個固定長度的結果，也就是原始輸入字串的特徵。hash function 個人覺得重要的特性如下：

不容易從輸出的資訊反推回原始資訊
相同輸出資訊，但由兩組以上相同資訊的字串輸入到 hash function 的機率很低，若有則為 hash collision，以下為範例：

hash function h(x) = x ^ 2
h(5) = 25; h(-5) = 25，這時候就有 hash collision

所以根據上述特性，hash function 函數的設計上，要盡量不容易被反推，且要注意避免 collisiion。

根據前人和同學的心得，我也先嘗試了 djb2，是個專門處理文字字串的 hash function，不管字串長度為何，都可以得到一個 unsigned int 型態的 hash value，djb2 的原型如下：










    unsigned long hash(unsigned char *str)
    {
        unsigned long hash = 5381;
        int c;

        while (c = *str++)
            hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

        return hash;
    }

根據我們所需，修改成以下版本，並導入 hashTable 取餘數的想法：











unsigned int djb2Hash(char lastName[])
{
    unsigned long hash = 5381;
    int c;
    
    while ((c = *lastName++)) {
        hash = (( hash << 5 ) + hash ) + c; // hash*33+c
    }
    
    return (unsigned int) hash % HASH_TABLE_SIZE;
}

實做想法

我們會設法建立一個 hash table，先對這些 hash value 進行分組。目前作業提供的 words.txt 資料約有 35萬筆，最理想的狀況是，我們可以平均地分成若干組，每一組都有共同的 index，當我們要查詢某個名字時也先利用共同的 hash function 求出 index，在到某一組去搜尋，就可以免去從頭到尾慢慢搜尋的窘狀。那要分成幾組呢？根據以前有學到 k-means clustering，若在一個理想狀況下，最適合的組數就是直接對資料筆數開根號。目前我們的資料有 35萬筆，所以開根號後為 591.x，我就先取個 591 組來試試看。

目前使用的 djb2 的 output 為 unsigned int，我們利用 hash value 取餘數的方式來分組，創立了一個 hashTable 的 array，而 index 為 hash value 除以 HASH_TABLE_SIZE 的餘數。該 array 為其中一組符合某 index 的 entry。根據 entry 的特性，可以利用 linklist 把整串符合某 index 的 entry 通通串起來。感覺大概如下圖：

那怎麼樣建立這個 hashTable 呢？將資料帶入 hashTable 是由 append 方法處理的。先利用 hash function 求出 hash value，並取餘數求出 hashIndex。而放入資料的方式，類似 stack 的方式，放在 hashTable 裡面的 entry，是最後一筆放進去的資料，如果又有心的資料要進來，則需要進行 push。概念如附圖。























entry *append(char lastName[], entry *e)
{
    
    // save the new entry first
    entry *newEntry = (entry *) malloc(sizeof(entry));
    strcpy(newEntry->lastName, lastName);
    
    // calculate the hash value
    unsigned int hashIndex = djb2Hash(lastName);
    
    // check whether the hashTable is empty, create or lineup
    if(hashTable[hashIndex] == NULL) {
    
        hashTable[hashIndex] = newEntry;
        hashTable[hashIndex]->pNext = NULL;
    } else {
    
        newEntry->pNext = hashTable[hashIndex];
        hashTable[hashIndex] = newEntry;
    }
    
    return newEntry;
}

接著就是進行 findName()，找尋名字的動作。基本上就是根據 hashIndex 先到某組資料的頭，開始依序的去找尋組內的名字比對。後續流程跟原始的 findName() 相似































entry *findName(char lastName[], entry *pHead)
{
    // choose which hashtable to find
    unsigned int hashTarget = djb2Hash(lastName);
    entry *probe = NULL;
    
    if ( hashTable[hashTarget] == NULL ) {
        return NULL;
    } else {
        probe = hashTable[hashTarget];
    }
    
    while ( probe != NULL ) {
    
        if (strcasecmp(lastName, probe->lastName) == 0)
            return probe;
         
        probe = probe->pNext;
    }
    
    return NULL;
    
    
    
    while (pHead != NULL) {
        if (strcasecmp(lastName, pHead->lastName) == 0)
            return pHead;
        pHead = pHead->pNext;
    }
    return NULL;
}

Reference

Result

Original

size of entry : 136 bytes
average execution time of append() : 0.070686 sec
average execution time of findName() : 0.006691 sec

perf stat --repeat 100 \
	-e cache-misses,cache-references,instructions,cycles \
	./phonebook_orig

 Performance counter stats for './phonebook_orig' (100 runs):

         2,019,235      cache-misses              #   91.440 % of all cache refs      ( +-  0.19% )
         2,208,271      cache-references                                              ( +-  0.15% )
       261,401,917      instructions              #    1.37  insns per cycle          ( +-  0.04% )
       191,014,931      cycles                                                        ( +-  0.28% )

       0.099477194 seconds time elapsed

Reduce Structure Size

size of entry : 32 bytes
average execution time of append() : 0.054300 sec
average execution time of findName() : 0.003689 sec

perf stat --repeat 100 \
	-e cache-misses,cache-references,instructions,cycles \
	./phonebook_opt

 Performance counter stats for './phonebook_opt' (100 runs):

           345,704      cache-misses              #   61.357 % of all cache refs      ( +-  0.20% )
           563,430      cache-references                                              ( +-  0.26% )
       244,249,489      instructions              #    1.71  insns per cycle          ( +-  0.03% )
       142,634,916      cycles                                                        ( +-  0.15% )

       0.080214917 seconds time elapsed                                          ( +-  2.47% )

Hash Function

size of entry : 32 bytes
average execution time of append() : 0.083085 sec
average execution time of findName() : 0.000000 sec

perf stat --repeat 100 \
	-e cache-misses,cache-references,instructions,cycles \
	./phonebook_hash

 Performance counter stats for './phonebook_hash' (100 runs):

           295,119      cache-misses              #   73.568 % of all cache refs      ( +-  0.07% )
           401,152      cache-references                                              ( +-  0.38% )
       244,445,915      instructions              #    1.63  insns per cycle          ( +-  0.04% )
       149,928,975      cycles                                                        ( +-  0.51% )

       0.085473493 seconds time elapsed                                          ( +-  2.30% )

PLOT

以結果論，雖然 findName() 的速度上，幾乎是不用花到任何時間。但在一開始建立 hashTable 時需要耗費較多的時間成本。故總耗費時間成本為 append() 時間 + N * findName() 時間，N 為搜尋名字的次數，N 越大才能彰顯出 hash function 方式的威力。

different dataset （老師課堂解說待改善）

目前的 dataset 是依照字母排序的，不符合現在真實情況。
未來要新增名字刪減功能，若有用到 hash function，可能會產生碰撞 (hash collision)。

Reference

舊版筆記

2016q3 Homework1 / phonebook

tags: jserv

電腦環境

github 指令

改善方式

reduce structure size

OPTION Control

Hash Function

前置修改

hash function 簡介

實做想法

Reference

Result

Original

Reduce Structure Size

Hash Function

PLOT

different dataset （老師課堂解說 待改善）

Reference

tags: `jserv`

different dataset （老師課堂解說待改善）