2018q1 Homework1 (phonebook)

contributed by <e94046165>

Reviewed by `BodomMoon`

*findName中的排版可謂驚世駭俗，我很好奇這樣真的能通過Astyle嗎？
HashFunction的部份可考慮是否要追加分佈平均數、極值、標準差等數據，可以讓整體除了看圖說故事以外更有說服力
可以再更加細分數值，圖表中兩個 HashFunction 所對應的 0.000005 還是有差別的，可以將搜尋次數增加或是用更加精細的數值代表

系統環境

$uname -a
Linux acnes-X550JX 4.13.0-32-generic 
#35~16.04.1-Ubuntu SMP Thu Jan 25 10:13:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$lArchitecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
每核心執行緒數：          2
每通訊端核心數：          4
Socket(s):             1
NUMA 節點：             1
供應商識別號：           GenuineIntel
CPU 家族：              6
型號：                  60
Model name:            Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz
製程：                  3
CPU MHz：              2594.040
CPU max MHz:           3600.0000
CPU min MHz:           800.0000
BogoMIPS:              5188.08
虛擬：                  VT-x
L1d 快取：              32K
L1i 快取：              32K
L2 快取：               256K
L3 快取：               6144K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts

phonebook.orig 各項數據

execution time of append() : 0.037812 sec
execution time of findName() : 0.005215 sec
size of entry : 136 bytes

Performance counter stats for './phonebook_orig' (100 runs):

           546,984      cache-misses              #   69.195 % of all cache refs      ( +-  1.05% )
           790,500      cache-references                                              ( +-  0.25% )
       278,677,063      instructions              #    1.42  insn per cycle           ( +-  0.02% )
       196,434,047      cycles                                                        ( +-  0.27% )

       0.058225744 seconds time elapsed                                          ( +-  0.31% )

減小 size of entry

因 findName 和 append 函式只會用到 struct 中 lastName 這個元素，也就是說其他元素雖然也被分配了空間可是實際上並沒有用到。因此我將原先包含了所有資料的 entry 分成兩個，如下所示：

typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    struct __PHONE_BOOK_DETAIL *details;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

typedef struct __PHONE_BOOK_DETAIL {
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
} detail;

其中，entry 只包含了必要的 lastName 資訊和指向 details 的指標
如此一來可以降低 size of entry (從136降至32)也順帶降低了 cache miss
在HMKRL的共筆中有讀到，可以加入 __attribute__((__packed__))進一步使 entry size 減少，但 cache miss 和搜尋時間並未隨之改善，故留待之後再做測試。(已補上測試)


Performance counter stats for './phonebook_opt' (100 runs):

           124,032      cache-misses              #   43.280 % of all cache refs     
           286,579      cache-references       
       258,234,106      instructions              #    1.94  insn per cycle         
       132,903,290      cycles                 

       0.039922422 seconds time elapsed

make plot 結果：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

可以由圖中看出，儘管只是對 struct 的成員做了些許更改，time performance 已經大幅的改善。我想以後我在必須放更多心思在決定"資料的必要性"上，在一個 struct 內塞入大量的資料雖然方便，但可能造成不必要的記憶體浪費和時間成本。

同時，在思考如何減少 entry size 時，我注意到了
#define MAX_LAST_NAME_SIZE 16
跑過簡單的函式後確定字典檔最長的單字為12位元

int longest(char lastName[]){
	if(strlen(lastName)>length){
		length = strlen(lastName);
	}
}

但在我將MAX_LAST_NAME_SIZE由 16 改成 13 時，entry size 並沒有如預期的減少，猜測在 memory allocate 的時候並不是以 1 byte 為單位，可能是 4 或 8 bytes。

確認 alignment 的影響，並且試圖驗證你的猜想

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

在此同時我也發現了另外一個問題，如果這個 MAX_LAST_NAME_SIZE是一個定值，且大於最長的 LastName 長度，那勢必會造成許多不必要的記憶體浪費，想想（名字）有夠長的黑人…

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

記憶體對齊（alignment）

參考資料OPASS'S BLOG發現：為了提高存取效率，電腦的記憶體通常會設計成以 word-sized 進行存取，現代的系統通常以4bytes(32bit)、8bytes(64bit)作為一個 word size。
而對齊的理由如下：
<未對齊的情況>

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

3個不同的顏色分別代表一筆 char, short, int 資料，當要讀取 int 時，就必須讀取兩個 word，經過 shift 後才能得到正確的資料

<對齊的情況>

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

若在 char 之後加一個 padding byte，使 int 符合4-bytes 對齊，此時存取就簡單多了，只要進行一次memory access。
在加入了__attribute__((packed))後可以看到 entry size 不再是 8 的倍數(8-byte aligned)，entry size 隨著更改 MAX_LAST_NAME_SIZE 而減少

typedef struct __attribute__((packed)) __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    struct __PHONE_BOOK_DETAIL *additional_details;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

size of entry : 29 bytes

Performance counter stats for './phonebook_opt' (100 runs):
129,992      cache-misses              #   43.776 % of all cache refs

make plot 結果：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

__attribute__有許多參數可用，其中packed常用在對記憶體必須斤斤計較的嵌入式系統中，使用該屬性可以使得變量或者結構體成員使用最小的對齊方式，即對變量是一字節對齊(參考資料)將資料塞進有限的記憶體當中。

但可以看到 cache miss 沒有降低，可能因為 cpu 讀取記憶體還是依尋本來的 8-bytes aligned 的關係，考慮到時間表現，之後的實驗會移除該__attribute__((packed))設定。

Hash Search

以前曾經寫過幾個 hash search 小程式，不過並未認真考慮 hash function 是否適合資料，自己定義的 hash function 相當隨性。沒有認真考慮過定義的 key 與 index 的關係是否會造成過多的 collision 而拖慢效能，只是隨意更改參數的組合，挑一個看起來不錯的 performance 就草草交件了。
這次將試著比較幾個不同的 hash function 轉換後的資料（bucket）分佈情形，並試著更改參數使得效能能儘可能最佳化。

Hash function-djb2

參考資料網站中提到這是 "…is one of the best string hash functions i know" 就先拿這個 hash function 來試試效果如何吧

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    eturn hash;
}

以下是配合 hash function 調整過的 findName 與 append

entry *findName(char lastName[], entry **pHead)
{
	//printf("get into find func\n");
	entry *e = pHead[hash(lastName)%HASH_TABLE_SIZE];
        while (e != NULL) {
            if (strcasecmp(lastName, e->lastName) == 0)
                return e;
        e = e->pNext;
    }
    return NULL;
}

entry *append(char lastName[], entry **pHead)
{
    /* allocate memory for the new entry and put lastName */
    unsigned long index = hash(lastName)%HASH_TABLE_SIZE;
    
    entry *e = (entry *) malloc(sizeof(entry));
    e->pNext = pHead[index]->pNext;
    pHead[index]->pNext = e;
    strcpy(e->lastName, lastName);

    return e;
}

make plot 結果：
這次使用的 hash table 大小為 1024，可以看到結果相當不錯，findName 時間非常短。
但在我想要改變 hash table 大小比較時間表現的時候卻發生了神奇的事：不論我把 hash table 大小調成 100, 10, 1 搜尋時間都沒有絲毫增加。理論上在 hash_table_size == 1 時應該要退化成跟 origin 版本一樣從頭做線性搜尋一樣才對啊…？
在檢視了 main.c 檔後發現，原來預設搜尋的 lastName 為 "zyxels"，這個個單字在字典檔中排在接近最末端，也就是說就算在 hash_table_size = 1 的情況下，還是會被排在前幾個，很快就會被檢索到了。看來我以後在選取測資上要更加小心才行
以下將搜尋的 lastName 改為 "inlander" 並比較不同 hash table size 的時間表現差異
make plot 結果

不出所料，append 所花的時間與 table size 沒有太顯著的關聯，但 findName 隨著 table size 的增加而減少。不過我倒是在研究 gnuplot 與 Makefile 花上許多時間，相信熟練了之後他們會是相當方便且有用的工具。

Hash function- sdbm

unsigned long sdbm(unsigned char *str)
{
    unsigned long hash = 0;
    int c;

     while (c = *str++)
         hash = c + (hash << 6) + (hash << 16) - hash;

         return hash;
}

make plot 與 djb2 做比較

table size = 128

table size = 512

本來想直接下結論：djb2 比 jdbm 稍微強大一點
但是我發現這樣說是不公平的，因為我們搜尋的是固定的 lastName ，可能對搜尋這一筆資料而言，其中一個的時間表現會優於另一個，但是對其他資料就不一定了。實際上在換了欲搜尋的 lastName 後發現很難說哪一個的表現比較好，這也延伸出了資料分佈的問題。
…看來又要繼續研究 gnuplot 了
cache miss

 Performance counter stats for './phonebook_hash_jdbm' (100 runs):

           600,824      cache-misses              #   68.733 % of all cache refs      ( +-  0.26% )
           874,139      cache-references                                              ( +-  0.31% )
       289,762,742      instructions              #    1.31  insn per cycle           ( +-  0.02% )
       220,873,272      cycles                                                        ( +-  0.19% )

       0.066087823 seconds time elapsed                                          ( +-  0.20% )

Distribution of Hash table

上圖為使用 djb2 作為 hash function 得到的 hashtable 雖然用折線圖畫出來很醜，不過大致可以看出來每個 Index 值對應到280~400個節點，而 jdbm …資料則較為分散

由上面兩張圖可以看出來，選擇以 djb2 作為 hash function ，搜尋隨機一筆資料所花時間的最大最小值差異應較 jdbm 小。且出現極端情況的機率較低。

問題討論

回答以下問題：本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題？

有代表性嗎？跟真實英文姓氏差異大嗎？
Ans:
與真實英文姓氏差異頗大，除了資料已經經過排序（開頭a~z)以外，所有資料也都不重複。資料已經經過排序這點對部份的演算法有所影響，對 append 時間也有影響。而真實的電話簿可能有重複的姓名，這時就必須再加入 firstName 等資訊，才能獲得正確的資料。
資料難道「數大就是美」嗎？
Ans:
不一定，像是在本例中以字典檔代替真實的電話簿，以字典檔實測找出的最佳演算法，可能在真實情境中的效能極差。因為資料的差異性使得字典檔的資料數目再多，也會與電話簿有些許偏差。因此在增加資料的數量同時，也必須考慮資料的正確性才行。
如果我們要建立符合真實電話簿情境的輸入，該怎麼作？
Ans:
真實情境中通常不會在 findName 前一刻才進行 append，真正電話簿的資料應該已經經過 append 以符合特定的搜尋演算法。所以我認為應該可以以時間換取空間，也就是如果花費稍微多一點 append 時間可以減少 memory 花費，這樣的選項是可以考慮的（字串壓縮等）
再者，真實電話簿資料必須考慮動態輸入的可能，所以必須定期檢查演算法的設計，是否因為新加入的資料而效能變差。例如建立 hash table 時可能各個 index 底下的節點數差異不大，但若刻意加入多筆相同 index 的資料將使效能變慢。這時可能就需要重新觀察經過 hash function 之後的資料分佈，找出相對適合的 function。

Reference

HMKRL的共筆
 ryanpatiency給初學者的建議
 OPASS'S BLOG
hash function

2018q1 Homework1 (phonebook)

Reviewed by BodomMoon

系統環境

phonebook.orig 各項數據

減小 size of entry

記憶體對齊（alignment）

Hash Search

Hash function-djb2

Hash function- sdbm

問題討論

Reference

Reviewed by `BodomMoon`