2016q3 Homework1 / phonebook

2016q3 Homework1 / phonebook === ###### tags: `jserv` Date: 2016/09/30 contribute by <`hankgo`> + [github](https://github.com/hankgo/phonebook) + youtube --- # 電腦環境 + OS: Lubuntu 16.04 (upgrade from 15.10) 因為舊的版本無法用內建的圖形介面升級，說缺少元件，所以使用終端機輸入指令升級。[參考網址](http://askubuntu.com/questions/765648/upgrading-from-15-10-to-16-04-apt-not-installed) ```= sudo apt-get update sudo apt-get upgrade sudo apt-get -f install # (not "install -f"!) sudo apt-get -y install apt sudo do-release-upgrade ``` + CPU: [Intel i3-2350m](http://ark.intel.com/zh-tw/products/53438/Intel-Core-i3-2350M-Processor-3M-Cache-2_30-GHz) ( 2 core / 4 thread ) + 使用指令 `$ lscpu` + L1 data cache = 32K + L1 instruction cache = 32K + L2 cache = 256K + L3 cache = 3072K + RAM: 8GB # github 指令 + [參考資料](http://blog.gogojimmy.net/2012/01/17/how-to-use-git-1-git-basic/) + `$ git clone xxxx.git` fork 之後下載檔案 + `$ git status` 看狀態 + `$ git add [filename]` 將檔案加入追蹤清單 + `$ git add .` 將全部檔案加入追蹤清單 + `$ git add -i` 互動模式，用挑選的加入清單 + `$ git reset [filename]` 特定檔案從 stage 移開 (移開追蹤) + `$ git reset .` 檔案從 stage 移開 + `$ git commit` commit，事後會問你要加入什麼訊息 + `$ git commit -m "message"` 一併將 message commit + `$ git commit -v` 顯示檔案變動增減記錄 + 如何修改/取消上一次的 commit + `$ git commit --amend` 修改上一次的 commit 訊息。 + `$ git commit --amend [filename1] [filename2]...` 將檔案1、檔案2加入上一次的 commit。 + `$ git reset HEAD^ --soft` 取消剛剛的 commit，但保留修改過的檔案。 + `$ git reset HEAD^ --hard` 取消剛剛的 commit，回到再上一次 commit的乾淨狀態。 + `git log` 查看過去 commit 的紀錄 + `git log -stat` 看到更詳盡的訊息 + `git log -p` 看到檔案更詳細的變更內容 + 基本流程：修改檔案 => 加入 stage (git add) => 提交( git commit )=> 繼續修改其他檔案 => 上傳到 github (git push) # 改善方式 ## reduce structure size + phonebook_orig.h ```= /* original version */ typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; struct __PHONE_BOOK_ENTRY *pNext; } entry; entry *findName(char lastName[], entry *pHead); entry *append(char lastName[], entry *e); #endif ``` + phonebook_opt.h ```= typedef struct __DETAIL_ENTRY { char firstName[16]; char email[16]; char phone[10]; char cell[10]; char addr1[16]; char addr2[16]; char city[16]; char state[2]; char zip[5]; } detailEntry; typedef struct __PHONE_BOOK_ENTRY { char lastName[MAX_LAST_NAME_SIZE]; detailEntry *detail; struct __PHONE_BOOK_ENTRY *pNext; } entry; ``` 因為目前的搜尋只會使用到 lastName 而已，所以先將其他的欄位用別的 detailEntry structure 包起來，原始的 __PHONE_BOOK_ENTRY 只剩下 lastName、pNext、還有 detailEntry 這樣，名字搜尋到之後，真的有需要再去 detailEntry 去撈資料。如此一來，原始的 structure 的大小就會縮減，cache可以塞的 structure 的數量會增加。這個嘗試之中，尚未去更動 phonebook_orig.c 的內容， phonebook_opt.c 的內容目前是跟 phonebook_orig.c 的內容是相同的，沒有做太大更動。 ### OPTION Control 改好了之後，跑了數次發現怎麼圖表都是一樣的。發現我的 Program 沒有把 opt.txt 印出來，再看看哪邊有跟 opt.txt 有關係，發現以下 code，兩個 orig 跟 opt 版本都是共用一個 main，是利用 OPT 的 define 狀況來控制適用哪一個 output file，未來要實作 hash optimized 的版本，也可能需要適當的修改這段 code 來共用 main.c。所以這個 #if define(OPT) 是關鍵 ```= #if defined(OPT) output = fopen("opt.txt", "a"); #else output = fopen("orig.txt", "a"); #endif ``` 一開始在 main.c 找了好久找不到，經過很長一段時間發現 phonebook_opt.h 有下列這段 code，只是一開始 clone 的時候是被註解的。解除註解就順利產生 opt.txt 也順利做出第一個改善效能的成果。 ``` #define OPT 1 ``` 未來要再增加使用 hash function 的效果，則需要新增新的 OPT 定義和輸出，故修改如下 (MAKEFILE 等也需要一併修改)：參考資料： + [#ifndef, #define, #endif的用法(整理)](http://huenlil.pixnet.net/blog/post/24339151) + [ruby0109 github](https://github.com/ruby0109/phonebook/blob/master/main.c) ## Hash Function ### 前置修改我將第二種優化的方式寫在新的檔案 phonebook_hash，然後共用同一個 main.c。根據觀察 makefile 等檔案的結果，其他的檔案需要做對應的調整 + Makefile：要增加 CFLAGS_hash、增加 phonebook_hash 的 compile (EXEC、phoebook_hash 的動作)、增加 phonebook_hash 的 perf test、make clean 要記得清 phonebook_hash 所產生的 hash.txt + main.c：利用 define OPT 來增加 output.txt 檔名 switch 的支援 (opt_hash.txt) ``` #if defined(OPT) output = fopen("opt.txt", "a"); #elif defined(HASH) output = fopen("hash.txt", "a"); #else output = fopen("orig.txt", "a"); #endif ``` + calculate.c：增加 phonebook_hash 的時間計算 + scripts/runtime.gp：增加屬於 hash 方式的圖表 ### hash function 簡介 hash function 常用於密碼學。常見的函數有一對一、多對一，一對一的函數可以求出其函數的反函數 (可以根據輸出結果反推輸入值)。而多對一的函數則沒辦法有反函數。密碼學有時候會利用 hash function 來製作真實密碼的部分特徵。hash function 其實就是對原始輸入字串做一連串的特殊函數運算，輸出一個固定長度的結果，也就是原始輸入字串的特徵。hash function 個人覺得重要的特性如下： + 不容易從輸出的資訊反推回原始資訊 + 相同輸出資訊，但由兩組以上相同資訊的字串輸入到 hash function 的機率很低，若有則為 hash collision，以下為範例： ``` hash function h(x) = x ^ 2 h(5) = 25; h(-5) = 25，這時候就有 hash collision ``` 所以根據上述特性，hash function 函數的設計上，要盡量不容易被反推，且要注意避免 collisiion。根據前人和同學的心得，我也先嘗試了 djb2，是個專門處理文字字串的 hash function，不管字串長度為何，都可以得到一個 unsigned int 型態的 hash value，djb2 的原型如下： ```= unsigned long hash(unsigned char *str) { unsigned long hash = 5381; int c; while (c = *str++) hash = ((hash << 5) + hash) + c; /* hash * 33 + c */ return hash; } ``` 根據我們所需，修改成以下版本，並導入 hashTable 取餘數的想法： ```= unsigned int djb2Hash(char lastName[]) { unsigned long hash = 5381; int c; while ((c = *lastName++)) { hash = (( hash << 5 ) + hash ) + c; // hash*33+c } return (unsigned int) hash % HASH_TABLE_SIZE; } ``` ### 實做想法我們會設法建立一個 hash table，先對這些 hash value 進行分組。目前作業提供的 words.txt 資料約有 35萬筆，最理想的狀況是，我們可以平均地分成若干組，每一組都有共同的 index，當我們要查詢某個名字時也先利用共同的 hash function 求出 index，在到某一組去搜尋，就可以免去從頭到尾慢慢搜尋的窘狀。那要分成幾組呢？根據以前有學到 k-means clustering，若在一個理想狀況下，最適合的組數就是直接對資料筆數開根號。目前我們的資料有 35萬筆，所以開根號後為 591.x，我就先取個 591 組來試試看。目前使用的 djb2 的 output 為 unsigned int，我們利用 hash value 取餘數的方式來分組，創立了一個 hashTable 的 array，而 index 為 hash value 除以 HASH_TABLE_SIZE 的餘數。該 array 為其中一組符合某 index 的 entry。根據 entry 的特性，可以利用 linklist 把整串符合某 index 的 entry 通通串起來。感覺大概如下圖： ![](http://i.imgur.com/AcCy3rb.jpg) 那怎麼樣建立這個 hashTable 呢？將資料帶入 hashTable 是由 append 方法處理的。先利用 hash function 求出 hash value，並取餘數求出 hashIndex。而放入資料的方式，類似 stack 的方式，放在 hashTable 裡面的 entry，是最後一筆放進去的資料，如果又有心的資料要進來，則需要進行 push。概念如附圖。 ```= entry *append(char lastName[], entry *e) { // save the new entry first entry *newEntry = (entry *) malloc(sizeof(entry)); strcpy(newEntry->lastName, lastName); // calculate the hash value unsigned int hashIndex = djb2Hash(lastName); // check whether the hashTable is empty, create or lineup if(hashTable[hashIndex] == NULL) { hashTable[hashIndex] = newEntry; hashTable[hashIndex]->pNext = NULL; } else { newEntry->pNext = hashTable[hashIndex]; hashTable[hashIndex] = newEntry; } return newEntry; } ``` ![](http://i.imgur.com/Wu5DW2T.jpg) 接著就是進行 findName()，找尋名字的動作。基本上就是根據 hashIndex 先到某組資料的頭，開始依序的去找尋組內的名字比對。後續流程跟原始的 findName() 相似 ```= entry *findName(char lastName[], entry *pHead) { // choose which hashtable to find unsigned int hashTarget = djb2Hash(lastName); entry *probe = NULL; if ( hashTable[hashTarget] == NULL ) { return NULL; } else { probe = hashTable[hashTarget]; } while ( probe != NULL ) { if (strcasecmp(lastName, probe->lastName) == 0) return probe; probe = probe->pNext; } return NULL; while (pHead != NULL) { if (strcasecmp(lastName, pHead->lastName) == 0) return pHead; pHead = pHead->pNext; } return NULL; } ``` ### Reference + [函數的概念](https://market.cloud.edu.tw/content/junior/math/ch_yl/content/math4_1.html) + [密碼系統介紹](http://120.105.184.250/cswang/thit/Security/slides/密碼系統.ppt) + [djb2](http://www.cse.yorku.ca/~oz/hash.html) --- # Result ## Original + size of entry : 136 bytes + average execution time of append() : 0.070686 sec + average execution time of findName() : 0.006691 sec ``` perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./phonebook_orig ``` ``` Performance counter stats for './phonebook_orig' (100 runs): 2,019,235 cache-misses # 91.440 % of all cache refs ( +- 0.19% ) 2,208,271 cache-references ( +- 0.15% ) 261,401,917 instructions # 1.37 insns per cycle ( +- 0.04% ) 191,014,931 cycles ( +- 0.28% ) 0.099477194 seconds time elapsed ``` ## Reduce Structure Size + size of entry : 32 bytes + average execution time of append() : 0.054300 sec + average execution time of findName() : 0.003689 sec ``` perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./phonebook_opt ``` ``` Performance counter stats for './phonebook_opt' (100 runs): 345,704 cache-misses # 61.357 % of all cache refs ( +- 0.20% ) 563,430 cache-references ( +- 0.26% ) 244,249,489 instructions # 1.71 insns per cycle ( +- 0.03% ) 142,634,916 cycles ( +- 0.15% ) 0.080214917 seconds time elapsed ( +- 2.47% ) ``` ## Hash Function + size of entry : 32 bytes + average execution time of append() : 0.083085 sec + average execution time of findName() : 0.000000 sec ``` perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./phonebook_hash ``` ``` Performance counter stats for './phonebook_hash' (100 runs): 295,119 cache-misses # 73.568 % of all cache refs ( +- 0.07% ) 401,152 cache-references ( +- 0.38% ) 244,445,915 instructions # 1.63 insns per cycle ( +- 0.04% ) 149,928,975 cycles ( +- 0.51% ) 0.085473493 seconds time elapsed ( +- 2.30% ) ``` ## PLOT ![](https://i.imgur.com/ipBhqHw.png) 以結果論，雖然 findName() 的速度上，幾乎是不用花到任何時間。但在一開始建立 hashTable 時需要耗費較多的時間成本。故總耗費時間成本為 append() 時間 + N * findName() 時間，N 為搜尋名字的次數，N 越大才能彰顯出 hash function 方式的威力。 --- ## different dataset （老師課堂解說待改善） + 目前的 dataset 是依照字母排序的，不符合現在真實情況。 + 未來要新增名字刪減功能，若有用到 hash function，可能會產生碰撞 (hash collision)。 --- # Reference + [舊版筆記](https://embedded2016.hackpad.com/HW-01-QX2MTGoK0j7)