2017q1 Homework1 (phonebook)

contributed by <vtim9907>

固定格式請記得! 課程助教

Reviewed by `hugikun999`

hash table 有許多種，可以比較各種之間的差異性。
圖表部分的數據可以調整位置，不然疊再一起不容易觀看
考慮所輸入的文字檔是在不事先排序的情況下的效能。
為了更貼近現實可以新增刪除 entry 的功能。
考慮新增的姓名已經存在電話簿，要如何避免重覆製造 entry。

Reviewed by `henry0929016816`

bkdr hash 的乘法可以使用，被乘數左移 5位再減被乘數的方式，雖然結果同樣是跟原本被乘數乘以 31 的值一樣，但是執行速度有差，由於電腦左一的速度很快。
其實網路上都有寫到 hash table size 不要為偶數，然而我有點疑問，2³ = 8 , 2³-1 = 7 ，照裡來說 n%8 的結果為 0~7 ，而 n%7 結果為 0~6 ，可以看到 8 比 7 分的還要均勻，希望可以用圖表給出 2ⁿ-1 比 2ⁿ 還要好的證明。

開發環境

OS : Ubuntu 16.04.2 LTS
Kernel Version : 4.8.0-36-generic
Memory : 32 GB
CPU : Intel® Core™ i5-6600 Processor
- cache :
  - L1i cache : 4*32 KB
  - L1d cache : 4*32 KB
  - L2 cache : 4*256 KB
  - L3 cache : 6144 KB
- L1 cache imformation ( sudo dmidecode -t cache ) :

Handle 0x001E, DMI type 7, 19 bytes
Cache Information
	Socket Designation: L1 Cache
	Configuration: Enabled, Not Socketed, Level 1
	Operational Mode: Write Back
	Location: Internal
	Installed Size: 256 kB
	Maximum Size: 256 kB
	Supported SRAM Types:
		Synchronous
	Installed SRAM Type: Synchronous
	Speed: Unknown
	Error Correction Type: Parity
	System Type: Unified
	Associativity: 8-way Set-associative

用簡單 command 查詢系統部份資訊
- 查 ubuntu 版本 ( -a 會列出相關資訊，若只查版本可以用 -r )
  $ lsb_release -a
- 查 kernel 版本 ( -a 會列出相關資訊，若只查版本可以用 -r )
  $ uname -a
- 查 cache 資訊
  $ sudo dmidecode -t cache
- 查 cpu 資訊
  $ lscpu
  
  比較特別的地方是，用 $ sudo dmidecode -t cache 和 $ lshw 所查到的 cache 資訊吻合，都顯示我 L1 cache 是 128 KB，但是用 $ lscpu 所查到的 cache 資訊卻不相同，顯示我的 L1 cache 只有 32 KB。
  所以我去看了我 cpu 的規格 i5-6600 ，結果我想是因為四核心的關係，每顆核心的 L1i cache 和 L1d cache 依然都還是 32KB，所以所有指令顯示的都沒錯，只是表達的東西不太一樣。

前置作業

孔子說：「工欲善其事必先利其器」，所以開始之前，決定先整理一下作業環境。

安裝好用的截圖工具 ––– Shutter

其實內建有一個螢幕截圖可以用 =w=
課程助教

原來如此XD 一開始不知道就直接裝 Shutter 了＠＠不過滿好用的～徐偉庭
~~我也是用shutter，助教很兇0.0~~亮谷

sudo add-apt-repository ppa:shutter/ppa
sudo apt-get update
sudo apt-get install shutter

修改 vim 的一些基本設定，還有最重要的換顏色！
由於預設的顏色實在有點黯淡，看的不太習慣，決定換點好看的顏色。
- 可以先到這裡找找看有沒有喜歡的 colorscheme
- 把載下來的東西放到 ~/.vim/colors/ 底下（如果沒有該目錄，就先建出來）
- 到 ~/.vimrc 裡設定 colorscheme <載下來的 color scheme 名稱>
- 我 color scheme 主題換 molokai

phonebook 優化

優化前

執行

$ perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_orig

得到結果：

size of entry : 136 bytes
execution time of append() : 0.035637 sec
execution time of findName() : 0.004737 sec

cache-miss 的相關資訊：

 Performance counter stats for './phonebook_orig' (10 runs):

          445,2933      cache-misses              #   89.849 % of all cache refs      ( +-  0.37% )
          495,6032      cache-references                                              ( +-  0.35% )
          260,8870      L1-dcache-load-misses                                         ( +-  0.26% )
   <not supported>      L1-dcache-store-misses                                      
   <not supported>      L1-dcache-prefetch-misses                                   
           14,5130      L1-icache-load-misses                                         ( +-  2.06% )

       0.048939923 seconds time elapsed                                          ( +-  1.70% )

Method 1 : 縮小 struct size

參考了教材 Programming Small ，我想先針對改善 cache miss 下手； cache 是很珍貴的資源，所以通常 size 都不大，在這樣的情況下我想先縮小 phonebook 的 struct ，以減少 cache miss 的發生。

是 cache 喔! =w= 課程助教

阿…打錯一堆＠＠感謝提醒！徐偉庭

原本的 struct 總共有 136 bytes，因為我的 cache 只有 256 KB, 所以總共可以存約 32 * 1024 / 136 = 240.9 筆資料，在讀取大量資料時，還是很容易發生海量的 cache miss，所以我把 struct 刪減成 32 bytes :





typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    entry_detail *detail;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

因為題目只需要搜尋 lastName ，所以可以把 struct 刪減到只剩 32 bytes，其中當然保留了原來的 lastName 還有 pNext ，除此之外還加了一個 detail 的指標，用來指回原來完整的 struct，不過這裡我並沒有 malloc 空間給它，因為再多做 malloc 勢必會產生更多的 cache miss，但畢竟 dictionary 裡只有 lastName 的資料，那我就省下來，暫且不管其他資料，等有需要用到其他更細部的資料時，再用其他 function 處理。

執行

$ perf stat -r 10 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-store-misses,L1-dcache-prefetch-misses,L1-icache-load-misses ./phonebook_opt

得到結果

size of entry : 32 bytes
execution time of append() : 0.027508 sec
execution time of findName() : 0.002284 sec

append 所耗費的時間減少了 0.08 秒左右，想必是因為 size 變小，malloc 消耗的時間因此變短，進而提昇程式的效能。

 Performance counter stats for './phonebook_opt' (10 runs):

          133,2860      cache-misses              #   59.392 % of all cache refs      ( +-  1.61% )
          224,4169      cache-references                                              ( +-  1.03% )
          129,2047      L1-dcache-load-misses                                         ( +-  0.27% )
   <not supported>      L1-dcache-store-misses                                      
   <not supported>      L1-dcache-prefetch-misses                                   
            9,9346      L1-icache-load-misses                                         ( +- 10.19% )

       0.037362806 seconds time elapsed                                          ( +- 15.04% )

相較於優化前，cache miss 減少了約 70%，而 cache references 減少了約 54.7%，總消耗時間節省了約 23.6%。

Method 2 : Hash table

為了更加優化搜尋的效能，選擇使用 hash table 似乎是個不錯的方法，畢竟搜尋一個 index 比一個一個比對字串還要來得快的多。
而根據各种字符串Hash函数比较內的實驗結果顯示，BKDR hash 相較於其他 hash 方法，在各種情形下似乎比較不容易產生碰撞，所以我選擇使用 BKDR hash 來對字串做 hash 產生建 table 用的索引值。












unsigned int BKDR_hash(char *str)
{
    unsigned int seed = 31;
    unsigned int hash = 0;

    while (*str)
    {
        hash = hash * seed + (*str++);
    }

    return (hash & 0x7FFFFFFF);
}

然而在選擇 table 大小時我遇到了一點困難，我不太確定選擇多大的 table size 比較好，然而我上網看了一些文章，有人說最好挑選測資數量的 1.45 倍，並且是個質數，如此比較不容易發生碰撞；也有一部分的人建議直接用測數字的方式，找出效率最好的數字，因為根據測資的屬性不同，產生的結果就會有出入，似乎沒有多大的 size 最好這回事；不過我可以肯定的是，size 太小則碰撞機率太大，size 太大又有可能會 overflow。
所以我的作法是用

2^{n} - 1

，以改變 n 的方式去測，好處是快速不用特別去找質數，也比直接用

2^{n}

還不容易發生碰撞。

請用 LaTeX 表示數學式
課程助教

n = 11

size of entry : 32 bytes
execution time of append() : 0.051272 sec
execution time of findName() : 0.000001 sec

 Performance counter stats for './phonebook_hash' (10 runs):

          598,2769      cache-misses              #   54.948 % of all cache refs      ( +-  1.47% )
         1088,8151      cache-references                                              ( +-  0.15% )
          486,1986      L1-dcache-load-misses                                         ( +-  0.10% )
   <not supported>      L1-dcache-store-misses                                      
   <not supported>      L1-dcache-prefetch-misses                                   
           14,8238      L1-icache-load-misses                                         ( +-  6.55% )

       0.132551090 seconds time elapsed                                          ( +-  3.08% )

n = 12

size of entry : 32 bytes
execution time of append() : 0.053231 sec
execution time of findName() : 0.000000 sec

 Performance counter stats for './phonebook_hash' (10 runs):

          572,2466      cache-misses              #   53.195 % of all cache refs      ( +-  1.19% )
         1075,7484      cache-references                                              ( +-  0.29% )
          496,5232      L1-dcache-load-misses                                         ( +-  0.07% )
   <not supported>      L1-dcache-store-misses                                      
   <not supported>      L1-dcache-prefetch-misses                                   
           13,9759      L1-icache-load-misses                                         ( +-  5.82% )

       0.134102282 seconds time elapsed

n = 13

size of entry : 32 bytes
execution time of append() : 0.048652 sec
execution time of findName() : 0.000000 sec

 Performance counter stats for './phonebook_hash' (10 runs):

          231,0402      cache-misses              #   52.510 % of all cache refs      ( +-  1.61% )
          439,9904      cache-references                                              ( +-  0.77% )
          241,0954      L1-dcache-load-misses                                         ( +-  0.19% )
   <not supported>      L1-dcache-store-misses                                      
   <not supported>      L1-dcache-prefetch-misses                                   
           13,1062      L1-icache-load-misses                                         ( +-  7.89% )

       0.084454930 seconds time elapsed                                          ( +-  7.07% )

可以看到當 n = 13 ，table size = 2^13 - 1 = 8191，此時的 findName 所消耗的時間已經幾乎等於 0，這的確是我們所預期的，建 hash table 後直接用 index 搜尋的確比每個字串慢慢比對來的快的多。
而 append 所花的時間卻變多了，我想是因為 append 裡做了更多的處理，時間被拉高也是可以預期的。

在我準備 free 掉每條 linked list 的時候，意外發現原來先前 main.c 裡並沒有完整的把 linked list 給 free 掉，會造成 memory leak 的情況，所以我以順手改了一下。

然而在我寫 free list 的 function 時，原本想說用 hash table 後，list 的長度應該會短一些，想用 recursive 的方式來釋放全部的 linked list ，沒想到即使 table 的 size 已經大到 16383 時，仍然會造成 segmentation fault 的情況產生…，看來還是應該乖乖用 loop 的方式來釋放 linked list。

Method 3 : Binary search tree

建一棵 binary search tree，把原來創出來的 linked list 放進創出來的樹裡，理論上來說，因為存資料的結構變成 BST ，所以插入、搜索、刪除的時間複雜度期望為 O(log n)，最差的情況為 O(n)，所以我預期 findName 的速度應該會比未優化前快不少。

實際做出來就是很普通的 BST，是在原版的 linked list 建出來之後，在把 list 轉成 BST，原本有想過在一邊讀字串時就一邊建樹，但感覺這樣會使運算量大幅增加，所以就還是實做把建好的 list 轉成 tree。

實際運行成果：

size of entry : 32 bytes
execution time of append() : 0.038473 sec
execution time of findName() : 0.000000 sec

由於仍然保留了縮減過的 struct，所以 struct 的大小只有 32 bytes。
可以看出 findName 的時間一樣縮減到幾乎為 0 秒，BST 果然起了作用！
而 append 比優化前慢了 0.03 秒左右，這是因為我在 append 之後建樹，把建樹的時間也算在 append 裡了，不過只慢了這麼一點還真出乎我意料＠＠

再來看看 cache 的情況：

 Performance counter stats for './phonebook_bst' (10 runs):

          194,2487      cache-misses              #   72.805 % of all cache refs      ( +-  0.69% )
          266,8058      cache-references                                              ( +-  0.35% )
          156,9015      L1-dcache-load-misses                                         ( +-  0.41% )
   <not supported>      L1-dcache-store-misses                                      
   <not supported>      L1-dcache-prefetch-misses                                   
           11,4174      L1-icache-load-misses                                         ( +-  2.27% )

       0.047925877 seconds time elapsed                                          ( +-  5.29% )

cache-miss 相較優化前少了快三百萬次，但主要的原因我認為是因為 struct 還是維持 32 bytes 的關係。

各版本圖形化比較：

Method 4 : 改變 BST node 結構

由於老師提到 BST 可以再做優化，所以我又再次思考我原來的 BST 有哪裡可以優化，我想為了符合 cache line 的特性，決定把 node 的結構更該成 64 bytes 的大小，這樣一個 node 應該就剛好就在同一條 cache line 上 @@"

原來 node 的結構就只包含兩個分別指到左、右子樹的指標，和一個指到一個 phonebook entry 的指標：





typedef struct __BST_NODE {
    struct __BST_NODE *left;
    struct __BST_NODE *right;
    entry *pEntry;
} bst;

更改後的結構與原來的大致雷同，不同的地方是一個節點就包含了 6 個指到 phonebook entry 的指標，有點類似一個比較大的 node 的感覺：





typedef struct __BST_NODE {
    struct __BST_NODE *left;
    struct __BST_NODE *right;
    entry *pEntry[6];
} bst;

雖然不太確定結果會如何，但是我預期在 append 的時候會稍微快一些，因為拿一個 node 就可以塞 6 個值，而且也不會增加 cache-miss，所以應該會比原來的 append 的 cache-miss 更少、速度更快；而 findName 的部份因為一個點要比對六次，所以時間應該是原來 BST 的六倍，但時間複雜度同樣都還是 O(log n)。

經過實測：

改版的 BST append 的時間果然變少，比原來的 BST 快 0.006 秒，效能提昇約 17% ；而 findName 的時間雖然比較慢，但在搜尋同一筆資料的情況下，時間依然接近 0 秒，畢竟時間複雜度相同，應該不會慢太多。

再來看一下 cache-miss 有沒有減少：
原版

 Performance counter stats for './phonebook_bst' (100 runs):

          187,3471      cache-misses              #   71.267 % of all cache refs      ( +-  0.14% )
          262,8803      cache-references                                              ( +-  0.12% )
       3,2452,4839      instructions              #    2.18  insn per cycle           ( +-  0.01% )
       1,4914,7703      cycles                                                        ( +-  0.06% )

       0.041766708 seconds time elapsed                                          ( +-  0.37% )

新版

 Performance counter stats for './phonebook_bst2' (100 runs):

          169,4826      cache-misses              #   69.828 % of all cache refs      ( +-  0.31% )
          242,7143      cache-references                                              ( +-  0.12% )
       2,5219,4566      instructions              #    1.99  insn per cycle           ( +-  0.02% )
       1,2681,9661      cycles                                                        ( +-  0.13% )

       0.036295141 seconds time elapsed                                          ( +-  0.38% )

雖然覺得這樣修改後仍然還有很大的優化空間，不過幸好這次的想法至少有優化到@@"

問題討論

本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題？
1. 有代表性嗎？跟真實英文姓氏差異大嗎？
  - 因為本例選用的 dataset 裡有大多數不具意義的英文字母組合，若要談與真實英文姓名的差異，那想必差異慎大；但如果是 dataset 裡代表的是別名或化名之類的資料，那也許就還滿符合真實情況。
  - 討論到 dataset 的代表性，我認為除了資料與真實英文姓氏有差異外，還有一個重點是資料的重複性，以真實情況來看，資料如果夠多，肯定會有一狗票的重複姓氏出現，但是在本例裡的資料似乎沒有重疊的資料，這樣跟真實情況就差異慎大；總體來說，不太具代表性。
2. 資料難道「數大就是美」嗎？
  - 我不這麼認為，就如前一題探討到的代表性，如果所有的資料都不具代表性，那麼再多也不「美」；讓我想到資料探勘，就算餵進去大量的資料，但如果餵的資料質量不好，那麼不僅效率不好，產出的結果也沒有什麼意義，精通資料視覺化以及統計學專家 Kaiser Fung 在 Good models + Bad data = Bad analysis 一文裡也提到，質量不好的資料就算經過很好的模型分析，最後的分析結果的精確度和參考度都是很差的。
3. 如果我們要建立符合真實電話簿情境的輸入，該怎麼作？
  - 其實網路上有很多姓名產生器，但是可能沒辦法快速的產生大把資料，所以比較快的方法大概是去一些列出大量名字的網站，比如說 List of people by name ，把裡面的人名爬下來，其中裡面的姓氏也有重複，也更符合真實情況。