2017q1 Homework1 (phonebook)

contributed by <hugikun999>

Reviewed by `heathcliffYang`

要尋找的目標資料所在位置不同、運用的 append 、搜尋方法不同，都會有很大的差異，應該也要不同資料點做分析（也就是說找其他的例子來比較），不然無法偵測一些特定的錯誤，像是 hash function 越後面分配的資料，會在越上層，所以你並不會找不到資料，但如果你是找中間偏前的資訊，也許因為你資料分配的漏洞，被丟失，建議改查其他英文字母看看有沒有錯誤，在快之外的前提就是正確性。
可以嘗試不同 dataset，也許在這個實驗好的方法，不一定放諸四海皆準。
增加電話簿更多功能，使它更切合我們的真實情況。

環境架設

對自己的電腦硬體要有一定的認知才可以因為環境的不同而採取對應的策略。

lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 60
Model name:            Intel(R) Core(TM) i7-4712MQ CPU @ 2.30GHz
Stepping:              3
CPU MHz:               2300.269
CPU max MHz:           3300.0000
CPU min MHz:           800.0000
BogoMIPS:              4589.58
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts

1.Byte Order
之前沒有特別注意到這個項目，這個項目有兩種排序方法Big-Endian和Littel-Endian
，所有的資料都是由位元組成，由小到大或由大到小的位元去儲存資料。網路統一使用 Big-Endian，所以如果要從我的電腦傳封包出去就必須因為這之間的差別而做出調整。
Understanding Big and Little Endian Byte Order

$cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 60
model name	: Intel(R) Core(TM) i7-4712MQ CPU @ 2.30GHz
stepping	: 3
microcode	: 0x1e
cpu MHz		: 2474.117
cache size	: 6144 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
bugs		:
bogomips	: 4589.58
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

這種方法類似第一種，可是會分別印出每個 porcessor 的資訊，這邊只放了一個 porcessor 的資訊。

memory

memory hierarchy
這邊是一個重要的觀念，記憶體存在著這種階層式的概念，由上到下代表著記憶體大小由小到大，下面的圖片可以清楚看到越下層離CPU愈遠，越上層的 access 速度也愈快。增大 Cache 並不和速度成正比，這當中牽涉到很多問題，例如：Cache allocate 、 Cache miss、Cache size，是一個可以伸入探討的議題。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
memory
當中有講到 Direct mapped caches 和 Fully Associative caches 的優點和缺點，前者限制較大，每個 block 有相對應的 Cache ，相較起來後者彈性許多，block 可以隨意的放入 cache 中，但是缺點正是因為太過自由而造成檢查的花費過大，為了兼顧兩者所以通常採用 n-way set associative cache的型式，先將 Cache 切成幾個 way 再進行 Cache 實體位置的分配。

astyle

這是一個對於排版非常方便的工具，其提供許多的 coding style 可以從其中挑選自己所喜歡的，保持版面的乾淨、統一是一件非常重要的事，這對於日後的維護會有非常大的幫助，可以省下非常多不必要的時間。下面是相關的網站，可以查到一些參數所代表的風格。
Artistic Style 2.06

phonebook

進度要加快亮谷
收到俊逸

analyze file

#if defined(__GNUC__)
gcc 所提供的 pre-defined macro，會偵測 compiler 是否為 gcc。
__builtin___clear_cache()
拿來清空處理器的 instruction cache。

comparison funcion of reading data

Which is fastest: read, fread, ifstream or mmap?

Overhead on mmap is heavily than read.
Memory maps is faster than randomly access.
Memory map keep using page from memory until you finish your work.However,wiht read,files may have been flushed out memory.
Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes.
If your access pattern is sequential, mmap can't work efficiently than other funcion.
Memory map is less robust,any slightest mistake can make your make program crash.
mmap directly access memory.
read fetches the data from disk to page cache and then copy them to position you specify.

Code to test

這裡有提供一個可以測試 mmap 和 fread的程式，我把 word.txt 當成輸入檔並用 $perf sate 重覆測試 100 次，可以發現 mmap 的 cache-missed 低了很大一個幅度，但是總體執行卻沒有比較快。
(1)fread

 Performance counter stats for './tbytesum1 -f words.txt' (100 runs):

              6491      branch-misses                                                 ( +-  0.42% )
            2,1510      cache-misses              #   40.280 % of all cache refs      ( +-  2.39% )
            5,3401      cache-references                                              ( +-  1.02% )
         2206,1482      cycles                                                        ( +-  0.21% )

       0.007478928 seconds time elapsed                                          ( +-  0.36% )

(2)mmap

 Performance counter stats for './tbytesum1 -m words.txt' (100 runs):

              6554      branch-misses                                                 ( +-  0.29% )
              6788      cache-misses              #   29.101 % of all cache refs      ( +-  3.07% )
            2,3325      cache-references                                              ( +-  0.65% )
         2513,0427      cycles                                                        ( +-  0.10% )

       0.008581382 seconds time elapsed                                          ( +-  1.01% )

[ ]比較為何 mmap 會花更多 cycle?

Origin

都尚未做任何的優化以前所測得的數據。

 Performance counter stats for './phonebook_orig' (100 runs):

           95,2355      cache-misses              #   83.915 % of all cache refs      ( +-  0.49% )
          113,4908      cache-references                                              ( +-  0.51% )
       2,5815,4183      instructions              #    1.41  insns per cycle          ( +-  0.02% )
       1,8289,4594      cycles                                                        ( +-  0.22% )

       0.059241425 seconds time elapsed                                          ( +-  0.28% )

減少struct大小

第一個比較淺而易見的是 struct 中有一些在找尋時不必用到的資訊，藉由把這些資料獨立出來放在另一個 struct 可以減少 findname 所花費的時間，因為節省了讀入 struct 的時間。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Performance counter stats for './phonebook_opt' (100 runs):

            12,3285      cache-misses              #   43.156 % of all cache refs      ( +-  0.94% )
            28,5670      cache-references                                              ( +-  0.99% )
       2,4387,8420      instructions              #    1.99  insns per cycle          ( +-  0.02% )
       1,2257,6740      cycles                                                        ( +-  0.21% )

       0.039012140 seconds time elapsed                                          ( +-  0.26% )

GCC's Attributes

GCC 提供 __attribute 這個關鍵字去指定一些特別的屬性。

aligned
如果在宣告 struct 後面加上__attribute__ ((aligned))，compiler 會在編譯時根據你的 struct size 自動設定一個高於 size 的 alignment。
例如：

  struct S { short f[3]; } __attribute__ ((aligned));

一個 short 是2個位元組，這裡 compiler 就會設定大於6的2的最小次方，意即8 bytes，但是這會有浪費 memory 的情況。GCC 網頁中有提到 packed 可以減少 alignment。

packed
這個屬性可以用在 struct 和 union 上(enum不包括)，它可以在 struct 和 union 中的每個成員被放置在最小的記憶體所需空間中。enum 的作法則是在 command line 中加入 -fshort-enums 這個 flags。

test data

利用__attribute__ ((aligned))

typedef struct __attribute__ ((aligned))  __PHONE_BOOK_DETAIL {
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
} detail;

結果

hugikun@hugikun:~/embedded/phonebook$ ./phonebook_opt 
size of entry : 40 bytes
size of detail : 112 bytes
OUT_FILE is : opt.txt
execution time of append() : 0.049275 sec
execution time of findName() : 0.000034 sec

利用__attribute__ ((packed))

typedef struct __attribute__ ((packed))  __PHONE_BOOK_DETAIL {
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
} detail;

結果

size of entry : 40 bytes
size of detail : 107 bytes
OUT_FILE is : opt.txt
execution time of append() : 0.049174 sec
execution time of findName() : 0.000018 sec

conclude

由上面的測試可以知道，透過設定屬性可以減少浪費在每個 struct 上的記憶體空間，但是再搜尋上並沒有明顯的加快，因為 entry 的 size 並沒有因為加上__attribute__((packed)) 而減少，故在搜尋時讀入的大小一樣沒有省下時間。

hash function

這邊我自己有利用 hash function 的觀念實作了自己的想法，同樣都是分類的概念，將每個名字的開頭依照字母分類，所以有 26 個 entry，這個方法有一個限制，輸入的 word.txt 必須是排好順序的情況下才能用這種方法。

 Performance counter stats for './phonebook_opt' (100 runs):

            6,7240      cache-misses              #   51.930 % of all cache refs      ( +-  0.74% )
           12,9482      cache-references                                              ( +-  0.63% )
       1,8556,7555      instructions              #    1.83  insns per cycle          ( +-  0.03% )
       1,0152,9576      cycles                                                        ( +-  0.19% )

       0.033044710 seconds time elapsed                                          ( +-  0.25% )

BKDRHash

(1)
我剛開始的想法是直接將 hash function 出來的值當作依據，每一個 hash 值都代表一個新的 entry head，只有當出來的 hash 值一樣才會在對應到的那個 entry 往下接，因此每次都必須重頭開始比對 hash 值是否已經出現，造成在 append 上的效率十分低落，搜尋的部份由於 table 十分龐大也花費許多時間在比對 hash 值上。總之，結果十分慘烈。

execution time of append() : 976.077524 sec
execution time of findName() : 0.010558 sec

(2)
這邊我將 hash table 的大小設為 31，中間 append 和 fineName 不用使如上次如此多的判斷條件，整體時間往下非常多，但是相比於用開頭字母做分類的方法速度還是比較慢。

Hash table size: 496 bytes
execution time of append() : 0.046426 sec
execution time of findName() : 0.000216 sec

Performance counter stats for './phonebook_hash' (100 runs):

           11,2360      cache-misses              #   54.402 % of all cache refs      ( +-  2.12% )
           20,6536      cache-references                                              ( +-  0.91% )
       2,4072,8347      instructions              #    1.76  insns per cycle          ( +-  0.02% )
       1,3693,3078      cycles                                                        ( +-  0.26% )

       0.044152873 seconds time elapsed                                          ( +-  0.27% )

當 hash table size = 997時，可以看到搜尋的時間明顯變得很短，代表 collision 已經很少發生了，但因為 table 變得很大所以所需要的記憶體也變多了。

Hash table size: 15952 bytes
execution time of append() : 0.045567 sec
execution time of findName() : 0.000003 sec

 Performance counter stats for './phonebook_hash' (100 runs):

            8,5163      cache-misses              #   22.057 % of all cache refs      ( +-  3.19% )
           38,6110      cache-references                                              ( +-  0.66% )
       2,3783,2822      instructions              #    1.72  insns per cycle          ( +-  0.02% )
       1,3810,6493      cycles                                                        ( +-  0.43% )

       0.044743948 seconds time elapsed                                          ( +-  0.43% )

(3)
對於 table size 的大小要如何取決，通常是選質數當作 table 大小，這邊我從測試了10~1000的 size 大小，比較這之間 append() 和 findname() 的花費時間。可以看出來 append() 並沒有因為 size 變大而有顯著的變慢，花費的時間非常不穩定，在700~850之間有一段上升明顯的區塊，隨後又回到之前的平均;反觀在 findname() 這邊有明顯的因為 size 變大而搜尋時間下降的趨勢，大概是呈現指數下降。

ohter hash function
這邊比較了四種 hash function，統一使用 table size 為997，並且使用 perf state 進行了100次執行算出花費時間，可以看出來 JSHash 在 append() 方面花了比較少的時間，但是在 findname() 這邊的時間差距並不大。

TODO

[ ]比較 table size 和 findname() 之間時間的關係，找出其 CP 值。

Open addressing

source
使用 hash function 不可避免的一定會發生 collision 的問題，Open addressing 是利用 probing 來偵測或著是找到在 table 中尚未被使用的位置。目前我在實作上都是使用 link list 的方法，如果發生 collision 則將其鏈結到原有的目標後。但是其實還有許多方法可以解決 collision 問題，如果選用的方法設計妥善可以將搜尋複雜度降到 O(1)，但這也是在 table 夠大的情況下。

Linear Probing
如果原本位置已經被佔用，將以線性方式找尋一個空的儲存位址將資料存入，簡單來說就是原值加上某個固定的常數或著乘上某倍率，可是這會造成某些特定地方的 hash value 過於集中，容易造成搜尋時間拉長。
Chaining
即是目前所利用的 link list 的方法，當 collision 發生時會將目前的東西以鏈結的型式接在已經存在的東西後面。
Quadratic Probing
不同於 Linear Probing 以線性處理， Quadratic Probing 是加上或減去某個值的平方，而這個值會隨著 collision 發生的次數而遞增，如此可以避免同樣的 hash value 過於集中的出現再某些特定地方。
Rehash
這個方法理解比較簡單，就是拿第一次的 hash value再做一次 hash function。

除了上面提到的這幾種方法，其實只要能設定規則都能解決 collision ，但若要找出一個能使 hash value 平均分散的方法則是比較困難的。

Memory Pool

一次割出一塊記憶體空間，每次需要使用的使用的時候再分配出去。malloc 所配置出來的記憶體是零散的，並非像 memory pool 是連續性的記憶體。一次割出一塊記憶體空間可以省下每次 malloc 尋找空記憶體的時間，另外在 free 時也不用一個一個將記憶體空間釋放，可以一次就完成釋放的動作。

other reference

下面是一些我找到有用的參考，有些可以增加工作的效率，值得參考。
打造專屬於你的 Git 工作流程

How to Set Up Command Aliases in Linux/Ubuntu/Debian

vimrc設定