2017q3 Homework1 (phonebook)

tags: `進階電腦系統理論與實作 (Fall 2017)`

contributed by < jcyztw, davidshih0908 >

開發環境

使用 lscpu 指令查詢 cache size。

可看到我的電腦 CPU 使用 32 KB 的 L1 data cache，32 KB 的 L1 instruction cache，以及 2 MB 的 L2 cache。

Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
每核心執行緒數：1
每通訊端核心數：2
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              15
Model name:            Intel(R) Core(TM)2 Duo CPU     T5750  @ 2.00GHz
製程：              13
CPU MHz：             1000.000
CPU max MHz:           2000.0000
CPU min MHz:           1000.0000
BogoMIPS:              3990.25
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           2048K
NUMA node0 CPU(s):     0,1
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm

降低 cache miss

在開始進行作業最主要的要求(降低 cache miss)前，先看看未優化程式的效能檢測數據。

首先先編譯程式，在本機存放 phonebook 原始碼的目錄下輸入以下指令：

$ make
$ make plot

得到結果如下：

 Performance counter stats for './phonebook_orig' (100 runs):

           837,974      cache-misses              #   39.951 % of all cache refs      ( +-  0.20% )  (74.25%)
         2,097,523      cache-references                                              ( +-  0.20% )  (50.82%)
       267,658,436      instructions              #    0.79  insn per cycle           ( +-  0.12% )  (76.32%)
       337,054,013      cycles                                                        ( +-  0.06% )  (75.40%)

       0.173064686 seconds time elapsed                                          ( +-  0.39% )

優化方案 1：縮減 __PHONE_BOOK_ENTRY 的 structure size
優化方案 2：實作 hash function

優化方案 1：縮減 `__PHONE_BOOK_ENTRY` 的 structure size

未優化

 Performance counter stats for './phonebook_orig' (100 runs):

           872,636      cache-misses              #   40.955 % of all cache refs      ( +-  0.15% )  (74.40%)
         2,130,724      cache-references                                              ( +-  0.13% )  (50.40%)
       266,789,767      instructions              #    0.79  insn per cycle           ( +-  0.11% )  (76.09%)
       339,519,402      cycles                                                        ( +-  0.06% )  (75.67%)

       0.175144684 seconds time elapsed                                          ( +-  0.44% )

優化

 Performance counter stats for './phonebook_opt' (100 runs):

            58,812      cache-misses              #    4.534 % of all cache refs      ( +-  0.92% )  (74.14%)
         1,297,099      cache-references                                              ( +-  0.39% )  (52.28%)
       252,106,594      instructions              #    1.25  insn per cycle           ( +-  0.03% )  (76.40%)
       201,546,202      cycles                                                        ( +-  0.12% )  (74.86%)

       0.102934648 seconds time elapsed                                          ( +-  0.21% )

由上面的數據可看出，縮減 structure 的 size 讓 miss rate 由 40.955% 降到了 4.534%，幾乎是變成原先 miss rate 的一成。

優化方案 2：實作 hash function

接續方案 1 繼續改進，用 hash table 取代原先 linked list，將 hashing 實作到 append() 和 findName() 中。
而 hash function 如下：












unsigned int hash(char key[])
{
	unsigned int tableIndex = 0;
	char *lastName = key;

	while(*lastName != '\0') {
		tableIndex = tableIndex + *lastName++;
	}
	tableIndex %= HASH_TABLE_SIZE;

	return tableIndex;
}

此 hash function 採用一個比資料筆數大的一個質數(42737)作為 hash table size，以減少碰撞(collision)。

檢測結果如下：

 Performance counter stats for './phonebook_orig' (100 runs):

           838,855      cache-misses              #   40.132 % of all cache refs      ( +-  0.21% )  (74.29%)
         2,090,233      cache-references                                              ( +-  0.18% )  (50.59%)
       266,490,247      instructions              #    0.79  insn per cycle           ( +-  0.11% )  (76.31%)
       338,661,799      cycles                                                        ( +-  0.05% )  (75.57%)

       0.173595447 seconds time elapsed                                          ( +-  0.43% )


 Performance counter stats for './phonebook_opt' (100 runs):

            57,061      cache-misses              #    4.442 % of all cache refs      ( +-  0.88% )  (74.30%)
         1,284,717      cache-references                                              ( +-  0.47% )  (52.41%)
       251,938,283      instructions              #    1.25  insn per cycle           ( +-  0.04% )  (76.47%)
       201,465,248      cycles                                                        ( +-  0.12% )  (74.67%)

       0.102744419 seconds time elapsed                                          ( +-  0.32% )


 Performance counter stats for './phonebook_orig' (100 runs):

           806,331      cache-misses              #   39.283 % of all cache refs      ( +-  0.13% )  (74.18%)
         2,052,640      cache-references                                              ( +-  0.18% )  (51.05%)
       268,610,061      instructions              #    0.80  insn per cycle           ( +-  0.11% )  (76.44%)
       335,217,963      cycles                                                        ( +-  0.05% )  (75.27%)

       0.171950086 seconds time elapsed                                          ( +-  0.12% )

由圖中可看到 findName() 在改用 hash function 的實作方式後，花費的時間遠遠少於方案 1。

hash function 優化後 miss rate 上升，可能是因為改 main.c 時，我把__builtin___clear_cache 這東西拿掉導致。此現象的原因有待釐清。

回答問題

本作業選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題？

有代表性嗎？跟真實英文姓氏差異大嗎？
簡單節錄我在測試中使用的 dataset 的內容：

aaaa
aaaaa
aaaaaa
aaaaaaa
aaaaaaaa
aaaaaaaah
aaaaaaauugh
aaaaaagh
aaaaaahhhhh
aaaaaaugh
aaaaagh
aaaaah
aaaarthur
aaaaw
aaagghhhh
aaah
aaaugh
aaccf

可看出大多不是真實英文姓名，與實際上的應用有落差，因此個人判斷比較不具有代表性。

資料難道「數大就是美」嗎？

(pending)

如果我們要建立符合真實電話簿情境的輸入，該怎麼作？

(pending)

第十週作業 (Nov. 24)

開發環境(更動)

Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
每核心執行緒數：2
每通訊端核心數：4
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              158
Model name:            Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
製程：              9
CPU MHz：             900.000
CPU max MHz:           4200.0000
CPU min MHz:           800.0000
BogoMIPS:              7200.00
虛擬：              VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           256K
L3 快取：           8192K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

優化方案：memory pool

參考 ierosodin 實作 memory pool 的程式碼，依樣畫葫蘆實作，測試看看使用 memory pool 可以將 miss rate 降到多低。

我在實作 memory pool 時，是基於 hash table 方案來加以改善，兩者差別只是在建立電話簿中的資料時 ( append())，每筆 entry 配置記憶體的方式不同。hash table 的作法是使用 malloc() 配置記憶體，而 memory pool 是用一開始配置好的 pool 直接指定可使用的記憶體位址。










entry *append(char lastName[], entry *e)
{
    e->pNext = (entry *) pool_alloc(mpool, sizeof(entry));
    e = e->pNext;
    strcpy(e->lastName, lastName);
    e->pNext = NULL;

    return e;
}

上面的程式碼為 memory pool 優化方案透過 pool_alloc() 直接指定可使用的記憶體位址。

以下是 memory pool 相關操作 (create, allocate, free) 的程式碼片段：
























pool *pool_create(size_t size)
{
    pool *p = (pool *) malloc(size + sizeof(pool));
    if (p) {
        p->next = p + sizeof(pool);
        p->end = p + size + sizeof(pool);
    }
    return p;
}

void *pool_alloc(pool *p, size_t size)
{
    void *mem = NULL;
    if (p->end - p->next >= size) {
        mem = p->next;
        p->next = p->next + size;
    }
    return mem;
}

void pool_destroy(pool *p)
{
    free(p);
}

比較 miss rate：

original	resized structure	hash table	memory pool
87.693%	51.938%	43.080%	37.338%

 Performance counter stats for './phonebook_orig' (100 runs):

         5,345,827      cache-misses              #   87.693 % of all cache refs      ( +-  0.17% )
         6,096,065      cache-references                                              ( +-  0.07% )
       322,369,931      instructions              #    1.37  insn per cycle           ( +-  0.02% )
       235,405,977      cycles                                                        ( +-  0.47% )

       0.057201688 seconds time elapsed                                          ( +-  0.57% )


 Performance counter stats for './phonebook_opt' (100 runs):

         1,443,369      cache-misses              #   51.938 % of all cache refs      ( +-  0.65% )
         2,779,036      cache-references                                              ( +-  0.15% )
       283,246,415      instructions              #    2.15  insn per cycle           ( +-  0.02% )
       131,848,129      cycles                                                        ( +-  0.64% )

       0.032153434 seconds time elapsed                                          ( +-  0.66% )


 Performance counter stats for './phonebook_opt2' (100 runs):

         1,725,825      cache-misses              #   43.080 % of all cache refs      ( +-  1.10% )
         4,006,108      cache-references                                              ( +-  1.18% )
       264,386,230      instructions              #    1.33  insn per cycle           ( +-  0.02% )
       199,015,555      cycles                                                        ( +-  0.40% )

       0.048235382 seconds time elapsed                                          ( +-  0.43% )


 Performance counter stats for './phonebook_mpool' (100 runs):

           308,512      cache-misses              #   37.338 % of all cache refs      ( +-  1.03% )
           826,268      cache-references                                              ( +-  0.30% )
       157,945,451      instructions              #    1.84  insn per cycle           ( +-  0.03% )
        85,842,871      cycles                                                        ( +-  0.44% )

       0.021161593 seconds time elapsed                                          ( +-  0.49% )

由下圖可看出 memory pool 方案是所有情況中 append() 以及 findName() 中執行時間花費最短的。

fuzzy search

模糊搜尋(fuzzy search)，找出與你給定字串相似的字串，一般來說可以透過查字典的方法找出與你相近的字串。常用於 search engine，ex: google search 中，當你輸入 mississipp ， google search 會建議你要不要搜尋 mississippi 。
找出相似字串的演算法有很多種，依Fuzzy search strings所說，有其中下列幾種:
- Levenshtein distance
- BK-trees

Levenshtein distance

Levenshtein distance 是將一字串轉換成另一個字串所需之編輯次數，能使用的編輯方式有 insertions , deletions or substitutions，而且視這3個 operation 的 cost 是一樣的，所以總 cost 就是 總編輯次數 x 每編輯一次的 cost 。

其數學式是

以字串 kitten 與 sitting 來做示範
我想將 kitten 轉變成 sitting

1.kitten -> sitten (s取代k)
2.sitten -> sittin (i取代e)
3.sittin -> sitting (字串尾加上g)

所以總編輯次數為 3 。

在實作上可以採用 recursive 或者 iterative

recursive 是個不有效率的做法，會有大量的重複計算，所以我這裡就不介紹了。
iterative 的話可以採用 Wagner–Fischer algorithm 來實作。
- Wagner–Fischer algorithm:
  利用 dynamic programming 的方式算兩字串的編輯距離。
  
  其演算法為:
  
  很直接的就是設初值，然後開始建 table
  
  上述以 GARVEY 和 AVERY 為例，算出 edit distance 為 3

BK-trees

是建立在 Levenshtein distance 的基礎上的樹狀資料結構。

一開始假如有一棵樹

其中邊的數值，以 BOOK 與 BOOKS 相連的邊為例，數值為1，是因為 BOOK 與 BOOKS 的 Levenshtein distance 為1

BK-trees 插入的規則:

就是先算插入字串與root 的 Levenshtein distance ，這裡我另為 n，接著看 root 的子結點有哪個與 root 的 Levenshtein distance 為 n ，沒有的話直接把新增的字串插在 root 底下，有的話接著那個與 root Levenshtein distance 相距為 n
的子結點繼續以上述方式以 DFS 的方式做下去，最終可以找到插入的點。

假如我今天想插入一個為字串 BOO ，首先先與 root 也就是 BOOK 算出之間的 Levenshtein distance 為 1 然後再繼續跟 BOOKS 比較 Levenshtein distance 為 2 然後無從比較了，就將點插在 BOOKS 後面。

BK-trees 找相似字串的規則:
先決定你可以允許多大的容錯程度，這裡我另容錯程度為 : n ，所以也就是在兩字串的 Levenshtein distance 加減 n 的範圍內的字串皆是我可以搜尋的目標，並且假如搜尋目標與 search partern 的 Levenshtein distance

\leq

n ，便可以將它加入可能是想要的搜尋的字串中。

以搜尋 CAQE 為例 :(另: n = 1)

(1) CAQE 先與 BOOK 算出 Levenshtein distance 為 4 ，4 並未小於 n = 1，所以 BOOK 並未加入可能想要搜尋的字串內，又因為 n = 1，所以搜尋字串的 range 為 (4-1,4+1) = (3,5) 。
(2) 接著看有無與 BOOK(root) Levenshtein distance 介於 (3,5) 的字串，看到 CAKE 是個選擇，於是再拿 CAQE 與 CAKE 算 Levenshtein distance 結果為 1 ，1 小於等於 n = 1 ，所以可以將 CAKE 納入可能想要搜尋的字串中，並且算出接下來搜尋字串的範圍為 (1-1,1+1) = (0,2) 。
(3) 接著繼續依上述範圍找與 CAKE Levenshtein distance 在上述範圍的字串，發現 CAPE 是個選擇，於是拿 CAQE 與 CAPE 算 Levenshtein distance 結果為 1，1 小於等於 n = 1 ，所以可以將 CAPE 納入可能想要搜尋的字串中。
(4) 因為此搜尋是採 DFS 所以退回 CAKE 看有無符合搜尋範圍為 (0,2) ，發現 CART 也符合，，於是拿 CAQE 與 CAPE 算 Levenshtein distance 結果為 2，2 並未小於等於 n = 1 ，所以CART 並未納入可能想要搜尋的字串中。

所以最後在容錯為 1 的情況下，可能是想要搜尋的字串有 CAKE,CAPE 。

參考資料

共筆
- 同學共筆：st9007a
- 選讀共筆：0xff07，nekoneko，ierosodin，charles620016
Hashing & Hash Tables
Memory pool
Fuzzy string search
Levenshtein distance from Wikipedia
The BK-Tree – A Data Structure for Spell Checking
BK-Tree | Introduction & Implementation