2017q3 Homework1 (phonebook) [分組報告]

tags: `進階電腦系統理論與實作 (Fall 2017)`

contributed by < jcyztw, davidshih0908 >

預計完成項目

改善效能

使用 binary search tree 或其他資料結構改寫
改善電話簿的效能分析，透過大量的重複執行，找出 95% 信賴區間，而且允許動態新增資料 (較符合現實) 和查詢
支援 fuzzy search，允許搜尋 lastName 時，不用精準打出正確的寫法
引入 memory pool，降低記憶體存取的成本

查閱相關資料

cache 原理與實驗的參考及引用資料
[ fuzzy search ] Levenshtein Distance (編輯距離)
何謂信賴區間，機率密度函數(PDF)，以及累積分佈函數(CDF)
作業共筆：ierosodin
春季班期中報告

回答問題

資料難道「數大就是美」嗎？
如果我們要建立符合真實電話簿情境的輸入，該怎麼作？

開發環境

Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
每核心執行緒數：2
每通訊端核心數：4
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              158
Model name:            Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
製程：              9
CPU MHz：             900.000
CPU max MHz:           4200.0000
CPU min MHz:           800.0000
BogoMIPS:              7200.00
虛擬：              VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           256K
L3 快取：           8192K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

改善效能

優化方案：memory pool

參考 ierosodin 實作 memory pool 的程式碼，依樣畫葫蘆實作，測試看看使用 memory pool 可以將 miss rate 降到多低。

我在實作 memory pool 時，是基於 hash table 方案來加以改善，兩者差別只是在建立電話簿中的資料時 ( append())，每筆 entry 配置記憶體的方式不同。hash table 的作法是使用 malloc() 配置記憶體，而 memory pool 是用一開始配置好的 pool 直接指定可使用的記憶體位址。










entry *append(char lastName[], entry *e)
{
    e->pNext = (entry *) pool_alloc(mpool, sizeof(entry));
    e = e->pNext;
    strcpy(e->lastName, lastName);
    e->pNext = NULL;

    return e;
}

上面的程式碼為 memory pool 優化方案透過 pool_alloc() 直接指定可使用的記憶體位址。

以下是 memory pool 相關操作 (create, allocate, free) 的程式碼片段：
























pool *pool_create(size_t size)
{
    pool *p = (pool *) malloc(size + sizeof(pool));
    if (p) {
        p->next = p + sizeof(pool);
        p->end = p + size + sizeof(pool);
    }
    return p;
}

void *pool_alloc(pool *p, size_t size)
{
    void *mem = NULL;
    if (p->end - p->next >= size) {
        mem = p->next;
        p->next = p->next + size;
    }
    return mem;
}

void pool_destroy(pool *p)
{
    free(p);
}

利用 perf stat -e 指令來紀錄 cache miss，cache reference 次數，總指令數以及總 cycle 數，下方表格用來比較 miss rate：

original	resized structure	hash table	memory pool
87.693%	51.938%	43.080%	37.338%

 Performance counter stats for './phonebook_orig' (100 runs):

         5,345,827      cache-misses              #   87.693 % of all cache refs      ( +-  0.17% )
         6,096,065      cache-references                                              ( +-  0.07% )
       322,369,931      instructions              #    1.37  insn per cycle           ( +-  0.02% )
       235,405,977      cycles                                                        ( +-  0.47% )

       0.057201688 seconds time elapsed                                          ( +-  0.57% )


 Performance counter stats for './phonebook_opt' (100 runs):

         1,443,369      cache-misses              #   51.938 % of all cache refs      ( +-  0.65% )
         2,779,036      cache-references                                              ( +-  0.15% )
       283,246,415      instructions              #    2.15  insn per cycle           ( +-  0.02% )
       131,848,129      cycles                                                        ( +-  0.64% )

       0.032153434 seconds time elapsed                                          ( +-  0.66% )


 Performance counter stats for './phonebook_opt2' (100 runs):

         1,725,825      cache-misses              #   43.080 % of all cache refs      ( +-  1.10% )
         4,006,108      cache-references                                              ( +-  1.18% )
       264,386,230      instructions              #    1.33  insn per cycle           ( +-  0.02% )
       199,015,555      cycles                                                        ( +-  0.40% )

       0.048235382 seconds time elapsed                                          ( +-  0.43% )


 Performance counter stats for './phonebook_mpool' (100 runs):

           308,512      cache-misses              #   37.338 % of all cache refs      ( +-  1.03% )
           826,268      cache-references                                              ( +-  0.30% )
       157,945,451      instructions              #    1.84  insn per cycle           ( +-  0.03% )
        85,842,871      cycles                                                        ( +-  0.44% )

       0.021161593 seconds time elapsed                                          ( +-  0.49% )

由下圖可看出 memory pool 方案是所有情況中 append() 以及 findName() 中執行時間花費最短的。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

實作 fuzzy search

我們實作了兩種搜尋相似字串的演算法：

Levenshtein distance
Wagner-Fischer algorithm

透過這兩種演算法可以算出 edit distance (意即將一字串轉換成另一個字串所需之編輯次數，編輯可分為插入、刪除以及取代三種操作)。

以下是實作 Wagner-Fischer algorithm，搜尋 words.txt 中，與 kaneshiro 的相似字串：

$ ./phonebook_fuzzy kaneshiro
similarity = 3
last names which edit distance is less or equal to similarity:
kanashi
janeiro
koshiro
yashiro
kanesian
aleshire
maeshiro
kazihiro
kunashir
kazuhiro
taushiro
kaneshite
kanephore
lanesboro
kunashiri
kanephoros

similarity 表示設定的 edit distance，此處是設為 3。以上搜尋到的 last name，都是與 kaneshiro 的 similarity 小於或等於 3 的字串。

參考維基百科，Wagner-Fischer algorithm 的程式碼如下：

























int wagnerFischer(const char *s, const char *t)
{
    int len_s = strlen(s), len_t = strlen(t);
    int distance[len_s + 1][len_t + 1];

    for (int i = 0; i <= len_s; i++) {
        distance[i][0] = i;
    }
    for (int j = 0; j <= len_t; j++) {
        distance[0][j] = j;
    }

    for (int j = 1; j <= len_t; j++) {
        for (int i = 1; i <= len_s; i++) {
            if (s[i - 1] == t[j - 1])
                distance[i][j] = distance[i - 1][j - 1];
            else
                distance[i][j] = min(
                                     distance[i-1][j] + 1,
                                     distance[i][j-1] + 1,
                                     distance[i-1][j-1] + 1);
        }
    }
    return distance[len_s][len_t];
}

資料閱讀

fuzzy search

模糊搜尋(fuzzy search)，找出與你給定字串相似的字串，一般來說可以透過查字典的方法找出與你相近的字串。常用於 search engine，e.g. google search 中，當你輸入 mississipp ， google search 會建議你要不要搜尋 mississippi。
找出相似字串的演算法有很多種，依據 Fuzzy search strings ，以下針對下列兩種作說明:
- Levenshtein distance
- BK-trees

Levenshtein distance

Levenshtein distance 是將一字串轉換成另一個字串所需之編輯次數，能使用的編輯方式有插入，刪除或是取代三種。若將這 3 個操作方式的 cost 視為相同，則「總 cost = 總編輯次數

\times

每編輯一次的 cost 」。

其數學式為：

l e v_{a, b} (i, j) = {\begin{cases} m a x (i, j) & if m i n (i, j) = 0, \\ m i n {\begin{cases} l e v_{a, b} (i - 1, j) + 1 \\ l e v_{a, b} (i, j - 1) + 1 \\ l e v_{a, b} (i - 1, j - 1) + 1_{(a_{i} \neq b_{j})} \end{cases} & otherwise. \end{cases}

以字串 kitten 與 sitting 做示範，若想將 kitten 轉換成 sitting：

kitten -> sitten ( s 取代 k )
sitten -> sittin ( i 取代 e )
sittin -> sitting (字串尾加上 g )

得到總編輯次數等於 3。

在實作上可以採用 recursive 或者 iterative：

recursive 不是很有效率的做法，會有大量的重複計算，因此不作介紹
iterative 的話可以採用 Wagner–Fischer algorithm 來實作 (利用 dynamic programming 的方式算兩字串的編輯距離)，其演算法如下：

很直接的就是設初值，然後開始建表。以 GARVEY 和 AVERY 為例，算出 edit distance 為 3。

BK-trees

BK-trees 是建立在 Levenshtein distance 的基礎上的樹狀資料結構。

若有一棵叫作 T 的 BK-tree，由於 BOOK 與 BOOKS 的 Levenshtein distance 為 1，所以在 BOOK 與 BOOKS 相連的邊上標上 1 (T 見下圖)。

以下介紹如何插入節點至 BK-tree 以及如何在 BK-tree 中尋找相似的字串。

BK-tree 插入的規則

計算欲插入字串與 BK-tree 根節點的 Levenshtein distance (令此 distance 為 N)
找尋是否有等於 N 的邊值 (所有根節點與其所有子節點相連的邊，每邊都會有一個值紀錄此節點與父點的 Levenshtein distance)
2-a. 若有，以邊值等於 N 的子節點作為新的根節點，重回步驟 1 執行，直到找到插入的位置
2-b. 若無，直接將欲插入的字串插到根節點之下，成為根節點的新子點

舉例來說 (使用上圖作例子)，假如若想插入一字串 BOO ，首先算出與 root (BOOK) 的 Levenshtein distance (得到 distance = 1)，找到子節點 BOOKS 作為新的 root (distance = 1)；接著計算 BOO 跟 BOOKS 的 Levenshtein distance，得到 distance = 2。因為找不到邊值相符的子節點，便將 BOO 插在 BOOKS 之下 (如下圖)。

BK-tree 中找相似字串的節點

以上圖的 BK-tree 來搜尋字串 CAQE 為例，令 distance tolerance N = 1：

計算根節點 BOOK 與 CAQE 的 Levenshtein distance (distance = 4)，而 distance 並沒有小於或等於 N (
$4 \leq 1$ 不成立)，因此 BOOK 不列入匹配的字串選擇中。再接著搜尋邊值落在
$(d i s t a n c e - 1, d i s t a n c e + 1) = (3, 5)$ 範圍內的 BOOK 子節點。
符合 1 最後敘述的子節點只有 CAKE，重複 1 的作法，得到 distance = 1，distance 等於 N，故將 CAKE 列入匹配的字串中。接下來搜尋邊值落在
$(d i s t a n c e - 1, d i s t a n c e + 1) = (0, 2)$ 範圍內的 CAKE 子節點。
符合 2 最後敘述的子節點有 CAPE 和 CART，重複 1 的作法：CAQE 與 CAPE 的 distance = 1，因此將 CAPE 列入匹配的字串選擇中；CAQE 與 CART 的 distance = 2，因 distance 大於 N，故不將 CART 列入匹配字串。

在 distance tolerance 為 1 的情況下，最終得到的相似字串有：CAKE 以及 CAPE。

信賴區間，機率密度函數 (PDF) 以及累積分佈函數 (CDF)

名詞解釋：

信賴區間

信賴區間 (Confidence Interval) 是對機率樣本的某個總體參數的區間估計；而信賴區間展現這個總體參數的真實值有一定機率落在與該測量結果有關的某對應區間，這個「一定機率」稱為信心水準。

舉例來說，某選舉的民意調查中，「在 95% 信心水準下，抽樣誤差在正負 3.1% 以內，甲候選人獲得 37% 的選民支持，乙候選人則有 23% 的支持率。」意即我們有 95% 的信心，支持甲的選民比例，在 (0.339, 0.401) 範圍內，而支持乙的選民比例，在 (0.199, 0.261) 的範圍內。這兩個範圍即信賴區間。

機率密度函數 (PDF, Probability Density Function)

一個描述這個連續隨機變量 (continuous random variable) 的輸出值，在某個確定的取值點附近的可能性的函數。而隨機變量的取值落在某個區域之內的機率，則為機率密度函數在這個區域上的積分。

以下使用 ASPE4e 課本 (見參考資料) P.112 範例 4-1 輔助說明。

X

表示在一細薄銅線 (thin copper wire) 中測量到的電流 (單位為 mA，毫安培)，假設

X

的範圍為 [0, 20 mA]，且

X

的機率密度函數

f (x) = 0.05, 0 \leq x \leq 20

，此時銅線中電流小於 10 毫安培的機率為何？

不在

X

的範圍內的

f (x) = 0

，範例所求機率等於下圖中的陰影範圍。

P (x < 10) = \int_{0}^{10} f (x) d x = \int_{0}^{10} 0.05 d x = 0.5

累積分佈函數 (CDF, Cumulative Distribution Function)

是機率密度函數的積分，能完整描述一個實隨機變量

X

的機率分布。

舉例說明前，首先看看 ASPE4e 課本給的累積分佈函數的數學定義：

一個連續隨機變數 (continuous random variable)

X

的累積分佈函數為

F (x) = P (X \leq x) = \int_{- \infty}^{x} f (u) d u, for - \infty < x < \infty

舉例：延續在機率密度函數部份的例子，找出

X

的累積分佈函數。

X

的 CDF 由三個部份組成：

x < 0

f (x) = 0

, therefore

F (x) = 0

, for

x < 0

And

F (x) = \int_{0}^{x} f (u) d u = 0.05 x

, for

0 \leq x < 20

Finally,

F (x) = \int_{0}^{x} f (u) d u = 1

, for

20 \leq x

整理之後得到以下結果：

F (x) = {\begin{array}{lc} 0 & x < 0 \\ 0.05 x & 0 \leq x < 20 \\ 1 & 20 \leq x \end{array}

上述式子作圖後如下：

參考資料

共筆
- 同學共筆：st9007a
- 選讀共筆：0xff07，nekoneko，ierosodin，charles620016
Hashing & Hash Tables
Memory pool
Fuzzy string search
Levenshtein distance from Wikipedia
The BK-Tree – A Data Structure for Spell Checking
BK-Tree | Introduction & Implementation
什麼是信賴區間
Applied Statistics and Probability for Engineers, 4th ed. (ASPE4e), Douglas C. Montgomery, George C. Runger
- Probability Density Function (P.111)
- Cumulative Distribution Function (P.114)

2017q3 Homework1 (phonebook) [分組報告]

tags: 進階電腦系統理論與實作 (Fall 2017)

預計完成項目

開發環境

改善效能

優化方案：memory pool

實作 fuzzy search

資料閱讀

fuzzy search

Levenshtein distance

BK-trees

信賴區間，機率密度函數 (PDF) 以及累積分佈函數 (CDF)

參考資料

tags: `進階電腦系統理論與實作 (Fall 2017)`