E02: dict

tags: `sysprog2018`

主講人: jserv / 課程討論區: 2018 年系統軟體課程

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

返回「進階電腦系統理論與實作」課程進度表

預期目標

學習 ternary search tree 作為 auto-complete 和 prefix search 的實作機制
學習 Bloom filter 和 ternary search tree 的整合議題
- bit-wise operation 的實踐
設計效能評比 (benchmark) 的程式框架
學習 GNU make 和 Makefile 撰寫
複習機率統計和對應的數學基礎

Autocompletion 和 ternary search tree

Autocomplete using a ternary search tree
Ternary Search Tree 視覺化
- 用 ab, abs, absolute 等字串逐次輸入並觀察
實際整合案例: Phonebook: adds new contacts and effectively search for contacts via ternary search tree

Trie (字典樹) 和 Ternary search tree (TST)

Trie

trie 與 binary search tree 不同在於，不把整個字串包在同一個 node 中，而是根據 node 的位置決定代表的字串為何。也就是一個看結果(最終 node)，一個看過程 (經過的 node)。因此利用 trie，可以快速的找出有相同 prefix 的字串。以下將舉例示範 trie 如何儲存 data。

Ternary search tree

引入 Ternary search tree (TST)，可望解決上述記體開銷的問題。這種樹的每個節點最多只會有 3 個 children，分別代表小於，等於，大於。

以下範例解釋 Ternary search tree 的運作。

插入一個字串 and
與 trie 概念相同，但是在 root 是直接擺 a，而不需要一個 null

接著插入單字 ant

我們先比對 a，發現 a 和 root 的字母相同（a=a）
因此 pointer 往下，來到 n 節點，發現 n 節點和 ant 的第二個字母相同（n=n）
因此 pointer 再往下，來到 d 節點，發現 d 節點和 ant 的第三個字母不同，而且 t > d
因此 d 節點的 right child 成為 t，成功將 ant 放入原本的 TST 中

接著插入單字 age

我們先比對 a ，發現 a 和 root 的字母相同（a=a)
因此 pointer 往下，來到 n 節點，發現 n 節點和 age 的第二個字母不同，而且 g < n
因此 n 節點的 left child 成為 g，成功將 age 放入原本的 TST 中

Ternary Search Tree 此網址提供了透過動畫呈現，展示如何加入與刪除給定字串的流程。

我們可發現 TST 幾種特性:

解決 trie tree 中大量佔用記憶體的情況，因為 TST 中，每個 node 最多只會有 3 個 child
但尋找時間在某些狀況中，會比 trie 慢，這是因為 TST 插入的順序會影響比較的次數。

考慮以下兩種插入順序：

前者用 aa
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
ac
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
ae
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
af
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
ag
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
ah 的順序
後者是 af
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
ae
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
ac
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
aa
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
ag
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
ah 的順序

試想，如果我們要尋找 ah，那麼前者要比較 6 次，後者只要比較 3 次。而若是建立 trie ，只需要 2 次。

用 aa -> ac -> ae -> af -> ag -> ah 的順序插入
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
用 af -> ae -> ac -> aa -> ag -> ah 的順序插入
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

ternary search tree 的實作程式碼

取得 dict 程式碼，編譯並測試

$ git clone https://github.com/sysprog21/dict
$ cd dict
$ make
$ ./test_cpy

預期會得到以下執行畫面:

ternary_tree, loaded 259112 words in 0.151270 sec

Commands:
 a  add word to the tree
 f  find word in tree
 s  search words matching prefix
 d  delete word from the tree
 q  quit, freeing all data

choice:

按下 f 隨後按下 Enter 鍵，會得到 find word in tree: 的提示畫面，輸入 Taiwan (記得按 Enter)，預期會得到以下訊息:

find word in tree: Taiwan
  found Taiwan in 0.000002 sec.

當再次回到選單時，按下 s 隨後按下 Enter 鍵，會得到 find words matching prefix (at least 1 char): 的提示訊息，輸入 Tain，預期會得到以下訊息:

  Tain - searched prefix in 0.000011 sec

suggest[0] : Tain,
suggest[1] : Tainan,
suggest[2] : Taino,
suggest[3] : Tainter
suggest[4] : Taintrux,

不難發現 prefix-search 的功能，可找出給定開頭字串對應資料庫裡頭的有效組合，你可以用 T 來當作輸入，可得到世界 9 萬多個城市裡頭，以 T 開頭的有效名稱
至於選單裡頭的 a 和 d 就由你去探索具體作用

何謂 Bloom Filter

當我們要在大量的資料中搜尋一個字串時，最簡單的方式就是走訪每個資料存放單位，這樣時間複雜度則是

O (n)

。

而，Bloom Filter 就是利用一個 n bits 的 table，不走訪資料結構就能透過這個 table 預測字串是不是在這個資料結構內。而這樣時間複雜度就變為

O (1)

。

這樣的應用其實很常見。例如在社群網站 facebook ，在搜尋欄鍵入名字時，能夠在 20 億個註冊使用者中很快的找到結果，甚至是根據與使用者的關聯度排序。

實作方式

首先，建立一個 n bits 的 table，並將每個 bit 初始化為 0。

我們所有的字串 set 表示為 S = {x₁ ,x₂ ,x₃…,x_n}， Bloom Filter 會使用 k 個不同的 hash function，每個 hash function 轉換後的範圍都是 0 ~ n-1 (為了能夠對應上面建立的 n bits table)。而對每個 S 內的 element x_i，都需要經過 k 個 hash function ，一一轉換成 k 個 index。轉換過後，table 上的這 k 個 index 的值就必須設定為 1。(注意到所以這裡可能會有同一個 index 被多次設定為 1 的狀況)

Bloom Filter 這樣的機制，是會有錯誤機率的。若今天想要找一個字串 x 在不在 S 中，這樣的資料結構只能保證某個 x₁ 一定不在 S 中，但沒辦法 100% 確定某個 x₂一定在 S 中。因為會有誤判 (false positive ) 的可能。

此外，資料只能夠新增，而不能夠刪除，試想今天有兩個 string x₁ , x₂經過某個 hash function h_i 轉換後的結果 h_i(x₁)=h_i(x₂) ，若今天要刪除x₁ 而把 table 中 set 的 1 改為 0，豈不是連 x₂ 都受到影響了 ?

錯誤率計算

首先假設我們的所有字串 set S 裡面有 n 個字串， hash 總共有 k 個，Bloom Filter 的 table 總共 m bits。我們會判斷一個字串存在於 S 內，是看經過轉換後的每個 bits 都被 set 了，我們就會說可能這個字串在 S 內。但試想若是其實這個字串不在 S 內，但是其他的 a b c 等等字串經過轉換後的 index ，剛好把我們的目標字串轉換後的 index 包含了，就造成了誤判這個字串在 S 內的情況。

如剛剛所述，Bloom Filter 有錯誤機率，而我們開發程式時可能必須把錯誤機率回報給使用者，因此以下將證明錯誤率公式。

當我們把 S 內的所有字串，每一個由 k 個 hash 轉換成 index 並把 table[index] 設為 1，而全部轉完後， table 中某個 bits 仍然是 0 的機率是

(1 - \frac{1}{m})^{k n}

。

其中

(1 - \frac{1}{m})

是每次 hash 轉換後，table 的某個 bits 仍然是 0 的機率。因為我們把 hash 轉換後到每個 index (而這個 index 會被 set) 的機率視為相等 (每人

\frac{1}{m}

)，所以用 1 減掉即為不會被 set 的機率。我們總共需要做 kn 次 hash，所以就得到答案

(1 - \frac{1}{m})^{k n}

。

由

(1 - \frac{1}{m})^{m} \approx e^{- 1}

特性，

(1 - \frac{1}{m})^{k n} = ((1 - \frac{1}{m})^{m})^{\frac{k n}{m}} \approx (e^{- 1})^{\frac{k n}{m}} \approx e^{- \frac{k n}{m}}

以上為全部的字串轉完後某個 bit 還沒有被 set 的機率。

因此誤判的機率=全部經由 k 個 hash 轉完後的 k bit 已經被其他人 set 的機率:
轉完後某個 bit 被 set機率 :

1 - e^{- \frac{k n}{m}}

，因此某 k bit 被 set機率 :

(1 - e^{- \frac{k n}{m}})^{k}

如何選參數值

k: 多少個不同 hash
m: table size 的大小

如何選 k 值?

我們必須讓錯誤率是最小值，因此必須讓

(1 - e^{- \frac{k n}{m}})^{k}

的值是最小。先把原式改寫成

(e^{k l n (1 - e^{- \frac{k n}{m}})})

，我們只要使

k l n (1 - e^{- \frac{k n}{m}})

最小，原式就是最小值。可以看出當

e^{- \frac{k n}{m}} = \frac{1}{2}

時，會達到最小值。因此

k = l n 2 \frac{m}{n}

即能達到最小值。

Bloom Filter calculator 和 Bloom Filter calculator 2 這兩個網站可以透過設定自己要的錯誤率與資料的多寡，得到 k 和 m 值。

在 prefix search 導入 bloom filter 機制

參考的字典檔案裡面總共有 36922 個詞彙。

經由上面的網址知道，當我們容許 5% 錯誤 (=0.05) 時:

table size = 230217
k = 4

先用原來的 cities.txt 檔案做測試，以驗證估程式。

首先，這裡選用了兩種 hash function:jenkins 和djb2，另外 murmur3 hash 也是一個不錯的選擇，這裡主要是選用夠 uniform 的即可。

unsigned int djb2(const void *_str) {
    const char *str = _str;
    unsigned int hash = 5381;
    char c;
    while ((c = *str++))
        hash = ((hash << 5) + hash) + c;
    return hash;
}

unsigned int jenkins(const void *_str) {
    const char *key = _str;
    unsigned int hash=0;
    while (*key) {
        hash += *key;
        hash += (hash << 10);
        hash ^= (hash >> 6);
        key++;
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

再來宣告我們的 Bloom filter 主體，這裡用一個結構體涵蓋以下：

func 是用來把我們的 hash function 做成一個 list 串起來，使用這個方法的用意是方便擴充。若我們今天要加進第三個 hash function，只要串進 list ，後面在實做時就會走訪 list 內的所有 function。
bits 是我們的 filter 主體，也就是 table 。 size 則是 table 的大小。

typedef unsigned int (*hash_function)(const void *data);
typedef struct bloom_filter *bloom_t;

struct bloom_hash {
    hash_function func;
    struct bloom_hash *next;
};

struct bloom_filter {
    struct bloom_hash *func;
    void *bits;
    size_t size;
};

主要的程式碼如下。

首先在 bloom_create 的部份，可以看到我們把兩個 hash function 透過bloom_add_hash 加入結構中。

再來 bloom_add 裡面是透過對目標字串 item 套用 hash ，並把 hash 產生的結果 module size (第20行)，這樣才會符合 table size 。最後第21行把產生的 bit 設定為 1 。

最後bloom_test就是去檢驗使用者輸入的 item 是不是在結構當中。與 add大致上都相等，只是把 bit 設定為 1 的部份改為「檢驗是不是 1」，若是任一個 bit 不是 1 ，我們就可以說這個結構裡面沒有這個字串。

這樣的實做，讓我們可以在 bloom_create 初始化完，接下來要加入字串就呼叫 bloom_add ，要檢驗字串就呼叫 bloom_test，不需要其他的操作。

bloom_t bloom_create(size_t size) {
    bloom_t res = calloc(1, sizeof(struct bloom_filter));
    res->size = size;
    res->bits = malloc(size);

    // 加入 hash function
    bloom_add_hash(res, djb2);
    bloom_add_hash(res, jenkins);

    return res;
}

void bloom_add(bloom_t filter, const void *item) {
    struct bloom_hash *h = filter->func;
    uint8_t *bits = filter->bits;
    while (h) {
        unsigned int hash = h->func(item);
        hash %= filter->size;
        bits[hash] = 1;
        h = h->next;
    }
}

bool bloom_test(bloom_t filter, const void *item) {
    struct bloom_hash *h = filter->func;
    uint8_t *bits = filter->bits;
    while (h) {
        unsigned int hash = h->func(item);
        hash %= filter->size;
        if (!(bits[hash])) {
            return false;
        }
        h = h->next;
    }
    return true;
}

再來是 main 改的部份，這裡修改 test_ref.c 檔案。需要更動的地方是 add 和 find 功能 (delete 會失去效用，注意！)

首先 find 功能修改：





















    t1 = tvgetf();
    // 用 bloom filter 判斷是否在 tree 內
    if (bloom_test(bloom, word) == 1) {
        // version1: bloom filter 偵測到在裡面就不走下去
        t2 = tvgetf();
        printf("  Bloomfilter found %s in %.6f sec.\n", word, t2 - t1);
        printf("  Probability of false positives:%lf\n", pow(1 - exp(-(double)HashNumber / (double) ((double) TableSize /(double) idx)), HashNumber));
        
        // version2: loom filter 偵測到在裡面，就去走 tree (防止偵測錯誤)
        t1 = tvgetf();
        res = tst_search(root, word);
        t2 = tvgetf();
        
        if (res)
            printf("  ----------\n  Tree found %s in %.6f sec.\n", (char *) res, t2 - t1);
        else
            printf("  ----------\n  %s not found by tree.\n", word);  
    } else
        printf("  %s not found by bloom filter.\n", word);
    
    break;

由以上可見，第 3 行會先用 bloom filter 去判斷是否在結構內，若不存在的話就不用走訪。如果在的話，程式印出 bloom filter 找到的時間，以及錯誤機率，隨後走訪節點，判斷是否在 tree 裡面。

上述參數的配置為:

TableSize = 5000000 (=五百萬)
hashNumber = 2 (總共兩個 hash)
用原來的 cities.txt 數據量為 259112

計算出理論錯誤率約為 0.009693，也就是不到 1%。

以下是 add 的部份：


















if (bloom_test(bloom, Top) == 1) // 已被 Bloom filter 偵測存在，不要走 tree
    res = NULL;			 
else { // 否則就走訪 tree，並加入 bloom filter
    bloom_add(bloom,Top);
    res = tst_ins_del(&root, &p, INS, REF);	
}
t2 = tvgetf();

if (res) {
    idx++;
    Top += (strlen(Top) + 1);
    printf("  %s - inserted in %.10f sec. (%d words in tree)\n",(char *) res, t2 - t1, idx);  
} else
    printf("  %s - already exists in list.\n", (char *) res);

if (argc > 1 && strcmp(argv[1], "--bench") == 0) // a for auto
    goto quit;
break;

以上程式碼可以看到，第 1 行的部份會先偵測要加入的字串是否在結構中，若是回傳 1 表示 bloom filter 偵測在結構中，已經重複了。因此會把 res 設為 null，這樣下方第 9 行的判斷自然就不會成立。如果偵測字串不在結構中，會跑進第 3 行的 else，加入結構還有 Bloom filter 當中。

GNU Make 的技巧

參見 Makefile header 檔的相依性檢查

Linux 效能分析工具: Perf

Perf 全名是 Performance Event，是在 Linux 2.6.31 以後內建的系統效能分析工具，它隨著核心一併釋出。藉由 perf，應用程式可以利用 PMU (Performance Monitoring Unit), tracepoint 和核心內部的特殊計數器 (counter) 來進行統計，另外還能同時分析運行中的核心程式碼，從而更全面了解應用程式中的效能瓶頸
請詳讀 perf 原理和實務並在自己的 Linux 開發環境實際操作
- 再次提醒：你的 GNU/Linux「不能」也「不該」透過虛擬機器 (如 VMware, VirtualBox, QEMU 等等) 安裝，請務必依據 2017 年春季作業說明指示，將 Linux 安裝於實體硬碟並設定好多重開機

透過統計模型來分析資料

參照交通大學開放課程: 統計學(一) / 統計學(二)
- 95％信賴區間是從樣本數據計算出來的一個區間，在鐘形曲線的條件下，由樣本回推，「可能」會有 95 % 的機會把真正的母體參數包含在區間之中
- 鐘形曲線可見資料的密集與散佈與否
- 圖形上第一個數為平均，第二個數為標準差的平方，我們稱以下圖形為機率密度函數 Probability Density Function (PDF)
- 雖然不一定每種統計數據都是常態分佈，但只要資料是常態分佈，機率密度函數就會成鐘形曲線，這是我們所希望的，因為排除掉忽大忽小的個案，才可以掌握資訊的正確性

把 dict 程式碼和給定的資料分用從 CPY 和 REF，用統計的方式重畫，其中 X-axis 是 cycles 數，Y-axis 是到目前為止已累積多少筆相同 cycle 數的資料。

CPY 分析分佈
REF (+Memory pool) 分析分佈

轉換圖表 X-Y 的數據後我們可以很明顯的看出 REF 資料密集程度遠大於 CPY ，不只在 count 累積的數目，也在 cycles 的 range 有很大的差異
計算平均與標準差：
標準差
$σ = \sqrt{\frac{1}{N} \cdot \sum_{k = 1}^{N} (X_{i} - \overset{―}{x})^{2}}$
- CPY 平均： 5974.116 cycles
- CPY 標準差: 5908.070
- REF 平均: 65.885 cycles
- REF 標準差: 9.854
帶入機率密度函數公式：

$f (x) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{(x - μ)^{2}}{2 σ^{2}}}$
- 把 x 帶入後，在常態分佈下資料展現如下圖。要特別注意是，因為我們判定資料型態像常態分佈才用這個公式算出理想的模型，CPY 表現太分散，其實並不像常態分佈所以才會在算 95% 信賴區間（平均加減 2 倍標準差）時出現負數，而 REF 符合常態分佈，在此假設他們是常態分佈下才能做以下的鐘形曲線圖
- probability 總和為 1，因為資料量大，所以如果愈分散每個 cycles 分到的機率愈小

REF_PDF ideal model (理想情況):
REF_PDF real model (真實情況):

顯然真實情況偏離理想狀況，但我們可將出現機率很小且不符合 95% 信賴區間的數值拿掉不考量。

CPY_PDF real model

CPY 除了非常不符合常態分佈以外，還有一個更糟糕的情況，我們出現機率少的數值竟然佔大多數，可見資料是十分分散的，所以這邊我們不能取 95% 信賴間，因此得重新分析 CPY_PDF。

Poisson distribution

參考 Statistics The Poisson Distribution , 統計應用小小傳 , Wikipedia

The fundamental trait of the Poisson is its asymmetry，and this trait it preserves at any value of
$λ$

有別於我們一般常態分佈的對稱性，不對稱性剛好是卜瓦松分布的特色
- x 軸代的 k 是 indexm, y 軸則是代入 k 後的機率
Probability of events for a Poisson distribution：
P ( k events in interval ) =
$e^{- λ} \cdot \frac{λ^{k}}{k!}$
- 只需要一個參數：單位時間平均事件發生次數 λ
- 要判定一個統計是否可以做出 Poisson distribution 必須要有以下特點：
  (1) the event is something that can be counted in whole numbers
  (2) occurrences are independent
  (3) the average frequency of occurrence for the time period is known
  (4) it is possible to count how many events have occurred
$λ$ 代
$\frac{1}{μ}$ : 這邊的意思拿 CPY 來說就是平均每 5974.116 個 cycle 可以完成一次搜尋
但是為了做圖方便所以將 5974.116 (cycles) 為一個單位時間，這樣
$λ$ 代 1
放大版
ideal 為階梯狀的原因是因為　5974.116 cycles 為一個單位，然後公式裏面又有　ｋ　階層，所以在小數的部份會被忽略，但 real 的部份又是 1 個 cycle 為單位，應以　5974.116 cycles 為單位製圖較容易觀察差異
以 5974.116 cycles 為一個單位製圖
這樣就可確定 CPY 比起 Normal distribution 確實符合　Poisson distribution
代入 Poisson 的 95% confidence interval
- 參考 How to calculate a confidence level for a Poisson distribution?
兩者的 95% 信賴區間
- CPY = 5477.26 ～ 6947.76 (cycles)
- REF (+memory pool) ＝ 46.177 ～ 85.593 (cycles)

作業要求

在 GitHub 上 fork dict，研讀並修改 Makefile 以允許 $ make plot 時，視覺化 ternary search tree + bloom filter 的效能表現。注意，需要設計實驗，允許不同的字串輸入並探討 TST 回應的統計分佈
- 要考慮到 prefix search 的特性還有詞彙的涵蓋率
- 限用 gnuplot 來製圖
- 參閱 Hash Table Benchmarks
以 perf 分析 TST 程式碼的執行時期行為並且充分解讀
考慮到 CPY 和 REF 這兩種實作機制，思考效能落差的緣故，並且提出後續效能改善機制
- 充分量化並以現代處理器架構的行為解讀
- 注意：這兩者都該加入 Bloom filter，而且分析具體效益 (原有程式存在若干缺陷，也該指出並修正)

繳交方式

編輯 Homework 3 作業區共筆，將你的觀察、上述要求的解說、應用場合探討，以及各式效能改善過程，善用 gnuplot 製圖，紀錄於新建立的共筆

截止日期

Oct 17, 2018 (含) 之前進行，不該截止日前一天才動手
越早在 GitHub 上有動態、越早接受 code review，評分越高