D04: prefix-search

tags: `sysprog2018`

主講人: jserv / 課程討論區: 2018 年系統軟體課程

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

返回「作業系統設計與實作」課程進度表

預期目標

學習 ternary search tree 作為 auto-complete 和 prefix search 的實作機制
延續 phonebook 的基礎工作，引入 ternary search tree 讓電話簿程式得以更符合人性
配合 Week4 進度，思考針對現代處理器特性的高效能程式設計議題
設計效能評比 (benchmark) 的程式框架
學習 GNU make 的進階技巧
複習機率統計

Autocompletion in C

Autocomplete using a ternary search tree
Ternary Search Tree 視覺化
- 用 ab, abs, absolute 等字串逐次輸入並觀察
實際整合案例: Phonebook: adds new contacts and effectively search for contacts via ternary search tree

ternary search tree

取得 prefix-search 程式碼，編譯並測試

$ git clone https://github.com/sysprog21/prefix-search
$ cd prefix-search
$ make
$ ./test_cpy

預期會得到以下執行畫面:

ternary_tree, loaded 259112 words in 0.151270 sec

Commands:
 a  add word to the tree
 f  find word in tree
 s  search words matching prefix
 d  delete word from the tree
 q  quit, freeing all data

choice:

按下 f 隨後按下 Enter 鍵，會得到 find word in tree: 的提示畫面，輸入 Taiwan (記得按 Enter)，預期會得到以下訊息:

find word in tree: Taiwan
  found Taiwan in 0.000002 sec.

當再次回到選單時，按下 s 隨後按下 Enter 鍵，會得到 find words matching prefix (at least 1 char): 的提示訊息，輸入 Tain，預期會得到以下訊息:

  Tain - searched prefix in 0.000011 sec

suggest[0] : Tain,
suggest[1] : Tainan,
suggest[2] : Taino,
suggest[3] : Tainter
suggest[4] : Taintrux,

不難發現 prefix-search 的功能，可找出給定開頭字串對應資料庫裡頭的有效組合，你可以用 T 來當作輸入，可得到世界 9 萬多個城市裡頭，以 T 開頭的有效名稱
至於選單裡頭的 a 和 d 就由你去探索具體作用

GNU Make 的技巧

參見 Makefile header 檔的相依性檢查

透過統計模型來分析資料

參照交通大學開放課程: 統計學(一) / 統計學(二)
- 95％信賴區間是從樣本數據計算出來的一個區間，在鐘形曲線的條件下，由樣本回推，「可能」會有 95 % 的機會把真正的母體參數包含在區間之中
- 鐘形曲線可見資料的密集與散佈與否
- 圖形上第一個數為平均，第二個數為標準差的平方，我們稱以下圖形為機率密度函數 Probability Density Function (PDF)
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
- 雖然不一定每種統計數據都是常態分佈，但只要資料是常態分佈，機率密度函數就會成鐘形曲線，這是我們所希望的，因為排除掉忽大忽小的個案，才可以掌握資訊的正確性

把 prefix-search 程式碼和給定的資料分用從 CPY 和 REF，用統計的方式重畫，其中 X-axis 是 cycles 數，Y-axis 是到目前為止已累積多少筆相同 cycle 數的資料。

CPY 分析分佈
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
REF( Memory pool 版本 ) 分析分佈
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

轉換圖表 X-Y 的數據後我們可以很明顯的看出 REF 資料密集程度遠大於 CPY ，不只在 count 累積的數目，也在 cycles 的 range 有很大的差異
計算平均與標準差：
標準差
$σ = \sqrt{\frac{1}{N} \cdot \sum_{k = 1}^{N} (X_{i} - \overset{―}{x})^{2}}$
- CPY 平均： 5974.116 cycles
- CPY 標準差: 5908.070
- REF 平均: 65.885 cycles
- REF 標準差: 9.854
帶入機率密度函數公式：

$f (x) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{(x - μ)^{2}}{2 σ^{2}}}$
- 把 x 帶入後，在常態分佈下資料展現如下圖。要特別注意是，因為我們判定資料型態像常態分佈才用這個公式算出理想的模型，CPY 表現太分散，其實並不像常態分佈所以才會在算 95% 信賴區間（平均加減 2 倍標準差）時出現負數，而 REF 符合常態分佈，在此假設他們是常態分佈下才能做以下的鐘形曲線圖
- probability 總和為 1，因為資料量大，所以如果愈分散每個 cycles 分到的機率愈小

REF_PDF ideal model (理想情況):
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
REF_PDF real model (真實情況):
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

顯然真實情況偏離理想狀況，但我們可將出現機率很小且不符合 95% 信賴區間的數值拿掉不考量。

CPY_PDF real model ：