2019q3 第 5 週測驗題

# 2019q3 第 5 週測驗題 :::danger 注意須知: 1. 答題時請先建立一份 HackMD 共筆，並設定其存取權限為「任何人都可讀取、已登入系統者可編輯」，隨後將 HackMD 網址張貼於表單中。每題都是[申論題](https://zh.wikipedia.org/zh-tw/%E7%94%B3%E8%AB%96%E9%A1%8C)，答題時請==善用 LaTeX 語法==; 2. 可翻閱書籍、可使用網頁搜尋引擎，亦可使用編譯器和除錯器來實驗，但在測驗期間 (`15:00`-`23:50` Oct 10)，禁止任何形式的討論; 3. 以合法 C 語言 (C99 以上的規格) 撰寫程式碼，需列出完整程式碼; 4. 題目較之前測驗更長，請務必耐心閱讀 5. 為了避免學員大量從給定素材中直接複製貼上原文，作答時==請儘量用漢語書寫== (當然，程式碼、數學表達式、關鍵字和科技詞彙不在此限); 6. 測驗題共 6 題，每題 20 分 ::: --- ## 測驗 `1` 本題引導學員實作 [Quotient Filters](https://en.wikipedia.org/wiki/Quotient_filter)，在這之前需要先對 [Bloom Filter](https://en.wikipedia.org/wiki/Bloom_filter) 有概念。 - [ ] 何謂 Bloom Filter 當我們要在大量的資料中搜尋一個字串時，最簡單的方式就是走訪每個資料存放單位，對應的時間複雜度是 $O(n)$。 Bloom filter 利用 hash function，在不用走訪全部元素的前提，**預測**特定字串是否存於資料結構中。因此時間複雜度是 $O(1)$，而非傳統逐一尋訪的 $O(n)$ 。 * n-bits 的 table * k 個 hash fucntion: h~1~ 、 h~1~ ... h~k~ * 每當要加入新字串 s 時，會將 s 透過這 k 個 hash function 各自轉換為 table index (範圍: `0` 到 `n - 1`) ，所以有 k 個 hash function ，就該有 k 個 index ，然後將該 `table[index]` 上的 bit set * 下次若要檢查該字串 s 是否存在時，只需要再跑一次那 k 個 hash function ，並檢查是否所有的 k 個 bit 都是 set 即可 ![](https://i.imgur.com/qa6qAzi.png) * 動畫: [Cube Drone - Bloom Filters](https://www.youtube.com/watch?v=-SuTGoFYjZs) 但這做法**存在錯誤率**，例如原本沒有在資料結構中的字串 s1 經過 hash 轉換後得出的 bit 的位置和另一個存在於資料結構中的字串 s2 經過 hash 之後的結果相同，如此一來，先前不存在的 s1 便會被認為存在，這就是 [false positive](https://en.wikipedia.org/wiki/False_positives_and_false_negatives) (指進行實用測試後，測試結果有機會不能反映出真正的面貌或狀況)。 Bloom filter 一類手法的應用很常見。例如在社群網站 Facebook，在搜尋欄鍵入名字時，能夠在 [20 億個註冊使用者](https://www.bnext.com.tw/article/45104/facebook-maus-surpasses-2-billion) (2017 年統計資料) 中很快的找到結果，甚至是根據與使用者的關聯度排序。延伸閱讀: * [Esoteric Data Structures and Where to Find Them](https://youtu.be/-8UZhDjgeZU?t=607) * [Bloom Filter 影片解說](https://www.youtube.com/watch?v=bEmBh1HtYrw) - [ ] Bloom Filter 實作方式首先，建立一個 n bits 的 table，並將每個 bit 初始化為 0。我們將所有的字串構成的集合 (set) 表示為 S = { x~1~, x~2~, x~3~, ... ,x~n~ }，Bloom Filter 會使用 k 個不同的 hash function，每個 hash function 轉換後的範圍都是 0 到 n-1 (為了能夠對應上面建立的 n bits table)。而對每個 S 內的 element x~i~，都需要經過 k 個 hash function，一一轉換成 k 個 index。轉換過後，table 上的這 k 個 index 的值就必須設定為 `1`。 > 注意: 可能會有同一個 index 被多次設定為 `1` 的狀況 Bloom Filter 這樣的機制，存在一定的錯誤率。若今天想要找一個字串 x 在不在 S 中，這樣的資料結構只能保證某個 x~1~ 一定不在 S 中，但沒辦法 100% 確定某個 x~2~一定在 S 中。因為會有誤判 (false positive ) 的可能。此外，資料只能夠新增，而不能夠刪除，試想今天有兩個字串 x~1~, x~2~ 經過某個 hash function h~i~ 轉換後的結果 h~i~(x~1~) = h~i~(x~2~)，若今天要刪除 x~1~ 而把 table 中 set 的 1 改為 0，豈不是連 x~2~ 都受到影響？ - [ ] Bloom Filter 錯誤率計算首先假設我們的所有字串集合 S 裡面有 n 個字串， hash 總共有 k 個，Bloom Filter 的 table 總共 m bits。我們會判斷一個字串存在於 S 內，是看經過轉換後的每個 bits 都被 set 了，我們就會說可能這個字串在 S 內。但試想若是其實這個字串不在 S 內，但是其他的 a b c 等等字串經過轉換後的 index ，剛好涵蓋了我們的目標字串轉換後的 index，就造成了誤判這個字串在 S 內的情況。如上述，Bloom Filter 存有錯誤機率，程式開發應顧及回報錯誤機率給使用者，以下分析錯誤率。當我們把 S 內的所有字串，每一個由 k 個 hash 轉換成 index 並把 `table[index]` 設為 1，而全部轉完後， table 中某個 bits 仍然是 0 的機率是 $(1-\dfrac{1}{m})^{kn}$ 。其中 $(1-\dfrac{1}{m})$ 是每次 hash 轉換後，table 的某個 bit 仍然是 0 的機率。因為我們把 hash 轉換後到每個 index (而這個 index 會被 set) 的機率視為相等 (每人 $\dfrac{1}{m}$)，所以用 1 減掉即為不會被 set 的機率。我們總共需要做 kn 次 hash，所以就得到答案$(1-\dfrac{1}{m})^{kn}$。由 $(1-\dfrac{1}{m})^{m}≈e^{-1}$ 特性，$(1-\dfrac{1}{m})^{kn}=((1-\dfrac{1}{m})^{m})^{\frac{kn}{m}}≈(e^{-1})^{\frac{kn}{m}}≈e^{-\frac{kn}{m}}$ 以上為全部的字串轉完後某個 bit 還沒有被 set 的機率。因此誤判的機率等同於全部經由 k 個 hash 轉完後的 k bit 已經被其他人 set 的機率: 轉完後某個 bit 被 set 機率是: $1-e^{-\frac{kn}{m}}$，因此某 k bit 被 set 機率為: ==$(1-e^{-\frac{kn}{m}})^k$== 延伸閱讀: * [Bloom Filter](https://blog.csdn.net/jiaomeng/article/details/1495500) ### 如何選參數值 * k: 多少個不同 hash * m: table size 的大小 **如何選 k 值?** 我們必須讓錯誤率是最小值，因此必須讓 $(1-e^{-\frac{kn}{m}})^k$ 的值是最小。先把原式改寫成 $(e^{k\ ln(1-e^{-\frac{kn}{m}})})$，我們只要使 ${k\ ln(1-e^{-\frac{kn}{m}})}$ 最小，原式就是最小值。可以看出當 $e^{-\frac{kn}{m}}=\dfrac{1}{2}$ 時，會達到最小值。因此 ==$k=ln2\dfrac{m}{n}$== 即能達到最小值。 [Bloom Filter calculator](https://hur.st/bloomfilter/?n=5000&p=3&m=&k=45) 和 [Bloom Filter calculator 2](https://krisives.github.io/bloom-calculator/) 這兩個網站可以透過設定自己要的錯誤率與資料的多寡，得到 `k` 和 `m` 值。 ### [Quotient Filters](https://en.wikipedia.org/wiki/Quotient_filter) 請詳細觀看 San Diego State University 的 Rob Edwards 教授的解說影片: [Counting Quotient Filters](https://youtu.be/oMFHkuVoF70) (可開字幕)。一般的 Bloom filter 存在以下問題： * The inability to delete items * Poor scaling out of RAM * The inability to resize dynamically * The inability to count the number of occurrences of each item, especially with skewed input distributions. 在 [A general purpose counting filter: making every bit count](https://blog.acolyer.org/2017/08/08/a-general-purpose-counting-filter-making-every-bit-count/) 提到的 Quotient filter 就著眼於解決上述問題。請嘗試總結上述材料，解釋為何 Quotient filter 得以解決 Bloom filter 存在的限制和 [false positive](https://en.wikipedia.org/wiki/False_positives_and_false_negatives) 問題，需要特別標注對應的場景和分析 (包含缺點，如更大的空間開銷)。 :::success 延伸問題: 1. 舉出 Linux 核心中應用 Bloom filter / Quotient Filter 的案例，像是 [Add Policy Engine Algorithmic Bloom Filter Entries Register](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=418089a850c751ada5e3535d0e71c3312fe3c432) 和 [Linux Socket Filtering aka Berkeley Packet Filter (BPF)](https://www.kernel.org/doc/Documentation/networking/filter.txt)，並予以解說; 2. 本系列課程希望破除一項迷思：很多人 (特別是台灣軟體工程師) 對 Linux 核心存在嚴重偏見，以為只要基本的邏輯觀念，而不需要數學，事實上，Linux 核心大量充斥著各式數學，例如離散數學及機率統計 (歡迎下學期選修「[Linux 核心設計](http://wiki.csie.ncku.edu.tw/linux/schedule)」課程)。在 Linux 核心原始程式碼中，藉由 Bloom Filter 來檢索一個元素是否在一個集合中，空間效率極高，利用 bitwise 簡潔地表示一個集合，空間效率和查詢時間都勝於常見演算法。請回想你撰寫過 (或深入追蹤過) 的程式碼，思考導入 Bloom/Quotient filer 的機會及效益，並實際量化分析。 ::: --- ## 測驗 `2` 承測驗 `1`，考慮 [quotient-filter](https://github.com/sysprog21/quotient-filter) 是個 Quotient filter 實作，編譯和執行方式如下: ```shell $ make $ ./bench ``` 程式執行輸出如下: ``` Testing 201326592 random inserts and 1000000 lookups........... done (27 seconds). Testing 65536 contiguous inserts and 1000000 lookups........... done (72 seconds). ``` 請利用 `quotient-filter.c` 裡頭的函式，實作 `qf_is_subsetof`: ```cpp /* Check if @lhs is a subset of @rhs */ bool qf_is_subsetof(quotient_filter *lhs, quotient_filter *rhs) { qf_iterator qfi; /* Write your own code */ } ``` 其作用是檢查 lhr 指標所指向的 Quotient Filter 是否是 rhs 指標所指向物件的子集合 (subset)，若是則回傳 `true`，反之回傳 `false`。完成上述 `is_subsetof` 函式的實作，作答時需要張貼完整函式宣告及內容。 --- ## 測驗 `3` 承測驗 `2`，實作 `qf_merge`，使其得以合併兩個 Quotient filter 裡頭的各項: ```cpp /* * Initializes qf_out and copies over all elements from qf1 and qf2. * Caution: qf_out holds twice as many entries as either qf1 or qf2. * * Returns false on ENOMEM. */ bool qf_merge(quotient_filter *qf_out, quotient_filter *qf1, quotient_filter *qf2) { qf_iterator qfi; /* Write your code here */ } ``` 註: 依據 [glibc: Error Codes](https://www.gnu.org/software/libc/manual/html_node/Error-Codes.html), `ENOMEM` 意味著 "Cannot allocate memory." > The system cannot allocate more virtual memory because its capacity is full. 完成上述 `qf_merge` 函式的實作，作答時需要張貼完整宣告及內容。 --- ## 測驗 `4` 承測驗 `3`，考慮到操作過程中可能會發生非預期狀況，我們想透過 `qf_is_consistent` 函式來檢驗給定的 quotient_filter 指標對應的物件是否為有效的 Quotient filter: ```cpp #include <assert.h> bool qf_is_consistent(quotient_filter *qf) { assert(qf->qbits); assert(qf->rbits); assert(qf->qf_table); assert(qf->entries <= qf->max_size); /* Write your code here */ } ``` 檢驗的手法很多，例如: ```cpp if (qf->entries == 0) { for (uint64_t start = 0; start < qf->max_size; ++start) if (get_elem(qf, start) != 0) return false; return true; } ``` 請儘量依據 Quotient filter 特性，撰寫對應的 `qf_is_consistent` 驗證程式碼，作答時需要張貼完整宣告及內容，還要簡述你的檢驗項目和想法。 --- ## 測驗 `5` 承測驗 `2`，實作 `qf_remove`，得以自 Quotient filter 中移除特定的 hash: ```cpp /** * Removes a hash from the Quotient filter. * * Caution: If you plan on using this function, make sure that your hash * function emits no more than q+r bits. Consider the following scenario; * insert(qf, A:X) # X is in the lowest q+r bits. * insert(qf, B:X) # This is a no-op, since X is already in the table. * remove(qf, A:X) # X is removed from the table. * * Now, may-contain(qf, B:X) == false, which is a ruinous false negative. * * Returns false if the hash uses more than q+r bits. */ bool qf_remove(quotient_filter *qf, uint64_t hash) { uint64_t highbits = hash >> (qf->qbits + qf->rbits); if (highbits) return false; uint64_t fq = hash_to_quotient(qf, hash); uint64_t fr = hash_to_remainder(qf, hash); uint64_t T_fq = get_elem(qf, fq); if (!is_occupied(T_fq) || !qf->entries) return true; uint64_t start = find_run_index(qf, fq); uint64_t s = start; uint64_t rem; /* Find the offending table index (or give up). */ do { rem = get_remainder(get_elem(qf, s)); if (rem == fr) { break; } else if (rem > fr) { return true; } s = incr(qf, s); } while (is_continuation(get_elem(qf, s))); if (rem != fr) return true; uint64_t kill = (s == fq) ? T_fq : get_elem(qf, s); bool replace_run_start = is_run_start(kill); /* If we are deleting the last entry in a run, clear `is_occupied'. */ if (is_run_start(kill)) { uint64_t next = get_elem(qf, incr(qf, s)); if (!is_continuation(next)) { T_fq = clr_occupied(T_fq); set_elem(qf, fq, T_fq); } } delete_entry(qf, s, fq); if (replace_run_start) { /* Write your code here */ } --qf->entries; return true; } ``` 完成上述 `qf_remove` 函式的實作，作答時需要張貼程式碼，並解說程式碼運作原理，同時也該解說程式註解中 `Caution:` 的考量和限制為何如此。 :::success 延伸問題: 1. 參照 [cqf](https://github.com/splatlab/cqf): A General-Purpose Counting Filter: Counting Quotient Filter 這份實作，學習其[測試程式](https://github.com/splatlab/cqf/blob/master/src/test.c)，除了新增尚有刪去 key/hash，善用測驗 `4` 發展的 `qf_is_consistent` 函式，在操作後檢驗 Quotient filter 的一致性; 2. 設計足以比較 [sysprog21/quotient-filter](https://github.com/sysprog21/quotient-filter) 和 [cqf](https://github.com/splatlab/cqf) 效能表現的實驗，可參照論文 [A General-Purpose Counting Filter: Making Every Bit Count](https://dl.acm.org/citation.cfm?id=3035963) 的手法，並予以分析; 3. 效能分析的過程請善用 Linux 的 [perf 工具](http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial) * 參考 [使用 perf_events 分析程式效能](https://zh-blog.logan.tw/2019/07/10/analyze-program-performance-with-perf-events/) 和 [簡介 perf_events 與 Call Graph](https://zh-blog.logan.tw/2019/10/06/intro-to-perf-events-and-call-graph/) ::: --- ## 測驗 `6` 承測驗 `2`，研究 [cqf](https://github.com/splatlab/cqf): A General-Purpose Counting Filter: Counting Quotient Filter 這份實作，搭配對應的論文 [A General-Purpose Counting Filter: Making Every Bit Count](https://dl.acm.org/citation.cfm?id=3035963)，指出其相較於 [sysprog21/quotient-filter](https://github.com/sysprog21/quotient-filter) 有哪些相同的功能，又額外實作了哪些機制，請儘量描述。例如: 在 [src/gqf_file.c](https://github.com/splatlab/cqf/blob/master/src/gqf_file.c) 的實作中，[cqf](https://github.com/splatlab/cqf) 支援檔案操作，但 [sysprog21/quotient-filter](https://github.com/sysprog21/quotient-filter) 僅能處理 in-memory 的資料，並且 [src/gqf_file.c](https://github.com/splatlab/cqf/blob/master/src/gqf_file.c) 利用 madvise 系統呼叫以提示 Linux 虛擬記憶體管理器做更好的分頁處理。 ---