2019q3 第 5 週測驗

# 2019q3 第 5 週測驗 ###### tags: `sysprog` `sysprog2019` ## 測驗 1 ### The inability to delete items 一般的 Bloom filters 如下圖方式運作： ![](https://i.imgur.com/qa6qAzi.png) *圖1：來自[題目](/@sysprog/BJvfgm2B)的圖* Bloom filters 使用 $k$ 個 hash function 來代表一個元素存在。因此如果我們要刪除一個元素，我們需要將該元素對應到的所有 Hash Function 的 bit 都清除掉。但是有一個問題，我們怎麼知道有沒有其他元素也對應到要被清除的 bit ？（比如說圖1的 $a_1$ 與 $b$，有兩個 bit 彼此都有對應到。）或許可以增加一個 counter 來解決這個問題，但是如此一來空間複雜度又增加了許多。 QF 只使用一個 hash function ，並將對應的 hash 值拆為前 $q$ 個 bit 與後 $r$ 個 bit，以前 $q$ 個 bit 作為 slot 的 index，並將後 $r$ 個 bit 放入 slot 中。 ![](https://adriancolyer.files.wordpress.com/2017/08/cqf-sketch-1.jpeg) *圖2：QF 的原理簡介 - [來源文章](https://blog.acolyer.org/2017/08/08/a-general-purpose-counting-filter-making-every-bit-count/)* 並且 Quotient filters 對重複的 index 採用 [Linear Probing](http://alrightchiu.github.io/SecondRound/hash-tableopen-addressing.html#lp) 的方式，在 QF 還沒有到接近填滿的時候能將 hash 值都儲存下來，因此我們可以實現動態刪除。不過缺點是因為我們多保存了 $r$ 個 bit ，空間複雜度稍微上升。 ## 測驗 2 ```c /* Check if @lhs is a subset of @rhs */ bool qf_is_subsetof(quotient_filter *lhs, quotient_filter *rhs) { qf_iterator qfi; /* Write your own code */ /* Initialize iterator */ qfi_start(lhs, qfi); /* Iterate all elements in lhs */ while (!qfi_done(lhs, qfi)) { hash = qfi_next(lhs, qfi); if (!qf_may_contain(rhs, hash)) { return false; } } return true; } ``` 由於需要存取 `lhs` 的所有 hash 值，時間複雜度為 $O(n)$。 ## 測驗 3 我寫完之後才發現兩個 qf 的 `q` 跟 `r` 可能不一樣大，不過來不及了，只能先留下這種 naive 版本。 ```c /* * Initializes qf_out and copies over all elements from qf1 and qf2. * Caution: qf_out holds twice as many entries as either qf1 or qf2. * * Returns false on ENOMEM. */ bool qf_merge(quotient_filter *qf_out, quotient_filter *qf1, quotient_filter *qf2) { qf_iterator qfi; /* Write your code here */ uint8_t newq = qf1->qbits; uint8_t newr = qf1->rbits; if(!qf_init(qf_out, newq, newr)) { return false; } quotient_filter *qfs[] = {qf1, qf2}; for(int i = 0; i < sizeof(qfs); i++) { qfi_start(qf[i], qfi); while (!qfi_done(qf[i], qfi)) { hash = qfi_next(qf[i], qfi); if (!qf_may_contain(qf_out, hash)) { qf_insert(qf_out, hash) } } } return true; } ``` ## 測驗 4 ```c #include <assert.h> bool qf_is_consistent(quotient_filter *qf) { assert(qf->qbits); assert(qf->rbits); assert(qf->qf_table); assert(qf->entries <= qf->max_size); /* Write your code here */ uint32_t valid_entries = entries; uint64_t curr = 0, prev = 0; for (uint64_t start = 0; start < qf->max_size; ++start) { /* Check valid entries numbers */ if ((curr = get_elem(qf, start)) != 0) { if (--valid_entries < 0) { return false; } if (start > 0) { /* Check the correctness of occupied */ if (is_empty_element(prev) && !is_occupied(curr)) { return false; } } } prev = curr; } return true; } ``` 這裡我做了兩個檢查： 1. 合法 entry 檢查：理論上 empty 的元素在 `get_elem()` 時應該為 `0`。因此結合目前 `entries` 的數量檢查是否符合這個條件。 2. 簡易 `occupied` 檢查：我們可以從 QF 原理中知道，如果某個含有元素的 slot 的前一個 slot 並沒有元素，那麼目前的 slot 的 `occupied` 標誌必定為 `1` 。因此也做了這個檢查。 ## 測驗 5 ```c /** * Removes a hash from the Quotient filter. * * Caution: If you plan on using this function, make sure that your hash * function emits no more than q+r bits. Consider the following scenario; * insert(qf, A:X) # X is in the lowest q+r bits. * insert(qf, B:X) # This is a no-op, since X is already in the table. * remove(qf, A:X) # X is removed from the table. * * Now, may-contain(qf, B:X) == false, which is a ruinous false negative. * * Returns false if the hash uses more than q+r bits. */ bool qf_remove(quotient_filter *qf, uint64_t hash) { uint64_t highbits = hash >> (qf->qbits + qf->rbits); if (highbits) return false; uint64_t fq = hash_to_quotient(qf, hash); uint64_t fr = hash_to_remainder(qf, hash); uint64_t T_fq = get_elem(qf, fq); if (!is_occupied(T_fq) || !qf->entries) return true; uint64_t start = find_run_index(qf, fq); uint64_t s = start; uint64_t rem; /* Find the offending table index (or give up). */ do { rem = get_remainder(get_elem(qf, s)); if (rem == fr) { break; } else if (rem > fr) { return true; } s = incr(qf, s); } while (is_continuation(get_elem(qf, s))); if (rem != fr) return true; uint64_t kill = (s == fq) ? T_fq : get_elem(qf, s); bool replace_run_start = is_run_start(kill); /* If we are deleting the last entry in a run, clear `is_occupied'. */ if (is_run_start(kill)) { uint64_t next = get_elem(qf, incr(qf, s)); if (!is_continuation(next)) { T_fq = clr_occupied(T_fq); set_elem(qf, fq, T_fq); } } delete_entry(qf, s, fq); if (replace_run_start) { /* Write your code here */ } --qf->entries; return true; } ``` 恩...好對不起我花了不少時間在這題，但目前真的沒想到砍完之後還要做什麼。在第一個 `is_run_start(kill)` 中，已經解決了我們砍掉了一個 run 中最後一個 entry 的問題， `delete_entry` 也將整個 cluster 往前移動了，目前暫時沒想法。而 `Caution` 的限制提供的 hash 不能超過 $q+r$ 個 bits，是因為我們在 `qf` 中本來就只有保存 lower $q+r$ 個 bits，其他部分都被我們用 mask 遮掉了。如果今天提供的 hash 值超過 $q+r$ 個 bits，可能會發生 hash 值明明不同，但是因為 lower $q+r$ 個 bits 相同而被誤判為是同一個 hash 的情況發生。因此如果我們採用此限制，便能夠避免掉這種情況。 ## 測驗 6 > 例如: 在 [src/gqf_file.c](https://github.com/splatlab/cqf/blob/master/src/gqf_file.c) 的實作中，[cqf](https://github.com/splatlab/cqf) 支援檔案操作，但 [sysprog21/quotient-filter](https://github.com/sysprog21/quotient-filter) 僅能處理 in-memory 的資料，並且 [src/gqf_file.c](https://github.com/splatlab/cqf/blob/master/src/gqf_file.c) 利用 madvise 系統呼叫以提示 Linux 虛擬記憶體管理器做更好的分頁處理。 > 在 [src/gqf.c](https://github.com/splatlab/cqf/blob/master/src/gqf.c) 的實作中有看到 spin lock 的機制，能確保 thread-safe 。目前[sysprog21/quotient-filter](https://github.com/sysprog21/quotient-filter) 的版本並不能確保 multithreading 不會出問題。