# 2020q3 專題: LevelDB 研究 contributed by < `grb72t3yde` > ###### tags: `sysprog2020` ## 已閱讀過的研究素材 - [x] [SLMDB Single-Level Key-Value Store with Persistent Memory](https://www.usenix.org/conference/fast19/presentation/kaiyrakhmet) - [x] [LevelDB 源码分析系列](https://sf-zhou.github.io/#/LevelDB) by `SF-Zhou` - [x] [RocksDB Wiki](https://github.com/facebook/rocksdb/wiki) - [ ] [SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores](https://www.usenix.org/conference/atc19/presentation/balmau) - [ ] [MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree Based KV Stores with Matrix Container in NVM](https://www.usenix.org/conference/atc20/presentation/yao) - [ ] [Jungle: Towards Dynamically Adjustable Key-Value Store by Combining LSM-Tree and Copy-On-Write B+-Tree ](https://www.usenix.org/conference/hotstorage19/presentation/ahn) - [ ] [AC-Key: Adaptive Caching for LSM-based Key-Value Stores](https://www.usenix.org/conference/atc20/presentation/wu-fenggang) - [x] [Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook](https://www.usenix.org/conference/fast20/presentation/cao-zhichao) - [x] [split KV](https://www.usenix.org/conference/hotstorage20/presentation/han) ## 針對 LevelDB 的研究目標 1. 理解 LevelDB 如何導入 `bloom filter`, 以對 query 進行優化 > $\to$ 嘗試引入 Xor filter 2. 理解 LevelDB 中如何使用 skip list 以實作 memtable,並嘗試改進 3. 練習使用效能分析工具 (perf, BCC/BPF 等等) 以理解 LevelDB 的程式熱點 (hot spot) ## 實驗環境 * Ubuntu 20.04 LTS * Linux kernel 5.4.0-52-generic ## 編譯 LevelDB 取得 [LevelDB](https://github.com/google/leveldb) 最新 release 1.22 版本。 * 建構專案 ```shell $ mkdir -p build && cd build $ cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build . ``` * 建構 LevelDB 函式庫 ```shell $ sudo make install ``` 如成功建構靜態函式庫, 應出現類似下圖之輸出 : ```shell ... Install the project... -- Install configuration: "Release" -- Up-to-date: /usr/local/lib/libleveldb.a -- Up-to-date: /usr/local/include/leveldb/c.h -- Up-to-date: /usr/local/include/leveldb/cache.h -- Up-to-date: /usr/local/include/leveldb/comparator.h -- Up-to-date: /usr/local/include/leveldb/db.h -- Up-to-date: /usr/local/include/leveldb/dumpfile.h -- Up-to-date: /usr/local/include/leveldb/env.h -- Up-to-date: /usr/local/include/leveldb/export.h -- Up-to-date: /usr/local/include/leveldb/filter_policy.h -- Up-to-date: /usr/local/include/leveldb/iterator.h -- Up-to-date: /usr/local/include/leveldb/options.h -- Up-to-date: /usr/local/include/leveldb/slice.h -- Up-to-date: /usr/local/include/leveldb/status.h -- Up-to-date: /usr/local/include/leveldb/table_builder.h -- Up-to-date: /usr/local/include/leveldb/table.h -- Up-to-date: /usr/local/include/leveldb/write_batch.h -- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbTargets.cmake -- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbTargets-release.cmake -- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbConfig.cmake -- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbConfigVersion.cmake ``` 接著編譯以下程式碼,操作一個 `DB` 物件來檢驗安裝是否成功 : ```cpp #include <iostream> #include <cassert> #include "leveldb/db.h" /* hello_leveldb.cc */ int main(int argc, const char* argv[]) { leveldb::DB* db; leveldb::Options options; options.create_if_missing = true; leveldb::Status status = leveldb::DB::Open(options,"/tmp/testdb", &db); assert(status.ok()); std::cout << "Hello LevelDB!\n"; delete db; return 0; } ``` ```shell $ clang++ hello_leveldb.cc -o hello_leveldb /usr/local/lib/libleveldb.a -pthread ``` --- ## Bloom Filter ### LevelDB 以及 LSM-tree 的精神 * `LevelDB` 使用 `LSM-tree (Log-Structured Merged Tree)` 作為其 `On disk` 組織檔案的資料結構, 它能保證資料與日誌的寫入是 `sequential write`, 而這在 SSD 上面是格外重要的 (因為它們不善於 `random write`)。 請參考 [SFS: Random Write Considered Harmful in Solid State Drives](https://www.usenix.org/conference/fast12/sfs-random-write-considered-harmful-solid-state-drives)。 * `Log-Strctured` 在寫入資料時以 `append` 機制在後台清理過時的資料)的方式寫入。對於資料的更新不做 `in-place` 的修改, 而是直接寫入新的資料並等待 `garbage collection (GC)`,因此 : 1. 改善了寫入效能 2. 但同時, 犧牲了 `random read` 的表現, 這同時也是引入 `Bloom Filter` 的原因 ### LevelDB 使用 Bloom Filter 的考量 * 對於讀取一個 `key value pair` 的操作, 我們必須搜索 **所有具有包含此 `key` 的 `key range` 的檔案**, 原因是**我們不確定哪個檔案中有我們的目標 `key`**。 > 舉例來說, 假設我們的目標是 `k = 4`, 對於擁有 key range `1~5` 的檔案 s, 與擁有 key range `3~7` 的檔案 s', (以及更多其他 key range 橫跨了 `4` 的其他檔案), 我們不確定哪些檔案確實包含 `4` 這個 key,因此必須全部找過一遍。 > 事實上, 也有可能全部都沒有; 也有可能全部都有, 而當然此情況下我們能夠知道哪一份是最新的版本。 :::info 所以引入 Bloom filter 的動機為藉由預測某個 key 是否存在某檔案中,以有機會減少要搜尋的檔案數量,減緩讀取放大。 ::: * 下圖以論文 [SLM-DB: Single-Level Key-Value Store with Persistent Memory](https://www.usenix.org/conference/fast19/presentation/kaiyrakhmet) 提供之參考數據資料重製, 可以觀察到, 隨著 `value size` 的增加, 使用 `Bloom Filter` 的收益也更顯著 ![](https://i.imgur.com/IF96OnG.png) ### 本項目目標 : 於 LevelDB 中引入 Xor filter,比較預測錯誤率以及記憶體開銷 - [x] 研究 LevelDB 中使用的 Bloom Filter 程式碼 - [x] 複習 [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258) - [x] 學習其他同學在 Xor filter 上的貢獻成果 * [eecheng87](https://hackmd.io/@eecheng/HJd1sde8v?fbclid=IwAR3yHXTxB9CPOsxsOH0A0TobMJ3puaVQuX634Xa1t-fk98gzJwD35u1FrJA) * [JulianATA](https://hackmd.io/@hPMCWajOS-ORQdEEAQ04-w/H1NVb6u8P?fbclid=IwAR2Rka6XXJAcboMtYkPXo6MBy1JDOFfmqWyWWXqv8ZZcYQBRHwe1vY6nxOw) - [x] 比較 [xor_singleheader](https://github.com/FastFilter/xor_singleheader) 以及 levelDB 內建 bloomfilter ### dict 以及 LevelDB 中使用的 bloom filter 之比較 我們在 [dict](https://hackmd.io/@sysprog/2020-dict) 中曾經提及 Bloom Filter,則 LevelDB 中的 Bloom Filter 實作與 dict 中的有何不同處 ?以下沿用 dict 中推導出的公式 ==$k = ln2(m/n)$== 之變數意義來說明 : 1. LevelDB 使用 *block-based bloom filter*。這部分可以搭配 [leveldb file format](https://github.com/google/leveldb/blob/master/doc/table_format.md) 來理解。LevelDB 以其一個 `file` 中,僅每 `2KB` 單位的 data block 資料中所有 keys 就建立一個 filter 並存放其於此 `file` 的 filter (meta) block 中,因此系統中其實存在著許多小 filter。(可以這麼做的原因以及其動機可以參考[此篇文件](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter)。另外,文中也提及在 RocksDB 中移除了這個做法而改採用 *full filter* 的原因。) 我們可觀察寫入 filter block 部分之程式碼 [filter_block.cc](https://github.com/google/leveldb/blob/master/table/filter_block.cc), 看到 FilterBlockBuilder 類別下 StartBlock 的實作,其會用 2KB 去計算 filter_index。 ```cpp= ... namespace leveldb { // See doc/table_format.md for an explanation of the filter block format. // Generate new filter every 2KB of data static const size_t kFilterBaseLg = 11; static const size_t kFilterBase = 1 << kFilterBaseLg; FilterBlockBuilder::FilterBlockBuilder(const FilterPolicy* policy) : policy_(policy) {} void FilterBlockBuilder::StartBlock(uint64_t block_offset) { uint64_t filter_index = (block_offset / kFilterBase); assert(filter_index >= filter_offsets_.size()); while (filter_index > filter_offsets_.size()) { GenerateFilter(); } } ... void FilterBlockBuilder::GenerateFilter() { const size_t num_keys = start_.size(); if (num_keys == 0) { // Fast path if there are no keys for this filter filter_offsets_.push_back(result_.size()); return; } // Make list of keys from flattened key structure start_.push_back(keys_.size()); // Simplify length computation tmp_keys_.resize(num_keys); for (size_t i = 0; i < num_keys; i++) { const char* base = keys_.data() + start_[i]; size_t length = start_[i + 1] - start_[i]; tmp_keys_[i] = Slice(base, length); } // Generate filter for current set of keys and append to result_. filter_offsets_.push_back(result_.size()); policy_->CreateFilter(&tmp_keys_[0], static_cast<int>(num_keys), &result_); tmp_keys_.clear(); keys_.clear(); start_.clear(); } ... ``` 1. line 40 的 CreateFilter 即為針對 tmp_keys_ 這組 keys 建立 filter。這個函式留待接下來講到 bloom filter 實作時說明。 2. 對每一個 filter,LevelDB 是使用 `double-hashing` 來產生 $k$ 個 hash values。這個部份的實作也留待接下來講到 bloom filter 實作時說明。 ### LevelDB 中 bloom filter 實作 我們從 [include/leveldb/filter_policy.h](https://github.com/google/leveldb/blob/master/include/leveldb/filter_policy.h) 來看起,這個檔案中定義了 FilterPolicy 這個 base class,提供了一個界面。 ```cpp class LEVELDB_EXPORT FilterPolicy { public: virtual ~FilterPolicy(); virtual const char* Name() const = 0; virtual void CreateFilter(const Slice* keys, int n, std::string* dst) const = 0; virtual bool KeyMayMatch(const Slice& key, const Slice& filter) const = 0; }; ... LEVELDB_EXPORT const FilterPolicy* NewBloomFilterPolicy(int bits_per_key); } // namespace leveldb ``` 舉例來說,使用者可自訂一個 filter policy 實作如下 : ```cpp class CustomFilterPolicy : public leveldb::FilterPolicy { private: FilterPolicy* builtin_policy_; public: CustomFilterPolicy() : builtin_policy_(NewBloomFilterPolicy(10)) {} ~CustomFilterPolicy() { delete builtin_policy_; } const char* Name() const { return "IgnoreTrailingSpacesFilter"; } void CreateFilter(const Slice* keys, int n, std::string* dst) const { // Use builtin bloom filter code after removing trailing spaces std::vector<Slice> trimmed(n); for (int i = 0; i < n; i++) { trimmed[i] = RemoveTrailingSpaces(keys[i]); } return builtin_policy_->CreateFilter(&trimmed[i], n, dst); } }; ``` LevelDB 中的 bloom filter 實作則是 [util/bloom.cc](https://github.com/google/leveldb/blob/master/util/bloom.cc) 中的 BloomFilterPolicy 類別,這裡 line 12 的 CreateFilter 就是前述建立 filter block 時所呼叫函數之定義 : ```cpp= class BloomFilterPolicy : public FilterPolicy { public: explicit BloomFilterPolicy(int bits_per_key) : bits_per_key_(bits_per_key) { // We intentionally round down to reduce probing cost a little bit k_ = static_cast<size_t>(bits_per_key * 0.69); // 0.69 =~ ln(2) if (k_ < 1) k_ = 1; if (k_ > 30) k_ = 30; } virtual const char* Name() const { return "leveldb.BuiltinBloomFilter2"; } virtual void CreateFilter(const Slice* keys, int n, std::string* dst) const { // Compute bloom filter size (in both bits and bytes) size_t bits = n * bits_per_key_; // For small n, we can see a very high false positive rate. Fix it // by enforcing a minimum bloom filter length. if (bits < 64) bits = 64; size_t bytes = (bits + 7) / 8; bits = bytes * 8; const size_t init_size = dst->size(); dst->resize(init_size + bytes, 0); dst->push_back(static_cast<char>(k_)); // Remember # of probes in filter char* array = &(*dst)[init_size]; for (int i = 0; i < n; i++) { // Use double-hashing to generate a sequence of hash values. // See analysis in [Kirsch,Mitzenmacher 2006]. uint32_t h = BloomHash(keys[i]); const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits for (size_t j = 0; j < k_; j++) { const uint32_t bitpos = h % bits; array[bitpos / 8] |= (1 << (bitpos % 8)); h += delta; } } } virtual bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const { const size_t len = bloom_filter.size(); if (len < 2) return false; const char* array = bloom_filter.data(); const size_t bits = (len - 1) * 8; // Use the encoded k so that we can read filters generated by // bloom filters created using different parameters. const size_t k = array[len - 1]; if (k > 30) { // Reserved for potentially new encodings for short bloom filters. // Consider it a match. return true; } uint32_t h = BloomHash(key); const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits for (size_t j = 0; j < k; j++) { const uint32_t bitpos = h % bits; if ((array[bitpos / 8] & (1 << (bitpos % 8))) == 0) return false; h += delta; } return true; } private: size_t bits_per_key_; size_t k_; }; } // namespace const FilterPolicy* NewBloomFilterPolicy(int bits_per_key) { return new BloomFilterPolicy(bits_per_key); } } // namespace leveldb ``` 我們可以從上面這個程式碼片段看到幾件事情 : 1. NewBloomFilterPolicy 是一個內建的 bloom filter policy 創建函數。呼叫它時可以自己接入一個引數 `bits_per_key`。從 line 5 可得知這個數字就是 $(m/n)$,這個數字決定了 $k$ 。而且 levelDB 還會將算出來的 $k$ 值控制在 1~30 之間,以控制每次建立 filter 花在運算 hash functions 的成本。 5. line 27 的迴圈就是在將每一個 key 帶入 hash function 得到 fingerprint,且看起來只有使用到一個 hash function (BloomHash)?這裡就是使用到 double hashing,第一個 hash function 是 BloomHash,而第二個 hash function 是 delta。可參考 [Less Hashing, Same Performance: Building a Better Bloom Filter](https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf)。 3. line 25 將這個 filter 的 hash function 個數 $k$ 一起存在 filter 後 (佔 1 byte),以供查詢的時候使用。 :::warning 一個 BloomFilterPoicy 建構的時候接入一個 bits_per_key 引數。若 $k$ 在這個 policy 之下是固定的,則為什麼會需要存下 $k$ 呢? ::: ### 實驗 : 比較 LevelDB 內建 bloom filter 以及 Xor8 filter 的 false positive possibility 以及記憶體開銷 如前所述,levelDB 已經控制了花在計算 hash function 上的時間。我們接下來著重比較 levelDB 所使用的 bloom filter 和 xorfilter 之記憶體開銷以及 FPP(false positive possibility)。 :::info `實驗組` : Xor8 filter `對照組` : levelDB bloom filter with `bits_per_key` = 10 (這個參數為 LevelDB 預設的推薦值, 當然它會影響記憶體開銷) `測試程式` : 使用 [util/bloom_test](https://github.com/google/leveldb/blob/master/util/bloom_test.cc) 中的 VaryingLengths TEST。此測試先連續加入`length` 個 key ,再來透過預測一萬個未被加入的 key 是否在 filter 中來測量 false positive possibility。被判定為 good 的標準為得到 1.25% 以下的 false positive rate,否則將被判定為 mediocre。 ::: 可以看到當 `length` 逐漸上升,Xor filter 能維持在 $\simeq0.4%$ 的 FPP,最後成功通過 37 個測試。 另外,LevelDB 將一個 filter 最小大小規定為 8 byte,原因是過小的 filter 將造成高 FPP。所以可以看到前面幾個 iteration 都是 9 byte (8 byte 加上 1 byte 的 $k$)。Xor8 則可以參考論文中的公式,其也提到可以用 Xor+ filter 來降低空間浪費。在 `length` 到達 2000 時,其 Xor filter 的記憶體開銷已經小於 bloom filter。 [TODO : 整理成圖表] ``` ==== Test BloomTest.VaryingLengths @ length = 1 False positives of bloom : 0.23% ; bytes = 9 False positives of xor : 0.48% ; bytes = 33 @ length = 2 False positives of bloom : 0.44% ; bytes = 9 False positives of xor : 0.48% ; bytes = 33 @ length = 3 False positives of bloom : 0.75% ; bytes = 9 False positives of xor : 0.53% ; bytes = 33 @ length = 4 False positives of bloom : 1.08% ; bytes = 9 False positives of xor : 0.46% ; bytes = 36 @ length = 5 False positives of bloom : 1.20% ; bytes = 9 False positives of xor : 0.43% ; bytes = 36 @ length = 6 False positives of bloom : 1.59% ; bytes = 9 False positives of xor : 0.36% ; bytes = 39 @ length = 7 False positives of bloom : 1.53% ; bytes = 10 False positives of xor : 0.41% ; bytes = 39 @ length = 8 False positives of bloom : 1.81% ; bytes = 11 False positives of xor : 0.44% ; bytes = 39 @ length = 9 False positives of bloom : 0.79% ; bytes = 13 False positives of xor : 0.43% ; bytes = 42 @ length = 10 False positives of bloom : 1.63% ; bytes = 14 False positives of xor : 0.36% ; bytes = 42 @ length = 20 False positives of bloom : 1.24% ; bytes = 26 False positives of xor : 0.37% ; bytes = 54 @ length = 30 False positives of bloom : 0.84% ; bytes = 39 False positives of xor : 0.43% ; bytes = 66 @ length = 40 False positives of bloom : 1.07% ; bytes = 51 False positives of xor : 0.37% ; bytes = 81 @ length = 50 False positives of bloom : 1.09% ; bytes = 64 False positives of xor : 0.41% ; bytes = 93 @ length = 60 False positives of bloom : 1.12% ; bytes = 76 False positives of xor : 0.36% ; bytes = 105 @ length = 70 False positives of bloom : 0.93% ; bytes = 89 False positives of xor : 0.41% ; bytes = 117 @ length = 80 False positives of bloom : 1.16% ; bytes = 101 False positives of xor : 0.33% ; bytes = 129 @ length = 90 False positives of bloom : 1.07% ; bytes = 114 False positives of xor : 0.36% ; bytes = 141 @ length = 100 False positives of bloom : 0.83% ; bytes = 126 False positives of xor : 0.39% ; bytes = 153 @ length = 200 False positives of bloom : 0.96% ; bytes = 251 False positives of xor : 0.38% ; bytes = 276 @ length = 300 False positives of bloom : 0.77% ; bytes = 376 False positives of xor : 0.37% ; bytes = 399 @ length = 400 False positives of bloom : 0.81% ; bytes = 501 False positives of xor : 0.31% ; bytes = 522 @ length = 500 False positives of bloom : 0.74% ; bytes = 626 False positives of xor : 0.37% ; bytes = 645 @ length = 600 False positives of bloom : 0.78% ; bytes = 751 False positives of xor : 0.41% ; bytes = 768 @ length = 700 False positives of bloom : 0.91% ; bytes = 876 False positives of xor : 0.44% ; bytes = 891 @ length = 800 False positives of bloom : 0.88% ; bytes = 1001 False positives of xor : 0.43% ; bytes = 1014 @ length = 900 False positives of bloom : 0.97% ; bytes = 1126 False positives of xor : 0.52% ; bytes = 1137 @ length = 1000 False positives of bloom : 0.90% ; bytes = 1251 False positives of xor : 0.50% ; bytes = 1260 @ length = 2000 False positives of bloom : 0.89% ; bytes = 2501 False positives of xor : 0.33% ; bytes = 2490 @ length = 3000 False positives of bloom : 0.95% ; bytes = 3751 False positives of xor : 0.32% ; bytes = 3720 @ length = 4000 False positives of bloom : 1.01% ; bytes = 5001 False positives of xor : 0.44% ; bytes = 4950 @ length = 5000 False positives of bloom : 0.89% ; bytes = 6251 False positives of xor : 0.28% ; bytes = 6180 @ length = 6000 False positives of bloom : 1.03% ; bytes = 7501 False positives of xor : 0.43% ; bytes = 7410 @ length = 7000 False positives of bloom : 0.78% ; bytes = 8751 False positives of xor : 0.34% ; bytes = 8640 @ length = 8000 False positives of bloom : 1.09% ; bytes = 10001 False positives of xor : 0.47% ; bytes = 9870 @ length = 9000 False positives of bloom : 1.09% ; bytes = 11251 False positives of xor : 0.47% ; bytes = 11100 @ length = 10000 False positives of bloom : 0.81% ; bytes = 12501 False positives of xor : 0.40% ; bytes = 12330 <test result> bloom : 33 good, 4 mediocre xor : 37 good, 0 mediocre ``` #### 討論 : TODO --- ## Skip List Skip List 用來實作 LevelDB 中的 memtable,也用來實作 Redis 中的 sorted sets 資料結構。 程式碼: [leveldb/db/skiplist.h](https://github.com/google/leveldb/blob/master/db/skiplist.h) 可改進之處: * [Random 的實作](https://github.com/google/leveldb/blob/master/util/random.h)執行成本較高,可換為 [xoshiro128+](http://prng.di.unimi.it/),取代原有的 `Next` 操作 * 強化 cache 效益,見 [Skip Lists: Done Right](http://ticki.github.io/blog/skip-lists-done-right/) Random 的修改 ```diff --- a/db/skiplist.h +++ b/db/skiplist.h @@ -32,10 +32,22 @@ #include <cstdlib> #include "util/arena.h" -#include "util/random.h" namespace leveldb { +struct rand_internal { + uint16_t s; +}; + +/* 16-bit xorshift PRNG */ +static inline uint16_t Rand(struct rand_internal* x) { + uint16_t xs = x->s; + xs ^= xs << 7; + xs ^= xs >> 9; + xs ^= xs << 8; + return (x->s = xs); +} + class Arena; template <typename Key, class Comparator> @@ -138,7 +150,7 @@ class SkipList { std::atomic<int> max_height_; // Height of the entire list // Read/written only by Insert(). - Random rnd_; + struct rand_internal rnd_; }; // Implementation details follow @@ -243,7 +255,7 @@ int SkipList<Key, Comparator>::RandomHeight() { // Increase height with probability 1 in kBranching static const unsigned int kBranching = 4; int height = 1; - while (height < kMaxHeight && ((rnd_.Next() % kBranching) == 0)) { + while (height < kMaxHeight && (Rand(&rnd_) % kBranching) == 0) { height++; } assert(height > 0); @@ -326,8 +338,10 @@ SkipList<Key, Comparator>::SkipList(Comparator cmp, Arena* arena) : compare_(cmp), arena_(arena), head_(NewNode(0 /* any key will do */, kMaxHeight)), - max_height_(1), - rnd_(0xdeadbeef) { + max_height_(1) { + static const uint32_t A = 16807; // bits 14, 8, 7, 5, 2, 1, 0 + rnd_.s = (uintptr_t)&rnd_ * A; + Rand(&rnd_); for (int i = 0; i < kMaxHeight; i++) { head_->SetNext(i, nullptr); } ``` > 由於 `kBranching` 實在太小 (`4`),甚至可避免用 PRNG --- [Skip Lists: Done Right](http://ticki.github.io/blog/skip-lists-done-right/) 一文中點出 : > "In the past five years, people have become increasingly sceptical of skip lists’ performance, due to their poor cache behavior when compared to e.g. B-trees" 並且基於這個動機提出了一些得以改善快取機制效益的 skip list 設計方法。 :::warning [pointer chasing](https://en.wikichip.org/wiki/pointer_chasing) 在系統執行期間造成諸多 TLB, cache miss,其原因來自於處理器必須處理連續的"不規則記憶體存取"之指令,進而導致快取效益大幅下降,甚至造成[快取污染](https://en.wikipedia.org/wiki/Cache_pollution)。 除了透過軟體設計減緩此問題,也有硬體層面的方法,例如 : [Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation](https://users.ece.cmu.edu/~omutlu/pub/in-memory-pointer-chasing-accelerator_iccd16.pdf)。 ::: 為了觀察 levelDB 原本 skip list 實作的 cache 表現,我嘗試用 performance event 來觀察 `db_bench`。執行 `--benchmarks="fillrandom,readseq"`。 * 環境 ```cpp LevelDB: version 1.22 Date: Wed Jan 20 12:07:14 2021 CPU: 8 * Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz CPUCache: 8192 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) WARNING: Snappy compression is not enabled ``` 首先用 perf stat 觀察執行此 benchmark 時的 cache miss, sTLB miss 狀況 : ```shell $ sudo perf stat --repeat 10 -e dTLB-loads:u,dTLB-load-misses:u,cache-misses:u,cache-references:u \ ./db_bench --benchmarks="fillrandom,readseq" ``` * 原本的 random 實作 ```cpp Performance counter stats for './db_bench --benchmarks=fillrandom,readseq' (10 runs): 40,1930,6118 dTLB-loads ( +- 0.91% ) 67,1019 dTLB-load-misses # 0.02% of all dTLB cache hits ( +- 1.62% ) 3404,4704 cache-misses:u # 13.366 % of all cache refs ( +- 4.47% ) 2,5470,5111 cache-references:u ( +- 0.73% ) 4.594 +- 0.234 seconds time elapsed ( +- 5.08% ) ``` 接著使用 perf record 觀察原實作的程式熱點 : ```shell $ sudo perf record -e dTLB-loads:u,dTLB-load-misses:u,cache-misses:u,cache-references:u \ ./db_bench --benchmarks="fillrandom,readseq" ``` 得到以下執行結果 : ``` LevelDB: version 1.22 Date: Wed Jan 20 12:28:38 2021 CPU: 8 * Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz CPUCache: 8192 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) WARNING: Snappy compression is not enabled ------------------------------------------------- fillrandom : 6.808 micros/op; 16.2 MB/s readseq : 0.187 micros/op; 590.6 MB/s ``` 得到 perf cache-misses event 報告如下 : ```cpp Samples: 14K of event 'cache-misses:u', Event count (approx.): 27402381 Overhead Command Shared Object Symbol 16.79% db_bench db_bench [.] leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert ◆ 9.94% db_bench libc-2.31.so [.] __memcmp_avx2_movbe ▒ 6.90% db_bench db_bench [.] leveldb::InternalKeyComparator::Compare ▒ 6.10% db_bench db_bench [.] leveldb::MemTableIterator::key ▒ 5.86% db_bench db_bench [.] leveldb::(anonymous namespace)::MergingIterator::Next ▒ 5.81% db_bench libc-2.31.so [.] __memmove_avx_unaligned_erms ▒ ... ``` 得到 perf dTLB-load-misses event 報告如下 : ```cpp Samples: 3K of event 'dTLB-load-misses', Event count (approx.): 586744 Overhead Command Shared Object Symbol 28.60% db_bench db_bench [.] leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert 22.71% db_bench db_bench [.] leveldb::MemTableIterator::key 12.84% db_bench db_bench [.] leveldb::ReadBlock ``` 發現 `leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert` 這個函式在兩種 miss 的統計中都是熱點。 接著觀察對 symbol `leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert` 進行 annotation * [filter meta block] (https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format#filter-meta-block)