# 2020q3 專題: LevelDB 研究
contributed by < `grb72t3yde` >
###### tags: `sysprog2020`
## 已閱讀過的研究素材
- [x] [SLMDB Single-Level Key-Value Store with Persistent Memory](https://www.usenix.org/conference/fast19/presentation/kaiyrakhmet)
- [x] [LevelDB 源码分析系列](https://sf-zhou.github.io/#/LevelDB) by `SF-Zhou`
- [x] [RocksDB Wiki](https://github.com/facebook/rocksdb/wiki)
- [ ] [SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores](https://www.usenix.org/conference/atc19/presentation/balmau)
- [ ] [MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree Based KV Stores with Matrix Container in NVM](https://www.usenix.org/conference/atc20/presentation/yao)
- [ ] [Jungle: Towards Dynamically Adjustable Key-Value Store by Combining LSM-Tree and Copy-On-Write B+-Tree
](https://www.usenix.org/conference/hotstorage19/presentation/ahn)
- [ ] [AC-Key: Adaptive Caching for LSM-based Key-Value Stores](https://www.usenix.org/conference/atc20/presentation/wu-fenggang)
- [x] [Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook](https://www.usenix.org/conference/fast20/presentation/cao-zhichao)
- [x] [split KV](https://www.usenix.org/conference/hotstorage20/presentation/han)
## 針對 LevelDB 的研究目標
1. 理解 LevelDB 如何導入 `bloom filter`, 以對 query 進行優化
> $\to$ 嘗試引入 Xor filter
2. 理解 LevelDB 中如何使用 skip list 以實作 memtable,並嘗試改進
3. 練習使用效能分析工具 (perf, BCC/BPF 等等) 以理解 LevelDB 的程式熱點 (hot spot)
## 實驗環境
* Ubuntu 20.04 LTS
* Linux kernel 5.4.0-52-generic
## 編譯 LevelDB
取得 [LevelDB](https://github.com/google/leveldb) 最新 release 1.22 版本。
* 建構專案
```shell
$ mkdir -p build && cd build
$ cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build .
```
* 建構 LevelDB 函式庫
```shell
$ sudo make install
```
如成功建構靜態函式庫, 應出現類似下圖之輸出 :
```shell
...
Install the project...
-- Install configuration: "Release"
-- Up-to-date: /usr/local/lib/libleveldb.a
-- Up-to-date: /usr/local/include/leveldb/c.h
-- Up-to-date: /usr/local/include/leveldb/cache.h
-- Up-to-date: /usr/local/include/leveldb/comparator.h
-- Up-to-date: /usr/local/include/leveldb/db.h
-- Up-to-date: /usr/local/include/leveldb/dumpfile.h
-- Up-to-date: /usr/local/include/leveldb/env.h
-- Up-to-date: /usr/local/include/leveldb/export.h
-- Up-to-date: /usr/local/include/leveldb/filter_policy.h
-- Up-to-date: /usr/local/include/leveldb/iterator.h
-- Up-to-date: /usr/local/include/leveldb/options.h
-- Up-to-date: /usr/local/include/leveldb/slice.h
-- Up-to-date: /usr/local/include/leveldb/status.h
-- Up-to-date: /usr/local/include/leveldb/table_builder.h
-- Up-to-date: /usr/local/include/leveldb/table.h
-- Up-to-date: /usr/local/include/leveldb/write_batch.h
-- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbTargets.cmake
-- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbTargets-release.cmake
-- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbConfig.cmake
-- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbConfigVersion.cmake
```
接著編譯以下程式碼,操作一個 `DB` 物件來檢驗安裝是否成功 :
```cpp
#include <iostream>
#include <cassert>
#include "leveldb/db.h"
/* hello_leveldb.cc */
int main(int argc, const char* argv[])
{
leveldb::DB* db;
leveldb::Options options;
options.create_if_missing = true;
leveldb::Status status = leveldb::DB::Open(options,"/tmp/testdb", &db);
assert(status.ok());
std::cout << "Hello LevelDB!\n";
delete db;
return 0;
}
```
```shell
$ clang++ hello_leveldb.cc -o hello_leveldb /usr/local/lib/libleveldb.a -pthread
```
---
## Bloom Filter
### LevelDB 以及 LSM-tree 的精神
* `LevelDB` 使用 `LSM-tree (Log-Structured Merged Tree)` 作為其 `On disk` 組織檔案的資料結構, 它能保證資料與日誌的寫入是 `sequential write`, 而這在 SSD 上面是格外重要的 (因為它們不善於 `random write`)。
請參考 [SFS: Random Write Considered Harmful in Solid State Drives](https://www.usenix.org/conference/fast12/sfs-random-write-considered-harmful-solid-state-drives)。
* `Log-Strctured` 在寫入資料時以 `append` 機制在後台清理過時的資料)的方式寫入。對於資料的更新不做 `in-place` 的修改, 而是直接寫入新的資料並等待 `garbage collection (GC)`,因此 :
1. 改善了寫入效能
2. 但同時, 犧牲了 `random read` 的表現, 這同時也是引入 `Bloom Filter` 的原因
### LevelDB 使用 Bloom Filter 的考量
* 對於讀取一個 `key value pair` 的操作, 我們必須搜索 **所有具有包含此 `key` 的 `key range` 的檔案**, 原因是**我們不確定哪個檔案中有我們的目標 `key`**。
> 舉例來說, 假設我們的目標是 `k = 4`, 對於擁有 key range `1~5` 的檔案 s, 與擁有 key range `3~7` 的檔案 s', (以及更多其他 key range 橫跨了 `4` 的其他檔案), 我們不確定哪些檔案確實包含 `4` 這個 key,因此必須全部找過一遍。
> 事實上, 也有可能全部都沒有; 也有可能全部都有, 而當然此情況下我們能夠知道哪一份是最新的版本。
:::info
所以引入 Bloom filter 的動機為藉由預測某個 key 是否存在某檔案中,以有機會減少要搜尋的檔案數量,減緩讀取放大。
:::
* 下圖以論文 [SLM-DB: Single-Level Key-Value Store with Persistent Memory](https://www.usenix.org/conference/fast19/presentation/kaiyrakhmet) 提供之參考數據資料重製, 可以觀察到, 隨著 `value size` 的增加, 使用 `Bloom Filter` 的收益也更顯著
![](https://i.imgur.com/IF96OnG.png)
### 本項目目標 : 於 LevelDB 中引入 Xor filter,比較預測錯誤率以及記憶體開銷
- [x] 研究 LevelDB 中使用的 Bloom Filter 程式碼
- [x] 複習 [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258)
- [x] 學習其他同學在 Xor filter 上的貢獻成果
* [eecheng87](https://hackmd.io/@eecheng/HJd1sde8v?fbclid=IwAR3yHXTxB9CPOsxsOH0A0TobMJ3puaVQuX634Xa1t-fk98gzJwD35u1FrJA)
* [JulianATA](https://hackmd.io/@hPMCWajOS-ORQdEEAQ04-w/H1NVb6u8P?fbclid=IwAR2Rka6XXJAcboMtYkPXo6MBy1JDOFfmqWyWWXqv8ZZcYQBRHwe1vY6nxOw)
- [x] 比較 [xor_singleheader](https://github.com/FastFilter/xor_singleheader) 以及 levelDB 內建 bloomfilter
### dict 以及 LevelDB 中使用的 bloom filter 之比較
我們在 [dict](https://hackmd.io/@sysprog/2020-dict) 中曾經提及 Bloom Filter,則 LevelDB 中的 Bloom Filter 實作與 dict 中的有何不同處 ?以下沿用 dict 中推導出的公式 ==$k = ln2(m/n)$== 之變數意義來說明 :
1. LevelDB 使用 *block-based bloom filter*。這部分可以搭配 [leveldb file format](https://github.com/google/leveldb/blob/master/doc/table_format.md) 來理解。LevelDB 以其一個 `file` 中,僅每 `2KB` 單位的 data block 資料中所有 keys 就建立一個 filter 並存放其於此 `file` 的 filter (meta) block 中,因此系統中其實存在著許多小 filter。(可以這麼做的原因以及其動機可以參考[此篇文件](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter)。另外,文中也提及在 RocksDB 中移除了這個做法而改採用 *full filter* 的原因。)
我們可觀察寫入 filter block 部分之程式碼 [filter_block.cc](https://github.com/google/leveldb/blob/master/table/filter_block.cc), 看到 FilterBlockBuilder 類別下 StartBlock 的實作,其會用 2KB 去計算 filter_index。
```cpp=
...
namespace leveldb {
// See doc/table_format.md for an explanation of the filter block format.
// Generate new filter every 2KB of data
static const size_t kFilterBaseLg = 11;
static const size_t kFilterBase = 1 << kFilterBaseLg;
FilterBlockBuilder::FilterBlockBuilder(const FilterPolicy* policy)
: policy_(policy) {}
void FilterBlockBuilder::StartBlock(uint64_t block_offset) {
uint64_t filter_index = (block_offset / kFilterBase);
assert(filter_index >= filter_offsets_.size());
while (filter_index > filter_offsets_.size()) {
GenerateFilter();
}
}
...
void FilterBlockBuilder::GenerateFilter() {
const size_t num_keys = start_.size();
if (num_keys == 0) {
// Fast path if there are no keys for this filter
filter_offsets_.push_back(result_.size());
return;
}
// Make list of keys from flattened key structure
start_.push_back(keys_.size()); // Simplify length computation
tmp_keys_.resize(num_keys);
for (size_t i = 0; i < num_keys; i++) {
const char* base = keys_.data() + start_[i];
size_t length = start_[i + 1] - start_[i];
tmp_keys_[i] = Slice(base, length);
}
// Generate filter for current set of keys and append to result_.
filter_offsets_.push_back(result_.size());
policy_->CreateFilter(&tmp_keys_[0], static_cast<int>(num_keys), &result_);
tmp_keys_.clear();
keys_.clear();
start_.clear();
}
...
```
1. line 40 的 CreateFilter 即為針對 tmp_keys_ 這組 keys 建立 filter。這個函式留待接下來講到 bloom filter 實作時說明。
2. 對每一個 filter,LevelDB 是使用 `double-hashing` 來產生 $k$ 個 hash values。這個部份的實作也留待接下來講到 bloom filter 實作時說明。
### LevelDB 中 bloom filter 實作
我們從 [include/leveldb/filter_policy.h](https://github.com/google/leveldb/blob/master/include/leveldb/filter_policy.h) 來看起,這個檔案中定義了 FilterPolicy 這個 base class,提供了一個界面。
```cpp
class LEVELDB_EXPORT FilterPolicy {
public:
virtual ~FilterPolicy();
virtual const char* Name() const = 0;
virtual void CreateFilter(const Slice* keys, int n,
std::string* dst) const = 0;
virtual bool KeyMayMatch(const Slice& key, const Slice& filter) const = 0;
};
...
LEVELDB_EXPORT const FilterPolicy* NewBloomFilterPolicy(int bits_per_key);
} // namespace leveldb
```
舉例來說,使用者可自訂一個 filter policy 實作如下 :
```cpp
class CustomFilterPolicy : public leveldb::FilterPolicy {
private:
FilterPolicy* builtin_policy_;
public:
CustomFilterPolicy() : builtin_policy_(NewBloomFilterPolicy(10)) {}
~CustomFilterPolicy() { delete builtin_policy_; }
const char* Name() const { return "IgnoreTrailingSpacesFilter"; }
void CreateFilter(const Slice* keys, int n, std::string* dst) const {
// Use builtin bloom filter code after removing trailing spaces
std::vector<Slice> trimmed(n);
for (int i = 0; i < n; i++) {
trimmed[i] = RemoveTrailingSpaces(keys[i]);
}
return builtin_policy_->CreateFilter(&trimmed[i], n, dst);
}
};
```
LevelDB 中的 bloom filter 實作則是 [util/bloom.cc](https://github.com/google/leveldb/blob/master/util/bloom.cc) 中的 BloomFilterPolicy 類別,這裡 line 12 的 CreateFilter 就是前述建立 filter block 時所呼叫函數之定義 :
```cpp=
class BloomFilterPolicy : public FilterPolicy {
public:
explicit BloomFilterPolicy(int bits_per_key) : bits_per_key_(bits_per_key) {
// We intentionally round down to reduce probing cost a little bit
k_ = static_cast<size_t>(bits_per_key * 0.69); // 0.69 =~ ln(2)
if (k_ < 1) k_ = 1;
if (k_ > 30) k_ = 30;
}
virtual const char* Name() const { return "leveldb.BuiltinBloomFilter2"; }
virtual void CreateFilter(const Slice* keys, int n, std::string* dst) const {
// Compute bloom filter size (in both bits and bytes)
size_t bits = n * bits_per_key_;
// For small n, we can see a very high false positive rate. Fix it
// by enforcing a minimum bloom filter length.
if (bits < 64) bits = 64;
size_t bytes = (bits + 7) / 8;
bits = bytes * 8;
const size_t init_size = dst->size();
dst->resize(init_size + bytes, 0);
dst->push_back(static_cast<char>(k_)); // Remember # of probes in filter
char* array = &(*dst)[init_size];
for (int i = 0; i < n; i++) {
// Use double-hashing to generate a sequence of hash values.
// See analysis in [Kirsch,Mitzenmacher 2006].
uint32_t h = BloomHash(keys[i]);
const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
for (size_t j = 0; j < k_; j++) {
const uint32_t bitpos = h % bits;
array[bitpos / 8] |= (1 << (bitpos % 8));
h += delta;
}
}
}
virtual bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const {
const size_t len = bloom_filter.size();
if (len < 2) return false;
const char* array = bloom_filter.data();
const size_t bits = (len - 1) * 8;
// Use the encoded k so that we can read filters generated by
// bloom filters created using different parameters.
const size_t k = array[len - 1];
if (k > 30) {
// Reserved for potentially new encodings for short bloom filters.
// Consider it a match.
return true;
}
uint32_t h = BloomHash(key);
const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
for (size_t j = 0; j < k; j++) {
const uint32_t bitpos = h % bits;
if ((array[bitpos / 8] & (1 << (bitpos % 8))) == 0) return false;
h += delta;
}
return true;
}
private:
size_t bits_per_key_;
size_t k_;
};
} // namespace
const FilterPolicy* NewBloomFilterPolicy(int bits_per_key) {
return new BloomFilterPolicy(bits_per_key);
}
} // namespace leveldb
```
我們可以從上面這個程式碼片段看到幾件事情 :
1. NewBloomFilterPolicy 是一個內建的 bloom filter policy 創建函數。呼叫它時可以自己接入一個引數 `bits_per_key`。從 line 5 可得知這個數字就是 $(m/n)$,這個數字決定了 $k$ 。而且 levelDB 還會將算出來的 $k$ 值控制在 1~30 之間,以控制每次建立 filter 花在運算 hash functions 的成本。
5. line 27 的迴圈就是在將每一個 key 帶入 hash function 得到 fingerprint,且看起來只有使用到一個 hash function (BloomHash)?這裡就是使用到 double hashing,第一個 hash function 是 BloomHash,而第二個 hash function 是 delta。可參考 [Less Hashing, Same Performance:
Building a Better Bloom Filter](https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf)。
3. line 25 將這個 filter 的 hash function 個數 $k$ 一起存在 filter 後 (佔 1 byte),以供查詢的時候使用。
:::warning
一個 BloomFilterPoicy 建構的時候接入一個 bits_per_key 引數。若 $k$ 在這個 policy 之下是固定的,則為什麼會需要存下 $k$ 呢?
:::
### 實驗 : 比較 LevelDB 內建 bloom filter 以及 Xor8 filter 的 false positive possibility 以及記憶體開銷
如前所述,levelDB 已經控制了花在計算 hash function 上的時間。我們接下來著重比較 levelDB 所使用的 bloom filter 和 xorfilter 之記憶體開銷以及 FPP(false positive possibility)。
:::info
`實驗組` : Xor8 filter
`對照組` : levelDB bloom filter with `bits_per_key` = 10 (這個參數為 LevelDB 預設的推薦值, 當然它會影響記憶體開銷)
`測試程式` : 使用 [util/bloom_test](https://github.com/google/leveldb/blob/master/util/bloom_test.cc) 中的 VaryingLengths TEST。此測試先連續加入`length` 個 key ,再來透過預測一萬個未被加入的 key 是否在 filter 中來測量 false positive possibility。被判定為 good 的標準為得到 1.25% 以下的 false positive rate,否則將被判定為 mediocre。
:::
可以看到當 `length` 逐漸上升,Xor filter 能維持在 $\simeq0.4%$ 的 FPP,最後成功通過 37 個測試。
另外,LevelDB 將一個 filter 最小大小規定為 8 byte,原因是過小的 filter 將造成高 FPP。所以可以看到前面幾個 iteration 都是 9 byte (8 byte 加上 1 byte 的 $k$)。Xor8 則可以參考論文中的公式,其也提到可以用 Xor+ filter 來降低空間浪費。在 `length` 到達 2000 時,其 Xor filter 的記憶體開銷已經小於 bloom filter。
[TODO : 整理成圖表]
```
==== Test BloomTest.VaryingLengths
@ length = 1
False positives of bloom : 0.23% ; bytes = 9
False positives of xor : 0.48% ; bytes = 33
@ length = 2
False positives of bloom : 0.44% ; bytes = 9
False positives of xor : 0.48% ; bytes = 33
@ length = 3
False positives of bloom : 0.75% ; bytes = 9
False positives of xor : 0.53% ; bytes = 33
@ length = 4
False positives of bloom : 1.08% ; bytes = 9
False positives of xor : 0.46% ; bytes = 36
@ length = 5
False positives of bloom : 1.20% ; bytes = 9
False positives of xor : 0.43% ; bytes = 36
@ length = 6
False positives of bloom : 1.59% ; bytes = 9
False positives of xor : 0.36% ; bytes = 39
@ length = 7
False positives of bloom : 1.53% ; bytes = 10
False positives of xor : 0.41% ; bytes = 39
@ length = 8
False positives of bloom : 1.81% ; bytes = 11
False positives of xor : 0.44% ; bytes = 39
@ length = 9
False positives of bloom : 0.79% ; bytes = 13
False positives of xor : 0.43% ; bytes = 42
@ length = 10
False positives of bloom : 1.63% ; bytes = 14
False positives of xor : 0.36% ; bytes = 42
@ length = 20
False positives of bloom : 1.24% ; bytes = 26
False positives of xor : 0.37% ; bytes = 54
@ length = 30
False positives of bloom : 0.84% ; bytes = 39
False positives of xor : 0.43% ; bytes = 66
@ length = 40
False positives of bloom : 1.07% ; bytes = 51
False positives of xor : 0.37% ; bytes = 81
@ length = 50
False positives of bloom : 1.09% ; bytes = 64
False positives of xor : 0.41% ; bytes = 93
@ length = 60
False positives of bloom : 1.12% ; bytes = 76
False positives of xor : 0.36% ; bytes = 105
@ length = 70
False positives of bloom : 0.93% ; bytes = 89
False positives of xor : 0.41% ; bytes = 117
@ length = 80
False positives of bloom : 1.16% ; bytes = 101
False positives of xor : 0.33% ; bytes = 129
@ length = 90
False positives of bloom : 1.07% ; bytes = 114
False positives of xor : 0.36% ; bytes = 141
@ length = 100
False positives of bloom : 0.83% ; bytes = 126
False positives of xor : 0.39% ; bytes = 153
@ length = 200
False positives of bloom : 0.96% ; bytes = 251
False positives of xor : 0.38% ; bytes = 276
@ length = 300
False positives of bloom : 0.77% ; bytes = 376
False positives of xor : 0.37% ; bytes = 399
@ length = 400
False positives of bloom : 0.81% ; bytes = 501
False positives of xor : 0.31% ; bytes = 522
@ length = 500
False positives of bloom : 0.74% ; bytes = 626
False positives of xor : 0.37% ; bytes = 645
@ length = 600
False positives of bloom : 0.78% ; bytes = 751
False positives of xor : 0.41% ; bytes = 768
@ length = 700
False positives of bloom : 0.91% ; bytes = 876
False positives of xor : 0.44% ; bytes = 891
@ length = 800
False positives of bloom : 0.88% ; bytes = 1001
False positives of xor : 0.43% ; bytes = 1014
@ length = 900
False positives of bloom : 0.97% ; bytes = 1126
False positives of xor : 0.52% ; bytes = 1137
@ length = 1000
False positives of bloom : 0.90% ; bytes = 1251
False positives of xor : 0.50% ; bytes = 1260
@ length = 2000
False positives of bloom : 0.89% ; bytes = 2501
False positives of xor : 0.33% ; bytes = 2490
@ length = 3000
False positives of bloom : 0.95% ; bytes = 3751
False positives of xor : 0.32% ; bytes = 3720
@ length = 4000
False positives of bloom : 1.01% ; bytes = 5001
False positives of xor : 0.44% ; bytes = 4950
@ length = 5000
False positives of bloom : 0.89% ; bytes = 6251
False positives of xor : 0.28% ; bytes = 6180
@ length = 6000
False positives of bloom : 1.03% ; bytes = 7501
False positives of xor : 0.43% ; bytes = 7410
@ length = 7000
False positives of bloom : 0.78% ; bytes = 8751
False positives of xor : 0.34% ; bytes = 8640
@ length = 8000
False positives of bloom : 1.09% ; bytes = 10001
False positives of xor : 0.47% ; bytes = 9870
@ length = 9000
False positives of bloom : 1.09% ; bytes = 11251
False positives of xor : 0.47% ; bytes = 11100
@ length = 10000
False positives of bloom : 0.81% ; bytes = 12501
False positives of xor : 0.40% ; bytes = 12330
<test result>
bloom : 33 good, 4 mediocre
xor : 37 good, 0 mediocre
```
#### 討論 : TODO
---
## Skip List
Skip List 用來實作 LevelDB 中的 memtable,也用來實作 Redis 中的 sorted sets 資料結構。
程式碼: [leveldb/db/skiplist.h](https://github.com/google/leveldb/blob/master/db/skiplist.h)
可改進之處:
* [Random 的實作](https://github.com/google/leveldb/blob/master/util/random.h)執行成本較高,可換為 [xoshiro128+](http://prng.di.unimi.it/),取代原有的 `Next` 操作
* 強化 cache 效益,見 [Skip Lists: Done Right](http://ticki.github.io/blog/skip-lists-done-right/)
Random 的修改
```diff
--- a/db/skiplist.h
+++ b/db/skiplist.h
@@ -32,10 +32,22 @@
#include <cstdlib>
#include "util/arena.h"
-#include "util/random.h"
namespace leveldb {
+struct rand_internal {
+ uint16_t s;
+};
+
+/* 16-bit xorshift PRNG */
+static inline uint16_t Rand(struct rand_internal* x) {
+ uint16_t xs = x->s;
+ xs ^= xs << 7;
+ xs ^= xs >> 9;
+ xs ^= xs << 8;
+ return (x->s = xs);
+}
+
class Arena;
template <typename Key, class Comparator>
@@ -138,7 +150,7 @@ class SkipList {
std::atomic<int> max_height_; // Height of the entire list
// Read/written only by Insert().
- Random rnd_;
+ struct rand_internal rnd_;
};
// Implementation details follow
@@ -243,7 +255,7 @@ int SkipList<Key, Comparator>::RandomHeight() {
// Increase height with probability 1 in kBranching
static const unsigned int kBranching = 4;
int height = 1;
- while (height < kMaxHeight && ((rnd_.Next() % kBranching) == 0)) {
+ while (height < kMaxHeight && (Rand(&rnd_) % kBranching) == 0) {
height++;
}
assert(height > 0);
@@ -326,8 +338,10 @@ SkipList<Key, Comparator>::SkipList(Comparator cmp, Arena* arena)
: compare_(cmp),
arena_(arena),
head_(NewNode(0 /* any key will do */, kMaxHeight)),
- max_height_(1),
- rnd_(0xdeadbeef) {
+ max_height_(1) {
+ static const uint32_t A = 16807; // bits 14, 8, 7, 5, 2, 1, 0
+ rnd_.s = (uintptr_t)&rnd_ * A;
+ Rand(&rnd_);
for (int i = 0; i < kMaxHeight; i++) {
head_->SetNext(i, nullptr);
}
```
> 由於 `kBranching` 實在太小 (`4`),甚至可避免用 PRNG
---
[Skip Lists: Done Right](http://ticki.github.io/blog/skip-lists-done-right/) 一文中點出 :
> "In the past five years, people have become increasingly sceptical of skip lists’ performance, due to their poor cache behavior when compared to e.g. B-trees"
並且基於這個動機提出了一些得以改善快取機制效益的 skip list 設計方法。
:::warning
[pointer chasing](https://en.wikichip.org/wiki/pointer_chasing) 在系統執行期間造成諸多 TLB, cache miss,其原因來自於處理器必須處理連續的"不規則記憶體存取"之指令,進而導致快取效益大幅下降,甚至造成[快取污染](https://en.wikipedia.org/wiki/Cache_pollution)。
除了透過軟體設計減緩此問題,也有硬體層面的方法,例如 : [Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation](https://users.ece.cmu.edu/~omutlu/pub/in-memory-pointer-chasing-accelerator_iccd16.pdf)。
:::
為了觀察 levelDB 原本 skip list 實作的 cache 表現,我嘗試用 performance event 來觀察 `db_bench`。執行 `--benchmarks="fillrandom,readseq"`。
* 環境
```cpp
LevelDB: version 1.22
Date: Wed Jan 20 12:07:14 2021
CPU: 8 * Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPUCache: 8192 KB
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 1000000
RawSize: 110.6 MB (estimated)
FileSize: 62.9 MB (estimated)
WARNING: Snappy compression is not enabled
```
首先用 perf stat 觀察執行此 benchmark 時的 cache miss, sTLB miss 狀況 :
```shell
$ sudo perf stat --repeat 10 -e dTLB-loads:u,dTLB-load-misses:u,cache-misses:u,cache-references:u \
./db_bench --benchmarks="fillrandom,readseq"
```
* 原本的 random 實作
```cpp
Performance counter stats for './db_bench --benchmarks=fillrandom,readseq' (10 runs):
40,1930,6118 dTLB-loads ( +- 0.91% )
67,1019 dTLB-load-misses # 0.02% of all dTLB cache hits ( +- 1.62% )
3404,4704 cache-misses:u # 13.366 % of all cache refs ( +- 4.47% )
2,5470,5111 cache-references:u ( +- 0.73% )
4.594 +- 0.234 seconds time elapsed ( +- 5.08% )
```
接著使用 perf record 觀察原實作的程式熱點 :
```shell
$ sudo perf record -e dTLB-loads:u,dTLB-load-misses:u,cache-misses:u,cache-references:u \
./db_bench --benchmarks="fillrandom,readseq"
```
得到以下執行結果 :
```
LevelDB: version 1.22
Date: Wed Jan 20 12:28:38 2021
CPU: 8 * Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPUCache: 8192 KB
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 1000000
RawSize: 110.6 MB (estimated)
FileSize: 62.9 MB (estimated)
WARNING: Snappy compression is not enabled
-------------------------------------------------
fillrandom : 6.808 micros/op; 16.2 MB/s
readseq : 0.187 micros/op; 590.6 MB/s
```
得到 perf cache-misses event 報告如下 :
```cpp
Samples: 14K of event 'cache-misses:u', Event count (approx.): 27402381
Overhead Command Shared Object Symbol
16.79% db_bench db_bench [.] leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert ◆
9.94% db_bench libc-2.31.so [.] __memcmp_avx2_movbe ▒
6.90% db_bench db_bench [.] leveldb::InternalKeyComparator::Compare ▒
6.10% db_bench db_bench [.] leveldb::MemTableIterator::key ▒
5.86% db_bench db_bench [.] leveldb::(anonymous namespace)::MergingIterator::Next ▒
5.81% db_bench libc-2.31.so [.] __memmove_avx_unaligned_erms ▒
...
```
得到 perf dTLB-load-misses event 報告如下 :
```cpp
Samples: 3K of event 'dTLB-load-misses', Event count (approx.): 586744
Overhead Command Shared Object Symbol
28.60% db_bench db_bench [.] leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert
22.71% db_bench db_bench [.] leveldb::MemTableIterator::key
12.84% db_bench db_bench [.] leveldb::ReadBlock
```
發現 `leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert` 這個函式在兩種 miss 的統計中都是熱點。
接著觀察對 symbol `leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert` 進行 annotation
* [filter meta block] (https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format#filter-meta-block)