2020q3 專題: LevelDB 研究

contributed by < grb72t3yde >

tags: `sysprog2020`

已閱讀過的研究素材

針對 LevelDB 的研究目標

理解 LevelDB 如何導入 bloom filter, 以對 query 進行優化

$\to$ 嘗試引入 Xor filter
理解 LevelDB 中如何使用 skip list 以實作 memtable，並嘗試改進
練習使用效能分析工具 (perf, BCC/BPF 等等) 以理解 LevelDB 的程式熱點 (hot spot)

實驗環境

Ubuntu 20.04 LTS
Linux kernel 5.4.0-52-generic

編譯 LevelDB

取得 LevelDB 最新 release 1.22 版本。

建構專案

$ mkdir -p build && cd build
$ cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build .

建構 LevelDB 函式庫

$ sudo make install

如成功建構靜態函式庫, 應出現類似下圖之輸出 :

...
Install the project...
-- Install configuration: "Release"
-- Up-to-date: /usr/local/lib/libleveldb.a
-- Up-to-date: /usr/local/include/leveldb/c.h
-- Up-to-date: /usr/local/include/leveldb/cache.h
-- Up-to-date: /usr/local/include/leveldb/comparator.h
-- Up-to-date: /usr/local/include/leveldb/db.h
-- Up-to-date: /usr/local/include/leveldb/dumpfile.h
-- Up-to-date: /usr/local/include/leveldb/env.h
-- Up-to-date: /usr/local/include/leveldb/export.h
-- Up-to-date: /usr/local/include/leveldb/filter_policy.h
-- Up-to-date: /usr/local/include/leveldb/iterator.h
-- Up-to-date: /usr/local/include/leveldb/options.h
-- Up-to-date: /usr/local/include/leveldb/slice.h
-- Up-to-date: /usr/local/include/leveldb/status.h
-- Up-to-date: /usr/local/include/leveldb/table_builder.h
-- Up-to-date: /usr/local/include/leveldb/table.h
-- Up-to-date: /usr/local/include/leveldb/write_batch.h
-- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbTargets.cmake
-- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbTargets-release.cmake
-- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbConfig.cmake
-- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbConfigVersion.cmake

接著編譯以下程式碼，操作一個 DB 物件來檢驗安裝是否成功 :

#include <iostream>
#include <cassert>
#include "leveldb/db.h"

/* hello_leveldb.cc */
int main(int argc, const char* argv[])
{
    leveldb::DB* db;
    leveldb::Options options;
    options.create_if_missing = true;
    leveldb::Status status = leveldb::DB::Open(options,"/tmp/testdb", &db);
    assert(status.ok());
    std::cout << "Hello LevelDB!\n";
    delete db;
    return 0;
}

$ clang++ hello_leveldb.cc -o hello_leveldb /usr/local/lib/libleveldb.a -pthread

Bloom Filter

LevelDB 以及 LSM-tree 的精神

LevelDB 使用 LSM-tree (Log-Structured Merged Tree) 作為其 On disk 組織檔案的資料結構, 它能保證資料與日誌的寫入是 sequential write, 而這在 SSD 上面是格外重要的 (因為它們不善於 random write)。
請參考 SFS: Random Write Considered Harmful in Solid State Drives。
Log-Strctured 在寫入資料時以 append 機制在後台清理過時的資料)的方式寫入。對於資料的更新不做 in-place 的修改, 而是直接寫入新的資料並等待 garbage collection (GC)，因此 :

改善了寫入效能
但同時, 犧牲了 random read 的表現, 這同時也是引入 Bloom Filter 的原因

LevelDB 使用 Bloom Filter 的考量

對於讀取一個 key value pair 的操作, 我們必須搜索 所有具有包含此 key 的 key range 的檔案, 原因是我們不確定哪個檔案中有我們的目標 key。

舉例來說, 假設我們的目標是 k = 4, 對於擁有 key range 1~5 的檔案 s, 與擁有 key range 3~7 的檔案 s', (以及更多其他 key range 橫跨了 4 的其他檔案), 我們不確定哪些檔案確實包含 4 這個 key，因此必須全部找過一遍。
事實上, 也有可能全部都沒有; 也有可能全部都有, 而當然此情況下我們能夠知道哪一份是最新的版本。

所以引入 Bloom filter 的動機為藉由預測某個 key 是否存在某檔案中，以有機會減少要搜尋的檔案數量，減緩讀取放大。

下圖以論文 SLM-DB: Single-Level Key-Value Store with Persistent Memory 提供之參考數據資料重製, 可以觀察到, 隨著 value size 的增加, 使用 Bloom Filter 的收益也更顯著

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

本項目目標 : 於 LevelDB 中引入 Xor filter，比較預測錯誤率以及記憶體開銷

研究 LevelDB 中使用的 Bloom Filter 程式碼
複習 Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters
學習其他同學在 Xor filter 上的貢獻成果
- eecheng87
- JulianATA
比較 xor_singleheader 以及 levelDB 內建 bloomfilter

dict 以及 LevelDB 中使用的 bloom filter 之比較

我們在 dict 中曾經提及 Bloom Filter，則 LevelDB 中的 Bloom Filter 實作與 dict 中的有何不同處？以下沿用 dict 中推導出的公式

k = l n 2 (m / n)

之變數意義來說明 :

LevelDB 使用 block-based bloom filter。這部分可以搭配 leveldb file format 來理解。LevelDB 以其一個 file 中，僅每 2KB 單位的 data block 資料中所有 keys 就建立一個 filter 並存放其於此 file 的 filter (meta) block 中，因此系統中其實存在著許多小 filter。(可以這麼做的原因以及其動機可以參考此篇文件。另外，文中也提及在 RocksDB 中移除了這個做法而改採用 full filter 的原因。)

我們可觀察寫入 filter block 部分之程式碼 filter_block.cc, 看到 FilterBlockBuilder 類別下 StartBlock 的實作，其會用 2KB 去計算 filter_index。














































...
namespace leveldb {

// See doc/table_format.md for an explanation of the filter block format.

// Generate new filter every 2KB of data
static const size_t kFilterBaseLg = 11;
static const size_t kFilterBase = 1 << kFilterBaseLg;

FilterBlockBuilder::FilterBlockBuilder(const FilterPolicy* policy)
    : policy_(policy) {}

void FilterBlockBuilder::StartBlock(uint64_t block_offset) {
  uint64_t filter_index = (block_offset / kFilterBase);
  assert(filter_index >= filter_offsets_.size());
  while (filter_index > filter_offsets_.size()) {
    GenerateFilter();
  }
}
...
void FilterBlockBuilder::GenerateFilter() {
  const size_t num_keys = start_.size();
  if (num_keys == 0) {
    // Fast path if there are no keys for this filter
    filter_offsets_.push_back(result_.size());
    return;
  }

  // Make list of keys from flattened key structure
  start_.push_back(keys_.size());  // Simplify length computation
  tmp_keys_.resize(num_keys);
  for (size_t i = 0; i < num_keys; i++) {
    const char* base = keys_.data() + start_[i];
    size_t length = start_[i + 1] - start_[i];
    tmp_keys_[i] = Slice(base, length);
  }

  // Generate filter for current set of keys and append to result_.
  filter_offsets_.push_back(result_.size());
  policy_->CreateFilter(&tmp_keys_[0], static_cast<int>(num_keys), &result_);

  tmp_keys_.clear();
  keys_.clear();
  start_.clear();
}
...

line 40 的 CreateFilter 即為針對 tmp_keys_ 這組 keys 建立 filter。這個函式留待接下來講到 bloom filter 實作時說明。
對每一個 filter，LevelDB 是使用 double-hashing 來產生
$k$ 個 hash values。這個部份的實作也留待接下來講到 bloom filter 實作時說明。

LevelDB 中 bloom filter 實作

我們從 include/leveldb/filter_policy.h 來看起，這個檔案中定義了 FilterPolicy 這個 base class，提供了一個界面。

class LEVELDB_EXPORT FilterPolicy {
 public:
  virtual ~FilterPolicy();

  virtual const char* Name() const = 0;

  virtual void CreateFilter(const Slice* keys, int n,
                            std::string* dst) const = 0;

  virtual bool KeyMayMatch(const Slice& key, const Slice& filter) const = 0;
};
...
LEVELDB_EXPORT const FilterPolicy* NewBloomFilterPolicy(int bits_per_key);

}  // namespace leveldb

舉例來說，使用者可自訂一個 filter policy 實作如下 :

class CustomFilterPolicy : public leveldb::FilterPolicy {
 private:
  FilterPolicy* builtin_policy_;

 public:
  CustomFilterPolicy() : builtin_policy_(NewBloomFilterPolicy(10)) {}
  ~CustomFilterPolicy() { delete builtin_policy_; }

  const char* Name() const { return "IgnoreTrailingSpacesFilter"; }

  void CreateFilter(const Slice* keys, int n, std::string* dst) const {
    // Use builtin bloom filter code after removing trailing spaces
    std::vector<Slice> trimmed(n);
    for (int i = 0; i < n; i++) {
      trimmed[i] = RemoveTrailingSpaces(keys[i]);
    }
    return builtin_policy_->CreateFilter(&trimmed[i], n, dst);
  }
};

LevelDB 中的 bloom filter 實作則是 util/bloom.cc 中的 BloomFilterPolicy 類別，這裡 line 12 的 CreateFilter 就是前述建立 filter block 時所呼叫函數之定義 :












































































class BloomFilterPolicy : public FilterPolicy {
 public:
  explicit BloomFilterPolicy(int bits_per_key) : bits_per_key_(bits_per_key) {
    // We intentionally round down to reduce probing cost a little bit
    k_ = static_cast<size_t>(bits_per_key * 0.69);  // 0.69 =~ ln(2)
    if (k_ < 1) k_ = 1;
    if (k_ > 30) k_ = 30;
  }

  virtual const char* Name() const { return "leveldb.BuiltinBloomFilter2"; }

  virtual void CreateFilter(const Slice* keys, int n, std::string* dst) const {
    // Compute bloom filter size (in both bits and bytes)
    size_t bits = n * bits_per_key_;

    // For small n, we can see a very high false positive rate.  Fix it
    // by enforcing a minimum bloom filter length.
    if (bits < 64) bits = 64;

    size_t bytes = (bits + 7) / 8;
    bits = bytes * 8;

    const size_t init_size = dst->size();
    dst->resize(init_size + bytes, 0);
    dst->push_back(static_cast<char>(k_));  // Remember # of probes in filter
    char* array = &(*dst)[init_size];
    for (int i = 0; i < n; i++) {
      // Use double-hashing to generate a sequence of hash values.
      // See analysis in [Kirsch,Mitzenmacher 2006].
      uint32_t h = BloomHash(keys[i]);
      const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
      for (size_t j = 0; j < k_; j++) {
        const uint32_t bitpos = h % bits;
        array[bitpos / 8] |= (1 << (bitpos % 8));
        h += delta;
      }
    }
  }

  virtual bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const {
    const size_t len = bloom_filter.size();
    if (len < 2) return false;

    const char* array = bloom_filter.data();
    const size_t bits = (len - 1) * 8;

    // Use the encoded k so that we can read filters generated by
    // bloom filters created using different parameters.
    const size_t k = array[len - 1];
    if (k > 30) {
      // Reserved for potentially new encodings for short bloom filters.
      // Consider it a match.
      return true;
    }

    uint32_t h = BloomHash(key);
    const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
    for (size_t j = 0; j < k; j++) {
      const uint32_t bitpos = h % bits;
      if ((array[bitpos / 8] & (1 << (bitpos % 8))) == 0) return false;
      h += delta;
    }
    return true;
  }

 private:
  size_t bits_per_key_;
  size_t k_;
};
}  // namespace

const FilterPolicy* NewBloomFilterPolicy(int bits_per_key) {
  return new BloomFilterPolicy(bits_per_key);
}

}  // namespace leveldb

我們可以從上面這個程式碼片段看到幾件事情 :

NewBloomFilterPolicy 是一個內建的 bloom filter policy 創建函數。呼叫它時可以自己接入一個引數 bits_per_key。從 line 5 可得知這個數字就是
$(m / n)$ ，這個數字決定了
$k$ 。而且 levelDB 還會將算出來的
$k$ 值控制在 1~30 之間，以控制每次建立 filter 花在運算 hash functions 的成本。
line 27 的迴圈就是在將每一個 key 帶入 hash function 得到 fingerprint，且看起來只有使用到一個 hash function (BloomHash)？這裡就是使用到 double hashing，第一個 hash function 是 BloomHash，而第二個 hash function 是 delta。可參考 Less Hashing, Same Performance:
Building a Better Bloom Filter。
line 25 將這個 filter 的 hash function 個數
$k$ 一起存在 filter 後 (佔 1 byte)，以供查詢的時候使用。

一個 BloomFilterPoicy 建構的時候接入一個 bits_per_key 引數。若

k

在這個 policy 之下是固定的，則為什麼會需要存下

k

呢？

實驗 : 比較 LevelDB 內建 bloom filter 以及 Xor8 filter 的 false positive possibility 以及記憶體開銷

如前所述，levelDB 已經控制了花在計算 hash function 上的時間。我們接下來著重比較 levelDB 所使用的 bloom filter 和 xorfilter 之記憶體開銷以及 FPP(false positive possibility)。

實驗組 : Xor8 filter
對照組 : levelDB bloom filter with bits_per_key = 10 (這個參數為 LevelDB 預設的推薦值, 當然它會影響記憶體開銷)
測試程式 : 使用 util/bloom_test 中的 VaryingLengths TEST。此測試先連續加入length 個 key ，再來透過預測一萬個未被加入的 key 是否在 filter 中來測量 false positive possibility。被判定為 good 的標準為得到 1.25% 以下的 false positive rate，否則將被判定為 mediocre。

可以看到當 length 逐漸上升，Xor filter 能維持在

≃ 0.4

的 FPP，最後成功通過 37 個測試。

另外，LevelDB 將一個 filter 最小大小規定為 8 byte，原因是過小的 filter 將造成高 FPP。所以可以看到前面幾個 iteration 都是 9 byte (8 byte 加上 1 byte 的

k

)。Xor8 則可以參考論文中的公式，其也提到可以用 Xor+ filter 來降低空間浪費。在 length 到達 2000 時，其 Xor filter 的記憶體開銷已經小於 bloom filter。

[TODO : 整理成圖表]

==== Test BloomTest.VaryingLengths

 @ length =      1 
False positives of bloom :  0.23% ; bytes =      9
False positives of xor :  0.48%  ; bytes =     33

 @ length =      2 
False positives of bloom :  0.44% ; bytes =      9
False positives of xor :  0.48%  ; bytes =     33

 @ length =      3 
False positives of bloom :  0.75% ; bytes =      9
False positives of xor :  0.53%  ; bytes =     33

 @ length =      4 
False positives of bloom :  1.08% ; bytes =      9
False positives of xor :  0.46%  ; bytes =     36

 @ length =      5 
False positives of bloom :  1.20% ; bytes =      9
False positives of xor :  0.43%  ; bytes =     36

 @ length =      6 
False positives of bloom :  1.59% ; bytes =      9
False positives of xor :  0.36%  ; bytes =     39

 @ length =      7 
False positives of bloom :  1.53% ; bytes =     10
False positives of xor :  0.41%  ; bytes =     39

 @ length =      8 
False positives of bloom :  1.81% ; bytes =     11
False positives of xor :  0.44%  ; bytes =     39

 @ length =      9 
False positives of bloom :  0.79% ; bytes =     13
False positives of xor :  0.43%  ; bytes =     42

 @ length =     10 
False positives of bloom :  1.63% ; bytes =     14
False positives of xor :  0.36%  ; bytes =     42

 @ length =     20 
False positives of bloom :  1.24% ; bytes =     26
False positives of xor :  0.37%  ; bytes =     54

 @ length =     30 
False positives of bloom :  0.84% ; bytes =     39
False positives of xor :  0.43%  ; bytes =     66

 @ length =     40 
False positives of bloom :  1.07% ; bytes =     51
False positives of xor :  0.37%  ; bytes =     81

 @ length =     50 
False positives of bloom :  1.09% ; bytes =     64
False positives of xor :  0.41%  ; bytes =     93

 @ length =     60 
False positives of bloom :  1.12% ; bytes =     76
False positives of xor :  0.36%  ; bytes =    105

 @ length =     70 
False positives of bloom :  0.93% ; bytes =     89
False positives of xor :  0.41%  ; bytes =    117

 @ length =     80 
False positives of bloom :  1.16% ; bytes =    101
False positives of xor :  0.33%  ; bytes =    129

 @ length =     90 
False positives of bloom :  1.07% ; bytes =    114
False positives of xor :  0.36%  ; bytes =    141

 @ length =    100 
False positives of bloom :  0.83% ; bytes =    126
False positives of xor :  0.39%  ; bytes =    153

 @ length =    200 
False positives of bloom :  0.96% ; bytes =    251
False positives of xor :  0.38%  ; bytes =    276

 @ length =    300 
False positives of bloom :  0.77% ; bytes =    376
False positives of xor :  0.37%  ; bytes =    399

 @ length =    400 
False positives of bloom :  0.81% ; bytes =    501
False positives of xor :  0.31%  ; bytes =    522

 @ length =    500 
False positives of bloom :  0.74% ; bytes =    626
False positives of xor :  0.37%  ; bytes =    645

 @ length =    600 
False positives of bloom :  0.78% ; bytes =    751
False positives of xor :  0.41%  ; bytes =    768

 @ length =    700 
False positives of bloom :  0.91% ; bytes =    876
False positives of xor :  0.44%  ; bytes =    891

 @ length =    800 
False positives of bloom :  0.88% ; bytes =   1001
False positives of xor :  0.43%  ; bytes =   1014

 @ length =    900 
False positives of bloom :  0.97% ; bytes =   1126
False positives of xor :  0.52%  ; bytes =   1137

 @ length =   1000 
False positives of bloom :  0.90% ; bytes =   1251
False positives of xor :  0.50%  ; bytes =   1260

 @ length =   2000 
False positives of bloom :  0.89% ; bytes =   2501
False positives of xor :  0.33%  ; bytes =   2490

 @ length =   3000 
False positives of bloom :  0.95% ; bytes =   3751
False positives of xor :  0.32%  ; bytes =   3720

 @ length =   4000 
False positives of bloom :  1.01% ; bytes =   5001
False positives of xor :  0.44%  ; bytes =   4950

 @ length =   5000 
False positives of bloom :  0.89% ; bytes =   6251
False positives of xor :  0.28%  ; bytes =   6180

 @ length =   6000 
False positives of bloom :  1.03% ; bytes =   7501
False positives of xor :  0.43%  ; bytes =   7410

 @ length =   7000 
False positives of bloom :  0.78% ; bytes =   8751
False positives of xor :  0.34%  ; bytes =   8640

 @ length =   8000 
False positives of bloom :  1.09% ; bytes =  10001
False positives of xor :  0.47%  ; bytes =   9870

 @ length =   9000 
False positives of bloom :  1.09% ; bytes =  11251
False positives of xor :  0.47%  ; bytes =  11100

 @ length =  10000 
False positives of bloom :  0.81% ; bytes =  12501
False positives of xor :  0.40%  ; bytes =  12330

<test result>
bloom : 33 good, 4 mediocre
xor : 37 good, 0 mediocre

討論 : TODO

Skip List

Skip List 用來實作 LevelDB 中的 memtable，也用來實作 Redis 中的 sorted sets 資料結構。
程式碼: leveldb/db/skiplist.h
可改進之處:

Random 的實作執行成本較高，可換為 xoshiro128+，取代原有的 Next 操作
強化 cache 效益，見 Skip Lists: Done Right

Random 的修改

--- a/db/skiplist.h
+++ b/db/skiplist.h
@@ -32,10 +32,22 @@
 #include <cstdlib>
 
 #include "util/arena.h"
-#include "util/random.h"
 
 namespace leveldb {
 
+struct rand_internal {
+  uint16_t s;
+};
+
+/* 16-bit xorshift PRNG */
+static inline uint16_t Rand(struct rand_internal* x) {
+  uint16_t xs = x->s;
+  xs ^= xs << 7;
+  xs ^= xs >> 9;
+  xs ^= xs << 8;
+  return (x->s = xs);
+}
+
 class Arena;
 
 template <typename Key, class Comparator>
@@ -138,7 +150,7 @@ class SkipList {
   std::atomic<int> max_height_;  // Height of the entire list
 
   // Read/written only by Insert().
-  Random rnd_;
+  struct rand_internal rnd_;
 };
 
 // Implementation details follow
@@ -243,7 +255,7 @@ int SkipList<Key, Comparator>::RandomHeight() {
   // Increase height with probability 1 in kBranching
   static const unsigned int kBranching = 4;
   int height = 1;
-  while (height < kMaxHeight && ((rnd_.Next() % kBranching) == 0)) {
+  while (height < kMaxHeight && (Rand(&rnd_) % kBranching) == 0) {
     height++;
   }
   assert(height > 0);
@@ -326,8 +338,10 @@ SkipList<Key, Comparator>::SkipList(Comparator cmp, Arena* arena)
     : compare_(cmp),
       arena_(arena),
       head_(NewNode(0 /* any key will do */, kMaxHeight)),
-      max_height_(1),
-      rnd_(0xdeadbeef) {
+      max_height_(1) {
+  static const uint32_t A = 16807;  // bits 14, 8, 7, 5, 2, 1, 0
+  rnd_.s = (uintptr_t)&rnd_ * A;
+  Rand(&rnd_);
   for (int i = 0; i < kMaxHeight; i++) {
     head_->SetNext(i, nullptr);
   }

由於 kBranching 實在太小 (4)，甚至可避免用 PRNG

Skip Lists: Done Right 一文中點出 :

"In the past five years, people have become increasingly sceptical of skip lists’ performance, due to their poor cache behavior when compared to e.g. B-trees"

並且基於這個動機提出了一些得以改善快取機制效益的 skip list 設計方法。

pointer chasing 在系統執行期間造成諸多 TLB, cache miss，其原因來自於處理器必須處理連續的"不規則記憶體存取"之指令，進而導致快取效益大幅下降，甚至造成快取污染。

除了透過軟體設計減緩此問題，也有硬體層面的方法，例如 : Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation。

為了觀察 levelDB 原本 skip list 實作的 cache 表現，我嘗試用 performance event 來觀察 db_bench。執行 --benchmarks="fillrandom,readseq"。

環境

LevelDB:    version 1.22
Date:       Wed Jan 20 12:07:14 2021
CPU:        8 * Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPUCache:   8192 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
WARNING: Snappy compression is not enabled

首先用 perf stat 觀察執行此 benchmark 時的 cache miss, sTLB miss 狀況 :

$ sudo perf stat --repeat 10 -e dTLB-loads:u,dTLB-load-misses:u,cache-misses:u,cache-references:u \
     ./db_bench --benchmarks="fillrandom,readseq"

原本的 random 實作

Performance counter stats for './db_bench --benchmarks=fillrandom,readseq' (10 runs):

      40,1930,6118      dTLB-loads                                                    ( +-  0.91% )
           67,1019      dTLB-load-misses          #    0.02% of all dTLB cache hits   ( +-  1.62% )
         3404,4704      cache-misses:u            #   13.366 % of all cache refs      ( +-  4.47% )
       2,5470,5111      cache-references:u                                            ( +-  0.73% )

             4.594 +- 0.234 seconds time elapsed  ( +-  5.08% )

接著使用 perf record 觀察原實作的程式熱點 :

$ sudo perf record -e dTLB-loads:u,dTLB-load-misses:u,cache-misses:u,cache-references:u \
     ./db_bench --benchmarks="fillrandom,readseq"

得到以下執行結果 :

LevelDB:    version 1.22
Date:       Wed Jan 20 12:28:38 2021
CPU:        8 * Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPUCache:   8192 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
WARNING: Snappy compression is not enabled
-------------------------------------------------
fillrandom   :       6.808 micros/op;   16.2 MB/s     
readseq      :       0.187 micros/op;  590.6 MB/s

得到 perf cache-misses event 報告如下 :

Samples: 14K of event 'cache-misses:u', Event count (approx.): 27402381
Overhead  Command   Shared Object        Symbol
  16.79%  db_bench  db_bench             [.] leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert                             ◆
   9.94%  db_bench  libc-2.31.so         [.] __memcmp_avx2_movbe                                                                                  ▒
   6.90%  db_bench  db_bench             [.] leveldb::InternalKeyComparator::Compare                                                              ▒
   6.10%  db_bench  db_bench             [.] leveldb::MemTableIterator::key                                                                       ▒
   5.86%  db_bench  db_bench             [.] leveldb::(anonymous namespace)::MergingIterator::Next                                                ▒
   5.81%  db_bench  libc-2.31.so         [.] __memmove_avx_unaligned_erms                                                                         ▒
...

得到 perf dTLB-load-misses event 報告如下 :

Samples: 3K of event 'dTLB-load-misses', Event count (approx.): 586744
Overhead  Command   Shared Object        Symbol
  28.60%  db_bench  db_bench             [.] leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert
  22.71%  db_bench  db_bench             [.] leveldb::MemTableIterator::key
  12.84%  db_bench  db_bench             [.] leveldb::ReadBlock

發現 leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert 這個函式在兩種 miss 的統計中都是熱點。

接著觀察對 symbol leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert 進行 annotation

[filter meta block] (https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format#filter-meta-block)