Try   HackMD

2020q3 專題: LevelDB 研究

contributed by < grb72t3yde >

tags: sysprog2020

已閱讀過的研究素材

針對 LevelDB 的研究目標

  1. 理解 LevelDB 如何導入 bloom filter, 以對 query 進行優化

    嘗試引入 Xor filter

  2. 理解 LevelDB 中如何使用 skip list 以實作 memtable,並嘗試改進
  3. 練習使用效能分析工具 (perf, BCC/BPF 等等) 以理解 LevelDB 的程式熱點 (hot spot)

實驗環境

  • Ubuntu 20.04 LTS
  • Linux kernel 5.4.0-52-generic

編譯 LevelDB

取得 LevelDB 最新 release 1.22 版本。

  • 建構專案
$ mkdir -p build && cd build
$ cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build .
  • 建構 LevelDB 函式庫
$ sudo make install

如成功建構靜態函式庫, 應出現類似下圖之輸出 :

...
Install the project...
-- Install configuration: "Release"
-- Up-to-date: /usr/local/lib/libleveldb.a
-- Up-to-date: /usr/local/include/leveldb/c.h
-- Up-to-date: /usr/local/include/leveldb/cache.h
-- Up-to-date: /usr/local/include/leveldb/comparator.h
-- Up-to-date: /usr/local/include/leveldb/db.h
-- Up-to-date: /usr/local/include/leveldb/dumpfile.h
-- Up-to-date: /usr/local/include/leveldb/env.h
-- Up-to-date: /usr/local/include/leveldb/export.h
-- Up-to-date: /usr/local/include/leveldb/filter_policy.h
-- Up-to-date: /usr/local/include/leveldb/iterator.h
-- Up-to-date: /usr/local/include/leveldb/options.h
-- Up-to-date: /usr/local/include/leveldb/slice.h
-- Up-to-date: /usr/local/include/leveldb/status.h
-- Up-to-date: /usr/local/include/leveldb/table_builder.h
-- Up-to-date: /usr/local/include/leveldb/table.h
-- Up-to-date: /usr/local/include/leveldb/write_batch.h
-- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbTargets.cmake
-- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbTargets-release.cmake
-- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbConfig.cmake
-- Up-to-date: /usr/local/lib/cmake/leveldb/leveldbConfigVersion.cmake

接著編譯以下程式碼,操作一個 DB 物件來檢驗安裝是否成功 :

#include <iostream>
#include <cassert>
#include "leveldb/db.h"

/* hello_leveldb.cc */
int main(int argc, const char* argv[])
{
    leveldb::DB* db;
    leveldb::Options options;
    options.create_if_missing = true;
    leveldb::Status status = leveldb::DB::Open(options,"/tmp/testdb", &db);
    assert(status.ok());
    std::cout << "Hello LevelDB!\n";
    delete db;
    return 0;
}
$ clang++ hello_leveldb.cc -o hello_leveldb /usr/local/lib/libleveldb.a -pthread

Bloom Filter

LevelDB 以及 LSM-tree 的精神

  • LevelDB 使用 LSM-tree (Log-Structured Merged Tree) 作為其 On disk 組織檔案的資料結構, 它能保證資料與日誌的寫入是 sequential write, 而這在 SSD 上面是格外重要的 (因為它們不善於 random write)。
    請參考 SFS: Random Write Considered Harmful in Solid State Drives
  • Log-Strctured 在寫入資料時以 append 機制在後台清理過時的資料)的方式寫入。對於資料的更新不做 in-place 的修改, 而是直接寫入新的資料並等待 garbage collection (GC),因此 :
  1. 改善了寫入效能
  2. 但同時, 犧牲了 random read 的表現, 這同時也是引入 Bloom Filter 的原因

LevelDB 使用 Bloom Filter 的考量

  • 對於讀取一個 key value pair 的操作, 我們必須搜索 所有具有包含此 keykey range 的檔案, 原因是我們不確定哪個檔案中有我們的目標 key

舉例來說, 假設我們的目標是 k = 4, 對於擁有 key range 1~5 的檔案 s, 與擁有 key range 3~7 的檔案 s', (以及更多其他 key range 橫跨了 4 的其他檔案), 我們不確定哪些檔案確實包含 4 這個 key,因此必須全部找過一遍。
事實上, 也有可能全部都沒有; 也有可能全部都有, 而當然此情況下我們能夠知道哪一份是最新的版本。

所以引入 Bloom filter 的動機為藉由預測某個 key 是否存在某檔案中,以有機會減少要搜尋的檔案數量,減緩讀取放大。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

本項目目標 : 於 LevelDB 中引入 Xor filter,比較預測錯誤率以及記憶體開銷

dict 以及 LevelDB 中使用的 bloom filter 之比較

我們在 dict 中曾經提及 Bloom Filter,則 LevelDB 中的 Bloom Filter 實作與 dict 中的有何不同處 ?以下沿用 dict 中推導出的公式

k=ln2(m/n) 之變數意義來說明 :

  1. LevelDB 使用 block-based bloom filter。這部分可以搭配 leveldb file format 來理解。LevelDB 以其一個 file 中,僅每 2KB 單位的 data block 資料中所有 keys 就建立一個 filter 並存放其於此 file 的 filter (meta) block 中,因此系統中其實存在著許多小 filter。(可以這麼做的原因以及其動機可以參考此篇文件。另外,文中也提及在 RocksDB 中移除了這個做法而改採用 full filter 的原因。)

我們可觀察寫入 filter block 部分之程式碼 filter_block.cc, 看到 FilterBlockBuilder 類別下 StartBlock 的實作,其會用 2KB 去計算 filter_index。

... namespace leveldb { // See doc/table_format.md for an explanation of the filter block format. // Generate new filter every 2KB of data static const size_t kFilterBaseLg = 11; static const size_t kFilterBase = 1 << kFilterBaseLg; FilterBlockBuilder::FilterBlockBuilder(const FilterPolicy* policy) : policy_(policy) {} void FilterBlockBuilder::StartBlock(uint64_t block_offset) { uint64_t filter_index = (block_offset / kFilterBase); assert(filter_index >= filter_offsets_.size()); while (filter_index > filter_offsets_.size()) { GenerateFilter(); } } ... void FilterBlockBuilder::GenerateFilter() { const size_t num_keys = start_.size(); if (num_keys == 0) { // Fast path if there are no keys for this filter filter_offsets_.push_back(result_.size()); return; } // Make list of keys from flattened key structure start_.push_back(keys_.size()); // Simplify length computation tmp_keys_.resize(num_keys); for (size_t i = 0; i < num_keys; i++) { const char* base = keys_.data() + start_[i]; size_t length = start_[i + 1] - start_[i]; tmp_keys_[i] = Slice(base, length); } // Generate filter for current set of keys and append to result_. filter_offsets_.push_back(result_.size()); policy_->CreateFilter(&tmp_keys_[0], static_cast<int>(num_keys), &result_); tmp_keys_.clear(); keys_.clear(); start_.clear(); } ...
  1. line 40 的 CreateFilter 即為針對 tmp_keys_ 這組 keys 建立 filter。這個函式留待接下來講到 bloom filter 實作時說明。

  2. 對每一個 filter,LevelDB 是使用 double-hashing 來產生

    k 個 hash values。這個部份的實作也留待接下來講到 bloom filter 實作時說明。

LevelDB 中 bloom filter 實作

我們從 include/leveldb/filter_policy.h 來看起,這個檔案中定義了 FilterPolicy 這個 base class,提供了一個界面。

class LEVELDB_EXPORT FilterPolicy {
 public:
  virtual ~FilterPolicy();

  virtual const char* Name() const = 0;

  virtual void CreateFilter(const Slice* keys, int n,
                            std::string* dst) const = 0;

  virtual bool KeyMayMatch(const Slice& key, const Slice& filter) const = 0;
};
...
LEVELDB_EXPORT const FilterPolicy* NewBloomFilterPolicy(int bits_per_key);

}  // namespace leveldb

舉例來說,使用者可自訂一個 filter policy 實作如下 :

class CustomFilterPolicy : public leveldb::FilterPolicy {
 private:
  FilterPolicy* builtin_policy_;

 public:
  CustomFilterPolicy() : builtin_policy_(NewBloomFilterPolicy(10)) {}
  ~CustomFilterPolicy() { delete builtin_policy_; }

  const char* Name() const { return "IgnoreTrailingSpacesFilter"; }

  void CreateFilter(const Slice* keys, int n, std::string* dst) const {
    // Use builtin bloom filter code after removing trailing spaces
    std::vector<Slice> trimmed(n);
    for (int i = 0; i < n; i++) {
      trimmed[i] = RemoveTrailingSpaces(keys[i]);
    }
    return builtin_policy_->CreateFilter(&trimmed[i], n, dst);
  }
};

LevelDB 中的 bloom filter 實作則是 util/bloom.cc 中的 BloomFilterPolicy 類別,這裡 line 12 的 CreateFilter 就是前述建立 filter block 時所呼叫函數之定義 :

class BloomFilterPolicy : public FilterPolicy { public: explicit BloomFilterPolicy(int bits_per_key) : bits_per_key_(bits_per_key) { // We intentionally round down to reduce probing cost a little bit k_ = static_cast<size_t>(bits_per_key * 0.69); // 0.69 =~ ln(2) if (k_ < 1) k_ = 1; if (k_ > 30) k_ = 30; } virtual const char* Name() const { return "leveldb.BuiltinBloomFilter2"; } virtual void CreateFilter(const Slice* keys, int n, std::string* dst) const { // Compute bloom filter size (in both bits and bytes) size_t bits = n * bits_per_key_; // For small n, we can see a very high false positive rate. Fix it // by enforcing a minimum bloom filter length. if (bits < 64) bits = 64; size_t bytes = (bits + 7) / 8; bits = bytes * 8; const size_t init_size = dst->size(); dst->resize(init_size + bytes, 0); dst->push_back(static_cast<char>(k_)); // Remember # of probes in filter char* array = &(*dst)[init_size]; for (int i = 0; i < n; i++) { // Use double-hashing to generate a sequence of hash values. // See analysis in [Kirsch,Mitzenmacher 2006]. uint32_t h = BloomHash(keys[i]); const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits for (size_t j = 0; j < k_; j++) { const uint32_t bitpos = h % bits; array[bitpos / 8] |= (1 << (bitpos % 8)); h += delta; } } } virtual bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const { const size_t len = bloom_filter.size(); if (len < 2) return false; const char* array = bloom_filter.data(); const size_t bits = (len - 1) * 8; // Use the encoded k so that we can read filters generated by // bloom filters created using different parameters. const size_t k = array[len - 1]; if (k > 30) { // Reserved for potentially new encodings for short bloom filters. // Consider it a match. return true; } uint32_t h = BloomHash(key); const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits for (size_t j = 0; j < k; j++) { const uint32_t bitpos = h % bits; if ((array[bitpos / 8] & (1 << (bitpos % 8))) == 0) return false; h += delta; } return true; } private: size_t bits_per_key_; size_t k_; }; } // namespace const FilterPolicy* NewBloomFilterPolicy(int bits_per_key) { return new BloomFilterPolicy(bits_per_key); } } // namespace leveldb

我們可以從上面這個程式碼片段看到幾件事情 :

  1. NewBloomFilterPolicy 是一個內建的 bloom filter policy 創建函數。呼叫它時可以自己接入一個引數 bits_per_key。從 line 5 可得知這個數字就是
    (m/n)
    ,這個數字決定了
    k
    。而且 levelDB 還會將算出來的
    k
    值控制在 1~30 之間,以控制每次建立 filter 花在運算 hash functions 的成本。
  2. line 27 的迴圈就是在將每一個 key 帶入 hash function 得到 fingerprint,且看起來只有使用到一個 hash function (BloomHash)?這裡就是使用到 double hashing,第一個 hash function 是 BloomHash,而第二個 hash function 是 delta。可參考 Less Hashing, Same Performance:
    Building a Better Bloom Filter
  3. line 25 將這個 filter 的 hash function 個數
    k
    一起存在 filter 後 (佔 1 byte),以供查詢的時候使用。

一個 BloomFilterPoicy 建構的時候接入一個 bits_per_key 引數。若

k 在這個 policy 之下是固定的,則為什麼會需要存下
k
呢?

實驗 : 比較 LevelDB 內建 bloom filter 以及 Xor8 filter 的 false positive possibility 以及記憶體開銷

如前所述,levelDB 已經控制了花在計算 hash function 上的時間。我們接下來著重比較 levelDB 所使用的 bloom filter 和 xorfilter 之記憶體開銷以及 FPP(false positive possibility)。

實驗組 : Xor8 filter
對照組 : levelDB bloom filter with bits_per_key = 10 (這個參數為 LevelDB 預設的推薦值, 當然它會影響記憶體開銷)
測試程式 : 使用 util/bloom_test 中的 VaryingLengths TEST。此測試先連續加入length 個 key ,再來透過預測一萬個未被加入的 key 是否在 filter 中來測量 false positive possibility。被判定為 good 的標準為得到 1.25% 以下的 false positive rate,否則將被判定為 mediocre。

可以看到當 length 逐漸上升,Xor filter 能維持在

0.4 的 FPP,最後成功通過 37 個測試。

另外,LevelDB 將一個 filter 最小大小規定為 8 byte,原因是過小的 filter 將造成高 FPP。所以可以看到前面幾個 iteration 都是 9 byte (8 byte 加上 1 byte 的

k)。Xor8 則可以參考論文中的公式,其也提到可以用 Xor+ filter 來降低空間浪費。在 length 到達 2000 時,其 Xor filter 的記憶體開銷已經小於 bloom filter。

[TODO : 整理成圖表]

==== Test BloomTest.VaryingLengths

 @ length =      1 
False positives of bloom :  0.23% ; bytes =      9
False positives of xor :  0.48%  ; bytes =     33

 @ length =      2 
False positives of bloom :  0.44% ; bytes =      9
False positives of xor :  0.48%  ; bytes =     33

 @ length =      3 
False positives of bloom :  0.75% ; bytes =      9
False positives of xor :  0.53%  ; bytes =     33

 @ length =      4 
False positives of bloom :  1.08% ; bytes =      9
False positives of xor :  0.46%  ; bytes =     36

 @ length =      5 
False positives of bloom :  1.20% ; bytes =      9
False positives of xor :  0.43%  ; bytes =     36

 @ length =      6 
False positives of bloom :  1.59% ; bytes =      9
False positives of xor :  0.36%  ; bytes =     39

 @ length =      7 
False positives of bloom :  1.53% ; bytes =     10
False positives of xor :  0.41%  ; bytes =     39

 @ length =      8 
False positives of bloom :  1.81% ; bytes =     11
False positives of xor :  0.44%  ; bytes =     39

 @ length =      9 
False positives of bloom :  0.79% ; bytes =     13
False positives of xor :  0.43%  ; bytes =     42

 @ length =     10 
False positives of bloom :  1.63% ; bytes =     14
False positives of xor :  0.36%  ; bytes =     42

 @ length =     20 
False positives of bloom :  1.24% ; bytes =     26
False positives of xor :  0.37%  ; bytes =     54

 @ length =     30 
False positives of bloom :  0.84% ; bytes =     39
False positives of xor :  0.43%  ; bytes =     66

 @ length =     40 
False positives of bloom :  1.07% ; bytes =     51
False positives of xor :  0.37%  ; bytes =     81

 @ length =     50 
False positives of bloom :  1.09% ; bytes =     64
False positives of xor :  0.41%  ; bytes =     93

 @ length =     60 
False positives of bloom :  1.12% ; bytes =     76
False positives of xor :  0.36%  ; bytes =    105

 @ length =     70 
False positives of bloom :  0.93% ; bytes =     89
False positives of xor :  0.41%  ; bytes =    117

 @ length =     80 
False positives of bloom :  1.16% ; bytes =    101
False positives of xor :  0.33%  ; bytes =    129

 @ length =     90 
False positives of bloom :  1.07% ; bytes =    114
False positives of xor :  0.36%  ; bytes =    141

 @ length =    100 
False positives of bloom :  0.83% ; bytes =    126
False positives of xor :  0.39%  ; bytes =    153

 @ length =    200 
False positives of bloom :  0.96% ; bytes =    251
False positives of xor :  0.38%  ; bytes =    276

 @ length =    300 
False positives of bloom :  0.77% ; bytes =    376
False positives of xor :  0.37%  ; bytes =    399

 @ length =    400 
False positives of bloom :  0.81% ; bytes =    501
False positives of xor :  0.31%  ; bytes =    522

 @ length =    500 
False positives of bloom :  0.74% ; bytes =    626
False positives of xor :  0.37%  ; bytes =    645

 @ length =    600 
False positives of bloom :  0.78% ; bytes =    751
False positives of xor :  0.41%  ; bytes =    768

 @ length =    700 
False positives of bloom :  0.91% ; bytes =    876
False positives of xor :  0.44%  ; bytes =    891

 @ length =    800 
False positives of bloom :  0.88% ; bytes =   1001
False positives of xor :  0.43%  ; bytes =   1014

 @ length =    900 
False positives of bloom :  0.97% ; bytes =   1126
False positives of xor :  0.52%  ; bytes =   1137

 @ length =   1000 
False positives of bloom :  0.90% ; bytes =   1251
False positives of xor :  0.50%  ; bytes =   1260

 @ length =   2000 
False positives of bloom :  0.89% ; bytes =   2501
False positives of xor :  0.33%  ; bytes =   2490

 @ length =   3000 
False positives of bloom :  0.95% ; bytes =   3751
False positives of xor :  0.32%  ; bytes =   3720

 @ length =   4000 
False positives of bloom :  1.01% ; bytes =   5001
False positives of xor :  0.44%  ; bytes =   4950

 @ length =   5000 
False positives of bloom :  0.89% ; bytes =   6251
False positives of xor :  0.28%  ; bytes =   6180

 @ length =   6000 
False positives of bloom :  1.03% ; bytes =   7501
False positives of xor :  0.43%  ; bytes =   7410

 @ length =   7000 
False positives of bloom :  0.78% ; bytes =   8751
False positives of xor :  0.34%  ; bytes =   8640

 @ length =   8000 
False positives of bloom :  1.09% ; bytes =  10001
False positives of xor :  0.47%  ; bytes =   9870

 @ length =   9000 
False positives of bloom :  1.09% ; bytes =  11251
False positives of xor :  0.47%  ; bytes =  11100

 @ length =  10000 
False positives of bloom :  0.81% ; bytes =  12501
False positives of xor :  0.40%  ; bytes =  12330

<test result>
bloom : 33 good, 4 mediocre
xor : 37 good, 0 mediocre

討論 : TODO


Skip List

Skip List 用來實作 LevelDB 中的 memtable,也用來實作 Redis 中的 sorted sets 資料結構。
程式碼: leveldb/db/skiplist.h
可改進之處:

Random 的修改

--- a/db/skiplist.h
+++ b/db/skiplist.h
@@ -32,10 +32,22 @@
 #include <cstdlib>
 
 #include "util/arena.h"
-#include "util/random.h"
 
 namespace leveldb {
 
+struct rand_internal {
+  uint16_t s;
+};
+
+/* 16-bit xorshift PRNG */
+static inline uint16_t Rand(struct rand_internal* x) {
+  uint16_t xs = x->s;
+  xs ^= xs << 7;
+  xs ^= xs >> 9;
+  xs ^= xs << 8;
+  return (x->s = xs);
+}
+
 class Arena;
 
 template <typename Key, class Comparator>
@@ -138,7 +150,7 @@ class SkipList {
   std::atomic<int> max_height_;  // Height of the entire list
 
   // Read/written only by Insert().
-  Random rnd_;
+  struct rand_internal rnd_;
 };
 
 // Implementation details follow
@@ -243,7 +255,7 @@ int SkipList<Key, Comparator>::RandomHeight() {
   // Increase height with probability 1 in kBranching
   static const unsigned int kBranching = 4;
   int height = 1;
-  while (height < kMaxHeight && ((rnd_.Next() % kBranching) == 0)) {
+  while (height < kMaxHeight && (Rand(&rnd_) % kBranching) == 0) {
     height++;
   }
   assert(height > 0);
@@ -326,8 +338,10 @@ SkipList<Key, Comparator>::SkipList(Comparator cmp, Arena* arena)
     : compare_(cmp),
       arena_(arena),
       head_(NewNode(0 /* any key will do */, kMaxHeight)),
-      max_height_(1),
-      rnd_(0xdeadbeef) {
+      max_height_(1) {
+  static const uint32_t A = 16807;  // bits 14, 8, 7, 5, 2, 1, 0
+  rnd_.s = (uintptr_t)&rnd_ * A;
+  Rand(&rnd_);
   for (int i = 0; i < kMaxHeight; i++) {
     head_->SetNext(i, nullptr);
   }

由於 kBranching 實在太小 (4),甚至可避免用 PRNG


Skip Lists: Done Right 一文中點出 :

"In the past five years, people have become increasingly sceptical of skip lists’ performance, due to their poor cache behavior when compared to e.g. B-trees"

並且基於這個動機提出了一些得以改善快取機制效益的 skip list 設計方法。

pointer chasing 在系統執行期間造成諸多 TLB, cache miss,其原因來自於處理器必須處理連續的"不規則記憶體存取"之指令,進而導致快取效益大幅下降,甚至造成快取污染

除了透過軟體設計減緩此問題,也有硬體層面的方法,例如 : Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation

為了觀察 levelDB 原本 skip list 實作的 cache 表現,我嘗試用 performance event 來觀察 db_bench。執行 --benchmarks="fillrandom,readseq"

  • 環境
LevelDB:    version 1.22
Date:       Wed Jan 20 12:07:14 2021
CPU:        8 * Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPUCache:   8192 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
WARNING: Snappy compression is not enabled

首先用 perf stat 觀察執行此 benchmark 時的 cache miss, sTLB miss 狀況 :

$ sudo perf stat --repeat 10 -e dTLB-loads:u,dTLB-load-misses:u,cache-misses:u,cache-references:u \
     ./db_bench --benchmarks="fillrandom,readseq"
  • 原本的 random 實作
Performance counter stats for './db_bench --benchmarks=fillrandom,readseq' (10 runs):

      40,1930,6118      dTLB-loads                                                    ( +-  0.91% )
           67,1019      dTLB-load-misses          #    0.02% of all dTLB cache hits   ( +-  1.62% )
         3404,4704      cache-misses:u            #   13.366 % of all cache refs      ( +-  4.47% )
       2,5470,5111      cache-references:u                                            ( +-  0.73% )

             4.594 +- 0.234 seconds time elapsed  ( +-  5.08% )

接著使用 perf record 觀察原實作的程式熱點 :

$ sudo perf record -e dTLB-loads:u,dTLB-load-misses:u,cache-misses:u,cache-references:u \
     ./db_bench --benchmarks="fillrandom,readseq"

得到以下執行結果 :

LevelDB:    version 1.22
Date:       Wed Jan 20 12:28:38 2021
CPU:        8 * Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPUCache:   8192 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
WARNING: Snappy compression is not enabled
-------------------------------------------------
fillrandom   :       6.808 micros/op;   16.2 MB/s     
readseq      :       0.187 micros/op;  590.6 MB/s 

得到 perf cache-misses event 報告如下 :

Samples: 14K of event 'cache-misses:u', Event count (approx.): 27402381
Overhead  Command   Shared Object        Symbol
  16.79%  db_bench  db_bench             [.] leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert                             ◆
   9.94%  db_bench  libc-2.31.so         [.] __memcmp_avx2_movbe                                                                                  ▒
   6.90%  db_bench  db_bench             [.] leveldb::InternalKeyComparator::Compare                                                              ▒
   6.10%  db_bench  db_bench             [.] leveldb::MemTableIterator::key                                                                       ▒
   5.86%  db_bench  db_bench             [.] leveldb::(anonymous namespace)::MergingIterator::Next                                                ▒
   5.81%  db_bench  libc-2.31.so         [.] __memmove_avx_unaligned_erms                                                                         ▒
...

得到 perf dTLB-load-misses event 報告如下 :

Samples: 3K of event 'dTLB-load-misses', Event count (approx.): 586744
Overhead  Command   Shared Object        Symbol
  28.60%  db_bench  db_bench             [.] leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert
  22.71%  db_bench  db_bench             [.] leveldb::MemTableIterator::key
  12.84%  db_bench  db_bench             [.] leveldb::ReadBlock

發現 leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert 這個函式在兩種 miss 的統計中都是熱點。

接著觀察對 symbol leveldb::SkipList<char const*, leveldb::MemTable::KeyComparator>::Insert 進行 annotation