2020_OS_Fall_HW3 Key Value Storage

學號： F74076027
姓名：林政傑
系級：資訊 111

2020_OS_Fall_HW3 Key Value Storage

開發環境：

OS: Ubuntu 20.04.1
CPU: Intel® Core™ i7-10750H @ 2.60GHz × 4 (VirtualBox)
Ｍemory: 4 GB
Programming Language: g++ (Ubuntu 9.3.0-10ubuntu2) 9.3.0

程式執行時間

$ time ./F74076027 ./1.input

real	0m0.001s
user	0m0.002s
sys	0m0.000s

程式開發與使用說明

使用說明: 編譯

在專案目錄輸入 make 即可編譯:

$ make
g++ main.cpp -o F74076027 -lpthread

程式開發: 模組

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

上圖為程式的架構圖，首先介紹各個資料結構:

bloom filter
其中有長度為 131072 bytes 的字串，作為過濾器，參考 jserv dict 的 bloom filter 實作，使用 djb2 和 jenkins 作為 hash function ，在記錄 100000 個 key 的情況下， false positive 的機率約為 3% 。
skip-list
為 4 個 level 的 linked-list ，每個節點的 level 數是隨機的，相同 level 的指標只會指向下一個擁有大於等於自己 level 數的節點，而且 key 是按照大小排序的。搜尋 key 時，從最高的 level 慢慢往下搜尋，複雜度為
$O (l o g n)$ 。以下為插入元素的示意圖。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
圖片來源: https://blog.csdn.net/guangyacyb/article/details/88183219
sorted string table (sstable)
由 skip-list 刷入硬碟，裡面的 key-value 順序是按照 key 的大小來排列，並且每組 key-value 都是 136 bytes ，所以可以直接使用 binary search 在檔案中搜尋 key 值和讀寫對應的 value。每個 sstable (page) 最多會有 100000 組 key-value ，存滿 100000 組 key-value 後便不再改動 key 的值。
metadata
主要記錄每個 sstable 的 bloomfilter 和 sstable 頭尾的 key 值，這些資訊可以幫助程式快速判斷是否需要搜尋此檔案 (page) ，當程式啟動時，會將所有 metadata 拉進記憶體，程式結束時再把更新後的 metadata 刷回硬碟。

程式開發: 流程

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

執行程式時，會先在 storage 目錄中尋找是否有 metadata 。若有，則代表資料庫內已經有東西，將 metadata 讀入記憶體，編號最大的 page 也會以 skip-list 的形式載入記憶體; 若無，則生成空的 metadata 和 skip-list。
接下來開啟 1.input 讀取指令:
- put
  先在 skip-list 中搜尋是否有該 key 值，再一個個比對 page 的 bloom filter ，如果 bloom filter 表明該 key 值存在，就去 page 裡搜尋，因為 page 是以 sstable 儲存的，所以使用 mmap 加 binary search 搜尋，最多只要搜尋
  $l o g_{2} 100000 = 17$ 次就能找到 key 值並複寫 value 。如果很不巧完全搜尋不到，就在 skip-list 中新增一個節點儲存 key-value。
  完成 put 之後，如果 skip-list 的節點數到達 100000 ，則將 skip-list 刷到硬碟，成為一個新的 page ，該 page 的 metadata 也會加進 metadata 的清單。
- get
  搜尋方法跟 put 一模一樣，但是不會複寫 sstable 或新增節點，搜尋結束後將字串丟入 print subroutine (負責寫入 1.output 的執行緒) 的 deque 的尾部。 print subroutine 持續從 deque 的頭部讀取字串、輸出、釋放。
- scan
  一直執行 get 指令的迴圈
重複執行 2. 直到讀取完所有指令。當讀取完所有指令後，將新的 metadata 覆蓋掉硬碟中的 metadata 、將 skip-list 存成最後一個 page、等待 print subroutine 印完 1.output ，程式結束。

效能分析報告

時間分析方法

使用 time.h 的 clock_gettime() 紀錄 wall-clock time

效能分析 1: PUT 的時間

首先分析連續 PUT 2500 萬個不重複的 key 的耗時 (實驗時是使用 0~24999999)，測試方法為產生一個 input 檔，裡面有 2500 萬個 PUT 並且每個 PUT 的 key 值都不相同，每 PUT 10 萬筆 key-value 記錄一次時間，測試硬體為 i7-10750H 和 M.2 SSD 。 PUT 完 2500 萬個 non-repeating key 總共耗時 2601.737108 秒，平均每 PUT 10 萬筆 key 需耗時 10.4 秒，並且 PUT 消耗的時間接近一次函數，如下圖藍色線。

接下來我將相同的 input 檔再次餵給 DataBase ，由於前一次實驗已經將 key-value 存進硬碟了，這次實驗不會再消耗更多儲存空間，而是透過 binary search 在 sstable 中尋找相同的 key ，然後把 value 複寫到硬碟中。 PUT 完 2500 萬個 key 總共耗時 1168.494940 秒，平均每 PUT 10 萬筆 key 需耗時 4.67 秒，並且 PUT 消耗的時間接近一次函數，如下圖紅色線。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

若 PUT 的 key 值為亂數，所消耗的時間理應在紅色和藍色線之間。

由此可知在我的實作中， PUT 已經存在的 key 要比 PUT 新的 key 還要快

效能分析 2: GET 的時間

沿用前面 PUT 實驗的 DataBase ，首先 GET DataBase 中的每個 key (0~24999999) ，花費時間為下圖藍色線，總共需要 877.194100 秒，平均每 GET 10 萬個 key 花費 3.5 秒。

接下來 GET 2500 萬個完全沒出現在 DataBase 的 key ，如下圖紅色線，花費時間為 22.762612 秒，平均每 GET 10 萬個 key 花費 0.09 秒，會如此快速大概是因為 bloom filter 以

O (1)

快速過濾不存在的 key 的關係。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

bloom filter 大幅加速不存在的 key 的搜尋時間

因為 SCAN 是由 for 迴圈執行的 GET ，所以就不分析了。

效能分析 3: 硬碟空間使用量

每個 sstable 通常儲存 10 萬筆 key-value ，並且會搭配一份 metadata ， metadata 中紀錄了 sstable 的首尾 key 值、key 的數量 (int) 以及一個 bloomfilter 。假設有

N

組 key-value ，計算 DataBase 會使用多少硬碟空間:

key 8 byte 、 value 128 byte ，所以 sstable 會消耗
$N * 136$ byte
首尾 key 共 16 byte 、 key 的數量 4 byte 、 bloomfilter 131072 byte ，但是每 100000 筆 key 才會新增一組 metadata ，所以使用空間為
$(⌊ N / 100000 ⌋ + 1) * 131092$ byte
sstable 和 metadata 加總起來即是占用硬碟空間的量

總的來說，若是存入 DataBase 的 key-value 太少，會相對浪費空間，因為 metadata 的 bloomfilter 占用很大的空間；反之，若是存入夠多的 key-value 則能稍微減少 metadata 佔用空間的比率。

每存入 13.6 MB (10 萬個 key-value) ，便會增加一個 131.092 KB 的 metadata ，

(13.6 M B + 131.092 K B) / 13.6 M B \approx 1.0096

，和原始資料的大小相比，大約有 0.9% 的增長。由下圖可以看出: 當 key 值越多， DataBase 和原始資料所占用的空間的比值越來越小，到 1 萬個 key-value 時， DataBase 較原始資料約要多消耗 10% 的空間

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

儲存越多 key-value 對硬碟的使用越有效率

效能分析 4: PUT 效能分析

在前面 "分析 PUT 的時間" 中可以發現 PUT 新的 key-value 到 DataBase 會比 PUT 已經存在 DataBase 中的 key-value 還要慢很多，這很不合理，照理說記憶體內的搜尋會比硬碟搜尋快很多。於是我進一步分析單次 PUT 所消耗的時間，我把 PUT 分為 3 種:

PUT New: 該 key-value 不在 skip-list 也不在 sstable 中
PUT Disk: 該 key-value 在 sstable 中
PUT Mem: 該 key-value 在 skip-list 中

對於這三種形式的 PUT 分別測試 10 萬次，擷取每次 PUT 指令的延遲時間:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

由上圖可以發現 skip-list 中，結點越多，PUT 的效能越低，甚至在超過 2 萬個 key 時，PUT 的效能會低於使用 binary search 的 sstable (也就是紅色點的 PUT Disk) ，照理來說 PUT Disk 所花的時間會大於 PUT 到記憶體，甚至在超過 8 萬個 key 時，PUT New 和 PUT Mem 的時間會超過 250 微秒，這很沒效率，反觀 PUT Disk 消耗的時間通常少於 100 微秒。

我猜測會有這種現象是因為我 skip-list 生成 level 的機率太高了，生成 level 1 的機率為 1、生成 level 2 的機率為 1/2 ，生成 level 3 為 1/4 ，生成 level 4 為 1/8 。當 key 值過多時，也會生成過多 level 4 ，降低了搜尋效率。

接下來我不斷嘗試調整新結點生成 level 的機率，最後被我調整為: 生成 level 1 的機率為 1、生成 level 2 為 1/16、生成 level 3 為 1/128、生成 level 4 為 1/2048，搜尋的效能變為:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

由上圖可以發現 PUT New 和 PUT Mem 的效能獲得巨大改善，除了一些離群值，消耗時間幾乎都小於 1 微秒 (圖中的 Y 軸已無法標示)， PUT New 稍微慢一點，因為需要配置新的結點；而 PUT Disk 消耗時間幾乎都介於 35 ~ 100 微秒。 PUT New 和 PUT Mem 已經比 PUT Disk 快非常多。

接下來我用效能分析 1 的方法測試時間，一樣連續 PUT 2500 萬 (0 ~ 2499999) 筆 key-value:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

可以看見第一次 PUT (藍色線) 從 2601 秒進步為 51 秒，平均每 PUT 10 萬筆 key 需耗時 0.204 秒，加速了約 50 倍，而再次 PUT (紅色線) 所花的時間就和效能分析 1 差不多，因為其需要在硬碟中做 binary search ，效率比優化過後的記憶體 skip-list 差。

適當調整 skip-list 產生 level 的機率，大幅提高記憶體搜尋的速度

優化程式架構

雖然記憶體內搜尋的效能提升很多，但是整個程式對記憶體的使用太少，最多只會有不到 10 萬個 key-value 會同時存在記憶體中的 skip-list ，相當於 13.6 MB 的資料。

一般情況下，扣掉系統占用，至少有 2.5 G 的記憶體可供我使用，只利用 13.6 MB 實在太吝嗇了，為了提升記憶體使用的效率，我微幅修改了程式架構如下:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

我定義了一個新的物件叫 Cache ，他可以儲存並操作 sstable ，再來在 metadata 新增一個布林值 in_cache ，代表該 page 是否存在 Cache 中。每當有 PUT、GET 存取硬碟的某個 page 時，我便會將該 page 拉進 cache ，這樣下次存取同一個 page 的東西時，就不用再從硬碟內搜尋。當 Cache 中的 page 數量到達上限時 (我的設定為 130 個 page ，也就是 1300 萬組 key-value)，我會將最早被拉進去的 page 刷回硬碟。

此外，我刪除了 print subroutine ，因為為了輸出 output 而開執行緒其實沒有加速，反而容易產生錯誤。

加入 Cache (非 CPU 的快取) 機制

首先，將 page 拉入 cache 或是刷回硬碟都有一個 overhead ，需要花費約 0.017 至 0.03 秒將一個 page 拉入或刷出 cache ，這個機制對於過於分散的 get 非常不好，因為會造成過於頻繁的 page 操作，以下是 page 操作花費時間的分布圖:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

以效能分析 4 的 PUT Performance 來看，一次 PUT Disk 的時間為 35 ~ 100 微秒 (我後來計算平均約為 47 微秒)，而在 Cache 中來回 Load 和 Flush 一個 page 平均要 0.07 秒，由此可以計算:

將一個 page 拉入 Cache 再放回硬碟的成本約等於

\frac{0.07 * 1000000}{47} = 1489

個 PUT Disk 的時間，算下來這個 overhead 非常龐大。換句話說，我需要對同一個 page 操作 1489 次，才有將其拉入 Cache 的價值。

顯然這個機制對過於分散的 get 非常不好，因為會造成頻繁的 page 刷進刷出，而且對個別 page 的利用度也不足，但是對於連續 GET 或是超級長的 SCAN 有爆炸性的改善。以下是我再度對 PUT 和 SCAN 的測試:

加入 Cache 機制後的效能分析 1

測試連續 PUT 2500 萬個不重複的 key 的耗時 (實驗時是使用 0~24999999)，測試方法為產生一個 input 檔，裡面有 2500 萬個 PUT 並且每個 PUT 的 key 值都不相同，每 PUT 10 萬筆 key-value 記錄一次時間。 PUT 完 2500 萬個 non-repeating key 總共耗時 51.135623 秒，平均每 PUT 10 萬筆 key 需耗時 0.204 秒，如下圖藍色線。

接下來我將相同的 input 檔再次餵給 DataBase ，由於前一次實驗已經將 key-value 存進硬碟了，這次實驗不會再消耗更多儲存空間，而是將硬碟中的 page 拉入 Cache ，透過 binary search 在 sstable 中尋找相同的 key ，然後把 value 複寫到 Cache 中，稍後再將 page 刷回硬碟。 PUT 完 2500 萬個 key 總共耗時 33.908611 秒，比效能分析 1 中的 1168 秒快了約 35 倍，平均每 PUT 10 萬筆 key 需耗時 0.136 秒，如下圖紅色線。

下圖折線最後會有一小段鉛直線為 Cache 將 130 個 page 刷回硬碟所花費的時間，約莫 2.5 到 3 秒。

加入 Cache 機制後的效能分析 2

沿用前面 PUT 實驗的 DataBase ，首先 GET DataBase 中的每個 key (0~24999999) ，花費時間為下圖藍色線，總共需要 28.717777 秒，比 效能分析 2 的 877 秒快了約莫 30 倍，平均每 GET 10 萬個 key 花費 0.114 秒。

接下來 GET 2500 萬個完全沒出現在 DataBase 的 key ，如下圖紅色線，花費時間為 20.373579 秒，平均每 GET 10 萬個 key 花費 0.081 秒。

加入 Cache 機制後的記憶體使用狀況

可以發現當 Cache 塞滿時，會占用 2 GB 左右的記憶體。

結論

在已排序好的 key 中， binary search 會比循序搜尋快得多
儲存越多 key-value 對硬碟的使用越有效率， 1 萬組 key-value 時， DataBase 較原始資料約要多消耗 10% 的空間；超過 10 萬組則為 0.9%。
skip-list 藉由 high level 的躍遷達到加速的效果，這意味著 high level 出現頻率太高會拖慢搜尋速度，於是我將 level 4 生成的機率大幅降低 (1/2048)，使搜尋速度優化了 50 倍。
增加 Cache 機制雖然可以大幅加速存取連續的 key ，但是因為其龐大的 overhead ，使得少量或是分散存取 key 的效率變得極差，約莫要在同一個 page 中操作 1489 次才能勉強打平將 page 拉進 Cahe 的 overhead。

2020_OS_Fall_HW3 Key Value Storage

開發環境：

程式執行時間

程式開發與使用說明

使用說明: 編譯

程式開發: 模組

程式開發: 流程

效能分析報告

時間分析方法

效能分析 1: PUT 的時間

效能分析 2: GET 的時間

效能分析 3: 硬碟空間使用量

效能分析 4: PUT 效能分析

優化程式架構

加入 Cache (非 CPU 的快取) 機制

加入 Cache 機制後的效能分析 1

加入 Cache 機制後的效能分析 2

加入 Cache 機制後的記憶體使用狀況

結論

Read more

2022q1 Homework3 (fibdrv)

學習 Ansible

分析 .NET MVC 官方教學 MvcMovie

SoK 相關調查（由 ChatGPT 統整）