Level DB

Just some writeupunfinished
Edits are welcome. :pencil2:

LevelDB Write

leveldb::DB::Put and leveldb::DB::Delete are implemented in leveldb::DBImpl::Put and leveldb::DBImpl::Delete

兩者皆利用 DBImpl::Write() 進行操作,需要傳入 const WriteOptions&WriteBatch*

首先先設置好 Writer 的 batch 與 sync/async 之後並推入 writers_ 佇列。如果 writer 狀態未完成是第一個 writer 則 MakeRoomForWriter() 並在 log 中加入實際要寫入的 record (batch),其餘則等待 (condition variable wait) 並且完成後回傳狀態。

MakeRoomForWriter() 的用意是要確保 memory table 有足夠的空間可以寫入,包括檢查 Level 0 是否有太多檔案。LevelDB 採階層式寫入,每一層的空間都比上一層還要大,最後一層即 disk,此種設計是為了達到 error recovery 功能、concurrent operations 以及批次寫入。上面所提到的 batch 即是指需要一次寫入的更新,每次寫入一個 batch size 的資料犧牲了隨機讀寫的能力,但是相當符合硬碟循序讀寫表現較好的特性。

LevelDB Read

leveldb::DB::Get is implemented in leveldb::DBImpl::Get

如同其他經典 concurrent 資料結構的設計,這邊一樣採用先取得 snapshot 再讀取的策略,保證讀取當下資料不會被變更,保持一致性。取得當下的 memory table 之後,使用 keysnapshot 組成一個 lkey ,和 status 一併傳入 get method,最後將結果紀錄於 status 回傳。

WIP

The query, insertion and iteration of skip list

Not done yet QQ


How LSM works.

LSM is a hierarchical, orderly, disk-oriented data structure. It sacrifies the partial read efficiency for sqential writes efficiency, comparing to random writes. The key idea is to read/write the same size of a chunk of data at once, to reduce random read/write, based on the characteristics of the disk. We put latest data in memory first and use merge sort to sort the data when it has accumulated enough. Append the result to the disk at the end.

How to perform read and write data flows in a DB engine based on LSM.

Read

  • The latest data is on C0. Check C0 first. Check lower layers when data not found.
  • Use page cache and bloom filter to speed up the existance check in lower layer.

Write

  • Put data in Write Ahead Log for failure recovery.
  • Append data to the C0 layer in memory
  • Compact C0 and C1 with merge sort when C0 reaches certain size. Do the same on Ci and Ci+1 layer. Old data can be deleted when merge is done.
  • Data reaches the lowest layer, which is the disk.