SLM-DB: Single-Level Key-Value Store with Persistent Memory
===
###### tags: `usenix` `fast`
# Problems
* B-tree based KV-stores : show poor write performanece
* incurs mutiple small random writes
* dynamically maintaining a balanced structure -> high write amplification
* more suitable for read-intensive workloads
* LSM-tree based KV-stores : challenging issues of high write and read amplifications
* LSM-tree is organized with multiple levels of files
* KV pairs are merge-sorted multiple times
---
介紹PM
* PM is expected to coexist with disks such as HDDs and SSDs
* However, searching for a new design for KV stores based on a hybrid system of PM and disks
* PM carries a role that is more than just a large memory write buffer or read cache
---
# Probs of LSM-tree based KV (e.g. LevelDB)
* LevelDB architecture

Components on DRAM ...
* MemTable and Immutable MemTable are basically sorted skiplists
* LevelDB does not commit KV pairs to the log due to the slower write performance induced by the fsync() operations for commit -> it trades off durability and consistency against higher performance
---
Components on disk ...
* Multiple levels from $L_0$ to $L_k$,
* SSTable files store sorted KV pairs, and there are no overlap key vlaues within a level
* Compaction by merge sort (Same key value in next level)
* Compaction to $L_0$ (i.e., flushing recently inserted data in Immutable MemTable to L0) does not perform merge sort in order to increase write throughput -> has overlap key vlaue
---
Facilitate fast search, and it also collects garbage in the files; however,
* For read operations
* File search & block search ( LevelDB uses a Bloom filter for each block)
* Unnecessary block read from mutilevel search & Bloom filter
* High write and read amplification
* Write : Maintains hierarchical levels of sorted files on a disk
---
# Persistent Memory
* The failure atomicity unit (or granularity) for write to PM is generally expected to be 8 bytes
* When persisting a data structure in PM, we need to carefully update or change the data structure by ensuring the memory write ordering
* However, in modern processors, memory write operations may be reordered in cache line units to maximize the memory bandwidth
* Moreover, if the size of the data written to PM is larger than 8 bytes, the data structure can be partially updated in system failure, resulting in an inconsistent state after recovery
---
# SLM-DB
* Persistent memory component and single-level disk component
* single level $L_0$ of SSTable files -> reduce write amplification
* persistent B+ -tree index -> expidite read operation on a single-level organization of SSTable files
* selective compaction scheme -> maintain a sufficient level of sequentiality of KVs (i.e., a degree of how wellKV pairs are stored in sorted order) in SSTables so that it can provide reasonable range query performance
* the state of on-going compaction needs to be backed up by the compaction log stored in PM -> consistent on system fail
* 以單層搭配PM中的B+tree來改善傳統多層架構
---
# B+ Tree index in PM
Immutable MemTable flush到 SSTable 時, 也會insert到B+tree
file creation thread : create a new SSTable file and writes KV pairs from Immutable MemTable to the file; it then flushes the file to the disk and adds all the KV pairs stored on the newly created file to a queue
B+ tree insertion thread : Create a queeu, processes the KV pairs in the queue one by one by inserting them in the B+-tree
---
B+ tree iterator is implemented to scan keys stored in SSTable files
Iterators support seek, value, and next methods
---
# Selective compaction
SLM-DB supports a selective compaction operation in order
to **collect garbage** of obsolete KVs and improve the sequentiality of KVs in SSTables
* compaction cadidate list of SSTable
* when the number of the files in the list is larger than a threshold
* for each file in the SSTable, compute the overlapping ratio as :
* formula...
* 選出某個file, 其與其他file所計算出個overlapping rate 是最大, 為被壓縮的檔案
---
* compaction 一樣用兩個前面提到的兩過thread實作
* merge 時檢查file內kv pairf資料是否為過時(搜尋B+tree), 若:
* 是valid, 則和其他 valid kv pair merge;
* 如果是obselete(不存在於B+TREE, 或是在其他SSTable), drop此kv pair
---
* file creation thread創建新file來寫merged kv pairs並沖到disk, 接著一樣排進btree insertion queue
* inseertiob thread insert 完queue中資料後, creation thread 繼續創新file
* 重複直到壓縮完成
---
# selective的方法
Live ratio - garbage collection
* live ratio : valid page # / ttl page #
* 原始的SSTable file的 live page rate = 1, 隨著資料在btree上更新, rate逐漸下降
* 變成0的直接刪掉, 低於threshold的加到candidate list
----
Btree leaf node scan - increase Sequencaility
* 壓縮時同時進行
* 以RR方式scan固定數量的key(連續) , 檢查他們是否太分散, 太分散的一群file全部加到candidate list
----
degree of rangequery
* 做range query時, 一樣檢查key是否太分散
---
# 實驗結論 (詳細數據見論文)
* 在寫入改善了寫入放大
* 讀則改善幅度較小(我猜還是跟SSTable overlapping有關係)
* range query 則某些較差 (sequentiality沒那麼好)