SLM-DB: Single-Level Key-Value Store with Persistent Memory === ###### tags: `usenix` `fast` # Problems * B-tree based KV-stores : show poor write performanece * incurs mutiple small random writes * dynamically maintaining a balanced structure -> high write amplification * more suitable for read-intensive workloads * LSM-tree based KV-stores : challenging issues of high write and read amplifications * LSM-tree is organized with multiple levels of files * KV pairs are merge-sorted multiple times --- 介紹PM * PM is expected to coexist with disks such as HDDs and SSDs * However, searching for a new design for KV stores based on a hybrid system of PM and disks * PM carries a role that is more than just a large memory write buffer or read cache --- # Probs of LSM-tree based KV (e.g. LevelDB) * LevelDB architecture ![](https://i.imgur.com/6CWJwMd.jpg) Components on DRAM ... * MemTable and Immutable MemTable are basically sorted skiplists * LevelDB does not commit KV pairs to the log due to the slower write performance induced by the fsync() operations for commit -> it trades off durability and consistency against higher performance --- Components on disk ... * Multiple levels from $L_0$ to $L_k$, * SSTable files store sorted KV pairs, and there are no overlap key vlaues within a level * Compaction by merge sort (Same key value in next level) * Compaction to $L_0$ (i.e., flushing recently inserted data in Immutable MemTable to L0) does not perform merge sort in order to increase write throughput -> has overlap key vlaue --- Facilitate fast search, and it also collects garbage in the files; however, * For read operations * File search & block search ( LevelDB uses a Bloom filter for each block) * Unnecessary block read from mutilevel search & Bloom filter * High write and read amplification * Write : Maintains hierarchical levels of sorted files on a disk --- # Persistent Memory * The failure atomicity unit (or granularity) for write to PM is generally expected to be 8 bytes * When persisting a data structure in PM, we need to carefully update or change the data structure by ensuring the memory write ordering * However, in modern processors, memory write operations may be reordered in cache line units to maximize the memory bandwidth * Moreover, if the size of the data written to PM is larger than 8 bytes, the data structure can be partially updated in system failure, resulting in an inconsistent state after recovery --- # SLM-DB * Persistent memory component and single-level disk component * single level $L_0$ of SSTable files -> reduce write amplification * persistent B+ -tree index -> expidite read operation on a single-level organization of SSTable files * selective compaction scheme -> maintain a sufficient level of sequentiality of KVs (i.e., a degree of how wellKV pairs are stored in sorted order) in SSTables so that it can provide reasonable range query performance * the state of on-going compaction needs to be backed up by the compaction log stored in PM -> consistent on system fail * 以單層搭配PM中的B+tree來改善傳統多層架構 --- # B+ Tree index in PM Immutable MemTable flush到 SSTable 時, 也會insert到B+tree file creation thread : create a new SSTable file and writes KV pairs from Immutable MemTable to the file; it then flushes the file to the disk and adds all the KV pairs stored on the newly created file to a queue B+ tree insertion thread : Create a queeu, processes the KV pairs in the queue one by one by inserting them in the B+-tree --- B+ tree iterator is implemented to scan keys stored in SSTable files Iterators support seek, value, and next methods --- # Selective compaction SLM-DB supports a selective compaction operation in order to **collect garbage** of obsolete KVs and improve the sequentiality of KVs in SSTables * compaction cadidate list of SSTable * when the number of the files in the list is larger than a threshold * for each file in the SSTable, compute the overlapping ratio as : * formula... * 選出某個file, 其與其他file所計算出個overlapping rate 是最大, 為被壓縮的檔案 --- * compaction 一樣用兩個前面提到的兩過thread實作 * merge 時檢查file內kv pairf資料是否為過時(搜尋B+tree), 若: * 是valid, 則和其他 valid kv pair merge; * 如果是obselete(不存在於B+TREE, 或是在其他SSTable), drop此kv pair --- * file creation thread創建新file來寫merged kv pairs並沖到disk, 接著一樣排進btree insertion queue * inseertiob thread insert 完queue中資料後, creation thread 繼續創新file * 重複直到壓縮完成 --- # selective的方法 Live ratio - garbage collection * live ratio : valid page # / ttl page # * 原始的SSTable file的 live page rate = 1, 隨著資料在btree上更新, rate逐漸下降 * 變成0的直接刪掉, 低於threshold的加到candidate list ---- Btree leaf node scan - increase Sequencaility * 壓縮時同時進行 * 以RR方式scan固定數量的key(連續) , 檢查他們是否太分散, 太分散的一群file全部加到candidate list ---- degree of rangequery * 做range query時, 一樣檢查key是否太分散 --- # 實驗結論 (詳細數據見論文) * 在寫入改善了寫入放大 * 讀則改善幅度較小(我猜還是跟SSTable overlapping有關係) * range query 則某些較差 (sequentiality沒那麼好)