HiKV: A Hybrid Index Key-Value Store for DRAM-NVM Memory Systems

# HiKV: A Hybrid Index Key-Value Store for DRAM-NVM Memory Systems #### Authors: Fei Xia, Dejun Jiang, Jin Xiong, and Ninghui Sun1 ###### tags: `usenix` ###### papers: [pdf](https://www.usenix.org/system/files/conference/atc17/atc17-xia.pdf) ###### slides: [pptx](https://) ###### introduction: [videos](https://) ###### reference: [1] [key-value storage](https://medium.com/%E9%AB%94%E9%A9%97%E4%BA%BA%E7%94%9F-touch-life/key-value-storage-%E5%84%B2%E5%AD%98-27372bb6b267) , [2] [B$^+$-Tree introduction](https://zh.wikipedia.org/wiki/B%2B%E6%A0%91) ## Abstracts: * 現有的 index design of existing key-value stores for hybrid memory 無法利用到 hybrid memory 的特性( DRAM 寫入快， NVM 寫入慢，兩者讀取速度相似 ) * 此篇論文介紹 "HiKV" ，一種 persistent key-value並在 hybrid memory 中建立 hybrid index，並使用 B$^+$-Tree 及 Hash index 的技術 * HiKV builds and persists the hash index in NVM to retain its inherent ability of fast index searching * HiKV builds the B$^+$-Tree index in DRAM to support range scan and avoids long NVM writes for maintaining consistency of the two indexes * HiKV 對於 hybrid index 使用不同的 concurrency schemes，並採用 ordered-write consistency 來確保 crash consistency * 在 single-threaded performance 上，HiKV 比起最新的 NVM-based key-value stores 降低了86.6％的 latency * 在 multi-threaded performance 上，HiKV 在 YCSB workloads 上增加了6.4x 的 throughput ## 1. Introduction * 在 hybrid DRAM-NVM memory systems 建立 key-value stores 來達到快速的 memory access * Persistent key-value stores (KV stores) 在資料中心的儲存架構中扮演很重要的角色，特別在大規模的生產環境，如：搜尋引擎、數位商業化平台、社群網路、照片儲存等等。 * 在過往的10年間，有許多針對 KV store 的設計和優化，如減少 SSD的寫入放大、減少 indexing 下的記憶體使用、增加高規模下的並行效率(improving concurrency to achieve high scalability) ，但因為這些研究主要專注於 hard disk && SSDs ，傳統的 KV store 無法適用於混和系統(hybrid memory systems) * 舉例來說：許多 indexing structure 都會採用 Log-Structured Merge Tree 來避免 hard disk && SSDs 的小部分隨機寫入 ; 但 hybrid memory system 是 byte-addressable 且 sequential and random access 的 performance 差不多，因此 KV stores 在 hybrid memory system 的設計上應較專注於大粒度(large granularity)的循序寫入，而非考慮寫入放大(write amplification) * Indexing 是設計 key-value stores 很重要的一個部分，使用 KV operations ，如：***Put, Get, Update, Delete, and Scan*** 的效率會很大取決於 indexing structure，搜尋 B$^+$-Tree index 大部分比 hash index 來的效率差 * 在 NVM 上的 B$^+$-Tree index 效能優化一直有在研究，但都主要專注於減少 consistency cost * key-value stores 的規模會被 indexing structure 所限制，如 hash index 允許 index structure 在 multi threads 上執行，但 B$^+$-Tree index 會在分割大 partitions 及融合小 partitions 時產生 expensive data movement，因此還有許多空間做研究並改善 * 在此篇論文中，提出 HiKV, **a Hybrid index Key-Value store to run on hybrid memory**，其主要構想爲： * hash index 放置且保存在 NVM ; B$^+$-Tree index 放置在 DRAM 但非永久保存(因 DRAM volatile) * 保留固有有效率的 hash operations 來支援 single-key operations( ***Get/Put/Update/Delete*** )，此外使用 the sorted indexing in B$^+$-Tree 來加速 ***Scan*** * 在採用 hybrid index 時有幾項挑戰： 1. 當使用某些 KV operations (***Put, Update, and Delete***) latency 可能增加，因 HiKV 需要更新兩個 indexes 來維持 consistent 2. hybrid index 的 scalability 需要小心設計， Partitioning hash index 有良好的 scalability，但 Partitioning B$^+$-Tree index 會有很大的 data migration cost 3. 保證 crash consistency of the hybrid index 會造成大量的 NVM Write * 論文解決方法： 1. 將較慢的 B$^+$-Tree index 放在較快的 DRAM ; 而較快的 hash index 放在較慢的 NVM，且 HiKV 不同步更新 B$^+$-Tree index 來隱藏 latency 2. HiKV 採用 partitioned hash indexes and a global B$^+$-Tree index，HiKV 使用 Hardware Transactional Memory (HTM) 來做 B$^+$-Tree index 的 concurrency control，fine-grained locking to support concurrent accesses within individual hash index partitions. 3. HiKV 採用 selective consistency 只保證 hash index and key-value items by ordered-write 的 consistency. HiKV keeps the B$^+$-Tree index in DRAM and rebuilds it after system failure. * 比較三者：HiKV、NVStore、FPTree Workload：micro-benchmarks、YCSB * 效能評比： 1. micro-benchmarks : * HiKV 比 NVStore 減少 54.5% ~ 83.2% 的 latency * HiKV 比 FPTree 減少 28.3% ~ 86.6% 的 latency 2. YCSB workloads : * HiKV 比 NVStore 快 1.7x 到 5.3x * HiKV 比 FPTree 快 24.2% 到 6.4x * 此篇論文有以下貢獻 1. 將較慢的 B$^+$-Tree index 放在較快的 DRAM ; 而較快的 hash index 放在較慢的 NVM 來善用 hybrid memory 的特性，並有效的支援 KV operations 2. 小心設計 concurrency 機制，來達成 high scalability with partitioned hash indexes and single global B$^+$-Tree index 3. 使用 ordered-write consistency and specific hash index design 來允許 atomic writes 已保證 crash consistency with reduced NVM writes 4. 對於 hybrid index 使用 HiKV 設計，並透過實驗來證明 HiKV 的效率性 ## 2. Background and Motivation ### 2.1 Non-Volatile Memory * NVM 的 read latency 與 DRAM 相仿，write latency 則較慢 * NVM 跟 NAND Flash 一樣， write endurance 有限制，特別是PCM，因此寫入次數要透過軟體層來控制 * NVM 如同 DRAM 可以 random access，與傳統 flash 不同 ![](https://i.imgur.com/qj2L5go.jpg) ### 2.2 KV operations and indexing efficiency * ***Put, Get, Update, and Delete*** 是 KV stores 的 basic operations * Scan( Range Scan ) 現今越來越重要： * *Facebook* 從原先的 MySQL 改成 KV store MyRocks，而 ***Scan*** 是MySQL range query 的重要 operation * Local file systems (i.e. TableFS ) and distributed file systems (i.e. CephFS ), 使用 KV stores 來儲存 metadata，而 ***Scan*** 是第二常使用的 metadata operation ***readdir***的核心 operation * 因此如何有效率的支援 rich KV operations 在 building key-value stores 時非常重要 * hash indexing && sorted indexing 都不能有效率的支援所有 operations，此論文採用 micro-benchmarks 來量化三種 in-memory indexes: hash, skiplist, B$^+$-Tree，並支援五種 KV operations * Figure 1 顯示 in-memory throughput results with 50M key-values * 對於 ***Put/Get/Update/Delete***，hash index 表現最唯有效率，比起需要 multi-level searching 的 skiplist and B$^+$-Tree，hash index 可以使用較少的 memory operations * 然而因 hash index 是 non-sorted indexing，在 ***Scan*** 上會有非常低的 throughput，因為要搜尋所有 index space * 現今的 NVM-based KV stores 採用 B$^+$-Tree 作為 indexing structure，這也是此篇論文的發想原因，希望透過hybrid index 來改善效能 ![](https://i.imgur.com/fOcLrC0.jpg) ## 3. HiKV Design and Implementation * 在這段落將會陳述 HiKV 系統的設計及實作，依序以下描述： 1. 首先描述 hybrid index 的設計 2. 講解到採用 hybrid index 所會遇到的 index updating, concurrency control and crash consistency guaranteeing 的設計問題 3. HiKV recovery 4. implementation of HiKV ### 3.1 Hybrid index * single-key operations (***Put/Get/Update/Delete***) 使用一個 key 值來搜尋 index * 一旦 KV item 鎖定，***Get*** 直接回傳 data ; 而 ***Put/Update/Delete*** 因為需要更新 index entry 及新的 KV item，所以需要有效率的 index searching 及 data persisting，而 Hash indexing 存於 NVM 就有此優點，不僅讀取速度如 DRAM，還會永久保存 index，因此不需要而外的 data copy from DRAM to NVM * 另一方面，***Scan*** 會以 start key && count(or a start key and an end key) 作為 input，這在 sorted indexing 會很有效率，因此 ***Scan*** operation 使用 hybrid index 中的 B$^+$-Tree * 要維持 hybrid index 的 consistent 需要每次 KV writes 都要 update hash index and B$^+$-Tree index，而因為 B$^+$-Tree index 需要許多寫入的操作，因此將 B$^+$-Tree index 放在寫入較快速的 DRAM 而非較慢的 NVM 中 * Figure 2 解釋了 hybrid index in hybrid memory 中的結構 ![](https://i.imgur.com/obmzaRV.jpg) ### 3.2 Index updating #### 3.2.1 Asynchronous updates * 當處理 KV writes (***Put/Update/Delete***) 時，HiKV 需要更新 both the hash index and the B$^+$-Tree index 以保證一致，最直覺的方法當然就是同步更新兩個 indexes(synchronous)，但因為 update B$^+$-Tree index 的成本較高，HiKV 採用非同步的方式 update hybrid index (asynchronous) * 也就是說，HiKV 會同步更新 KV items and the hash index in NVM ，而 B$^+$-Tree index in DRAM 就會在之後非同步更新以減少 latency * Figure 3 顯示 HiKV 處理不同 KV operation 的步驟，以 ***PUT*** 爲例： 1. HiKV 首先使用 *serving thread* 來處理 incoming request，*serving thread* 負責寫入 KV item to NVM (step 1) 2. 將新加入的 index-entry 寫入 hash index (step 2) 3. *serving thread* 會將 *put request* 放入 updating queue 中，然後 return ; 一個非同步的 thread (backend thread) 會從 updating queue 中拿 request 並在 background 中操作 B$^+$-Tree 以避免 KV writes latency。且為避免 system crash 無法更新 B$^+$-Tree，HiKV 可以從 hash index 回復 B$^+$-Tree (Section 3.5) ![](https://i.imgur.com/vOqI0MM.jpg) * 只要當 ***Scan*** 進入時 updating queue 還有 requests ，就會發生 B$^+$-Tree index 不同步的狀況，直接執行 ***Scan operation*** 會提取到舊的錯誤的 data， HiKV 遇到此狀況時，會先鎖住接下來的 write 跟 ***Scan*** ，直到 updating queue 中的 request 被清空，再執行 ***Scan*** * ***Scan*** 跟 subsequent writes on B$^+$-Tree index 的同步機制會由 Hardware Transactional Memory (HTM) 所提供 * 此論文限制 updating queue 的長度爲 4096 requests 以免過多的寫入。 #### 3.2.2 Dynamic threads adaption * 爲了處理高同步的 request，HiKV 需要增加 serving threads 的數量，對於 read-write mixed 的 workload，updaing queue 會接收到非常多 *put write requests* ，如果 backend threads 進度太緩慢， updating queue 變得 full 就會 block 住接下來的 requests，因此 HiKV 需要動態的依據 serving threads 調整 backend threads * 通常會設置一個固定數量的 thread pool 來跑 serving threads && backend threads，並透過動態的 threads 調整機制來決定 serving threads 的數量 ($Nsthd$) and backend threads 的數量 ($Nbthd$) * 基本上要符合 backend threads 處理 updating queue 的速率跟 serving threads 增加 updating queue 的速率，而 processing rate 跟 filling rate 由很多因素決定： 1. KV operation complexity 2. ratio of different KV operations 3. DRAM/NVM performance * 為了要決定 runtime 時的 $Nsthd$ ＆＆ $Nbthd$ ，此論文測量不同 KV operations 的 latencies * ***Scan*** 比 ***Get*** 多花了14倍的時間 * 而 ***Put/Update/Delete*** 之間的延遲時間差不到2x，因此視為有相同的 latency時間 * 假定寫入的數量 normalized to 1: * ***Get*** 數量爲 $Ng$ ; ***Scan*** 數量爲 $Ns$ * ***Get*** 延遲爲 $Lg$ ; ***Scan*** 延遲爲 $Ls$ * backend thread 寫入延遲爲 $Lbw$ ; serving thread 寫入延遲爲 $Lsw$ * 那 $Nsthd$ ＆＆ $Nbthd$ 應要滿足以下兩個式子： 1. ($Ng$ x $Lg$ + $Ns$ x $Ls$ +1 x $Lsw$) / $Nsthd$ = (1 x $Lbw$) / $Nbthd$ 2. $Nsthd$ + $Nbthd$ = $Nthd$ ### 3.3 Differential concurrency * 在 multi-era 下同步控制 ( Concurrency control ) 是增加 KV stores scalability 的關鍵因素 * Partitioning 對於 balanced workload 可以有高產出與規模，因此 HiKV 對於 hash index 採用 keyhash- based partitioning，所有 KV items 依據 hash value of the key 分散到多個 partitions，而每個 partition 使用如 Figure 2 所示的 hash index * HiKV 允許多個 threads 同時 access 同一個 partition 來處理 skewed workload，在 partition 中會有細部的同步控制機制，HiKV 使用 atomic write 來更新 hash index entry( 如Section 3.4所示 )，總結來說，HiKV 可以在其他 thread updating 的時候讀取，不會被鎖住。 * 分割 B$^+$-Tree index 的方法不是 unordered multi-B$^+$-Tree indexes as in Cassandra and Megastore using keyhash-based approach 就是 ordered multi-B$^+$-Tree indexes as in SLIK using range partition. * 然而若是要處理 ***Scan*** 都會需要很費工，因為沒有排序的 B$^+$-Tree index 需要每個index 都要執行 *Scan operation* ，才能回傳 matching result; 若是有排序的 B$^+$-Tree index ，只要在包含該 KV items 的 B$^+$-Tree index 中才能完成，因為合併 index entries，不論 index 變太大或太小，都會降低系統 performance，因此 HiKV 使用 global B$^+$-Tree index ，其中包含所有存在 NVM 的 KV items，並採用 HTM 來處理 global B$^+$-Tree index 同步控制的問題 ### 3.4 Ordered-write consistency * crash consistency 是 KV stores 所需要保證的事，因為 NVM 的 long write latency ，HiKV 需要減少 NVM 的寫入。 1. HiKV 會有選擇性的 consistency，只保證 hash index and KV items 的 consistency，因為 DRAM copy data to NVM 代價很大，所以不保證 B$^+$-Tree index consistency。而遇到 system failure 時，HiKV 回復 B$^+$-Tree index 的機制在 Section 3.5 2. 使用 order-write 來保證 hash index and KV items 的 consistency，傳統的 logging and copy-on-write 會產生兩個 writes ，而 order-write 首先在 out-of-place 的地方更新 KV item，然後使用 atomic write 更新 in-place 的 hash index，KV item 在 atomic write 完成後才會 visible，透過這個方式，就可以保證 crash consistency 而且不需要多個 write * 接下來描述支援 atomic write 的 hash index design 以及 consistency algorithms #### 3.4.1 Hash index design * 現代的處理器支援直接支援 8B atomic write，而使用 cmpxchg16b instruction 可支援 16B atomic writes。 * 然而因為 KV stores 的 key size 就有 16B，直接將 key 跟 KV item 在 hash index entry 的位置放入是不可能達到 atomic writes 的。 * HiKV 採用 16B index entry，包含 64 bits 的 key signature 跟 64 bits 的 KV item 位址 * 一個 hash bucket 包含多個 16B 的 index entries * 每個 KV item 儲存 32 bits kv_length, 32 bits key and value * Key signature 仍可能會跟其他的 key 有衝突，因此如果在 index entry 有相同的 key signature，HiKV 會檢查 index entry 中的 KV items 是否符合 specified key ![](https://i.imgur.com/plXZOAD.jpg) #### 3.4.2 Consistency algorithm * 因為下列原因可能造成 memory write 被 reordered： 1. caching of CPU 2. scheduling of memory controller * HiKV 使用 *sfence* 跟 *clflush* ， *sfence* 可以在現有的 hardware 上強制 memory write ordering and persistency，如果 hardware 有支援，*clflush* 可以被最新的 *CLWB* instruction 的取代 * **Put**： 1. 找到空的 index entry 2. 演算法配置空的空間以儲存 KV item 3. 配置 KV item ( key and value ) 4. 放置 NVM 中 5. 使用 atomic write 來設定 index entry 6. 保存 index entry ![](https://i.imgur.com/UPQ1YSt.jpg) * **Update**： 1. 演算法依據 key 找到原本的 index entry 2. 使用 index entry 來找到 NVM 中原始的 KV item 3. 因為 HiKV 採用 out-of-place update for KV item ，需要配置一個空的地區來儲存新的 KV item 4. 配置 KV item ( key and value ) 5. 放置 NVM 中 6. 使用 atomic write 來設定 index entry 7. 保存 index entry 8. deallocate 原先 KV item 的空間 ![](https://i.imgur.com/5KFgQyN.jpg) * **Delete**： 1. 演算法依據 key 找到原本的 index entry 2. 使用 index entry 來找到 NVM 中原始的 KV item 3. 藉由 atomic write 設定 "0" 在 index entry 4. 保存 index entry 5. deallocate 原先 KV item 的空間 ![](https://i.imgur.com/YvGkDW1.jpg) * KV item 的有效與否是看相對應的 index entry ，因為 index entry 更新時是使用 atomic write 的方式，因此在執行 **Put/Update/Delete** 指令的時候 crash 不會破壞一致性 * 值得注意的是，如果 HiKV 在 NVM allocate 新的空間的時候 crash ，那會造成 memory leak，要透過 underlying libraries and operating system 來支援，這也是這篇論文的==future work==。 ### 3.5 Recovery * Recovery after a normal shutdown : * 在正常的關機下，HiKV 會在 NVM 以連續空間的方式保存 B$^+$-Tree index * HiKV 會儲存開始的 address 並使用 atomic wirte 寫一個 **flag** 來標註是 **normal shutdown** * 當 HiKV 需要回復 index，會先檢查 **flag** 是否是 **normal shutdown** ，如果是，HiKV 會從 NVM 讀取 B$^+$-Tree index 並回復至 DRAM * Recovery after a system failure * 如果是 system failure，HiKV 需要藉由 NVM 中的 hash index and key-value items 來回復 B$^+$-Tree index，而這操作使用 **Scan** 所有 hash indexes 來達成 * 對於每個 hash_index 中的 index_entry，如果值不是 0 ，就利用 **recovery thread** 來插入 key 跟相對應的 KV item 的位址到 B$^+$-Tree index，若值為 0，就將 index_entry 設為 invalid ### 3.6 Implementation * HiKV 以 **MICA** 方式實現不會 loss 的 hash index design * hash bucket 有許多連續的 index entry * 如果遇到 hash collision，HiKV 會循序的搜尋 hash bucket 中的下一個 index entry * B$^+$-Tree leaf node 中的每一個 index entry 包含了整個 key 跟其相對應 KV item 在 NVM 中的位址 * 採用多重 lock-free updating queue 來減少多同步 request 下的資源爭奪，因為所有的 **bcakend thread** 同步處理 updating queue 的 cost 很高 ## 4. Evaluation ### 4.1 Experimental Setup * 實驗環境： * two Intel Xeon E5-2620 v4 processors * running at 2.1 GHz has 8 cores * a shared 20MB last level cache * The memory size in the server is 256GB. * NVM emulation： * 採用 DRAM 來模擬真實的 NVM DIMMs * access latency is 60 ns * write latency is 600 ns * 使用 x86 RDTSCP instruction 讀取 processor timestamp counter and spin counter 直到達到 configured latency * read latency 維持原本的，因 NVM 讀取速度與 DRAM 類似 * 較長的 NVM read latency 在 Section 4.6 模擬 * Workloads： * 使用五個 micro-benchmarks 來模擬 single KV operations ( ***Get/Put/Update/Delete/Scan*** ) * 如同 **YCSB**，隨機產生的 Scan count 少於 100 * 對於每個 micro-benchmarks： * 先輸入 50M key-values 來 warm up KV stores * 執行 50M operations with 隨機選擇的 key-values * 所有 micro-benchmarks 產生的 KV data 會有一致的 distribution * 使用常見的 macro-benchmarks **YCSB** 來預測 mixed operations 的效能 * 執行 50 M key-value opeations * 使用 default configuration of **YCSB**，符合[齊夫定律](https://zh.m.wikipedia.org/zh-tw/%E9%BD%8A%E5%A4%AB%E5%AE%9A%E5%BE%8B)並有99%的傾斜度 * 對於 micro- && macro-benchmarks，都使用 16B 的 key size，爲 production 環境下常見的 size * 在 **Facebook** ，超過 90％的 Memcashed value 接近並略小於 500B ，因此此論文的 value size 基本上採用 256B * Compared systems： * HiKV 跟最新的 NVM KV store *NVStore* 還有 hybrid memory KV store *FPTree* 做比較 * 並沒有跟 disk-based KV stores, such as *RocksDB* 做比較，因為 HiKV 是以 byte-addressable NVM 爲設計，且他的 I/O stack 跟 *RocksDB* 不太一樣 * 並沒有跟無法保證一致性的 KV stores 做比較，如 *Echo* and *Masstree* * 以別人的論文重建 *NVStore* and *FPTree* * **NVStore**： * NVStore 的 index 是優化過的 B$^+$-Tree，稱為 NVTree，會將 entry 放置在 leaf nodes 並 unsorted 以減少 NVM write * 公平起見，將 NVTree 的 innor nodes 放置在 DRAM，如同 HiKV * **FPTree**： * 同樣使用 variation of B$^+$-Tree ，並在每個 unsorted leaf node 中加入 bitmap and fingerprints 以加速搜尋 * 一般的混和記憶體是把 DRAM 當做 NVM 的 cache，以 16B keys and 256B values 來說，HiKV 的 DRAM/NVM 使用率比 **FPTree** 高了 15.4％，所以在此實驗將多餘的 DRAM 作為 FPTree 的 cache，稱為 **FPTree_C**，**FPTree_C** 使用 hash index 跟 LRU replacement policy 來管理 cache ### 4.2 Single-threaded performance * 首先以 micro-bechmarks 測試 single-threaded performace * ***Get/Scan*** 因為只需要讀取 data ，四個 KV stores 都只有 single thread * 因為寫入操作（*Put/Update/Delete*) HiKV 需要一個 backend thread 來維持 B$^+$-Tree index 及一個 serving thread，為公平起見，both NVStore, FPTree, and FPTree C 配置兩個 thread #### 4.2.1 Latency reduction * **Get**: * HiKV 比起 NVStore 減少 83.2％ latency * HiKV 比起 FPTree 減少 86.6％ latency * HiKV 只需要查詢快速的 hash index，但 both NVStore and FPTree 需要循序的查詢 unsorted leaf node * **Put/Update/Delete**: * HiKV 比起 NVStore 減少 54.5% / 58.4% / 65.3% latency * HiKV 比起 FPTree 減少 68.8% / 59.1% / 45.0% latency * 因為搜尋 hash index 非常快速且 HiKV 利用非同步機制隱藏了 B$^+$-Tree index 的 latency * 對於**Put/Update**： * FPTree 需要保存 data 三次 ( bitmap, fingerprints and key-value ) * NVStore 需要保存 data 二次 ( key-value and leaf.number ) * 因此對於**Put/Update** ，FPTree 效能比 NVStore 來的差 * 對於 **Delete**： * HiKV 只需要 invalid 相對應的 index_entry 並保存在 NVM * NVStore 需要新增 key-value with an invalid flag and update the leaf number，因此在 NVM 保存 data 兩次 * 雖然 FPTree 只需要 invalid bitmap 並在 NVM 保存 data一次，因為較沒效率的 tree index 搜尋機制，他的效能仍比 HiKV 來的差 * **Scan**： * HiKV 比起 NVStore 減少 77.7％ latency * HiKV 比起 FPTree 減少 28.3％ latency * 當放 key-value 時， NVStore 不會檢查 key-value 是否在 leaf node，因此在 scanning leaf node 時還要額外確定 key-value 是否有效，因此 NVStore 效能比 HiKV 差 * FPTree 每個 leaf node 都使用 bitmap 來確定 leaf node 中的 key-value entry 是否有效，因此 FPTree 在 **Scan** operation 的效能比 NVStore 好，但仍比 HiKV 差 * FPTree_C 對於 single-key operations 的表現比 FPTree 來的差，因為 micro-benchmarks 有一致的 distribution，因此會造成 low cache hit ratio，FPTree_C 就要花多餘的 overhead 來執行 cache replacement ![](https://i.imgur.com/JrCGuUr.jpg) #### 4.2.2 Throughput improvement * **Get/Scan**: * HiKV 比起 NVStore 增加 5.0x/3.8x throughput * HiKV 比起 FPTree 增加 6.4x/41.2％ throughput ![](https://i.imgur.com/zIPOOmA.jpg) * **Put/Update**: * HiKV 比起 NVStore 增加 10.4%/19.6% throughput * HiKV 比起 FPTree 增加 55.9%/19.6% throughput * **Delete**: * HiKV 比起 NVStore 增加 43.2% throughput * HiKV 比起 FPTree 減少 10.0% throughput * HiKV throughput 的增加幅度在 **Put/Update/Delete** 比起 **Get/Scan** 來的少，因為 NVStore 跟 FPTree 在 write request 時有兩個 threads，而 HiKV 只有一個 thread，而在 read request，這三個 KV stores 都只有一個 thread * 因為 DRAM cache management 的 overhead，FPTree_C 的 throughput 比 FPTree_C 來的少 ### 4.3 Scalability * 透過 macro-benchmark YCSB 來測試 HiKV scalability * 因 network stack 的 latency，不使用原先 client-server 模式的 YCSB 架構，而使用 local YCSB workload generator，如同 default YCSB configurations like **MICA** * HiKV 會依據不同 workload 動態調整 serving thread 跟 backend thread 的比例 * 在實驗中所有 KV stores 都會用相同數量的 threads * 在 16核的 server 下，最多同時跑 32 個 threads * 在 YSCB-A/B/C/D/E/F 下，2 threads 跟 32 threads的效能比例爲： * HiKV -- 13.6/15.5/10.5/15.4/4.3/18.3 * NVStore -- 5.5/7.6/8.2/8.2/4.6/5.5 * FPTree -- 10.0/7.5/7.5/7.7/3.4/10.1 * FPTree_C -- 7.9/8.2/8.8/7.8/3.5/8.8 * 總結來說 HiKV 的 scalability 比 NVStore 及 FPTree 來的好 * **Get ratio** is 95%, 95%, and 75% for YCSB-B, YCSB-D, and YCSB-F，HiKV 在有32 threads 運行時 serving threads 佔了 20 個，然而在 2 threads 運行時 serving thread 只會有 1 個，因此在 32 threads 且 workload 爲 YCSB-B/D/F 的情況下，HiKV 的 throughput 可以比在 2 threads 的情況下多了 15倍 * 在 32 threads 執行時： * HiKV 比起 NVStore 增加 1.7x to 5.3x throughput * HiKV 比起 FPTree 增加 24.2% to 6.3x throughput * HiKV 比起 FPTree_C 增加 24.1% to 3.5x throughput * 對於 read-intensive 跟 skewed workloads，如 YCSB-B/C，FPTree_C 因為 cache hit 比例較高所以表現的比 FPTree 好 * 對於 YCSB-E，HiKV 在 8 threads 以前幾乎以線性做成長，且 8 threads 以後進入 stable 狀態。因為 HiKV 在執行 **Scan** 以前會先鎖住所有 updating queues，因此鎖住 **Put** 等其他 threads * NVStore, FPTree and FPTree_C 可以擴展到 16 threads，不過 HiKV 仍然比他們各多了 1.7x, 24.2%, and 24.1% ![](https://i.imgur.com/PgrBu2U.jpg) ### 4.4 Sensitivity analysis * 此段落，仔細分析 HiKV 對於 NVM write latency 跟 workload dataset size 下的狀況 * 實驗都以 16 threads 進行 ### 4.4.1 Sensitivity to NVM write latency * NVM devices 的寫入延遲沒有定數，所以進行不同 NVM write latency 下的實驗 * Figure 8 表示 NVM write latency 從 50ns to 1400ns 下的 throughput 狀況 * 因為 **Get/Scan** 跟 write latency 沒有關係，因此實驗只做 **Put/Update/Delete** * 並不會顯示 FPTree_C 的結果，因為在平均 distribution 下他的產出會比 FPTree 來的差 * NVStore 及 FPTree 的產出會隨著 NVM write latency 增加而減少 * 在 **Delete** 下，HiKV的 throughput 還是很穩定( NVM write latency 低於 1400ns 時)，因為同步刪除 B$^+$-Tree index 的 latency 即使在 write latency 上升到 1000 ns 時仍然比 hash index 來得高 * 跟 NVStore 及 FPTree 比較起來，HiKV 增加了 **Delete** 的 throughput by 17.6%/80.0%/39.9%，32.9%/38.4%/24.6% for **Put/Update/Delete**，即使write latency 上升到 1400 ns ![](https://i.imgur.com/LoF1idK.jpg) ### 4.4.2 Sensitivity to dataset size * 有一個懷疑是 HiKV 使用 global B$^+$-Tree index 來支援 **Scan** ，隨著 dataset 的提高，是否會限制 HiKV 的 scalability？ * Figure 9 顯示了 key-value 從 10M to 250M 的 throughput * NVStore 無法支援到 250M key-value，因為 NVTree index 的增加使用完了 server memory * 對於 **Put** 總共放了 500M 的 key-value，因為先放了 250M key-values 作為 warm up * **Update** 下 HiKV 的產出維持不變，但 NVStore 及 FPTree 分別下降了 19.3％及 13.1％ * 當 key-value 增加了 25 倍： * **Put**： * HiKV throughput 下降 14.6％ * NVStore throughput 下降 NA * FPTree throughput 下降 7.2% * **Delete**： * HiKV throughput 下降 22.4％ * NVStore throughput 下降 15.6% * FPTree throughput 下降 16.3% * 產出下降的原因是因為增加 dataset size 會增加搜尋的 latency，而 HiKV 執行 **Update** 的 throughput 是由 serving thread 所決定 * 透過此實驗，可以確定 global B$^+$-Tree index 不會限制 HiKV 的 scalability ![](https://i.imgur.com/alhAlH1.jpg) ### 4.5 Performance breakdown * 此段落討論三個問題： 1. 非同步 update 的效率 2. differential concurrency 3. HiKV orderd-write 的一致性 * HiKV_sync 使用一個 thread 同步 update hash index 跟 B$^+$-Tree * HiKV_par 採用 partitioning-based 同步控制 B$^+$-Tree index (多個且照順序) * HiKV_wal 使用傳統的 Write-Ahead Log 來保證一致性 * Figure 10 顯示 **Put** 在NVM write latency 從 50ns to 1400ns 下的平均 latency * 跟 HiKV_sync 相比，HiKV 可以減少 46.7% to 57.8% latency，因 HiKV 非同步更新，所以只需處理 hash index * 跟 HiKV_par 相比，HiKV 可以減少 11.2% to 17.4% latency，HiKV_par 的低效能有兩個原因： 1. 在 B$^+$-Tree index 中合併 index entries 會 block 住 **Put operation** 2. 在 16 threads 執行下，migration thread 會佔用 CPU 的資源 * HiKV_wal 在 hash index 儲存 key and value position(index entry) ，為了保證一致性，HiKV_wal 首先將 key-value 寫入 log area，然後將 value 寫入 NVM，並因 **Put** 在 hash index 寫入 index_entry， * 如果沒有先 logging，直接寫入 value and index_entry 會造成 inconsistency，因為沒有辦法 atomic write * 可以發現 HiKV_wal 需要在 NVM 保存 data 3次，還不能保證 writing value and index_entry to NVM 的 order * HiKV 只需要在 NVM 保存 data 2次，還可達成 order-writing * 跟 HiKV_wal 相比，在 NVM write 爲 1400ns 的情況下，HiKV 可以減少 27.4% latency ![](https://i.imgur.com/9iQMZhI.jpg) * 接下來探討 HiKV 動態調整 thread adaption 機制： 1. 先以 50M key-value 做 warm up 2. 依序執行 YCSB-A/B/C/D/E/F，**Put/Get/Update/Scan**的比例散落在這些 workloads 3. 每個 workloads 執行60秒，總 threads 數爲 16，HiKV-8-8代表 serving thread 有 8 個而 backend thread 有 8 個，HiKV-12-4 表示法亦同 * Figure 11 可以看出除了 YCSB-A 之外， HiKV 皆有最多的 throughput * 在 YCSB-B/C/D/E/F，HiKV 分別超出 HiKV-8-8 by 10.5%-1.0x，超出 HiKV-12-4 by 1.6% * HiKV 會動態調整 thread 比例，如在 YCSB-A，比例爲 8-8，在 YCSB-B，比例爲 13-3，在 YCSB-F，比例爲 9-7 * 若是 read-intensive workload，HiKV 會增加 serving thread 來增加 throughput ![](https://i.imgur.com/oNNK4ck.jpg) ### 4.6 Impact of NVM read latency * 有些研究指出 NVM read 會比 DRAM read 來的久，因此在這邊量測 NVM read latency 對於 system performance 的影響 * 量測讀取速度爲 60ns(DRAM speed) 跟 120ns(dounle DRAM speed) * Figure 12 指出平均 serving latency 會隨 NVM read latency 增加而增加，因為 HiKV 要花更多的時間搜尋 hash index * 然而 NVStore and FPTree 也要花更多的時間當他們搜尋 unsorted leaf nodes 跟切割/合併 leaf nodes * **Get/Scan** operation 下，HiKV 比 NVStore 少了 80.0%/61.8% 的 latency * **Get/Scan** operation 下，HiKV 比 FPTree 少了 82.3%/13.0% 的 latency ![](https://i.imgur.com/i8SgQtM.jpg) ### 4.7 Memory consumption * Figure 13 顯示隨機放入 50M key-values to different KV stores 後的 DRAM and NVM 消耗量 * value size 從 64B to 1KB，曲線表示 DRAM / NVM 的 consumption 比例 * 因為 DRAM 用來儲存 indexes of key-values，DRAM consumption 跟 key-value 的數量及 varied value sizes for both KV stores 的同步 * 相反的，在 both KV stores，NVM 用來儲存 data items 且他的 consumption 隨著 value size 增加 * NVStore 消耗 DRAM 高達 38.27 GB，主要因為當 NVStore 重新建立連續的 innder nodes of NVTree，每個 leaf node 皆創造一個 parent-leaf-node，index size of NVTree 會隨著豎高而指數成長 * HiKV 常常消耗 2.21 GB 的 DRAM 來儲存 B$^+$-Tree index，比起FPTree 來的多，因為 HiKV 的 B$^+$-Tree 要把 NVM 中的每個 unsorted key-value 都 index，而 FPTree 只需要儲存 inner nodes * HiKV-ratio 從40％降到4％，隨著 value size 從 64B to 1KB，256B 的 HiKV-ratio 爲 15.8％ * 雖然 FPTree 多餘的 DRAM 空間用來作為 FPTree_C，但 HiKV 相較於 FPTree_C 仍舊有較佳的 performance，而要透過 migratin part of leaf nods to NVM 來減少 DRAM consumption of B$^+$-Tree 是此論文的==future work== ![](https://i.imgur.com/WR2YCJZ.jpg) ### 4.8 Recovery time * 最後來檢查 HiKV, NVStore, and FPTree 的 recovery performance * 50M key-values 下，NVStore 花 11.03s restore ; FPTree 花 1.74s restore * NVStore 花比較久的時間，因為他配置較大的連續 inner nodes for tree index 且比起 FPTree ，insert key 更加隨機 * 因為 hash index 是 unsorted，HiKV 需要在 NVM 讀取 valid index_entry 並將對映的 key and key-value 位址新增到 B$^+$-Tree，因此 HiKV restore 要 88.24s * 增加 recovery threads 可以加快 recovery 速度，在 16 threads 下只需要 6.28s 來回復 50M key-values ## 5. Related Work * 此段落討論三種方向的 related work: 1. Indexing Structure * 許多分散式 KV stores，如 Cassandra, Megastore, and SLIK，建立多個 indexes for multi-key-value data，如 secondary index for non primary key query * HiKV 依據 single key 建立 hybrid index，並專注於改善 updating hybrid index 的 latency * SILT and dual-stage index 建立多層 indexes 來減少 DRAM consumption of indexes，這兩種技巧與 HiKV 減少 DRAM consumption of B$^+$-Tree 無關 * NVM，特別是 PCM，有寫入次數的限制，因此有一些研究致力於優化 NVM 的 indexing structure 來減少 NVM write，如 Chen et al. 提出 unsorted leaf nodes of B$^+$-Tree 以避免 sorting 的寫入 * HiKV 主要優化 indexing structure 來支援豐富的 KV operations，而非減少 NVM 寫入 2. Concurrency Control * KV Stores 多核的同步機制已經被研究許久，Echo and NVStore use the MVCC for concurrency control. Chronos and MICA uses partitioning for concurrency control of hash tables.PALM is lock-free concurrent B+-Tree. FPTree adopts HTM to handle the concurrency of inner nodes, and fine-grained locks for the concurrency access of leaf nodes. * HiKV 採用相似的技巧，使用 hybrid index 將他分割爲 hash tables and HTM for B$^+$-Tree index 3. NVM consistency guaranteeing * 有一些研究專注於減少 consistency guaranteeing 下的 cost * 有些 research works 使用 differential logging 只記錄修改過的 bytes of a block on journal 以減少 NVM writes，然而 differential logging 無法避免二次寫入 * 許多研究針對 data 大小不同用不同的方式減低 consistency cost，如 atomic-write 在小 data(8B to 16B) 時使用，而大 data 則使用 write-ahead logging and copy-on-write * NVStore and FPTree 使用 orderd-write 來保證 consistency * HiKV 採用 ordered-write 搭配 atomic-write to hash index ## 6. Conclusion * 永久 key-value stores 在現實中已經大量使用，混合式記憶體允許儲存裝置在記憶體中保存 data * 使用 KV stores 可以削減掉 hybrid memory 的一些效能特性，而透過豐富的 KV operations ( ***Put/Get/Update/Delete/Scan*** ) 可以有效率的支援現今的 application * 透過這篇論文的 hybrid index 方法，在 NVM 中使用 hash index 快速搜尋並直接保存 ; 在 DRAM 中使用 B$^+$-Tree index 快速更新及支援 ranging scan * 在 hybrid index 之上使用 HiKV 做管理，有完善的 scalability及 crash consistency 保證，透過實驗也證明效能比現今的 NVM-based KV stores 表現來的好 ----------------------------- > 老師建議 * DRAM comsumption 重要，要了解比例 * 了解 backend threads 處理 updating queue 的機制，因為使用 HTM，如果一次只能一個 thread 處理變成循序的話增加 backend thread 就沒有用處 * 也可以使用 DRAM 作為 NVM write 的 buffer，讓寫入更為快速 * 是否可以把 DRAM 當做 cache，大部分的 DRAM B$^+$-Tree 是放在 NVM 中 * 此篇論文沒有假設放在哪個 NVM 中 * metadata 定義要再了解 > 老師建議