Fusion-RAID - HackMD

# 新興記憶體-期末報告: Fusion-RAID contributed by < [HenryChaing](https://github.com/HenryChaing/ca2023-lab3) > ## Abstract * 製作Fusion RAID的原因為減少software overhead. * 分配request到各個SSD中，分散application workload. * 以最後的效能分析，Fusion RAID 相比一般磁碟陣列減少了22%-98%的median latency，並且減少了2.7-62倍的效能提升在tail latency. ## Introduction * 近期all-flash arrays(AFAs)有越來越普遍的趨勢，例如澳洲的ANZ Bank採用了高達400TB的 SSD Arrays.AFAs的好處在於整合了各個單一磁碟的bandwidth來提升IOPS，並且擁有更正碼來提升SSD的錯誤更正功能。 * 我們模擬由HDD構成的RAID5並且與一般的SSD RAID進行比較，觀察他的Median Latency 以及 Tail latency，執行的benchmark 為寫入密集的Exchange benchmark. * 最後的結果發現SSD Array 擁有較好的 Median Latency，但是SSD有個缺點就是使用過久(Aged)時，效能會有明顯的下降。 ![image](https://hackmd.io/_uploads/HyKu0AWLp.png) * 我們進一步檢驗4+1 RAID5陣列中各個SSD驅動器的延遲分佈，結果列於Table2 中（僅考慮Aged狀態）。比較兩個表的結果，我們可以看到，當我們將SSD分組成RAID時，為了獲得空間和帶寬聚合的增益，我們可能正在以個別請求的處理速度為代價。 * 奇偶校驗更新的增加不僅產生更多工作，還涉及更多驅動器，使得對於相同工作負載，RAID的P99延遲幾乎比單個SSD高400倍。 * Table2 的最後一列突顯了SSD RAID面臨的一個挑戰：在我們的15分鐘跟蹤執行期間，其中三個驅動器似乎正在進行垃圾回收（GC），其P999延遲比其他兩個高出23倍以上。在多個驅動器之間進行高度耦合的操作中，來自單個驅動器的孤立尾延遲影響了RAID更多的請求，使整個陣列更容易受到性能異常的影響。 ![image](https://hackmd.io/_uploads/r1fRwyGIp.png) 然後，我們提出了一種新的RAID架構，名為FusionRAID，旨在同時降低SSD RAID的平均和最壞情況延遲，特別是針對延遲敏感的應用程序。FusionRAID運行在通用SSD陣列上，無需任何特殊的硬體支持或FTL修改，而是依賴三個關鍵技術： > 在SSD池中實現平坦的資源共享，以利用服務突發應用I/O的可用I/O併發性。 > 使用臨時複寫寫入作為長期RAID存儲的序幕的縮短寫入操作，但需小心地放置“在條帶中”，以便一個副本可以在不遷移數據的情況下直接轉換。 > 一種輕量級的延遲尖峰檢測和請求重定向機制，允許請求（包括讀取和寫入）避開在性能嚴重下降的SSD下運行。 ## SSD RAID Latency Source Study ### Workload I/O Characteristics * 我們接下來要用 workload 來分辨 SSD 的效能，首先我們得知了一個特性，當 multiworkload 同時運作於SSD上，瞬時的總合頻寬通常由少數工作負載主導，因為它們在短時間內讀/寫大量數據，而其他工作負載只是在這段時間內訪問少量數據。 * 我們採用了8個workloads，我們從多個流行的數據密集型工作負載中單獨運行SSD上收集了五個區塊級跡。我們還包括了三個公開可用的跟踪以確保多樣性。詳細在Table4有紀載。並且以毫秒為顆粒度檢查這8個workloads的混和。 ![image](https://hackmd.io/_uploads/rkot80GU6.png) * 我們從測試床中捕獲了所有五個跟踪（詳細信息見第5節）。其中三個來自運行RocksDB 的代表性標準YCSB 工作負載：YCSB-A、YCSB-B和YCSB-Load（表5中的規格）。RocksDB數據庫大小為40GB，具有4KB的KV對。另一個對延遲敏感的應用程序跟踪來自TPC-C基準測試，在77GB的MySQL數據庫上運行。此外，我們跟踪了TensorFlow（TF），它定期讀取訓練數據集並檢查CNN模型以尋求更好的準確性。最後，我們包括了三個來自SNIA存儲性能評估（SPC）庫的跟踪：VirtualDesktop（VD），這是來自伺服器環境的唯一最新跟踪，以及Exchange和Proxy，這是Microsoft和SPC跟踪收藏中49個跟踪中最重的兩個之一。 * 根據它們的平均吞吐量，我們將8個跟踪粗略地劃分為“重”和“輕”兩組。我們檢查所有三種2工作負載混合：輕輕、輕重和重重。圖2(a)顯示了這樣的混合的所有時間段中 Rma j 的CDF。瞬時互補性在輕重（Exchange + YCSB-A）和輕輕（VD + TF）工作負載混合中相當頻繁，分別佔所有時間段的83.6%和73.9%。即使在重重混合（YCSB-Load + TPC-C）中，54.1%的時間段具有瞬時互補性。 * 使用這8個工作負載，我們列舉了所有2工作負載混合，發現其中有25個在超過一半的時間段內具有瞬時互補性，所有28個混合的平均比率為67.8%（圖2(b)）。在所有70個4工作負載混合中，擁有瞬時互補性的時間段的平均比率上升到91.4%。 ![image](https://hackmd.io/_uploads/HkyDtCM86.png) * 結論我們的分析顯示，當工作負載一起運行時，它們瞬時互補性的機會非常高。讓所有磁碟在當前繁忙的工作負載中參與服務，減輕了它們的隊列等待時間，這是應用引起的尾部延遲的主要來源。 ### Write Overhead in SSD RAID * 圖3通過比較使用三種設備類型的(4+1)RAID-5陣列來說明這一點：英特爾545s SATA SSD、西門子7200RPM SATA HDD和RAM磁碟。它們都是通過MD實現的軟體RAID陣列。我們運行Microsoft Enterprise Exchange工作負載，並展示了寫入延遲的細分：讀取、xor和寫入，其餘分為軟體開銷。左側的圖描述了所有請求，而右側的圖集中在最高延遲的1%請求上。條的頂部數字給出了每個組的平均延遲值。 ![image](https://hackmd.io/_uploads/B1fX6AM8T.png) * 對於這款消費者級SSD，軟體開銷已經成為寫入延遲的主要組成部分，而在HDD上幾乎看不到。平均而言，測試的SSD RAID在軟體開銷上花費的時間是寫入時間的2.9倍。軟體開銷還使最慢的1%請求的延遲遭受平均延遲的10倍，這是由於線程上下文切換和請求排隊等因素（在區塊層和主機端調度隊列）。 * 儘管軟體開銷仍然是主要類別，但對於圖3中顯示的最慢的1%請求，SSD寫入也對SSD RAID尾延遲產生重大影響，成本是平均寫入開銷的20.7倍。第2.3節對這個問題進行了詳細討論。 * 影響: 與HDD陣列不同，SSD RAID的I/O延遲由軟體開銷主導。與RAM磁碟陣列不同，它通過實際I/O和跨磁碟的協調而延長了這種開銷。這表明，具有較少依賴性的更短的寫入路徑可能會大大降低SSD RAID的延遲，無論是在平均情況還是在最壞情況下。 ## Approach Overview ### Design Rationale * 為了緩解工作量中固有的請求爆發，FusionRAID將請求分散到存儲池中的所有磁碟（例如大型商品SSD機箱），其中包含多個RAID區。儘管它們的個別I/O請求大多可以由少量磁碟回答，但應用程序通常會有嚴重的負載爆發，直接導致尾延遲。FusionRAID通過使用RAID declustering 將爆發平滑到SSD機箱中的所有磁碟，來削減此類工作負載引起的尾延遲。在多租戶環境中，這自動為個別工作負載的變化強度提供資源彈性。 * 為了降低RAID寫入中的軟體開銷和磁碟間的相依性，FusionRAID使用複製寫入作為RAID寫入的前奏，稍後將數據懶惰地轉換為更節省空間的RAID組織，以進行長期存儲。在這種轉換之前，塊副本確保與指定的RAID級別一樣的容錯能力。例如，對於RAID5區，FusionRAID寫入一個塊的兩個副本，對於RAID6，寫入三個。這樣做通過推遲並在某些情況下甚至避免漫長且容易受干擾的奇偶更新過程，縮短了寫入的關鍵路徑。因此，這種複製寫入的簡單、相互獨立的操作提供了更低（並且更一致）的延遲。 * 最後，為了避開臨時性能下降的SSD，FusionRAID不斷監視每個SSD的性能行為，以檢測暫時無響應的SSD。為此，它使用一種輕量級的尖峰檢測機制，不發出額外的I/O，也不需要SSD內部知識。在大型SSD池上通過RAID declustering的狀況下，FusionRAID能輕鬆將寫入重新定向到未受影響的驅動器，在任何給定時間可能仍然占多數。對於讀取，它也可以選擇較不受影響的副本，或者使用像ToleRAID 這樣的系統提出的現有方法來使用奇偶數據計算由無響應的SSD主持的塊。 ### FusionRAID architecture * 圖5展示了 FusionRAID 架構。多個虛擬 RAID 陣列（具有不同 RAID 配置）共享包含數十個以上商品 SSD 的底層池。這個 SSD 池的總邏輯空間被分為 RAID 區和複製區，分別用於長期、節省空間的存儲和快速吸收小型、隨機寫入。請注意，這兩個虛擬區之間沒有物理分割：實際上，它們被有意地混合在一起，以便從複製到 RAID 存儲進行快速轉換（在第4.2節有詳細討論）。雖然本文的討論/評估在這裡使用 RAID5，但 FusionRAID 適用於其他 RAID 組織，例如通過將複製區的複製度增加到3以適應 RAID6。 * 在內部，FusionRAID 為每個虛擬 RAID 卷使用一個映射表（圖5中的FBMT），以維護每個卷邏輯塊號到 FusionRAID 內部邏輯塊號的映射（§4.4）。後者隨後可以使用基於 MOLS 的確定性 RAID declustering 策略（§4.1）映射到某個 SSD 上的邏輯塊。對於複製寫入，FusionRAID 從一對條帶構建可用塊對的列表，實現低成本的複製到 RAID 轉換（§4.2）。 * 此外，FusionRAID 通過監控 SSD 的延遲來執行實時 SSD 延遲尖峰檢測，並將結果納入其決策過程。在圖5中，最後一個 SSD 被標記為無響應，將在可能的情況下避免在讀取和寫入中使用（§4.3）。 ## FusionRAID Design ### Storage Organization * FusionRAID introduces Fusion logical address space, an internal logical block layer between the user-perceived logical and the SSD logical block address spaces. * As illustrated in Figure 6. The mapping from user-perceived to this internal space is maintained by a block mapping table (to be detailed in Section 4.4), one per user RAID volume, which supports dynamic mapping for out-of-place writes. The mapping from a Fusion logical block to a logical SSD block is done by RAID declustering, involving a static mapping function instead of mapping tables. * * Instruction Decode Process Overview: In the Decoder, there are two crucial components: the output control signal lines and the data lines to be transmitted, all determined by the instruction. For instance, `ALU_Op` decides the source for the ALU input, while `MEM_RE/WE` determines whether to read from or write to memory. As for `RegWS`, `Reg1/2RA`, and `immediate`, they determine the destination registers, rs1, rs2, enabling the following stages to read data and execute properly. ## Performance Evaluation ### Experiment Setup * Testbed * We use a SuperMicro 4U storage server, with two 12-core Intel XEON E5-2650 V4 processors and 128GB DDR4 memory, running Ubuntu 16.04 with Linux kernel v4.15.0. * Workloads * We use both trace-driven and real application tests. * **trace-driven** We use eight traces mentioned in Section 2.1 with major attributes in Table 4. * **real application tests** we evaluate FusionRAID with the popular RocksDB KV store running YCSB workloads. * ![image](https://hackmd.io/_uploads/S1t5zCSUp.png) * RAID Setup * We compare Fusion Raid with 4 type of RAID * **LogRAID** , a log-structured RAID-50 that appends all updates. * **ToleRAID** , designed for cutting read tail latency. * **4-RAID5**, 4 independent(6+1)-disk RAID-5 arrays * **RAID50**, a RAID-50 system that stripes across * **NV-RAID**, we implement another alternative design that adds an NVRAM write buffer above RAID50. ### Overall Performance (Trace-driven) * We measure overall performance by co-running multiple workloads. More specifically, we select 20 4-workload mixes randomly from the aforementioned 8 traces, testing 4-RAID5, RAID50, LogRAID,ToleRAID, and FusionRAID on DC SSDs. * figure 10 gives the median and tail latencies of 8 test workloads. The bars show the average value among all their executions in the 20 mixes (number of executions ranges between 9 and 13), and error bars mark the min/max values. ![image](https://hackmd.io/_uploads/r1ZRXyUIa.png) * **4-RAID5**,the median and tail latencies on 4-RAID5 increaseobviously under workloads with larger average write size. * **RAID50**, Despite spreading work to all 28 SSDs, RAID50 does not reduce median latency and often worsens tail latency. The worst cases, meanwhile, are slowed down by one or two unresponsive SSDs. * **Fusion RAID**,significantly reduces both median and tail latencies. Compared with 4-RAID5, FusionRAID shows an average reduction of 49% in median latency across the 8 traces and a maximum of 87%.Compared with RAID50, the average/maximum reductions are 81%/97%, respectively. * FusionRAID’s P99 improvement (Figure 10(b)) is even more impressive, averaging a 15× reduction (up to 32×) from 4-RAID5, 35× (up to 62×) from RAID50. ### Space OverHead & Real Application Overhead * **Space consumption**, meanwhile, depends on the aggressiveness of the background conversion policy adopted. Figure 13(c) gives the overall space consumption (vs. user data size), for all data written during the trace run. As expected,`4-RAID5’s` extra space overhead comes from the single parity block in its 6+1 stripe. `FusionRAID`, with conversion turned off, has a slightly varying space ratio across workloads, due to their different write patterns. With conversion fully performed, one of the replicas is recycled (the other reclaimed) and multiple writes are compacted, returning the space consumption to the same as the 4-RAID5 level. * ![image](https://hackmd.io/_uploads/BkDdkT8LT.png) * **Conversion**, FusionRAID’s major internal I/O activity is its replicated-to RAID data conversion. To examine its performance impact, Figure 13(a) shows performance with and without such background conversion. One sees that even with our current brute force conversion policy, its performance overhead is quite small, thanks to our in-position stripe allocation during replicated writes. * ![image](https://hackmd.io/_uploads/ByjHHNw86.png) * **Real application results**,we evaluate with representative YCSB workloads, of varied write intensity levels,running RocksDB on top of ext4 above FusionRAID and RAID50. Table 5 lists workload information and results. * across all three workloads, read tail latency consistently benefits from FusionRAID, due to its more efficient digestion of request bursts. For writes, even at an update rate as low as 5%, FusionRAID reduces the average latencies of the slowest 10% by 4.1× * ![image](https://hackmd.io/_uploads/HkVPdEPLT.png)