Scalable Parallel Flash Firmware for Many-core Architectures

--- title: Scalable Parallel Flash Firmware for Many-core Architectures tag : ssd --- # Scalable Parallel Flash Firmware for Many-core Architectures [paper](https://https://www.usenix.org/system/files/fast20-zhang_jie.pdf) ## introduction * highly parallel I/O services require simultaneously performing many SSD internal tasks, the SSD needs multiple cores and parallel implementation for a higher throughput * As the tasks inside the SSD increase, the SSD must address several scalability challenges brought by garbage collection, memory/storage contention and data consistency management when processing I/O requests in parallel. * We propose DeepFlash, a manycore-based NVMe SSD platform that can process more than one million I/O requests within a second (1MIOPS) while minimizing the requirements of internal resources ## Background * a high-performance SSD architecture that Marvell recently published * four Gen 3.0 PCIe lanes (4 GB/s) * three embedded processors, each employing two cores ![](https://i.imgur.com/ppVTkIc.png) #### Datapath from PCIe to flash * Figure 2b shows the processes of the FTL, which performs the steps 3~8 ![](https://i.imgur.com/OF9Tn8V.png) ## Challenges to Exceeding 1MIOPS we extend the baseline SSD architecture in a highly scalable environment : Intel many-integrated cores (MIC) platform, which **provides a high core** count to study parallelism and scalability ### Flash scaling * A single MIC core, which two threads are initialized to process the tasks of HIL and FTL * 16 MIC cores (one core per flash channel) to manage flash interface subsystems * Testing 4KB sequential writes * Performance bottleneck * A single-core SSD controller is insufficient to fetch all the requests * Faster to parallelize I/O accesses across many flash chips than performing address translation only on one core ![](https://i.imgur.com/sPFAxyT.png) ### Core scaling * Naive can only achieve 813K IOPS even with 32 cores, which exhibits 82.6% lower performance, compared to Expected because **contention and consistency management for the memory spaces of internal DRAM** introduces significant synchronization overheads ## Many-to-Many Threading Firmware ### overview * The queuegather stage mainly parses NVMe requests and collects them to the SSD-internal DRAM * The trans-apply stage mainly buffers the data and translates addresses * The flash scatter stage spreads the requests across many underlying flash packages and manages background SSD-internal tasks in parallel ![](https://i.imgur.com/YBUuyIi.png) * While the procedure of I/O services is managed by many threads in the different data processing paths, the threads can be allocated in any core in the network, in a parallel and scalable manner. ![](https://i.imgur.com/vshj2kn.png) ### Queue-gather Stage #### NVMe queue management A host initiates an NVMe command to an SQ and writes the corresponding doorbell, the firmware **fetches the command from the SQ and decodes a non-contiguous set of host physical memory pages** by referring a kernel list structure, called a physical region page * A challenge to employ many cores for parallel queue processing is that, multiple NVMQ cores may simultaneously fetch a same set of NVMe commands from a single queue(Fig.6(a)) * To address this challenge, one can make each core handle only a set of SQ/CQ(Fig.6(b)) ![](https://i.imgur.com/BGfWEdJ.png) #### I/O mutual exclusion. * Request-1 (a write) and request-2 (a read), which create two different paths, but target to the same PPA. (Fig.6(c)) * We implement the **index lock as a redblack tree** and make this locking system as a dedicated thread (ILOCK). The red-black tree helps ILOCK quickly identify which lock to use, and reduces the overheads of lock acquisition and release ### Trans-apply Stage #### Data caching and buffering * Incarnate SSD internal memory as a burst buffer by mapping LBAs to DRAM addresses * CACHE threads are configured with a traditional direct-map cache * Since all NVMQ threads possibly communicate with a CACHE thread for every I/O request, it can introduce extra latency imposed by passing messages among threads.(Fig.7(a)) ![](https://i.imgur.com/W9ObsNr.png) #### Parallel address translation * We allocate the FTL address translation to multiple threads. * Address partitioning can make all TRANS threads operate in parallel without interference(Fig.7(b)) ### Flash-scatter Stage #### Background task scheduling * GCs can be performed in parallel by allocating separate core(s), referred to as BGC. * BGC(s) records the block numbers that have no more entries to write when TRANS threads process incoming I/O requests * BGC reclaims blocks and updates the mapping table at background #### Journalling * DeepFlash separates the journalling from TRANS and assigns it to a LOG thread #### Parallel flash accesses * FCMD * Compose flash transactions respecting flash interface timing * Schedule them across different flash resources over the flash physical layer(PHY) ![](https://i.imgur.com/xUak9IX.png) ## Optimizing DeepFlash #### Parallel Processing for NVMe Queue Dynamic I/O serialization(DIOS) decouples the fetching and parsing processes of NVMe queue entries ![](https://i.imgur.com/0rgQ264.png) #### Index Lock Optimization * When multiple NVMQ threads contend to acquire or release the same lock due to their same target address range, it can raise * lock contention * low resource utilization * acquisition * if the target LBA has a conflict. It then checks the red-black (RB) tree returns the owner ID for all lock acquisition requests.The NVMQ thread receives the ID of the owning NVMQ thread, and can **forward** the request there to be processe * release * If the target address is already held by another NVMQ thread, the lock requester can be stalled until the corresponding I/O service is completed * It directly removes the target node without an SQ inference process. ![](https://i.imgur.com/24rPy06.png) #### Non-blocking Cache Fig10(b) * Add a direct path between NVMQ and TRANS threadsnd make NVMQ threads access CACHE threads only if there is data in CACHE * We allocate direct-map table(s) in a shared memory space to accommodate the cache metadata so that NVMQ threads can lookup the cache metadata on their own * However, this simple approach may introduce inconsistency. We add “**evicted LPN**” field in each map table entry ## Evaluation #### Implementation platform * MIC 5120D accelerator hat employs 60 lightweight in-order cores (4 hardware threads per core) * The host employs a Xeon 16-core processor and 256 GB DRAM * SSD platform(12 cores) * DeepFlash * BaseDeepFlash(not apply the optimization techniques) * ManyLayered ![](https://i.imgur.com/ceo9Chj.png) ### Performance Analysis * ManyLayered outperforms 750SSD and 4600SSD by 1.5× and 45%, on average * BaseDeepFlash exhibits poor performance in cases that the request size is smaller than 24KB with random patterns, because threads in NVMQ/ILOCK keep tight inter-thread communications * DeepFlash provides 4.8 GB/s and 4.5 GB/s bandwidth for reads and writes * ManyLayered suffer from core/flash-level conflicts * BaseDeepFlash suffer from lock/sync issues ![](https://i.imgur.com/hmT7Y5s.png) * DeepFlash can mostly activate 6.3 cores that run 25 threads to process I/O services in parallel * Because although many cores contend to acquire ILOCK which makes more cores stay in idle, the burst buffer successfully overcomes the long write latency of the flash * Figure 12e shows the active core decomposition of DeepFlash * Reads require 23% more Flash activities ![](https://i.imgur.com/xoQLUdj.png) #### Server workload traces * MPS generates multiple lock contentions, due to more small-size random accesses * DeepFlash's performance is not as good under DDR workloads because FCMD utilization is lower than 56% due to the address patterns of DDR ![](https://i.imgur.com/AgogsyR.png) ### CPU, Energy and Power Analyses #### CPU usage and different flashes * When we increase the number of threads more, the performance gains start to diminish due to the overhead of exchanging many messages * When 19 cores are employed, SLC, MLC, and TLC achieve the maximum bandwidths ![](https://i.imgur.com/KmOCrjs.png) #### Power/Energy * DeepFlash with 12 cores consumes 29 W while this power consumption is higher than existing SSDs (20 ∼ 30W) * However, power-efficient manycores can be used to reduce the power of our prototype #### Different CPUs * Figure 14c compares the minimum number of cores that DeepFlash requires to achieve 1MIOPS for both reads and writes * However, due to the complicated core logic , OoO-1.2G and OoO-2.4G consume 93% and 110% more power than IO-1G to achieve the same level of IOPS ### Performance Analysis of Optimization #### NVMQ * This implies that the host also requires more cores since the NVMe allocates a queue per host CPU core ![](https://i.imgur.com/1yYwK6w.png) #### ILOCK ![](https://i.imgur.com/TxNlrrP.png) #### CACHE * Even with more CACHE threads, performance gains diminish due to communication overhead * DeepFlash’s 2-Bypass,which employs the bypass technique (with only 2 threads), can be ideal as it requires fewer threads to achieve 1MIOPS #### Background activities * Once NVMQ is in idle, LOG and BGC reactivate their work ![](https://i.imgur.com/b98CBEk.png) #### STEADY-STATE performance * Figure 17b shows the impact of on-demand garbage collection (FGC) and journalling (FLOG) on the performance of DeepFlash. The results are compared to the ideal performance of DeepFlash (Pristine), which has no GC and LOG activities ## Conclusion Our emulation prototype on a manycore-integrated accelerator reveals that it simultaneously processes beyond 1MIOPS, while successfully hiding long latency imposed by internal flash media.