---
title: Scalable Parallel Flash Firmware for Many-core Architectures
tag : ssd
---
# Scalable Parallel Flash Firmware for Many-core Architectures
[paper](https://https://www.usenix.org/system/files/fast20-zhang_jie.pdf)
## introduction
* highly parallel I/O services require simultaneously performing many SSD internal tasks, the SSD needs multiple cores and parallel implementation for a higher throughput
* As the tasks inside the SSD increase, the SSD must address several scalability challenges brought by garbage collection, memory/storage contention and data consistency management when processing I/O requests in parallel.
* We propose DeepFlash, a manycore-based NVMe SSD platform that can process more than one million I/O requests within a second (1MIOPS) while minimizing the requirements of internal resources
## Background
* a high-performance SSD architecture that Marvell recently published
* four Gen 3.0 PCIe lanes (4 GB/s)
* three embedded processors, each employing two cores

#### Datapath from PCIe to flash
* Figure 2b shows the processes of the FTL, which performs the steps 3~8

## Challenges to Exceeding 1MIOPS
we extend the baseline SSD architecture in a highly scalable environment : Intel many-integrated cores (MIC) platform, which **provides a high core** count to study parallelism and scalability
### Flash scaling
* A single MIC core, which two threads are initialized to process the tasks of HIL and FTL
* 16 MIC cores (one core per flash channel) to manage flash interface subsystems
* Testing 4KB sequential writes
* Performance bottleneck
* A single-core SSD controller is insufficient to fetch all the requests
* Faster to parallelize I/O accesses across many flash chips than performing address translation only on one core

### Core scaling
* Naive can only achieve 813K IOPS even with 32 cores, which exhibits 82.6% lower performance, compared to Expected because **contention and consistency management for the memory spaces of internal DRAM** introduces significant synchronization overheads
## Many-to-Many Threading Firmware
### overview
* The queuegather stage mainly parses NVMe requests and collects them to the SSD-internal DRAM
* The trans-apply stage mainly buffers the data and translates addresses
* The flash scatter stage spreads the requests across many underlying flash packages and manages background SSD-internal tasks in parallel

* While the procedure of I/O services is managed by many threads in the different data processing paths, the threads can be allocated in any core in the network, in a parallel and scalable manner.

### Queue-gather Stage
#### NVMe queue management
A host initiates an NVMe command to an SQ and writes the corresponding doorbell, the firmware **fetches the command from the SQ and decodes a non-contiguous set of host physical memory pages** by referring a kernel list structure, called a physical region page
* A challenge to employ many cores for parallel queue processing is that, multiple NVMQ cores may simultaneously fetch a same set of NVMe commands from a single queue(Fig.6(a))
* To address this challenge, one can make each core handle only a set of SQ/CQ(Fig.6(b))

#### I/O mutual exclusion.
* Request-1 (a write) and request-2 (a read), which create two different paths, but target to the same PPA. (Fig.6(c))
* We implement the **index lock as a redblack tree** and make this locking system as a dedicated thread (ILOCK). The red-black tree helps ILOCK quickly identify which lock to use, and reduces the overheads of lock acquisition and release
### Trans-apply Stage
#### Data caching and buffering
* Incarnate SSD internal memory as a burst buffer by mapping LBAs to DRAM addresses
* CACHE threads are configured with a traditional direct-map cache
* Since all NVMQ threads possibly communicate with a CACHE thread for every I/O request, it can introduce extra latency imposed by passing messages among threads.(Fig.7(a))

#### Parallel address translation
* We allocate the FTL address translation to multiple threads.
* Address partitioning can make all TRANS threads operate in parallel without interference(Fig.7(b))
### Flash-scatter Stage
#### Background task scheduling
* GCs can be performed in parallel by allocating separate core(s), referred to as BGC.
* BGC(s) records the block numbers that have no more entries to write when TRANS threads process incoming I/O requests
* BGC reclaims blocks and updates the mapping table at background
#### Journalling
* DeepFlash separates the journalling from TRANS and assigns it to a LOG thread
#### Parallel flash accesses
* FCMD
* Compose flash transactions respecting flash interface timing
* Schedule them across different flash resources over the flash physical layer(PHY)

## Optimizing DeepFlash
#### Parallel Processing for NVMe Queue
Dynamic I/O serialization(DIOS) decouples the fetching and parsing processes of NVMe queue entries

#### Index Lock Optimization
* When multiple NVMQ threads contend to acquire or release the same lock due to their same target address range, it can raise
* lock contention
* low resource utilization
* acquisition
* if the target LBA has a conflict. It then checks the red-black (RB) tree returns the owner ID for all lock acquisition requests.The NVMQ thread receives the ID of the owning NVMQ thread, and can **forward** the request there to be processe
* release
* If the target address is already held by another NVMQ thread, the lock requester can be stalled until the corresponding I/O service is completed
* It directly removes the target node without an SQ inference process.

#### Non-blocking Cache
Fig10(b)
* Add a direct path between NVMQ and TRANS threadsnd make NVMQ threads access CACHE threads only if there is data in CACHE
* We allocate direct-map table(s) in a shared memory space to accommodate the cache metadata so that NVMQ threads can lookup the cache metadata on their own
* However, this simple approach may introduce inconsistency. We add “**evicted LPN**” field in each map table entry
## Evaluation
#### Implementation platform
* MIC 5120D accelerator hat employs 60 lightweight in-order cores (4 hardware threads per core)
* The host employs a Xeon 16-core processor and 256 GB DRAM
* SSD platform(12 cores)
* DeepFlash
* BaseDeepFlash(not apply the optimization techniques)
* ManyLayered

### Performance Analysis
* ManyLayered outperforms 750SSD and 4600SSD by 1.5× and 45%, on average
* BaseDeepFlash exhibits poor performance in cases that the request size is smaller than 24KB with random patterns, because threads in NVMQ/ILOCK keep tight inter-thread communications
* DeepFlash provides 4.8 GB/s and 4.5 GB/s bandwidth for reads and writes
* ManyLayered suffer from core/flash-level conflicts
* BaseDeepFlash suffer from lock/sync issues

* DeepFlash can mostly activate 6.3 cores that run 25 threads to process I/O services in parallel
* Because although many cores contend to acquire ILOCK which makes more cores stay in idle, the burst buffer successfully overcomes the long write latency of the flash
* Figure 12e shows the active core decomposition of DeepFlash
* Reads require 23% more Flash activities

#### Server workload traces
* MPS generates multiple lock contentions, due to more small-size random accesses
* DeepFlash's performance is not as good under DDR workloads because FCMD utilization is lower than 56% due to the address patterns of DDR

### CPU, Energy and Power Analyses
#### CPU usage and different flashes
* When we increase the number of threads more, the performance gains start to diminish due to the overhead of exchanging many messages
* When 19 cores are employed, SLC, MLC, and TLC achieve the maximum bandwidths

#### Power/Energy
* DeepFlash with 12 cores consumes 29 W while this power consumption is higher than existing SSDs (20 ∼ 30W)
* However, power-efficient manycores can be used to reduce the power of our prototype
#### Different CPUs
* Figure 14c compares the minimum number of cores that DeepFlash requires to achieve 1MIOPS for both reads and writes
* However, due to the complicated core logic , OoO-1.2G and OoO-2.4G consume 93% and 110% more power than IO-1G to achieve the same level of IOPS
### Performance Analysis of Optimization
#### NVMQ
* This implies that the host also requires more cores since the NVMe allocates a queue per host CPU core

#### ILOCK

#### CACHE
* Even with more CACHE threads, performance gains diminish due to communication overhead
* DeepFlash’s 2-Bypass,which employs the bypass technique (with only 2 threads), can be ideal as it requires fewer threads to achieve 1MIOPS
#### Background activities
* Once NVMQ is in idle, LOG and BGC reactivate their work

#### STEADY-STATE performance
* Figure 17b shows the impact of on-demand garbage collection (FGC) and journalling (FLOG) on the performance of DeepFlash. The results are compared to the ideal performance of DeepFlash (Pristine), which has no GC and LOG activities
## Conclusion
Our emulation prototype on a manycore-integrated accelerator reveals that it simultaneously processes beyond 1MIOPS, while successfully hiding long latency imposed by internal flash media.