Stable Greedy : Adaptive Garbage Collection for Durable Page-Mapping Multichannel SSDs

# Stable Greedy : Adaptive Garbage Collection for Durable Page-Mapping Multichannel SSDs ###### tags: `paper` [toc] ## 1. Introduction  This study presents <font color="red">**Stable Greedy**</font> for **garbage collection** in page-mapping multichannel SSDs, which runs at a *constant time*, requires *limited RAM space* and also consider *flash wear leveling*.  ### Garbage collection Garbage collection have to erase the invalid data produce by out-of-place updates. However, it will also bring out data copying overhead. FTLs adopt some strategies to alleviate the consequences. 1. [Data separation](#2.2-Hot/nonhot-Separation) - Identify frequently updated data - Cluster data with **similar update frequency** in same blocks 2. [Victim selection](#2.3-Victim-Selection) - Reducing frequency of page data copying - Avoid copying the page data which is soon to become invalid. Although there are various techniques of garbage collection for page-level mapping, they are still not capable of handling **modern gigabyte-level flash**.  ### Multichannel architecture Recent novel technology such as **multilevel cells** increased the storage density but at the cost of degrading write performance. Thus, modern SSDs use **multichannel** for higher device write throughput. Every channel manages a reduced set of flash blocks independently in multichannel architectures.  ::: warning Problems under realistic workloads : - Channels have unbalanced inbound write traffic - Different garbage collection overhead - Thus, blocks in different channel may be worn at **different rates** ::: To **prolong device lifetime**, FTLs should perform **wear leveling** of blocks within the same or different channel.  ## 2. Background ### 2.1 Page-level mapping :::info ***Crucial techniques about garbage collection*** - **Hot/nonhot separation** ![](https://i.imgur.com/CHnLzgt.png) To reduce page copy - Identify hotness of data - Write hot data and nonhot data to different flash blocks - Reduce data copy overhead in garbage collection - **Victim selection** To reduce overhead of page copy - Select blocks with a small number of valid pages as a victim for short-term - Avoid selecting blocks with hot data as a victim for long-term :::  ### 2.2 Hot/nonhot Separation #### Different Hot/nonhot separation mechanisms:  - **Dynamic Data Clustering(DAC)** - :-1: : Performance is **highly sensitive** to the selection of age threshold. - :-1: : Require **prohibitively large RAM space** to storing page-level data age.  - Identifying the data carried by small write requests as hot block - :-1: : Large request may also carry hot data. - Using **multiple bloom filters** to track the frequency and recency of data updates - :+1: : Low RAM space overhead. - **N-chance policy** - :-1: : Require manual tuning for optimal performance $\rightarrow$ not adaptive.  ### 2.3 Victim Selection - **Well-known greedy policy** Select block of largest amount of invalid page. Realistic workloads have write localities $\rightarrow$ block may continue to receive more page invalidations. - :-1: : Premature erasure of flash blocks - **Cost-and-benefit policy** Low priority for blocks that have recently received page invalidations. Erase blocks with highest score : ${a \times (1 - u)} \over {2 \times u}$ $a$ : Age of blocks $u$ : Space utilization of blocks - :-1: : Serious scalability issues when rescoring all flash blokcs. - **Cost-Age-Time function** Consider both garbage collection and wear leveling. Erase blocks with lowest score : ${u} \over {((1 - u) \times a ) \times t}$ $t$ : Erase count of block $a$ : Time since the last erasure of blocks $u$ : Space utilization of blocks - :-1: : Serious scalability issues when rescoring all flash blokcs. - Heap-based implementation of **EF-Greedy policy** - Select the block with the largest sum of invalid page periods as the victim. - **List-based victim selection** - With preemptive garbage collection for reducing write latency of SSDs. ### 2.4 Multichannel Architectures :::info - All flash blocks are partitioned evenly among channels. - Locally idealy victim **may not be global best** $\rightarrow$ garbage collection would be less efficient. - Many prior garbage-collection algorithms performed poorly with a reich set of channels. - Channels can independently process flash commands and transfer data. - Channels could wear their blocks at **different rates**.  - Write buffer collect host data and flush them in parallel to as many channels as possible. - Static channel binding. - Dynamic channel binding. :::  ## 3. Stable Greedy : an efficient garbage-collection method :::success Efficient **garbage collection** requires 1. [Separating hot data from nonhot data](#3.1-Data-separation-policy) 2. [Good victim selection](#3.2-Victim-Selection-Policy) ::: ***Multilevel least recently invalidated(LRI) list*** :::info **Definition.** LRI maintains $P$ parallel list, $P$ is the total number of pages in a block. Level-$i$ list of the blocks has $i$ pages of valid data. ::: ![](https://i.imgur.com/9kxjByA.png) - If a block at level-$i$ receives a page invalidation, it will be promoted to the **rightmost of $i-1$th level** $\rightarrow$ The leftmost block of every list is the **LRI block** of the list.  ### 3.1 Data-separation policy > [***Why?***](#2.1-Page-level-mapping) ***Page lifetime*** :::info **Definition.** Time length between a page is written to new data and the instant when data are invalidated by the host. ::: ![](https://i.imgur.com/oVmQNhZ.png) - Page lifetime appear and produce **bimodal**. $\rightarrow$ using a lifetime threshlod for separating hot and nonhot data. - Average lifetime is a good threshold, FTL maintains threshold and recomputes it **periodically**.  ### 3.2 Victim Selection Policy :::success Reducing copy overhead in garbage collection - Short term - Select blocks with **low space utilizations** as victim. - Long term - Avoid a **premature** block erasure. :::  ***Block dormant period*** :::info **Definition.** Interval between the current time and the time of the latest page invalidation in the block. ::: ![](https://i.imgur.com/vTAh1Vn.png) - Majority of the dormant periods were **short** $\rightarrow$ most page invalidations occur in block with short dormant periods. - Using average dormant period as the threshold is effective to identify shory dormant periods. - FTL maintains a sum of block dormant periods and periodically recompute a new threshold. - Victim selection should favor blocks of long dormant period $\rightarrow$ more stable blocks.  :::success **Stable Greedy** - A victim selection policy considers both **block space utilization** and **block stability**. - Selecting a victim from stable blocks $\rightarrow$ minimizing space utilization of victim. - Only checks LRI list head from top level down to bottom level until finding **stable block**.  ::: ### 3.3 Block wear leveling based on multilevel LRI lists  ![](https://i.imgur.com/QV6vGvB.png) ***Old/Young Blocks*** :::info ***Definition.*** A block is an **old block** if the erase count of a block is larger than average, otherwise, the block is a **young block**. ::: - FTL maintains average erase counts(threshold) of all blocks. - :+1: : Erase counts of blocks are stored in their page spare area $\rightarrow$ **no extra space overhead**. :::success **Wear leveling** will be triggered when an old block is selected as victim. 1. Move valid data out of victim. 2. Erase the victim block. 3. Fill the victim with nonhot data (from one or multiple young blocks). 4. The young block becomes the free block instead of the old block ::: - Issues and solutions - Wear leveling must find nonhot data easily. - Stablest block remains in **lower-left corner** of LRI list. - All young block must have equal opportunities to participate wear leveling. - Blocks with long dormant period finally become a contributor of nonhot data. ## 4. Multichannel Architectures - Channels are independent in command processing and data transfer - They can perform garbage collection independently - They maintain their own data. - Multilevel LRI lists - Average page lifetime - Average block dormant period ### 4.1 Write Buffering and Channel Binding  ***Write Buffer*** :::success ***Purpose.*** Small write requests may not use all channels. Multichannel architectures employ a write buffer to improve channel utilizations. ::: ***Channel binding policy*** :::info ***Definition.*** When the write buffer evicts a buffer page, it determines the destination channel of the buffer page. It can be either static or dynamic. ::: #### Static binding policy, RAID-0 style striping - Channel number of a buffer page is residual number of $\frac{logical\ page \ address}{total\ channel\ number}$ - RAID-0 striping binds a set of sequential logical pages to the largest number of channels. - The major advantage is high read throughput - Drawbacks of static binding - Skewed channel write traffic - **Contention** of write buffer space among channels - The contention create many **idle periods** in the channels - Wear of blocks from different channels grew at different rates Because these drawbacks are **inherent in static binding**, this study investigated Stable Greedy with **dynamic binding**. ### 4.2 Dynamic Channel Binding :::info ***Definition.*** Delays channel binding until buffer evictions, and an evicted buffer page can be bound to **any channel**. ::: ![](https://i.imgur.com/dAJzDeC.png)  After the buffer page $a'$ is dispatched to the channel $C_3$, the original channel $C_0$ receives an invalidation for the old page data a. Stable Greedy updates the [information](#4.1-Write-Buffering-and-Channel-Binding) of the original channel $C_0$. #### 4.2.1 Round-Robin over Idle Channels (RRoI) :::info ***Definition.*** Dispatch the buffer pages to channels in **Round-Robin** manner. However, it will skip the busy channel until it finds an idle channel. ::: Major Design goals of dynamic channel binding 1. To keep all channels **busy at all times**. 2. To wear all blocks from different channels **equally**.  Advantages of RRoI - RRoI never generates **idle periods** in channels as long as write buffer is not empty. - At any time instant, a channel is either writing buffer pages or collecting garbage. ***Self-balancing effect*** ![](https://i.imgur.com/4EAWCvf.png) - Write amplification is **inversely proportional** to space overprovision - If total amount of blocks in channels are the same - The channel write amplifications were initially different, but gradually converged. #### 4.2.2 Balancing Channel Remaining lifetime ***Remaining lifetime*** :::info ***Definition.*** The ***remaining lifetime*** of channel $i$ is $$l_i = \frac{(\bar{e} - e^{avg}_i)\times B_i \times P}{w_i \times waf_i}$$ | Parameter | Definition | |:-----------:|:---------------------------------------------------------------------- | | $\bar{e}$ | Erase count limit of blocks | | $e^{avg}_i$ | Average erase count of blocks among channel | | $B_i$ | Amount of flash blocks in channel $C_i$ | | $P$ | Number of pages in a block | | $w_i$ | Average number of buffer pages dispatched to channel in a unit of time | | $waf_i$ | Write amplification ($\text{# data written into flash} \over \text{# data written from host}$) | It represents how long can a channel be used in a unit of time. ::: - $B_i$ and $P$ are constant $\rightarrow$ $l_i$ is related to $w_i$ and $waf_i$. - If a channel has more blocks, it have lower write amplification, which implies longer $l_i$ - Vice versa. - Nonhot data migration strategy barely effect $l_i$ - Increasing $waf_i$ accordingly reduces $w_i$.  ***Duty cycle*** :::info ***Definition.*** Time fraction during which the channel is busy writing buffer pages or performing garbage collection. $$d_i = \frac{w_j \times waf_j}{w_i \times waf_i}\times\frac{(\bar{e} - e^{avg}_i)\times B_i}{(\bar{e} - e^{avg}_j)\times B_j}$$ Let every channel maintain a **page write counter**. When the write counter of a channel reaches a predefined constant $K$, RRoI dispatches $K \times \frac{1-d}{d}$ no operation (NOP) commands. ::: :::success ***Purpose.*** Balancing the **remaining lifetime** of each channels, so that the blocks from different channel reach their **EOL** simultaneously. ::: - FTL idividually computes the duty cycle of every channel with the longest remaining lifetime.   ## 5. Experimental Results ### 5.1 Experimental Setup and Performance Metrices Compared with - Hybrid level - FAST - Superblock - page level - Greedy recycle blocks with smallest amount of valid pages - DAC - Dual Greedy predecessor of Stable Greedy - Workload - PC1 (Windows) - PC2 (Linux) - Server - Multimedia frequent creation and deletion of movie files - MSN MSN storage file server - Expected good result: - Low erase count $\rightarrow$ Efficiency garbage collection - High write throughput $\rightarrow$ Efficiency garbage collection and channel time utilization ### 5.2 Host Workload and Channel Number Garbage collection algorithm compared with Stable Greedy - Dual Greedy - Perform similarly when only one channel was present. - Stable Greedy outperform dual greedy when channel number increased. - [DAC](#2.2-Hot/nonhot-Separation) - DAC cannot adapt to the change of average page lifetime grew large for a long period of time. - DAC perform poorly under the Server workload.  - Greedy - Exhibited best performance under Iometer workload, because it's purely random write without locality. ### 5.3 Channel Binding and Write Buffering ![](https://i.imgur.com/HEmA5sQ.png) - Dynamic binding was considerably superior than static binding, especially with a large number of channels. - Static binding suffer from buffer space contention problem. - :-1: : Some channel might **become idle** $\rightarrow$ cannot utilize channel all the time.  ![](https://i.imgur.com/M25bXna.png) Figure shows that LRU had better erase counts and write throughput in every algorithm.  ### 5.4 Block Wear Leveling and Channel Lifetimes :::info ***Experimental Environment*** Workload : PC1 Total Block number   : $10752$ Numbers of channel : $4$ ($C_0$ to $C_4$) Erase count limit $\bar{e}$    : $500$ Threshold of wear leveling : $32$ Adjust duty cycle after the host wrote $300$GB ::: #### Part1 - All channel contain same amount of flash blocks :::info Flash blocks of each channel  : $2688$ Block-level wear leveling        : **Enable** Channel-level wear leveling    : **Disable** ::: - The blocks within every channel were **evenly worn**, as were the blocks from different channels. - [self-wear-leveling effect](#4.2.1-Round-Robin-over-Idle-Channels-(RROI)) - None of channels experienced **idle period**. #### Part2 - Channel contains different amount of flash blocks :::info Flash blocks of channel        : $1792, 1792, 3584, 3584$ Block-level wear leveling      : **Disable** Channel-level wear leveling : **Disable** ::: - Wear of blocks in the entire device was seriously imbalanced. :::info Flash blocks of channel        : $1792, 1792, 3584, 3584$ Block-level wear leveling      : **Enable** Channel-level wear leveling : **Disable** ::: - The blocks within same channel were **evenly worn**. - But average erase count of channel with **fewer blocks** grew **markedly faster** than channel with **more blocks**. :::info Flash blocks of channel        : $1792, 1792, 3584, 3584$ Block-level wear leveling      : **Enable** Channel-level wear leveling : **Enable** ::: - All of the blocks erase counts were even. - Average erase count in different channels were close.  ***Duty cycle Adjustment*** - Restored the balance of write amplification among channel. - Improved overall write amplification. ### 5.5 A Case study and Overhead Analysis ***Space overhead*** - Linearly related to the total number of blocks. - Block node - Two pointers for LRI lists - Timestamp for first-page writing - Timestamp for latest page invalidation - Each of them comprises $4B$ - Overall space overhead is $Number$ $of$ $Blocks \times 4 \times 4B$ ***Time overhead*** - Linearly related to the number of pages per block. - Victim selection only checks **list-head blocks** in the multilevel LRI lists. ## 6. Conclusion :::success ***Stable Greedy*** - A **constant time, adaptive** victim selection method was proposed based on the **multilevel LRI lists**. - **Hotness Separation:** Using block-level information([page lifetime](#3.1-Data-separation-policy)) to identify page-accurate data hotness. - **Victim selection:** - Short term: Select blocks with low space utilizations. (Greedy) - Long term: Avoid erasing blocks with short [dormant](#3.2-Victim-Selection-Policy) to **reduce the copy overhead** - **Duty cycle adjustment policy:** - Balancing the remaining lifetimes of the channels. - Improved overall write amplification. :::