# Stable Greedy : Adaptive Garbage Collection for Durable Page-Mapping Multichannel SSDs ###### tags: `paper` [toc] ## 1. Introduction <!-- From abstract --> This study presents <font color="red">**Stable Greedy**</font> for **garbage collection** in page-mapping multichannel SSDs, which runs at a *constant time*, requires *limited RAM space* and also consider *flash wear leveling*. <!-- page level High performance of small random writes but require large mapping table Thumb drive using block level to saving space consumer-level SSD using hybrid garbage collection -> data copying overhead sol: identify frequently updated data cluster data with similar update frequency in same blocks victim selection reducing frequency of page data copying during garbage collection avoid copying the page data which is soon to become invalid. Many recent studies focused on garbage collection for hybrid mapping are inapplicable to page-level mapping. --> ### Garbage collection Garbage collection have to erase the invalid data produce by out-of-place updates. However, it will also bring out data copying overhead. FTLs adopt some strategies to alleviate the consequences. 1. [Data separation](#2.2-Hot/nonhot-Separation) - Identify frequently updated data - Cluster data with **similar update frequency** in same blocks 2. [Victim selection](#2.3-Victim-Selection) - Reducing frequency of page data copying - Avoid copying the page data which is soon to become invalid. Although there are various techniques of garbage collection for page-level mapping, they are still not capable of handling **modern gigabyte-level flash**. <!-- primitive gc algos use too many parameters that require manual tuning. --> ### Multichannel architecture Recent novel technology such as **multilevel cells** increased the storage density but at the cost of degrading write performance. Thus, modern SSDs use **multichannel** for higher device write throughput. Every channel manages a reduced set of flash blocks independently in multichannel architectures. <!-- rather than global block set. Modern flash-manufacturing tech (e.g. mlc,tlc,qlc) 犧牲寫入效能及擦寫次數換取更大的空間 --> ::: warning Problems under realistic workloads : - Channels have unbalanced inbound write traffic - Different garbage collection overhead - Thus, blocks in different channel may be worn at **different rates** ::: To **prolong device lifetime**, FTLs should perform **wear leveling** of blocks within the same or different channel. <!-- :::spoiler Data separation policy - Pages of the same block share the same time stamp of data arrivals $\rightarrow$ using block-level information. ::: --> ## 2. Background ### 2.1 Page-level mapping :::info ***Crucial techniques about garbage collection*** - **Hot/nonhot separation** ![](https://i.imgur.com/CHnLzgt.png) To reduce page copy - Identify hotness of data - Write hot data and nonhot data to different flash blocks - Reduce data copy overhead in garbage collection - **Victim selection** To reduce overhead of page copy - Select blocks with a small number of valid pages as a victim for short-term - Avoid selecting blocks with hot data as a victim for long-term ::: <!-- #### Write amplification and spcae overprovision Write amplification 寫入要先copy副本再erase 所以通常flash會有overprovision(OP 超容量快取) 讓flash保留大量的空白block,即便硬碟使用率增加的話寫入效能也能盡量一致 write amplification is inversely proportional to space overprovision --> ### 2.2 Hot/nonhot Separation #### Different Hot/nonhot separation mechanisms: <!-- Separate Newly arriving and garbage-collected data gc data less hot than newly arriving --> - **Dynamic Data Clustering(DAC)** - :-1: : Performance is **highly sensitive** to the selection of age threshold. - :-1: : Require **prohibitively large RAM space** to storing page-level data age. <!-- - Groups flash block into different region for multiple levels of data hotness --> - Identifying the data carried by small write requests as hot block - :-1: : Large request may also carry hot data. - Using **multiple bloom filters** to track the frequency and recency of data updates - :+1: : Low RAM space overhead. - **N-chance policy** - :-1: : Require manual tuning for optimal performance $\rightarrow$ not adaptive. <!-- Different Hot/nonhot separation: Multiple level of data hotness for separation cons: Performance is highly sensitive to the selection of age threshold Require prohibitively large RAM space -> storing page-level data age Identifying data with small write request as hot data cons: Large request may also carry hot data --> ### 2.3 Victim Selection - **Well-known greedy policy** Select block of largest amount of invalid page. Realistic workloads have write localities $\rightarrow$ block may continue to receive more page invalidations. - :-1: : Premature erasure of flash blocks - **Cost-and-benefit policy** Low priority for blocks that have recently received page invalidations. Erase blocks with highest score : ${a \times (1 - u)} \over {2 \times u}$ $a$ : Age of blocks $u$ : Space utilization of blocks - :-1: : Serious scalability issues when rescoring all flash blokcs. - **Cost-Age-Time function** Consider both garbage collection and wear leveling. Erase blocks with lowest score : ${u} \over {((1 - u) \times a ) \times t}$ $t$ : Erase count of block $a$ : Time since the last erasure of blocks $u$ : Space utilization of blocks - :-1: : Serious scalability issues when rescoring all flash blokcs. - Heap-based implementation of **EF-Greedy policy** - Select the block with the largest sum of invalid page periods as the victim. - **List-based victim selection** - With preemptive garbage collection for reducing write latency of SSDs. ### 2.4 Multichannel Architectures :::info - All flash blocks are partitioned evenly among channels. - Locally idealy victim **may not be global best** $\rightarrow$ garbage collection would be less efficient. - Many prior garbage-collection algorithms performed poorly with a reich set of channels. - Channels can independently process flash commands and transfer data. - Channels could wear their blocks at **different rates**. <!-- Manage reduced set of blocks --> - Write buffer collect host data and flush them in parallel to as many channels as possible. - Static channel binding. - Dynamic channel binding. ::: <!-- large difference in channel write traffic and channel write amplification in RAID0 static binding for dyna channel binding little research on balancing flash wear -> propose remapping hot data among channels to balance wear rates --> ## 3. Stable Greedy : an efficient garbage-collection method :::success Efficient **garbage collection** requires 1. [Separating hot data from nonhot data](#3.1-Data-separation-policy) 2. [Good victim selection](#3.2-Victim-Selection-Policy) ::: ***Multilevel least recently invalidated(LRI) list*** :::info **Definition.** LRI maintains $P$ parallel list, $P$ is the total number of pages in a block. Level-$i$ list of the blocks has $i$ pages of valid data. ::: ![](https://i.imgur.com/9kxjByA.png) - If a block at level-$i$ receives a page invalidation, it will be promoted to the **rightmost of $i-1$th level** $\rightarrow$ The leftmost block of every list is the **LRI block** of the list. <!-- 新收到一個page invalidation的block會被推到下個level的最右側,因此每層最左側都是least recently invalidated Realistic workload have temporal locality -> if host modified a sector recently, the sector will be modified again page lifetime are subject only to the host write behaviors --> ### 3.1 Data-separation policy > [***Why?***](#2.1-Page-level-mapping) ***Page lifetime*** :::info **Definition.** Time length between a page is written to new data and the instant when data are invalidated by the host. ::: ![](https://i.imgur.com/oVmQNhZ.png) - Page lifetime appear and produce **bimodal**. $\rightarrow$ using a lifetime threshlod for separating hot and nonhot data. - Average lifetime is a good threshold, FTL maintains threshold and recomputes it **periodically**. <!-- Page lifetimes are subject only to the host write behaviors.!! FTL maintain sum of lifetime and periodically recompute the threshold. computing page lifetimes with reduced space overhead Time of writing the first page of a block can serve as every page writing time in the block FTL stores the first page-writing time of every block in RAM FTL perform comparison of lifetime of invalidated page against threshold. New page data is predicted by how long its prior version stayed alive --> ### 3.2 Victim Selection Policy :::success Reducing copy overhead in garbage collection - Short term - Select blocks with **low space utilizations** as victim. - Long term - Avoid a **premature** block erasure. ::: <!-- 一下就收到invalidation算是long term Block dormant periods -> index of maturity of block erasure --> ***Block dormant period*** :::info **Definition.** Interval between the current time and the time of the latest page invalidation in the block. ::: ![](https://i.imgur.com/vTAh1Vn.png) - Majority of the dormant periods were **short** $\rightarrow$ most page invalidations occur in block with short dormant periods. - Using average dormant period as the threshold is effective to identify shory dormant periods. - FTL maintains a sum of block dormant periods and periodically recompute a new threshold. - Victim selection should favor blocks of long dormant period $\rightarrow$ more stable blocks. <!-- hot data and sequential data contribute short dormant periods victim selection should favor longer dormant periods which is more stable Period length to recompute dormant period is same as recomputing page lifetime. Stable Greedy selecting a victime from stable blocks subject to minimizing space utilization. Stable Greedy only checks list head from top level down to bottom level until finding stable block. --> :::success **Stable Greedy** - A victim selection policy considers both **block space utilization** and **block stability**. - Selecting a victim from stable blocks $\rightarrow$ minimizing space utilization of victim. - Only checks LRI list head from top level down to bottom level until finding **stable block**. <!-- because leftmost is the most stable block --> ::: ### 3.3 Block wear leveling based on multilevel LRI lists <!-- LRI facilitate block level wear leveling nonhot data copying may involve multiple stable young blocks. These contributors cannot be victim until they contribute all their nonhot data. Erase count is checked only when the block is selected as victim. Proposed method can be revised to use the remaining erase count to support uneven cycle limits. --> ![](https://i.imgur.com/QV6vGvB.png) ***Old/Young Blocks*** :::info ***Definition.*** A block is an **old block** if the erase count of a block is larger than average, otherwise, the block is a **young block**. ::: - FTL maintains average erase counts(threshold) of all blocks. - :+1: : Erase counts of blocks are stored in their page spare area $\rightarrow$ **no extra space overhead**. :::success **Wear leveling** will be triggered when an old block is selected as victim. 1. Move valid data out of victim. 2. Erase the victim block. 3. Fill the victim with nonhot data (from one or multiple young blocks). 4. The young block becomes the free block instead of the old block ::: - Issues and solutions - Wear leveling must find nonhot data easily. - Stablest block remains in **lower-left corner** of LRI list. - All young block must have equal opportunities to participate wear leveling. - Blocks with long dormant period finally become a contributor of nonhot data. ## 4. Multichannel Architectures - Channels are independent in command processing and data transfer - They can perform garbage collection independently - They maintain their own data. - Multilevel LRI lists - Average page lifetime - Average block dormant period ### 4.1 Write Buffering and Channel Binding <!-- Realistic workloads : numerous small write requests does not use all channels. This study consider both FIFO and LRU write buffer because of stable greedy is independent of the buffer design. --> ***Write Buffer*** :::success ***Purpose.*** Small write requests may not use all channels. Multichannel architectures employ a write buffer to improve channel utilizations. ::: ***Channel binding policy*** :::info ***Definition.*** When the write buffer evicts a buffer page, it determines the destination channel of the buffer page. It can be either static or dynamic. ::: #### Static binding policy, RAID-0 style striping - Channel number of a buffer page is residual number of $\frac{logical\ page \ address}{total\ channel\ number}$ - RAID-0 striping binds a set of sequential logical pages to the largest number of channels. - The major advantage is high read throughput - Drawbacks of static binding - Skewed channel write traffic - **Contention** of write buffer space among channels - The contention create many **idle periods** in the channels - Wear of blocks from different channels grew at different rates Because these drawbacks are **inherent in static binding**, this study investigated Stable Greedy with **dynamic binding**. ### 4.2 Dynamic Channel Binding :::info ***Definition.*** Delays channel binding until buffer evictions, and an evicted buffer page can be bound to **any channel**. ::: ![](https://i.imgur.com/dAJzDeC.png) <!-- Let the original channel of a buffer page be the channel in which the old version of the page data resides. When write buffer binds a buffer page to a new channel,page invalidation occurs in the original channel of buffer page --> After the buffer page $a'$ is dispatched to the channel $C_3$, the original channel $C_0$ receives an invalidation for the old page data a. Stable Greedy updates the [information](#4.1-Write-Buffering-and-Channel-Binding) of the original channel $C_0$. #### 4.2.1 Round-Robin over Idle Channels (RRoI) :::info ***Definition.*** Dispatch the buffer pages to channels in **Round-Robin** manner. However, it will skip the busy channel until it finds an idle channel. ::: Major Design goals of dynamic channel binding 1. To keep all channels **busy at all times**. 2. To wear all blocks from different channels **equally**. <!-- Write amplification is inversely proportional to space overprovision channel write amplifications were initially different, but gradually converged. If the total amount of blocks in channels are different, RRoI won't converge their write amplification --> Advantages of RRoI - RRoI never generates **idle periods** in channels as long as write buffer is not empty. - At any time instant, a channel is either writing buffer pages or collecting garbage. ***Self-balancing effect*** ![](https://i.imgur.com/4EAWCvf.png) - Write amplification is **inversely proportional** to space overprovision - If total amount of blocks in channels are the same - The channel write amplifications were initially different, but gradually converged. #### 4.2.2 Balancing Channel Remaining lifetime ***Remaining lifetime*** :::info ***Definition.*** The ***remaining lifetime*** of channel $i$ is $$l_i = \frac{(\bar{e} - e^{avg}_i)\times B_i \times P}{w_i \times waf_i}$$ | Parameter | Definition | |:-----------:|:---------------------------------------------------------------------- | | $\bar{e}$ | Erase count limit of blocks | | $e^{avg}_i$ | Average erase count of blocks among channel | | $B_i$ | Amount of flash blocks in channel $C_i$ | | $P$ | Number of pages in a block | | $w_i$ | Average number of buffer pages dispatched to channel in a unit of time | | $waf_i$ | Write amplification ($\text{# data written into flash} \over \text{# data written from host}$) | It represents how long can a channel be used in a unit of time. ::: - $B_i$ and $P$ are constant $\rightarrow$ $l_i$ is related to $w_i$ and $waf_i$. - If a channel has more blocks, it have lower write amplification, which implies longer $l_i$ - Vice versa. - Nonhot data migration strategy barely effect $l_i$ - Increasing $waf_i$ accordingly reduces $w_i$. <!-- 整個Channel的page數量(Bi*P)除上單位時間的寫入page次數(Wi*放大) 再乘上平均p/e與上限的差 且就算移動nonhot data去降低OP以增加WA但因此會讓寫入buffer變少(Wi * WA)所以沒辦法影響remaining lifetime --> ***Duty cycle*** :::info ***Definition.*** Time fraction during which the channel is busy writing buffer pages or performing garbage collection. $$d_i = \frac{w_j \times waf_j}{w_i \times waf_i}\times\frac{(\bar{e} - e^{avg}_i)\times B_i}{(\bar{e} - e^{avg}_j)\times B_j}$$ Let every channel maintain a **page write counter**. When the write counter of a channel reaches a predefined constant $K$, RRoI dispatches $K \times \frac{1-d}{d}$ no operation (NOP) commands. ::: :::success ***Purpose.*** Balancing the **remaining lifetime** of each channels, so that the blocks from different channel reach their **EOL** simultaneously. ::: - FTL idividually computes the duty cycle of every channel with the longest remaining lifetime. <!-- duty cycle去平衡remaining lifetime不同 每個channel的duty cycle是去跟最大remaining lifetime的channel 計算 讓每個channel的remaining lifetime一致? 延後少block的channel出現壞塊 When write counter of a channel reaches a predefined constant K RRoI dispatches K x (1-d)/d no operation command(NOP) Channels with fewer blocks need lower duty cycles --> <!-- Channels with fewer blocks need lower duty cycles FTL periodically recomputes the channel duty cycles everytime after the host has written K pages of data. Self-balancing no longer exists if channels contain different amounts of blocks --> ## 5. Experimental Results ### 5.1 Experimental Setup and Performance Metrices Compared with - Hybrid level - FAST - Superblock - page level - Greedy recycle blocks with smallest amount of valid pages - DAC - Dual Greedy predecessor of Stable Greedy - Workload - PC1 (Windows) - PC2 (Linux) - Server - Multimedia frequent creation and deletion of movie files - MSN MSN storage file server - Expected good result: - Low erase count $\rightarrow$ Efficiency garbage collection - High write throughput $\rightarrow$ Efficiency garbage collection and channel time utilization ### 5.2 Host Workload and Channel Number Garbage collection algorithm compared with Stable Greedy - Dual Greedy - Perform similarly when only one channel was present. - Stable Greedy outperform dual greedy when channel number increased. - [DAC](#2.2-Hot/nonhot-Separation) - DAC cannot adapt to the change of average page lifetime grew large for a long period of time. - DAC perform poorly under the Server workload. <!-- Data separation during gc copy block age does not suggest whether block is receive new page invalidation in near future. It cannot change its parameter during runtime --> - Greedy - Exhibited best performance under Iometer workload, because it's purely random write without locality. ### 5.3 Channel Binding and Write Buffering ![](https://i.imgur.com/HEmA5sQ.png) - Dynamic binding was considerably superior than static binding, especially with a large number of channels. - Static binding suffer from buffer space contention problem. - :-1: : Some channel might **become idle** $\rightarrow$ cannot utilize channel all the time. <!-- two bank share one bus so 4*2 might be slightly slower than 8*1 Because two bank share one bus. --> ![](https://i.imgur.com/M25bXna.png) Figure shows that LRU had better erase counts and write throughput in every algorithm. <!-- - Because LRU reduced write traffic volume by buffering hot data --> ### 5.4 Block Wear Leveling and Channel Lifetimes :::info ***Experimental Environment*** Workload : PC1 Total Block number &nbsp;&thinsp;: $10752$ Numbers of channel : $4$ ($C_0$ to $C_4$) Erase count limit $\bar{e}$ &nbsp;&thinsp;&nbsp;: $500$ Threshold of wear leveling : $32$ Adjust duty cycle after the host wrote $300$GB ::: #### Part1 - All channel contain same amount of flash blocks :::info Flash blocks of each channel&nbsp; : $2688$ Block-level wear leveling &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;: **Enable** Channel-level wear leveling&nbsp;&nbsp;&nbsp;&thinsp;: **Disable** ::: - The blocks within every channel were **evenly worn**, as were the blocks from different channels. - [self-wear-leveling effect](#4.2.1-Round-Robin-over-Idle-Channels-(RROI)) - None of channels experienced **idle period**. #### Part2 - Channel contains different amount of flash blocks :::info Flash blocks of channel &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&thinsp;: $1792, 1792, 3584, 3584$ Block-level wear leveling &nbsp;&nbsp;&nbsp;&nbsp;&thinsp;: **Disable** Channel-level wear leveling&nbsp;: **Disable** ::: - Wear of blocks in the entire device was seriously imbalanced. :::info Flash blocks of channel &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&thinsp;: $1792, 1792, 3584, 3584$ Block-level wear leveling &nbsp;&nbsp;&nbsp;&nbsp;&thinsp;: **Enable** Channel-level wear leveling&nbsp;: **Disable** ::: - The blocks within same channel were **evenly worn**. - But average erase count of channel with **fewer blocks** grew **markedly faster** than channel with **more blocks**. :::info Flash blocks of channel &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&thinsp;: $1792, 1792, 3584, 3584$ Block-level wear leveling &nbsp;&nbsp;&nbsp;&nbsp;&thinsp;: **Enable** Channel-level wear leveling&nbsp;: **Enable** ::: - All of the blocks erase counts were even. - Average erase count in different channels were close. <!-- ***Throughput Degradation*** - Actual decrease of throughput was lower than theoretical. - By using NOP, data migration of valid data balance write amplification among channel. --> ***Duty cycle Adjustment*** - Restored the balance of write amplification among channel. - Improved overall write amplification. ### 5.5 A Case study and Overhead Analysis ***Space overhead*** - Linearly related to the total number of blocks. - Block node - Two pointers for LRI lists - Timestamp for first-page writing - Timestamp for latest page invalidation - Each of them comprises $4B$ - Overall space overhead is $Number$ $of$ $Blocks \times 4 \times 4B$ ***Time overhead*** - Linearly related to the number of pages per block. - Victim selection only checks **list-head blocks** in the multilevel LRI lists. ## 6. Conclusion :::success ***Stable Greedy*** - A **constant time, adaptive** victim selection method was proposed based on the **multilevel LRI lists**. - **Hotness Separation:** Using block-level information([page lifetime](#3.1-Data-separation-policy)) to identify page-accurate data hotness. - **Victim selection:** - Short term: Select blocks with low space utilizations. (Greedy) - Long term: Avoid erasing blocks with short [dormant](#3.2-Victim-Selection-Policy) to **reduce the copy overhead** - **Duty cycle adjustment policy:** - Balancing the remaining lifetimes of the channels. - Improved overall write amplification. ::: <!-- 結論好難寫 -->