<!-- Variables --> [fig_for_cell]: https://www.cactus-tech.com/wp-content/uploads/2019/03/NAND-Cell-Cut-Away-31e620fa.png [fig_for_string]: https://www.cactus-tech.com/wp-content/uploads/2019/03/NAND-Cell-String-Small-c937d893.png [fig_for_array]: https://www.cactus-tech.com/wp-content/uploads/2019/03/NAND-Array-Small-0bca62df.png [fig_for_page]: https://www.cactus-tech.com/wp-content/uploads/2019/03/NAND-Array-Page-Highlight-Small-16e871e3.png [fig_for_block]: https://www.cactus-tech.com/wp-content/uploads/2019/03/NAND-Array-Block-Highlight-Small-b883d68a.png [fig_for_plane]: https://www.cactus-tech.com/wp-content/uploads/2019/03/NAND-Array-Plane-Highlight-Small-56ae3ab4.png [fig_for_die]: https://www.cactus-tech.com/wp-content/uploads/2019/03/NAND-Die-Multi-Plane-Small-19f75846.png [fig_for_outside]: https://i.imgur.com/3J313yv.jpg # Design Tradeoffs for SSD Performance > [A useful reference](https://ir.nctu.edu.tw/bitstream/11536/45887/1/554201.pdf) [name=Ting-En Liao] ###### tags: `paper` [toc] ## Abstract <font color="red"> This paper presents a taxonomy of design choices and analyzes the performance of various configurations of SSD. </font> ## 1. Introduction ### SSDs SSD offers - Exceptional bandwith - Random I/O performance - Significant savings in power budget - more reliable system ### System Issues There are some systems issues related to SSD performance: - Data placement - Not only to provide load balancing, but to effect wear-leveling - Parallelism - Bandwidth and operation rate :arrow_up: - Write ordering - Small, randomly-ordered writes are especially tricky - Workload management - Performance is highly workload-dependent ## 2. Background <!-- 4gb flash-package consisting two dies Dies can execute command independently --> ### Preliminary :::info ***Terminologies.*** [Reference](https://www.cactus-tech.com/resources/blog/details/solid-state-drives-101) **Inside of a flash chip** | | Definition | Usage | Figure | |:---------- |:-------------------------------------------------------------:|:---------------------------------:|:--------------------------:| | Cell | A floating gate transistor | Hold one or more bits of data | [**Fig.**][fig_for_cell] | | String | Connected cells | Store 32 or 64 bits of data | [**Fig.**][fig_for_string] | | Array | Combined strings | Achieve useful amounts of storage | [**Fig.**][fig_for_array] | | Page | Comprised of multiple cells and <br> share the same word line | Minimum unit to read or write | [**Fig.**][fig_for_page] | | Block | Comprised of pages and strings | Minimum unit to erase | [**Fig.**][fig_for_block] | | Plane | A bank of blocks compose a plane | | [**Fig.**][fig_for_plane] | | Die (chip) | One or many planes form a die | | [**Fig.**][fig_for_die] | **Outside of a flash chip** | | Definition | Usage | Figure | |:--------------------- |:---------------------------------------:|:--------------------------------------:|:---------------------------:| | Chip enable line (CE) | | Selects which chip to be active on bus | [**Fig.**][fig_for_outside] | | Data line (bus) | | Transmit the data | [**Fig.**][fig_for_outside] | | Channel | Chips that connected to the same bus | | [**Fig.**][fig_for_outside] | | Bank | Chips at the same location in channels. | | [**Fig.**][fig_for_outside] | | Gang | Channels that connected to the same CEs | | [**Fig.**][fig_for_outside] | --- <!-- Channel package內的平行化 gang 多個package的 package One or mutiple dies. --> ***Specification.*** ![](https://i.imgur.com/jVsZQTp.png) ::: ### 2.1 Properties of Flash Memory - Granularity of read/wrtie and erase are different - Read - Page granularity - Typically takes $25\mu s$ to read a page into a 4KB data register - Then takes roughly $100 \mu s$ to shift the data out over the data bus - Write - Page granularity - Shifting data to data register ($100\mu s$) - Writing in out to ($200\mu s$) - Erase - Block granularity - Takes $1.5ms$ to erase whole blocks - Each block has limited erase count ### 2.2 Bandwidth and Interleaving <!-- 100mius 4KB page from on-chip reg to off-chip controller series 25/100 mius 32MB/secc interleaving 25/100 mius 40MB/sec --> The serial interface over which flash packages receive commands and transmit data is a primary bottleneck for SSD performance. For a **Read** operation It takes $25 \mu s$ to move data into register, $100 \mu s$ to transfer a page from register to controller. For a **Write** operation It takes $100 \mu s$ to transfer data to register, $200 \mu s$ to write page from register If the operations perform without interleaving, that is, in serial, the second step of the operation will dwarf the first one. ![](https://i.imgur.com/urEgqSJ.png) ![](https://i.imgur.com/ToNB7f3.png) ***Interleaving*** - Interleaving the serial transfer time and the program operation doubles the overall bandwith - Serious constraint for flash supporting interleaving - Operations on the same flash plane-pair cannot be interleaved. <!-- copy-back 不用讀到外部buffer 可以直接讀內部的page register allows data to be copied to another block on-chip without crossing the serial pins. --> ## 3. SSD Basics This section introduce the issues arise when constructing SSD from NAND flash compenents. --- ![](https://i.imgur.com/SyJmWmY.png) > Generalized block diagram for an SSD - **Host Interface Logic** - Support physical host interface conntection - logical disk emulation (such as using *FTL*) - **Internal buffer manager** - Holds pending and satisfied requests along the primary data path. - **Flash Demux/Mux** - Emits commands and handles transport of data along the serial connections to the flash package. - **Processor and RAM** - Required to manage the request flow and mappings from logical block to physical flash location. ***Interconnection between the flash controller and flash packages*** - 32GB SSD with 8 of the Samsung parts would require 136 pins at the flash controller. - However, it is not feasible for larger configurations. This paper is primarily concerned with the **organization of the flash array** and th **algorithms** needed to manage mappings between logical disk and physical flash address. ### 3.1 Logical Block Map <!-- 定道最小的話平行度會很好,但是GC很難做,因為打太散locality會很爛 --> **Allocation pool** A target logical page (4KB) is allocated from a pre-determined pool of flash memory. <br> The scope of an allocation pool might be as small as a flash plane or as large as multiple flash packages. :::info ***Variables of allocation pools*** - **Static map** - A portion of each LBA constitutes a fixed mapping to a specific allocation pool. - **Dynamic map** - The non-static portion of a LBA is the lookup key for a mapping within a pool. - **Logical page size** - As large as a flash block (256KB), or as small as a quarter-page (1KB). - **Page span** - A logical page might span related pages on different flash packages thus creating the potential for accessing sections of the page in parallel ***Constraints*** - **Load balancing** - I/O operations should be evenly balanced between allocation pools - **Parallel access** - The assignment of LBAs to physical addresses should interfere as little as possible with the ability to access those LBAs in **parallel**. - **Block erasure** - Flash pages cannot be rewritten without being erased. ::: ***Design trade offs*** - Large portion of LBA is statically mapped - Fewer LBA can be dynamically mapped to keep load-balancing. - Contiguous range of LBAs is mapped to the same physical die - Performance for sequential access in large chunks will suffer - Small logical page size - More operations are required. - Logical page size equal to block size - Writes that smaller than the logical page size result in a **read-modify-operation**. <!-- Design tradeoffs Large portion of LBA is statically mapped Load-balancing will be influenced Because there are few LBA can be dynamically mapped to reach load-balancing Contiguous range of LBAs is mapped to the same physical die Performance for sequential access in large chunks will suffer Small logical page size More erasure are required. Logical page size equal to block size Read-modify-write: ![](https://i.imgur.com/nCLr5Gm.png) 寫的資料小於effective page size 要先讀整個effective page到buffer再修改要寫的page 最後寫回flash,will bring out extra overhead. --> ### 3.2 Cleaning <!-- Although a given LBA is statically mapped to a specific allocation pool Cleaning can still operate at smaller granularity by copy-back. --> > Nothing new here :poop: ***Cleaning***: here means that moving the non-**superseded** pages in victim block to somewhere else prior to erasure. ***Cleaning efficiency*** $= {\text{superseded pages} \over \text{total pages}}$: Just for choosing victim blocks. ***Over provision***: In order to perform **cleaning**, there must be enough spare blocks to allow writes and cleaning. ### 3.3 Parallelism and Interconnect Density Handling I/O requests on multiple flash packages in parallel achieves <!-- i before e except after c --> greater bandwidths or I/O rates. Here are some techniques for obtaining parallelism. 1. **Parallel requests** - In a **fully connected** flash array, each elements is an independent entity and can therefore accept a separate flow of requests. - However, the maintenance of a queue per element may be an issue. 2. [**Ganging**](#Preliminary) - :+1: : Multiple packages can be used in parallel without multiple queues. - :-1: : Elements will lie idle when requests don't span all of them. ![](https://i.imgur.com/hDGhuw2.png) > Gang diagram with different configuration. - Figure 4. - Controller can dynamically selects the target of each command. - :+1: : Increase capacity without requiring more pins. - :-1: : Does not increase bandwith. - Figure 5. - There are individual data paths to each package, sync operations which span multiple packages can proceed in parallel. 3. **Interleaving** - [Section 2.2](#2.2-Bandwidth-and-Interleaving) - Used to improve bandwith 4. **Background cleaning** - Using internal copy-back does not require data to cross the serial interface, and therefore reduce the cost of cleaning. <!-- shared data and command bus between different gang? --> ### 3.4 Persistence In order to recover SSD state, it's essenetial to rebuild the **logical block map** and all related data structures. - Error detection and correction - Flash part does not provide it, it's provided by application firmware. - **Page metadata** can hold an error-detection code to determine valid page and failure-free blocks. - It may be possible to extend block lifetime by using more robust error correction. - Holding the logical block map in non-volatile memory(**Phase-Change**, **Magnetoresistive** RAM) - Writable at byte granularity - Don't have the block-erasure constraints like flash. ### 3.5 Industry Trends - **Consumer portable storage** - Appear as USB flash sticks or camera memories - Moderate bandwidth for sequential operations - Moderate random read performance and very poor write performance - **Laptop disk replacement** - Random read performance is far superior than rotating media - Random write perfomance is comparable to rotating media - **Enterprise/database accelerators** - Promised very fast sequential perfomance - Very fast random read and write perfomance ![](https://i.imgur.com/vZbJMmJ.png) - USB - Almost certainly use a logical page granularity similar to the size of block. - Limited random read performance $\rightarrow$ only **a single I/O** request is in flight at a time. - MTron - Poor random write performance $\rightarrow$ a large logical page size is being used(read-modify-write costs). ## 4. Design Details and Evaluation This section introduces a <font color="red">**trace-driven simulation environment**</font> that allows us to gain insight into SSD behavior under various workloads. ### 4.1 Simulator <!-- Simply introduce the simulator and what they extra added to simulate the situation they want. --> A simple introduction to the **simulator** and the **simulation environment** ### 4.2 Workloads An introduction to the workloads 1. TPC-C - instance of well-established database benchmark 2. Exchange - server running Microsoft Exchange - specialized database workload with 3:2 read-to-write ratio 3. [IOzone](https://www.iozone.org/) - standard filesystem benchmarks - Significant sequential I/O component 5. [Postmark](https://www.filesystems.org/docs/auto-pilot/Postmark.html) - standard file system benchmarks - Significant sequential I/O component - locality very good, can always find a whole invalid blocmk <!-- Alignment is important Misaligned requests to flash add a page access to every read/write. --> ### 4.3 Simulation Results <!-- 幫我改info :chr1s878QQ: --> :::info ***Baseline Configuration*** $32GB$ of flash $8$ fully connected flash packages Allocation pools are the size of flash package Logical page and stripe size is $4KB$ Cleaning requires across Serial interface Overprovisioned $15%$% Invoke cleaning when free less then $5%$% ::: - **Microbenchmarks** ![](https://i.imgur.com/jnrOvOU.png) - Sequential and random I/Os have equivalent latencies in a **fully-connected** SSD. - Latency for write operations reflect the overhead caused by cleaning. - **Page Size, Striping, and Interleaving** - Page size - Large page size may cause every write requires a **read-modify-write** operations. - 256KB page size produces more than two orders of magnitude greater than 4KB page size under workload TPC-C <!-- - Striping --> <!-- search NCQ --> - Interleaving - Figure below shows that IOzone and Postmark benefit from interleaving, while TPC-C and Exchange don't. - With no queuing, interleaving won't occur. ![](https://i.imgur.com/fssXUnj.png) - **Gang Performance** - Shared-control gang - **Asynchronous-shared-control** - Gang can perform flash operation on different gang members concurrently. - **Synchronous-shared-control** - All packages in a gang managed in synchrony <!-- by setting logical page size to that of gang size --> <!-- - e.g. 8-wide gang have logical page size of $8 \times 4\text{KB} = 32\text{KB}$ --> - Since the logical page size of sync gang is bigger than the corresponding async gang $\rightarrow$ less number of simultaneous operations can be performed in a gang unit. ![](https://i.imgur.com/3i6zx2c.png) <!-- all package in a gang are managed sync by utilizing a logical depth equal to gang size. --> - **Copy-back vs. Inter-plane Transfer** - Pages can be moved using the **copy-back** feature without having to transfer cross the serial pins. - Using the **copy-back** feature, TPC-C shows a improvement in cleaning cost. - IOzone and Postmark produce perfect cleaning efficiency, so they move no pages during cleaning. ![](https://i.imgur.com/w57Q6Xa.png) - **Cleaning Threshold** - Increasing the minimum free blocks threshold may affect the overall performance of the SSD depending upon the pages moved under the workload. ![](https://i.imgur.com/JNvXjY1.png) ### 4.4 Tradeoff Summary :::success ***Brief summary*** | Technique | Positives | Negatives | |------------------------------------------------------------------------|--------------------------------------------------------------------------|---------------------------------------------------------------------------------------------| | Large allocation pool <br>Large page size <br>Overprovisioning <br>Ganging <br>Striping | Load balancing<br> Small page table<br> Less cleaning<br> Sparser wiring <br>Concurrency | Few intra-chip ops<br> Read-modify-writes <br>Reduced capacity<br> Reduced parallelism<br> Loss of locality | ::: ## 5. Wear-leveling > For further imformation, please refer to [this paper](https://hackmd.io/7GS0Ony_T8CI-4DYBl_MxA#33-Block-wear-leveling-based-on-multilevel-LRI-lists) - **Cleaning** - Efficient cleaning may reduce overall wear, but it may not wear evenly. - The same blocks may get used over and over again in greedy approach. :::info ***Design Goal*** Delay the expiry time of any single block. Ensure that the remaining lifetime across all blocks is equal within an variance. ::: <!-- Simple approach is only allow recycling when candidates remaining lifetime exceeds threshold However it may exclude many blocks from consideration and bring out remaining blocks to be recycled more frequently and with poor efficieny --> - **Simple Approach** - Only allow recycling when candidates remaining lifetime exceeds threshold. - :-1: : It may exclude many blocks from consideration $\rightarrow$ poor cleaning efficiency. - **Greedy strategy** ## 6. Related Work Skipped ## 7. Conclusion There is significant interplay between both the hardware and software components and the workload. This paper provides insight into how all of these components must cooperate in order to produce an SSD design taht meets the performance goals of the targeted workload.