Lecture 15: File Systems – Naming and Reliability

# Lecture 15: File Systems – Naming and Reliability # Slide Notes pretty much nothing you have not seen. also readings have more to do with lecture 15. have slides open during final. # Chapter 42 Crash Consistency: FSCK and Journaling One major challenge faced by file system is how to update persistent data structures despite the presence of a **power loss** or **system crash**. This leads us to the problem of **crash-consistency**. ## A Detailed Example remember that when we append to the file we are adding a new data block to it and thus must update thress on-disk strucutres: the inode, the new data block, and a new version of the data bitmap. (important to remember for crash senarios) ### Crash Scenarios here there are 3 possible outcomes with only 1 successful write: 1. Just the data block is written to the disk without updating the inode and bitmap 2. Just the update inode is written to disk resulting in reading garbage from the disk / wrong contents - also affacts the bitmap with believing the data block not been allocated when the inode says it has 3. Just the bitmap is updated which results in a **space leak** and the data block will never get used there is also 3 cases for when 3 out of 3 writes succeeds: 1. The inode and bitmap are successful but they are pointing to garbage or wrong information 2. the inode and data block is written but the bitmap conflicts with inode data 3. bitmap and data block are written where the systems has no idea which file it belongs to as no inode points to the file. #### The Crash Consistency Problem ideally we want to move the file system from one consistent state to another **atomically**. This is not easy as the disk only commits one write at a time. (**crash-consistency problem / consistent-update problem**) ## Solution #1: The File System Checker Basically this approach allowed the inconsistencies and went to fix them later upon rebooting. **fsck** is a unix tool for finding such inconsistencies and repairing them. Note: this cannot fix everything (e.g. bitmap and inode are fine but point to garbage data). It only really makes sure the file system metadata is internally consistent. This will run before the file system is mounted and made available. once finished, the on-disk file system should be consistent and this can be made accessible to users. Summary of process: 1. first check if the superblock looks resonable, make sure the file system size is greate than the number of block that have been allocated. - goal is to find a corrupt superblock 2. next scan the inodes, indirect blocks, double indirect blocks, etc. Use it to produce a correct version of the allocation bitmaps. then check if bitmap is consistent with inodes. 3. each inode is checked for corruption of other problems (e.g. check for valid type field) 4. verify the link count of each allocated inode. - if an allocated inode is discovered but no directory refers to it, it is moved to the lost+found directory 5. check for duplicate pointers 6. check for bad block pointers (something pointing ouside of its valid range) 7. check the direcotires to ensure that no directories are linked more than once in the enture hierarchy The biggest problem with using fsck is it is too slow and expensive. it is akin to dropping your keys on the floor and then comencing a seach the entire house for keys recovery algorithm. ## Solution #2: Journaling (or Write-Ahead Logging) **Write-ahead logging** basic idea is when updating the disk, before overwritting the structures in place, to first write down a little note describing what you are about to do. if a crash takes place you cna go back and look at the note you made and try it again; this you will konw exactly what to fix. ![Screenshot 2024-05-20 at 3.33.24 PM](https://hackmd.io/_uploads/SJl6JLYXA.png) ### Data Journaling what log looks like: ![Screenshot 2024-05-20 at 3.48.34 PM](https://hackmd.io/_uploads/ByTBXIF7C.png) the process begins with the **(TxB)** which tells up about the update and some kind of **transaction identifier (TID)**. **physical logging** is where we put the exact physical contents of the update in the journal. (**logical logging** is a more compact version). **TxE** is the marker for the end of this transaction (also contains TID). The process of logging and then overwriting the file system is called **checkpointing** where the checkpoint is when the journal is successfully written. Ideally we would log all block writes at once to spped up the process. however this is unsafe as it may mess with correctness when a crash occurs. (e.g. looks like a proper journal but is only missing data block). To avoid this problem, the file system issues the transactional write in 2 steps: 1. write all blocks except the TxE block to the journal, issuing these writes all at once. 2. only after first step is complete are right will TxE get written. (another checkpoint) This will guarentee show if a write was unsuccessful bc of crash. ### Recovery If the crash happens before the transaction is written safely to the log then the pending disk update is skipped. if it happens after, the process will scan the log and look for the transactions that have been commited to the disk. These transactions are replayed in order with the file system again attempting to write out the blocks in the transaction to their final on-disk locations. (**redo logging**) Becuase recovery is rare, a few redundant writes are nothing to worry about. ### Batching Log Updates to remedy issue that come with writing files at the same time, some file systems do not commit each update to disk one at a time. One can buffer all updates into a global transaction. Basically just one global transaction log. ### Making The Log Finite Two problems arise when the log becomes full: 1. the larger the log, the longer recovery will take, as the recovery process must replay all the transcations withon the log (in order) 2. when the log is full or nearly full no further transcations can be committed to the disk, thus making the file system useless. The fix is a **circular log** where the file system must take action some time after a checkpoint. Specifically, once a transaction has been checkpointed, the file system should free the space it was occupying with the journal, allowing the log space to be reused. ### Metadata Journaling adding a journal reuslts in double the write traffic, operating of half of peak bandwidth. the solve is only **metadata journaling** where user data is **not** written to the journal. unfortunately this way comes with an issue if we write the data block after the transaction completes. (case where Inode and bitmap made it but data did not). DB is not in log and therefore fails. Therefore the data block must write to the disk before the metadata is written to disk. Final Possible checkpoint (there are variations): 1. Data write: Write data to final location; wait for completion (the wait is optional; see below for details). 2. Journal metadata write: Write the begin block and metadata to the log; wait for writes to complete. 3. Journal commit: Write the transaction commit block (containing TxE) to the log; wait for the write to complete; the transaction (including data) is now committed. 4. Checkpoint metadata: Write the contents of the metadata update to their final locations within the file system. 5. Free: Later, mark the transaction free in journal superblock. ### Tricky Cases: Block Reuse suppose you are using some form of metadata journaling. you allocate for a block then immediately delete that same block. and create a new file in that same block. A specific crash can occur where the write of the direcotory data in the bloc overwrites the user data of the new file with the old direcotry contents. a potentional solution is to never reuse block until the delete of said block is checkpointed out of the journal. instead they write a **revoke** record that record the deletion in the journal. this will be the first to get scanned. ### Wrapping Up Journaling: A Timeline ![Screenshot 2024-05-20 at 4.33.56 PM](https://hackmd.io/_uploads/HJkxA8Km0.png) Finally, note that the time of completion marked for each write in the timelines is arbitrary. In a real system, completion time is determined by the I/O subsystem, which may reorder writes to improve performance. ## Solution #3: Other Approaches Another approach is **soft updates** which carefully orders all writes to the file system to ensure that the on-disk structures are never left in an inconsistent state. This does require intricate knowledge of each file system data strucutre and this adds a fair amount of complexity to the system. yet another approach is **copy-on-write** which never overwrites files or direcotries in place; rather, it places new updates to previously unused locations on disk. After a number of updates are complete, COW file systems flip the root strucutre of the file system to include pointers to the newly updated structures. Another approach is **backpointer-based consistency (or BBC)** To achieve consistency, an additional back pointer is added to every block in the system. when accessing a file, the file system can determine if the file is consistent by checking if the forward pointer points to a block that refers back to it. finally there is **optimistic crash consistency** which issues as many writes to disk as possible by using a generalized form of the **transaction checksum** and a few other techniques to detect inconsistencies shoud they arise. - improves performance but does require a slightly different disk interface # Chapter 43 Log-structured File Systems motivations for log-structured file systems: - system memories are growing - there is a large gap between random I/O performance and sequential I/O performance - Existing File systems perform poorly on many common workloads - File systems are not RAID-aware **log-structured File System** was introduced to first buffer all updates in an im memory segment and when it is full it is written to disk. It always writes segments to free locations. ## Writing To Disk Sequentially and Effectively generally writing to the disk will obviously involve the data block and it will also include the **metadata / inode information** with a pointer to the data block looking something like this: ![Screenshot 2024-05-22 at 3.51.10 PM](https://hackmd.io/_uploads/SJYyDl2X0.png) This will be addressed more later but for now we focus on how it does not guarantee efficient writes. we disk rotation also considered there is a lot of wait and unused time when dealing with sequential queue of data to write. to achieve efficiency we use **write buffering** where LFS keeps track of updates in memory and when it receives a sufficient number of updates it writes them to the disk all at once. The large chunk of updates LFS writes at one time is reffered to by the name of a **segment**. As long as the segments are large enough the writes will be efficient. ![Screenshot 2024-05-22 at 3.56.11 PM](https://hackmd.io/_uploads/ryLzdx370.png) ## How Much To Buffer you want to **amortize** the cost of time with considerations for overhead and tranfer rate. goal is large quick transfers (ideally) / (peak bandwidth). ![Screenshot 2024-05-22 at 3.58.40 PM](https://hackmd.io/_uploads/Skoo_gnmA.png) ![Screenshot 2024-05-22 at 3.59.01 PM](https://hackmd.io/_uploads/BJepuln7R.png) ![Screenshot 2024-05-22 at 3.59.15 PM](https://hackmd.io/_uploads/HyxC_g3X0.png) F is peak bandwidth and D is amount we need to buffer in bytes. ## Problem: Finding Inodes and Solution Through Indirection: The Inode Map In Log structured file systems indoes are a bit more complicated to find as we have scattered them all throughout the disk. and make decide to keep moving. LFS designers introduced **level of indirection** between inode numbers and the inodes through the data structure called **inode map (imap)** The imap is a structure that takes an inode number as input and produces the disk address of the most recent version of the inode. This is often implemented as a simple array. The imap does need to be kept persistent to allow LFS to keep track of the locations across crashes. Therefore LFS places chunks of the inode map right next to where it is writing all of the other new information, appending to the data block and inode writes: ![Screenshot 2024-05-22 at 4.08.09 PM](https://hackmd.io/_uploads/S1Ekjx37R.png) ## Completing The Solution: The Checkpoint Region however we still run into the same issue and ultimately need some sort of fixed point. The **checkpoint region (CR)** contains pointers to the lastest pieces of the inode map. this region is only updated periodically so performance is not ill-affected. ![Screenshot 2024-05-22 at 4.10.46 PM](https://hackmd.io/_uploads/S1Xtjln7R.png) ## What About Directories remember directories are pretty much just a list of names and their inode numbers mappings. This would only slightly affect the LFS structure on the disk. ![Screenshot 2024-05-22 at 4.14.20 PM](https://hackmd.io/_uploads/SkP82xnmR.png) the directory block gets its own data block, inode block, and imap pointer. its important to add this imap pointer otherwise you could run into the **recursive update problem** which arises becasue the file system never updates in one place but while it is moving to new locations. specifically whenever inode is updated its location on the disk changes. ## A New Problem: Garbage Collection While writing has been made efficient we do need to consider how to make deleting garbage efficient also. This occurs pretty frequently such as in instances where data and indoe are updating to a new location you have to clean up the original memory place. Otherwise you can have 2 versions: ![Screenshot 2024-05-22 at 4.30.41 PM](https://hackmd.io/_uploads/SyhXxZ37A.png) Some file systems (**versioning file system**) do keep these older versions around for just in case senarios. LFS only wants the live versions so we need to do some cleaning **(garbage collection)** it is particularly useful for working on a segment by segment basis to clear up large chunks of space for subsequent writings. specifically we expect the cleaner to read M existing segments, compact their contents into N new segments where N < M and write the N segments to disk in new locations. The old M segments are then freed and can be used by the file system for subsequent writes. ## Determining Block Liveness LFS adds a little extra information for each data block D, its inode number, and its offset. This is known as the **segment summary block**. if address looks correct it will assume it is live. ![Screenshot 2024-05-22 at 4.39.30 PM](https://hackmd.io/_uploads/HyyHzbhmR.png) **version number** can also be used to determine liveness is a more efficient way. ## A Policy Question: Which Blocks To Clean, And When? there also must be policies on what is worth cleaning. This is heavily research and debates and has variable answers A **hot segment** is one in which the contents are being frequently over-written; thus, for such a segment, the best policy is to wait a long time before cleaning it, as more and more blocks are getting over-written (in new segments) and thus being freed for use. A **cold segment**, in contrast, may have a few dead blocks but the rest of its contents are relatively stable. Thus, the authors conclude that one should clean cold segments sooner and hot segments later, and develop a heuristic that does exactly that. ## Crash Recovery and The Log One final problem: what happens if the system crashes while LFS is writing to disk? LFS organizes these writes in a log, i.e., the checkpoint region points to a head and tail segment, and each segment points to the next segment to be written. LFS also periodically updates the checkpoint region. To ensure that the CR update happens atomically, LFS actually keeps two CRs, one at either end of the disk, and writes to them alternately. LFS also implements a careful protocol when updating the CR with the latest pointers to the inode map and other information; specifically, it first writes out a header (with timestamp), then the body of the CR, and then finally one last block (also with a timestamp). If the system crashes during a CR update, LFS can detect this by seeing an inconsistent pair of timestamps. LFS will always choose to use the most recent CR that has consistent timestamps, and thus consistent update of the CR is achieved. upon reboot, LFS can easily recover by simply reading in the checkpoint region, the imap pieces it points to, and subsequent files and directories; however, the last many seconds of updates would be lost. **roll foward** is an improve on rebuilding where the basic idea is to start with the last checkpoint region, find the end of the log (which is included in the CR), and then use that to read through the next segments and see if there are any valid updates within it. If there are, LFS updates the file system accordingly and thus recovers much of the data and metadata written since the last checkpoint. # Chapter 44 Flash-based SSDs as we know hard-disk drive has movin parts and mechanics. Whereas **solid-state storage** has no moving parts of mechanics, simply build out of transistors. in this chapter we will study how to write to a given chunk **flash page** and how you have to first erase a bigger chunk **flash block** which can be expensize and may lead to **wear out** ## Storing a Single Bit **single-level cell (SLC)** flash only has a single bit sotred within the transistor. **multi-level cell (MLC)** flash has 2 bits encoded into different levels of charge. There are even **triple-level cell (TLC)** with 3 bits encoded per cell. Overall, SLC chips achieve higher performance and are more expensive. ## From Bits to Banks/Planes flash chips are organized into banks or planes which consist of large number of cells. A **bank** is accessed in two different sized units: **blocks/erase blocks** and **pages**. Within each bank there are a large number of blocks; within each block there are a large number of pages. ![Screenshot 2024-05-23 at 12.47.07 PM](https://hackmd.io/_uploads/rJUSpMTmR.png) ## Basic Flash Operations Given this flash organization, there are three low-level operations that a flash chip supports. The **read** command is used to read a page from the flash; **erase** and **program** are used in tandem to write. - Read: read any page by specifing the read command and appropriate page number to the device. Being able to access any location uniformly quickly means the device is a **random access** device. - Erase: before writing you need to erase the entire block the page lies within. you may want to copy any memory you want elsewhere. very expensive - Program: once a block ahs been erased program can be used to change some of the 1's within a page to 0's and write the desired contents of a page to the flash. Pages start with an **invalid** state, erasing sets the state to **erased**, when you program a page it changes state to **valid** ![Screenshot 2024-05-23 at 1.02.46 PM](https://hackmd.io/_uploads/ByZebXpQ0.png) writing a page can be trickier than reading as not only are you constantly erasing and programming, but frequent repetitions of this program / erase cycle can lead to the biggest reliability problem flash chips: **wear out** ## Flash Performance and Reliability Unlike mechanical disks, which can fail for a wide variety of reasons (including the gruesome and quite physical head crash, where the drive head actually makes contact with the recording surface) The primary concern is **wear out** where as the flash block is erased and prograamed it slowly accrues a little bit of extra charge to the point where the block eventually becomes unusuable. However, recent research has shown that lifetimes are much longer than expected. another problem is **disturbance** where when accessing a particular page within a flash, it is possible that some bits get flipped in neighboring pages; such bit flips are known as **read disturbs** or **program disturbs** depending on whether the page is being read or programmed ## From Raw Flash to Flash-Based SSDs The standard storage interface is a simple block based one where blocks can be read or written given a block address. SSD codes contain some amount of volatile memory which is useful for caching and buffering of data as well as for mapping tables. For internal flash operations: **flash translation layer (FTL)** takes read and write requests on logical blocks and turns them into low-level read, erase, and program commands on the underlying physical blocks and physical pages. One key optimization tech is to use multiple flash chips in parallel to obtian high performance. Another performance goal will be to reduce **write amplification** which is defined as the total write traffic issued to the flash chips by the FTL divided by the total write traffic issued by the client to the SSD. - naive approaches to this will lead to high write amp and lw performance the main concern will still be wear out so FTL should try to spread writes across the blocks of the flash as evenly as possible, ensuring that all of the vlocks of the device wear out at roughly the same time (**wear leveling**) ## FTL Organization: A Bad Approach one simple organization FTL would be a **direct mapped** where a read to logical page N is mapped directly to a read of physical page N. A write to logical page N is more complicated; the FTL first has to read in the entire block the page N is contained within; it then has to erase the block then program the old pages as well as the new ones. the problem is such an operation is very costly so performance is down as well as reliability. if a block contains popular data it will wear down quickly. ## A log-structured FTL with **log structured** a write to logical block N, the device appeands the write to the next free spot in the currently being written to block. To allow for subsequent reads of block N, the device keeps a **mapping table** which stores the physical address of each logical block in the system. ![Screenshot 2024-05-23 at 8.17.08 PM](https://hackmd.io/_uploads/BJxa8FT7C.png) to wear level the FTL will spread subsequent writes to other block and pages. downsides to this approach include: - overwrites of logical blocks lead to **garbage** (old versions of data around the drive and taking up space) - therefore we periodically perform **garbage collection (GC)** - there is also high cost of in-memory mapping tables; the larger the device the more memory such tables need. ## Garbage Collection take this example where blocks 100 and 101 are written again to c1 and c2: ![Screenshot 2024-05-23 at 8.21.31 PM](https://hackmd.io/_uploads/BkdaDYp7R.png) even though pages 0 and 1 are marked valid they have garbage in them. to clean we find a block the contains one or more garbage pages, read in the live pages from the block, write out those live pages to the log and finally reclaim the entire block use in writing. ![Screenshot 2024-05-23 at 8.23.55 PM](https://hackmd.io/_uploads/SyrLdKp70.png) The ideal candidate for reclamation is a block that consists of only dead pages so we do not deal with the expensive data migration. To reduce GC costs some SSDs **overprovision** the device by adding extra flash capacity, cleaning can be delayed and pushed to the **background** to handle when the device is less busy. ## Mapping Table Size the second cost of log strucutring is potential for extremely large mapping tables. page-level FTL scheme is impractical ### Block-Based Mapping One approach to reduce the costs of mapping is to only keep a pointer per block of the device instead of per page. unfortunately, using a block-based mapping inside a log-based FTL does not perform well when dealing with small writes. To aid in this typically the logical block address will consist of 2 portions: a chunk number and an offset. the biggest issue occurs when writes are smaller than the physical block size of the device especially when rewritting occurs becuase new block usage will happen pretty often and be wasteful. in these cases a single file can be altered and there is now overhead of appending all the old and new files in the block to a new block to deal with the small amount of garbage. ### Hybrid Mapping **hybrid mapping** will keep a few blocks erased and directs all writes to them called **log blocks**. it will also keep a per-page mappings for these log blocks. FTL has 2 types of mapping table in its memory: - a small set of per-page mappings in what we call the log table - a larger set of per-block mappings in the data table FTL will first consult the log table; if the logical blocks location is not found there, the FTL will then consult the data table to find its location and then access the requested data The key to the hybrid mapping strategy is keeping the number of log blocks small. To keep the number of log blocks small, the FTL has to periodically examine log blocks (which have a pointer per page) and switch them into blocks that can be pointed to by only a single block pointer. ![Screenshot 2024-05-23 at 8.50.54 PM](https://hackmd.io/_uploads/rkojRKTX0.png) ![Screenshot 2024-05-23 at 8.51.05 PM](https://hackmd.io/_uploads/r1BhAKamC.png) ![Screenshot 2024-05-23 at 8.51.30 PM](https://hackmd.io/_uploads/S1A6RFp7C.png) This would be more challenging to handle if a and b are the only ones updated ![Screenshot 2024-05-23 at 8.52.33 PM](https://hackmd.io/_uploads/ry3Zk56QA.png) To reunite the other pages of this physical block, and thus be able to refer to them by only a single block pointer, the FTL performs what is called a **partial merge**. In this operation, logical blocks 1002 and 1003 are read from physical block 2, and then appended to the log. The resulting state of the SSD is the same as the switch merge above; however, in this case, the FTL had to perform extra I/O to achieve its goals, thus increasing write amplification. The final case encountered by the FTL known as a full merge, and requires even more work. In this case, the FTL must pull together pages from many other blocks to perform cleaning. Frequent full merges, as is not surprising, can seriously harm performance and thus should be avoided when at all possible ### Page Mapping Plus Caching The simplest method is to just cache only the active parts of the FTL in memory, this reducing the amount of memory needed. f course, the approach can also perform poorly. If memory cannot contain the **working set** of necessary translations, each access will minimally require an extra flash read to first bring in the missing mapping before being able to access the data itself. Even worse, to make room for the new mapping, the FTL might have to **evict** an old mapping, and if that mapping is **dirty** (i.e., not yet written to the flash persistently), an extra write will also be incurred. ### Wear Leveling **wear leveling** is where FTL should try its best to spread that work across all the blocks of the device evenly. In this manner, all blocks will wear out at roughlt the same time, instead of a few "popular" blocks quickly becoming unusable. However, sometimes a block will be filled with long-lived data that does not get over-written; in this case, garbage collection will never reclaim the block, and thus it does not receive its fair share of the write load. To remedy this problem, the FTL must periodically read all the live data out of such blocks and re-write it elsewhere, thus making the block available for writing again. ## SSD Performance and Cost ### Performance The biggest difference in performance, as compared to disk drives, is realized when performing random reads and writes; while a typical disk drive can only perform a few hundred random I/Os per second, SSDs can do much better. While the SSDs obtain tens or even hundreds of MB/s in random I/Os, this “high performance” hard drive has a peak of just a couple MB/s in terms of sequential performance, the SSDs perform better, a hard drive is still a good choice if sequential performance is all you need. you can see that SSD random read performance is not as good as SSD random write performance. The reason for such unexpectedly good random-write performance is due to the log-structured design of many SSDs, which transforms random writes into sequential ones and improves performance. although the magnitude of difference between sequential and random I/Os is smaller, there is enough of a gap to carefully consider how to design file systems to reduce random I/Os. ### Cost an SSD costs something like $150 for a 250-GB drive; such an SSD costs 60 cents per GB. A typical hard drive costs roughly $50 for 1-TB of storage, which means it costs 5 cents per GB. There is still more than a 10× difference in cost between these two storage media. If, on the other hand, you are assembling a large data center and wish to store massive amounts of information, the large cost difference will drive you towards hard drives. ## KEY SSD TERMS - A flash chip consists of many banks, each of which is organized into erase blocks (sometimes just called blocks). Each block is further subdivided into some number of pages. - Blocks are large (128KB–2MB) and contain many pages, which are relatively small (1KB–8KB). - To read from flash, issue a read command with an address and length; this allows a client to read one or more pages. - Writing flash is more complex. First, the client must erase the entire block (which deletes all information within the block). Then, the client can program each page exactly once, thus completing the write. - A new trim operation is useful to tell the device when a particular block (or range of blocks) is no longer needed. - Flash reliability is mostly determined by wear out; if a block is erased and programmed too often, it will become unusable. - A flash-based solid-state storage device (SSD) behaves as if it were a normal block-based read/write disk; by using a flash translation layer (FTL), it transforms reads and writes from a client into reads, erases, and programs to underlying flash chips. - Most FTLs are log-structured, which reduces the cost of writing by minimizing erase/program cycles. An in-memory translation layer tracks where logical writes were located within the physical medium. - One key problem with log-structured FTLs is the cost of garbage collection, which leads to write amplification. - Another problem is the size of the mapping table, which can be- come quite large. Using a hybrid mapping or just caching hot pieces of the FTL are possible remedies. - One last problem is wear leveling; the FTL must occasionally mi- grate data from blocks that are mostly read in order to ensure said blocks also receive their share of the erase/program load. # Ch 45 Data Integrity and Protection Specifically, how should a file system or storage system ensure that data is safe, given the unreliable nature of modern storage devices? ## Disk Failure Modes The earliest was the **fail-stop** model which was simply either the entire idsk is working, or it fails completely and detection of such a failure is straightforward. **Latent sector errors (LSEs)** arise when a disk sector has been damaged in some way. (e.g. disk head touches the surface for some reason). **error correcting codes (ECC)** are used by the drive to determine whether the on-disk bits in a block are good and in some cases fix them. other times it does not have enough information to fix the error. There are also cases where a disk block becomes corrupt in a way not detectable by the disk itself. For example, buggy disk firmware may write a block to the wrong location; in such a case, the disk ECC indicates the block contents are fine, but from the client’s perspective the wrong block is returned when subsequently accessed. These types of faults are particularly insidious because they are **silent faults**; the disk gives no indication of the problem when returning the faulty data. **fail-partial** is a more modern view that disks can still fail in their entirety however disks can also seemingly be working and have one or more blocks become inaccessible or hold the wrong contents. Some additional findings about LSEs are: - Costly drives with more than one LSE are as likely to develop additional errors as cheaper drives - For most drives, annual error rate increases in year two - The number of LSEs increase with disk size - Most disks with LSEs have less than 50 - Disks with LSEs are more likely to develop additional LSEs - There exists a significant amount of spatial and temporal locality - Disk scrubbing is useful (most LSEs were found this way) Some findings about corruption: - Chance of corruption varies greatly across different drive models within the same drive class - Age effects are different across models - Workload and disk size have little impact on corruption - Most disks with corruption only have a few corruptions - Corruption is not independent within a disk or across disks in RAID - There exists spatial locality, and some temporal locality - There is a weak correlation with LSEs ## Handling Latent Sector Errors Error handling for latent sector errors are rather straightforward to handle. When a storage system tries to access a block, and the disk returns an error, the storage system should simply use whatever redundancy mechanism it has to return the correct data. - e.g. RAID tries to reconstruct the disk (say, onto a hot spare) by reading through all of the other disks in the parity group and recomputing the missing values ## Detecting Corruption: The Checksum silent failures are a bit more challenging. recovery will remain the same as it was before where you need to have some other copu of the block around. **checksum** is simply the result of a function that takes a chunk of data as input and computes a function over said data, producing a small summary of the contents of the data. The goal of such a computation is to enable a system to detect if data has somehow been corrupted or altered by storing the checksum with the data and them confirming upon later access that the data's current checksum matches the original storage value. One simple checksum function that some use is based on exclusive or (XOR). With XOR-based checksums, the checksum is computed by XOR’ing each chunk of the data block being checksummed, thus producing a single value that represents the XOR of the entire block. - has limitations such as if two bits in the same position within each checksummed unit change, the checksum will not detect the corruption other checksum function are listed in the reading [here](https://pages.cs.wisc.edu/~remzi/OSTEP/file-integrity.pdf). look for **FLetcher checksum and cyclic redundancy check (CRC)** There is no perfect checksum as it is possible two data blocks with non-identical contents will have identical checksums, something referred to as a collision. (we want to minimize the chance of this) ### Checksum Layout The most basic approach simply stores a checksum with each disk sector or block. ![Screenshot 2024-05-29 at 12.40.00 PM](https://hackmd.io/_uploads/rkj94Wr4C.png) In this scheme, the n checksums are stored together however is not optimally efficient ![Screenshot 2024-05-29 at 12.41.16 PM](https://hackmd.io/_uploads/r1PyHZBEC.png) this would only require one write but may be tricky to achieve ## Using Checksums when comparing we will have 2 checksums the **stored checksum** and the **computed checksum**. if they do not match then the data has changed since the time it was stored in which case we have corruption. if there is a copy we simply make the correction otherwise we simply just throw an error. ## A New Problem: Misdirected Writes a rare and unique case is **misdirected writes** where the disk wirtes the data to disk correct except in the wrong location. the issue becomes even more complex when dealing with multi-disk systems. All this means is where need to add something extra to the checksum, a **physical identifier**. ![Screenshot 2024-05-29 at 12.51.41 PM](https://hackmd.io/_uploads/SkOLPZBNA.png) A little extra information, while not strictly needed with perfect disks, can go a long ways in helping detect problematic situations should they arise. ## One Last Problem: Lost Writes **lost write** occurs when the device informs the upper layer that a write has completed but in fact it never is persisted thus what remains is the old contents of the block rather than the updated new contents. A classic fix is **write verify / read-after-write** by immediately reading back the data after a write, a system can ensure that the data indeed reached the disk surface. (Very slow) **Zettabyte File System (ZFS)** includes checksum in each file system inode and indirect block for every block included within the file. This, even if the write to a datablock itself is lost, the checksum within the inode will not match the old data. ## Scrubbing some amount of checking occurs when data is accessed by applicaitons, but most data is rarely accessed and this would remain unchecked. to check the unchecked systems may perform **disk scrubbing** by periodically reading through every block of the system and checkning whether checksums are still valid. Typical systems schedule scans on a nightly or weekly basis. ## Overheads of Checksumming There are two distinct kinds of overheads, as is common in computer systems: space and time. Space overheads come in two forms. The first is on the disk (or other storage medium) itself; each stored checksum takes up room on the disk, which can no longer be used for user data. The second type of space overhead comes in the memory of the system. When accessing data, there must now be room in memory for the checksums as well as the data itself. However, if the system simply checks the checksum and then discards it once done, this overhead is short-lived and not much of a concern. the time overheads induced by checksumming can be quite noticeable. Minimally, the CPU must compute the checksum over each block, both when the data is stored (to determine the value of the stored checksum) and when it is accessed (to compute the checksum again and compare it against the stored checksum) One approach to reducing CPU overheads, employed by many systems that use checksums (including network stacks), is to combine data copying and checksumming into one streamlined activity; because the copy is needed anyhow (e.g., to copy the data from the kernel page cache into a user buffer), combined copying/checksumming can be quite effective. Beyond CPU overheads, some checksumming schemes can induce extra I/O overheads, particularly when checksums are stored distinctly from the data (thus requiring extra I/Os to access them), and for any extra I/O needed for background scrubbing. The former can be reduced by design; the latter can be tuned and thus its impact limited, perhaps by controlling when such scrubbing activity takes place. The middle of the night, when most (not all!) productive workers have gone to bed, may be a good time to perform such scrubbing activity and increase the robustness of the storage system.