The Block I/O Layer

# The Block I/O Layer 主要分別為存取到底是不是randomly的 (whether the device can seek to one position from another) * Block devices (random) * random access of fixed-size chunks of data(block) * hard disk * floppy drives * blu-ray readers * flash memory * character devices (serial) * accessed as a stream of sequential data, one byte after another. * serial ports * keyboards block device比char divice複雜得多 # Anatomy of a Block Device * block device * sector : 最小的位址單元 * 512bytes最常見 * 要看實際HW設計 * 不能在往下提取更小的單元 * 但同時間可提取多個sectors * block * software的最小邏輯位址單元 * filesystem的抽象化 * filesystems can be accessed only in multiples of a block * 雖然disk上最小的執行單元是sector，但kernel總是用blocks來做各種操作 * block size need to be a multiple of a sector * and need to be power of 2 * block還被規定不能比page size大 * 綜上所述 * Common block sizes are * 512 bytes * 1k bytes * 4kbytes 有些人常常會創造出很多名詞讓人非常迷惑，比如說把sectors叫做hard sectors或 "device blocks"，然後把blocks叫做filesystem blocks或 I/O blocks，這個章節不會這樣叫 sectors就是sectors, blocks就是blocks sector得重要性在於，所有的 device I/O都是以sectors為基礎完成的，the higher-level concept used by the kernel--blocks--is built on top of sectors. ![](https://i.imgur.com/JL45sQ5.png) # Buffers and Buffer Heads Buffer Heads的目的就是把memory buffer跟 disk blocks之間的關係描述清楚 When a block is stored in memory—say, after a read or pending a write—it is stored in a > buffer > * buffer * 每個buffer 1對1 associated with one block * buffer就代表disk block in memory * 前面有提到a block is several sectors but no more than a page size * a page can hold one or more blocks in memory * buffer descriptor * buffer head * struct buffer_head * <linux/buffer_head.h> ![](https://i.imgur.com/QnxQhLc.png) * b_state * the state of this particular buffer * the legal flags are stored in the bh_state_bits enum ![](https://i.imgur.com/eh0zUaY.png) ![](https://i.imgur.com/ZLPCSfZ.png) * bh_state_bits * ![](https://i.imgur.com/Bm9jLDd.png) * ![](https://i.imgur.com/84gYQda.png) * b_count * ![](https://i.imgur.com/6Xl5bFG.png) physical block on disk跟 buffer in memory的關聯用b_bdev的b_blocknr-th logical block來描述 kernel 2.6之前這個structure是主要的，但有缺點而且kernel比較prefer用Page來工作，buffer head的最小單位是blocks這樣很沒效率，所以kernel 2.6之後 block I/O的控制全部都直接透過pages跟address space了，在16章 The Page Cache and Page Writeback有討論 * address_space * pdflush daemons 還有一個缺點是，buffer head只描述了一個單一的buffer. 當在做I/O operations時, kernel只能把一個大的block I/O operations切成好幾個buffer_head來做事,This results in needless overhead and space consumption，所以下面討論改良版 # The bio Structure * bio structure * basic container for block I/O in kernel * <linux/bio.h> * 他把block I/O operations切成 a list of segments * a segment is a chunk of buffer that is contiguous in memory * 所以個別的buffer不用在memory裡是連續的 * (阿不就是多了linklist來描述的好處....) * Vector I/O such as this is called * scatter-gather I/O ![](https://i.imgur.com/ZyEz9SF.png) bio structure的主要目的是代表 in-flight(執行中的動態) block I/O operation most important fields are * bi_io_vec * bi_vcnt * bi_idx ![](https://i.imgur.com/Anovguw.png) # I/O vectors * bi_io_vec * points to an array of bio_vec structures * bio_vec * vector of the form <page, offset, len> * 描述segment * 是放在哪個physical page裡 ![](https://i.imgur.com/b1789RX.png) 總結來說 * 一個block I/O request可以用bio structure表達 * 每個request包含了一個到數個的blocks * 被bio_vec用array的形式存起來 * bio_vec表達了segment跟physical page的對應 # Request Queues see <linux/blkdev.h> Block devices用 request queues來存放pending的block I/O requests. * block device driver負責處理相關事宜 * 空了滿了要怎樣沙小的 * each element in queue are request * 用 struct request表達 * 雖然disk的blocks必須是相鄰的，但放到memory後不用了(前面有提到用linked list串起來了) * each bio structure can describe multiple segments * each request can be composed of multiple bio structures ex: * request * bio1 * segment1 * segment2 * segment3 * bio2 * bio3 # I/O Schedulers sending out requests to the block devices in the order that the kernel issues them. 因為disk的seek動作太慢，kernel不直接送block I/O requests to the disk.kernel 用merging and sorting來改善performance(I/O Schedulers的工作) * process scheduler * 把工作抽象化成process * I/O Scheduler * 把工作抽象化成block I/O requests # The Job of an I/O Scheduler 管理block device's request queue. 有點像process scheduler * decides the order of requests in the queue * and what time each request is dispatched to the block device * 終極目標就是要減少使用disk seeks的時間 * results in greater global throughput 2 primary actions to minimize seeks * Merging * 把多個requests合併成一個 * 考慮一個例子，某個filesystem現在送了一個request給queue, 說從一個file去read a chunk of data好了 * scheduler會去找queue裡面是不是有鄰近的sector的request，有就把它們合併成同一個，總之就是減少disk seek time的次數，有相鄰就一次連續的讀出來就好。(移動磁頭是最耗時間的) * Sorting * 如果沒有可以合併的，把他丟到queue的最尾端等待處理這有點像電梯演算法 This is similar to the algorithm employed in elevators—elevators do not jump all over, wildly, from floor to floor. Instead, they try to move gracefully in a single direction. When the final floor is reached in one direction, the elevator can reverse course and move in the other direction. Because of this similarity, I/O schedulers (or sometimes just their sorting algorithm) are called elevators. # The Linus Elevator The Linus Elevator performs both merging and sorting, When a request is added to the queue * first checked against every other pending request to see whether it is a possible candidate for merging * front merging * If the new request immediately proceeds an existing request * back merging * if the new request immediately precedes an existing request 沒辦法merge再insert進queue內，插入要參考sector等等的資訊插的聰明點 # The Deadline I/O Scheduler ![](https://i.imgur.com/bbsTKlY.png) ![](https://i.imgur.com/DcrDkNZ.png) ![](https://i.imgur.com/RrdBkFy.png) ![](https://i.imgur.com/Lm2clUr.png) ![](https://i.imgur.com/zDUGV6q.png) # The Anticipatory I/O Scheduler ![](https://i.imgur.com/6ItZ6Qi.png) ![](https://i.imgur.com/uGsXFHT.png) ![](https://i.imgur.com/nYFmfl6.png) # The Complete Fair Queuing I/O Scheduler ![](https://i.imgur.com/EgaXVPm.png) # The Noop I/O Scheduler ![](https://i.imgur.com/oYMi8BE.png) ![](https://i.imgur.com/6EUorCP.png) # I/O Scheduler Selection ![](https://i.imgur.com/1K8BoIG.png) 隨時可以在command line 切換 I/O scheduling演算法