Try   HackMD

Block device, BIO in linux kernel

tags: block device bio

ztex

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Linux storage stack

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The block layer

see A block layer introduction:

  1. part 1 the bio layer
  2. part 2 the request layer

The term "block layer" is often used to talk about that part of the Linux kernel which implements the interface that applications and filesystems use to access various storage devices.

The bio layer is a thin layer that takes I/O requests in the form of bio structures and passes them directly to the appropriate make_request_fn() function. It provides various support functions to simplify splitting bios and scheduling the sub-bios, and to allow plugging of the queue. It also performs some other simple tasks such as updating the pgpgin and pgpgout statistics in /proc/vmstat, but mostly it just lets the next level down get on with its work.

bio layer 收 bio, bio 就是一種 I/O requests, 然後把他丟給適合的 make_request_fn.
也簡化, 切開 bios 跟 排程, 並且允許 request queue 的插入.
也要做 pagging, 這部分跟 /proc/vmstat 有關
ztex

Sometimes the next layer is just the final driver, as with drbd (The Distributed Replicated Block Device) or brd (a RAM based block device). More often the next layer is an intermediate layer such as for the virtual devices provided by md(used for software, RAID) and dm(used, for example, by LVM2). Probably the most common is when that intermediate layer is the remainder of the block layer, which I have chosen to call the "request layer".

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Access to block devices generally happens through block special devices in /dev, which map to S_IFBLK inodes in the kernel. These inodes act a little bit like symbolic links in that they don't represent the block device directly but simply contain a pointer to the block device as a "major:minor" number pair. Internally the i_bdev field in the inode contains a link to a struct block_device that represents the target device. This block device holds a reference to a second inode: block_device->bd_inode. This inode is more closely involved in I/O to the block device, the original inode in /dev is just a pointer.

inode 就看成一種 link, 不要看成 block device, inode 看成一種指向 block device 的 pointer.
i_bdev 是指向 block device 的連結, 跟 I/O 比較沒關係
block_device->bd_inode 跟 I/O 才有關係

for inode->i_bdev, see: https://elixir.bootlin.com/linux/latest/source/include/linux/fs.h#L725
for block_device->bd_inode see: https://elixir.bootlin.com/linux/latest/source/include/linux/fs.h#L479

The main role that this second inode plays (which is implemented in fs/block_dev.c, fs/buffer.c, and elsewhere) is to provide a page cache. When the device file is opened without the O_DIRECT flag, the page cache associated with the inode is used to buffer reads, including readahead, and to buffer writes, usually delaying writes until the normal writeback process flushes them out. When O_DIRECT is used, reads and writes go directly to the block device. Similarly when a filesystem mounts a block device, reads and writes from the filesystem usually go directly to the device, though some filesystems (particularly the ext* family) can access the same page cache (traditionally known as the buffer cache in this context) to manage some of the filesystem data.

用 O_DIRECT open, 直接讀寫 block device; 否則, buffer 讀寫, delay 一段時間 normal writeblock process 才會把 buffer flush out 進去 device.
filesystem 通常直接讀寫, 除了 ext* 之類的, 他們透過獲取相同的 page cache (buffer cache) 來管理
ztex

Another open() flag of particular relevance to block devices is O_EXCL. Block devices have a simple advisory-locking scheme whereby each block device can have at most one "holder". The holder is specified when activating the block device (e.g. using a blkdev_get() or similar call in the kernel); that will fail if a different holder has already claimed the device. Filesystems usually specify a holder when mounting a device to ensure exclusive access. When an application opens a block device with O_EXCL, that causes the newly created struct file to be used as the holder; the open will fail if a filesystem is mounted from the device. If the open is successful, it will block future mount attempts as long as the device remains open. Using O_EXCL doesn't prevent the block device from being opened without O_EXCL, so it doesn't prevent concurrent writes completely — it just makes it easy for applications to test if the block device is in use.

block_device 有個簡單的 advisory-locking 機制, 每個 block device 最多同時有一個 holer. blkdev_get(), 如果有 holer 而嘗試搶, fail.
filesytem 大多抓著不放, 才能保證唯一存取.
ztex

All block devices in Linux are represented by struct gendisk — a "generic disk". This structure doesn't contain a great deal of information and largely serves as a link between the filesystem interface "above" and the lower-layer interface "below". Above the gendisk is one or more struct block_device, which, as we already saw, are linked from inodes in /dev. A gendisk can be associated with multiple block_device structures when it has a partition table. There will be one block_device that represents the whole gendisk, and possibly some others that represent partitions within the gendisk.

所有的 block devices 都被一個 struct gendisk 表示.
struct gendisk 可看成一介於 filesystem interface 跟 底層的連結
如果有 partition table, 一個 gendisk 可跟多個 block_device 關聯.
存在一個代表整個 gendisk 的 block_device, 其他 block_device 代表 gendisk 中的其他 partition.
ztex

The "bio" that gives its name to the bio layer is a data structure (struct bio) that carries read and write requests, and assorted other control requests, from the block_device, past the gendisk, and on to the driver. A bio identifies a target device, an offset in the linear address space of the device, a request (typically READ or WRITE), a size, and some memory where data will be copied to or from. Prior to Linux 4.14, the target device would be identified in the bio by a pointer to the struct block_device. Since then it holds a pointer to the struct gendisk together with a partition number, which can be set by bio_set_dev(). This is more natural given the central role of the gendisk structure.

IO 請求用 struct bio 表示. 這結構包含: 目標裝置, 線性位置的 offset, READ or WRITE 請求 flag, 大小, 資料 copy 的 memory.
Linux 4.14 之前 target_device 是 bio 中一個 struct block_device 的 field. commit 74d46992e0d9dee7f1f376de0d56d31614c8a17a 後變成 gendisk, 可以用 bio_set_dev() 這個 macro.
ztex

Once constructed, a bio is given to the bio layer by calling generic_make_request() or, equivalently, submit_bio(). This does not normally wait for the request to complete, but merely queues it for subsequent handling. generic_make_request() can still block for short periods of time, to wait for memory to become available, for example. A useful way to think about this behavior is that it might wait for previous requests to complete (e.g. to make room on the queue), but not for the new request to complete. If the REQ_NOWAIT flag is set in the bi_opf field, generic_make_request() shouldn't wait at all if there is insufficient space and should, instead, cause the bio to complete with the status set to BLK_STS_AGAIN, or possibly BLK_STS_NOTSUPP. As of this writing, this feature is not yet implemented correctly or consistently.

struct bio 構建好後, 透過 generic_make_request() 或者 submit_bio() 交付 bio layer.
這兩個函式都不等 I/O request 完成. (ztex: 所以一定要 bio_get() -> submit_bio() -> bio_put(), 沒有的話, 我之前 kernel panic 過, see: https://elixir.bootlin.com/linux/latest/source/include/linux/bio.h#L210)
ztex

/*
 * get a reference to a bio, so it won't disappear. the intended use is
 * something like:
 *
 * bio_get(bio);
 * submit_bio(rw, bio);
 * if (bio->bi_flags ...)
 *	do_something
 * bio_put(bio);
 *
 * without the bio_get(), it could potentially complete I/O before submit_bio
 * returns. and then bio would be freed memory when if (bio->bi_flags ...)
 * runs
 */

The interface between the bio layer and request layer requires devices to register with the bio layer by calling blk_queue_make_request() and passing a make_request_fn() function that takes a bio. generic_make_request() will call that function for the device identified in the bio. This function must arrange things such that, when the I/O request described by the bio completes, the bi_status field is set to indicate success or failure and call bio_endio() which, in turn, will call the bi_end_io() function stored in the structure.

介於 bio layer 跟 request layer 的 interface 要求 devices 透過
blk_queue_make_request() 註冊. 而且必須傳遞一個make_request_fn()來收 bio.
make_request_fn() 必須處理 I/O request, 最後處理完之後要把狀態 set bi_status, 表明 success/fail, 並且 call bio_endio(), 這是一個位於 struct bio 的結束時呼叫的 callback.
ztex

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
The bio structure

see: http://books.gigatux.nl/mirror/kerneldevelopment/0672327201/ch13lev1sec3.html#:~:text=The basic container for block,that is contiguous in memory.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Example code: submit a bio

/**
 * Autor: ztex 2020/8/19
 * write_lba(): Write bytes to disk, starting at given LBA
 * @state: disk parsed partitions
 * @lba: the Logical Block Address of the partition table
 * @buffer: resource buffer
 * @count: bytes to write
 *
 * Description: Write @count bytes from buffer into @@state->bdev.
 * Returns number of bytes read on success, 0 on error.
 */
static size_t write_lba(struct parsed_partitions *state,
		       u64 lba, u8 *buffer, size_t count)
{
	size_t totalreadcount = 0;
	struct block_device *bdev = state->bdev;
	struct bio *bio;
	struct page *page;
	struct address_space *mapping = bdev->bd_inode->i_mapping;

	sector_t n = lba * (bdev_logical_block_size(bdev) / 512);

	if (!buffer || lba > last_lba(bdev))
                return 0;

	while (count) {
		int copied = 512;
		Sector sect;
		unsigned char *data = read_part_sector(state, n, &sect);
		if (!data)
			break;
		if (copied > count)
			copied = count;
		memcpy(data, buffer, copied);

		bio = bio_alloc(GFP_NOIO, 1);
		bio_get(bio);
		bio->bi_bdev = bdev;
		bio->bi_iter.bi_sector = n;

		if (n >= get_capacity(state->bdev->bd_disk)) {
			state->access_beyond_eod = true;
			pr_warn("[ZTEX] write_lba access_beyond eod");
			return -1;
		}

		page = read_mapping_page(mapping,
				(pgoff_t)(n >> (PAGE_SHIFT - 9)), NULL);
		if (PageError(page))
			put_page(page);

                pr_warn("[ZTEX][WRITE_LBA] lba: %llu; sector: %llu; offset: %llu\n",
                        (unsigned long long )lba, (unsigned long long)n, (unsigned long long)SECTOR_TO_PAGE_OFFSET(n));
		bio_add_page(bio, page, copied, SECTOR_TO_PAGE_OFFSET(n));
		submit_bio(WRITE_FLUSH_FUA, bio);

		put_dev_sector(sect);
		buffer += copied;
		totalreadcount +=copied;
		count -= copied;
		n++;
		bio_put(bio);
	}
	return totalreadcount;
}

explanation

given a block device struct block_device *bdev

in order to write to sector n

1. allocate a bio
bio = bio_alloc(GFP_NOIO, 1);
2. increase reference count
bio_get(bio);
3. associate bio with bdev
bio->bi_bdev = bdev;
4. get the block device inode mapping
struct address_space *mapping = bdev->bd_inode->i_mapping;
5. read the page
page = read_mapping_page(mapping,
				(pgoff_t)(n >> (PAGE_SHIFT - 9)), NULL);
6. convert a given page to its logical address, see: https://stackoverflow.com/questions/11602930/linux-kernel-function-page-address
(unsigned char *)page_address(page)
7. A convience way to add a bio_vec
void bio_add_page(struct bio *bio, struct page *page,
		unsigned int len, unsigned int off)
8. submit bio
submit_bio(WRITE_FLUSH_FUA, bio);
9. decrease reference count
bio_put(bio);

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Memory mapping

Memory mapping
Linux中的kmap
Driver porting: low-level memory allocation
Linux通用块设备层
Chapter 16. Block Drivers
Linux内核Cache机制
Block Device Drivers
https://linux-kernel-labs.github.io/refs/heads/master/labs/block_device_drivers.html