Ztex
  • NEW!
    NEW!  Connect Ideas Across Notes
    Save time and share insights. With Paragraph Citation, you can quote others’ work with source info built in. If someone cites your note, you’ll see a card showing where it’s used—bringing notes closer together.
    Got it
      • Create new note
      • Create a note from template
        • Sharing URL Link copied
        • /edit
        • View mode
          • Edit mode
          • View mode
          • Book mode
          • Slide mode
          Edit mode View mode Book mode Slide mode
        • Customize slides
        • Note Permission
        • Read
          • Only me
          • Signed-in users
          • Everyone
          Only me Signed-in users Everyone
        • Write
          • Only me
          • Signed-in users
          • Everyone
          Only me Signed-in users Everyone
        • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invite by email
        Invitee

        This note has no invitees

      • Publish Note

        Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

        Your note will be visible on your profile and discoverable by anyone.
        Your note is now live.
        This note is visible on your profile and discoverable online.
        Everyone on the web can find and read all notes of this public team.

        Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

        Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

        Explore these features while you wait
        Complete general settings
        Bookmark and like published notes
        Write a few more notes
        Complete general settings
        Write a few more notes
        See published notes
        Unpublish note
        Please check the box to agree to the Community Guidelines.
        View profile
      • Commenting
        Permission
        Disabled Forbidden Owners Signed-in users Everyone
      • Enable
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
        • Everyone
      • Suggest edit
        Permission
        Disabled Forbidden Owners Signed-in users Everyone
      • Enable
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
      • Emoji Reply
      • Enable
      • Versions and GitHub Sync
      • Note settings
      • Note Insights New
      • Engagement control
      • Make a copy
      • Transfer ownership
      • Delete this note
      • Save as template
      • Insert from template
      • Import from
        • Dropbox
        • Google Drive
        • Gist
        • Clipboard
      • Export to
        • Dropbox
        • Google Drive
        • Gist
      • Download
        • Markdown
        • HTML
        • Raw HTML
    Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
    Create Create new note Create a note from template
    Menu
    Options
    Engagement control Make a copy Transfer ownership Delete this note
    Import from
    Dropbox Google Drive Gist Clipboard
    Export to
    Dropbox Google Drive Gist
    Download
    Markdown HTML Raw HTML
    Back
    Sharing URL Link copied
    /edit
    View mode
    • Edit mode
    • View mode
    • Book mode
    • Slide mode
    Edit mode View mode Book mode Slide mode
    Customize slides
    Note Permission
    Read
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Write
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Block device, BIO in linux kernel ###### tags: `block device` `bio` > [name=ztex] > ## :memo: Linux storage stack ![image alt](https://upload.wikimedia.org/wikipedia/commons/thumb/f/fb/The_Linux_Storage_Stack_Diagram.svg/1052px-The_Linux_Storage_Stack_Diagram.svg.png) ## The block layer see A block layer introduction: 1. [part 1 the bio layer](https://lwn.net/Articles/736534/) 2. [part 2 the request layer](https://lwn.net/Articles/738449/) The term "block layer" is often used to talk about that part of the Linux kernel which implements the interface that **applications and filesystems use to access various storage devices**. The bio layer is a thin layer that **takes I/O requests in the form of bio structures and passes them directly to the appropriate make_request_fn() function**. It provides various support functions to simplify splitting bios and scheduling the sub-bios, and to allow plugging of the queue. It also performs some other simple tasks such as updating the pgpgin and pgpgout statistics in /proc/vmstat, but mostly it just lets the next level down get on with its work. > > bio layer 收 bio, bio 就是一種 I/O requests, 然後把他丟給適合的 make_request_fn. > 也簡化, 切開 bios 跟 排程, 並且允許 request queue 的插入. > 也要做 pagging, 這部分跟 /proc/vmstat 有關 > [name=ztex] > **Sometimes the next layer is just the final driver**, as with drbd (The Distributed Replicated Block Device) or brd (a RAM based block device). **More often the next layer is an intermediate layer** such as for the virtual devices provided by md(used for software, RAID) and dm(used, for example, by LVM2). Probably the most common is when **that intermediate layer is the remainder of the block layer, which I have chosen to call the "request layer"**. ![](https://i.imgur.com/5Z7q1ps.png) Access to block devices generally happens through block special devices in /dev, which map to S_IFBLK inodes in the kernel. These **inodes act a little bit like symbolic links** in that **they don't represent the block device directly but simply contain a pointer to the block device** as a "major:minor" number pair. Internally the **i_bdev field in the inode contains a link to a struct block_device that represents the target device.** This block device holds a reference to a **second inode: block_device->bd_inode. This inode is more closely involved in I/O to the block device, the original inode in /dev is just a pointer**. > inode 就看成一種 link, 不要看成 block device, inode 看成一種指向 block device 的 pointer. > i_bdev 是指向 block device 的連結, 跟 I/O 比較沒關係 > block_device->bd_inode 跟 I/O 才有關係 > for `inode->i_bdev`, see: https://elixir.bootlin.com/linux/latest/source/include/linux/fs.h#L725 for `block_device->bd_inode` see: https://elixir.bootlin.com/linux/latest/source/include/linux/fs.h#L479 The main role that this **second inode** plays (which is implemented in fs/block_dev.c, fs/buffer.c, and elsewhere) is to provide a **page cache**. When the device file is opened **without the O_DIRECT flag, the page cache associated with the inode is used to buffer reads**, including readahead, and to buffer writes, usually **delaying writes until the normal writeback process flushes them out**. When **O_DIRECT is used, reads and writes go directly to the block device**. Similarly when a filesystem mounts a block device, reads and writes from the **filesystem usually go directly to the device**, **though some filesystems (particularly the ext\* family) can access the same page cache (traditionally known as the buffer cache in this context) to manage some of the filesystem data**. > 用 O_DIRECT open, 直接讀寫 block device; 否則, buffer 讀寫, delay 一段時間 normal writeblock process 才會把 buffer flush out 進去 device. > filesystem 通常直接讀寫, 除了 ext* 之類的, 他們透過獲取相同的 page cache (buffer cache) 來管理 > [name=ztex] > Another open() flag of particular relevance to block devices is O_EXCL. Block devices have a simple advisory-locking scheme whereby each block device can have at most one "holder". The holder is specified when activating the block device (e.g. using a blkdev_get() or similar call in the kernel); that will fail if a different holder has already claimed the device. Filesystems usually specify a holder when mounting a device to ensure exclusive access. When an application opens a block device with O_EXCL, that causes the newly created struct file to be used as the holder; the open will fail if a filesystem is mounted from the device. If the open is successful, it will block future mount attempts as long as the device remains open. Using O_EXCL doesn't prevent the block device from being opened without O_EXCL, so it doesn't prevent concurrent writes completely — it just makes it easy for applications to test if the block device is in use. > block_device 有個簡單的 advisory-locking 機制, 每個 block device 最多同時有一個 holer. `blkdev_get()`, 如果有 holer 而嘗試搶, fail. > filesytem 大多抓著不放, 才能保證唯一存取. > [name=ztex] > All block devices in Linux are represented by struct gendisk — a "generic disk". This structure doesn't contain a great deal of information and largely serves as a link between the filesystem interface "above" and the lower-layer interface "below". Above the gendisk is one or more struct block_device, which, as we already saw, are linked from inodes in /dev. A gendisk can be associated with multiple block_device structures when it has a partition table. There will be one block_device that represents the whole gendisk, and possibly some others that represent partitions within the gendisk. > 所有的 block devices 都被一個 struct gendisk 表示. > struct gendisk 可看成一介於 filesystem interface 跟 底層的連結 > 如果有 partition table, 一個 gendisk 可跟多個 block_device 關聯. > 存在一個代表整個 gendisk 的 block_device, 其他 block_device 代表 gendisk 中的其他 partition. > [name=ztex] > The "bio" that gives its name to the bio layer is a data structure (**struct bio) that carries read and write requests, and assorted other control requests**, from the block_device, past the gendisk, and on to the driver. A bio identifies a **target device, an offset in the linear address space of the device, a request (typically READ or WRITE), a size, and some memory where data will be copied to or from**. Prior to Linux 4.14, the target device would be identified in the bio by a pointer to the struct block_device. Since [then](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=74d46992e0d9) it holds a pointer to the struct gendisk together with a partition number, which can be set by bio_set_dev(). This is more natural given the central role of the gendisk structure. > IO 請求用 struct bio 表示. 這結構包含: 目標裝置, 線性位置的 offset, READ or WRITE 請求 flag, 大小, 資料 copy 的 memory. > Linux 4.14 之前 target_device 是 bio 中一個 struct block_device 的 field. commit 74d46992e0d9dee7f1f376de0d56d31614c8a17a 後變成 gendisk, 可以用 bio_set_dev() 這個 macro. > [name=ztex] > Once constructed, **a bio is given to the bio layer by calling generic_make_request() or, equivalently, submit_bio()**. **This does not normally wait for the request to complete**, but merely queues it for subsequent handling. generic_make_request() can still block for short periods of time, to wait for memory to become available, for example. A useful way to think about this behavior is that it might wait for previous requests to complete (e.g. to make room on the queue), but not for the new request to complete. If the REQ_NOWAIT flag is set in the bi_opf field, generic_make_request() shouldn't wait at all if there is insufficient space and should, instead, cause the bio to complete with the status set to BLK_STS_AGAIN, or possibly BLK_STS_NOTSUPP. As of this writing, this feature is not yet implemented correctly or consistently. > struct bio 構建好後, 透過 `generic_make_request()` 或者 `submit_bio()` 交付 bio layer. > 這兩個函式都不等 I/O request 完成. (ztex: 所以一定要 bio_get() -> submit_bio() -> bio_put(), 沒有的話, 我之前 kernel panic 過..., see: https://elixir.bootlin.com/linux/latest/source/include/linux/bio.h#L210) > [name=ztex] > ```cpp /* * get a reference to a bio, so it won't disappear. the intended use is * something like: * * bio_get(bio); * submit_bio(rw, bio); * if (bio->bi_flags ...) * do_something * bio_put(bio); * * without the bio_get(), it could potentially complete I/O before submit_bio * returns. and then bio would be freed memory when if (bio->bi_flags ...) * runs */ ``` The interface between the bio layer and request layer requires devices to register with the bio layer by calling blk_queue_make_request() and passing a make_request_fn() function that takes a bio. generic_make_request() will call that function for the device identified in the bio. This function must arrange things such that, when the I/O request described by the bio completes, the bi_status field is set to indicate success or failure and call bio_endio() which, in turn, will call the bi_end_io() function stored in the structure. > 介於 bio layer 跟 request layer 的 interface 要求 devices 透過 > `blk_queue_make_request()` 註冊. 而且必須傳遞一個`make_request_fn()`來收 bio. > `make_request_fn()` 必須處理 I/O request, 最後處理完之後要把狀態 set bi_status, 表明 success/fail, 並且 call `bio_endio()`, 這是一個位於 `struct bio` 的結束時呼叫的 callback. > [name=ztex] > ## :package: The bio structure see: http://books.gigatux.nl/mirror/kerneldevelopment/0672327201/ch13lev1sec3.html#:~:text=The%20basic%20container%20for%20block,that%20is%20contiguous%20in%20memory. ## :page_facing_up: Example code: submit a bio ```cpp /** * Autor: ztex 2020/8/19 * write_lba(): Write bytes to disk, starting at given LBA * @state: disk parsed partitions * @lba: the Logical Block Address of the partition table * @buffer: resource buffer * @count: bytes to write * * Description: Write @count bytes from buffer into @@state->bdev. * Returns number of bytes read on success, 0 on error. */ static size_t write_lba(struct parsed_partitions *state, u64 lba, u8 *buffer, size_t count) { size_t totalreadcount = 0; struct block_device *bdev = state->bdev; struct bio *bio; struct page *page; struct address_space *mapping = bdev->bd_inode->i_mapping; sector_t n = lba * (bdev_logical_block_size(bdev) / 512); if (!buffer || lba > last_lba(bdev)) return 0; while (count) { int copied = 512; Sector sect; unsigned char *data = read_part_sector(state, n, &sect); if (!data) break; if (copied > count) copied = count; memcpy(data, buffer, copied); bio = bio_alloc(GFP_NOIO, 1); bio_get(bio); bio->bi_bdev = bdev; bio->bi_iter.bi_sector = n; if (n >= get_capacity(state->bdev->bd_disk)) { state->access_beyond_eod = true; pr_warn("[ZTEX] write_lba access_beyond eod"); return -1; } page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT - 9)), NULL); if (PageError(page)) put_page(page); pr_warn("[ZTEX][WRITE_LBA] lba: %llu; sector: %llu; offset: %llu\n", (unsigned long long )lba, (unsigned long long)n, (unsigned long long)SECTOR_TO_PAGE_OFFSET(n)); bio_add_page(bio, page, copied, SECTOR_TO_PAGE_OFFSET(n)); submit_bio(WRITE_FLUSH_FUA, bio); put_dev_sector(sect); buffer += copied; totalreadcount +=copied; count -= copied; n++; bio_put(bio); } return totalreadcount; } ``` explanation ``` given a block device struct block_device *bdev in order to write to sector n 1. allocate a bio bio = bio_alloc(GFP_NOIO, 1); 2. increase reference count bio_get(bio); 3. associate bio with bdev bio->bi_bdev = bdev; 4. get the block device inode mapping struct address_space *mapping = bdev->bd_inode->i_mapping; 5. read the page page = read_mapping_page(mapping, (pgoff_t)(n >> (PAGE_SHIFT - 9)), NULL); 6. convert a given page to its logical address, see: https://stackoverflow.com/questions/11602930/linux-kernel-function-page-address (unsigned char *)page_address(page) 7. A convience way to add a bio_vec void bio_add_page(struct bio *bio, struct page *page, unsigned int len, unsigned int off) 8. submit bio submit_bio(WRITE_FLUSH_FUA, bio); 9. decrease reference count bio_put(bio); ``` ## :memo: Memory mapping [Memory mapping](https://linux-kernel-labs.github.io/refs/heads/master/labs/memory_mapping.html) [Linux中的kmap](https://zhuanlan.zhihu.com/p/69329911) [Driver porting: low-level memory allocation](https://lwn.net/Articles/22909/) [Linux通用块设备层](Linux通用块设备层) [Chapter 16. Block Drivers](https://www.oreilly.com/library/view/linux-device-drivers/0596005903/ch16.html) [Linux内核Cache机制](https://zhuanlan.zhihu.com/p/56823442) [Block Device Drivers](https://linux-kernel-labs.github.io/refs/heads/master/labs/block_device_drivers.html) [https://linux-kernel-labs.github.io/refs/heads/master/labs/block_device_drivers.html](https://stackoverflow.com/questions/12720420/how-to-read-a-sector-using-a-bio-request-in-linux-kernel)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Google Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully