File System (檔案系統) - 從零開始的開源地下城

--- title: File System (檔案系統) - 從零開始的開源地下城 tags: Linux, Linux讀書會, Kernel, 從零開始的開源地下城, COMBO-tw description: 介紹與Linux Kernel相關基本知識 lang: zh-Hant GA: G-2QY5YFX2BV --- # File System (檔案系統) ###### tags: `Linux` ## 目錄 [TOC] ## 簡介 ### 什麼是 File System 電腦的檔案系統是一種儲存和組織電腦資料的方法，它使得對其存取和尋找變得容易，檔案系統使用檔案和樹形目錄的抽象邏輯概念代替了硬碟和光碟等物理裝置使用資料塊的概念，使用者使用檔案系統來儲存資料不必關心資料實際儲存在硬碟（或者光碟）的位址為多少的資料塊上，只需要記住這個檔案的所屬目錄和檔名。在寫入新資料之前，使用者不必關心硬碟上的那個塊位址沒有被使用，硬碟上的儲存空間管理（分配和釋放）功能由檔案系統自動完成，使用者只需要記住資料被寫入到了哪個檔案中。 ### 常見的 File System 與格式隨著Linux的不斷發展，它所支援的檔案系統也在迅速擴充，Linux系統核心可以支援十多種檔案系統類型：Btrfs、JFS、ReiserFS、exFAT、ext、ext2、ext3、ext4、XFS、ISO 9660、Minix、MSDOS、UMSDOS、VFAT、NTFS（Linux Kernel內建的NTFS驅動程式，寫入功能不穩定）、HPFS、NFS、SMB、SysV、PROC等。注意：部分Linux發行版的Kernel預設不編譯Kernel內建的NTFS檔案系統支援，常見的在Linux下讀寫NTFS的解決方法是安裝NTFS-3G或ufsd等NTFS驅動程式。部分Linux發行版對NTFS的支援度並不高。 * EXT2 * ![](https://hackmd.io/_uploads/H1mYXobL2.png) ### VFS 架構 ![](https://hackmd.io/_uploads/Hy_KmiZL3.png) ![](https://hackmd.io/_uploads/ryht7s-U2.png) VFS 定義了在實體檔案系統上更高一層的介面，讓應用程式得以透過 VFS 定義好的介面存取底層資料，不用考慮底層是如何實作。有了 VFS，增加擴充新的檔案系統非常容易，只要實作出 VFS 規範的介面。 * 聚焦在 VFS 和系統呼叫之間的關聯: ![](https://hackmd.io/_uploads/rkX57obI3.png) * VFS 有主要幾個物件 `superblock`, `inode`, `dentry`, `file` 等等。這裡先解釋 `inode` 和 `file`: * `inode` 和 `file` 在檔案系統中都代表某個檔案; * `file` 只是在核心執行 `open` 時，為行程建立的資料結構，因此 `open` 執行幾次就會有多少個 `file` (也有可能增加 file.f_count 的值)，但是這些 `file` 都會指向同一個實體的 file，也就是 `inode`。 * ![](https://hackmd.io/_uploads/Bkj9QjZU2.png) * `superblock` 是 FileSystem 的全域資料 * `inode` 是一個檔案的 metadata * `dentry` 是 inode 映射到的檔案名稱 * `file` 是一個檔案處理器，Cursor in file offset ### 如何 Trace File System * https://elixir.bootlin.com/linux/v4.14.50/source/fs * 所有的 Filesystem 都放在 `/fs` 當中 * 舉個栗子，如果我們要 trace EXT2 的話，整個 FS 都放在 `fs/ext2` 當中 #### 概述用戶進程通過系統調用 `write()` 往磁區上寫資料，但 `write()` 執行結束後，資料是否立即寫到磁區上？ Kernel 讀文件資料時，使用到了 "提前讀"；寫資料時，則使用了 "延遲寫"，即 `write()` 執行結束後，資料並沒有立即將請求放入塊設備驅動請求佇列，然後寫到硬碟上。 > 本文不考慮以 O_DIRECT 或 O_SYNC 方式打開文件的情形，我們僅考慮異步 I/O。同步方式寫的 I/O 流程相對簡單。 ![](https://hackmd.io/_uploads/rkVj7iW83.png) * 參考用的 Call Trace ``` Pid: 4318, comm: cp Tainted: G ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ HT 2.6.32279.debug #19 Call Trace: [<ffffffff8135ce59>] ? brd_make_request+0x439/0x510 [<ffffffff814fd503>] ? printk+0x41/0x46 [<ffffffff81256f64>] ? generic_make_request+0x2c4/0x5b0 [<ffffffff8125730c>] ? submit_bio+0xbc/0x160 [<ffffffff811acd46>] ? submit_bh+0xf6/0x150 [<ffffffffa0109773>] ? ext4_mb_init_cache+0x883/0x9f0 [ext4] [<ffffffff8112b560>] ? __lru_cache_add+0x40/0x90 [<ffffffffa01099fe>] ? ext4_mb_init_group+0x11e/0x210 [ext4] [<ffffffffa0109bbd>] ? ext4_mb_good_group+0xcd/0x110 [ext4] [<ffffffffa010b35b>] ? ext4_mb_regular_allocator+0x19b/0x410 [ext4] [<ffffffff814feffe>] ? mutex_lock+0x1e/0x50 [<ffffffffa010d22d>] ? ext4_mb_new_blocks+0x38d/0x560 [ext4] [<ffffffffa0100ace>] ? ext4_ext_find_extent+0x2be/0x320 [ext4] [<ffffffffa0103b83>] ? ext4_ext_get_blocks+0x1113/0x1a10 [ext4] [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320 [<ffffffff81136959>] ? zone_statistics+0x99/0xc0 [<ffffffff811259f1>] ? get_page_from_freelist+0x3d1/0x820 [<ffffffffa00dfd39>] ? ext4_get_blocks+0xf9/0x2a0 [ext4] [<ffffffffa00e07ad>] ? ext4_get_block+0xbd/0x120 [ext4] [<ffffffff811afe1b>] ? __block_prepare_write+0x1db/0x570 [<ffffffffa00e06f0>] ? ext4_get_block+0x0/0x120 [ext4] [<ffffffff811b048c>] ? block_write_begin_newtrunc+0x5c/0xd0 [<ffffffff811b0893>] ? block_write_begin+0x43/0x90 [<ffffffffa00e06f0>] ? ext4_get_block+0x0/0x120 [ext4] [<ffffffffa00e4896>] ? ext4_write_begin+0x226/0x2d0 [ext4] [<ffffffffa00e06f0>] ? ext4_get_block+0x0/0x120 [ext4] [<ffffffffa00e4ac8>] ? ext4_da_write_begin+0x188/0x200 [ext4] [<ffffffffa00ab63f>] ? jbd2_journal_dirty_metadata+0xff/0x150 [jbd2] [<ffffffff814ff6f6>] ? down_read+0x16/0x30 [<ffffffff81114ab3>] ? generic_file_buffered_write+0x123/0x2e0 [<ffffffff810724c7>] ? current_fs_time+0x27/0x30 [<ffffffff81116450>] ? __generic_file_aio_write+0x250/0x480 [<ffffffff8113f1c7>] ? handle_pte_fault+0xf7/0xb50 [<ffffffff811166ef>] ? generic_file_aio_write+0x6f/0xe0 [<ffffffffa00d9131>] ? ext4_file_write+0x61/0x1e0 [ext4] [<ffffffff8117ad6a>] ? do_sync_write+0xfa/0x140 [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff810edfc2>] ? ring_buffer_lock_reserve+0xa2/0x160 [<ffffffff810d69e2>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81213136>] ? security_file_permission+0x16/0x20 [<ffffffff8117b068>] ? vfs_write+0xb8/0x1a0 [<ffffffff8117ba81>] ? sys_write+0x51/0x90 [<ffffffff8100b308>] ? tracesys+0xd9/0xde ``` #### 虛擬文件系統與 ext4 文件系統層 * 分析原始碼之前，我們先看 `write()` 函數原型 ``` #include <unistd.h> ssize_t write(int fd, const void *buf, size_t count); ``` 1. `sys_write()` * `write()` 系統調用在內核的實現為 `sys_write()`，定義在檔案`include/linux/syscalls.h`。 ``` asmlinkage long sys_write(unsigned int fd, const char __user *buf, size_t count); ``` * `sys_write()` 的實做定義在 `fs/write.c` ```c=581 SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; if (f.file) { loff_t pos = file_pos_read(f.file); ret = vfs_write(f.file, buf, count, &pos); if (ret >= 0) file_pos_write(f.file, pos); fdput_pos(f); } return ret; } ``` * 584 行的 `fdget_pos()` 會根據打開的 fd 找到文件結構，我們知道在 write 資料之前，需要先 `open()` 打開一個 file，打開後會有對應的檔案描述符 fd 及 file 對象。 * `file_pos_read()` 及 `file_pos_write()` 則是讀寫當前的讀寫位置。 * 重點會放在後面要講的 `vfs_write()` 上面 2. `vfs_write()` * `vfs_write()` 的實做定義在 `fs/write.c` ``` ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) { ssize_t ret; if (!(file->f_mode & FMODE_WRITE)) return -EBADF; if (!(file->f_mode & FMODE_CAN_WRITE)) return -EINVAL; if (unlikely(!access_ok(VERIFY_READ, buf, count))) return -EFAULT; ret = rw_verify_area(WRITE, file, pos, count); if (!ret) { if (count > MAX_RW_COUNT) count = MAX_RW_COUNT; file_start_write(file); ret = __vfs_write(file, buf, count, pos); if (ret > 0) { fsnotify_modify(file); add_wchar(current, ret); } inc_syscw(current); file_end_write(file); } return ret; } ``` * 首先進來之後要先檢查是否為寫入操作，不是的話直接返回錯誤 * `rw_verify_area()` 檢查文件從當前位置 f_pos 開始的 count 個 Bytes 是否對寫入操作加上了 "強制鎖"，這是透過函式調用完成的。 * 檢查完是否合法後，就調用具體檔案系統中定義的 `file_operarions` 中的 write 方法。 * 對於 ext4 來說， `file_operarions` 定義在 `在fs/ext4/file.c` ``` const struct file_operations ext4_file_operations = { .llseek = ext4_llseek, .read_iter = ext4_file_read_iter, .write_iter = ext4_file_write_iter, .unlocked_ioctl = ext4_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ext4_compat_ioctl, #endif .mmap = ext4_file_mmap, .open = ext4_file_open, .release = ext4_release_file, .fsync = ext4_sync_file, .get_unmapped_area = thp_get_unmapped_area, .splice_read = generic_file_splice_read, .splice_write = iter_file_splice_write, .fallocate = ext4_fallocate, }; ``` 3. `__vfs_write()` * `__vfs_write()` 的實做定義在 `fs/read_write.c` ``` ssize_t __vfs_write(struct file *file, const char __user *p, size_t count, loff_t *pos) { if (file->f_op->write) return file->f_op->write(file, p, count, pos); else if (file->f_op->write_iter) return new_sync_write(file, p, count, pos); else return -EINVAL; } ``` * 在先前的 `ext4_file_operations` 中會發現沒有 `write` 但有 `write_iter`，因此我們將透過 `new_sync_write()` 進入下一步。 * `new_sync_write()` 的實做定義在 `fs/read_write.c` ``` static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos) { struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len }; struct kiocb kiocb; struct iov_iter iter; ssize_t ret; init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; iov_iter_init(&iter, WRITE, &iov, 1, len); ret = call_write_iter(filp, &kiocb, &iter); BUG_ON(ret == -EIOCBQUEUED); if (ret > 0) *ppos = kiocb.ki_pos; return ret; } ``` * 透過 `call_write_iter()` 去 call 對應的 function ``` static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, struct iov_iter *iter) { return file->f_op->write_iter(kio, iter); } ``` 4. `ext4_file_write_iter()` * 進入 EXT4 的環節了 * ![](https://hackmd.io/_uploads/S1XnXi-8h.png) * Ext2/3 等老 Linux 文件系統使用間接區塊映射模式 (block mapping)，文件的每一個區塊都要被記錄下來，這使得對大文件的操作（如刪除）效率低下。 Ext4 引入 Extents 這一概念來代替 ext2/3 使用的傳統的塊映射 (block mapping) 方式。 "extent" 是一個大的連續的物理塊區域，當塊大小為 4KB 時，ext4 中的一個 extent 最大可以映射 128MB 的連續物理儲存空間。例如，Ext3 採用間接塊映射，當操作大文件時，效率極其低下。比如一個 100MB 大小的文件，在 Ext3 中要建立 25,600 個資料區塊（每個資料塊大小為 4KB）的映射表。而 Ext4 引入了現代文件系統中流行的 extents 概念，每個 extent 為一組連續的數據塊，上述文件則表示為“該文件數據保存在接下來的 25,600 個資料區塊中”，提高了不少效率。Ext4 文件系統支持新的Extents 模式，也支持傳統的塊映射模式。 ``` /* * This is the extent on-disk structure. * It's used at the bottom of the tree. */ struct ext4_extent { __le32 ee_block; /* first logical block extent covers */ __le16 ee_len; /* number of blocks covered by extent */ __le16 ee_start_hi; /* high 16 bits of physical block */ __le32 ee_start_lo; /* low 32 bits of physical block */ }; /* * This is index on-disk structure. * It's used at all the levels except the bottom. */ struct ext4_extent_idx { __le32 ei_block; /* index covers logical blocks from 'block' */ __le32 ei_leaf_lo; /* pointer to the physical block of the next * * level. leaf or next index could be there */ __le16 ei_leaf_hi; /* high 16 bits of physical block */ __u16 ei_unused; }; /* * Each block (leaves and indexes), even inode-stored has header. */ struct ext4_extent_header { __le16 eh_magic; /* probably will support different formats */ __le16 eh_entries; /* number of valid entries */ __le16 eh_max; /* capacity of store in entries */ __le16 eh_depth; /* has tree real underlying blocks? */ __le32 eh_generation; /* generation of the tree */ }; ``` * ![](https://hackmd.io/_uploads/rJ22msbU3.png) * ![](https://hackmd.io/_uploads/Sk-T7jWLn.png) * `ext4_file_write_iter()` 定義在 `fs/ext4/file.c` ``` static ssize_t ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) { struct inode *inode = file_inode(iocb->ki_filp); int o_direct = iocb->ki_flags & IOCB_DIRECT; int unaligned_aio = 0; int overwrite = 0; ssize_t ret; if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb)))) return -EIO; #ifdef CONFIG_FS_DAX if (IS_DAX(inode)) return ext4_dax_write_iter(iocb, from); #endif if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT)) return -EOPNOTSUPP; if (!inode_trylock(inode)) { if (iocb->ki_flags & IOCB_NOWAIT) return -EAGAIN; inode_lock(inode); } ret = ext4_write_checks(iocb, from); if (ret <= 0) goto out; /* * Unaligned direct AIO must be serialized among each other as zeroing * of partial blocks of two competing unaligned AIOs can result in data * corruption. */ if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) && !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, iocb->ki_pos)) { unaligned_aio = 1; ext4_unwritten_wait(inode); } iocb->private = &overwrite; /* Check whether we do a DIO overwrite or not */ if (o_direct && !unaligned_aio) { if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) { if (ext4_should_dioread_nolock(inode)) overwrite = 1; } else if (iocb->ki_flags & IOCB_NOWAIT) { ret = -EAGAIN; goto out; } } ret = __generic_file_write_iter(iocb, from); inode_unlock(inode); if (ret > 0) ret = generic_write_sync(iocb, ret); return ret; out: inode_unlock(inode); return ret; } ``` * `ext4_write_checks()` 檢查寫入狀態/權限 * `__generic_file_write_iter()` 計算要寫入的數量 * `generic_write_sync` 同步開始寫入大致上從 VFS 到 EXT4 的流程就如上述簡單的追了一下，雖然我沒有全部追完，後續留給你們去追。 ## 本章節練習與反思 ### 練習 * 先前我們追到 `ext4_file_write_iter()` 的入口，但後續還有可以繼續追下去的，請你們自行練習如何追下去吧。 * 在此給你們一些 Tips，`__generic_file_write_iter()` -> `generic_perform_write()` -> `a_ops->write_begin()` -> `a_ops->write_end()` * 大致上就追到 end 就差不多到一段落了 ### 反思 * 為什麼我們需要 File System? * 為什麼我們要從 VFS 開始追起? * 如果我們要自己做一個 FS 可以從哪邊下手(大概即可)? * 為什麼 Linux 上，要將所有的東西都變成檔案(Ext4, sysfs, procfs, etc.)? ## 參考資料 * https://zh.wikipedia.org/zh-tw/%E6%96%87%E4%BB%B6%E7%B3%BB%E7%BB%9F * https://hackmd.io/@sysprog/linux-file-system#Linux-Virtual-File-System-%E4%BB%8B%E9%9D%A2 * http://www.ilinuxkernel.com/files/Linux.Kernel.Write.Procedure.pdf * https://blog.csdn.net/qq_32473685/article/details/103494398