simplefs - HackMD

# simplefs ## 目標 1. 解說程式中的每個程式碼 2. 解兩隻 bugs 1. ls -a 的 . 跟 .. 2. Directory will be full if more than 128 files ## 補充背景知識 linux source code 本身有提供關於 vfs 的文件 https://github.com/torvalds/linux/blob/master/Documentation/filesystems/vfs.rst [Linux 核心設計: 檔案系統概念及實作手法](https://hackmd.io/@sysprog/linux-file-system?type=view#File-Descriptor-%E5%8F%8A%E9%96%8B%E5%95%9F%E7%9A%84%E6%AA%94%E6%A1%88) 這篇文章把 filesystem 的實作跟 source code 做很清楚的連結跟解說 https://developer.ibm.com/tutorials/l-linux-filesystem/ ## 理解在閱讀 simplefs 的程式碼時需要先理解幾件事情 1. 資料可同時存在於**記憶體**跟**儲存設備**上 2. inode 會同時存在 cache 跟儲存設備上 3. 硬碟寫入屬於非同步 simplefs 的架構大致上分為 * superbock * inode * dir * file * file ( To do ) 從原始程式的 [readme](https://github.com/jserv/simplefs) 可以看到程式碼分為幾個區域 ``` +------------+-------------+-------------------+-------------------+-------------+ | superblock | inode store | inode free bitmap | block free bitmap | data blocks | +------------+-------------+-------------------+-------------------+-------------+ ``` 每個區域為 4KiB 除了最後的 data blocks 外． superblock : 存放 superblock 中的全部資訊，細節可以參考 [superblock](https://hackmd.io/@fwfly/simplefs#superblock) inode store: 這個 block 存放全部的 inode 資訊，大小會隨著儲存裝置大小而有變化 inode free bitmap: 用來記錄哪些 inode 已經被使用 block free bitmap: 用來記錄哪些 block 已經被使用 data blocks: 存放檔案內容的地方 ## 檔案解說 1. fs.c : 處理檔案系統掛載，想知道怎麼掛載可以看這個檔案 2. super.c: super block 的操作，也包含怎麼產生新的 inode 跟刪除 3. inode.c: 處理 inode 的部分，想知道 inode 怎麼建立刪除可以參考這部分 4. dir.c: 處理 dir 的操作，目前只支援瀏覽資料夾 (ls) ## superblock superblock 包含整個檔案系統所需要的資訊，同時也是同步 inode 的地方 ```cpp struct simplefs_sb_info { uint32_t magic; /* Magic number */ uint32_t nr_blocks; /* Total number of blocks (incl sb & inodes) */ uint32_t nr_inodes; /* Total number of inodes */ uint32_t nr_istore_blocks; /* Number of inode store blocks */ uint32_t nr_ifree_blocks; /* Number of inode free bitmap blocks */ uint32_t nr_bfree_blocks; /* Number of block free bitmap blocks */ uint32_t nr_free_inodes; /* Number of free inodes */ uint32_t nr_free_blocks; /* Number of free blocks */ #ifdef __KERNEL__ unsigned long *ifree_bitmap; /* In-memory free inodes bitmap */ unsigned long *bfree_bitmap; /* In-memory free blocks bitmap */ #endif }; ``` 要更簡化的看出資料結構對應的數字，產生一個只有 404k （檔案系統要求的最小空間）大小的空間，然後再用 mkfs.simplefs 去輸出 superblock 寫的資料 ```shell dd if=/dev/zero of=test.img bs=4k count=101 Superblock: (4096) magic=0xdeadce nr_blocks=101 nr_inodes=202 (istore=2 blocks) nr_ifree_blocks=1 nr_bfree_blocks=1 nr_free_inodes=201 nr_free_blocks=95 Inode store: wrote 2 blocks inode size = 40 B Ifree blocks: wrote 1 blocks Bfree blocks: wrote 1 blocks ``` ### magic number magic number 是用來判斷檔案系統是哪一種檔案系統．最通常的 ext4 magic number 是 0xEF53 下面的連結可以知道目前的檔案系統相對應的 magic number 是哪一個 https://github.com/torvalds/linux/blob/master/include/uapi/linux/magic.h 如果知道 magic number，就可以用 dd 找出 magic number 在檔案系統的那個 block. 以 ext4 為例，則可以透過以下命令找到 magic number 的位置 ```shell $ sudo dd if=/dev/sda2 bs=1M count=1 skip=0 | xxd | grep -E '53 ?ef' 00000430: eb09 2d60 0500 ffff 53ef 0100 0100 0000 ..-`....S....... 00053ef0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ ``` 可以看到 53ef 大約落在 00000430 的地方．而這邊也是 superblock 的位置．如果有原始碼，就有辦法知道鄰近的編碼是什麼意思．以 simplefs 為例，magic number 為 DEADCELL ```shell= $ sudo dd if=test.img bs=4k count=2 skip=0 | xxd | grep -E 'ce' 2+0 records in 2+0 records out 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000144281 s, 56.8 MB/s 00000000: cead de00 0032 0000 1832 0000 e500 0000 .....2...2...... 00000ce0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00001ce0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ ``` 可以看到 magic number 是從第一個 block 開始(00000000)．因為 mkfs.simplefs 第一個寫入的是 superblock ```cpp /* Write superblock (block 0) */ struct superblock *sb = write_superblock(fd, &stat_buf); if (!sb) { perror("write_superblock():"); ret = EXIT_FAILURE; goto fclose; } ``` ### nr_ifree_blocks 跟 nr_bfree_blocks ```cpp uint32_t nr_ifree_blocks; /* Number of inode free bitmap blocks */ uint32_t nr_bfree_blocks; /* Number of block free bitmap blocks */ ``` 這兩個變數是指 `記錄 bitmap 需要多少個 blocks` 透過前面的 mkfs 的輸出可以知道 ``` nr_free_inodes=201 nr_free_blocks=95 ``` 由於 201 跟 95 小於 4096 (simplefs block 最小單位) 所以 nr_ifree_blocks 跟 nr_bfree_blocks 都為 1 從這邊也可以知道，如果檔案系統空間越大，所需要的 blocks 就會越大，同時也會在 map 中產生閒置的空間．以 inode free blocks 為例，在 4096 個 bits 中，實際上只用了 201 個 bits 剩下的都沒有被使用． ### super_ops (to do) 想理解 inode 怎麼同步到硬碟上，以及怎麼產生．可以參考這個部分 ## bitmap 的功用 bitmap 有點像是一個紀錄表，上面記錄著那些 inode 跟 blocks 已經被使用了．如果沒有 bitmap，可能要搜尋整個檔案系統的 inode，或者是硬碟上的 blocks 來找出沒有用到空間或 inode 結構大概就像是一長串的 0 跟 1, 0 表示 inode/block 已經使用，1 則相反．以 inode 為例，如果檔案系統下建立 3 個檔案，在 bitmap 上面會這樣表示 ``` 0000111111........ ``` 第一個 0 是 superblock 後面三個 0 則為新建立的 3 個檔案 ## 理解 inode 建立 inode 是檔案系統中最基本的操作．不論是檔案或資料夾或 link 都需要建立一個 inode. 在進入程式碼前要先知道這幾件事情 1. 所有的 inode 都同時存在 cache 中跟儲存裝置 2. inode 一但建立在 cache 中，就不會被刪除，直到卸載檔案 3. vfs 中的 inode 資料是來自 cache 而不是從檔案直接抓新的資訊 ### 建立新 inode 建立新的 inode 基本上會做以下事情 1. 找到 parent dir 的 block idx ( container_of ) 2. 把 dir 的 block 從儲存設備讀(map)到記憶體 (sb_bread) 3. 從 cache 中建立一個 buffer 並且 map 回去儲存設備 (sb_bread) 4. 用這個 buffer 建立新的 inode，並且跟 dir 的關係連結起來 5. 把 inode 的 buffer mark 為 dirty 6. 跟新 dir 並且把 dir 的 buffer mark 為 dirty 7. 把 inode mark 為 dirty ...經過數秒... 8. 更新儲存設備上的 inode 透過[實作](https://github.com/jserv/simplefs/blob/master/inode.c#L292)可以看到 linux 的記憶體跟儲存裝置是非同步的運行透過把記憶體區塊標記為 dirty ，然後再一次同步到儲存裝置上． #### 建立新的 inode 時的過程 ``` simplefs_create .... inode.c -> simplefs_new_inode .... inode.c -> simplefs_iget .... inode.c -> simplefs_alloc_inode .... super.c -> simplefs_write_inode .... super.c ``` printk 印出來的結果 ``` [Tue May 25 23:22:52 2021] simplefs_create [Tue May 25 23:22:52 2021] simplefs_new_inode [Tue May 25 23:22:52 2021] simplefs_iget [Tue May 25 23:22:52 2021] simplefs_alloc_inode （5秒後） [Tue May 25 23:22:57 2021] simplefs_write_inode (unmount 後) [Tue May 25 23:22:57 2021] simplefs_write_inode [Tue May 25 23:29:09 2021] simplefs_sync_fs [Tue May 25 23:29:09 2021] simplefs_sync_fs [Tue May 25 23:29:09 2021] simplefs_destroy_inode [Tue May 25 23:29:09 2021] simplefs_destroy_inode [Tue May 25 23:29:09 2021] simplefs: unmounted disk ``` 可以看到 simplefs_alloc_inode 跟 simplefs_write_inode 中間相差了 5 秒，從這邊可以知道 linux 的硬碟寫入不是即時的．也可以看到 unmount 後，為了確保資料正確，會一次寫回硬碟．同時也可以看到一開始 alloc 的 [cache]([https://](https://youtu.be/9Kj7TBrvP-8?t=2366)) 其實不是馬上就會做 destory ### simplefs_inode_info, simplefs_inode 跟 inode ? simplefs_inode: 是 simplefs 用來讀取寫入儲存裝置時的特有的 inode 格式 simplefs_inode_info: 負責記錄額外資訊，像是 block idx 跟 link inode 則是 vfs 所提供的 inode 他們之間的關係如下 (to do) ## (Progressing) File simplefs 的操作主要是在從記憶體，寫回硬碟所以主要是這三個 simplefs_write_begin #準備把資料寫進 page cache 中 simplefs_write_end #當資料被寫進 cache 後，處理 inode 更新 simplefs_writepage #把 page cache 的資料寫回硬諜 ::: warning **問題** 當使用者輸入時，是輸入到哪裡 ? userspace memory ? 某處的 buffer ? 所以流程會變成 buffer/userspace memory -> page cache (write) -> disk (writepage) ::: ### 何時發生寫了一個簡單的寫入程式去測試何時發生，答案是當呼叫 fclose 的時候，也就是當使用者結束編輯檔案，在關檔前一次寫回去的時候發生不過 vim 並不會這樣做，vim 在每一次的檔案更動都會去做開檔關檔案，以確保文件就算是寫到一半 crash 了也能找回之前寫的內容． ## dot dot 請參考 https://hackmd.io/@fwfly/Bka0w4n3Y ## (To do ) simplefs 實際上在儲存裝置上的內容那實際上 simplefs 在硬碟上看起來是怎麼樣的？ ## 如果看不懂 simplefs 程式碼，可以檢視以下問題 1. 是否了解資料有分儲存在記憶體跟儲存裝置 (ex: HDD) 2. 是否有用 dd 實際看過 simplefs 在硬碟中的情形 * 是否知道 linux 的寫入為非同步 (可透過 dd 觀察到) 4. 是否了解 c 語言的繼承 5. 是否了解 container_of 6. 是否知道在 simplefs 中 inode 是"全部"同時存放在 cache 跟儲存裝置上的 ## 問題: 1. 什麼時候會用 inode lookup ? 2. dentry 的功用？沒有 dentry 是否會導致某些功能出錯 4. extend 的功用? ~~5. inode store 是做什麼用的~~ ~~6. inode 存在哪裡 ? inode store ?~~ 7. 為什麼沒有 data_block 但是還是可以保留檔案 ? ~~8. `container_of` ? (Linux 的 linked list 實作會用到)~~ 9. dirty 的同步時機 ? 10. buffer 是什麼 ? ## 待讀 https://github.com/torvalds/linux/blob/master/Documentation/filesystems/path-lookup.rst?fbclid=IwAR3qpbZH_DNO1RxXlfn0dgdvHvWO27SP5k0mRgNi4vln6IZ_PLV99RelUEs ## Reference 1. https://kukuruku.co/post/teaching-the-file-system-to-read/ 2. https://stackoverflow.com/questions/48731404/busy-inodes-dentries-after-umount-in-self-written-virtual-fs 3. [Linux内核：一文读懂文件系统、缓冲区高速缓存和块设备、超级块](https://blog.csdn.net/rong_toa/article/details/108885346) 4. https://linux-kernel-labs.github.io/refs/heads/master/index.html 5. [信息安全课程15：rootkit(3)](https://zhuanlan.zhihu.com/p/62188061) 6. [动手写一个简单的文件系统.md](https://www.jianshu.com/p/8966d121263b) 7. https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#The_Super_Block 8. http://tltech.com/info/recovering-broken-partition/ 9. http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/vfs2.pdf 10. http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/vfs.pdf 11. http://www.cs.unc.edu/~porter/courses/cse506/s16/slides/ext4.pdf 12. https://www.stupid-projects.com/container_of/ 13. https://kukuruku.co/post/writing-a-file-system-in-linux-kernel/ 14. 印出一個區間的記憶體內容: https://www.techcrashcourse.com/2016/02/c-program-to-print-memory-representation-of-variable.html 15. https://unix.stackexchange.com/questions/209566/how-to-format-a-partition-inside-of-an-img-file 16. http://www.ilinuxkernel.com/files/Linux.Kernel.Delay.Write.pdf 17. https://github.com/torvalds/linux/commit/5f99f4e79abc64ed9d93a4b0158b21c64ff7f478 ## To do 1. 依照 simplefs 自己刻一個檔案系統，來確認是否真的理解 2. 理解部分 : * 檔案 * dentry * extend : simplefs 怎麼跟系統要 block 4. 解兩隻 bugs 1. ls -a 的 . 跟 .. 2. Directory will be full if more than 128 files