Linux 核心專題: simplefs

執行人: fewletter
專題解說影片

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

提問清單

可否改用 kasan 來檢測記憶體問題？

回覆
理論上可以，但我還沒有實做過。要開啟 kasan 首先要先將 kmemleak 關閉，不然連建立環境都會 crash (親身經歷)，在核心的 config 檔中有 CONFIG_KASAN, CONFIG_KASAN_GENERIC, CONFIG_KASAN_SW_TAGS, CONFIG_KASAN_HW_TAGS 這四個選項，第一個為是否要開啟 kasan，後面三個為開啟的選項。
至於記憶體配置的部分，從下面命令 virtme-run --kdir . --mods=auto --qemu-opts 2048 可以增加虛擬環境的記憶體配置，所以記憶體空間不足不會有問題。

simplefs

為了探索 Linux VFS (virtual file system) 介面及檔案系統實作機制，我們從無到有撰寫一個運作於 Linux 核心模式中的精簡檔案系統，原始程式碼約一千餘行，支援基本的檔案和目錄處理，同時也考慮到權限和並行處理的議題。

背景知識

TODO: 修正 Linux v6.3 編譯

參見 Issue #23

作法: 利用 user-mode-linux 或 virtme (第 7 週教材) 編譯 Linux v6.3 核心，並嘗試編譯 simplefs 及修正相關錯誤，隨後提交 pull request。

追蹤錯誤程式碼

根據 Issue #23 的錯誤資訊可以知道錯誤的程式碼出現在 include/linux/fs.h ，而這段程式碼是在 Linux v6.3 版本才出現，在 6.3 版本以前皆為以下宣告

void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
	        const struct inode *dir, umode_t mode);

在 6.3 版本以後改成以下的宣告

void inode_init_owner(struct mnt_idmap *idmap, struct inode *inode,
                const struct inode *dir, umode_t mode);

建立虛擬化測試環境

參考測試 Linux 核心的虛擬化環境，基本上就是按照教材上的步驟實作

執行以下命令取得 Linux 核心原始程式碼
將 Linux 版本改成 6.3 來搭配之後要除錯的環境

$ git clone https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux
$ cd linux
$ git checkout -b linux-6.3.y origin/linux-6.3.y

使用 virtme 選取預設核心組態並編譯:

$ virtme-configkernel --defconfig
$ make ARCH=x86 CROSS_COMPILE=x86_64-linux-gnu- -j6

編譯完成後出現

Kernel: arch/x86/boot/bzImage is ready  (#1)

接著啟動虛擬測試環境

$ virtme-run --kdir . --mods=auto
./.virtme_mods/lib/modules/0.0.0
No EFI environment detected.
early console in extract_kernel
input_data: 0x0000000002d6d3a8
input_len: 0x0000000000c572f0
output: 0x0000000001000000
output_len: 0x000000000297a5d0
...
[    1.394987] virtme-init: triggering udev coldplug
[    1.472012] virtme-init: waiting for udev to settle
[    1.906750] gzip (97) used greatest stack depth: 13280 bytes left
[    2.454079] virtme-init: udev is done
[    2.558114] ip (124) used greatest stack depth: 12976 bytes left
virtme-init: console is ttyS0

此虛擬環境就是另一個版本的 Linux ，只不過命令方法從 $ 變成 #，然後來確認一下此環境是否為一開始設定的 6.3

root@(none):/# uname -r
6.3.3

測試能否從 github 直接複製整個專案到此環境

root@(none):/# git clone https://github.com/fewletter/simplefs.git
fatal: could not create work tree dir 'simplefs': Read-only file system

看起來不能直接從 github 複製專案到環境中，接著試試將已經複製到本地端的專案複製到此環境中

root@(none):/# mkdir -p /tmp/simplefs
root@(none):/# cd tmp/simplefs
root@(none):/tmp/simplefs# cp /home/fewletter/linux2023/simplefs/dir.c .
root@(none):/tmp/simplefs# cp /home/fewletter/linux2023/simplefs/extent.c .
...
root@(none):/tmp/simplefs# cp /home/fewletter/linux2023/simplefs/bitmap.h .
root@(none):/tmp/simplefs# cp /home/fewletter/linux2023/simplefs/Makefile .

編譯此專案並重現錯誤

root@(none):/tmp/simplefs# make
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /lib/modules/6.3.3/build M=/tmp/simplefs modules
make[1]: Entering directory '/home/fewletter/linux2023/linux'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  You are using:           gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  CC [M]  /tmp/simplefs/fs.o
  CC [M]  /tmp/simplefs/super.o
/tmp/simplefs/super.c: In function 'simplefs_fill_super':
/tmp/simplefs/super.c:286:22: error: passing argument 1 of 'inode_init_owner' from incompatible pointer type [-Werror=incompatible-pointer-types]
  286 |     inode_init_owner(&init_user_ns, root_inode, NULL, root_inode->i_mode);
      |                      ^~~~~~~~~~~~~
      |                      |
      |                      struct user_namespace *
In file included from ./include/linux/highmem.h:5,
                 from ./include/linux/bvec.h:10,
                 from ./include/linux/blk_types.h:10,
                 from ./include/linux/buffer_head.h:12,
                 from /tmp/simplefs/super.c:3:
./include/linux/fs.h:1682:41: note: expected 'struct mnt_idmap *' but argument is of type 'struct user_namespace *'
 1682 | void inode_init_owner(struct mnt_idmap *idmap, struct inode *inode,
      |                       ~~~~~~~~~~~~~~~~~~^~~~~
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:252: /tmp/simplefs/super.o] Error 1
make[1]: *** [Makefile:2025: /tmp/simplefs] Error 2
make[1]: Leaving directory '/home/fewletter/linux2023/linux'
make: *** [Makefile:9: all] Error 2

從上面的結果來看在 Linux v6.3 確實存在著會讓此專案產生 bug 的程式碼，接著就是著手修改程式碼。

修改程式碼以符合 Linux 6.3 版本

在 simplefs/super.c 中有著這段程式碼如下

#if USER_NS_REQUIRED()
    inode_init_owner(&init_user_ns, root_inode, NULL, root_inode->i_mode);
#else
    inode_init_owner(root_inode, NULL, root_inode->i_mode);
#endif

其中 USER_NS_REQUIRED() 在 simplefs/simplefs.h 被如此定義

#define USER_NS_REQUIRED() LINUX_VERSION_CODE >= KERNEL_VERSION(5,12,0)

所以此 bug 的原因就十分明顯了，此程式碼的定義 Linux 核心版本的分界點在 5.12，在 6.3 版本才會出現無法編譯過的問題。而 Linux 核心在 v6.2 到 v6.3 版本中間到底改了什麼，或許可以從以下程式碼窺探一二

-void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
-	        const struct inode *dir, umode_t mode);
            
+void inode_init_owner(struct mnt_idmap *idmap, struct inode *inode,
+                const struct inode *dir, umode_t mode);

從以上程式碼可知他只把 struct user_namespace *mnt_userns 改成 struct mnt_idmap *idmap，而結構體 mnt_idmap 則可以在 linux/fs/mnt_idmapping.c 中找到定義，簡單來說它只是將原本的結構體 user_namespace 加上 reference counting 變成新的結構體。

struct mnt_idmap {
	struct user_namespace *owner;
	refcount_t count;
};

/*
 * Carries the initial idmapping of 0:0:4294967295 which is an identity
 * mapping. This means that {g,u}id 0 is mapped to {g,u}id 0, {g,u}id 1 is
 * mapped to {g,u}id 1, [...], {g,u}id 1000 to {g,u}id 1000, [...].
 */
struct mnt_idmap nop_mnt_idmap = {
	.owner	= &init_user_ns,
	.count	= REFCOUNT_INIT(1),
};
EXPORT_SYMBOL_GPL(nop_mnt_idmap);

simplefs.h

在 simplefs.h 多定義一行程式碼

#define USER_NS_REQUIRED_6_3() LINUX_VERSION_CODE >= KERNEL_VERSION(6,3,0)

代表如果此專案如果要從 Linux v6.3 以上的核心編譯則需透過此定義執行以下程式碼。

super.c

+#if USER_NS_REQUIRED_6_3()
+    inode_init_owner(&nop_mnt_idmap, root_inode, NULL, root_inode->i_mode);
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
     inode_init_owner(&init_user_ns, root_inode, NULL, root_inode->i_mode);

改完程式碼後在虛擬環境中測試後發現 super.c 中這個bug 只是整個專案裡面的一部分，有很多會編譯的錯誤都出現在 inode.c 中。

inode.c

static struct inode *simplefs_new_inode(struct inode *dir, mode_t mode)
{
    ...
    if (S_ISLNK(mode)) {
+#if USER_NS_REQUIRED_6_3()
+       inode_init_owner(&nop_mnt_idmap, inode, dir, inode->mode);
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
        inode_init_owner(&init_user_ns, inode, dir, mode);
#else
        inode_init_owner(inode, dir, mode);
    ...        
    /* Get a free block for this new inode's index */
    bno = get_free_blocks(sbi, 1);
    if (!bno) {
        ret = -ENOSPC;
        goto put_inode;
    }
    
+#if USER_NS_REQUIRED_6_3()
+       inode_init_owner(&nop_mnt_idmap, inode, dir, inode->mode);
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
        inode_init_owner(&init_user_ns, inode, dir, mode);
#else
        inode_init_owner(inode, dir, mode);
}

...

/*
 * Create a file or directory in this way:
 *   - check filename length and if the parent directory is not full
 *   - create the new inode (allocate inode and blocks)
 *   - cleanup index block of the new inode
 *   - add new file/directory in parent index
 */
+#if USER_NS_REQUIRED_6_3()
+static int simplefs_create(struct mnt_idmap *id,
+                          struct inode *dir,
+                          struct dentry *dentry,
+                          umode_t mode,
+                          bool excl)
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
static int simplefs_create(struct user_namespace *ns,
                           struct inode *dir,
                           struct dentry *dentry,
                           umode_t mode,
                           bool excl)
...

+#if USER_NS_REQUIRED_6_3()
+static int simplefs_rename(struct mnt_idmap *id,
+                          struct inode *old_dir,
+                          struct dentry *old_dentry,
+                          struct inode *new_dir,
+                          struct dentry *new_dentry,
+                          unsigned int flags)
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
+static int simplefs_rename(struct user_namespace *ns,

...

+#if USER_NS_REQUIRED_6_3()
+static int simplefs_mkdir(struct mnt_idmap *id,
+                         struct inode *dir,
+                         struct dentry *dentry,
+                         umode_t mode)
+{
+   return simplefs_create(id, dir, dentry, mode | S_IFDIR, 0);
+}
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
static int simplefs_mkdir(struct user_namespace *ns,

...

+#if USER_NS_REQUIRED_6_3()
+static int simplefs_symlink(struct mnt_idmap *id,
+                           struct inode *dir,
+                           struct dentry *dentry,
+                           const char *symname)
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
static int simplefs_symlink(struct user_namespace *ns,

在 inode.c 中所修改的程式碼都是跟 Linux 版本有關的程式碼，同時必須要注意到在 Linux v6.3 之後結構體 user_namespace 已經被包裝成結構體 mnt_idmap ，而下面則是在虛擬環境中的測試結果。

root@(none):/tmp/simplefs# make
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /lib/modules/6.3.3/build M=/tmp/simplefs modules
make[1]: Entering directory '/home/fewletter/linux2023/linux'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  You are using:           gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  CC [M]  /tmp/simplefs/fs.o
  CC [M]  /tmp/simplefs/super.o
  CC [M]  /tmp/simplefs/inode.o
  CC [M]  /tmp/simplefs/file.o
  CC [M]  /tmp/simplefs/dir.o
  CC [M]  /tmp/simplefs/extent.o
  LD [M]  /tmp/simplefs/simplefs.o
  MODPOST /tmp/simplefs/Module.symvers
  CC [M]  /tmp/simplefs/simplefs.mod.o
  LD [M]  /tmp/simplefs/simplefs.ko
make[1]: Leaving directory '/home/fewletter/linux2023/linux'

從上述結果可以看到已經沒有編譯錯誤。

TODO: 提交 pull request

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

修正 Issue #25

static struct inode *simplefs_new_inode(struct inode *dir, mode_t mode)
{
...
#if MNT_IDMAP_REQUIRED()
       inode_init_owner(&nop_mnt_idmap, inode, dir, inode->mode);
...   
}

根據 Issue #25 所說，上面程式碼發生錯誤，為了找出錯誤去看了 inode_init_owner 的函式的作用，在 Linux v6.3 中，inode_init_owner 的描述如下

void inode_init_owner(struct mnt_idmap *idmap, struct inode *inode,
		      const struct inode *dir, umode_t mode)
{
	inode_fsuid_set(inode, idmap);
	if (dir && dir->i_mode & S_ISGID) {
		inode->i_gid = dir->i_gid;

		/* Directories are special, and always inherit S_ISGID */
		if (S_ISDIR(mode))
			mode |= S_ISGID;
	} else
		inode_fsgid_set(inode, idmap);
	inode->i_mode = mode;
}
EXPORT_SYMBOL(inode_init_owner);

從最後一行程式碼可以看出 inode->i_mode 該由 mode 這個參數初始化，而根據這個函式 simplefs_new_inode 的作用為透過 dir 和 mode 兩個參數去初始化新的 inode ，所以 inode_init_owner 最後一個參數應為 mode。

#if MNT_IDMAP_REQUIRED()
+       inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
-       inode_init_owner(&nop_mnt_idmap, inode, dir, inode->mode);

TODO: 修正 `ls -a` 問題

參見 2022 年報告，確認在 Linux v5.15+ 是否仍有相關問題，並嘗試排除。

測試 `ls -a`

開發裝置

fewletter@fewletter-Veriton-M4665G:~$ uname -r
5.15.0-72-generic

測試環境:開發裝置 Linux v5.15

fewletter@fewletter-Veriton-M4665G:~/linux2023/simplefs$ ls -a
.              dir.o          file.c       .fs.o.cmd   .inode.o.cmd   modules.order        script            simplefs.mod.c       .simplefs.o.cmd
..             .dir.o.cmd     file.o       .git        LICENSE        .modules.order.cmd   simplefs.h        .simplefs.mod.cmd    super.c
bitmap.h       extent.c       .file.o.cmd  .gitignore  Makefile       Module.symvers       simplefs.ko       simplefs.mod.o       super.o
.clang-format  extent.o       fs.c         inode.c     mkfs.c         .Module.symvers.cmd  .simplefs.ko.cmd  .simplefs.mod.o.cmd  .super.o.cmd
dir.c          .extent.o.cmd  fs.o         inode.o     mkfs.simplefs  README.md            simplefs.mod      simplefs.o           test

測試環境:QEMU Linux v6.3

root@(none):/tmp/simplefs# ls -a
.                    .extent.o.cmd  .modules.order.cmd   .simplefs.o.cmd  Module.symvers  extent.c  fs.c     mkfs.c         simplefs.h     simplefs.mod.o  test
..                   .file.o.cmd    .simplefs.ko.cmd     .super.o.cmd     bitmap.h        extent.o  fs.o     mkfs.simplefs  simplefs.ko     simplefs.o
.Module.symvers.cmd  .fs.o.cmd      .simplefs.mod.cmd    LICENSE          dir.c           file.c    inode.c  modules.order  simplefs.mod    super.c
.dir.o.cmd           .inode.o.cmd   .simplefs.mod.o.cmd  Makefile         dir.o           file.o    inode.o  script         simplefs.mod.c  super.o

從上面兩個環境來看， ls -a 皆有出現 . 和 .. ，根據 2022年報告， ls -a 的 bug 會出現在 Linux v5.11 的開發裝置和 Linux v5.18 的虛擬裝置上。

測試環境:QEMU Linux v5.18

root@(none):/tmp/simplefs# ls -a
.                    .extent.o.cmd  .modules.order.cmd   .simplefs.o.cmd  bitmap.h  extent.o  fs.o     mkfs.simplefs  simplefs.mod    super.c
..                   .file.o.cmd    .simplefs.ko.cmd     .super.o.cmd     dir.c     file.c    inode.c  modules.order  simplefs.mod.c  super.o
.Module.symvers.cmd  .fs.o.cmd      .simplefs.mod.cmd    Makefile         dir.o     file.o    inode.o  simplefs.h     simplefs.mod.o
.dir.o.cmd           .inode.o.cmd   .simplefs.mod.o.cmd  Module.symvers   extent.c  fs.c      mkfs.c   simplefs.ko    simplefs.o

使用 QEMU 作為 Linux v5.18 虛擬環境也還是沒有發生 bug。

測試環境:User-Mode Linux (UML)

參考資料建構 User-Mode Linux 的實驗環境，開發紀錄-3

準備 UML

首先先取得 Linux v5.18 的程式碼

$ sudo apt install build-essential libncurses-dev flex bison
$ sudo apt install xz-utils wget ca-certificates bc
$ wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.18.tar.xz
$ tar xvf linux-5.18.tar.xz

設定核心組態，特別是 ARCH=um，之後的核心模組都需要用到這項指定 UML

$ make mrproper
$ make defconfig ARCH=um SUBARCH=x86_64
$ make linux ARCH=um SUBARCH=x86_64 -j `nproc`

在準備 UML.sh 前，先準備 root file system (簡稱 rootfs)

$ export REPO=http://dl-cdn.alpinelinux.org/alpine/v3.13/main
$ mkdir -p rootfs
$ curl $REPO/x86_64/APKINDEX.tar.gz | tar -xz -C /tmp/
$ export APK_TOOL=`grep -A1 apk-tools-static /tmp/APKINDEX | cut -c3- | xargs printf "%s-%s.apk"`
$ curl $REPO/x86_64/$APK_TOOL | fakeroot tar -xz -C rootfs
$ fakeroot rootfs/sbin/apk.static \
    --repository $REPO --update-cache \
    --allow-untrusted \
    --root $PWD/rootfs --initdb add alpine-base
$ echo $REPO > rootfs/etc/apk/repositories
$ echo "LABEL=ALPINE_ROOT / auto defaults 1 1" >> rootfs/etc/fstab

撰寫 init.sh 來作為 UML.sh 的啟動程序，內容如下

#!/bin/sh

mount -t proc proc /proc
mount -t sysfs sys /sys

export PS1='UML:\w\ $ '
export PS1='\[\033[01;32mUML:\w\033[00m \$ '

exec /sbin/tini /bin/sh +m

撰寫 UML.sh 來啟動 UML

#!/bin/sh
./linux umid=uml0 ubd0=/dev/null \
        root=/dev/root rootfstype=hostfs hostfs=./rootfs \
        rw mem=64M init=/init.sh quiet
stty sane ; echo

啟動 UML

$ ./UML.sh
UML:/ # ls
bin       etc       init.sh   media     opt       root      sbin      srv       tmp       var
dev       home      lib       mnt       proc      run       simplefs  sys       usr

準備所需版本的核心模組

接著在 rootfs 中建構 User-Mode Linux 的實驗環境，以下的 kernel modules 是我們所需要編譯的核心，同樣記得 ARCH=um ，編譯完後到所指定的目錄，這裡是 lib/modules/5.18.0 ，檢查是否有成功編譯。

~/linux2023/linux-5.18$ make modules_install INSTALL_MOD_PATH=`pwd`/rootfs ARCH=um
~/linux2023/linux-5.18/rootfs/lib/modules/5.18.0$ ls
build   modules.alias      modules.builtin            modules.builtin.bin      modules.dep      modules.devname  modules.softdep  modules.symbols.bin
kernel  modules.alias.bin  modules.builtin.alias.bin  modules.builtin.modinfo  modules.dep.bin  modules.order    modules.symbols  source

更改 Makefile 以達到可編譯不同版本 Linux

+v1PATH ?= /home/fewletter/linux2023/linux-5.18/rootfs
+v2PATH ?= /home/fewletter/linux2023/linux-6.3/rootfs
+VERSION ?=

+ifeq ($(VERSION), 5.18.0)
+	KDIR ?= $(v1PATH)/lib/modules/$(VERSION)/build
+else ifeq ($(VERSION), 6.3.0)
+	KDIR ?= $(v2PATH)/lib/modules/$(VERSION)/build
+else
+	KDIR ?= /lib/modules/$(shell uname -r)/build
+endif

順便連 ARCH=um 也一起考慮進去

+ifdef ARCH
+       ARCHARG = ARCH=$(ARCH)
+endif
...
-    make -C $(KDIR) M=$(PWD) modules
+    make -C $(KDIR) M=$(PWD) modules $(ARCHARG)
...
-    make -C $(KDIR) M=$(PWD) clean
+    make -C $(KDIR) M=$(PWD) clean $(ARCHARG)

接著準備就緒，試試能否在不同版本下編譯 simplefs，並且在 UML 中載入核心模組

$ make all VERSION=5.18.0 ARCH=um
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /home/fewletter/linux2023/linux-5.18/rootfs/lib/modules/5.18.0/build M=/home/fewletter/linux2023/simplefs modules ARCH=um
make[1]: 進入目錄「/home/fewletter/linux2023/linux-5.18」
  CC [M]  /home/fewletter/linux2023/simplefs/fs.o
  CC [M]  /home/fewletter/linux2023/simplefs/super.o
  CC [M]  /home/fewletter/linux2023/simplefs/inode.o
  CC [M]  /home/fewletter/linux2023/simplefs/file.o
  CC [M]  /home/fewletter/linux2023/simplefs/dir.o
  CC [M]  /home/fewletter/linux2023/simplefs/extent.o
  LD [M]  /home/fewletter/linux2023/simplefs/simplefs.o
  MODPOST /home/fewletter/linux2023/simplefs/Module.symvers
  CC [M]  /home/fewletter/linux2023/simplefs/simplefs.mod.o
  LD [M]  /home/fewletter/linux2023/simplefs/simplefs.ko
make[1]: 離開目錄「/home/fewletter/linux2023/linux-5.18」

$ ./UML.sh
UML:/simplefs # insmod simplefs.ko
UML:/simplefs # lsmod
Module                  Size  Used by    Tainted: G  
simplefs               17205  0

搭配 GDB 進行 Linux 核心程式碼追蹤

首先不管是何種版本，都需要先在 Linux 核心程式碼內建構 GDB script，並且需要修正核心程式碼中的 .config 檔案，在此檔案其中需要修改 CONFIG_GDB_SCRIPTS=y ，也就是下面命令所執行的事，還有註解掉 CONFIG_DEBUG_INFO_NONE 和增加 CONFIG_DEBUG_INFO=y。

$ echo "CONFIG_GDB_SCRIPTS=y" > .config-fragment
$ ARCH=um scripts/kconfig/merge_config.sh .config .config-fragment
$ make ARCH=um scripts_gdb

但是在 Linux v5.18 中，.config 檔一直無法正確運作，即使已經改動上述幾項，使的在開啟 gdb 時，一直會出現無法找到 debug symbols。

Reading symbols from vmlinux...
(No debugging symbols found)
Undefined command: "lx-version".  Try "help".

所以在後面決定使用 Linux v6.3 來進行 GDB 追蹤程式碼。首先也是修改 .config 檔，在 Linux v6.3 中，需要修改的地方只有 CONFIG_GDB_SCRIPTS=y，在建構 GDB script 後使用下面命令啟動 gdb。

fewletter@fewletter-Veriton-M4665G:~/linux2023/linux-6.3$ gdb -ex "add-auto-load-safe-path scripts/gdb/vmlinux-gdb.py" \
      -ex "file vmlinux" \
      -ex "lx-version" -q
Reading symbols from vmlinux...
Signal        Stop      Print   Pass to program Description
SIGSEGV       No        No      Yes             Segmentation fault
Signal        Stop      Print   Pass to program Description
SIGUSR1       Yes       Yes     No              User defined signal 1
Linux version 6.3.0 (fewletter@fewletter-Veriton-M4665G) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #1 Sat Jun 17 21:18:30 CST 2023
(gdb)

接著在 gdb 中啟動 UML，卻發現仍然在 gdb 的終端中，非常不尋常。

(gdb) r
Starting program: /home/fewletter/linux2023/linux-6.3/vmlinux umid=uml0 root=/dev/root rootfstype=hostfs rootflags=/home/fewletter/linux2023/linux-6.3/rootfs/simplefs/rootfs rw mem=64M init=/init.sh quiet
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 66210]
[Detaching after fork from child process 66211]
[Detaching after fork from child process 66212]
[Detaching after fork from child process 66213]
warning: Corrupted shared library list: 0x605e4da0 != 0x7ffff7ffd9e8
[New Thread 0x7ffff7d8db80 (LWP 66221)]
[New Thread 0x7ffff7d8db80 (LWP 66222)]
umid "uml0" is already in use by pid 50653
Failed to initialize umid "uml0", trying with a random umid
Failed to initialize ubd device 0 :Couldn't determine size of device's file
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
CPU: 0 PID: 1 Comm: swapper Not tainted 6.3.0 #1
Stack:
 6044dcbf 00000000 64803cf0 60034159
 6044dcbf 00000000 63f42930 60a2a000
 64803d20 60390b95 60390b47 60389d00
Call Trace:
 [<60389d5e>] ? _printk+0x0/0x98
 [<6002218b>] show_stack+0x141/0x150
 [<60034159>] ? um_set_signals+0x0/0x43
...
 [<60020e23>] new_thread_handler+0x85/0xb6

Thread 2 received signal SIGTERM, Terminated.
                                             [Switching to Thread 0x7ffff7d8db80 (LWP 66221)]
0x00007ffff7ea1967 in __GI___poll (fds=fds@entry=0x605c9310 <kernel_pollfd>, nfds=nfds@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: 沒有此一檔案或目錄.
(gdb)

接著根據建構 User-Mode Linux 的實驗環境-搭配GDB進行核心追蹤和分析的步驟準備 gdbinit 檔案，並且將啟動 gdb 的命令改成 gdb -q -x gdbinit。

python gdb.COMPLETE_EXPRESSION = gdb.COMPLETE_SYMBOL
add-auto-load-safe-path scripts/gdb/vmlinux-gdb.py
file vmlinux
lx-version
set args umid=uml0 root=/dev/root rootfstype=hostfs rootflags=FULLPATH/rootfs rw mem=64M init=/init.sh quiet
handle SIGSEGV nostop noprint
handle SIGUSR1 nopass stop print

終於出現 UML 了。

$ gdb -q -x gdbinit
Linux version 6.3.0 (fewletter@fewletter-Veriton-M4665G) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #1 Sat Jun 17 21:18:30 CST 2023
(gdb) run
Starting program: /home/fewletter/linux2023/linux-6.3/vmlinux umid=uml0 root=/dev/root rootfstype=hostfs rootflags=/home/fewletter/linux2023/linux-6.3/rootfs rw mem=64M init=/init.sh quiet
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 66637]
[Detaching after fork from child process 66638]
[Detaching after fork from child process 66639]
[Detaching after fork from child process 66640]
[New Thread 0x7ffff7d8db80 (LWP 66643)]
[New Thread 0x7ffff7d8db80 (LWP 66644)]
Failed to initialize ubd device 0 :Couldn't determine size of device's file
[Detaching after fork from child process 66645]
UML:/ #

嘗試重現 ls -a bug

在 2022年報告-Make and Mount a simplefs Filesystem 中提到 ls -a 的 bug 會出現在 host 端，重現此 bug 的步驟大概為下:

編譯 Linux v6.3.0 版本程式碼
將 simplefs 以 Linux v6.3.0 版本編譯，編譯方法如同 Linux v5.18
開啟 gdb 並在 UML 中掛載檔案系統至 test 目錄中

UML:/simplefs # insmod simplefs.ko
UML:/simplefs # mount -t simplefs -o loop test.img /test/
UML:/simplefs # df -Th
Filesystem           Type            Size      Used Available Use% Mounted on
root                 hostfs        195.8G     53.6G    132.2G  29% /
devtmpfs             devtmpfs       28.5M         0     28.5M   0% /dev
/dev/loop0           simplefs      200.0M      3.6M    196.4M   2% /test

觀察 host 端的 ls -a 是否有出現 . 和 ..

UML:/simplefs # cd ..
UML:/ # ls -a
.                                                              opt
..                                                             proc
.PKGINFO                                                       root
.SIGN.RSA.alpine-devel@lists.alpinelinux.org-4a6a0840.rsa.pub  run
.ash_history                                                   sbin
bin                                                            script
dev                                                            simplefs
etc                                                            srv
home                                                           sys
init.sh                                                        test
lib                                                            tmp
media                                                          usr
mnt                                                            var
UML:/ # cd /test
UML:/test # ls -a
.   ..

從以上結果來看 ls -a 在掛載檔案系統後，不管在 host 端還是在 test 目錄內都有出現 . 和 ..，此 bug 或許已經不存在。

TODO: 改進 VFS 及 simplefs 描述

擴充 2022 年報告，改進其中關於 VFS 及 simplefs 實作的描述 (可運用 HackMD 的書本模式分類展現)

擴充 VFS : Registering and Mounting a Filesystem : Mounting

透過 fill_super 初始化 Superblock

在掛載函式中會有一個 function pointer 的參數 fill_super，會根據不同的檔案系統使用不同的初始化函式，在找到 Superblock 之後就會透過這個函式對 Superblock 進行初始化。
- mount_bdev()
  將檔案系統掛載到實體裝置上，比如說硬碟上。
- mount_single()
  將多個掛載操作將共享在同一個檔案系統實例 (instance)。
- mount_nodev()
  將檔案系統掛載到非實體裝置上。
不同的檔案系統的 fill_super() 會有些許的不同，不過最主要都有對 Superblock 做初始化的功能，比如 ramfs 檔案系統:

static int ramfs_fill_super(struct super_block *sb, void *data, int silent)
{
        struct ramfs_fs_info *fsi;
        struct inode *inode;
        int err;

        save_mount_options(sb, data);

        fsi = kzalloc(sizeof(struct ramfs_fs_info), GFP_KERNEL);
        sb->s_fs_info = fsi;
        if (!fsi)
                return -ENOMEM;

        err = ramfs_parse_options(data, &fsi->mount_opts);
        if (err)
                return err;

        sb->s_maxbytes          = MAX_LFS_FILESIZE;
        sb->s_blocksize         = PAGE_SIZE;
        sb->s_blocksize_bits    = PAGE_SHIFT;
        sb->s_magic             = RAMFS_MAGIC;
        sb->s_op                = &ramfs_ops;
        sb->s_time_gran         = 1;

        inode = ramfs_get_inode(sb, NULL, S_IFDIR | fsi->mount_opts.mode, 0);
        sb->s_root = d_make_root(inode);
        if (!sb->s_root)
                return -ENOMEM;

        return 0;
}

擴充 simplefs : Registering and Mounting

Mounting

首先在 Linux 中 mount 是根據不同掛載裝置而有不同的掛載函式，如同 VFS 內所提到三種函式，而在 simplefs 中是選擇使用 mount_bdev() 作為掛載函式，並且利用 simplefs_fill_super 作為初始化 superblock 的函式

struct dentry *simplefs_mount(struct file_system_type *fs_type,
                              int flags,
                              const char *dev_name,
                              void *data)
{
    struct dentry *dentry =
        mount_bdev(fs_type, flags, dev_name, data, simplefs_fill_super);
    if (IS_ERR(dentry))
        pr_err("'%s' mount failure\n", dev_name);
    else
        pr_info("'%s' mount success\n", dev_name);

    return dentry;
}

擴充 VFS : Registering and Mounting a Filesystem : Umount

Unmounting

至於卸載檔案系統則是將 superblock 的資訊給消除掉，主要會利用到 linux/fs/super.c 的 kill_block_super

void simplefs_kill_sb(struct super_block *sb)
{
    kill_block_super(sb);

    pr_info("unmounted disk\n");
}

擴充 Simplefs : File page cache and read/write block on disk

simplefs 同時提供將 page cache 讀寫和從硬碟上將 block 從檔案系統中寫入至硬碟，而 block 中的資料可以同時包括 superblock, inode 和 bitmaps，而在 simplefs 中如同本章之前所提大約為 4 KB。

結構體 extent

在了解方法之前，先了解結構體 extent ，結構體 extent 是在較新的檔案系統中才有的結構體，目的是為了解決處理大型檔案的問題，比如說在檔案系統如 bfs 就是直接對 block 進行操作，如以下程式碼

static int bfs_get_block(struct inode *inode, sector_t block,
			struct buffer_head *bh_result, int create)
{
	unsigned long phys;
	int err;
	struct super_block *sb = inode->i_sb;
	struct bfs_sb_info *info = BFS_SB(sb);
	struct bfs_inode_info *bi = BFS_I(inode);

	phys = bi->i_sblock + block;
	if (!create) {
    ...

上述情況會有什麼問題，那就是在處理大型檔案比如說超過 10 MiB 的檔案，以一個單位只有 4 KiB 的 block 來處理，那勢必會耗費大量時間，但是透過結構體 extent ，一次會分配 8 個 block 來處理檔案不管是要寫入或是讀取，以減少每次都要對每個 block 處理的時間。

inode                                                
+-----------------------+                           
| i_mode = IFDIR | 0644 |          block 93       
| ei_block = 93     ----|------>  +----------------+      
| i_size = 10 KiB       |       0 | ee_block  = 0  |     
| i_blocks = 25         |         | ee_len    = 8  |      extent 94 
+-----------------------+         | ee_start  = 94 |---> +--------+
                                  |----------------|     |        |     
                                1 | ee_block  = 8  |     +--------+
                                  | ee_len    = 8  |      extent 99
                                  | ee_start  = 99 |---> +--------+ 
                                  |----------------|     |        |
                                2 | ee_block  = 16 |     +--------+
                                  | ee_len    = 8  |      extent 66 
                                  | ee_start  = 66 |---> +--------+
                                  |----------------|     |        |
                                  | ...            |     +--------+
                                  |----------------|  
                              341 | ee_block  = 0  | 
                                  | ee_len    = 0  |
                                  | ee_start  = 0  |
                                  +----------------+

Get blocks from file system

如何將 file system 中的資料映射到要儲存在硬碟上的 block，要知道在 simplefs 中運用了 superblock，inode 等結構體來記錄一個檔案系統的資訊，所以一開始使用下面兩個定義來取得 superblock 和 inode 的資訊

#define SIMPLEFS_SB(sb) (sb->s_fs_info)
#define SIMPLEFS_INODE(inode) \
    (container_of(inode, struct simplefs_inode_info, vfs_inode))

而為了知道檔案要映射到硬碟的哪個區域，simplefs 使用結構體 simplefs_extent 來管理硬碟的起始地址，長度，以及起始 block 的地址，sb_bread 則是透過 superblock 和 inode 來得知檔案的 block 的大小範圍，最後在找到 index 後，利用 iblock 來找出這個檔案在硬碟的 block 編號範圍。

static int simplefs_file_get_block(struct inode *inode,
                                   sector_t iblock,
                                   struct buffer_head *bh_result,
                                   int create)
{
    struct super_block *sb = inode->i_sb;
    struct simplefs_sb_info *sbi = SIMPLEFS_SB(sb);
    struct simplefs_inode_info *ci = SIMPLEFS_INODE(inode);
    ...
    bh_index = sb_bread(sb, ci->ei_block);
    if (!bh_index)
        return -EIO;
    index = (struct simplefs_file_ei_block *) bh_index->b_data;
    extent = simplefs_ext_search(index, iblock);
    ...
}

最後則是靠至著 get_free_blocks 在硬碟中找尋空的位置並且透過 map_bh 將 buffer_head 映射到映射到硬碟上。

...
if (index->extents[extent].ee_start == 0) {
        if (!create)
            return 0;
        bno = get_free_blocks(sbi, 8);
        if (!bno) {
            ret = -ENOSPC;
            goto brelse_index;
        }
        index->extents[extent].ee_start = bno;
        index->extents[extent].ee_len = 8;
        index->extents[extent].ee_block =
            extent ? index->extents[extent - 1].ee_block +
                         index->extents[extent - 1].ee_len
                   : 0;
        alloc = true;
    } else {
        bno = index->extents[extent].ee_start + iblock -
              index->extents[extent].ee_block;
    }

    /* Map the physical block to to the given buffer_head */
    map_bh(bh_result, sb, bno);
...

TODO: 改進 page cache

參見 WIP on page cache hooks for disaggregated fs

什麼是 page cache ?

page cache 的目的在於能夠減少讀取資料的需要，而對於註冊在硬碟上的檔案系統來說，如果每次讀取檔案都要從硬碟上讀取，那會花費相當多的時間，而 page 由於儲存在記憶體中，讀取速度較快。
在寫入資料的部分， page cache 會先將資料儲存於記憶體中並將其標記為 dirty page，此意思為此資料已被改動但是尚未被同步至硬碟中，而檔案系統會把這些資料同步到硬碟中，以達到資料的一致性。

Page cache 在檔案系統中如何運作 ?

首先在 linux/fs.h 中可以找到結構體 address_space，linux/mm_types.h 中可以找到結構體 page，其中結構體 page 便是在描述資料在記憶體中的型態，而 page 也是在記憶體中的最小單位，下方則是 page 與 address_space 之間的關聯。

/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct page{
    ...
        /* See page-flags.h for PAGE_MAPPING_FLAGS */
        struct address_space *mapping;
                union {
                    pgoff_t index;		/* Our offset within mapping. */
                    unsigned long share;	/* share count for fsdax */
                };
    ...
}

address_space *mapping 提供 page 的所在地址，index 則是提供偏移量(offset)。

圖片參考資訊 Memory Mapping

至於結構體 address_space 則是讓 page 讀取或寫入檔案的重要結構體，而操作方式則是透過另外一個結構體 address_space_operations *a_ops，其提供 dirty page 寫回硬碟或是從位址讀取 page 的方法。

struct address_space {
    struct inode		*host;
    struct xarray		i_pages;
    ...
    const struct address_space_operations *a_ops;
    ...
}

struct address_space_operations {
	int (*writepage)(struct page *page, struct writeback_control *wbc);
	int (*read_folio)(struct file *, struct folio *);

	/* Write back some dirty pages from this mapping. */
	int (*writepages)(struct address_space *, struct writeback_control *);

	/* Mark a folio dirty.  Return true if this dirtied it */
	bool (*dirty_folio)(struct address_space *, struct folio *);

	void (*readahead)(struct readahead_control *);
    ...
}

GDB 追蹤 page cache

準備檔案系統映像檔

首先先把檔案系統以虛擬環境的 Linux 版本編譯，這邊虛擬環境的版本為 6.3.0，並且需要注意的是由於要在 UML 中編譯，所以要將核心組態 ARCH=um ，然後以此版本建立檔案系統映像檔 make test.img。

make all VERSION=6.3.0 ARCH=um
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /home/fewletter/linux2023/linux-6.3/rootfs/lib/modules/6.3.0/build M=/home/fewletter/linux2023/simplefs modules ARCH=um
make[1]: 進入目錄「/home/fewletter/linux2023/linux-6.3」
  CC [M]  /home/fewletter/linux2023/simplefs/fs.o
  CC [M]  /home/fewletter/linux2023/simplefs/super.o
  CC [M]  /home/fewletter/linux2023/simplefs/inode.o
  CC [M]  /home/fewletter/linux2023/simplefs/file.o
  CC [M]  /home/fewletter/linux2023/simplefs/dir.o
  CC [M]  /home/fewletter/linux2023/simplefs/extent.o
  LD [M]  /home/fewletter/linux2023/simplefs/simplefs.o
  MODPOST /home/fewletter/linux2023/simplefs/Module.symvers
  CC [M]  /home/fewletter/linux2023/simplefs/simplefs.mod.o
  LD [M]  /home/fewletter/linux2023/simplefs/simplefs.ko
make[1]: 離開目錄「/home/fewletter/linux2023/linux-6.3」

$ make test.img
dd if=/dev/zero of=test.img bs=1M count=200
輸入 200+0 個紀錄
輸出 200+0 個紀錄
209715200位元組（210 MB，200 MiB）已複製，0.102465 s，2.0 GB/s
./mkfs.simplefs test.img
Superblock: (4096)
        magic=0xdeadce
        nr_blocks=51200
        nr_inodes=51240 (istore=915 blocks)
        nr_ifree_blocks=2
        nr_bfree_blocks=2
        nr_free_inodes=51239
        nr_free_blocks=50280
Inode store: wrote 915 blocks
        inode size = 72 B
Ifree blocks: wrote 2 blocks
Bfree blocks: wrote 2 blocks

接著將整個 simplefs 編譯過的檔案和檔案系統全部複製到 rootfs (root file system) 當中，然後啟動 GDB 按下 r 後切到 UML ，在 UML 中載入模組，掛載檔案系統到 test 目錄中。

rootfs$ cp -r /home/fewletter/linux2023/simplefs simplefs/
...
(gdb) r
Starting program: /home/fewletter/linux2023/linux-6.3/vmlinux umid=uml0 root=/dev/root rootfstype=hostfs rootflags=/home/fewletter/linux2023/linux-6.3/rootfs rw mem=64M init=/init.sh quiet
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 9053]
[Detaching after fork from child process 9054]
[Detaching after fork from child process 9055]
[Detaching after fork from child process 9056]
[New Thread 0x7ffff7d8db80 (LWP 9057)]
[New Thread 0x7ffff7d8db80 (LWP 9058)]
Failed to initialize ubd device 0 :Couldn't determine size of device's file
[Detaching after fork from child process 9059]
UML:/ # cd simplefs
UML:/simplefs # insmod simplefs.ko
UML:/simplefs # mount -t simplefs -o loop test.img /test/
UML:/simplefs # df -Th
Filesystem           Type            Size      Used Available Use% Mounted on
root                 hostfs        195.8G     53.6G    132.2G  29% /
devtmpfs             devtmpfs       28.5M         0     28.5M   0% /dev
/dev/loop0           simplefs      200.0M      3.6M    196.4M   2% /test

在另外一個視窗打上pkill -SIGUSR1 -o vmlinux 將 UML 切回 GDB。

在 GDB 中設置斷點並追蹤 page cache 相關程式碼

首先一定要先在 GDB 中輸入 lx-symbols ，否則 GDB 無法進行 debug

(gdb) lx-symbols
loading vmlinux
scanning for modules in /home/fewletter/linux2023/linux-6.3
loading @0x64947000: /home/fewletter/linux2023/linux-6.3/drivers/block/loop.ko
loading @0x649ad000: /home/fewletter/linux2023/linux-6.3/rootfs/simplefs/simplefs/simplefs.ko

設立斷點於 simplefs_get_block ， simplefs_write_begin ，simplefs_write_end ， simplefs_readahead ， simplefs_writepage ， simplefs_ext_search 當中，觀察其在建立檔案和修改檔案時的行為。

(gdb) info b
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00000000649af54d in simplefs_file_get_block at /home/fewletter/linux2023/simplefs/file.c:21
        breakpoint already hit 26 times
2       breakpoint     keep y   0x00000000649af468 in simplefs_write_begin at /home/fewletter/linux2023/simplefs/file.c:126
        breakpoint already hit 22 times
3       breakpoint     keep y   0x00000000649af26f in simplefs_write_end at /home/fewletter/linux2023/simplefs/file.c:167
        breakpoint already hit 24 times
4       breakpoint     keep y   0x00000000649af50a in simplefs_readahead at /home/fewletter/linux2023/simplefs/file.c:86
5       breakpoint     keep y   0x00000000649af52a in simplefs_writepage at /home/fewletter/linux2023/simplefs/file.c:101
        breakpoint already hit 13 times
6       breakpoint     keep y   0x00000000649af983 in simplefs_ext_search at /home/fewletter/linux2023/simplefs/extent.c:14
        breakpoint already hit 8 times

實驗修改檔案
切回 UML 後建立檔案於檔案系統，可以看到建立檔案於檔案系統中需要經過 simplefs_write_begin 開始寫入，並且結束於 simplefs_write_end。

UML:/ # echo "test1" > test/hello

Thread 1 "vmlinux" hit Breakpoint 2, simplefs_write_begin (file=0x60a3ae00, mapping=0x60a9e400, pos=0, len=6, pagep=0x64897c58, fsdata=0x64897c60)
    at /home/fewletter/linux2023/simplefs/file.c:126
126     {
(gdb) c
Continuing.

Thread 1 "vmlinux" hit Breakpoint 1, simplefs_file_get_block (inode=inode@entry=0x60a9e2b8, iblock=iblock@entry=0, bh_result=bh_result@entry=0x60c29d00, create=create@entry=1)
    at /home/fewletter/linux2023/simplefs/file.c:21
21      {
(gdb) c
Continuing.

Thread 1 "vmlinux" hit Breakpoint 6, simplefs_ext_search (index=index@entry=0x6075b000, iblock=iblock@entry=0) at /home/fewletter/linux2023/simplefs/extent.c:14
14      {
(gdb) c
Continuing.

Thread 1 "vmlinux" hit Breakpoint 3, simplefs_write_end (file=0x60a3ae00, mapping=0x60a9e400, pos=0, len=6, copied=6, page=0x63f38e50, fsdata=0x0 <loop_exit>)
    at /home/fewletter/linux2023/simplefs/file.c:167
167     {
(gdb) c
Continuing.
UML:/ #

但是如果要修改檔案的話，在最後面就會多了一行 simplefs_writepage，代表著檔案從 page cache 中將修改過後的檔案寫回實體硬碟上。

UML:/ # echo "test9" > test/hello

Thread 1 "vmlinux" hit Breakpoint 2, simplefs_write_begin (file=0x60b61500, mapping=0x60a9e400, pos=0, len=6, pagep=0x64897c58, fsdata=0x64897c60)
    at /home/fewletter/linux2023/simplefs/file.c:126
126     {
(gdb) c
Continuing.

Thread 1 "vmlinux" hit Breakpoint 1, simplefs_file_get_block (inode=inode@entry=0x60a9e2b8, iblock=iblock@entry=0, bh_result=bh_result@entry=0x60c2bf70, create=create@entry=1)
    at /home/fewletter/linux2023/simplefs/file.c:21
21      {
(gdb) c
Continuing.

Thread 1 "vmlinux" hit Breakpoint 6, simplefs_ext_search (index=index@entry=0x6075b000, iblock=iblock@entry=0) at /home/fewletter/linux2023/simplefs/extent.c:14
14      {
(gdb) c
Continuing.

Thread 1 "vmlinux" hit Breakpoint 3, simplefs_write_end (file=0x60b61500, mapping=0x60a9e400, pos=0, len=6, copied=6, page=0x63f38e88, fsdata=0x0 <loop_exit>)
    at /home/fewletter/linux2023/simplefs/file.c:167
167     {
(gdb) c
Continuing.
UML:/ # 
Thread 1 "vmlinux" hit Breakpoint 5, simplefs_writepage (page=0x63f38e88, wbc=0x6488bce0) at /home/fewletter/linux2023/simplefs/file.c:101
101     {

從這個簡單的實驗可以看到，在建立檔案時檔案資訊就會存在 page cache 中，而當我們要修改檔案時，檔案會寫入 page cache 而不是原本的硬碟上，最後在結束時，也就是在上方最後的 UML:/ # 時候，自行寫入硬碟中。

TODO: 排除記憶體錯誤

運用 kmemleak 和 kasan 排除記憶體錯誤

嘗試建立 kmemleak 報表

首先從 kmemleak 文件中知道， kmemleak 就像是一個檔案系統，它的檔案系統的型態為 debugfs ，將他開啟的方式首先要先從 .config 檔開始修改，而我原本想在本地端修改 .config 檔，後來為了保險起見，決定直接在虛擬化環境修改，而建構虛擬化環境則是參考測試 Linux 核心的虛擬化環境。

修改 .config 檔

.config 檔案是編譯核心的重要檔案，此檔案決定核心編譯時會有什麼功能，而核心本身也有提供腳本 scripts/kconfig/merge_config.sh .config .config-fragment 來修改 .config 檔，在此核心中會需要修改 CONFIG_DEBUG_INFO 和 CONFIG_DEBUG_KMEMLEAK。

$ echo "CONFIG_DEBUG_INFO=y" > .config-fragment
$ scripts/kconfig/merge_config.sh .config .config-fragment

$ echo "CONFIG_DEBUG_KMEMLEAK=y" > .config-fragment
$ scripts/kconfig/merge_config.sh .config .config-fragment

不同版本的 Linux .config 檔會有些許的不同

接著編譯所需要之核心環境

$ make ARCH=x86 CROSS_COMPILE=x86_64-linux-gnu- -j$(nproc)

成功編譯完環境後會出現以下訊息，後面的 # 代表的是編譯次數。

Kernel: arch/x86/boot/bzImage is ready (#1)

在虛擬環境編譯 simplefs

利用 QEMU 作為虛擬環境有個缺點，也就是在外面的核心無法直接對此環境的版本編譯，在裝載著此虛擬環境的目錄中執行以下命令 ls .virtime_mods/lib/modules/ 你會得到 0.0.0，代表一定要進到此虛擬環境才能編譯核心。

fewletter@fewletter-Veriton-M4665G:~/linux2023/linux$ virtme-run --kdir . --mods=auto
./.virtme_mods/lib/modules/0.0.0
...
root@(none):/tmp/simplefs# make
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /lib/modules/6.1.34/build M=/tmp/simplefs modules 
make[1]: Entering directory '/home/fewletter/linux2023/linux'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  You are using:           gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  CC [M]  /tmp/simplefs/fs.o
  CC [M]  /tmp/simplefs/super.o
...
[   98.786243] Tasks state (memory values in pages):
[   98.786582] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[   98.787200] [     98]     0    98     5334      777    61440        0         -1000 systemd-udevd
[   98.787831] [    157]     0   157     1062      144    49152        0             0 bash
[   98.788440] [    178]     0   178      728       68    40960        0             0 make
[   98.789098] [    185]     0   185      793      136    49152        0             0 make
[   98.789761] [    398]     0   398      817      161    45056        0             0 make
[   98.790428] [    422]     0   422      656       30    40960        0             0 sh
[   98.791054] [    423]     0   423      948       47    45056        0             0 gcc
[   98.791680] [    424]     0   424    20244    10738   200704        0             0 cc1
[   98.792298] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,task=cc1,pid=424,uid=0
[   98.793063] Out of memory: Killed process 424 (cc1) total-vm:80976kB, anon-rss:42668kB, file-rss:284kB, shmem-rss:0kB, UID:0 pgtables:196kB oom_score_adj:0
[   98.798426] cc1 (424) used greatest stack depth: 12664 bytes left
gcc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:250: /tmp/simplefs/super.o] Error 1
make[1]: *** [Makefile:2012: /tmp/simplefs] Error 2
make[1]: Leaving directory '/home/fewletter/linux2023/linux'
make: *** [Makefile:23: all] Error 2
root@(none):/tmp/simplefs#

從上面看到 simplefs 無法正常編譯，非常不尋常，因為在先前的章節建立虛擬化測試環境此專案是可以被編譯的。試著用搭配 crash 進行核心偵錯來看什麼地方出錯。

搭配 crash 除錯

首先先將呼叫虛擬化環境的命令改成以下命令，--mods=auto 為的是能夠讓 lib/modules/ 下的建構模組的版本為虛擬化環境的版本，--qemu-opts -qmp tcp:localhost:4444,server,nowait 是為了能夠連上 telnet 並將記憶體內容倒給映像檔然後讓 crash 來除錯。

$ virtme-run --kdir . --mods=auto --qemu-opts -qmp tcp:localhost:4444,server,nowait

在另一個終端輸入以下命令連到 telnet

$ telnet localhost 4444

會有以下畫面

Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
{"QMP": {"version": {"qemu": {"micro": 1, "minor": 2, "major": 4}, "package": "Debian 1:4.2-3ubuntu6.27"}, "capabilities": ["oob"]}}

接著輸入以下兩行命令，"file:vmcore.img" 這行為輸出的檔案名稱，最好將其命名為映像檔(img)，因為 crash 主要是對映像檔偵錯。

{ "execute": "qmp_capabilities" }
{ "execute": "dump-guest-memory", "arguments": {"paging": false, "protocol": "file:vmcore.img"}}

再開另一個終端輸入以下指令，crash 會讀取你給的 vmcore.img

$ crash /home/fewletter/linux2023/linux/vmlinux /home/fewletter/linux2023/linux/kmemleak_mempoolsize2000.img
...
KERNEL: /home/fewletter/linux2023/linux/vmlinux
    DUMPFILE: /home/fewletter/linux2023/linux/kmemleak_mempoolsize2000.img
        CPUS: 1
        DATE: Thu Jun 22 21:57:16 CST 2023
      UPTIME: 00:03:16
LOAD AVERAGE: 0.02, 0.02, 0.00
       TASKS: 45
    NODENAME: (none)
     RELEASE: 5.17.15
     VERSION: #6 SMP Thu Jun 22 21:52:44 CST 2023
     MACHINE: x86_64  (3000 Mhz)
      MEMORY: 127.5 MB
       PANIC: ""
         PID: 0
     COMMAND: "swapper/0"
        TASK: ffffffff8cc14940  [THREAD_INFO: ffffffff8cc14940]
         CPU: 0
       STATE: TASK_RUNNING (ACTIVE)
     WARNING: panic task not found

雖然沒有發生 kernel panic，但是發生了 oom (out of memory)，所以從 crash 中查詢到底發生了什麼事造成 simplefs 無法編譯。

crash> ps
      PID    PPID  CPU       TASK        ST  %MEM      VSZ      RSS  COMM
>       0       0   0  ffffffffb1e14a40  RU   0.0        0        0  [swapper/0]
        1       0   0  ffffa0e041358000  IN   0.4     4116      468  virtme-init      
    ...
       38       2   0  ffffa0e0420c2080  ID   0.0        0        0  [scsi_tmf_0]
       39       2   0  ffffa0e0420c30c0  IN   0.0        0        0  [scsi_eh_1]
       40       2   0  ffffa0e0420c4100  ID   0.0        0        0  [scsi_tmf_1]
       41       2   0  ffffa0e0420c5140  ID   0.0        0        0  [mld]
       42       2   0  ffffa0e0420c6180  ID   0.0        0        0  [ipv6_addrconf]
       43       2   0  ffffa0e043430000  IN   0.0        0        0  [kmemleak]
       98       1   0  ffffa0e0434b1040  IN   2.4    21336     3172  systemd-udevd
      157       1   0  ffffa0e0434930c0  IN   1.0     4248     1304  bash
      432       2   0  ffffa0e043498000  ID   0.0        0        0  [kworker/0:1]
      451       2   0  ffffa0e0434b2080  ID   0.0        0        0  [kworker/0:2]
      452       2   0  ffffa0e0434b5140  ID   0.0        0        0  [kworker/0:0]

從上面錯誤訊息看到在 PID 98 時有看到有 oom 的發生，所以直接去查 PID 98 發生了什麼事。

crash> bt 98
PID: 98       TASK: ffffa0e0434b1040  CPU: 0    COMMAND: "systemd-udevd"
 #0 [ffffaef5c01bbd08] __schedule at ffffffffb137f181
 #1 [ffffaef5c01bbd90] schedule at ffffffffb137f6d5
 #2 [ffffaef5c01bbda8] schedule_hrtimeout_range_clock at ffffffffb1385d42
 #3 [ffffaef5c01bbe18] do_epoll_wait at ffffffffb0905c08
 #4 [ffffaef5c01bbf00] __x64_sys_epoll_wait at ffffffffb0906f70
 #5 [ffffaef5c01bbf38] do_syscall_64 at ffffffffb1377f08
 #6 [ffffaef5c01bbf50] entry_SYSCALL_64_after_hwframe at ffffffffb140009b
    RIP: 00007fe9f356a42a  RSP: 00007ffeebe1dd78  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000559fe09ab8a0  RCX: 00007fe9f356a42a
    RDX: 000000000000000a  RSI: 0000559fe0bfb4c0  RDI: 0000000000000008
    RBP: ffffffffffffffff   R8: 000000000000000a   R9: 00007ffeebfbb0f0
    R10: 00000000ffffffff  R11: 0000000000000246  R12: 0000000000000001
    R13: 000000000000000a  R14: 0000559fe07352ba  R15: 0000559fe09ab8a0
    ORIG_RAX: 00000000000000e8  CS: 0033  SS: 002b
crash> gdb list do_syscall_64
22      
23      #ifdef CONFIG_HAVE_JUMP_LABEL_HACK
24      
25      static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
26      {
27              asm_volatile_goto("1:"
28                      "jmp %l[l_yes] # objtool NOPs this \n\t"
29                      JUMP_TABLE_ENTRY
30                      : :  "i" (key), "i" (2 | branch) : : l_yes);
31      
crash> gdb list __schedule
6431     *          - return from interrupt-handler to user-space
6432     *
6433     * WARNING: must be called with preemption disabled!
6434     */
6435    static void __sched notrace __schedule(unsigned int sched_mode)
6436    {
6437            struct task_struct *prev, *next;
6438            unsigned long *switch_count;
6439            unsigned long prev_state;
6440            struct rq_flags rf;
crash> gdb list schedule
10      
11      DECLARE_PER_CPU(struct task_struct *, current_task);
12      
13      static __always_inline struct task_struct *get_current(void)
14      {
15              return this_cpu_read_stable(current_task);
16      }
17      
18      #define current get_current()
19      
crash>

其中一條 warning 說到 must be called with preemption disabled ，所以到 .config 檔案中修改 CONFIG_PREEMPT_NONE_BUILD=y 和 CONFIG_PREEMPT_NONE=y，然後重新編譯核心環境跟之前的動作一樣。

重新進入虛擬化環境並且嘗試編譯 simplefs，還是失敗，測試結果如下。

root@(none):/tmp/simplefs# make
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /lib/modules/6.1.34/build M=/tmp/simplefs modules 
make[1]: Entering directory '/home/fewletter/linux2023/linux'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  You are using:           gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  CC [M]  /tmp/simplefs/fs.o
  CC [M]  /tmp/simplefs/super.o
...
[   98.786243] Tasks state (memory values in pages):
[   98.786582] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[   98.787200] [     98]     0    98     5334      777    61440        0         -1000 systemd-udevd
[   98.787831] [    157]     0   157     1062      144    49152        0             0 bash
[   98.788440] [    178]     0   178      728       68    40960        0             0 make
[   98.789098] [    185]     0   185      793      136    49152        0             0 make
[   98.789761] [    398]     0   398      817      161    45056        0             0 make
[   98.790428] [    422]     0   422      656       30    40960        0             0 sh
[   98.791054] [    423]     0   423      948       47    45056        0             0 gcc
[   98.791680] [    424]     0   424    20244    10738   200704        0             0 cc1
[   98.792298] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,task=cc1,pid=424,uid=0
[   98.793063] Out of memory: Killed process 424 (cc1) total-vm:80976kB, anon-rss:42668kB, file-rss:284kB, shmem-rss:0kB, UID:0 pgtables:196kB oom_score_adj:0
[   98.798426] cc1 (424) used greatest stack depth: 12664 bytes left
gcc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:250: /tmp/simplefs/super.o] Error 1
make[1]: *** [Makefile:2012: /tmp/simplefs] Error 2
make[1]: Leaving directory '/home/fewletter/linux2023/linux'
make: *** [Makefile:23: all] Error 2

既然結果顯示 error 跟 preemption 無關，再看一次 crash 的報表 kmemleak 也出現在上面，也就是說 kmemleak 也有可能導致在編譯 simplefs 時候，虛擬環境出現 oom 的狀況。

修改 kmemleak config 設定

以下是 kmemleak 在 .config 檔中相關的設定，我首先先調整 CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE 的大小為 200，然後編譯核心環境，進入虛擬化環境編譯 simplefs ，結果還是失敗。

CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE=16000
# CONFIG_DEBUG_KMEMLEAK_TEST is not set
CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y
CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y

最後我調整 CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y 此命令為讓使用者在進入環境時， kmemleak 先不要啟動，然後編譯核心環境，得到以下畫面，原來我已經編譯 9 次環境了。

Kernel: arch/x86/boot/bzImage is ready (#9)

接著再試圖在環境編譯 simplefs 就成功了

root@(none):/tmp/simplefs# make all
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /lib/modules/5.17.15/build M=/tmp/simplefs modules 
make[1]: Entering directory '/home/fewletter/linux2023/linux'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  You are using:           gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  CC [M]  /tmp/simplefs/fs.o
  CC [M]  /tmp/simplefs/super.o
  CC [M]  /tmp/simplefs/inode.o
  CC [M]  /tmp/simplefs/file.o
  CC [M]  /tmp/simplefs/dir.o
  CC [M]  /tmp/simplefs/extent.o
  LD [M]  /tmp/simplefs/simplefs.o
  MODPOST /tmp/simplefs/Module.symvers
  CC [M]  /tmp/simplefs/simplefs.mod.o
  LD [M]  /tmp/simplefs/simplefs.ko
make[1]: Leaving directory '/home/fewletter/linux2023/linux'

但是此時卻無法將關閉的 kmemleak 開啟

root@(none):/tmp/simplefs# mount -t debugfs nodev /sys/kernel/debug/
mount: /sys/kernel/debug: nodev already mounted or mount point busy.
root@(none):/# mount | grep "debug"
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
root@(none):/tmp/simplefs# kmemleak=on 
root@(none):/tmp/simplefs# echo scan=on > /sys/kernel/debug/kmemleak
bash: echo: write error: Operation not permitted

所以從之前的錯誤可得出結論，虛擬環境配置的記憶體無法同時開啟 kmemleak 和編譯 simplefs，必須要配置更大的記憶體空間給虛擬環境。

Jim Huang

2023/06/19 10:29:02

directory 是「目錄」，並非「檔案夾」 (Edited)

2023/06/23 00:04:12

執行 kmemleak 需要較大的記憶體，你需要調整 QEMU 的執行選項，見: https://github.com/amluto/virtme/blob/master/samples/xfstests (--qemu-opts) (Edited)

2023/06/23 00:10:47

```c static struct inode *simplefs_new_inode(struct inode *dir, mode_t mode) { ... #if MNT_IDMAP_REQUIRED() inode_init_owner(&nop_mnt_idmap, inode, dir, inode->mode); ... } ```

由於 Issue #25 尚未更新，請你直接遞交新的 pull request 並提及 cbkadal 的成果 (如 Reported by Cicak Bin Kadal) (Edited)

2023/06/24 08:58:31

[專題解說

調整存取權限，應為「公開」 (Edited)