owned this note
owned this note
Published
Linked with GitHub
# Linux 核心專題: simplefs
> 執行人: fewletter
> [專題解說影片](https://youtu.be/9FFa8GuU8t4)
:::success
:question: 提問清單
* 可否改用 [kasan](https://www.kernel.org/doc/html/latest/dev-tools/kasan.html) 來檢測記憶體問題?
:::
:::info
* 回覆
理論上可以,但我還沒有實做過。要開啟 kasan 首先要先將 kmemleak 關閉,不然連建立環境都會 crash (親身經歷),在核心的 config 檔中有 `CONFIG_KASAN`, `CONFIG_KASAN_GENERIC`, `CONFIG_KASAN_SW_TAGS`, ` CONFIG_KASAN_HW_TAGS` 這四個選項,第一個為是否要開啟 kasan,後面三個為開啟的選項。
至於記憶體配置的部分,從下面命令 `virtme-run --kdir . --mods=auto --qemu-opts 2048` 可以增加虛擬環境的記憶體配置,所以記憶體空間不足不會有問題。
:::
## [simplefs](https://github.com/sysprog21/simplefs)
為了探索 Linux VFS (virtual file system) 介面及檔案系統實作機制,我們從無到有撰寫一個運作於 Linux 核心模式中的精簡檔案系統,原始程式碼約一千餘行,支援基本的檔案和目錄處理,同時也考慮到權限和並行處理的議題。
## 背景知識
* [開發紀錄-1](https://hackmd.io/@Nahemah1022/rJo1sAtid)
* [開發紀錄-2](https://hackmd.io/@fwfly/simplefs)
* [開發紀錄-3](https://hackmd.io/@freshLiver/linux-vfs-main/)
## TODO: 修正 Linux v6.3 編譯
參見 [Issue #23](https://github.com/sysprog21/simplefs/issues/23)
作法: 利用 user-mode-linux 或 virtme (第 7 週教材) 編譯 Linux v6.3 核心,並嘗試編譯 simplefs 及修正相關錯誤,隨後提交 pull request。
### 追蹤錯誤程式碼
根據 [Issue #23](https://github.com/sysprog21/simplefs/issues/23) 的錯誤資訊可以知道錯誤的程式碼出現在 [include/linux/fs.h](https://github.com/torvalds/linux/blob/master/include/linux/fs.h#L1699) ,而這段程式碼是在 Linux v6.3 版本才出現,在 6.3 版本以前皆為以下宣告
```c
void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
const struct inode *dir, umode_t mode);
```
在 6.3 版本以後改成以下的宣告
```c
void inode_init_owner(struct mnt_idmap *idmap, struct inode *inode,
const struct inode *dir, umode_t mode);
```
### 建立虛擬化測試環境
參考 [測試 Linux 核心的虛擬化環境](https://hackmd.io/@sysprog/linux-virtme) ,基本上就是按照教材上的步驟實作
* 執行以下命令取得 Linux 核心原始程式碼
* 將 Linux 版本改成 6.3 來搭配之後要除錯的環境
```shell
$ git clone https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux
$ cd linux
$ git checkout -b linux-6.3.y origin/linux-6.3.y
```
使用 virtme 選取預設核心組態並編譯:
```shell
$ virtme-configkernel --defconfig
$ make ARCH=x86 CROSS_COMPILE=x86_64-linux-gnu- -j6
```
編譯完成後出現
```shell
Kernel: arch/x86/boot/bzImage is ready (#1)
```
接著啟動虛擬測試環境
```shell
$ virtme-run --kdir . --mods=auto
./.virtme_mods/lib/modules/0.0.0
No EFI environment detected.
early console in extract_kernel
input_data: 0x0000000002d6d3a8
input_len: 0x0000000000c572f0
output: 0x0000000001000000
output_len: 0x000000000297a5d0
...
[ 1.394987] virtme-init: triggering udev coldplug
[ 1.472012] virtme-init: waiting for udev to settle
[ 1.906750] gzip (97) used greatest stack depth: 13280 bytes left
[ 2.454079] virtme-init: udev is done
[ 2.558114] ip (124) used greatest stack depth: 12976 bytes left
virtme-init: console is ttyS0
```
此虛擬環境就是另一個版本的 Linux ,只不過命令方法從 `$` 變成 `#`,然後來確認一下此環境是否為一開始設定的 6.3
```shell
root@(none):/# uname -r
6.3.3
```
測試能否從 github 直接複製整個專案到此環境
```shell
root@(none):/# git clone https://github.com/fewletter/simplefs.git
fatal: could not create work tree dir 'simplefs': Read-only file system
```
看起來不能直接從 github 複製專案到環境中,接著試試將已經複製到本地端的專案複製到此環境中
```shell
root@(none):/# mkdir -p /tmp/simplefs
root@(none):/# cd tmp/simplefs
root@(none):/tmp/simplefs# cp /home/fewletter/linux2023/simplefs/dir.c .
root@(none):/tmp/simplefs# cp /home/fewletter/linux2023/simplefs/extent.c .
...
root@(none):/tmp/simplefs# cp /home/fewletter/linux2023/simplefs/bitmap.h .
root@(none):/tmp/simplefs# cp /home/fewletter/linux2023/simplefs/Makefile .
```
編譯此專案並重現錯誤
```shell
root@(none):/tmp/simplefs# make
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /lib/modules/6.3.3/build M=/tmp/simplefs modules
make[1]: Entering directory '/home/fewletter/linux2023/linux'
warning: the compiler differs from the one used to build the kernel
The kernel was built by: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
You are using: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CC [M] /tmp/simplefs/fs.o
CC [M] /tmp/simplefs/super.o
/tmp/simplefs/super.c: In function 'simplefs_fill_super':
/tmp/simplefs/super.c:286:22: error: passing argument 1 of 'inode_init_owner' from incompatible pointer type [-Werror=incompatible-pointer-types]
286 | inode_init_owner(&init_user_ns, root_inode, NULL, root_inode->i_mode);
| ^~~~~~~~~~~~~
| |
| struct user_namespace *
In file included from ./include/linux/highmem.h:5,
from ./include/linux/bvec.h:10,
from ./include/linux/blk_types.h:10,
from ./include/linux/buffer_head.h:12,
from /tmp/simplefs/super.c:3:
./include/linux/fs.h:1682:41: note: expected 'struct mnt_idmap *' but argument is of type 'struct user_namespace *'
1682 | void inode_init_owner(struct mnt_idmap *idmap, struct inode *inode,
| ~~~~~~~~~~~~~~~~~~^~~~~
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:252: /tmp/simplefs/super.o] Error 1
make[1]: *** [Makefile:2025: /tmp/simplefs] Error 2
make[1]: Leaving directory '/home/fewletter/linux2023/linux'
make: *** [Makefile:9: all] Error 2
```
從上面的結果來看在 Linux v6.3 確實存在著會讓此專案產生 bug 的程式碼,接著就是著手修改程式碼。
### 修改程式碼以符合 Linux 6.3 版本
在 [simplefs/super.c](https://github.com/fewletter/simplefs/blob/master/super.c#L285) 中有著這段程式碼如下
```c
#if USER_NS_REQUIRED()
inode_init_owner(&init_user_ns, root_inode, NULL, root_inode->i_mode);
#else
inode_init_owner(root_inode, NULL, root_inode->i_mode);
#endif
```
其中 `USER_NS_REQUIRED()` 在 [simplefs/simplefs.h](https://github.com/fewletter/simplefs/blob/master/simplefs.h#L29) 被如此定義
```c
#define USER_NS_REQUIRED() LINUX_VERSION_CODE >= KERNEL_VERSION(5,12,0)
```
所以此 bug 的原因就十分明顯了,此程式碼的定義 Linux 核心版本的分界點在 5.12,在 6.3 版本才會出現無法編譯過的問題。而 Linux 核心在 v6.2 到 v6.3 版本中間到底改了什麼,或許可以從以下程式碼窺探一二
```diff
-void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
- const struct inode *dir, umode_t mode);
+void inode_init_owner(struct mnt_idmap *idmap, struct inode *inode,
+ const struct inode *dir, umode_t mode);
```
從以上程式碼可知他只把 `struct user_namespace *mnt_userns` 改成 `struct mnt_idmap *idmap`,而結構體 `mnt_idmap` 則可以在 [linux/fs/mnt_idmapping.c](https://github.com/torvalds/linux/blob/master/fs/mnt_idmapping.c) 中找到定義,簡單來說它只是將原本的結構體 `user_namespace` 加上 reference counting 變成新的結構體。
```c
struct mnt_idmap {
struct user_namespace *owner;
refcount_t count;
};
/*
* Carries the initial idmapping of 0:0:4294967295 which is an identity
* mapping. This means that {g,u}id 0 is mapped to {g,u}id 0, {g,u}id 1 is
* mapped to {g,u}id 1, [...], {g,u}id 1000 to {g,u}id 1000, [...].
*/
struct mnt_idmap nop_mnt_idmap = {
.owner = &init_user_ns,
.count = REFCOUNT_INIT(1),
};
EXPORT_SYMBOL_GPL(nop_mnt_idmap);
```
#### simplefs.h
在 simplefs.h 多定義一行程式碼
```c
#define USER_NS_REQUIRED_6_3() LINUX_VERSION_CODE >= KERNEL_VERSION(6,3,0)
```
代表如果此專案如果要從 Linux v6.3 以上的核心編譯則需透過此定義執行以下程式碼。
#### super.c
```diff
+#if USER_NS_REQUIRED_6_3()
+ inode_init_owner(&nop_mnt_idmap, root_inode, NULL, root_inode->i_mode);
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
inode_init_owner(&init_user_ns, root_inode, NULL, root_inode->i_mode);
```
改完程式碼後在虛擬環境中測試後發現 `super.c` 中這個bug 只是整個專案裡面的一部分,有很多會編譯的錯誤都出現在 `inode.c` 中。
#### inode.c
```diff
static struct inode *simplefs_new_inode(struct inode *dir, mode_t mode)
{
...
if (S_ISLNK(mode)) {
+#if USER_NS_REQUIRED_6_3()
+ inode_init_owner(&nop_mnt_idmap, inode, dir, inode->mode);
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
inode_init_owner(&init_user_ns, inode, dir, mode);
#else
inode_init_owner(inode, dir, mode);
...
/* Get a free block for this new inode's index */
bno = get_free_blocks(sbi, 1);
if (!bno) {
ret = -ENOSPC;
goto put_inode;
}
+#if USER_NS_REQUIRED_6_3()
+ inode_init_owner(&nop_mnt_idmap, inode, dir, inode->mode);
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
inode_init_owner(&init_user_ns, inode, dir, mode);
#else
inode_init_owner(inode, dir, mode);
}
...
/*
* Create a file or directory in this way:
* - check filename length and if the parent directory is not full
* - create the new inode (allocate inode and blocks)
* - cleanup index block of the new inode
* - add new file/directory in parent index
*/
+#if USER_NS_REQUIRED_6_3()
+static int simplefs_create(struct mnt_idmap *id,
+ struct inode *dir,
+ struct dentry *dentry,
+ umode_t mode,
+ bool excl)
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
static int simplefs_create(struct user_namespace *ns,
struct inode *dir,
struct dentry *dentry,
umode_t mode,
bool excl)
...
+#if USER_NS_REQUIRED_6_3()
+static int simplefs_rename(struct mnt_idmap *id,
+ struct inode *old_dir,
+ struct dentry *old_dentry,
+ struct inode *new_dir,
+ struct dentry *new_dentry,
+ unsigned int flags)
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
+static int simplefs_rename(struct user_namespace *ns,
...
+#if USER_NS_REQUIRED_6_3()
+static int simplefs_mkdir(struct mnt_idmap *id,
+ struct inode *dir,
+ struct dentry *dentry,
+ umode_t mode)
+{
+ return simplefs_create(id, dir, dentry, mode | S_IFDIR, 0);
+}
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
static int simplefs_mkdir(struct user_namespace *ns,
...
+#if USER_NS_REQUIRED_6_3()
+static int simplefs_symlink(struct mnt_idmap *id,
+ struct inode *dir,
+ struct dentry *dentry,
+ const char *symname)
-#if USER_NS_REQUIRED()
+#elif USER_NS_REQUIRED()
static int simplefs_symlink(struct user_namespace *ns,
```
在 inode.c 中所修改的程式碼都是跟 Linux 版本有關的程式碼,同時必須要注意到在 Linux v6.3 之後結構體 `user_namespace` 已經被包裝成結構體 `mnt_idmap` ,而下面則是在虛擬環境中的測試結果。
```shell
root@(none):/tmp/simplefs# make
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /lib/modules/6.3.3/build M=/tmp/simplefs modules
make[1]: Entering directory '/home/fewletter/linux2023/linux'
warning: the compiler differs from the one used to build the kernel
The kernel was built by: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
You are using: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CC [M] /tmp/simplefs/fs.o
CC [M] /tmp/simplefs/super.o
CC [M] /tmp/simplefs/inode.o
CC [M] /tmp/simplefs/file.o
CC [M] /tmp/simplefs/dir.o
CC [M] /tmp/simplefs/extent.o
LD [M] /tmp/simplefs/simplefs.o
MODPOST /tmp/simplefs/Module.symvers
CC [M] /tmp/simplefs/simplefs.mod.o
LD [M] /tmp/simplefs/simplefs.ko
make[1]: Leaving directory '/home/fewletter/linux2023/linux'
```
從上述結果可以看到已經沒有編譯錯誤。
:::warning
TODO: 提交 pull request
:notes: jserv
:::
#### 修正 [Issue #25](https://github.com/sysprog21/simplefs/issues/25)
```c
static struct inode *simplefs_new_inode(struct inode *dir, mode_t mode)
{
...
#if MNT_IDMAP_REQUIRED()
inode_init_owner(&nop_mnt_idmap, inode, dir, inode->mode);
...
}
```
根據 Issue #25 所說,上面程式碼發生錯誤,為了找出錯誤去看了 `inode_init_owner` 的函式的作用,在 Linux v6.3 中,`inode_init_owner` 的描述如下
```c
void inode_init_owner(struct mnt_idmap *idmap, struct inode *inode,
const struct inode *dir, umode_t mode)
{
inode_fsuid_set(inode, idmap);
if (dir && dir->i_mode & S_ISGID) {
inode->i_gid = dir->i_gid;
/* Directories are special, and always inherit S_ISGID */
if (S_ISDIR(mode))
mode |= S_ISGID;
} else
inode_fsgid_set(inode, idmap);
inode->i_mode = mode;
}
EXPORT_SYMBOL(inode_init_owner);
```
從最後一行程式碼可以看出 `inode->i_mode` 該由 `mode` 這個參數初始化,而根據這個函式 `simplefs_new_inode` 的作用為透過 `dir` 和 `mode` 兩個參數去初始化新的 `inode` ,所以 `inode_init_owner` 最後一個參數應為 `mode`。
```diff
#if MNT_IDMAP_REQUIRED()
+ inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
- inode_init_owner(&nop_mnt_idmap, inode, dir, inode->mode);
```
## TODO: 修正 `ls -a` 問題
參見 [2022 年報告](https://hackmd.io/@freshLiver/linux-vfs-main/),確認在 Linux v5.15+ 是否仍有相關問題,並嘗試排除。
### 測試 `ls -a`
#### 開發裝置
```shell
fewletter@fewletter-Veriton-M4665G:~$ uname -r
5.15.0-72-generic
```
#### 測試環境:開發裝置 Linux v5.15
```shell
fewletter@fewletter-Veriton-M4665G:~/linux2023/simplefs$ ls -a
. dir.o file.c .fs.o.cmd .inode.o.cmd modules.order script simplefs.mod.c .simplefs.o.cmd
.. .dir.o.cmd file.o .git LICENSE .modules.order.cmd simplefs.h .simplefs.mod.cmd super.c
bitmap.h extent.c .file.o.cmd .gitignore Makefile Module.symvers simplefs.ko simplefs.mod.o super.o
.clang-format extent.o fs.c inode.c mkfs.c .Module.symvers.cmd .simplefs.ko.cmd .simplefs.mod.o.cmd .super.o.cmd
dir.c .extent.o.cmd fs.o inode.o mkfs.simplefs README.md simplefs.mod simplefs.o test
```
#### 測試環境:QEMU Linux v6.3
```shell
root@(none):/tmp/simplefs# ls -a
. .extent.o.cmd .modules.order.cmd .simplefs.o.cmd Module.symvers extent.c fs.c mkfs.c simplefs.h simplefs.mod.o test
.. .file.o.cmd .simplefs.ko.cmd .super.o.cmd bitmap.h extent.o fs.o mkfs.simplefs simplefs.ko simplefs.o
.Module.symvers.cmd .fs.o.cmd .simplefs.mod.cmd LICENSE dir.c file.c inode.c modules.order simplefs.mod super.c
.dir.o.cmd .inode.o.cmd .simplefs.mod.o.cmd Makefile dir.o file.o inode.o script simplefs.mod.c super.o
```
從上面兩個環境來看, `ls -a` 皆有出現 `.` 和 `..` ,根據 [2022年報告](https://hackmd.io/@freshLiver/linux-vfs-main/%2FzJqwWG80Su258RVuf4nR2g#Improving-simplefs) , `ls -a` 的 bug 會出現在 Linux v5.11 的開發裝置和 Linux v5.18 的虛擬裝置上。
#### 測試環境:QEMU Linux v5.18
```shell
root@(none):/tmp/simplefs# ls -a
. .extent.o.cmd .modules.order.cmd .simplefs.o.cmd bitmap.h extent.o fs.o mkfs.simplefs simplefs.mod super.c
.. .file.o.cmd .simplefs.ko.cmd .super.o.cmd dir.c file.c inode.c modules.order simplefs.mod.c super.o
.Module.symvers.cmd .fs.o.cmd .simplefs.mod.cmd Makefile dir.o file.o inode.o simplefs.h simplefs.mod.o
.dir.o.cmd .inode.o.cmd .simplefs.mod.o.cmd Module.symvers extent.c fs.c mkfs.c simplefs.ko simplefs.o
```
使用 QEMU 作為 Linux v5.18 虛擬環境也還是沒有發生 bug。
### 測試環境:User-Mode Linux (UML)
參考資料[建構 User-Mode Linux 的實驗環境](https://hackmd.io/-KufbNBCRpKhBbZVWCz_CA?view#%E5%BB%BA%E6%A7%8B-User-Mode-Linux-%E7%9A%84%E5%AF%A6%E9%A9%97%E7%92%B0%E5%A2%83),[開發紀錄-3](https://hackmd.io/@freshLiver/linux-vfs-main/%2F9Qym3jjIQIqmGZ2GZpbzTg#Experiment-Environment)
#### 準備 UML
首先先取得 Linux v5.18 的程式碼
```shell
$ sudo apt install build-essential libncurses-dev flex bison
$ sudo apt install xz-utils wget ca-certificates bc
$ wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.18.tar.xz
$ tar xvf linux-5.18.tar.xz
```
設定核心組態,特別是 `ARCH=um`,之後的核心模組都需要用到這項指定 UML
```shell
$ make mrproper
$ make defconfig ARCH=um SUBARCH=x86_64
$ make linux ARCH=um SUBARCH=x86_64 -j `nproc`
```
在準備 `UML.sh` 前,先準備 root file system (簡稱 rootfs)
```shell
$ export REPO=http://dl-cdn.alpinelinux.org/alpine/v3.13/main
$ mkdir -p rootfs
$ curl $REPO/x86_64/APKINDEX.tar.gz | tar -xz -C /tmp/
$ export APK_TOOL=`grep -A1 apk-tools-static /tmp/APKINDEX | cut -c3- | xargs printf "%s-%s.apk"`
$ curl $REPO/x86_64/$APK_TOOL | fakeroot tar -xz -C rootfs
$ fakeroot rootfs/sbin/apk.static \
--repository $REPO --update-cache \
--allow-untrusted \
--root $PWD/rootfs --initdb add alpine-base
$ echo $REPO > rootfs/etc/apk/repositories
$ echo "LABEL=ALPINE_ROOT / auto defaults 1 1" >> rootfs/etc/fstab
```
撰寫 `init.sh` 來作為 `UML.sh` 的啟動程序,內容如下
```shell
#!/bin/sh
mount -t proc proc /proc
mount -t sysfs sys /sys
export PS1='UML:\w\ $ '
export PS1='\[\033[01;32mUML:\w\033[00m \$ '
exec /sbin/tini /bin/sh +m
```
撰寫 `UML.sh` 來啟動 UML
```shell
#!/bin/sh
./linux umid=uml0 ubd0=/dev/null \
root=/dev/root rootfstype=hostfs hostfs=./rootfs \
rw mem=64M init=/init.sh quiet
stty sane ; echo
```
啟動 UML
```
$ ./UML.sh
UML:/ # ls
bin etc init.sh media opt root sbin srv tmp var
dev home lib mnt proc run simplefs sys usr
```
#### 準備所需版本的核心模組
接著在 rootfs 中建構 User-Mode Linux 的實驗環境,以下的 kernel modules 是我們所需要編譯的核心,同樣記得 `ARCH=um` ,編譯完後到所指定的目錄,這裡是 `lib/modules/5.18.0` ,檢查是否有成功編譯。
```shell
~/linux2023/linux-5.18$ make modules_install INSTALL_MOD_PATH=`pwd`/rootfs ARCH=um
~/linux2023/linux-5.18/rootfs/lib/modules/5.18.0$ ls
build modules.alias modules.builtin modules.builtin.bin modules.dep modules.devname modules.softdep modules.symbols.bin
kernel modules.alias.bin modules.builtin.alias.bin modules.builtin.modinfo modules.dep.bin modules.order modules.symbols source
```
#### 更改 Makefile 以達到可編譯不同版本 Linux
```diff
+v1PATH ?= /home/fewletter/linux2023/linux-5.18/rootfs
+v2PATH ?= /home/fewletter/linux2023/linux-6.3/rootfs
+VERSION ?=
+ifeq ($(VERSION), 5.18.0)
+ KDIR ?= $(v1PATH)/lib/modules/$(VERSION)/build
+else ifeq ($(VERSION), 6.3.0)
+ KDIR ?= $(v2PATH)/lib/modules/$(VERSION)/build
+else
+ KDIR ?= /lib/modules/$(shell uname -r)/build
+endif
```
順便連 `ARCH=um` 也一起考慮進去
```diff
+ifdef ARCH
+ ARCHARG = ARCH=$(ARCH)
+endif
...
- make -C $(KDIR) M=$(PWD) modules
+ make -C $(KDIR) M=$(PWD) modules $(ARCHARG)
...
- make -C $(KDIR) M=$(PWD) clean
+ make -C $(KDIR) M=$(PWD) clean $(ARCHARG)
```
接著準備就緒,試試能否在不同版本下編譯 simplefs,並且在 UML 中載入核心模組
```shell
$ make all VERSION=5.18.0 ARCH=um
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /home/fewletter/linux2023/linux-5.18/rootfs/lib/modules/5.18.0/build M=/home/fewletter/linux2023/simplefs modules ARCH=um
make[1]: 進入目錄「/home/fewletter/linux2023/linux-5.18」
CC [M] /home/fewletter/linux2023/simplefs/fs.o
CC [M] /home/fewletter/linux2023/simplefs/super.o
CC [M] /home/fewletter/linux2023/simplefs/inode.o
CC [M] /home/fewletter/linux2023/simplefs/file.o
CC [M] /home/fewletter/linux2023/simplefs/dir.o
CC [M] /home/fewletter/linux2023/simplefs/extent.o
LD [M] /home/fewletter/linux2023/simplefs/simplefs.o
MODPOST /home/fewletter/linux2023/simplefs/Module.symvers
CC [M] /home/fewletter/linux2023/simplefs/simplefs.mod.o
LD [M] /home/fewletter/linux2023/simplefs/simplefs.ko
make[1]: 離開目錄「/home/fewletter/linux2023/linux-5.18」
$ ./UML.sh
UML:/simplefs # insmod simplefs.ko
UML:/simplefs # lsmod
Module Size Used by Tainted: G
simplefs 17205 0
```
#### 搭配 GDB 進行 Linux 核心程式碼追蹤
首先不管是何種版本,都需要先在 Linux 核心程式碼內建構 GDB script,並且需要修正核心程式碼中的 `.config` 檔案,在此檔案其中需要修改 `CONFIG_GDB_SCRIPTS=y` ,也就是下面命令所執行的事,還有註解掉 `CONFIG_DEBUG_INFO_NONE` 和增加 `CONFIG_DEBUG_INFO=y`。
```shell
$ echo "CONFIG_GDB_SCRIPTS=y" > .config-fragment
$ ARCH=um scripts/kconfig/merge_config.sh .config .config-fragment
$ make ARCH=um scripts_gdb
```
但是在 Linux v5.18 中,`.config` 檔一直無法正確運作,即使已經改動上述幾項,使的在開啟 gdb 時,一直會出現無法找到 debug symbols。
```shell
Reading symbols from vmlinux...
(No debugging symbols found)
Undefined command: "lx-version". Try "help".
```
所以在後面決定使用 Linux v6.3 來進行 GDB 追蹤程式碼。首先也是修改 `.config` 檔,在 Linux v6.3 中,需要修改的地方只有 `CONFIG_GDB_SCRIPTS=y`,在建構 GDB script 後使用下面命令啟動 gdb。
```shell
fewletter@fewletter-Veriton-M4665G:~/linux2023/linux-6.3$ gdb -ex "add-auto-load-safe-path scripts/gdb/vmlinux-gdb.py" \
-ex "file vmlinux" \
-ex "lx-version" -q
Reading symbols from vmlinux...
Signal Stop Print Pass to program Description
SIGSEGV No No Yes Segmentation fault
Signal Stop Print Pass to program Description
SIGUSR1 Yes Yes No User defined signal 1
Linux version 6.3.0 (fewletter@fewletter-Veriton-M4665G) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #1 Sat Jun 17 21:18:30 CST 2023
(gdb)
```
接著在 gdb 中啟動 UML,卻發現仍然在 gdb 的終端中,非常不尋常。
```
(gdb) r
Starting program: /home/fewletter/linux2023/linux-6.3/vmlinux umid=uml0 root=/dev/root rootfstype=hostfs rootflags=/home/fewletter/linux2023/linux-6.3/rootfs/simplefs/rootfs rw mem=64M init=/init.sh quiet
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 66210]
[Detaching after fork from child process 66211]
[Detaching after fork from child process 66212]
[Detaching after fork from child process 66213]
warning: Corrupted shared library list: 0x605e4da0 != 0x7ffff7ffd9e8
[New Thread 0x7ffff7d8db80 (LWP 66221)]
[New Thread 0x7ffff7d8db80 (LWP 66222)]
umid "uml0" is already in use by pid 50653
Failed to initialize umid "uml0", trying with a random umid
Failed to initialize ubd device 0 :Couldn't determine size of device's file
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
CPU: 0 PID: 1 Comm: swapper Not tainted 6.3.0 #1
Stack:
6044dcbf 00000000 64803cf0 60034159
6044dcbf 00000000 63f42930 60a2a000
64803d20 60390b95 60390b47 60389d00
Call Trace:
[<60389d5e>] ? _printk+0x0/0x98
[<6002218b>] show_stack+0x141/0x150
[<60034159>] ? um_set_signals+0x0/0x43
...
[<60020e23>] new_thread_handler+0x85/0xb6
Thread 2 received signal SIGTERM, Terminated.
[Switching to Thread 0x7ffff7d8db80 (LWP 66221)]
0x00007ffff7ea1967 in __GI___poll (fds=fds@entry=0x605c9310 <kernel_pollfd>, nfds=nfds@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 ../sysdeps/unix/sysv/linux/poll.c: 沒有此一檔案或目錄.
(gdb)
```
接著根據 [建構 User-Mode Linux 的實驗環境-搭配GDB進行核心追蹤和分析](https://hackmd.io/@sysprog/user-mode-linux-env#%E6%90%AD%E9%85%8D-GDB-%E9%80%B2%E8%A1%8C%E6%A0%B8%E5%BF%83%E8%BF%BD%E8%B9%A4%E5%92%8C%E5%88%86%E6%9E%90) 的步驟準備 `gdbinit` 檔案,並且將啟動 gdb 的命令改成 `gdb -q -x gdbinit`。
```
python gdb.COMPLETE_EXPRESSION = gdb.COMPLETE_SYMBOL
add-auto-load-safe-path scripts/gdb/vmlinux-gdb.py
file vmlinux
lx-version
set args umid=uml0 root=/dev/root rootfstype=hostfs rootflags=FULLPATH/rootfs rw mem=64M init=/init.sh quiet
handle SIGSEGV nostop noprint
handle SIGUSR1 nopass stop print
```
終於出現 UML 了。
```shell
$ gdb -q -x gdbinit
Linux version 6.3.0 (fewletter@fewletter-Veriton-M4665G) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #1 Sat Jun 17 21:18:30 CST 2023
(gdb) run
Starting program: /home/fewletter/linux2023/linux-6.3/vmlinux umid=uml0 root=/dev/root rootfstype=hostfs rootflags=/home/fewletter/linux2023/linux-6.3/rootfs rw mem=64M init=/init.sh quiet
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 66637]
[Detaching after fork from child process 66638]
[Detaching after fork from child process 66639]
[Detaching after fork from child process 66640]
[New Thread 0x7ffff7d8db80 (LWP 66643)]
[New Thread 0x7ffff7d8db80 (LWP 66644)]
Failed to initialize ubd device 0 :Couldn't determine size of device's file
[Detaching after fork from child process 66645]
UML:/ #
```
#### 嘗試重現 ls -a bug
在 [2022年報告-Make and Mount a simplefs Filesystem](https://hackmd.io/@freshLiver/linux-vfs-main/%2F9Qym3jjIQIqmGZ2GZpbzTg#Make-and-Mount-a-simplefs-Filesystem) 中提到 `ls -a` 的 bug 會出現在 host 端,重現此 bug 的步驟大概為下:
* 編譯 Linux v6.3.0 版本程式碼
* 將 simplefs 以 Linux v6.3.0 版本編譯,編譯方法如同 [Linux v5.18](https://hackmd.io/ywq9yWYcSSa1D7MDh1yKmw?both#%E6%BA%96%E5%82%99%E6%89%80%E9%9C%80%E7%89%88%E6%9C%AC%E7%9A%84%E6%A0%B8%E5%BF%83%E6%A8%A1%E7%B5%84)
* 開啟 gdb 並在 UML 中掛載檔案系統至 test 目錄中
```shell
UML:/simplefs # insmod simplefs.ko
UML:/simplefs # mount -t simplefs -o loop test.img /test/
UML:/simplefs # df -Th
Filesystem Type Size Used Available Use% Mounted on
root hostfs 195.8G 53.6G 132.2G 29% /
devtmpfs devtmpfs 28.5M 0 28.5M 0% /dev
/dev/loop0 simplefs 200.0M 3.6M 196.4M 2% /test
```
* 觀察 host 端的 `ls -a` 是否有出現 `.` 和 `..`
```
UML:/simplefs # cd ..
UML:/ # ls -a
. opt
.. proc
.PKGINFO root
.SIGN.RSA.alpine-devel@lists.alpinelinux.org-4a6a0840.rsa.pub run
.ash_history sbin
bin script
dev simplefs
etc srv
home sys
init.sh test
lib tmp
media usr
mnt var
UML:/ # cd /test
UML:/test # ls -a
. ..
```
從以上結果來看 `ls -a` 在掛載檔案系統後,不管在 host 端還是在 test 目錄內都有出現 `.` 和 `..`,此 bug 或許已經不存在。
## TODO: 改進 VFS 及 simplefs 描述
擴充 [2022 年報告](https://hackmd.io/@freshLiver/linux-vfs-main/),改進其中關於 VFS 及 simplefs 實作的描述 (可運用 HackMD 的[書本模式](https://hackmd.io/s/how-to-create-book-tw) 分類展現)
### 擴充 [VFS : Registering and Mounting a Filesystem : Mounting](https://hackmd.io/@freshLiver/linux-vfs-main/%2FaDeWUhTDSMeU6a_HCt3tBg#Mounting)
* 透過 `fill_super` 初始化 Superblock
在掛載函式中會有一個 function pointer 的參數 `fill_super`,會根據不同的檔案系統使用不同的初始化函式,在找到 Superblock 之後就會透過這個函式對 Superblock 進行初始化。
* mount_bdev()
將檔案系統掛載到實體裝置上,比如說硬碟上。
* mount_single()
將多個掛載操作將共享在同一個檔案系統實例 (instance)。
* mount_nodev()
將檔案系統掛載到非實體裝置上。
不同的檔案系統的 `fill_super()` 會有些許的不同,不過最主要都有對 Superblock 做初始化的功能,比如 [ramfs 檔案系統](https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html):
```c
static int ramfs_fill_super(struct super_block *sb, void *data, int silent)
{
struct ramfs_fs_info *fsi;
struct inode *inode;
int err;
save_mount_options(sb, data);
fsi = kzalloc(sizeof(struct ramfs_fs_info), GFP_KERNEL);
sb->s_fs_info = fsi;
if (!fsi)
return -ENOMEM;
err = ramfs_parse_options(data, &fsi->mount_opts);
if (err)
return err;
sb->s_maxbytes = MAX_LFS_FILESIZE;
sb->s_blocksize = PAGE_SIZE;
sb->s_blocksize_bits = PAGE_SHIFT;
sb->s_magic = RAMFS_MAGIC;
sb->s_op = &ramfs_ops;
sb->s_time_gran = 1;
inode = ramfs_get_inode(sb, NULL, S_IFDIR | fsi->mount_opts.mode, 0);
sb->s_root = d_make_root(inode);
if (!sb->s_root)
return -ENOMEM;
return 0;
}
```
### 擴充 [simplefs : Registering and Mounting](https://hackmd.io/@freshLiver/linux-vfs-main/%2FxuiLSFn8Sji_EkkoCsSpTw#Registering-and-Mounting)
#### Mounting
首先在 Linux 中 mount 是根據不同掛載裝置而有不同的掛載函式,如同 [VFS](https://hackmd.io/@freshLiver/linux-vfs-main/%2FaDeWUhTDSMeU6a_HCt3tBg#Mounting) 內所提到三種函式,而在 simplefs 中是選擇使用 `mount_bdev()` 作為掛載函式,並且利用 `simplefs_fill_super` 作為初始化 superblock 的函式
```c
struct dentry *simplefs_mount(struct file_system_type *fs_type,
int flags,
const char *dev_name,
void *data)
{
struct dentry *dentry =
mount_bdev(fs_type, flags, dev_name, data, simplefs_fill_super);
if (IS_ERR(dentry))
pr_err("'%s' mount failure\n", dev_name);
else
pr_info("'%s' mount success\n", dev_name);
return dentry;
}
```
### 擴充 [VFS : Registering and Mounting a Filesystem : Umount](https://hackmd.io/@freshLiver/linux-vfs-main/%2FaDeWUhTDSMeU6a_HCt3tBg#Unmount)
#### Unmounting
至於卸載檔案系統則是將 superblock 的資訊給消除掉,主要會利用到 [linux/fs/super.c](https://github.com/torvalds/linux/blob/master/fs/super.c) 的 `kill_block_super`
```c
void simplefs_kill_sb(struct super_block *sb)
{
kill_block_super(sb);
pr_info("unmounted disk\n");
}
```
### 擴充 [Simplefs : File page cache and read/write block on disk](https://hackmd.io/xuiLSFn8Sji_EkkoCsSpTw?view#File-page-cache-and-readwrite-blocks-on-disk)
simplefs 同時提供將 page cache 讀寫和從硬碟上將 block 從檔案系統中寫入至硬碟,而 block 中的資料可以同時包括 superblock, inode 和 bitmaps,而在 simplefs 中如同[本章之前所提](https://hackmd.io/xuiLSFn8Sji_EkkoCsSpTw?both#Block)大約為 4 KB。
#### 結構體 extent
在了解方法之前,先了解結構體 `extent` ,結構體 `extent` 是在較新的檔案系統中才有的結構體,目的是為了解決處理大型檔案的問題,比如說在檔案系統如 [bfs](https://github.com/torvalds/linux/blob/master/fs/bfs/file.c) 就是直接對 block 進行操作,如以下程式碼
```c
static int bfs_get_block(struct inode *inode, sector_t block,
struct buffer_head *bh_result, int create)
{
unsigned long phys;
int err;
struct super_block *sb = inode->i_sb;
struct bfs_sb_info *info = BFS_SB(sb);
struct bfs_inode_info *bi = BFS_I(inode);
phys = bi->i_sblock + block;
if (!create) {
...
```
上述情況會有什麼問題,那就是在處理大型檔案比如說超過 10 MiB 的檔案,以一個單位只有 4 KiB 的 block 來處理,那勢必會耗費大量時間,但是透過結構體 `extent` ,一次會分配 8 個 block 來處理檔案不管是要寫入或是讀取,以減少每次都要對每個 block 處理的時間。
```
inode
+-----------------------+
| i_mode = IFDIR | 0644 | block 93
| ei_block = 93 ----|------> +----------------+
| i_size = 10 KiB | 0 | ee_block = 0 |
| i_blocks = 25 | | ee_len = 8 | extent 94
+-----------------------+ | ee_start = 94 |---> +--------+
|----------------| | |
1 | ee_block = 8 | +--------+
| ee_len = 8 | extent 99
| ee_start = 99 |---> +--------+
|----------------| | |
2 | ee_block = 16 | +--------+
| ee_len = 8 | extent 66
| ee_start = 66 |---> +--------+
|----------------| | |
| ... | +--------+
|----------------|
341 | ee_block = 0 |
| ee_len = 0 |
| ee_start = 0 |
+----------------+
```
#### Get blocks from file system
如何將 file system 中的資料映射到要儲存在硬碟上的 block,要知道在 simplefs 中運用了 superblock,inode 等結構體來記錄一個檔案系統的資訊,所以一開始使用下面兩個定義來取得 superblock 和 inode 的資訊
```c
#define SIMPLEFS_SB(sb) (sb->s_fs_info)
#define SIMPLEFS_INODE(inode) \
(container_of(inode, struct simplefs_inode_info, vfs_inode))
```
而為了知道檔案要映射到硬碟的哪個區域,simplefs 使用結構體 `simplefs_extent` 來管理硬碟的起始地址,長度,以及起始 block 的地址,`sb_bread` 則是透過 superblock 和 inode 來得知檔案的 block 的大小範圍,最後在找到 `index` 後,利用 `iblock` 來找出這個檔案在硬碟的 block 編號範圍。
```c
static int simplefs_file_get_block(struct inode *inode,
sector_t iblock,
struct buffer_head *bh_result,
int create)
{
struct super_block *sb = inode->i_sb;
struct simplefs_sb_info *sbi = SIMPLEFS_SB(sb);
struct simplefs_inode_info *ci = SIMPLEFS_INODE(inode);
...
bh_index = sb_bread(sb, ci->ei_block);
if (!bh_index)
return -EIO;
index = (struct simplefs_file_ei_block *) bh_index->b_data;
extent = simplefs_ext_search(index, iblock);
...
}
```
最後則是靠至著 `get_free_blocks` 在硬碟中找尋空的位置並且透過 `map_bh` 將 `buffer_head` 映射到映射到硬碟上。
```c
...
if (index->extents[extent].ee_start == 0) {
if (!create)
return 0;
bno = get_free_blocks(sbi, 8);
if (!bno) {
ret = -ENOSPC;
goto brelse_index;
}
index->extents[extent].ee_start = bno;
index->extents[extent].ee_len = 8;
index->extents[extent].ee_block =
extent ? index->extents[extent - 1].ee_block +
index->extents[extent - 1].ee_len
: 0;
alloc = true;
} else {
bno = index->extents[extent].ee_start + iblock -
index->extents[extent].ee_block;
}
/* Map the physical block to to the given buffer_head */
map_bh(bh_result, sb, bno);
...
```
## TODO: 改進 page cache
參見 [WIP on page cache hooks for disaggregated fs](https://github.com/rstutsman/simplefs/commit/c78f20aee3672a03d3a9a33fb30465ee2ca6d88c)
### 什麼是 page cache ?
* page cache 的目的在於能夠減少讀取資料的需要,而對於註冊在硬碟上的檔案系統來說,如果每次讀取檔案都要從硬碟上讀取,那會花費相當多的時間,而 page 由於儲存在記憶體中,讀取速度較快。
* 在寫入資料的部分, page cache 會先將資料儲存於記憶體中並將其標記為 dirty page,此意思為此資料已被改動但是尚未被同步至硬碟中,而檔案系統會把這些資料同步到硬碟中,以達到資料的一致性。
```graphviz
digraph {
node [shape=box];
Data [label="Data"];
PageCache [label="Page Cache"];
Dirtypage [label="標記為 dirty page", color="white"];
Filesystem [label="Filesystem"];
StorageDevice [label="Storage Device"];
Data -> PageCache [label=" 儲存"]
PageCache -> Dirtypage;
Dirtypage -> Filesystem [label=" 寫入"];
Filesystem -> StorageDevice [label=" 同步"];
}
```
### Page cache 在檔案系統中如何運作 ?
首先在 [linux/fs.h](https://github.com/torvalds/linux/blob/master/include/linux/fs.h#L443) 中可以找到結構體 `address_space`,[linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/linux/mm_types.h#LL74C3-L74C3) 中可以找到結構體 `page`,其中結構體 `page` 便是在描述資料在記憶體中的型態,而 `page` 也是在記憶體中的最小單位,下方則是 `page` 與 `address_space` 之間的關聯。
```c
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct page{
...
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
union {
pgoff_t index; /* Our offset within mapping. */
unsigned long share; /* share count for fsdax */
};
...
}
```
`address_space *mapping` 提供 `page` 的所在地址,`index` 則是提供偏移量(offset)。
![](https://hackmd.io/_uploads/BJnrxeXDn.png)
* 圖片參考資訊 [Memory Mapping](https://linux-kernel-labs.github.io/refs/heads/master/labs/memory_mapping.html)
至於結構體 `address_space` 則是讓 `page` 讀取或寫入檔案的重要結構體,而操作方式則是透過另外一個結構體 `address_space_operations *a_ops`,其提供 dirty page 寫回硬碟或是從位址讀取 `page` 的方法。
```c
struct address_space {
struct inode *host;
struct xarray i_pages;
...
const struct address_space_operations *a_ops;
...
}
struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*read_folio)(struct file *, struct folio *);
/* Write back some dirty pages from this mapping. */
int (*writepages)(struct address_space *, struct writeback_control *);
/* Mark a folio dirty. Return true if this dirtied it */
bool (*dirty_folio)(struct address_space *, struct folio *);
void (*readahead)(struct readahead_control *);
...
}
```
### GDB 追蹤 page cache
#### 準備檔案系統映像檔
首先先把檔案系統以虛擬環境的 Linux 版本編譯,這邊虛擬環境的版本為 6.3.0,並且需要注意的是由於要在 UML 中編譯,所以要將核心組態 `ARCH=um` ,然後以此版本建立檔案系統映像檔 `make test.img`。
```shell
make all VERSION=6.3.0 ARCH=um
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /home/fewletter/linux2023/linux-6.3/rootfs/lib/modules/6.3.0/build M=/home/fewletter/linux2023/simplefs modules ARCH=um
make[1]: 進入目錄「/home/fewletter/linux2023/linux-6.3」
CC [M] /home/fewletter/linux2023/simplefs/fs.o
CC [M] /home/fewletter/linux2023/simplefs/super.o
CC [M] /home/fewletter/linux2023/simplefs/inode.o
CC [M] /home/fewletter/linux2023/simplefs/file.o
CC [M] /home/fewletter/linux2023/simplefs/dir.o
CC [M] /home/fewletter/linux2023/simplefs/extent.o
LD [M] /home/fewletter/linux2023/simplefs/simplefs.o
MODPOST /home/fewletter/linux2023/simplefs/Module.symvers
CC [M] /home/fewletter/linux2023/simplefs/simplefs.mod.o
LD [M] /home/fewletter/linux2023/simplefs/simplefs.ko
make[1]: 離開目錄「/home/fewletter/linux2023/linux-6.3」
$ make test.img
dd if=/dev/zero of=test.img bs=1M count=200
輸入 200+0 個紀錄
輸出 200+0 個紀錄
209715200位元組(210 MB,200 MiB)已複製,0.102465 s,2.0 GB/s
./mkfs.simplefs test.img
Superblock: (4096)
magic=0xdeadce
nr_blocks=51200
nr_inodes=51240 (istore=915 blocks)
nr_ifree_blocks=2
nr_bfree_blocks=2
nr_free_inodes=51239
nr_free_blocks=50280
Inode store: wrote 915 blocks
inode size = 72 B
Ifree blocks: wrote 2 blocks
Bfree blocks: wrote 2 blocks
```
接著將整個 simplefs 編譯過的檔案和檔案系統全部複製到 rootfs (root file system) 當中,然後啟動 GDB 按下 r 後切到 UML ,在 UML 中載入模組,掛載檔案系統到 test 目錄中。
```shell
rootfs$ cp -r /home/fewletter/linux2023/simplefs simplefs/
...
(gdb) r
Starting program: /home/fewletter/linux2023/linux-6.3/vmlinux umid=uml0 root=/dev/root rootfstype=hostfs rootflags=/home/fewletter/linux2023/linux-6.3/rootfs rw mem=64M init=/init.sh quiet
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 9053]
[Detaching after fork from child process 9054]
[Detaching after fork from child process 9055]
[Detaching after fork from child process 9056]
[New Thread 0x7ffff7d8db80 (LWP 9057)]
[New Thread 0x7ffff7d8db80 (LWP 9058)]
Failed to initialize ubd device 0 :Couldn't determine size of device's file
[Detaching after fork from child process 9059]
UML:/ # cd simplefs
UML:/simplefs # insmod simplefs.ko
UML:/simplefs # mount -t simplefs -o loop test.img /test/
UML:/simplefs # df -Th
Filesystem Type Size Used Available Use% Mounted on
root hostfs 195.8G 53.6G 132.2G 29% /
devtmpfs devtmpfs 28.5M 0 28.5M 0% /dev
/dev/loop0 simplefs 200.0M 3.6M 196.4M 2% /test
```
在另外一個視窗打上`pkill -SIGUSR1 -o vmlinux` 將 UML 切回 GDB。
#### 在 GDB 中設置斷點並追蹤 page cache 相關程式碼
首先一定要先在 GDB 中輸入 `lx-symbols` ,否則 GDB 無法進行 debug
```shell
(gdb) lx-symbols
loading vmlinux
scanning for modules in /home/fewletter/linux2023/linux-6.3
loading @0x64947000: /home/fewletter/linux2023/linux-6.3/drivers/block/loop.ko
loading @0x649ad000: /home/fewletter/linux2023/linux-6.3/rootfs/simplefs/simplefs/simplefs.ko
```
設立斷點於 `simplefs_get_block` , `simplefs_write_begin` ,`simplefs_write_end` , `simplefs_readahead` , `simplefs_writepage` , `simplefs_ext_search` 當中,觀察其在建立檔案和修改檔案時的行為。
```shell
(gdb) info b
Num Type Disp Enb Address What
1 breakpoint keep y 0x00000000649af54d in simplefs_file_get_block at /home/fewletter/linux2023/simplefs/file.c:21
breakpoint already hit 26 times
2 breakpoint keep y 0x00000000649af468 in simplefs_write_begin at /home/fewletter/linux2023/simplefs/file.c:126
breakpoint already hit 22 times
3 breakpoint keep y 0x00000000649af26f in simplefs_write_end at /home/fewletter/linux2023/simplefs/file.c:167
breakpoint already hit 24 times
4 breakpoint keep y 0x00000000649af50a in simplefs_readahead at /home/fewletter/linux2023/simplefs/file.c:86
5 breakpoint keep y 0x00000000649af52a in simplefs_writepage at /home/fewletter/linux2023/simplefs/file.c:101
breakpoint already hit 13 times
6 breakpoint keep y 0x00000000649af983 in simplefs_ext_search at /home/fewletter/linux2023/simplefs/extent.c:14
breakpoint already hit 8 times
```
**實驗修改檔案**
切回 UML 後建立檔案於檔案系統,可以看到建立檔案於檔案系統中需要經過 `simplefs_write_begin` 開始寫入,並且結束於 `simplefs_write_end`。
```c
UML:/ # echo "test1" > test/hello
Thread 1 "vmlinux" hit Breakpoint 2, simplefs_write_begin (file=0x60a3ae00, mapping=0x60a9e400, pos=0, len=6, pagep=0x64897c58, fsdata=0x64897c60)
at /home/fewletter/linux2023/simplefs/file.c:126
126 {
(gdb) c
Continuing.
Thread 1 "vmlinux" hit Breakpoint 1, simplefs_file_get_block (inode=inode@entry=0x60a9e2b8, iblock=iblock@entry=0, bh_result=bh_result@entry=0x60c29d00, create=create@entry=1)
at /home/fewletter/linux2023/simplefs/file.c:21
21 {
(gdb) c
Continuing.
Thread 1 "vmlinux" hit Breakpoint 6, simplefs_ext_search (index=index@entry=0x6075b000, iblock=iblock@entry=0) at /home/fewletter/linux2023/simplefs/extent.c:14
14 {
(gdb) c
Continuing.
Thread 1 "vmlinux" hit Breakpoint 3, simplefs_write_end (file=0x60a3ae00, mapping=0x60a9e400, pos=0, len=6, copied=6, page=0x63f38e50, fsdata=0x0 <loop_exit>)
at /home/fewletter/linux2023/simplefs/file.c:167
167 {
(gdb) c
Continuing.
UML:/ #
```
但是如果要修改檔案的話,在最後面就會多了一行 `simplefs_writepage`,代表著檔案從 page cache 中將修改過後的檔案寫回實體硬碟上。
```c
UML:/ # echo "test9" > test/hello
Thread 1 "vmlinux" hit Breakpoint 2, simplefs_write_begin (file=0x60b61500, mapping=0x60a9e400, pos=0, len=6, pagep=0x64897c58, fsdata=0x64897c60)
at /home/fewletter/linux2023/simplefs/file.c:126
126 {
(gdb) c
Continuing.
Thread 1 "vmlinux" hit Breakpoint 1, simplefs_file_get_block (inode=inode@entry=0x60a9e2b8, iblock=iblock@entry=0, bh_result=bh_result@entry=0x60c2bf70, create=create@entry=1)
at /home/fewletter/linux2023/simplefs/file.c:21
21 {
(gdb) c
Continuing.
Thread 1 "vmlinux" hit Breakpoint 6, simplefs_ext_search (index=index@entry=0x6075b000, iblock=iblock@entry=0) at /home/fewletter/linux2023/simplefs/extent.c:14
14 {
(gdb) c
Continuing.
Thread 1 "vmlinux" hit Breakpoint 3, simplefs_write_end (file=0x60b61500, mapping=0x60a9e400, pos=0, len=6, copied=6, page=0x63f38e88, fsdata=0x0 <loop_exit>)
at /home/fewletter/linux2023/simplefs/file.c:167
167 {
(gdb) c
Continuing.
UML:/ #
Thread 1 "vmlinux" hit Breakpoint 5, simplefs_writepage (page=0x63f38e88, wbc=0x6488bce0) at /home/fewletter/linux2023/simplefs/file.c:101
101 {
```
從這個簡單的實驗可以看到,在建立檔案時檔案資訊就會存在 page cache 中,而當我們要修改檔案時,檔案會寫入 page cache 而不是原本的硬碟上,最後在結束時,也就是在上方最後的 `UML:/ # ` 時候,自行寫入硬碟中。
## TODO: 排除記憶體錯誤
運用 [kmemleak](https://www.kernel.org/doc/html/latest/dev-tools/kmemleak.html) 和 [kasan](https://www.kernel.org/doc/html/latest/dev-tools/kasan.html) 排除記憶體錯誤
### 嘗試建立 kmemleak 報表
首先從 [kmemleak](https://www.kernel.org/doc/html/latest/dev-tools/kmemleak.html) 文件中知道, kmemleak 就像是一個檔案系統,它的檔案系統的型態為 [debugfs](h[ttps://](https://github.com/torvalds/linux/tree/master/fs/debugfs)) ,將他開啟的方式首先要先從 .config 檔開始修改,而我原本想在本地端修改 .config 檔,後來為了保險起見,決定直接在虛擬化環境修改,而建構虛擬化環境則是參考 [測試 Linux 核心的虛擬化環境](https://hackmd.io/@sysprog/linux-virtme#%E6%B8%AC%E8%A9%A6-Linux-%E6%A0%B8%E5%BF%83%E7%9A%84%E8%99%9B%E6%93%AC%E5%8C%96%E7%92%B0%E5%A2%83)。
#### 修改 .config 檔
.config 檔案是編譯核心的重要檔案,此檔案決定核心編譯時會有什麼功能,而核心本身也有提供腳本 `scripts/kconfig/merge_config.sh .config .config-fragment` 來修改 .config 檔,在此核心中會需要修改 `CONFIG_DEBUG_INFO` 和 `CONFIG_DEBUG_KMEMLEAK`。
```shell
$ echo "CONFIG_DEBUG_INFO=y" > .config-fragment
$ scripts/kconfig/merge_config.sh .config .config-fragment
$ echo "CONFIG_DEBUG_KMEMLEAK=y" > .config-fragment
$ scripts/kconfig/merge_config.sh .config .config-fragment
```
:::warning
不同版本的 Linux .config 檔會有些許的不同
:::
接著編譯所需要之核心環境
```shell
$ make ARCH=x86 CROSS_COMPILE=x86_64-linux-gnu- -j$(nproc)
```
成功編譯完環境後會出現以下訊息,後面的 # 代表的是編譯次數。
```
Kernel: arch/x86/boot/bzImage is ready (#1)
````
#### 在虛擬環境編譯 simplefs
利用 QEMU 作為虛擬環境有個缺點,也就是在外面的核心無法直接對此環境的版本編譯,在裝載著此虛擬環境的目錄中執行以下命令 `ls .virtime_mods/lib/modules/` 你會得到 `0.0.0`,代表一定要進到此虛擬環境才能編譯核心。
```
fewletter@fewletter-Veriton-M4665G:~/linux2023/linux$ virtme-run --kdir . --mods=auto
./.virtme_mods/lib/modules/0.0.0
...
root@(none):/tmp/simplefs# make
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /lib/modules/6.1.34/build M=/tmp/simplefs modules
make[1]: Entering directory '/home/fewletter/linux2023/linux'
warning: the compiler differs from the one used to build the kernel
The kernel was built by: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
You are using: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CC [M] /tmp/simplefs/fs.o
CC [M] /tmp/simplefs/super.o
...
[ 98.786243] Tasks state (memory values in pages):
[ 98.786582] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 98.787200] [ 98] 0 98 5334 777 61440 0 -1000 systemd-udevd
[ 98.787831] [ 157] 0 157 1062 144 49152 0 0 bash
[ 98.788440] [ 178] 0 178 728 68 40960 0 0 make
[ 98.789098] [ 185] 0 185 793 136 49152 0 0 make
[ 98.789761] [ 398] 0 398 817 161 45056 0 0 make
[ 98.790428] [ 422] 0 422 656 30 40960 0 0 sh
[ 98.791054] [ 423] 0 423 948 47 45056 0 0 gcc
[ 98.791680] [ 424] 0 424 20244 10738 200704 0 0 cc1
[ 98.792298] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,task=cc1,pid=424,uid=0
[ 98.793063] Out of memory: Killed process 424 (cc1) total-vm:80976kB, anon-rss:42668kB, file-rss:284kB, shmem-rss:0kB, UID:0 pgtables:196kB oom_score_adj:0
[ 98.798426] cc1 (424) used greatest stack depth: 12664 bytes left
gcc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:250: /tmp/simplefs/super.o] Error 1
make[1]: *** [Makefile:2012: /tmp/simplefs] Error 2
make[1]: Leaving directory '/home/fewletter/linux2023/linux'
make: *** [Makefile:23: all] Error 2
root@(none):/tmp/simplefs#
```
從上面看到 simplefs 無法正常編譯,非常不尋常,因為在先前的章節 [建立虛擬化測試環境](https://hackmd.io/ywq9yWYcSSa1D7MDh1yKmw?both#%E5%BB%BA%E7%AB%8B%E8%99%9B%E6%93%AC%E5%8C%96%E6%B8%AC%E8%A9%A6%E7%92%B0%E5%A2%83) 此專案是可以被編譯的。試著用 [搭配 crash 進行核心偵錯](https://hackmd.io/@sysprog/linux-virtme#%E6%90%AD%E9%85%8D-crash-%E9%80%B2%E8%A1%8C%E6%A0%B8%E5%BF%83%E5%81%B5%E9%8C%AF) 來看什麼地方出錯。
#### 搭配 crash 除錯
首先先將呼叫虛擬化環境的命令改成以下命令,`--mods=auto` 為的是能夠讓 `lib/modules/` 下的建構模組的版本為虛擬化環境的版本,`--qemu-opts -qmp tcp:localhost:4444,server,nowait` 是為了能夠連上 telnet 並將記憶體內容倒給映像檔然後讓 crash 來除錯。
```shell
$ virtme-run --kdir . --mods=auto --qemu-opts -qmp tcp:localhost:4444,server,nowait
```
在另一個終端輸入以下命令連到 telnet
```shell
$ telnet localhost 4444
```
會有以下畫面
```shell
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
{"QMP": {"version": {"qemu": {"micro": 1, "minor": 2, "major": 4}, "package": "Debian 1:4.2-3ubuntu6.27"}, "capabilities": ["oob"]}}
```
接著輸入以下兩行命令,`"file:vmcore.img"` 這行為輸出的檔案名稱,最好將其命名為映像檔(img),因為 crash 主要是對映像檔偵錯。
```shell
{ "execute": "qmp_capabilities" }
{ "execute": "dump-guest-memory", "arguments": {"paging": false, "protocol": "file:vmcore.img"}}
```
再開另一個終端輸入以下指令,crash 會讀取你給的 vmcore.img
```shell
$ crash /home/fewletter/linux2023/linux/vmlinux /home/fewletter/linux2023/linux/kmemleak_mempoolsize2000.img
...
KERNEL: /home/fewletter/linux2023/linux/vmlinux
DUMPFILE: /home/fewletter/linux2023/linux/kmemleak_mempoolsize2000.img
CPUS: 1
DATE: Thu Jun 22 21:57:16 CST 2023
UPTIME: 00:03:16
LOAD AVERAGE: 0.02, 0.02, 0.00
TASKS: 45
NODENAME: (none)
RELEASE: 5.17.15
VERSION: #6 SMP Thu Jun 22 21:52:44 CST 2023
MACHINE: x86_64 (3000 Mhz)
MEMORY: 127.5 MB
PANIC: ""
PID: 0
COMMAND: "swapper/0"
TASK: ffffffff8cc14940 [THREAD_INFO: ffffffff8cc14940]
CPU: 0
STATE: TASK_RUNNING (ACTIVE)
WARNING: panic task not found
```
雖然沒有發生 kernel panic,但是發生了 oom (out of memory),所以從 crash 中查詢到底發生了什麼事造成 simplefs 無法編譯。
```
crash> ps
PID PPID CPU TASK ST %MEM VSZ RSS COMM
> 0 0 0 ffffffffb1e14a40 RU 0.0 0 0 [swapper/0]
1 0 0 ffffa0e041358000 IN 0.4 4116 468 virtme-init
...
38 2 0 ffffa0e0420c2080 ID 0.0 0 0 [scsi_tmf_0]
39 2 0 ffffa0e0420c30c0 IN 0.0 0 0 [scsi_eh_1]
40 2 0 ffffa0e0420c4100 ID 0.0 0 0 [scsi_tmf_1]
41 2 0 ffffa0e0420c5140 ID 0.0 0 0 [mld]
42 2 0 ffffa0e0420c6180 ID 0.0 0 0 [ipv6_addrconf]
43 2 0 ffffa0e043430000 IN 0.0 0 0 [kmemleak]
98 1 0 ffffa0e0434b1040 IN 2.4 21336 3172 systemd-udevd
157 1 0 ffffa0e0434930c0 IN 1.0 4248 1304 bash
432 2 0 ffffa0e043498000 ID 0.0 0 0 [kworker/0:1]
451 2 0 ffffa0e0434b2080 ID 0.0 0 0 [kworker/0:2]
452 2 0 ffffa0e0434b5140 ID 0.0 0 0 [kworker/0:0]
```
從上面錯誤訊息看到在 PID 98 時有看到有 oom 的發生,所以直接去查 PID 98 發生了什麼事。
```
crash> bt 98
PID: 98 TASK: ffffa0e0434b1040 CPU: 0 COMMAND: "systemd-udevd"
#0 [ffffaef5c01bbd08] __schedule at ffffffffb137f181
#1 [ffffaef5c01bbd90] schedule at ffffffffb137f6d5
#2 [ffffaef5c01bbda8] schedule_hrtimeout_range_clock at ffffffffb1385d42
#3 [ffffaef5c01bbe18] do_epoll_wait at ffffffffb0905c08
#4 [ffffaef5c01bbf00] __x64_sys_epoll_wait at ffffffffb0906f70
#5 [ffffaef5c01bbf38] do_syscall_64 at ffffffffb1377f08
#6 [ffffaef5c01bbf50] entry_SYSCALL_64_after_hwframe at ffffffffb140009b
RIP: 00007fe9f356a42a RSP: 00007ffeebe1dd78 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000559fe09ab8a0 RCX: 00007fe9f356a42a
RDX: 000000000000000a RSI: 0000559fe0bfb4c0 RDI: 0000000000000008
RBP: ffffffffffffffff R8: 000000000000000a R9: 00007ffeebfbb0f0
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001
R13: 000000000000000a R14: 0000559fe07352ba R15: 0000559fe09ab8a0
ORIG_RAX: 00000000000000e8 CS: 0033 SS: 002b
crash> gdb list do_syscall_64
22
23 #ifdef CONFIG_HAVE_JUMP_LABEL_HACK
24
25 static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
26 {
27 asm_volatile_goto("1:"
28 "jmp %l[l_yes] # objtool NOPs this \n\t"
29 JUMP_TABLE_ENTRY
30 : : "i" (key), "i" (2 | branch) : : l_yes);
31
crash> gdb list __schedule
6431 * - return from interrupt-handler to user-space
6432 *
6433 * WARNING: must be called with preemption disabled!
6434 */
6435 static void __sched notrace __schedule(unsigned int sched_mode)
6436 {
6437 struct task_struct *prev, *next;
6438 unsigned long *switch_count;
6439 unsigned long prev_state;
6440 struct rq_flags rf;
crash> gdb list schedule
10
11 DECLARE_PER_CPU(struct task_struct *, current_task);
12
13 static __always_inline struct task_struct *get_current(void)
14 {
15 return this_cpu_read_stable(current_task);
16 }
17
18 #define current get_current()
19
crash>
```
其中一條 warning 說到 `must be called with preemption disabled` ,所以到 .config 檔案中修改 `CONFIG_PREEMPT_NONE_BUILD=y` 和 `CONFIG_PREEMPT_NONE=y`,然後重新編譯核心環境跟之前的動作一樣。
重新進入虛擬化環境並且嘗試編譯 simplefs,還是失敗,測試結果如下。
```
root@(none):/tmp/simplefs# make
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /lib/modules/6.1.34/build M=/tmp/simplefs modules
make[1]: Entering directory '/home/fewletter/linux2023/linux'
warning: the compiler differs from the one used to build the kernel
The kernel was built by: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
You are using: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CC [M] /tmp/simplefs/fs.o
CC [M] /tmp/simplefs/super.o
...
[ 98.786243] Tasks state (memory values in pages):
[ 98.786582] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 98.787200] [ 98] 0 98 5334 777 61440 0 -1000 systemd-udevd
[ 98.787831] [ 157] 0 157 1062 144 49152 0 0 bash
[ 98.788440] [ 178] 0 178 728 68 40960 0 0 make
[ 98.789098] [ 185] 0 185 793 136 49152 0 0 make
[ 98.789761] [ 398] 0 398 817 161 45056 0 0 make
[ 98.790428] [ 422] 0 422 656 30 40960 0 0 sh
[ 98.791054] [ 423] 0 423 948 47 45056 0 0 gcc
[ 98.791680] [ 424] 0 424 20244 10738 200704 0 0 cc1
[ 98.792298] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,task=cc1,pid=424,uid=0
[ 98.793063] Out of memory: Killed process 424 (cc1) total-vm:80976kB, anon-rss:42668kB, file-rss:284kB, shmem-rss:0kB, UID:0 pgtables:196kB oom_score_adj:0
[ 98.798426] cc1 (424) used greatest stack depth: 12664 bytes left
gcc: fatal error: Killed signal terminated program cc1
compilation terminated.
make[2]: *** [scripts/Makefile.build:250: /tmp/simplefs/super.o] Error 1
make[1]: *** [Makefile:2012: /tmp/simplefs] Error 2
make[1]: Leaving directory '/home/fewletter/linux2023/linux'
make: *** [Makefile:23: all] Error 2
```
既然結果顯示 error 跟 preemption 無關,再看一次 crash 的報表 kmemleak 也出現在上面,也就是說 kmemleak 也有可能導致在編譯 simplefs 時候,虛擬環境出現 oom 的狀況。
#### 修改 kmemleak config 設定
以下是 kmemleak 在 .config 檔中相關的設定,我首先先調整 `CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE` 的大小為 200,然後編譯核心環境,進入虛擬化環境編譯 simplefs ,結果還是失敗。
```
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE=16000
# CONFIG_DEBUG_KMEMLEAK_TEST is not set
CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y
CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y
```
最後我調整 `CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y` 此命令為讓使用者在進入環境時, kmemleak 先不要啟動,然後編譯核心環境,得到以下畫面,原來我已經編譯 9 次環境了。
```
Kernel: arch/x86/boot/bzImage is ready (#9)
```
接著再試圖在環境編譯 simplefs 就成功了
```shell
root@(none):/tmp/simplefs# make all
cc -std=gnu99 -Wall -o mkfs.simplefs mkfs.c
make -C /lib/modules/5.17.15/build M=/tmp/simplefs modules
make[1]: Entering directory '/home/fewletter/linux2023/linux'
warning: the compiler differs from the one used to build the kernel
The kernel was built by: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
You are using: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CC [M] /tmp/simplefs/fs.o
CC [M] /tmp/simplefs/super.o
CC [M] /tmp/simplefs/inode.o
CC [M] /tmp/simplefs/file.o
CC [M] /tmp/simplefs/dir.o
CC [M] /tmp/simplefs/extent.o
LD [M] /tmp/simplefs/simplefs.o
MODPOST /tmp/simplefs/Module.symvers
CC [M] /tmp/simplefs/simplefs.mod.o
LD [M] /tmp/simplefs/simplefs.ko
make[1]: Leaving directory '/home/fewletter/linux2023/linux'
```
但是此時卻無法將關閉的 kmemleak 開啟
```shell
root@(none):/tmp/simplefs# mount -t debugfs nodev /sys/kernel/debug/
mount: /sys/kernel/debug: nodev already mounted or mount point busy.
root@(none):/# mount | grep "debug"
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
root@(none):/tmp/simplefs# kmemleak=on
root@(none):/tmp/simplefs# echo scan=on > /sys/kernel/debug/kmemleak
bash: echo: write error: Operation not permitted
```
所以從之前的錯誤可得出結論,虛擬環境配置的記憶體無法同時開啟 kmemleak 和編譯 simplefs,必須要配置更大的記憶體空間給虛擬環境。