# 2019q1 Homework2 (fibdrv)
contributed by < `johnnylord` >
###### tags: `sysprog2019`
## 實驗環境
```shell
$ uname -a
Linux johnnylord 4.18.0-16-generic #17~18.04.1-Ubuntu SMP Tue Feb 12 13:35:51 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
```
## 自我檢查清單
### 檔案 `fibdrv.c` 裡頭的 `MODULE_LICENSE`, `MODULE_AUTHOR`, `MODULE_DESCRIPTION`, `MODULE_VERSION` 等巨集做了什麼事,可以讓核心知曉呢? `insmod` 這命令背後,對應 Linux 核心內部有什麼操作呢?請舉出相關 Linux 核心原始碼並解讀
這些巨集本質上就是在編譯過後在 `.ko` 檔中提供相對應的資訊,由於性質相同,這邊就先專注於 `MODULE_AUTHOR`。在範例程式中我們指定 module 的作者
```clike
MODULE_AUTHOR("National Cheng Kung University, Taiwan");
```
**[include/linux/module.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/module.h#L205)**
```cpp
/*
* Author(s), use "Name <email>" or just "Name", for multiple
* authors use multiple MODULE_AUTHOR() statements/lines.
*/
#define MODULE_AUTHOR(_author) MODULE_INFO(author, _author)
```
上述註解說明 `_author` 的格式和若有多個 author 則應該呼叫多次 `MODULE_AUTHOR`。
若將巨集展開應得 `MODULE_INFO(author, "National Cheng Kung University, Taiwan")`
```cpp
/* Generic info of form tag = "info" */
#define MODULE_INFO(tag, info) __MODULE_INFO(tag, tag, info)
```
繼續將上述展開得 `__MODULE_INFO(author, author, "National Cheng Kung University, Taiwan")`
**[include/linux/moduleparam.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/moduleparam.h)**
```cpp
#ifdef MODULE
#define __MODULE_INFO(tag, name, info) \
static const char __UNIQUE_ID(name)[] \
__used __attribute__((section(".modinfo"), unused, aligned(1))) \
= __stringify(tag) "=" info
#else /* !MODULE */
/* This struct is here for syntactic coherency, it is not used */
#define __MODULE_INFO(tag, name, info) \
struct __UNIQUE_ID(name) {}
#endif
```
上述巨集的定義根據 `MODULE` 是否有被定義,`MODULE` 是在此 dynamic kernel module 被編譯時定義的,若此 module 內建在 kernel 則不會被定義。繼續將上述巨集展開
```cpp
static const char __UNIQUE_ID(author)[] \
__used __attribute__((section(".modinfo"), unused, aligned(1))) \
= __stringify(author) "=" "National Cheng Kung University, Taiwan"
```
**[include/linux/compiler-gcc.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/compiler-gcc.h#L208)**
```cpp
#define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
```
繼續將 \__UNIQUE_ID 展開,`__COUNTER__` 這個巨集由 GNU GCC 維護,每當遇到使用到 `__COUNTER__` 就會將其值加一。
```cpp
static const char __PASTE(__PASTE(__UNIQUE_ID_, author), __COUNTER__)[] \
__used __attribute__((section(".modinfo"), unused, aligned(1))) \
= __stringify(author) "=" "National Cheng Kung University, Taiwan"
```
**[include/linux/compiler_types.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/compiler_types.h#L53)**
```cpp
/* Indirect macros required for expanded argument pasting, eg. __LINE__. */
#define ___PASTE(a,b) a##b
#define __PASTE(a,b) ___PASTE(a,b)
```
繼續做展開
```cpp
static const char __UNIQUE_ID_author0[] \
__used __attribute__((section(".modinfo"), unused, aligned(1))) \
= __stringify(author) "=" "National Cheng Kung University, Taiwan"
```
**[include/linux/stringify.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/stringify.h#L10)**
```cpp
#define __stringify_1(x...) #x
#define __stringify(x...) __stringify_1(x)
```
做最後的展開能夠得到以下的結果
```clike
static const char __UNIQUE_ID_author0[] \
__used __attribute__((section(".modinfo"), unused, aligned(1))) \
= "author=National Cheng Kung University, Taiwan"
```
根據 GNU GCC 文件說明對於 Variable attribute 的解說,`section` 會特別將此 variable 放到指定的 section 中,這邊為 `.modinfo`.
```
$ objdump -s fibdrv.ko
...
Contents of section .modinfo:
0000 76657273 696f6e3d 302e3100 64657363 version=0.1.desc
0010 72697074 696f6e3d 4669626f 6e616363 ription=Fibonacc
0020 6920656e 67696e65 20647269 76657200 i engine driver.
0030 61757468 6f723d4e 6174696f 6e616c20 author=National
0040 4368656e 67204b75 6e672055 6e697665 Cheng Kung Unive
0050 72736974 792c2054 61697761 6e006c69 rsity, Taiwan.li
0060 63656e73 653d4475 616c204d 49542f47 cense=Dual MIT/G
0070 504c0000 00000000 73726376 65727369 PL......srcversi
0080 6f6e3d34 42373436 37453631 43414238 on=4B7467E61CAB8
0090 32354539 35364446 38330000 00000000 25E956DF83......
00a0 64657065 6e64733d 00726574 706f6c69 depends=.retpoli
00b0 6e653d59 006e616d 653d6669 62647276 ne=Y.name=fibdrv
00c0 00766572 6d616769 633d342e 31382e30 .vermagic=4.18.0
00d0 2d31362d 67656e65 72696320 534d5020 -16-generic SMP
00e0 6d6f645f 756e6c6f 61642000 mod_unload .
...
```
上述可以看到 author 的資訊被寫入到 `.modinfo` section 中。
再來看看當我們執行 `insmod` 時 linux kernel 做了什麼動作。
這邊利用 `strace` 查看在執行 `insmod fibdrv.ko` 時做了哪寫 system call。
```shell=
$ sudo strace insmod fibdrv.ko
execve("/sbin/insmod", ["insmod", "fibdrv.ko"], 0x7ffeab43f308 /* 25 vars */) = 0
brk(NULL) = 0x561084511000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=83948, ...}) = 0
mmap(NULL, 83948, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0621290000
close(3) = 0
...
close(3) = 0
getcwd("/home/johnnylord/fibdrv", 4096) = 24
stat("/home/johnnylord/fibdrv/fibdrv.ko", {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0
openat(AT_FDCWD, "/home/johnnylord/fibdrv/fibdrv.ko", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0
mmap(NULL, 8288, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f06212a2000
finit_module(3, "", 0) = 0
munmap(0x7f06212a2000, 8288) = 0
close(3) = 0
exit_group(0) = ?
+++ exited with 0 +++m
```
在第 19 行可以發現他執行了 `finit_module`。去查看 linux 核心中如何定義 `finit_module`.
**[kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c)**
```clike=
SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
{
// ...
return load_module(&info, uargs, flags);
}
```
在第 4 行可以發現他執行了 `load_module` 這個函式
```cpp
/* Allocate and load the module: note that size of section 0 is always
zero, and we rely on this for optional sections. */
static int load_module(struct load_info *info, const char __user *uargs,
int flags)
{
...
}
```
而在註解的部分可以看到 `load_module` 大致就是 kernel 為 module 分配記憶體和載入 module 相關資料的地方。
### 當我們透過 `insmod` 去載入一個核心模組時,為何 `module_init` 所設定的函式得以執行呢?Linux 核心做了什麼事呢?
首先,先看看原始碼
```clike
static int __init init_fib_dev(void)
{
// ...
}
static void __exit exit_fib_dev(void)
{
// ...
}
module_init(init_fib_dev);
module_exit(exit_fib_dev);
```
**[include/linux/module.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/module.h#L129)**
```clike=
/* Each module must use one module_init(). */
#define module_init(initfn) \
static inline initcall_t __maybe_unused __inittest(void) \
{ return initfn; } \
int init_module(void) __attribute__((alias(#initfn)));
```
在第 5 行,可以看到 GNU gcc 會在編譯過後將 `initfn` 設為 `int init_module(void)`的別名。
透過以下實驗可以做看看是否真的有達到別名的效果
```clike
#include <stdio.h>
int __func() {
printf("In __func()\n");
return 0;
}
int func() __attribute__((alias("__func")));
int main() {
func();
return 0;
}
```
```shell
$ gcc -o test test.c
$ ./test
In __func()
```
因此執行 `init_module()` 就相當於執行使用者自定義的函式 `initfn`。
再來繼續回到為什麼 init_module 會被執行?
那就要回到前一題說系統會呼叫 `finit_module` 再來 `load_module`
**[kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c#L3785)**
```clike=
/* Allocate and load the module: note that size of section 0 is always
zero, and we rely on this for optional sections. */
static int load_module(struct load_info *info, const char __user *uargs,
int flags)
{
struct module *mod;
//...
/* Figure out module layout, and allocate all the memory. */
mod = layout_and_allocate(info, flags);
if (IS_ERR(mod)) {
err = PTR_ERR(mod);
goto free_copy;
}
// ...
return do_init_module(mod);
// ...
}
```
上面第 11 可以看到在 `do_init_module` 之前,我們有先做了 `layout_and_allocate` 為載入的 module 做記憶體分配和記憶體配置。最後在第 19 行對 module 做初始化。
**[kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c#L3456)**
```clike=
/*
* This is where the real work happens.
*
* Keep it uninlined to provide a reliable breakpoint target, e.g. for the gdb
* helper command 'lx-symbols'.
*/
static noinline int do_init_module(struct module *mod)
{
// ...
/* Start the module */
if (mod->init != NULL)
ret = do_one_initcall(mod->init);
// ...
}
```
**[init/main.c](https://elixir.bootlin.com/linux/v4.18/source/init/main.c#L874)**
```clike=
int __init_or_module do_one_initcall(initcall_t fn)
{
// ...
do_trace_initcall_start(fn);
ret = fn();
do_trace_initcall_finish(fn, ret);
// ...
}
```
在這邊可以看到 `fn` 也就是我們傳入的 `mod->init`,module 的 init_function 在上述程式碼第 6 行被執行,為 module 做真正得初始化的工作。
### 試著執行 `$ readelf -a fibdrv.ko`, 觀察裡頭的資訊和原始程式碼及 `modinfo` 的關聯,搭配上述提問,解釋像 `fibdrv.ko` 這樣的 ELF 執行檔案是如何「植入」到 Linux 核心
Executable and Linking Format 簡稱為 ELF,字面上他是一種檔案格式,可以詮釋一個 executable binary file 或是 object file。由於這次實驗,`fibdrv.ko` 並非可執行檔,因此這邊專注於解釋 ELF 檔案以 object file 的觀點。ELF 可以大致分為三個部分:
* **ELF header**
存放了有關於此 object file 的訊息
```shell
$ readelf -h fibdrv.ko
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: REL (Relocatable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x0
Start of program headers: 0 (bytes into file)
Start of section headers: 6688 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 0 (bytes)
Number of program headers: 0
Size of section headers: 64 (bytes)
Number of section headers: 25
Section header string table index: 24
```
* **Section(s)**
有系統預定義的 section,如 `.text`, `.data`, `.bss` 等等,但也有使用者定義的 section,在這個例子中就有 `.modinfo`.
* **Section Header(s)**
有對應的關於每個 section 的 metadata。例如某 Section 的 size。
```shell
$ readelf -S fibdrv.ko
There are 25 section headers, starting at offset 0x1a20:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .note.gnu.build-i NOTE 0000000000000000 00000040
0000000000000024 0000000000000000 A 0 0 4
[ 2] .text PROGBITS 0000000000000000 00000070
000000000000015c 0000000000000000 AX 0 0 16
[ 3] .rela.text RELA 0000000000000000 00001218
0000000000000120 0000000000000018 I 22 2 8
[ 4] .init.text PROGBITS 0000000000000000 000001cc
0000000000000153 0000000000000000 AX 0 0 1
[ 5] .rela.init.text RELA 0000000000000000 00001338
00000000000003a8 0000000000000018 I 22 4 8
[ 6] .exit.text PROGBITS 0000000000000000 0000031f
0000000000000040 0000000000000000 AX 0 0 1
[ 7] .rela.exit.text RELA 0000000000000000 000016e0
00000000000000d8 0000000000000018 I 22 6 8
[ 8] __mcount_loc PROGBITS 0000000000000000 0000035f
0000000000000030 0000000000000000 A 0 0 1
[ 9] .rela__mcount_loc RELA 0000000000000000 000017b8
0000000000000090 0000000000000018 I 22 8 8
[10] .rodata.str1.1 PROGBITS 0000000000000000 0000038f
:q 000000000000006e 0000000000000001 AMS 0 0 1
[11] .rodata.str1.8 PROGBITS 0000000000000000 00000400
0000000000000058 0000000000000001 AMS 0 0 8
[12] .rodata PROGBITS 0000000000000000 00000460
0000000000000100 0000000000000000 A 0 0 32
[13] .rela.rodata RELA 0000000000000000 00001848
0000000000000090 0000000000000018 I 22 12 8
[14] .modinfo PROGBITS 0000000000000000 00000560
00000000000000ec 0000000000000000 A 0 0 8
[15] .data PROGBITS 0000000000000000 00000660
0000000000000020 0000000000000000 WA 0 0 32
[16] .rela.data RELA 0000000000000000 000018d8
...
```
再來看 `modinfo` 這個程式和 `fibdrv.ko` 的關聯。由第一題的推斷,`MODULE_XXX` 等巨集會將 module 的額外資訊放入 `fibdrv.ko` 中 `.modinfo` 中,`modinfo` 這個程式應該就是到 `fibdrv.ko` 中的 `.modinfo` 區段讀取資料並做顯示。以下是 `man modinfo` 中關於 `modinfo` 的描述。
> DESCRIPTION
> ==modinfo extracts information from the Linux Kernel modules== given on the command line. If the module name is not a filename, then the /lib/modules/version directory is
> searched, as is also done by modprobe(8) when loading kernel modules.
> ==modinfo by default lists each attribute of the module in form fieldname : value, for
> easy reading.== The filename is listed the same way (although it's not really an
> attribute).
**解釋像 `fibdrv.ko` 這樣的 ELF 執行檔案是如何「植入」到 Linux 核心**。其實 `fibdrv.ko` 並非是執行檔,他只是 object file。如果 `fibdrv.ko` 是執行檔,那麼其內容應該會包含了 Program headers 這些訊息,但是查看 ELF header 可以發現並沒有 Program header.
```shell
$ readelf -h fibdrv.ko
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: REL (Relocatable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x0
Start of program headers: 0 (bytes into file)
Start of section headers: 6688 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 0 (bytes)
Number of program headers: 0
Size of section headers: 64 (bytes)
Number of section headers: 25
Section header string table index: 24
```
因此我們需要透過 `insmod` 這個程式(可執行檔)來將 `fibdrv.ko` 植入核心中。kernel module 是執行在 kernel space 中,但是 `insmod fibdrv.ko` 是一個在 user space 的程序,因此在 `insmod` 中應該需要呼叫相關管理記憶體的 system call,將在 user space 中 kernel module 的資料複製到 kernel space 中。
回頭看之前說 `insmod` 會使核心執行 `finit_module`
:::warning
列出 Linux 核心原始程式碼時,需要標注版本,並補上 LXR 對應的超連結
:notes: jserv
:::
**[linux/kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c)**
```clike=
SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
{
struct load_info info = { };
loff_t size;
void *hdr;
int err;
// ...
err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
READING_MODULE);
if (err)
return err;
info.hdr = hdr;
info.len = size;
return load_module(&info, uargs, flags);
}
```
在第 10 行中 kernel 會讀取一個 file,而這個 file 不意外的應該就是 `fibdrv.ko`
```shell=
$ sudo insmod fibdrv.ko
...
getcwd("/home/johnnylord/fibdrv", 4096) = 24
stat("/home/johnnylord/fibdrv/fibdrv.ko", {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0
openat(AT_FDCWD, "/home/johnnylord/fibdrv/fibdrv.ko", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0
mmap(NULL, 8288, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f06212a2000
finit_module(3, "", 0) = 0
...
```
可以看到上述執行 `strace insmod fibdrv.ko` 後,在第 7 行開啟了 `fibdrv.ko` 這個檔案並得到其 file descriptor 為 3。並且在第 10 行傳入 `finit_module` 中。
:::warning
接著可以思考: `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢?
:notes: jserv
:::
### `fibdrv` 如何透過 Linux Virtual File System 介面,讓計算出來的 Fibonacci 數列得以讓 userspace (使用者層級) 程式 (本例就是 `client.c` 程式) 得以存取呢?解釋原理,並撰寫或找出相似的 Linux 核心模組範例
![](https://i.imgur.com/m6eIZJi.png)
VFS 定義了在實體檔案系統上更高一層的介面(API),讓 user application 可以透過 VFS 定義好的介面存取底層資料,不用考慮底層是如何實作。有了 VFS,增加擴充新的檔案系統非常容易,只要實作出 VFS 定義好的介面的內容即可。
VFS 有主要幾個物件 `superblock`, `inode`, `dentry`, `file` 等等。因為作業需求,我只解釋 `inode` 和 `file`。`inode` 和 `file` 在檔案系統中都代表某個檔案,不過 `file` 只是在 kernel 執行 `open` 時,為 process 建立的資料結構,因此 `open` 執行幾次就會有多少個 `file`(也有可能增加 `file.f_count` 的值),但是這些 `file` 都會指向同一個實體的 file,也就是 `inode`。
**[include/linux/fs.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/fs.h#L572)**
```clike
struct inode {
// ...
dev_t i_rdev;
// ...
union {
struct pipe_inode_info *i_pipe;
struct block_device *i_bdev;
struct cdev *i_cdev;
char *i_link;
unsigned i_dir_seq;
};
// ...
} __randomize_layout;
```
在 `inode` 中 `i_rdev` 會存有 device 的 major number 和 minor number。(如果這個 node 是個 device 的話啦)。且根據 device 的類型,存有關於 device 相對應的資料,這是作業的 device 是 char device,所以資料結構是 `struct cdev`。
**[include/linux/cdev.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/cdev.h#L14)**
```clike
struct cdev {
struct kobject kobj;
struct module *owner;
const struct file_operations *ops;
struct list_head list;
dev_t dev;
unsigned int count;
} __randomize_layout;
```
在 `struct cdev` 中,我們可以看到有 `struct file_operations` 這個 field。這個資料結構存有關於此 char device 的相關操作,如 `open`, `read`, `write` 等等。這邊可以看到 linux kernel 將 OOP 的概念引入。
在 `fibdrv.c` 中我們定義了一系列對於此 char device 的操作
```clike
const struct file_operations fib_fops = {
.owner = THIS_MODULE,
.read = fib_read,
.write = fib_write,
.open = fib_open,
.release = fib_release,
.llseek = fib_device_lseek,
};
```
定義完操作後便是將這些操作關聯到相對應的 char device 上,並註冊 char device 到 kernel 中。
```clike=
static int __init init_fib_dev(void)
{
// ...
fib_cdev = cdev_alloc();
if (fib_cdev == NULL) {
printk(KERN_ALERT "Failed to alloc cdev");
rc = -1;
goto failed_cdev;
}
cdev_init(fib_cdev, &fib_fops);
rc = cdev_add(fib_cdev, fib_dev, 1);
// ...
}
```
最後的最後就是創建 device node,讓使用者可以透過對這個 node 操作,來和 device 互動。
```clike
static int __init init_fib_dev(void)
{
// ...
fib_class = class_create(THIS_MODULE, DEV_FIBONACCI_NAME);
if (!fib_class) {
printk(KERN_ALERT "Failed to create device class");
rc = -3;
goto failed_class_create;
}
if (!device_create(fib_class, NULL, fib_dev, NULL, DEV_FIBONACCI_NAME)) {
printk(KERN_ALERT "Failed to create device");
rc = -4;
goto failed_device_create;
}
// ...
}
```
### 查閱 ktime 相關的 API,並找出使用案例 (需要有核心模組和簡化的程式碼來解說)
**[drivers/base/power/main.c](https://elixir.bootlin.com/linux/v4.18/source/drivers/base/power/main.c)**
```clike
static ktime_t initcall_debug_start(struct device *dev, void *cb)
{
// ...
return ktime_get();
}
static void initcall_debug_report(struct device *dev, ktime_t calltime,
void *cb, int error)
{
ktime_t rettime;
s64 nsecs;
// ...
rettime = ktime_get();
nsecs = (s64) ktime_to_ns(ktime_sub(rettime, calltime));
dev_info(dev, "%pF returned %d after %Ld usecs\n", cb, error,
(unsigned long long)nsecs >> 10);
}
static int dpm_run_callback(pm_callback_t cb, struct device *dev,
pm_message_t state, const char *info)
{
ktime_t calltime;
// ...
calltime = initcall_debug_start(dev, cb);
// ...
initcall_debug_report(dev, calltime, cb, error);
return error;
}
```
在這個 driver 中,它使用到 ktime 相關的 API 的用途就是做紀錄執行時間,比如說 `dpm_run_callback`,它會先紀錄執行開始時間,然後開始做事,結束時呼叫 `initcall_debug_report` 輸出執行用了多少的時間。
### 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題
現在的電腦架構大多是多核心(Symmetric multiprocessing),同一時間可能會有多個執行緒(thread) 存取共享的資源。要讓這些共享資源保持資料的一致性和正確性,需要一些機制讓這些資源在同一時間只能有一個執行緒做操作。
而 mutex(mutual exclusion) 這個機制的背後用到 semaphore 的概念,在一個執行緒進入critical section 前,需要先拿到 lock(把 semaphore 的值減 1),此時若其他的執行緒要進入 crtical section 時,就會發現 lock 已被拿走 (semaphore = 0),然後進入 sleep 的狀態。最後當執行緒離開 critical section 前,要 release lock (semaphore 值加 1),這樣其他的執行緒才可以進入 crtical section。
看一下程式碼
```clike
static int fib_open(struct inode *inode, struct file *file)
{
if (!mutex_trylock(&fib_mutex)) {
printk(KERN_ALERT "fibdrv is in use");
return -EBUSY;
}
return 0;
}
static int fib_release(struct inode *inode, struct file *file)
{
mutex_unlock(&fib_mutex);
return 0;
}
```
我們在做 lock/unlock 只有在檔案開啟和關閉時,且若已有其他執行緒開啟 `/dev/fibnocci` 其他執行緒在開啟檔案時會失敗。
下面是平行化版的 client 程式
```clike
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#include <fcntl.h>
#define FIB_DEV "/dev/fibonacci"
#define THREADS 5
#define OFFSET 100
void *thread_task(void *data)
{
int fd, i;
unsigned long long sz;
fd = open(FIB_DEV, O_RDWR);
if (fd < 0) {
perror("Failed to open character device");
exit(1);
}
char kbuffer[100];
for (i = OFFSET; i >= 0; i--) {
lseek(fd, i, SEEK_SET);
sz = read(fd, kbuffer, 1);
printf("Reading from " FIB_DEV
" at offset %d, returned the sequence "
"%lld.\n",
i, sz);
}
pthread_exit(NULL);
}
int main()
{
pthread_t t[THREADS];
for(int i = 0; i < THREADS; ++i) {
pthread_create(&t[i], NULL, thread_task, NULL);
}
for(int i = 0; i < THREADS; ++i) {
pthread_join(t[i], NULL);
}
return 0;
}
```
下面來看在核心若沒有使用 lock 和使用 lock 得到的結果
* 有 lock 的版本
```shell
$ sudo ./thread
Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429.
Failed to open character device: Device or resource busy
Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429.
Failed to open character device: Device or resource busy
Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429.
Failed to open character device: Device or resource busy
Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189.
Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931.
Failed to open character device: Device or resource busy
Reading from /dev/fibonacci at offset 87, returned the sequence 679891637638612258.
...
```
若有 lock 住,則意義上代表我們 lock 住了 device(`/dev/fibnocci`),同一時間只有一個執行緒可以開啟`/dev/fibnocci`。
* 沒有 lock 的版本
```
Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189.
Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931.
Reading from /dev/fibonacci at offset 87, returned the sequence 679891637638612258.
Reading from /dev/fibonacci at offset 86, returned the sequence 420196140727489673.
Reading from /dev/fibonacci at offset 85, returned the sequence 259695496911122585.
Reading from /dev/fibonacci at offset 84, returned the sequence 160500643816367088.
Reading from /dev/fibonacci at offset 83, returned the sequence 99194853094755497.
Reading from /dev/fibonacci at offset 82, returned the sequence 61305790721611591.
Reading from /dev/fibonacci at offset 81, returned the sequence 37889062373143906.
Reading from /dev/fibonacci at offset 80, returned the sequence 23416728348467685.
Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429.
...
```
同一時間所有執行緒都可以開啟 `/dev/fibnocci` 並且讀取資料,所以資料呈現亂數的狀態。
## 效能分析
* **未優化版本**
在程式尚未優化下,以下是在 kernel space 和 user space 中,`client.c` 得到第 n 項 fibnocci number 所需要的時間。由此圖大概可以看出,由 kernel space 傳遞計算好的 fibnocci number 到 user space 的時間大概是 `700ns`。
![](https://i.imgur.com/OTzVkCF.png)
* **優化版本**
TO be continue
### 如何檢測時間
**[client.c](https://github.com/johnnylord/fibdrv/blob/master/client.c)**
```clike
for (i = 0; i <= offset; i++) {
lseek(fd, i, SEEK_SET);
clock_gettime(cid, &start);
sz = read(fd, kbuffer, 100);
clock_gettime(cid, &end);
sec = end.tv_sec - start.tv_sec;
nanosec = end.tv_nsec - start.tv_nsec;
sprintf(buffer, "%d %ld %ld\n", i, sec * 1000000000 + nanosec, atol(kbuffer));
write(fd_out, buffer, strlen(buffer));
printf("Reading from " FIB_DEV
" at offset %d, returned the sequence "
"%lld.\n",
i, sz);
}
```
在上述的程式中,kernel space 所花的時間會透過 `kbuffer` 以字串的方式傳遞回去,讓 `client.c` 一同輸出之後 `gnuplot` 所需要的資料。
**[fibdrv.c](https://github.com/johnnylord/fibdrv/blob/master/fibdrv.c)**
```clike=
/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
char *buf,
size_t size,
loff_t *offset)
{
ktime_t timelapsed, start, end;
ssize_t retval;
start = ktime_get();
retval = (ssize_t) fib_sequence(*offset);
end = ktime_get();
timelapsed = ktime_sub(end, start);
sprintf(kbuffer, "%lld\n", timelapsed);
copy_to_user(buf, kbuffer, 100);
return retval;
}
```
**上面程式碼有趣的點是第 15 行,為什麼需要使用到 `copy_to_user` 而不是直接使用 `sprintf` 對 user space 的記憶體做直接操作?**
fib_read 傳近來的參數 buf 是一個 user-space pointer。因此它不應該在 kernel module 中直接做 deferenced。以下是他的幾個原因:
>
> 1. Depending on which architecture your driver is running on, and how the kernel was configured, ==the user-space pointer may not be valid while running in kernel mode at all==. There may be no mapping for that address, or it could point to some other, random data.
>
> 2. Even if the pointer does mean the same thing in kernel space, ==user-space memory is paged, and the memory in question might not be resident in RAM when the system call is made. Attempting to reference the user-space memory directly could generate a page fault, which is something that kernel code is not allowed to do.== The result would be an "oops," which would result in the death of the process that made the system call.
>
> 3. ==The pointer in question has been supplied by a user program, which could be buggy or malicious. If your driver ever blindly dereferences a user-supplied pointer, it provides an open doorway allowing a user-space program to access or overwrite memory anywhere in the system.== If you do not wish to be responsible for compromising the security of your users' systems, you cannot ever dereference a user-space pointer directly.