# 2019q1 Homework2 (fibdrv) contributed by < `johnnylord` > ###### tags: `sysprog2019` ## 實驗環境 ```shell $ uname -a Linux johnnylord 4.18.0-16-generic #17~18.04.1-Ubuntu SMP Tue Feb 12 13:35:51 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ``` ## 自我檢查清單 ### 檔案 `fibdrv.c` 裡頭的 `MODULE_LICENSE`, `MODULE_AUTHOR`, `MODULE_DESCRIPTION`, `MODULE_VERSION` 等巨集做了什麼事,可以讓核心知曉呢? `insmod` 這命令背後,對應 Linux 核心內部有什麼操作呢?請舉出相關 Linux 核心原始碼並解讀 這些巨集本質上就是在編譯過後在 `.ko` 檔中提供相對應的資訊,由於性質相同,這邊就先專注於 `MODULE_AUTHOR`。在範例程式中我們指定 module 的作者 ```clike MODULE_AUTHOR("National Cheng Kung University, Taiwan"); ``` **[include/linux/module.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/module.h#L205)** ```cpp /* * Author(s), use "Name <email>" or just "Name", for multiple * authors use multiple MODULE_AUTHOR() statements/lines. */ #define MODULE_AUTHOR(_author) MODULE_INFO(author, _author) ``` 上述註解說明 `_author` 的格式和若有多個 author 則應該呼叫多次 `MODULE_AUTHOR`。 若將巨集展開應得 `MODULE_INFO(author, "National Cheng Kung University, Taiwan")` ```cpp /* Generic info of form tag = "info" */ #define MODULE_INFO(tag, info) __MODULE_INFO(tag, tag, info) ``` 繼續將上述展開得 `__MODULE_INFO(author, author, "National Cheng Kung University, Taiwan")` **[include/linux/moduleparam.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/moduleparam.h)** ```cpp #ifdef MODULE #define __MODULE_INFO(tag, name, info) \ static const char __UNIQUE_ID(name)[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = __stringify(tag) "=" info #else /* !MODULE */ /* This struct is here for syntactic coherency, it is not used */ #define __MODULE_INFO(tag, name, info) \ struct __UNIQUE_ID(name) {} #endif ``` 上述巨集的定義根據 `MODULE` 是否有被定義,`MODULE` 是在此 dynamic kernel module 被編譯時定義的,若此 module 內建在 kernel 則不會被定義。繼續將上述巨集展開 ```cpp static const char __UNIQUE_ID(author)[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = __stringify(author) "=" "National Cheng Kung University, Taiwan" ``` **[include/linux/compiler-gcc.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/compiler-gcc.h#L208)** ```cpp #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__) ``` 繼續將 \__UNIQUE_ID 展開,`__COUNTER__` 這個巨集由 GNU GCC 維護,每當遇到使用到 `__COUNTER__` 就會將其值加一。 ```cpp static const char __PASTE(__PASTE(__UNIQUE_ID_, author), __COUNTER__)[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = __stringify(author) "=" "National Cheng Kung University, Taiwan" ``` **[include/linux/compiler_types.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/compiler_types.h#L53)** ```cpp /* Indirect macros required for expanded argument pasting, eg. __LINE__. */ #define ___PASTE(a,b) a##b #define __PASTE(a,b) ___PASTE(a,b) ``` 繼續做展開 ```cpp static const char __UNIQUE_ID_author0[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = __stringify(author) "=" "National Cheng Kung University, Taiwan" ``` **[include/linux/stringify.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/stringify.h#L10)** ```cpp #define __stringify_1(x...) #x #define __stringify(x...) __stringify_1(x) ``` 做最後的展開能夠得到以下的結果 ```clike static const char __UNIQUE_ID_author0[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = "author=National Cheng Kung University, Taiwan" ``` 根據 GNU GCC 文件說明對於 Variable attribute 的解說,`section` 會特別將此 variable 放到指定的 section 中,這邊為 `.modinfo`. ``` $ objdump -s fibdrv.ko ... Contents of section .modinfo: 0000 76657273 696f6e3d 302e3100 64657363 version=0.1.desc 0010 72697074 696f6e3d 4669626f 6e616363 ription=Fibonacc 0020 6920656e 67696e65 20647269 76657200 i engine driver. 0030 61757468 6f723d4e 6174696f 6e616c20 author=National 0040 4368656e 67204b75 6e672055 6e697665 Cheng Kung Unive 0050 72736974 792c2054 61697761 6e006c69 rsity, Taiwan.li 0060 63656e73 653d4475 616c204d 49542f47 cense=Dual MIT/G 0070 504c0000 00000000 73726376 65727369 PL......srcversi 0080 6f6e3d34 42373436 37453631 43414238 on=4B7467E61CAB8 0090 32354539 35364446 38330000 00000000 25E956DF83...... 00a0 64657065 6e64733d 00726574 706f6c69 depends=.retpoli 00b0 6e653d59 006e616d 653d6669 62647276 ne=Y.name=fibdrv 00c0 00766572 6d616769 633d342e 31382e30 .vermagic=4.18.0 00d0 2d31362d 67656e65 72696320 534d5020 -16-generic SMP 00e0 6d6f645f 756e6c6f 61642000 mod_unload . ... ``` 上述可以看到 author 的資訊被寫入到 `.modinfo` section 中。 再來看看當我們執行 `insmod` 時 linux kernel 做了什麼動作。 這邊利用 `strace` 查看在執行 `insmod fibdrv.ko` 時做了哪寫 system call。 ```shell= $ sudo strace insmod fibdrv.ko execve("/sbin/insmod", ["insmod", "fibdrv.ko"], 0x7ffeab43f308 /* 25 vars */) = 0 brk(NULL) = 0x561084511000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=83948, ...}) = 0 mmap(NULL, 83948, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0621290000 close(3) = 0 ... close(3) = 0 getcwd("/home/johnnylord/fibdrv", 4096) = 24 stat("/home/johnnylord/fibdrv/fibdrv.ko", {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0 openat(AT_FDCWD, "/home/johnnylord/fibdrv/fibdrv.ko", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0 mmap(NULL, 8288, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f06212a2000 finit_module(3, "", 0) = 0 munmap(0x7f06212a2000, 8288) = 0 close(3) = 0 exit_group(0) = ? +++ exited with 0 +++m ``` 在第 19 行可以發現他執行了 `finit_module`。去查看 linux 核心中如何定義 `finit_module`. **[kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c)** ```clike= SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) { // ... return load_module(&info, uargs, flags); } ``` 在第 4 行可以發現他執行了 `load_module` 這個函式 ```cpp /* Allocate and load the module: note that size of section 0 is always zero, and we rely on this for optional sections. */ static int load_module(struct load_info *info, const char __user *uargs, int flags) { ... } ``` 而在註解的部分可以看到 `load_module` 大致就是 kernel 為 module 分配記憶體和載入 module 相關資料的地方。 ### 當我們透過 `insmod` 去載入一個核心模組時,為何 `module_init` 所設定的函式得以執行呢?Linux 核心做了什麼事呢? 首先,先看看原始碼 ```clike static int __init init_fib_dev(void) { // ... } static void __exit exit_fib_dev(void) { // ... } module_init(init_fib_dev); module_exit(exit_fib_dev); ``` **[include/linux/module.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/module.h#L129)** ```clike= /* Each module must use one module_init(). */ #define module_init(initfn) \ static inline initcall_t __maybe_unused __inittest(void) \ { return initfn; } \ int init_module(void) __attribute__((alias(#initfn))); ``` 在第 5 行,可以看到 GNU gcc 會在編譯過後將 `initfn` 設為 `int init_module(void)`的別名。 透過以下實驗可以做看看是否真的有達到別名的效果 ```clike #include <stdio.h> int __func() { printf("In __func()\n"); return 0; } int func() __attribute__((alias("__func"))); int main() { func(); return 0; } ``` ```shell $ gcc -o test test.c $ ./test In __func() ``` 因此執行 `init_module()` 就相當於執行使用者自定義的函式 `initfn`。 再來繼續回到為什麼 init_module 會被執行? 那就要回到前一題說系統會呼叫 `finit_module` 再來 `load_module` **[kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c#L3785)** ```clike= /* Allocate and load the module: note that size of section 0 is always zero, and we rely on this for optional sections. */ static int load_module(struct load_info *info, const char __user *uargs, int flags) { struct module *mod; //... /* Figure out module layout, and allocate all the memory. */ mod = layout_and_allocate(info, flags); if (IS_ERR(mod)) { err = PTR_ERR(mod); goto free_copy; } // ... return do_init_module(mod); // ... } ``` 上面第 11 可以看到在 `do_init_module` 之前,我們有先做了 `layout_and_allocate` 為載入的 module 做記憶體分配和記憶體配置。最後在第 19 行對 module 做初始化。 **[kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c#L3456)** ```clike= /* * This is where the real work happens. * * Keep it uninlined to provide a reliable breakpoint target, e.g. for the gdb * helper command 'lx-symbols'. */ static noinline int do_init_module(struct module *mod) { // ... /* Start the module */ if (mod->init != NULL) ret = do_one_initcall(mod->init); // ... } ``` **[init/main.c](https://elixir.bootlin.com/linux/v4.18/source/init/main.c#L874)** ```clike= int __init_or_module do_one_initcall(initcall_t fn) { // ... do_trace_initcall_start(fn); ret = fn(); do_trace_initcall_finish(fn, ret); // ... } ``` 在這邊可以看到 `fn` 也就是我們傳入的 `mod->init`,module 的 init_function 在上述程式碼第 6 行被執行,為 module 做真正得初始化的工作。 ### 試著執行 `$ readelf -a fibdrv.ko`, 觀察裡頭的資訊和原始程式碼及 `modinfo` 的關聯,搭配上述提問,解釋像 `fibdrv.ko` 這樣的 ELF 執行檔案是如何「植入」到 Linux 核心 Executable and Linking Format 簡稱為 ELF,字面上他是一種檔案格式,可以詮釋一個 executable binary file 或是 object file。由於這次實驗,`fibdrv.ko` 並非可執行檔,因此這邊專注於解釋 ELF 檔案以 object file 的觀點。ELF 可以大致分為三個部分: * **ELF header** 存放了有關於此 object file 的訊息 ```shell $ readelf -h fibdrv.ko ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: REL (Relocatable file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x0 Start of program headers: 0 (bytes into file) Start of section headers: 6688 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 0 (bytes) Number of program headers: 0 Size of section headers: 64 (bytes) Number of section headers: 25 Section header string table index: 24 ``` * **Section(s)** 有系統預定義的 section,如 `.text`, `.data`, `.bss` 等等,但也有使用者定義的 section,在這個例子中就有 `.modinfo`. * **Section Header(s)** 有對應的關於每個 section 的 metadata。例如某 Section 的 size。 ```shell $ readelf -S fibdrv.ko There are 25 section headers, starting at offset 0x1a20: Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [ 0] NULL 0000000000000000 00000000 0000000000000000 0000000000000000 0 0 0 [ 1] .note.gnu.build-i NOTE 0000000000000000 00000040 0000000000000024 0000000000000000 A 0 0 4 [ 2] .text PROGBITS 0000000000000000 00000070 000000000000015c 0000000000000000 AX 0 0 16 [ 3] .rela.text RELA 0000000000000000 00001218 0000000000000120 0000000000000018 I 22 2 8 [ 4] .init.text PROGBITS 0000000000000000 000001cc 0000000000000153 0000000000000000 AX 0 0 1 [ 5] .rela.init.text RELA 0000000000000000 00001338 00000000000003a8 0000000000000018 I 22 4 8 [ 6] .exit.text PROGBITS 0000000000000000 0000031f 0000000000000040 0000000000000000 AX 0 0 1 [ 7] .rela.exit.text RELA 0000000000000000 000016e0 00000000000000d8 0000000000000018 I 22 6 8 [ 8] __mcount_loc PROGBITS 0000000000000000 0000035f 0000000000000030 0000000000000000 A 0 0 1 [ 9] .rela__mcount_loc RELA 0000000000000000 000017b8 0000000000000090 0000000000000018 I 22 8 8 [10] .rodata.str1.1 PROGBITS 0000000000000000 0000038f :q 000000000000006e 0000000000000001 AMS 0 0 1 [11] .rodata.str1.8 PROGBITS 0000000000000000 00000400 0000000000000058 0000000000000001 AMS 0 0 8 [12] .rodata PROGBITS 0000000000000000 00000460 0000000000000100 0000000000000000 A 0 0 32 [13] .rela.rodata RELA 0000000000000000 00001848 0000000000000090 0000000000000018 I 22 12 8 [14] .modinfo PROGBITS 0000000000000000 00000560 00000000000000ec 0000000000000000 A 0 0 8 [15] .data PROGBITS 0000000000000000 00000660 0000000000000020 0000000000000000 WA 0 0 32 [16] .rela.data RELA 0000000000000000 000018d8 ... ``` 再來看 `modinfo` 這個程式和 `fibdrv.ko` 的關聯。由第一題的推斷,`MODULE_XXX` 等巨集會將 module 的額外資訊放入 `fibdrv.ko` 中 `.modinfo` 中,`modinfo` 這個程式應該就是到 `fibdrv.ko` 中的 `.modinfo` 區段讀取資料並做顯示。以下是 `man modinfo` 中關於 `modinfo` 的描述。 > DESCRIPTION > ==modinfo extracts information from the Linux Kernel modules== given on the command line. If the module name is not a filename, then the /lib/modules/version directory is > searched, as is also done by modprobe(8) when loading kernel modules. > ==modinfo by default lists each attribute of the module in form fieldname : value, for > easy reading.== The filename is listed the same way (although it's not really an > attribute). **解釋像 `fibdrv.ko` 這樣的 ELF 執行檔案是如何「植入」到 Linux 核心**。其實 `fibdrv.ko` 並非是執行檔,他只是 object file。如果 `fibdrv.ko` 是執行檔,那麼其內容應該會包含了 Program headers 這些訊息,但是查看 ELF header 可以發現並沒有 Program header. ```shell $ readelf -h fibdrv.ko ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: REL (Relocatable file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x0 Start of program headers: 0 (bytes into file) Start of section headers: 6688 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 0 (bytes) Number of program headers: 0 Size of section headers: 64 (bytes) Number of section headers: 25 Section header string table index: 24 ``` 因此我們需要透過 `insmod` 這個程式(可執行檔)來將 `fibdrv.ko` 植入核心中。kernel module 是執行在 kernel space 中,但是 `insmod fibdrv.ko` 是一個在 user space 的程序,因此在 `insmod` 中應該需要呼叫相關管理記憶體的 system call,將在 user space 中 kernel module 的資料複製到 kernel space 中。 回頭看之前說 `insmod` 會使核心執行 `finit_module` :::warning 列出 Linux 核心原始程式碼時,需要標注版本,並補上 LXR 對應的超連結 :notes: jserv ::: **[linux/kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c)** ```clike= SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) { struct load_info info = { }; loff_t size; void *hdr; int err; // ... err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX, READING_MODULE); if (err) return err; info.hdr = hdr; info.len = size; return load_module(&info, uargs, flags); } ``` 在第 10 行中 kernel 會讀取一個 file,而這個 file 不意外的應該就是 `fibdrv.ko` ```shell= $ sudo insmod fibdrv.ko ... getcwd("/home/johnnylord/fibdrv", 4096) = 24 stat("/home/johnnylord/fibdrv/fibdrv.ko", {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0 openat(AT_FDCWD, "/home/johnnylord/fibdrv/fibdrv.ko", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0 mmap(NULL, 8288, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f06212a2000 finit_module(3, "", 0) = 0 ... ``` 可以看到上述執行 `strace insmod fibdrv.ko` 後,在第 7 行開啟了 `fibdrv.ko` 這個檔案並得到其 file descriptor 為 3。並且在第 10 行傳入 `finit_module` 中。 :::warning 接著可以思考: `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢? :notes: jserv ::: ### `fibdrv` 如何透過 Linux Virtual File System 介面,讓計算出來的 Fibonacci 數列得以讓 userspace (使用者層級) 程式 (本例就是 `client.c` 程式) 得以存取呢?解釋原理,並撰寫或找出相似的 Linux 核心模組範例 ![](https://i.imgur.com/m6eIZJi.png) VFS 定義了在實體檔案系統上更高一層的介面(API),讓 user application 可以透過 VFS 定義好的介面存取底層資料,不用考慮底層是如何實作。有了 VFS,增加擴充新的檔案系統非常容易,只要實作出 VFS 定義好的介面的內容即可。 VFS 有主要幾個物件 `superblock`, `inode`, `dentry`, `file` 等等。因為作業需求,我只解釋 `inode` 和 `file`。`inode` 和 `file` 在檔案系統中都代表某個檔案,不過 `file` 只是在 kernel 執行 `open` 時,為 process 建立的資料結構,因此 `open` 執行幾次就會有多少個 `file`(也有可能增加 `file.f_count` 的值),但是這些 `file` 都會指向同一個實體的 file,也就是 `inode`。 **[include/linux/fs.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/fs.h#L572)** ```clike struct inode { // ... dev_t i_rdev; // ... union { struct pipe_inode_info *i_pipe; struct block_device *i_bdev; struct cdev *i_cdev; char *i_link; unsigned i_dir_seq; }; // ... } __randomize_layout; ``` 在 `inode` 中 `i_rdev` 會存有 device 的 major number 和 minor number。(如果這個 node 是個 device 的話啦)。且根據 device 的類型,存有關於 device 相對應的資料,這是作業的 device 是 char device,所以資料結構是 `struct cdev`。 **[include/linux/cdev.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/cdev.h#L14)** ```clike struct cdev { struct kobject kobj; struct module *owner; const struct file_operations *ops; struct list_head list; dev_t dev; unsigned int count; } __randomize_layout; ``` 在 `struct cdev` 中,我們可以看到有 `struct file_operations` 這個 field。這個資料結構存有關於此 char device 的相關操作,如 `open`, `read`, `write` 等等。這邊可以看到 linux kernel 將 OOP 的概念引入。 在 `fibdrv.c` 中我們定義了一系列對於此 char device 的操作 ```clike const struct file_operations fib_fops = { .owner = THIS_MODULE, .read = fib_read, .write = fib_write, .open = fib_open, .release = fib_release, .llseek = fib_device_lseek, }; ``` 定義完操作後便是將這些操作關聯到相對應的 char device 上,並註冊 char device 到 kernel 中。 ```clike= static int __init init_fib_dev(void) { // ... fib_cdev = cdev_alloc(); if (fib_cdev == NULL) { printk(KERN_ALERT "Failed to alloc cdev"); rc = -1; goto failed_cdev; } cdev_init(fib_cdev, &fib_fops); rc = cdev_add(fib_cdev, fib_dev, 1); // ... } ``` 最後的最後就是創建 device node,讓使用者可以透過對這個 node 操作,來和 device 互動。 ```clike static int __init init_fib_dev(void) { // ... fib_class = class_create(THIS_MODULE, DEV_FIBONACCI_NAME); if (!fib_class) { printk(KERN_ALERT "Failed to create device class"); rc = -3; goto failed_class_create; } if (!device_create(fib_class, NULL, fib_dev, NULL, DEV_FIBONACCI_NAME)) { printk(KERN_ALERT "Failed to create device"); rc = -4; goto failed_device_create; } // ... } ``` ### 查閱 ktime 相關的 API,並找出使用案例 (需要有核心模組和簡化的程式碼來解說) **[drivers/base/power/main.c](https://elixir.bootlin.com/linux/v4.18/source/drivers/base/power/main.c)** ```clike static ktime_t initcall_debug_start(struct device *dev, void *cb) { // ... return ktime_get(); } static void initcall_debug_report(struct device *dev, ktime_t calltime, void *cb, int error) { ktime_t rettime; s64 nsecs; // ... rettime = ktime_get(); nsecs = (s64) ktime_to_ns(ktime_sub(rettime, calltime)); dev_info(dev, "%pF returned %d after %Ld usecs\n", cb, error, (unsigned long long)nsecs >> 10); } static int dpm_run_callback(pm_callback_t cb, struct device *dev, pm_message_t state, const char *info) { ktime_t calltime; // ... calltime = initcall_debug_start(dev, cb); // ... initcall_debug_report(dev, calltime, cb, error); return error; } ``` 在這個 driver 中,它使用到 ktime 相關的 API 的用途就是做紀錄執行時間,比如說 `dpm_run_callback`,它會先紀錄執行開始時間,然後開始做事,結束時呼叫 `initcall_debug_report` 輸出執行用了多少的時間。 ### 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題 現在的電腦架構大多是多核心(Symmetric multiprocessing),同一時間可能會有多個執行緒(thread) 存取共享的資源。要讓這些共享資源保持資料的一致性和正確性,需要一些機制讓這些資源在同一時間只能有一個執行緒做操作。 而 mutex(mutual exclusion) 這個機制的背後用到 semaphore 的概念,在一個執行緒進入critical section 前,需要先拿到 lock(把 semaphore 的值減 1),此時若其他的執行緒要進入 crtical section 時,就會發現 lock 已被拿走 (semaphore = 0),然後進入 sleep 的狀態。最後當執行緒離開 critical section 前,要 release lock (semaphore 值加 1),這樣其他的執行緒才可以進入 crtical section。 看一下程式碼 ```clike static int fib_open(struct inode *inode, struct file *file) { if (!mutex_trylock(&fib_mutex)) { printk(KERN_ALERT "fibdrv is in use"); return -EBUSY; } return 0; } static int fib_release(struct inode *inode, struct file *file) { mutex_unlock(&fib_mutex); return 0; } ``` 我們在做 lock/unlock 只有在檔案開啟和關閉時,且若已有其他執行緒開啟 `/dev/fibnocci` 其他執行緒在開啟檔案時會失敗。 下面是平行化版的 client 程式 ```clike #include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <unistd.h> #include <string.h> #include <sys/types.h> #include <fcntl.h> #define FIB_DEV "/dev/fibonacci" #define THREADS 5 #define OFFSET 100 void *thread_task(void *data) { int fd, i; unsigned long long sz; fd = open(FIB_DEV, O_RDWR); if (fd < 0) { perror("Failed to open character device"); exit(1); } char kbuffer[100]; for (i = OFFSET; i >= 0; i--) { lseek(fd, i, SEEK_SET); sz = read(fd, kbuffer, 1); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%lld.\n", i, sz); } pthread_exit(NULL); } int main() { pthread_t t[THREADS]; for(int i = 0; i < THREADS; ++i) { pthread_create(&t[i], NULL, thread_task, NULL); } for(int i = 0; i < THREADS; ++i) { pthread_join(t[i], NULL); } return 0; } ``` 下面來看在核心若沒有使用 lock 和使用 lock 得到的結果 * 有 lock 的版本 ```shell $ sudo ./thread Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429. Failed to open character device: Device or resource busy Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429. Failed to open character device: Device or resource busy Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429. Failed to open character device: Device or resource busy Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309. Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120. Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189. Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931. Failed to open character device: Device or resource busy Reading from /dev/fibonacci at offset 87, returned the sequence 679891637638612258. ... ``` 若有 lock 住,則意義上代表我們 lock 住了 device(`/dev/fibnocci`),同一時間只有一個執行緒可以開啟`/dev/fibnocci`。 * 沒有 lock 的版本 ``` Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309. Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120. Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189. Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931. Reading from /dev/fibonacci at offset 87, returned the sequence 679891637638612258. Reading from /dev/fibonacci at offset 86, returned the sequence 420196140727489673. Reading from /dev/fibonacci at offset 85, returned the sequence 259695496911122585. Reading from /dev/fibonacci at offset 84, returned the sequence 160500643816367088. Reading from /dev/fibonacci at offset 83, returned the sequence 99194853094755497. Reading from /dev/fibonacci at offset 82, returned the sequence 61305790721611591. Reading from /dev/fibonacci at offset 81, returned the sequence 37889062373143906. Reading from /dev/fibonacci at offset 80, returned the sequence 23416728348467685. Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429. ... ``` 同一時間所有執行緒都可以開啟 `/dev/fibnocci` 並且讀取資料,所以資料呈現亂數的狀態。 ## 效能分析 * **未優化版本** 在程式尚未優化下,以下是在 kernel space 和 user space 中,`client.c` 得到第 n 項 fibnocci number 所需要的時間。由此圖大概可以看出,由 kernel space 傳遞計算好的 fibnocci number 到 user space 的時間大概是 `700ns`。 ![](https://i.imgur.com/OTzVkCF.png) * **優化版本** TO be continue ### 如何檢測時間 **[client.c](https://github.com/johnnylord/fibdrv/blob/master/client.c)** ```clike for (i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); clock_gettime(cid, &start); sz = read(fd, kbuffer, 100); clock_gettime(cid, &end); sec = end.tv_sec - start.tv_sec; nanosec = end.tv_nsec - start.tv_nsec; sprintf(buffer, "%d %ld %ld\n", i, sec * 1000000000 + nanosec, atol(kbuffer)); write(fd_out, buffer, strlen(buffer)); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%lld.\n", i, sz); } ``` 在上述的程式中,kernel space 所花的時間會透過 `kbuffer` 以字串的方式傳遞回去,讓 `client.c` 一同輸出之後 `gnuplot` 所需要的資料。 **[fibdrv.c](https://github.com/johnnylord/fibdrv/blob/master/fibdrv.c)** ```clike= /* calculate the fibonacci number at given offset */ static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { ktime_t timelapsed, start, end; ssize_t retval; start = ktime_get(); retval = (ssize_t) fib_sequence(*offset); end = ktime_get(); timelapsed = ktime_sub(end, start); sprintf(kbuffer, "%lld\n", timelapsed); copy_to_user(buf, kbuffer, 100); return retval; } ``` **上面程式碼有趣的點是第 15 行,為什麼需要使用到 `copy_to_user` 而不是直接使用 `sprintf` 對 user space 的記憶體做直接操作?** fib_read 傳近來的參數 buf 是一個 user-space pointer。因此它不應該在 kernel module 中直接做 deferenced。以下是他的幾個原因: > > 1. Depending on which architecture your driver is running on, and how the kernel was configured, ==the user-space pointer may not be valid while running in kernel mode at all==. There may be no mapping for that address, or it could point to some other, random data. > > 2. Even if the pointer does mean the same thing in kernel space, ==user-space memory is paged, and the memory in question might not be resident in RAM when the system call is made. Attempting to reference the user-space memory directly could generate a page fault, which is something that kernel code is not allowed to do.== The result would be an "oops," which would result in the death of the process that made the system call. > > 3. ==The pointer in question has been supplied by a user program, which could be buggy or malicious. If your driver ever blindly dereferences a user-supplied pointer, it provides an open doorway allowing a user-space program to access or overwrite memory anywhere in the system.== If you do not wish to be responsible for compromising the security of your users' systems, you cannot ever dereference a user-space pointer directly.