# 2019q1 Homework2 (fibdrv) contributed by < `0xff07` > ## 粗略的直覺 在 `fibdrive` 中可以發現一切的一切都在把 `const struct file_operation` 跟 `dev_t` 這兩個東西做適當的初始化: ```c= const struct file_operations fib_fops = { .owner = THIS_MODULE, .read = fib_read, .write = fib_write, .open = fib_open, .release = fib_release, .llseek = fib_device_lseek, }; ``` 這些東西似乎是在提供檔案操作的 `read`, `open` 等等在對應裝置的函數。然後在 `init_fib_dev` 裡面把這兩個東西傳進 `device_create` 中。 然後,裡面有一堆 `MODULE_*` 巨集,裡面的參數似乎是各種相關資訊。 ## 詳細 ### .modinfo 找出 Linux Kernel 5.0 的 [include/linux/module.h](https://elixir.bootlin.com/linux/latest/source/include/linux/module.h#L199) 的程式中可以找到: ```clike=199 #define MODULE_AUTHOR(_author) MODULE_INFO(author, _author) ``` 接著往下找。同一個檔案中可以找到: ```c= /* Generic info of form tag = "info" */ #define MODULE_INFO(tag, info) __MODULE_INFO(tag, tag, info) ``` 這個東西定義在 [include/linux/moduleparam.h](https://elixir.bootlin.com/linux/latest/source/include/linux/moduleparam.h#L21) 中: ```c= #ifdef MODULE #define __MODULE_INFO(tag, name, info) \ static const char __UNIQUE_ID(name)[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = __stringify(tag) "=" info #else /* !MODULE */ /* This struct is here for syntactic coherency, it is not used */ #define __MODULE_INFO(tag, name, info) \ struct __UNIQUE_ID(name) {} #endif ``` 那個 `__stringnify(tag)` 在[include/linux/stringify.h](https://elixir.bootlin.com/linux/latest/source/include/linux/stringify.h#L10) 中定義為: ```c= #define __stringify_1(x...) #x #define __stringify(x...) __stringify_1(x) ``` 所以就是在 macro 輸入的參數前面加一個 #。所以知道這是 C99 中的 stringification 。因此,撇開一些編譯器的提示,這段程式的意思大致上是: ```clike # define __MODULE_INFO(tag, name, info)\ static const char name[] = #tag "=" info ``` 可發現這個東西利用到 C11 規格書中的 6.4.5 String Literals : > In translation phase 6, the multibyte character sequences specified by any sequence of adjacent character and wide string literal tokens are concatenated into a single multibytecharacter sequence. 大致的意思是把 string literal 並排,等同於一個合併起來的字串。 接著看看 `__attribute__` 有什麼。這個東西規範再 gcc 使用手冊的 [Variable Attributes](https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#Common-Variable-Attributes) 中: > `section (`*section-name*`)` > Normally, the compiler places the objects it generates in sections like data and bss. Sometimes, however, you need additional sections, or you need certain particular variables to appear in special sections, for example to map to special hardware. The section attribute specifies that a variable (or function) lives in a particular section. 言下之意,這個提示會把(全域)變數放在編出來的 object 自訂的 section 中。 在這邊更明確地說,`__MODULE_INFO` 這個巨集會讓編譯器在 `.modinfo` 區段塞進一個字串,而這個字串會遵守 `<tag>=<info>` 的格式。`tag` 是如 `author`, `license` 等等,依照使用的巨集而定; 等號後面的 `<info>` 是開發者自己指定的資訊,比如 `National Cheng Kung University`, `Dual MIT/GPL` 等等。 `.modinfo` 這個區段可以用 `$ readelf -a` 找到下面的資訊: ``` Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align ... [15] .rela.rodata RELA 0000000000000000 000018a8 0000000000000090 0000000000000018 I 24 14 8 [16] .modinfo PROGBITS 0000000000000000 00000560 00000000000000ec 0000000000000000 A 0 0 8 [17] .data PROGBITS 0000000000000000 00000660 0000000000000020 0000000000000000 WA 0 0 32 ... ``` 更進一步,用 vim 打開 `fibdrv.ko`,並且使用 16 進位模式閱讀,可以在 [elfread](https://sourceware.org/binutils/docs/binutils/readelf.html) 輸出的 `modinfo` 區段的 `offset` (560) 中找到下面內容: ```hex 00000530: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000540: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000550: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000560: 7665 7273 696f 6e3d 302e 3100 6465 7363 version=0.1.desc 00000570: 7269 7074 696f 6e3d 4669 626f 6e61 6363 ription=Fibonacc 00000580: 6920 656e 6769 6e65 2064 7269 7665 7200 i engine driver. 00000590: 6175 7468 6f72 3d4e 6174 696f 6e61 6c20 author=National 000005a0: 4368 656e 6720 4b75 6e67 2055 6e69 7665 Cheng Kung Unive 000005b0: 7273 6974 792c 2054 6169 7761 6e00 6c69 rsity, Taiwan.li 000005c0: 6365 6e73 653d 4475 616c 204d 4954 2f47 cense=Dual MIT/G 000005d0: 504c 0000 0000 0000 7372 6376 6572 7369 PL......srcversi 000005e0: 6f6e 3d32 3444 4335 4642 3745 3736 3038 on=24DC5FB7E7608 000005f0: 4146 3136 4230 4343 3146 0000 0000 0000 AF16B0CC1F...... 00000600: 6465 7065 6e64 733d 0072 6574 706f 6c69 depends=.retpoli 00000610: 6e65 3d59 006e 616d 653d 6669 6264 7276 ne=Y.name=fibdrv 00000620: 0076 6572 6d61 6769 633d 342e 3138 2e30 .vermagic=4.18.0 00000630: 2d31 352d 6765 6e65 7269 6320 534d 5020 -15-generic SMP 00000640: 6d6f 645f 756e 6c6f 6164 2000 0000 0000 mod_unload ..... 00000650: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000660: 0000 0000 0000 0000 0000 0000 0000 0000 ................ ``` > 檔案 fibdrv.c 裡頭的 MODULE_LICENSE, MODULE_AUTHOR, MODULE_DESCRIPTION, MODULE_VERSION 等巨集做了什麼事,可以讓核心知曉呢? ### module_init 參考 [第三版 Linux Device Driver](https://lwn.net/Kernel/LDD3/) 第二章的資訊: > ...The use of module_init is mandatory. This macro adds a special section to the module’s object code stating where the module’s initialization function is to be found. Without this definition, your initialization function is never called. 可以知道 `module_init` 巨集在編譯出來的 object 中,加入初始化模組函數的起始位置。類似地,`module_exit` 的相關敘述: > Once again, the module_exit declaration is necessary to enable to kernel to find your cleanup function. > 一樣去找 [include/linux/module.h](https://elixir.bootlin.com/linux/latest/source/include/linux/module.h#L199) ,會找到下面這樣的資訊: ```clike #ifndef MODULE /** * module_init() - driver initialization entry point * @x: function to be run at kernel boot time or module insertion * * module_init() will either be called during do_initcalls() (if * builtin) or at module insertion time (if a module). There can only * be one per module. */ #define module_init(x) __initcall(x); /** * module_exit() - driver exit entry point * @x: function to be run when driver is removed * * module_exit() will wrap the driver clean-up code * with cleanup_module() when used with rmmod when * the driver is a module. If the driver is statically * compiled into the kernel, module_exit() has no effect. * There can only be one per module. */ #define module_exit(x) __exitcall(x); #else /* MODULE */ /* * In most cases loadable modules do not need custom * initcall levels. There are still some valid cases where * a driver may be needed early if built in, and does not * matter when built as a loadable module. Like bus * snooping debug drivers. */ #define early_initcall(fn) module_init(fn) #define core_initcall(fn) module_init(fn) #define core_initcall_sync(fn) module_init(fn) #define postcore_initcall(fn) module_init(fn) #define postcore_initcall_sync(fn) module_init(fn) #define arch_initcall(fn) module_init(fn) #define subsys_initcall(fn) module_init(fn) #define subsys_initcall_sync(fn) module_init(fn) #define fs_initcall(fn) module_init(fn) #define fs_initcall_sync(fn) module_init(fn) #define rootfs_initcall(fn) module_init(fn) #define device_initcall(fn) module_init(fn) #define device_initcall_sync(fn) module_init(fn) #define late_initcall(fn) module_init(fn) #define late_initcall_sync(fn) module_init(fn) #define console_initcall(fn) module_init(fn) /* Each module must use one module_init(). */ #define module_init(initfn) \ static inline initcall_t __maybe_unused __inittest(void) \ { return initfn; } \ int init_module(void) __copy(initfn) __attribute__((alias(#initfn))); /* This is only required if you want to be unloadable. */ #define module_exit(exitfn) \ static inline exitcall_t __maybe_unused __exittest(void) \ { return exitfn; } \ void cleanup_module(void) __copy(exitfn) __attribute__((alias(#exitfn))); #endif ``` 由註解可以知道:如果是事後插入的模組,是在 `#else` 後面的定義。這時: ```clike #define module_init(initfn) \ static inline initcall_t __maybe_unused __inittest(void) \ { return initfn; } \ int init_module(void) __copy(initfn) __attribute__((alias(#initfn))); ``` 這邊一樣有許多給編譯器的提示,查閱 gcc 手冊中 [Function Attributes](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes) 的內容,可以查到: 關於 `alias`: > alias ("target") > > The alias attribute causes the declaration to be emitted as an alias for another symbol, which must be specified. For instance, > ```clike > void __f () { /* Do something. */; } > void f () __attribute__ ((weak, alias ("__f"))); > ``` > defines ‘f’ to be a weak alias for ‘__f’. In C++, the mangled name for the target must be used. It is an error if ‘__f’ is not defined in the same translation unit. > This attribute requires assembler and object file support, and may not be available on all targets. 關於 __copy: > copy (function) > > The copy attribute applies the set of attributes with which function has been declared to the declaration of the function to which the attribute is applied. The attribute is designed for libraries that define aliases or function resolvers that are expected to specify the same set of attributes as their targets. The copy attribute can be used with functions, variables, or types. However, the kind of symbol to which the attribute is applied (either function or variable) must match the kind of symbol to which the argument refers. The copy attribute copies only syntactic and semantic attributes but not attributes that affect a symbol’s linkage or visibility such as alias, visibility, or weak. The deprecated attribute is also not copied. See Common Type Attributes. See Common Variable Attributes. > > For example, the StrongAlias macro below makes use of the alias and copy attributes to define an alias named alloc for function allocate declared with attributes alloc_size, malloc, and nothrow. Thanks to the __typeof__ operator the alias has the same type as the target function. As a result of the copy attribute the alias also shares the same attributes as the target. > ```clike > #define StrongAlias(TagetFunc, AliasDecl) \ > extern __typeof__ (TargetFunc) AliasDecl \ > __attribute__ ((alias (#TargetFunc), copy (TargetFunc))); > > extern __attribute__ ((alloc_size (1), malloc, nothrow)) > void* allocate (size_t); > StrongAlias (allocate, alloc); > ``` ## 計時機制 LWN 的 [Reinventing the timer wheel](https://lwn.net/Articles/646950/) 一文中有以下描述: > The kernel maintains two types of timers with two distinct use cases. The high-resolution timer ("hrtimer") mechanism provides accurate timers for work that needs to be done in the near future; hrtimer use is relatively rare, but, when hrtimers are used, they almost always run to completion. "Timeouts," instead, are normally used to alert the kernel to an expected event that has failed to arrive — a missing network packet or I/O completion interrupt, for example. The accuracy requirements for these timers are less stringent (it doesn't matter if an I/O timeout comes a few milliseconds late), and, importantly, these timers are usually canceled before they expire. 跟計時有關的功能主要用在兩種情境: 1. timer:安排「在某個時間點做某件事情」。直覺地想可以想像成火車班次表,這個重要的地方是:如果某班火車誤點,會連鎖地影響後面所有班次。因此需要比較精準的計時。 3. timeout:用來作為逾時的通知,提醒「有東西遲到了」。最簡單的例子是 `qtest` 裡面的 `SIGALRM` 的使用。既然都已經遲到了, ### jiffies & timer wheel 其中一種計時方式是用作業系統從開機以來的 timer interrupt 發生的次數作為計時的依據。這個次數叫作 `jiffies`。較舊的 linux 核心提供一個建議在 `jiffies` 上的計時機制,叫作 timer wheel。可以參考 [A new approach to kernel timers](https://lwn.net/Articles/152436/) 一文(註:標題雖然說是 new approach,但這篇文章的寫作時間是 2005 年)。機制跟 direct mapped cache 有點類似: ![](https://i.imgur.com/mZqKE8B.png) 以 `tv1` 為例,他是個大小是 256 個陣列,看 `jiffies` 的最低 8 位元代表的數字是多少,就去 `tv1` 陣列對應的元素找對應需要處理的事件。而 tv2 是 tv1 的陣列,每經過 $2^8$ 次 jiffy ,index 就加一。 `tv3` 後面以此類推。 不過,這個計時器受限於 timer interrupt 的頻率,而這個頻率有其極限。在 ### ktime_t & hrtimer 一個新的機制是跳脫 `jiffies`,而將計時機制建立在一個新的資料結構 `ktime_t` 上面。參考 LWN 中 [The high-resolution timer API](https://lwn.net/Articles/167897/) 一文。 `hrtimer` 是在 2.6.16 開始有的新的計時機制,裡面使用了 `ktime_t` 這個新的資料結構來進行計時。這個結構體的定義會隨架構有所不同。所以跟大多數 Linux 中的資料結構使用機制類似,都要使用專門的函數來對這個資料型態進行操作。而在 x86-64 中是個 64 位元整數。相關的使用方式如下: 宣告並初始化一個 `ktime_t`: ```c= DEFINE_KTIME(name); ``` 這跟 `LIST_HEAD` 的功能之於 `struct list_head` 是一樣的,宣告一個 `ktime_t` 並初始化成 0。 加減: ```c= ktime_t ktime_add(ktime_t kt1, ktime_t kt2); ktime_t ktime_sub(ktime_t kt1, ktime_t kt2); /* kt1 - kt2 */ ktime_t ktime_add_ns(ktime_t kt, u64 nanoseconds); ``` 與其他時間相關的轉換: ```c= ktime_t timespec_to_ktime(struct timespec tspec); ktime_t timeval_to_ktime(struct timeval tval); struct timespec ktime_to_timespec(ktime_t kt); struct timeval ktime_to_timeval(ktime_t kt); clock_t ktime_to_clock_t(ktime_t kt); u64 ktime_to_ns(ktime_t kt); ``` [IBM](https://www.ibm.com/developerworks/library/l-timers-list/index.html) 關於 hrtimer 的文章。 ## 從 perf 開始的一些觀察 一開始是使用 `perf` 想要找看看那邊花比較多時間,搭配 `strace` 中的資訊,猜看看在 `insmod` 的過程發生了什麼事。首先用 `perF` ```c= $ sudo perf record -g ./client $ sudo perf report -G ``` 得到下面的輸出: ```c= amples: 9 of event 'cycles:ppp', Event count (approx.): 4057500 Children Self Command Shared Object Symbol + 99.39% 0.00% insmod [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ◆ + 99.39% 0.00% insmod [kernel.kallsyms] [k] do_syscall_64 ▒ + 49.66% 49.66% insmod [kernel.kallsyms] [k] rb_insert_color ▒ + 49.66% 0.00% insmod [unknown] [k] 0x20646564616f6c00 ▒ + 49.66% 0.00% insmod libc-2.28.so [.] syscall ▒ + 49.66% 0.00% insmod [kernel.kallsyms] [k] __x64_sys_finit_module ▒ + 49.66% 0.00% insmod [kernel.kallsyms] [k] __do_sys_finit_module ▒ + 49.66% 0.00% insmod [kernel.kallsyms] [k] load_module ▒ + 49.66% 0.00% insmod [kernel.kallsyms] [k] mod_sysfs_setup ▒ + 49.66% 0.00% insmod [kernel.kallsyms] [k] sysfs_create_group ▒ + 49.66% 0.00% insmod [kernel.kallsyms] [k] internal_create_group ▒ + 49.66% 0.00% insmod [kernel.kallsyms] [k] sysfs_add_file_mode_ns ▒ + 49.66% 0.00% insmod [kernel.kallsyms] [k] __kernfs_create_file ▒ + 49.66% 0.00% insmod [kernel.kallsyms] [k] kernfs_add_one ▒ + 41.30% 41.30% insmod [kernel.kallsyms] [k] vma_interval_tree_insert ▒ + 41.30% 0.00% insmod ld-2.28.so [.] mmap64 ▒ + 41.30% 0.00% insmod [kernel.kallsyms] [k] __x64_sys_mmap ▒ + 41.30% 0.00% insmod [kernel.kallsyms] [k] ksys_mmap_pgoff ▒ + 41.30% 0.00% insmod [kernel.kallsyms] [k] vm_mmap_pgoff ▒ + 41.30% 0.00% insmod [kernel.kallsyms] [k] do_mmap ▒ + 41.30% 0.00% insmod [kernel.kallsyms] [k] mmap_region ▒ + 41.30% 0.00% insmod [kernel.kallsyms] [k] vma_link ▒ + 41.30% 0.00% insmod [kernel.kallsyms] [k] __vma_link_file ▒ + 8.43% 8.43% insmod [kernel.kallsyms] [k] move_page_tables ▒ + 8.43% 0.00% insmod [unknown] [k] 0x00007f4bf345d977 ▒ + 8.43% 0.00% insmod [kernel.kallsyms] [k] __x64_sys_execve ▒ + 8.43% 0.00% insmod [kernel.kallsyms] [k] __do_execve_file.isra.41 ``` 找 `do_syscall_64` 看看裡面發生了什麼事: ```c= - 99.39% 0.00% insmod [kernel.kallsyms] [k] do_syscall_64 ▒ - do_syscall_64 ◆ + 49.66% __x64_sys_finit_module ▒ + 41.30% __x64_sys_mmap ▒ + 8.43% __x64_sys_execve ``` 可以發現裡面 overhead 比較大的系統呼叫是跟記憶體有關的 `mmap` 以及 `finit_module`。打開 `finit_module` ,看看裡面的 call stack 長什麼樣子: ```c= + 99.39% 0.00% insmod [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ▒ - 99.39% 0.00% insmod [kernel.kallsyms] [k] do_syscall_64 ▒ - do_syscall_64 ◆ - 49.66% __x64_sys_finit_module ▒ __do_sys_finit_module ▒ load_module ▒ mod_sysfs_setup ▒ sysfs_create_group ▒ internal_create_group ▒ sysfs_add_file_mode_ns ▒ __kernfs_create_file ▒ kernfs_add_one ▒ rb_insert_color ▒ + 41.30% __x64_sys_mmap ▒ + 8.43% __x64_sys_execve + 49.66% 49.66% insmod [kernel.kallsyms] [k] rb_insert_color ▒ + 49.66% 0.00% insmod [unknown] [k] 0x20646564616f6c00 ... ``` 查 `man` 可以找到 `finit_module` 這個系統呼叫。為了進一步確認這幾個函數真的是系統呼叫,去看看這個系統呼叫當初的定義。在 [](https://elixir.bootlin.com/linux/latest/source/kernel/module.c#L3878) 中有: ```c= SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) { struct load_info info = { }; loff_t size; void *hdr; int err; err = may_init_module(); if (err) return err; pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags); if (flags & ~(MODULE_INIT_IGNORE_MODVERSIONS |MODULE_INIT_IGNORE_VERMAGIC)) return -EINVAL; err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX, READING_MODULE); if (err) return err; info.hdr = hdr; info.len = size; return load_module(&info, uargs, flags); } ``` 而 `SYSCALL_DEFINE*` 等巨集的功能可以在 [arch/x86/include/asm/syscall_wrapper.h](https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/syscall_wrapper.h#L157) 得知。這些巨集最後主要的功能都是 [__SYSCALL_DEFINEx](https://elixir.bootlin.com/linux/v5.0/source/arch/x86/include/asm/syscall_wrapper.h): ```c= /* * Instead of the generic __SYSCALL_DEFINEx() definition, this macro takes * struct pt_regs *regs as the only argument of the syscall stub named * __x64_sys_*(). It decodes just the registers it needs and passes them on to * the __se_sys_*() wrapper performing sign extension and then to the * __do_sys_*() function doing the actual job. These wrappers and functions * are inlined (at least in very most cases), meaning that the assembly looks * as follows (slightly re-ordered for better readability): * * <__x64_sys_recv>: <-- syscall with 4 parameters * callq <__fentry__> * * mov 0x70(%rdi),%rdi <-- decode regs->di * mov 0x68(%rdi),%rsi <-- decode regs->si * mov 0x60(%rdi),%rdx <-- decode regs->dx * mov 0x38(%rdi),%rcx <-- decode regs->r10 * * xor %r9d,%r9d <-- clear %r9 * xor %r8d,%r8d <-- clear %r8 * * callq __sys_recvfrom <-- do the actual work in __sys_recvfrom() * which takes 6 arguments * * cltq <-- extend return value to 64-bit * retq <-- return * * This approach avoids leaking random user-provided register content down * the call chain. * * If IA32_EMULATION is enabled, this macro generates an additional wrapper * named __ia32_sys_*() which decodes the struct pt_regs *regs according * to the i386 calling convention (bx, cx, dx, si, di, bp). */ #define __SYSCALL_DEFINEx(x, name, ...) \ asmlinkage long __x64_sys##name(const struct pt_regs *regs); \ ALLOW_ERROR_INJECTION(__x64_sys##name, ERRNO); \ static long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \ static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__));\ asmlinkage long __x64_sys##name(const struct pt_regs *regs) \ { \ return __se_sys##name(SC_X86_64_REGS_TO_ARGS(x,__VA_ARGS__));\ } \ __IA32_SYS_STUBx(x, name, __VA_ARGS__) \ static long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \ { \ long ret = __do_sys##name(__MAP(x,__SC_CAST,__VA_ARGS__));\ __MAP(x,__SC_TEST,__VA_ARGS__); \ __PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \ return ret; \ } \ static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) ``` 註解中提到這個巨集會宣告 3 個函式: 1. `__x64_sys##name` 2. `_se_sys##name` 3. `__do_sys##name` 所以可以確認 `__x64_sys_finit_module` 等函數確實是 `finit_module` 系統呼叫對應的函數。接著回到 `finit_module` 中(也就是那個 `SYSCALL_DEFINE3()` 巨集裡面寫的東西)。根據 `man finit_module` 裡面提到: > **init_module()** loads an ELF image into kernel space, performs any necessary symbol relocations, initializes module parameters to values provided by the caller, and then runs the module's init function. > ... > The **finit_module()** system call is like init_module(), but reads the module to be loaded from the file descriptor fd. 比較 `SYSCALL_DEFINE3(init_module...)` 跟 `SYSCALL_DEFINE3(finit_module ...)`,可以發現兩者殊途同歸,最後面都是在做 `return load_module`。打開 [load_module](https://elixir.bootlin.com/linux/latest/source/kernel/module.c#L3660): ```c= /* Allocate and load the module: note that size of section 0 is always zero, and we rely on this for optional sections. */ static int load_module(struct load_info *info, const char __user *uargs, int flags) { struct module *mod; long err = 0; char *after_dashes; err = elf_header_check(info); if (err) goto free_copy; err = setup_load_info(info, flags); if (err) goto free_copy; if (blacklisted(info->name)) { err = -EPERM; goto free_copy; } err = module_sig_check(info, flags); if (err) goto free_copy; err = rewrite_section_headers(info, flags); if (err) goto free_copy; /* Check module struct version now, before we try to use module. */ if (!check_modstruct_version(info, info->mod)) { err = -ENOEXEC; goto free_copy; } ... return do_init_module(mod); sysfs_cleanup: mod_sysfs_teardown(mod); coming_cleanup: mod->state = MODULE_STATE_GOING; destroy_params(mod->kp, mod->num_kp); blocking_notifier_call_chain(&module_notify_list, MODULE_STATE_GOING, mod); ... } ``` > 目前的認知是作層層檢查, 配置空間等等。但詳細究竟做了什麼,以及為什麼要做這些事有待繼續查。 # vfs 接著對 `client`使用 `perf`: 結果: ```c= Samples: 9 of event 'cycles:ppp', Event count (approx.): 4123826 Children Self Command Shared Object Symbol + 99.33% 0.00% client [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ◆ + 99.33% 0.00% client [kernel.kallsyms] [k] do_syscall_64 ▒ + 48.74% 48.74% client [kernel.kallsyms] [k] fib_sequence_naive ▒ + 48.74% 0.00% client [unknown] [k] 0x41fd89415541f689 ▒ + 48.74% 0.00% client libc-2.28.so [.] __libc_start_main ▒ + 48.74% 0.00% client libc-2.28.so [.] read ▒ + 48.74% 0.00% client [kernel.kallsyms] [k] __x64_sys_read ▒ + 48.74% 0.00% client [kernel.kallsyms] [k] ksys_read ▒ + 48.74% 0.00% client [kernel.kallsyms] [k] vfs_read ▒ + 48.74% 0.00% client [kernel.kallsyms] [k] __vfs_read ▒ + 41.50% 41.50% client [kernel.kallsyms] [k] memcpy_erms ▒ + 41.50% 0.00% client ld-2.28.so [.] mmap64 ▒ + 41.50% 0.00% client [kernel.kallsyms] [k] __x64_sys_mmap ▒ + 41.50% 0.00% client [kernel.kallsyms] [k] ksys_mmap_pgoff ▒ + 41.50% 0.00% client [kernel.kallsyms] [k] vm_mmap_pgoff ▒ + 41.50% 0.00% client [kernel.kallsyms] [k] do_mmap ▒ + 41.50% 0.00% client [kernel.kallsyms] [k] mmap_region ▒ + 41.50% 0.00% client [kernel.kallsyms] [k] perf_event_mmap ▒ + 41.50% 0.00% client [kernel.kallsyms] [k] perf_iterate_sb ▒ + 41.50% 0.00% client [kernel.kallsyms] [k] perf_iterate_ctx ▒ + 41.50% 0.00% client [kernel.kallsyms] [k] perf_event_mmap_output ▒ + 9.09% 9.09% client [kernel.kallsyms] [k] free_pgd_range ▒ + 9.09% 0.00% client [unknown] [k] 0x00007f05cb6cc977 ``` 扣除跟記憶體有關的部份,可以發現 `fib_sequence_naive` 出現熱點。察看 `do_syscall_64`: ```c= - 99.33% 0.00% client [kernel.kallsyms] [k] do_syscall_64 ▒ - do_syscall_64 ◆ + 48.74% __x64_sys_read ▒ + 41.50% __x64_sys_mmap ▒ + 9.09% __x64_sys_execve ``` 除了跟記憶體配置有關的 `mmap`,大的熱點出現在 `__x64_sts_read`。由前面的討論可以知道這是 `read` 系統呼叫有關的部份。繼續展開檢查: ```c= - 99.33% 0.00% client [kernel.kallsyms] [k] do_syscall_64 ▒ - do_syscall_64 ◆ - 48.74% __x64_sys_read ▒ ksys_read ▒ vfs_read ▒ __vfs_read ▒ fib_sequence_naive ``` 可以發現最後呼叫到 `fib_sequence_naive`。打開 [fs/read_write.c](https://elixir.bootlin.com/linux/v5.0/source/fs/read_write.c#L412) 實作: ```c= ssize_t __vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos) { if (file->f_op->read) return file->f_op->read(file, buf, count, pos); else if (file->f_op->read_iter) return new_sync_read(file, buf, count, pos); else return -EINVAL; } ``` 以及 [include/linux/fs.h](https://elixir.bootlin.com/linux/v5.0/source/include/linux/fs.h#L901) 中 `struct file` 的定義: ```c= struct file { union { struct llist_node fu_llist; struct rcu_head fu_rcuhead; } f_u; struct path f_path; struct inode *f_inode; /* cached value */ const struct file_operations *f_op; /* * Protects f_ep_links, f_flags. * Must not be taken from IRQ context. */ spinlock_t f_lock; enum rw_hint f_write_hint; atomic_long_t f_count; unsigned int f_flags; fmode_t f_mode; struct mutex f_pos_lock; loff_t f_pos; struct fown_struct f_owner; const struct cred *f_cred; struct file_ra_state f_ra; u64 f_version; #ifdef CONFIG_SECURITY void *f_security; #endif /* needed for tty driver, and maybe others */ void *private_data; #ifdef CONFIG_EPOLL /* Used by fs/eventpoll.c to link all the hooks to this file */ struct list_head f_ep_links; struct list_head f_tfile_llink; #endif /* #ifdef CONFIG_EPOLL */ struct address_space *f_mapping; errseq_t f_wb_err; } __randomize_layout __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */ struct file_handle { __u32 handle_bytes; int handle_type; /* file identifier */ unsigned char f_handle[0]; }; ``` ## 其他 ### Universal I/O Model Linux 中有種「所有東西都是檔案」的哲學。這個哲學在 The Linux Programming Interface 中稱作「Universality of I/O」: > One of the distinguishing features of the UNIX I/O model is the concept of universality of I/O. This means that the same four system calls—open(), read(), write(), and close()—are used to perform I/O on all types of files, including devices such as terminals. 因此,看似不同種類的 I/O (如輸出到終端機 vs. 寫東西到檔案),可以在這個一致的抽象層下運作(都可以用檔案操作視之)。裝置驅動程式的目的就是要給出這個抽象層的實作,使得作業系統能夠透過一致的介面去存取。該書中第 14 章提到: > A device special file corresponds to a device on the system. Within the kernel, each device type has a corresponding device driver, which handles all I/O requests for the device. A device driver is a unit of kernel code that implements a set of operations that (normally) correspond to input and output actions on an associated piece of hardware. The API provided by device drivers is fixed, and includes operations corresponding to the system calls open(), close(), read(), write(), mmap(), and ioctl(). 這些操作大致上是那些基於 file descriptor 的操作,比如 `open()`, `read()`, `write()`, `lseek()` 等等。去查 man 就好。 ## File Discriptor & Open Files 在以 `open()`, `read()`, `write()` 等函數進行各種 I/O 操作時,都是以 file descriptor 為對象。而實際上這件事牽扯到 3 個面向: 1. 每個行程自己看到的 file descriptor 2. open file table 3. inode:那個「檔案」真正的 inode ![](https://i.imgur.com/ZA2BYTX.png)