linux2024-homework6

# 2024q1 Homework6 (integration) contributed by < [`SuNsHiNe-75`](https://github.com/SuNsHiNe-75) > ## 有關〈[Linux 核心模組運作原理](https://hackmd.io/@sysprog/linux-kernel-module)〉 ### 解釋 `insmod` 後，Linux 核心模組的符號 (symbol) 如何被 Linux 核心找到首先要知道，`insmod` 程式的主要目的是「將模組 (module) 掛載至核心」。可用 [strace](https://linux.die.net/man/1/strace) 來追蹤執行 `insmod` 時，會執行哪些系統呼叫，會發現其呼叫 `idempotent_init_module` 後在該函式當中還會呼叫 `init_module_from_file` ，在 `init_module_from_file` 才真正呼叫到 **`load_module`**。 > `load_module`：大致就是 Linux 核心為模組配置記憶體和載入模組相關資料的地方。觀察其所在的檔案位置 [kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c) 中，有 `lookup_module_symbol_name` 及 `lookup_module_symbol_attrs` 等與尋找 symbol 相關的函式，部分程式碼展示如下： ```c= int lookup_module_symbol_name(unsigned long addr, char *symname) { //... list_for_each_entry_rcu(mod, &modules, list) { if (mod->state == MODULE_STATE_UNFORMED) continue; if (within_module(addr, mod)) { const char *sym; sym = get_ksymbol(mod, addr, NULL, NULL); //... } } //... } ``` 可以看到第 12 行涉及有關 symbol 的操作，延伸探討 `get_ksymbol` 函式，其同樣定義在 [kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c) 中，如下： ```c= static const char *get_ksymbol(struct module *mod, unsigned long addr, unsigned long *size, unsigned long *offset) { unsigned int i, best = 0; unsigned long nextval; struct mod_kallsyms *kallsyms = rcu_dereference_sched(mod->kallsyms); /* At worse, next value is at end of module */ if (within_module_init(addr, mod)) nextval = (unsigned long)mod->init_layout.base+mod->init_layout.text_size; else nextval = (unsigned long)mod->core_layout.base+mod->core_layout.text_size; /* Scan for closest preceding symbol, and next symbol. (ELF starts real symbols at 1). */ for (i = 1; i < kallsyms->num_symtab; i++) { if (kallsyms->symtab[i].st_shndx == SHN_UNDEF) continue; /* We ignore unnamed symbols: they're uninformative * and inserted at a whim. */ if (*symname(kallsyms, i) == '\0' || is_arm_mapping_symbol(symname(kallsyms, i))) continue; if (kallsyms->symtab[i].st_value <= addr && kallsyms->symtab[i].st_value > kallsyms->symtab[best].st_value) best = i; if (kallsyms->symtab[i].st_value > addr && kallsyms->symtab[i].st_value < nextval) nextval = kallsyms->symtab[i].st_value; } if (!best) return NULL; if (size) *size = nextval - kallsyms->symtab[best].st_value; if (offset) *offset = addr - kallsyms->symtab[best].st_value; return symname(kallsyms, best); } ``` :::danger 注意用語！ ::: 推測此函式主要用途為「從 Linux 模組中尋找符號」。指定一個位址，其<s>遍歷</s> 符號表，從第一個符號開始找尋「比該位址小且最接近該位址的符號」，之後回傳最接近該位址的符號名稱，然後計算符號的大小和位址偏移量。最後，根據「是否在模組的初始化段內」，計算下一個符號的位址。 :::danger 用第七週介紹的 UML 和 QEMU 來追蹤 Linux 核心的行為以驗證。 ::: 同樣觀察 [kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c) 中的 `find_symbol` 函式，其呼叫了 `each_symbol_section` 函式，擷取如下： ```c= /* Returns true as soon as fn returns true, otherwise false. */ bool each_symbol_section(bool (*fn)(const struct symsearch *arr, struct module *owner, void *data), void *data) { struct module *mod; static const struct symsearch arr[] = { { __start___ksymtab, __stop___ksymtab, __start___kcrctab, NOT_GPL_ONLY, false }, { __start___ksymtab_gpl, __stop___ksymtab_gpl, __start___kcrctab_gpl, GPL_ONLY, false }, { __start___ksymtab_gpl_future, __stop___ksymtab_gpl_future, __start___kcrctab_gpl_future, WILL_BE_GPL_ONLY, false }, #ifdef CONFIG_UNUSED_SYMBOLS { __start___ksymtab_unused, __stop___ksymtab_unused, __start___kcrctab_unused, NOT_GPL_ONLY, true }, { __start___ksymtab_unused_gpl, __stop___ksymtab_unused_gpl, __start___kcrctab_unused_gpl, GPL_ONLY, true }, #endif }; module_assert_mutex_or_preempt(); if (each_symbol_in_section(arr, ARRAY_SIZE(arr), NULL, fn, data)) return true; list_for_each_entry_rcu(mod, &modules, list) { struct symsearch arr[] = { { mod->syms, mod->syms + mod->num_syms, mod->crcs, NOT_GPL_ONLY, false }, { mod->gpl_syms, mod->gpl_syms + mod->num_gpl_syms, mod->gpl_crcs, GPL_ONLY, false }, { mod->gpl_future_syms, mod->gpl_future_syms + mod->num_gpl_future_syms, mod->gpl_future_crcs, WILL_BE_GPL_ONLY, false }, #ifdef CONFIG_UNUSED_SYMBOLS { mod->unused_syms, mod->unused_syms + mod->num_unused_syms, mod->unused_crcs, NOT_GPL_ONLY, true }, { mod->unused_gpl_syms, mod->unused_gpl_syms + mod->num_unused_gpl_syms, mod->unused_gpl_crcs, GPL_ONLY, true }, #endif }; if (mod->state == MODULE_STATE_UNFORMED) continue; if (each_symbol_in_section(arr, ARRAY_SIZE(arr), mod, fn, data)) return true; } return false; } ``` 此函式透過 `list_for_each_entry_rcu` 來走訪 Linux 模組的 Symbol table，並對每個符號執行指定的函式。其先<s>遍歷</s> 一組 Static symbol table，然後遍歷「已載入的模組」，對每個模組的 Symbol table 執行相同的操作並檢查其狀態。 > 如果其中任何一個函式返回 true，則 `each_symbol_section` 也會回傳 true，否則回傳 false。綜上所述，我認為 `insmod` 程式將 module 的資料載入核心時，核心會將 module 中沒有定義過的 symbol 連結到核心的「Symbol Table」中（如上述程式碼之 `symtab[]`）。 :::danger 只閱讀程式碼，難免陷入「舉燭」的境界。 ::: 當需要使用到時，透過 `list_for_each_entry_rcu` 走訪 modules，若有對應 symbol 在 Symbol table 中，則成功找到 symbol；反之，將該未定義的 symbol 連到核心的 Symbol table 內。 > RCU (Read-Copy Update) 是一種高效的同步機制，常用於在多核系統中進行共享資料的讀取操作，以提高性能和降低<s>內存</s>消耗。 :::danger 注意用語！ ::: ### `MODULE_LICENSE` 巨集指定的授權條款對核心有什麼影響參照 [include/linux/module.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/module.h#L199) 中的 `MODULE_LICENSE` 巨集定義及註解如下： ```c /* * The following license idents are currently accepted as indicating free * software modules * * "GPL" [GNU Public License v2 or later] * "GPL v2" [GNU Public License v2] * "GPL and additional rights" [GNU Public License v2 rights and more] * "Dual BSD/GPL" [GNU Public License v2 * or BSD license choice] * "Dual MIT/GPL" [GNU Public License v2 * or MIT license choice] * "Dual MPL/GPL" [GNU Public License v2 * or Mozilla license choice] * * The following other idents are available * * "Proprietary" [Non free products] * * There are dual licensed components, but when running with Linux it is the * GPL that is relevant so this is a non issue. Similarly LGPL linked with GPL * is a GPL combined work. * * This exists for several reasons * 1. So modinfo can show license info for users wanting to vet their setup * is free * 2. So the community can ignore bug reports including proprietary modules * 3. So vendors can do likewise based on their own policies */ #define MODULE_LICENSE(_license) MODULE_INFO(license, _license) ``` 另參照 [Linux 核心模組掛載機制](https://hackmd.io/@sysprog/linux-kernel-module#Linux-%E6%A0%B8%E5%BF%83%E6%A8%A1%E7%B5%84%E6%8E%9B%E8%BC%89%E6%A9%9F%E5%88%B6) 中，以 `MODULE_AUTHOR` 為例對 `MODULE_INFO` 展開討論，可知 `MODULE_LICENSE` 會根據傳入的參數（如 `Dual MIT/GPL`），寫入 `.modinfo` 的區段內，以宣告程式的 License。 > 根據 GNU GCC 文件說明對於 Variable attribute 的解說，`section` 會特別將此 variable 放到指定的 ELF section 中，這邊為 `.modinfo`。 GPL 全名為 [GNU General Public License](https://zh.wikipedia.org/zh-tw/GNU%E9%80%9A%E7%94%A8%E5%85%AC%E5%85%B1%E8%AE%B8%E5%8F%AF%E8%AF%81)，根據上述巨集之註解推測，如果授權條款是 **GPL v2** 以上，則該模組會被視為「自由軟體」，可以與核心互動，並且其符號將包含在核心的 Symbol table 中，供其他模組使用。 :::danger 為何要有此機制？ ::: 反之，如果授權條款是 **Proprietary**，則模組被視為「專有軟體」，核心不會提供對其符號的存取權限，表示其他模組無法使用該模組的符號。如上所述，我認為授權條款的指定會影響其他模組對其符號的可用性，需透過指定正確的授權條款，來確保模組與核心的相容性。 :::danger 有什麼第一手的資料佐證？查閱 Linux Kernel Mailing List (LKML) ::: ### 藉由 [strace](https://man7.org/linux/man-pages/man1/strace.1.html) 追蹤 Linux 核心的掛載，涉及哪些系統呼叫和子系統根據 [strace(1)](https://man7.org/linux/man-pages/man1/strace.1.html)，strace 命令所做的工作簡述如下： > Each line in the trace contains the system call name, followed by its arguments in parentheses and its return value. - 若追蹤 Signal（如 `sleep`），`strace` 會將其解碼成 siginfo 結構，並以 signal symbol 的形式輸出。 - 在多線程的環境下同時使用一系統呼叫，`strace` 會記錄其先後順序，並根據是否完成（回傳），以 `unfinished` 或 `resumed` 來標示它們。 - 若遇到 parameter，會以「符號」形式輸出。 - Structure pointers 被解引用並根據需要顯示其成員。 - `strace` 未知的系統呼叫會以「原始方式」輸出，將 system call number 以十六進制輸出並以 "`syscall_`" 為前綴。舉例如下： ```cshell syscall_0xbad(0x1, 0x2, 0x3, 0x4, 0x5, 0x6) = -1 ENOSYS (Function not implemented) ``` - Character pointers 被解引用並以 C 字符串的形式輸出。 - 「基本類型和陣列」的 pointers 則使用方括號（`[]`）輸出。系統呼叫舉例，可參照 [Linux 核心模組掛載機制](https://hackmd.io/@sysprog/linux-kernel-module#Linux-%E6%A0%B8%E5%BF%83%E6%A8%A1%E7%B5%84%E6%8E%9B%E8%BC%89%E6%A9%9F%E5%88%B6) 追蹤執行 `insmod fibdrv.ko` 的過程如下： ```shell execve("/sbin/insmod", ["insmod", "fibdrv.ko"], 0x7ffeab43f308 /* 25 vars */) = 0 brk(NULL) = 0x561084511000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=83948, ...}) = 0 mmap(NULL, 83948, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0621290000 close(3) = 0 ... close(3) = 0 getcwd("/tmp/fibdrv", 4096) = 24 stat("/tmp/fibdrv/fibdrv.ko", {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0 openat(AT_FDCWD, "/tmp/fibdrv/fibdrv.ko", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0 mmap(NULL, 8288, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f06212a2000 finit_module(3, "", 0) = 0 munmap(0x7f06212a2000, 8288) = 0 close(3) = 0 exit_group(0) = ? +++ exited with 0 +++m ``` ## 有關〈[The Linux Kernel Module Programming Guide](https://sysprog21.github.io/lkmpg/)〉 ### 解釋 [simrupt](https://github.com/sysprog21/simrupt) 程式碼裡頭的 mutex lock 的使用方式該程式碼有兩種 mutex lock 的使用，分別為 `read_lock` 及 `producer_lock` & `consumer_lock` 的配合，我想如下分開敘述。 #### `read_lock` 互斥鎖定義及註解如下： ```c /* NOTE: the usage of kfifo is safe (no need for extra locking), until there is * only one concurrent reader and one concurrent writer. Writes are serialized * from the interrupt context, readers are serialized using this mutex. */ static DEFINE_MUTEX(read_lock); ``` > 注意註解：readers are serialized using this mutex。此互斥鎖機制使用在 `simrupt_read` 函式中，如下： ```c= static ssize_t simrupt_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { unsigned int read; int ret; pr_debug("simrupt: %s(%p, %zd, %lld)\n", __func__, buf, count, *ppos); if (unlikely(!access_ok(buf, count))) return -EFAULT; if (mutex_lock_interruptible(&read_lock)) return -ERESTARTSYS; do { ret = kfifo_to_user(&rx_fifo, buf, count, &read); if (unlikely(ret < 0)) break; if (read) break; if (file->f_flags & O_NONBLOCK) { ret = -EAGAIN; break; } ret = wait_event_interruptible(rx_wait, kfifo_len(&rx_fifo)); } while (ret == 0); pr_debug("simrupt: %s: out %u/%u bytes\n", __func__, read, kfifo_len(&rx_fifo)); mutex_unlock(&read_lock); return ret ? ret : read; } ``` 觀察第 14、15 行，該段程式碼鎖定一個互斥鎖 `read_lock` 來保護後續對共享資源 `rx_fifo` 的存取，如果此時被中斷（例如收到信號），則回傳錯誤 `-ERESTARTSYS`。共享資源 `rx_fifo` 存取完畢後，即在第 32 行解鎖 `read_lock`，以供其他 process 存取（如果想要的話）。 #### `producer_lock` & `consumer_lock` 互斥鎖定義及註解如下： ```c /* Mutex to serialize kfifo writers within the workqueue handler */ static DEFINE_MUTEX(producer_lock); /* Mutex to serialize fast_buf consumers: we can use a mutex because consumers * run in workqueue handler (kernel thread context). */ static DEFINE_MUTEX(consumer_lock); ``` 此互斥鎖機制實現在 `simrupt_work_func` 函式中，如下： ```c= /* Workqueue handler: executed by a kernel thread */ static void simrupt_work_func(struct work_struct *w) { int val, cpu; /* This code runs from a kernel thread, so softirqs and hard-irqs must * be enabled. */ WARN_ON_ONCE(in_softirq()); WARN_ON_ONCE(in_interrupt()); /* Pretend to simulate access to per-CPU data, disabling preemption * during the pr_info(). */ cpu = get_cpu(); pr_info("simrupt: [CPU#%d] %s\n", cpu, __func__); put_cpu(); while (1) { /* Consume data from the circular buffer */ mutex_lock(&consumer_lock); val = fast_buf_get(); mutex_unlock(&consumer_lock); if (val < 0) break; /* Store data to the kfifo buffer */ mutex_lock(&producer_lock); produce_data(val); mutex_unlock(&producer_lock); } wake_up_interruptible(&rx_wait); } ``` 該 `consumer_lock` 用於保護對 Circular buffer 的存取，確保在讀取資料時不會與其他執行緒的寫入操作衝突－用 `mutex_lock` 函式鎖定 `consumer_lock`，從 `fast_buf` 中存取資料後，再用 `mutex_unlock` 解鎖 `consumer_lock`，讓其他執行緒可以有機會存取該 buffer 的資料，從而避免 **Race condition**。另外，`producer_lock` 用於保護對 kfifo buffer 的存取，描述與 `consumer_lock` 類似，都可避免與其他執行緒發生 Race condition 的情況。 :::danger 上方程式碼有改進空間，不要看到程式碼就急著「舉燭」，你需要有辦法推論和從實驗來得知其行為和限制。 ::: #### 改寫為 [lock-free](https://hackmd.io/@sysprog/concurrency-lockfree) 所謂 lock-free，即透過「Atomic Operation」或「Compare and Swap (CAS)」等機制，實現對資料存取的「**不可分割性**」－[Lock-Free Programming](https://hackmd.io/@sysprog/concurrency-lockfree#Deeper-Look-Lock-Free-Programming) 文章中的舉例很不錯： > 本來對資料的改變，外面的人是看不到（至少不該看到）的 (只看得到轉帳前的狀態)；而 atomic write 做的事是「把你做的所有改變讓大家看到」(就是把轉帳後的餘額顯示出來)。另外要注意的是 lock-less 與 lock-free 的意義有重疊，但仍有定義上的不同－lock-less 顧名思義就是「不使用 lock 的前提」，確保資料能正確地被執行緒們存取；而「不使用 lock」不代表其為 lock-free，其定義比較像是指：整個程式的執行不會被其單個執行緒 lock 鎖住。參照 [2021q3 Homework (simrupt)](https://hackmd.io/@linD026/simrupt-vwifi#%E6%94%B9%E5%AF%AB-producer_lock-%E5%92%8C-consumer_lock) 章節。 > kfifo 以及 circular buffer 會遭遇到並行的問題，因為前者只提供 one reader 和 one writer 的情況；後者則只提供資料結構以及計算使用量等 API 。因此，原透過 `consumer_lock` 保護 `fast_buf` 資料存取的部分，可以透過導入 Atomic Operation 改寫 `fast_buf` 相關操作函式 `fast_buf_get` 及 `fast_buf_put`，以實現 lock-free 的概念。 > 程式碼參考：[改寫 `producer_lock` 和 `consumer_lock`](https://hackmd.io/@linD026/simrupt-vwifi#%E6%94%B9%E5%AF%AB-producer_lock-%E5%92%8C-consumer_lock) `producer_lock` 的話，會因 workqueue 導致可能有「多個 writer」同時想存取 kfifo buffer，因此 mutex_lock 的機制需要保留，不可移除 `producer_lock`。