LKMPG 學習紀錄

# LKMPG 學習紀錄 **閱讀 lkmpg 的學習筆記** 若有錯誤歡迎各位協助提醒，感謝 [The Linux Kernel Module Programming Guide](https://sysprog21.github.io/lkmpg/#scheduling-tasks) # Introduction linux 核心模組的定義: 一段可以在需要的時候**動態加載或卸載**的程式碼片段。通常這些模組可以增強 linux kernel 的功能，而且不需要重新啟動系統查看當前 kernel 中有哪些模組 ```shell $ lsmod ``` 搜尋指定模組，例如 fat 模組 ```shell $ lsmod | grep fat ``` ### SecureBoot 現今的電腦大多在出廠時都設置了 UEFI SecureBoot，這是一項必備的安全標準，目的是為確保說系統只透過原使設備製造商認可的軟體進行啟動。關閉方式可以直接到 BIOS 中關閉 SecureBoot 選項，或是使用 `mokutil` 來停用 SecureBoot ```shell sudo apt install mokutil sudo mokutil --disable-validation ``` :::info 當 Secure Boot 是啟用（Enabled）的時候，Linux 核心只允許載入具有有效「數位簽章」的 kernel module（.ko）。而我們手動編譯或修改的模組，預設是「沒有被簽章」的，因此會被拒絕掛載（insmod / modprobe 會失敗）。 ::: 關閉後我們才能去測試跟掛載模組。 ## Headers 在開始前，需要先安裝 Kernel 的 header file ```shell sudo apt-get install linux-headers-`uname -r` ``` ## 撰寫 Hello World! 模組建立 `hello.c` ```c #include <linux/init.h> #include <linux/module.h> MODULE_LICENSE("Dual BSD/GPL"); static int __init hello_init(void) { printk(KERN_INFO "Hello, world\n"); return 0; } static void __exit hello_exit(void) { printk(KERN_INFO "Goodbye, cruel world\n"); } module_init(hello_init); module_exit(hello_exit); ``` 建立 `Makefile` ``` obj-m := hello.o clean: rm -rf *.o *.ko *.mod.* *.symvers *.order *.mod.cmd *.mod ``` make 指令 ```shell $ make -C /lib/modules/`uname -r`/build M=`pwd` modules ``` 編譯成功後，應該會產生一個 `hello.ko` 模組，可以用命令來查看它 ```shell $ modinfo hello.ko ``` ``` filename: /home/user/kmod/hello.ko license: Dual BSD/GPL srcversion: F387861272F5CA27DA088DC depends: retpoline: Y name: hello vermagic: 6.11.0-19-generic SMP preempt mod_unload modversions ``` 接著可以將模組掛載 ```shell $ sudo insmod hello.ko ``` 使用 `lsmod` 可以找到掛載上去的模組接著在 /var/log/kern.log 中確認一下輸出結果 ``` 2025-05-10T18:20:22.957234+08:00 user kernel: Hello, world ``` 卸載核心模組 ```shell $ sudo rmmod hello ``` 在 /var/log/kern.log 中確認一下輸出結果 ``` 2025-05-10T18:22:26.210266+08:00 user kernel: Goodbye, curl world ``` `module_init()` 與 `module_exit()` 都是 kernel module 中兩個基本需要的函數，當模組被掛載到核心時，會呼叫 `module_init()`，為核心註冊一個處理程序，或是用自己的程式碼替換核心的某一個函式，而在模組被移除前會呼叫 `module_exit()`，撤銷 `module_init()` 所做的任何操作，以便模組可以安全的卸載。 :::info Linux 的 coding style 應該要使用 tab 來縮排而不是 space ::: ## printk [關於 printk 的說明](https://huenlil.pixnet.net/blog/post/23271426) 一開始，kernel 是使用 `printk` 搭配優先級(priority)來輸出訊息，像是 `KERN_INFO` 等，`KERN_INFO` 就是 log 的嚴重程度（info = 資訊），還有像是 `KERN_ERR`（錯誤）、`KERN_WARNING`（警告）等。 ```c printk(KERN_INFO "Hello from kernel\n"); ``` 後來，這些可以用簡寫的列印巨集來寫，例如 `pr_info` 或 `pr_debug`。 ```c pr_info("Hello from kernel\n"); ``` ## LICENSE(許可) 我們可以透過幾個 macro 來標示模組的授權方式，舉例像是 "GPL", "Dual BSD/GPL" 等等，這些 macro 定義在 `include/linux/module.h` 中。 `MODULE_LICENSE` 這個 macro 可以引用我們所使用的授權方式 ``` MODULE_LICENSE("GPL"); MODULE_AUTHOR("LKMPG"); MODULE_DESCRIPTION("A sample driver"); ``` ## 傳遞命令行參數給模組模組也可以接受命令行參數的形式，為了讓你的模組可以接收參數，你要先宣告一個全域變數，這個變數會用來儲存從命令列傳進來的參數值。 `module_param()` 能夠讓我們在 `insmod` 的時後附上參數，這個 macro 可以接受 3 個參數，變數名稱、資料型態、以及在 sysfs 中對應檔案的權限。 ```c int myint = 3; module_param(myint, int, 0); ``` Array 也可以支援，但他多了第 3 個參數為傳遞的參數數量。而 `MODULE_PARM_DESC()` 則是讓我們給予不同的參數附上說明，使用 `modinfo` 時可以列出該模組的說明文件。使用範例: <details> <summary>完整程式碼</summary> ```c #include <linux/init.h> #include <linux/kernel.h> /* for ARRAY_SIZE() */ #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/printk.h> #include <linux/stat.h> MODULE_LICENSE("GPL"); static short int myshort = 1; static int myint = 420; static long int mylong = 9999; static char *mystring = "blah"; static int myintarray[2] = { 420, 420 }; static int arr_argc = 0; /* module_param(foo, int, 0000) * The first param is the parameter's name. * The second param is its data type. * The final argument is the permissions bits, * for exposing parameters in sysfs (if non-zero) at a later stage. */ module_param(myshort, short, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP); MODULE_PARM_DESC(myshort, "A short integer"); module_param(myint, int, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH); MODULE_PARM_DESC(myint, "An integer"); module_param(mylong, long, S_IRUSR); MODULE_PARM_DESC(mylong, "A long integer"); module_param(mystring, charp, 0000); MODULE_PARM_DESC(mystring, "A character string"); /* module_param_array(name, type, num, perm); * The first param is the parameter's (in this case the array's) name. * The second param is the data type of the elements of the array. * The third argument is a pointer to the variable that will store the number * of elements of the array initialized by the user at module loading time. * The fourth argument is the permission bits. */ module_param_array(myintarray, int, &arr_argc, 0000); MODULE_PARM_DESC(myintarray, "An array of integers"); static int __init hello_5_init(void) { int i; pr_info("Hello, world 5\n=============\n"); pr_info("myshort is a short integer: %hd\n", myshort); pr_info("myint is an integer: %d\n", myint); pr_info("mylong is a long integer: %ld\n", mylong); pr_info("mystring is a string: %s\n", mystring); for (i = 0; i < ARRAY_SIZE(myintarray); i++) pr_info("myintarray[%d] = %d\n", i, myintarray[i]); pr_info("got %d arguments for myintarray.\n", arr_argc); return 0; } static void __exit hello_5_exit(void) { pr_info("Goodbye, world 5\n"); } module_init(hello_5_init); module_exit(hello_5_exit); ``` </details> 在執行掛載時給予參數: ```shell $ sudo insmod hello-5.ko mystring="babe" myintarray=12,-1 $ sudo dmesg | tail -7 ``` ``` [214907.640336] myshort is a short integer: 1 [214907.640338] myint is an integer: 420 [214907.640339] mylong is a long integer: 9999 [214907.640341] mystring is a string: babe [214907.640343] myintarray[0] = 12 [214907.640345] myintarray[1] = -1 [214907.640346] got 2 arguments for myintarray. ``` # 前置知識 ## module begin 一般在 User space 運行的程式是從 `main()` 作為進入點，但在核心模組中，模組都是通過 `module_init` 函數開始，這個函數作為模組的進入點，向核心傳送模組資料，並準備好核心在需要時使用的模組功能，完成任務後，這個入口函數將會 return ，模組處於 inactivce 狀態，直到核心需要他。 ## module end 所有的模組都必須調用 `cleanup_module` or `module_exit` 來調用指定的函數作為該模組的退出函數。 ## function available for module 在撰寫核心模組時，當中再調用一些函式像是 `pr_info` ，他跟一般寫的程式利用 include 引入函式庫去連結的方式不同，這是因為模組是一個物件檔案，它們只能呼叫 kernel 導出的 symbols（如 `printk()` 或 `pr_info()`），這些 symbols 的定義來自正在執行的 kernel，而不是自己。 /proc/kallsyms 提供了目前可供 module 使用的 symbol 清單。 ## Name Space 當撰寫核心模組時，即便是最小的模組也會連結到整個核心，跟 kernel 共用一個 Symbol table，而這時候如果你的模組內有某些函式命名和核心內其他模組撞名，那可能會導致其被覆蓋，因此針對這個問題的最佳解決方式，就是將所有變數宣告為 `static`，因為 `static` 在 C 中代表「此變數或函式的範圍只限於這個檔案」。 ## Device Driver 其中一種模組類型稱為 Device driver，他為一些硬體提供功能，像是序列埠，在 UNIX 中，每種硬體都位於 /dev 底下並以檔案表示，稱為 device file。 Device driver 讓使用者可以用 user program 進行互動，例如，es1370.ko 音校卡設備驅動程式可能會將 /dev/sound 設備檔案連接到 Ensoniq IS1370 音效卡。 ```shell $ ls -l /dev/sda[1-3] brw-rw---- 1 root disk 8, 1 Apr 9 2025 /dev/sda1 brw-rw---- 1 root disk 8, 2 Apr 9 2025 /dev/sda2 brw-rw---- 1 root disk 8, 3 Apr 9 2025 /dev/sda3 ``` 第一個數字稱為設備的主要編號(major number)。第二個數字是次要編號(minor number)。主要編號告訴你用來存取硬體的驅動程式是哪一個。每個驅動程式都分配了一個獨一無二的主要編號。所有具有相同主要編號的設備檔案都由同一個驅動程式控制。上述的所有主要編號都是 8，因為它們都由同一個驅動程式控制。次要編號用來讓驅動程式區分它控制的各種硬體。回到上面的例子，雖然所有三個設備都由同一個驅動程式處理，但因為驅動程式認為它們是不同的硬體，所以它們具有獨特的次要編號。 Device 又可以分為兩種: **character devices** and **block devices**。區別在於 block devices 有 buffer 來儲存 request，因此他可以去從中選擇最佳的 respond 順序，這種功能對儲存設備來說非常重要，因為會涉及到鄰近區域的寫入和存取效率，再來是它只能以 block 的形式接收輸入和回傳輸出。 character devices 可以使用任意數量的 bytes，大部分的 devices 都屬於這種。可以透過 `ls -l` 來查看輸出的第一個字符來判斷該設備是何種。 ```shell user@user:~/kmod$ ls -l /dev/tty1 crw--w---- 1 root tty 4, 1 May 8 12:22 /dev/tty1 ``` 第一個是 'c' , is character devices. 透過 `cat /proc/devices` 指令，我們可以檢視當前系統中的主設備號和設備類型名稱系統在安裝的時候，所有的 device files 都是由 `mknod`。要建立一個名為 coffee 的新 character device，其 major / minor number 為 12 和　２，我們只需要執行 ```shell! $ mknod /dev/coffee c 12 2 ``` :::info 不一定要將 device files 放在 /dev，但這是慣例 ::: # Character Device drivers ## The file_operations Structure file_operations 結構體定義在 [include/linux/fs.h](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h) 中，結構體中的每個欄位都是函式指標，指向你自己在驅動程式中定義的函式，用來對裝置做各種操作（例如：讀寫、開啟、關閉等）。這裡是這個結構體的[完整定義](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h#n2129)。也可以對結構體內的成員去做初始化，參考 C99 的 [designated initializers](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html) ```c struct file_operations fops = { .read = device_read, .write = device_write, .open = device_open, .release = device_release }; ``` 其他未賦值的成員，GCC會將其初始化為 NULL。 ## Registering A Device 註冊裝置將 driver 加到系統，意味著將其註冊到 kernel 中，相當於在模組初始化時去分配一個 major number，這是由剛剛的 fs.h 介面中的 `register_chrdev ` 所完成。 ```c int register_chrdev(unsigned int major, const char *name, struct file_operations *fops); ``` 其中，major 代表的是我們想要請求的設備的 major number，name 是設備的名稱，這個名稱將會出現在 `/proc/devices` 中，fops 是指向 driver的 file_operations 表。如果回傳值是負數，則表示註冊失敗。而要如何挑選一個不會被其他程式占用的 major number? 可以向 kernel 請求一個動態的 major number。若將 major number 設為 0 傳給 `register_chrdev`，其返回值將會是動態分配的 major number，但缺點是無法事先製作 devices files，因為不知道 major number 會是多少。有幾種方法可以解決這個問題首先，device driver 本身可以印出新分配的編號，然後我們可以手動製作 device files。其次，新註冊的裝置會在 /proc/devices 中出現，我們可以手動製作 device files，或寫一個 shell 腳本來讀取該檔案並製作 device files。第三種方法是我們可以讓 device driver 在成功註冊後使用 `device_create` 函式，並且呼叫 `cleanup_module` 時使用 `device_destroy` 清除。但 `register_chrdev()` 會占用與給定 major number 相關聯的一系列的 minor number，會有浪費資源的問題。推薦使用 cdev 介面 ## cdev 首先，一樣先註冊裝置編號，可以使用下面任一個函式完成 ```c int register_chrdev_region(dev_t from, unsigned count, const char *name); int alloc_chrdev_region(dev_t *dev, unsigned baseminor, unsigned count, const char *name); ``` 再來我們初始化我們的 char device 的結構體 cdev，並將其與設備編號給相關聯 ```c struct cdev *my_cdev = cdev_alloc(); my_cdev->ops = &my_fops; ``` 再來需要將其嵌入到自己的特定結構中 ```c void cdev_init(struct cdev *cdev, const struct file_operations *fops); ``` 初始化完成后，我們可以使用 `cdev_add` 將 char device 加入系統。 ```c int cdev_add(struct cdev *p, dev_t dev, unsigned count); ``` ## Unregistering A Device 我們不允許 kernel 在 device file 正被其他 process 開啟的時候做 rmmod，這樣代表模組突然被移除，所以它會呼叫原本對應的 driver function 記憶體位置，而這時候假如剛好那塊記憶體位址被其他的模組載入，那將會出現不可預測的行為。而通常要對這類情況作出防範，我們可能會在最後結束前的函式做檢查，如果有不符的條件就回傳負數之類的，但偏偏 `cleanup_module` 型態是 `void`，不過 kernel 有一個 counter，它會追蹤目前有多少 process 正在使用你的模組。可以通過執行 cat /proc/modules 或 lsmod 命令查看該 counter 的值，如過這個數字不為 0 ， rmmod 將會失敗。 [include/linux/module.h](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/module.h) 有提供可以增加、減少、和顯示這個 counter 的函式。 `try_module_get(THIS_MODULE)` : 增加目前模組的 count。 `module_put(THIS_MODULE)` : 減少目前模組的 count。 `module_refcount(THIS_MODULE)` : 回傳目前模組的 count。 ## chardev.c <details> <summary>完整程式碼</summary> ```c // chardev.c: Create a read-only char device that says how many times //you have read fron the dev file #include <linux/atomic.h> #include <linux/cdev.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/fs.h> #include <linux/init.h> #include <linux/kernel.h> /* for sprint() */ #include <linux/module.h> #include <linux/printk.h> #include <linux/types.h> #include <linux/uaccess.h> /* for get_user and put_user */ #include <linux/version.h> #include <asm/errno.h> /* Prototypes - this would normally gp in a .h file */ static int device_open(struct inode *, struct file *); static int device_release(struct inode *, struct file *); static ssize_t device_read(struct file *, char __user *, size_t, loff_t *); static ssize_t device_write(struct file *, const char __user *, size_t, loff_t *); #define DEVICE_NAME "chardev" /* Dev name as it appears in /proc/devices */ #define BUF_LEN 80 /* Max length of the message from the device */ /* Global variables are declared as static, so are global within file */ static int major; /* major number assigned to our device driver */ enum{ CDEV_NOT_USED, CDEV_EXCLUSIVE_OPEN, }; /* Is device open? Used to prevent multiple access to device */ static atomic_t already_open = ATOMIC_INIT(CDEV_NOT_USED); static char msg[BUF_LEN + 1]; /* The msg the device will give when asked */ static struct class *cls; static struct file_operations chardev_fop = { .read = device_read, .write = device_write, .open = device_open, .release = device_release, }; static int __init chardev_init(void) { major = register_chrdev(0, DEVICE_NAME, &chardev_fop); if (major < 0) { pr_alert("Registering char device failed with %d\n", major); return major; } pr_info("I was assigned major number %d.\n", major); #if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 4, 0) cls = class_create(DEVICE_NAME); #else cls = class_create(THIS_MODULE, DEVICE_NAME); #endif device_create(cls, NULL, MKDEV(major, 0), NULL, DEVICE_NAME); pr_info("Device created in /dev/%s\n", DEVICE_NAME); return 0; } static void __exit chardev_exit(void) { device_destroy(cls, MKDEV(major, 0)); /* clear device file */ class_destroy(cls); /* Unregister the device */ unregister_chrdev(major, DEVICE_NAME); } /* Methods */ /* Called when a process tries to open the device file, like * "sudo car /dev/chardev" */ static int device_open(struct inode *inode, struct file *file) { static int counter = 0; if (atomic_cmpxchg(&already_open, CDEV_NOT_USED, CDEV_EXCLUSIVE_OPEN)) return -EBUSY; sprintf(msg, "I alreadt told you %d times Hello world!\n", counter++); try_module_get(THIS_MODULE); return 0; } /* Called when a process closes the device file */ static int device_release(struct inode *inode, struct file *file) { /* We 're not ready for our next caller */ atomic_set(&already_open, CDEV_NOT_USED); /* Decrement the usage count, or else once you opened the file, you will * never get rid of the module. */ module_put(THIS_MODULE); return 0; } /* Called when a process, which already opened the dec file, attempts to * read from it. */ static ssize_t device_read(struct file *filp, /* see include/linux/fs.h */ char __user *buffer, /* buffer to fill with data */ size_t length, /* length of the buffer */ loff_t *offset) { /* Number of bytes actually written to the buffer */ int bytes_read = 0; const char *msg_ptr = msg; if (!*(msg_ptr + *offset)){ /* we are at the end of message */ *offset = 0; /* reset the offset */ return 0; /* signify end of file */ } msg_ptr += *offset; /* Actually put the data into the buffer */ while (length && *msg_ptr){ /* The buffer is in the user data segment, not the kernel * segment so "*" assignment won't work. We have to use * put_user which copies data from the kernel data segment to * the user data segment. */ put_user(*(msg_ptr++), buffer++); length--; bytes_read++; } *offset += bytes_read; /* Most read functions return the number of bytes put into the buffer */ return bytes_read; } /* Called when a process writes */ static ssize_t device_write(struct file *filp, const char __user *buff, size_t len, loff_t *off) { pr_alert("Sorry, this operation is not supported.\n"); return -EINVAL; } module_init(chardev_init); module_exit(chardev_exit); MODULE_LICENSE("GPL"); ``` </details> # The /proc File System proc filesystem 是一種傳送資訊給 processes 的方式，像是提供 module list 的 /proc/modules 和收集記憶體使用資料的 /proc/meminfo。使用方法和 device driver 很類似，要建立一個結構體，包含了所有指向 handler function 的指標，然後 `init_module` 會跟 kernel 註冊，`cleanup_module` 則會取消註冊。下面有個範例是如何使用 /proc file，我們以 HelloWorld 示範，當中分為 3 個部分 - 透過 `init_module` 建立 /proc/hellowrold 檔案 - 當使用 `procfile_read` 讀取 /proc/helloworld 時回傳一個值以及一個 buffer。 - 使用 `cleanup_module` 刪除 /proc/helloworld。當使用 `proc_create` 函式載入模組時會建立 /proc/helloworld 檔案，回傳值是一個指向結構體 `proc_dir_entry` 的指標，這個結構體代表剛剛創建出來的 /proc 檔案，可以使用這個指標來進一步設定這個檔案的屬性，如果回傳為 `NULL` 則表示建立 /proc/helloworld 失敗。每次讀取 /proc/helloworld 時都會呼叫 procfile_read 函式，此函式的兩個參數 buffer 和 offset，第二個參數 `char *buffer` 要把輸出內容寫進這裡，寫進 buffer 的東西會被傳回給使用者空間程式（例如 cat），這樣使用者才會看到 /proc/helloworld 的內容，offset 告訴你現在讀到了檔案的哪裡。你可以用這個來控制是否要傳回更多資料，還是該結束這次的讀取。如果在 `procfile_read()` 中回傳的值不為 0，那麼核心會再度呼叫這個函式。 <details> <summary> 完整程式碼 </summary> ```c #include <linux/kernel.h> #include <linux/module.h> #include <linux/proc_fs.h> #include <linux/uaccess.h> #include <linux/version.h> #if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0) #define HAVE_PROC_OPS #endif #define procfs_name "helloworld" static struct proc_dir_entry *our_proc_file; static ssize_t procfile_read(struct file *file_pointer, char __user *buffer, size_t buffer_length, loff_t *offset) { char s[13] = "HelloWorld!\n"; int len = sizeof(s); ssize_t ret = len; if (*offset >= len || copy_to_user(buffer, s, len)) { pr_info("copy_to_user failed\n"); ret = 0; } else { pr_info("procfile read %s\n", file_pointer->f_path.dentry->d_name.name); *offset += len; } return ret; } #ifdef HAVE_PROC_OPS static const struct proc_ops proc_file_fops = { .proc_read = procfile_read, }; #else static const struct file_operations proc_file_fops = { .read = procfile_read, }; #endif static int __init procfs1_init(void) { our_proc_file = proc_create(procfs_name, 0644, NULL, &proc_file_fops); if (NULL == our_proc_file) { pr_alert("Error:Could not initialize /proc/%s\n", procfs_name); return -ENOMEM; } pr_info("/proc/%s created\n", procfs_name); return 0; } static void __exit procfs_exit(void) { proc_remove(our_proc_file); pr_info("/proc/%s removed\n", procfs_name); } module_init(procfs_init); module_exit(procfs_exit); MODULE_LICENSE("GPL"); ``` </details> 其中來講一下 [copy_to_user](https://manpages.debian.org/stretch-backports/linux-manual-4.11/__copy_to_user.9) ```C unsigned long __copy_to_user(void __user * to, const void * from, unsigned long n); ``` 作用是將資料從 kernel space 複製到 user space，**成功時回傳 0**。而在上面程式中的用法，就是一次將整個 "HelloWorld!\n"（13 bytes)全部複製到 user space 的 buffer 裡，再來會因為 `*offset += len = 13`，進而停止讀資料。掛載 helloworld 後 ```$ cat /proc/helloworld ```，將會看到 HelloWorld!，說明成功建立 proc file ## Read and Write a /proc file 在前一個例子中示範了讀取 /proc/hellworld，當然也可以做到寫入 /proc 檔案，而原理和讀取相同，當中會使用到 `copy_from_user` 函式，因為要寫入的資料來自於使用者空間，因此必須將資料移動到核心空間內。但因為 user process 只能存取自己的記憶體區段(segment)，所以今天當我們想要撰寫核心模組時，通常會需要存取核心的記憶體區段，而這會由系統自動處理。然而，如果需要把某個使用者空間的 buffer 傳給核心模組，那核心函式拿到的會是一個位於該 process 的記憶體區段中的指標，這時就不能直接使用這個指標！必須使用 `put_user` 和 `get_user` 這兩個巨集來存取來自使用者空間的記憶體。 - get_user：從使用者空間讀取一個字元到核心 - put_user：從核心寫入一個字元到使用者空間如果想要一次處理多個字元（例如整個字串或結構），那就要使用 `copy_to_user`（寫出）或 `copy_from_user`（讀入）。也就是說，在讀取的時候，因為資料本來就在核心空間中，所以不用額外處理，但是在寫入的時候，資料時從使用者空間傳進來的，所以要用 `copy_from_user` 將資料安全移動到核心空間內。 [copy_from_user](https://manpages.debian.org/testing/linux-manual-4.8/__copy_from_user.9.en.html) ```c unsigned long __copy_from_user(void * to, const void __user * from, unsigned long n); ``` 接下來對原本的程式碼做擴充，讓這個 procfile 能夠有寫入功能 ```c #define PROCFS_MAX_SIZE 1024 static unsigned long procfs_buffer_size = 0; // 當 /proc file 被寫入時呼叫此函數 static ssize_t procfile_write(struct file *file, const char __user *buff, size_t len, loff_t *off) { procfs_buffer_size = len; if (procfs_buffer_size >= PROCFS_MAX_SIZE) procfs_buffer_size = PROCFS_MAX_SIZE - 1; if (copy_from_user(procfs_buffer, buff, procfs_buffer_size)) return -EFAULT; procfs_buffer[procfs_buffer_size] = '\0'; *off += procfs_buffer_size; pr_info("procfile write %s\n", procfs_buffer); return procfs_buffer_size; } ``` 最後記得跟前面一樣要在 proc_ops 和 file_operations 介面中再加上 write 操作。掛載後，當我們輸入 `echo "abc" | sudo tee /proc/helloworld` 候，輸入 `sudo dmesg` 查看，即能成功看到輸出。 ## Manage /proc file with standard filesystem 前面提到如何用 /proc 的介面去建立可讀寫的檔案，除此之外也可以透過 **inode** 來管理 /proc file，在 Linux 內，每個檔案系統都要自自己的函式來處理 inode 和檔案操作，這些函式會被集中在 - `struct inode_operations`：inode 層級的操作（例如建立、刪除） - `struct proc_ops/file operations`: file 層級的操作（read/write 等）而 `inode_operations` 裡面會包含指向 `proc_ops` 的指標。 `file operations` 處理的是對檔案內容本身的操作，`inode operatopns` 負責的則是檔案的引用或是建立連結的方式等等，當在 /proc 建立新檔案時，可以指定這個新檔案要用哪個 `inode_operations`，這個 `inode_operations` 會指向我們的 `proc_ops`，而 `proc_ops` 又會指定我們的 `read/write` 函式。 ```c static int procfs_open(struct inode *inode, struct file *file) { try_module_get(THIS_MODULE); return 0; } static int procfs_close(struct inode *inode, struct file *file) { module_put(THIS_MODULE); return 0; } ``` [關於 `try_module_get` 和 `module_put`](https://www.kernel.org/doc/html/next/driver-api/basics.html#) 文件中提到，這兩個函式分別是加 / 減少 kernel module 的 reference count（引用計數），防止模組在使用中被卸載，或在不使用時讓它被安全卸載，前面有提到，若是卸載模組時 reference count 不為 0 的話，系統就會認定模組正在被使用，沒辦法卸載，而若是沒使用這兩個函式，很有可能導致 refcnt 錯亂，衍生上述情況，明明已沒在使用模組卻無法移除的情況，所以在 open 和 close 使用，做一個保險。 ```c proc_set_size(our_proc_file, 80); proc_set_user(our_proc_file, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID); ``` 這兩個函式是用來設定 proc 檔案的屬性，讓它在 /proc 底下表現得更像普通檔案（例如 ls -l /proc/xxx 時會看到正確的大小和擁有者） <details> <summary> 完整程式碼</summary> ```c #include <linux/kernel.h> #include <linux/module.h> #include <linux/proc_fs.h> #include <linux/uaccess.h> #include <linux/version.h> #if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 10, 0) #include <linux/minmax.h> #endif #if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0) #define HAVE_PROC_OPS #endif #define PROCFS_MAX_SIZE 1024 #define PROCFS_NAME "buffer1k" static struct proc_dir_entry *our_proc_file; static char procfs_buffer[PROCFS_MAX_SIZE]; static unsigned long procfs_buffer_size = 0; //呼叫此函數並讀取 /proc 檔案 static ssize_t procfs_read(struct file *filp, char __user *buffer, size_t length, loff_t *offset) { if (*offset || procfs_buffer_size == 0){ pr_info("procfs_read: END\n"); *offset = 0; return 0; } procfs_buffer_size = min(procfs_buffer_size, length); if (copy_to_user(buffer, procfs_buffer, procfs_buffer_size)) return -EFAULT; *offset += procfs_buffer_size; pr_info("procfs_read: read %lu bytes\n", procfs_buffer_size); return procfs_buffer_size; } static ssize_t procfs_write(struct file *file, const char __user *buffer, size_t len, loff_t *off) { procfs_buffer_size = min(PROCFS_MAX_SIZE, len); if (copy_from_user(procfs_buffer, buffer, procfs_buffer_size)) return -EFAULT; *off += procfs_buffer_size; pr_info("procfs_write: write %lu bytes\n", procfs_buffer_size); return procfs_buffer_size; } static int procfs_open(struct inode *inode, struct file *file) { pr_info("procfs is opened"); try_module_get(THIS_MODULE); return 0; } static int procfs_close(struct inode *inode, struct file *file) { pr_info("procfs is closed"); module_put(THIS_MODULE); return 0; } #ifdef HAVE_PROC_OPS static const struct proc_ops proc_file_fops = { .proc_read = procfs_read, .proc_write = procfs_write, .proc_open = procfs_open, .proc_release = procfs_close, }; #else static const struct file_operations proc_file_fops = { .proc_read = procfs_read, .proc_write = procfs_write, .proc_open = procfs_open, .proc_release = procfs_close, }; #endif static int __init procfs_init(void) { our_proc_file = proc_create(PROCFS_NAME, 0644, NULL, &proc_file_fops); if (our_proc_file == NULL) { pr_alert("Error:Could not initialize /proc/%s\n", PROCFS_NAME); return -ENOMEM; } proc_set_size(our_proc_file, 80); proc_set_user(our_proc_file, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID); pr_debug("/proc/%s created\n", PROCFS_NAME); return 0; } static void __exit procfs_exit(void) { remove_proc_entry(PROCFS_NAME, NULL); pr_debug("/proc/%s removed\n", PROCFS_NAME); } module_init(procfs_init); module_exit(procfs_exit); MODULE_LICENSE("GPL"); ``` </details> --- ## seq_file API [Driver porting: The seq_file interface](https://lwn.net/Articles/22355/) 當要在 /proc 中顯示一大堆資料（例如一堆 process、module、設備狀態），自己管理讀取 offset、格式化、分段傳輸會變得很複雜又容易出錯，所以 kernel 提供了一個 `seq_file` API 來簡化這個過程。 seq_file 介面可以透過 `linux/seq_file` 調用，其中值得注意的有 3 個部分 ### The iterator interface 在使用 `seq_file` 建立虛擬檔案的時候，必須實做一個簡單的迭代器來逐筆顯示資料，這個迭代器需要能根據特定的位置移動，就像檔案一樣可以掃描，而這個位置怎麼定義可以由使用者決定，但位置 0 必須是檔案的開頭。而一個 iterator 必須要有四個函式來讓 `seq_file` 正常運作，以下示範操作 - start() ```c static void *ct_seq_start(struct seq_file *s, loff_t *pos) { loff_t *spos = kmalloc(sizeof(loff_t), GFP_KERNEL); if (! spos) return NULL; *spos = *pos; return spos; } ``` 此函式是用 `kmalloc` 配置出一塊記憶體空間來存放目前的 position 值，通常還需要檢查 position 是否已經超過資料範圍，超過時要回傳 `NULL` 讓 `seq_file` 知道檔案結束了。 - next() ```c static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos) { loff_t *spos = (loff_t *) v; *pos = ++(*spos); return spos; } ``` 此函式用來移動到下一筆資料的，這邊沒什麼特別的，就只是將位置 `+1`，如果已經走到資料結尾，則回傳 `NULL`。 - stop() ```c static void ct_seq_stop(struct seq_file *s, void *v) { kfree(v); } ``` 此函式是清除 iterator 用完的資源，也就是說如果有用到 `kmalloc`，則苦以在這邊做 `kfree()`，否則可以空著不做事。 - show() ```c static int ct_seq_show(struct seq_file *s, void *v) { loff_t *spos = (loff_t *) v; seq_printf(s, "%Ld\n", *spos); return 0; } ``` 此函式為輸出一筆資料，會先從 `v` 拿到目前位置或資料，然後用 `seq_printf()` 把格式化後的資料寫進供使用者讀取的 buffer 中。然後用 `seq_operations` 去榜定四個函式 ```c static struct seq_operations ct_seq_ops = { .start = ct_seq_start, .next = ct_seq_next, .stop = ct_seq_stop, .show = ct_seq_show }; ``` 所以流程大致為：一開使先呼叫 `start()` ，如果 `start()` 回傳的不是 `NULL` ，就會繼續呼叫 `next()`，如果 `start()` 一開始就回傳 `NULL` 代表沒有任何資料，系統會直接呼叫 `stop()` 做結尾，其中每次呼叫 `next()` 時，也會呼叫一次 `show()` 把目前資料寫進 user buffer 中。 ![image](https://hackmd.io/_uploads/HyW11GH8el.png) <details> <summary> 完整程式碼 </summary> ```c #include <linux/kernel.h> #include <linux/module.h> #include <linux/proc_fs.h> #include <linux/seq_file.h> /* for seq_file */ #include <linux/version.h> #if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0) #define HAVE_PROC_OPS #endif #define PROC_NAME "iter" /* This function is called at the beginning of a sequence. * ie, when: * - the /proc file is read (first time) * - after the function stop (end of sequence) */ static void *my_seq_start(struct seq_file *s, loff_t *pos) { if (*pos == 0) return pos; return NULL; } static void *my_seq_next(struct seq_file *s, void *v, loff_t *pos) { (*pos)++; if (*pos < 10){ return pos; } return NULL; } static void my_seq_stop(struct seq_file *s, void *v) { /* nothing to do */ } static int my_seq_show(struct seq_file *s, void *v) { loff_t *spos = (loff_t *)v; seq_printf(s, "%lld\n", *spos); return 0; } /* This structure gather "function" to manage the sequence */ static struct seq_operations my_seq_ops = { .start = my_seq_start, .next = my_seq_next, .stop = my_seq_stop, .show = my_seq_show, }; /* This function is called when the /proc file is open. */ static int my_open(struct inode *inode, struct file *file) { return seq_open(file, &my_seq_ops); }; /* This structure gather "function" that manage the /proc file */ #ifdef HAVE_PROC_OPS static const struct proc_ops my_file_ops = { .proc_open = my_open, .proc_read = seq_read, .proc_lseek = seq_lseek, .proc_release = seq_release, }; #else static const struct file_operations my_file_ops = { .open = my_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, }; #endif static int __init seq1_init(void) { struct proc_dir_entry *entry; entry = proc_create(PROC_NAME, 0, NULL, &my_file_ops); if (entry == NULL) { pr_debug("Error: Could not initialize /proc/%s\n", PROC_NAME); return -ENOMEM; } return 0; } static void __exit seq1_exit(void) { remove_proc_entry(PROC_NAME, NULL); pr_info("/proc/%s removed\n", PROC_NAME); } module_init(seq1_init); module_exit(seq1_exit); MODULE_LICENSE("GPL"); ``` </details> # sysfs: Interacting with your module sysfs 是一種虛擬檔案系統，掛載在 /sys，可以讓你從使用者空間（userspace）讀寫核心物件的屬性（attributes），以此做到觀察裝置狀態、控制硬體開關、debug 核心模組等等需求。 ## kobject 參考資料: [The zen of kobjects](https://lwn.net/Articles/51437/) 在 Linux 核心中，kobject（kernel object）是一種統一的物件模型（object model），Kobject主要提供如下功能： - 通過 parent 指標，可以將所有 kobject 以層次結構的形式組合起來。 - 使用一個引用計數（reference count），來記錄 `kobject` 被引用的次數，並在引用次數變為0時把它釋放（這是 `Kobject` 誕生時的唯一功能）。 - 和 `sysfs` 虛擬文件系統配合，將每一個 `kobject` 及其特性，以文件的形式，開放到 user space。通常不會單獨使用 `kobject`，而是嵌入在其他結構體中，例如: ```c struct cdev { struct kobject kobj; struct module *owner; struct file_operations *ops; struct list_head list; }; ``` 這裡 kobj 嵌入在 `cdev` 中。你可以使用 `container_of()` 巨集來從 `kobject` 回推包含它的原始結構： ```c struct cdev *device = container_of(kp, struct cdev, kobj); ``` ### 初始化 kobject 透過呼叫 `kobject_init()` 來建立 `kobject`，並且此函式會將 kobj 的 `reference count` 設定為 1， ```c void kobject_init(struct kobject *kobj); ``` 除此之外，還需要設定 `kobject` 的名稱，而這將是會在 `sysfs` 目錄中顯示的名稱 ```c int kobject_set_name(struct kobject *kobj, const char *format, ...); ``` ### Reference counts 只要對該 kobject 的 reference 存在，則 kobject 就會持續存在，用於管理 reference counts 的函數如下: ```c struct kobject *kobject_get(struct kobject *kobj); void kobject_put(struct kobject *kobj); ``` 當 `kobject_get` 被成功呼叫時，`reference counts` 將會加 1，並且回傳指向 `kobject` 的指標，若 kobject 已經不存在，則回傳 `NULL`。當對 `kobject` 的 reference 已經不存在時，呼叫 `kobject_put` 以將 `reference counts` 減 1，前面有提到 `kobject_init` 會將 `counts` 設定為 1，因此在程式碼最後須要記得呼叫 `kobject_put`。 ### Hooking into sysfs 要讓 `kobject` 能夠出現在 `/sys` 中，需要呼叫 `kobject_add` 並將 `kobject` 傳遞進去 ```c int kobject_add(struct kobject *kobj); ``` 而 `kobject_del` 將會把 `kobject` 從 `sysfs` 中移除。 ```c void kobject_del(struct kobject *kobj); ``` ### 釋放 kobject 假設有一個 `kobject` 綁定在 `sysfs` 上的某個檔案，如果有一個 user space program 把這個檔案打開（例如 cat 或 echo），即使 kernel 那邊已經「不需要」這個物件了，`reference count` 也不會變成 0，直到這個檔案被關閉，所以不能在 ` reference count` 還沒歸零前就釋放記憶體。而釋放時就需要用到 `release` 函式，下面是一個實作 `release` 的範例 ```c void my_object_release(struct kobject *kobj) { struct my_object *mine = container_of(kobj, struct my_object, kobj); /* Perform any additional cleanup on this object, then... */ kfree (mine); } ``` 每一個 `kobject` 都必須有一個 `release()`，並且這個 `kobject` 必須保持有，直到該方法被呼叫為止。如果無法滿足這些條件，代表程式碼有缺陷，特別的是，`release()` 並不是直接儲存在 `kobject` 裡，而是儲存在 `ktype`（`struct kobj_type`）這個結構體。 ```c struct kobj_type { void (*release)(struct kobject *); struct sysfs_ops *sysfs_ops; struct attribute **default_attrs; }; ``` 這個結構體用來描述一種 `kobject` 的類型，每一個 `kobject` 都必須對應到一個 `kobj_type`，這個連結通常在初始化時設定 `kobject->ktype` 指向這 `kobj_type`，或者由它所屬的 `kset` 自動指定，`release` 成員函式是用來放置清除記憶體的 callback function (當 `referencr count` 歸零時呼叫)，`sysfs_ops` 和 `default_attrs` 兩個欄位，則是控制這些 `kobject` 如何顯示在 `sysfs`。 ### kset `kset` 是一組**相同類型**的 `kobject` 的集合，也就是同種 type，`kset` 看起來像是 `kobj_type` 結構體的一種擴充，但兩者的焦點不同，`struct kobj_type` 關注的是「物件的類型」，而 `struct kset` 則關注於「物件的聚集與集合」。 `kset` 用途如下: - 核心可以用一個 `kset` 來追蹤所有的 block 裝置或所有的 PCI 驅動程式。 - 每個 `kset` 都包含一個 `kobject`，這個 `kobject` 可以設為其他 `kobject` 的 `parent`，因此 `kset` 是用來建立整個 `device model` 階層結構的關鍵。 - 當有 `kobject` 動態加入或移除時，`kset` 可以決定該怎麼把這些事件通知給 user space。 :::info 用物件導向觀點總結，`kset` 是「最高層級的容器類別」，每個 `kset` 自己就擁有一個內建的 `kobject`，所以它本身也可以當作一個 `kobject` 來使用，也就是說你可以把它丟進 `sysfs` 裡，變成一個 `/sys/xxx` 目錄。 ::: 一個 `kset` 會使用標準的 kernel linked list 來保存它的子項（`kobject`），每個 `kobject` 透過其 `kset` 欄位，知道自己是屬於哪個 `kset` 的，這些 `kobject` 還會透過 `parent` 欄位指向該 `kset`（更準確地說，是 `kset` 裡面內建的那個 `kobject`） ![image](https://hackmd.io/_uploads/BkZn1ao8ll.png) 圖中的那些 `kobject` 實際上是嵌在其他結構裡的（例如某個裝置結構體，甚至可能是另一個 `kset`），也並不是所有 `kobject` 的 `parent` 一定都要設為所屬的 `kset`。 `kset` 的初始化與設置方式和 `kobject` 非常類似。提供了以下函式： ```c void kset_init(struct kset *kset); int kset_add(struct kset *kset); /* 把初始化完的 kset 加入到 sysfs */ int kset_register(struct kset *kset); /* 等同於 kset_init() + kset_add() */ void kset_unregister(struct kset *kset); /* 移除並釋放記憶體 */ ``` 我們可以利用 ```c kobj->kset = the_kset; kobject_add(kobj); ``` 就會自動幫你加入對應的 `kset`。 ## attributes 接下來讓我們講回 `sysfs`，它提供一種介面，讓使用者可以從 user space 讀取或設定核心 kernel 的 `kobject` 屬性。基本的 attributes 宣告: ```c struct attribute { char *name; /* sysfs檔案的檔名 */ struct module *owner; /* 所屬模組 */ umode_t mode; /* 權限，例如 0444 (read-only) */ }; ``` 這個結構對應一個 `sysfs` 檔案，等於你這個屬性要創建一個 `/sys/.../xxx` 檔案，這個檔案可以被讀取或寫入，你可以決定名稱和權限，而要使用這個結構，還需要再透過 kernel 提供的函式: ```c int sysfs_create_file(struct kobject *kobj, const struct attribute *attr); void sysfs_remove_file(struct kobject *kobj, const struct attribute *attr); ``` 這兩個是往 `sysfs` 註冊或移除檔案用的函式，`kobject` 代表你要嵌入的物件。在 device driver 中更常使用的結構體為 `struct device_attribute` ```c struct device_attribute { struct attribute attr; ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf); ssize_t (*store)(struct device *dev, struct device_attribute *attr, const char *buf, size_t count); }; ``` - `show()`：當有人執行 `cat /sys/xxx/attrfile` 會執行這個 function，它需要把你要顯示的資料寫入 buf 中。 - `store()`：當有人執行 `echo value > /sys/xxx/attrfile`，會呼叫這個 function，你要從 buf 收到用戶寫的資料。使用的時候還需要註冊與移除 ```c int device_create_file(struct device *, const struct device_attribute *); void device_remove_file(struct device *, const struct device_attribute *); ``` <details> <summary> 完整程式碼 </summary> ```c #include <linux/fs.h> #include <linux/init.h> #include <linux/kobject.h> #include <linux/module.h> #include <linux/string.h> #include <linux/sysfs.h> static struct kobject *mymodule; /* the variable you want to be able to change */ static int myvariable = 0; static ssize_t myvariable_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { return sprintf(buf, "%d\n", myvariable); } static ssize_t myvariable_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { sscanf(buf, "%d", &myvariable); return count; } static struct kobj_attribute myvariable_attributes = __ATTR(myvariable, 0660, myvariable_show, myvariable_store); static int __init mymodule_init(void) { int error = 0; pr_info("mymodule: initialized\n"); mymodule = kobject_create_and_add("mymodule", kernel_kobj); if (!mymodule) return -ENOMEM; error = sysfs_create_file(mymodule, &myvariable_attributes.attr); if (error) { kobject_put(mymodule); pr_info("failed to create the myvariable file " "in /sys/kernel/mymodule\n"); } return error; } static void __exit mymodule_exit(void) { pr_info("Module release\n"); kobject_put(mymodule); } module_init(mymodule_init); module_exit(mymodule_exit); MODULE_LICENSE("GPL"); ``` </details> # Blocking Processes and threads ## sleep 如果一個核心模組，不想被某個 process 不斷打擾，我們可以選擇讓這個 process 進入 sleep 狀態，等到準備好再將它喚醒，這就是為什麼即使只有一個 CPU，看起來好像有多個 process 同時執行的原因。以下有一個範例做演飾，首先創建一個檔案 `/proc/sleep`，目標功能是 **這個檔案一次只能被一個 process 打開**，如果已經有程序打開了這個檔案，模組就會呼叫 `wait_event_interruptible` 讓後來的 process 進入等待狀態。 ### wait_event_interruptible ```c #define wait_event_interruptible(wq_head,condition) ({ int __ret = 0; might_sleep(); if (!(condition)) __ret = __wait_event_interruptible(wq_head, condition); __ret; }) ``` - `wq` : 一個 wait queue，例如 `DECLARE_WAIT_QUEUE_HEAD(myqueue)` - `condition` : 等待的條件，例如 `flag == 1` 或 `buffer_len > 0` process 會進入到休眠狀態，將 process 放到 `wait queue` 中等待，直到 `condition` 為 true 或是接收到一個 signal 時醒來，每次 wq 這個 `wait queue` 被喚醒的時候，`condition` 條件都會被重新檢查一次。而必須在明確改變 `condition` 後呼叫 `wake_up(&wq)`，否則等待者永遠醒不來，例如 ```c flag = 1; wake_up(&myqueue); // 必須這樣明確叫醒 wait_event_interruptible ``` 醒來後回傳 - 0 表示 `condition` 成立 - `-ERESTARTSYS` 表示醒來是因為收到 signal（不是因為條件成立） --- 我們可以用 `tail -f` 來持續開啟一個檔案，這個指令會持續讀取檔案末尾，使檔案保持開啟狀態，當呼叫 `wait_event_interruptible`，它會將 Task(在核心中代表一個 `process` 的資料結構）設為 `TASK_INTERRUPTIBLE` 狀態，表示這個任務會睡眠，直到被某種方式喚醒。這個任務會被加入到 `wait queue`，也就是等著存取該檔案的所有任務的佇列中。接下來，函式會呼叫排程器，context switch 到其他可以用 CPU 的 `process`，也就是說讓其他程式先執行。程式中重要的關鍵便是 **compare-and-exchange**（atomic_cmpxchg）來取得資源 ```c /* 1 if the file is currently open by somebody */ static atomic_t already_open = ATOMIC_INIT(0); ``` 宣告一個 atomic 變數 `already_open`，初始化為 0，代表沒有 process 開啟此檔案。 ### atomic_cmpxchg atomic 一詞來表示「不可再拆分的」，於是 "atomic operation" 寓意為「不可再拆分的執行步驟」，也就是「最小操作」，即某個動作執行時，中間沒有辦法分割，也就是說該操作在執行完畢前不會被其他的操作或任務打斷，`atomic_t` 結構體定義如下 ```c typedef struct { int counter; } atomic_t; ``` ```c static inline int atomic_cmpxchg(atomic_t *v, int old, int new) atomic_cmpxchg() - atomic compare and exchange with full ordering @v: pointer to atomic_t @old: int value to compare with @new: int value to assign If (@v == @old), atomically updates @v to @new with full ordering. Unsafe to use in noinstr code; use raw_atomic_cmpxchg() there. Return: The original value of @v. ``` `atomic_cmpxchg` 是一個原子操作函數，用於在多執行緒環境中進行比較並交換。它會比較目標記憶體位置的值與期望的值，如果相等，則將新的值寫入該位置，並返回目標記憶體位置的舊值。這個過程是 atomic 的，意味著在操作期間不會被其他執行緒中斷，避免了 race condition 問題。接下來看他在這段程式中如何使用再 `open` 函式中可以看到 ```c /* Try to get without blocking */ if (!atomic_cmpxchg(&already_open, 0, 1)) { /* Success without blocking, allow the access */ return 0; } ``` 若目前值等於 old（0），就把它改為 new（1）並回傳舊值；否則回傳當前值，為回傳的是舊值，所以 `if (!atomic_cmpxchg(...))` 代表如果原值是 0（搶到），就成功，而其他的 process 如果想要開啟檔案，此時 `already_open` 已經被更改為 1，也就是不會立即回傳 0，程式會繼續往下走，此時會有兩種情況 1. 如果 caller 指定 `O_NONBLOCK`，表示它不想被 block，此時回傳 `-EAGAIN 2. 如果 caller 沒指定 `O_NONBLOCK`（預設 blocking），kernel 應該把它放到 wait queue，直到資源可用再醒來繼續。實際情況為，當 `cmpxchg` 失敗，表示已被其它 process 佔用，此時檢查 `file->f_flags & O_NONBLOCK`，若有 `O_NONBLOCK`，立刻 return `-EAGAIN`，否則呼叫 `try_module_get()` 保持 module 引用，然後進入 sleep 迴圈： ```c while (atomic_cmpxchg(&already_open, 0, 1)) { ... ... wait_event_interruptible(waitq, !atomic_read(&already_open)); ... ... ``` 這裡用 `wait_event_interruptible` 把 process 放到 wait queue，讓 CPU 去處理工作，直到 `wake_up(&waitq)` 被呼叫（在 `module_close()`內）。在 `close` 內: ```c static int module_close(struct inode *inode, struct file *file) { atomic_set(&already_open, 0); wake_up(&waitq); return 0; /* success */ } ``` `atomic_set(&already_open, 0)` 將 `already_open` 設為 0，然後 `wake_up(&waitq)` 喚醒等待的 process，讓它們再次競爭 `atomic_cmpxchg`。 [完整程式碼](https://gist.github.com/leonnig/9ee58d6120bf40acc6ec1f463f55cd5b) --- ## completion 還有一種方式能保證 process 的執行順序，與其用 `/bin/sleep` 指令等待，kernel 還提供了一種方式，叫做 **completions**。 ```c #include <linux/completion.h> struct completion { unsigned int done; // Tracks completion state wait_queue_head_t wait; // Queue of waiting threads }; ``` completions 物件主要包含 3 個部分: 1. 對 `struct completions` 做初始化 2. 使用 `wait_for_completion()` 做等待或是 barrier。 3. 用 `complete()` 做通知。在範例程式碼中，啟動兩個執行緒: `crank` & `flywheel`，而我們的目的是要讓 `crank` 執行緒總是優先於 `flywheel` 執行，為此，需要為兩個執行緒分別建立完成狀態，而我們用 completions 結構體來表示狀態，在每個執行緒結束時會分別更新對應的 completion ，而 `flywheel` 執行緒就會透過呼叫 `wait_for_completion` 來確保自己不會過早開始，`crank` 執行緒使用 `complete_all()` 更新完成狀態，讓 `flywheel` 執行緒能夠繼續。相關資料 : [Completions - “wait for completion” barrier APIs](https://docs.kernel.org/scheduler/completion.html) ```c void wait_for_completion(struct completion *done void complete_all(struct completion *); ``` 使用方式: ```c CPU#1 CPU#2 struct completion setup_done; init_completion(&setup_done); initialize_work(...,&setup_done,...); /* run non-dependent code */ /* do setup */ wait_for_completion(&setup_done); complete(&setup_done) ``` 即便 `flywheel` 執行緒先被啟動，當載入此模組並執行 `dmesg` 時，會看到「turn the crank」總是先被印出，接著才是 `flywheel` 的動作，因為 `flywheel` 在等待 `crank` 完成。 [完整程式碼](https://gist.github.com/leonnig/1d0fc2c2834af5d77d9fd00afd4ec96e) :::warning 為何不用 Pthraed ? 因為在 kernel space 中，使用的是 linux kernel 的 API，而非 POSIX Threads，pthread_t 是 POSIX thread library（libpthread）的一部分，只能在 user space 使用。在 Linux kernel 中，每一個 process（或 kernel thread）都由一個 struct task_struct 來描述，這是 Linux process descriptor。 ::: # Synchronization 如果運行在不同 CPU 或是 Thread 的 process 想要存取同一塊記憶體，可能會產生 race condtion，為了避免這種情況，Kernel 提供了各種類型的互斥函式，這些函式用來標示某段程式碼是 **鎖定** 還是 **解鎖** 的，以避免同時嘗試執行該程式碼的情況發生。 ## Mutex