2020 年春季 Linux 核心設計課程作業 —— fibdrv

--- title: 2020 年春季 Linux 核心設計課程作業 —— fibdrv image: https://i.imgur.com/KXUuZLV.png description: 引導學員開發 Linux 核心模組，預期知悉核心模組如何掛載進入 Linux 核心、Virtual File System (VFS) 概況、character device driver 撰寫，和準備對應的測試及效能分析工具 --- # H03: fibdrv ###### tags: `linux2020` > 主講人: [jserv](http://wiki.csie.ncku.edu.tw/User/jserv) / 課程討論區: [2020 年系統軟體課程](https://www.facebook.com/groups/system.software2020/) :mega: 返回「[Linux 核心設計](http://wiki.csie.ncku.edu.tw/linux/schedule)」課程進度表 ==[作業說明直播錄影](https://youtu.be/go41AXrBR1Y?t=704)== ## :memo: 預期目標 * 撰寫適用於 Linux 核心層級的程式 * 學習 ktimer, copy_to_user 一類的核心 API * 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise) * 初探 Linux VFS * 自動測試機制 * 透過工具進行效能分析 ## :rocket: 費氏數列參照以下材料理解 Fibonacci 數列和對應的計算手法: * [費氏數列分析](https://hackmd.io/@sysprog/fibonacci-number) * [Fast modular Fibonacci](http://fedelebron.com/fast-modular-fibonacci) * 「什麼是費氏數列」短片: [Part I](https://youtu.be/JWGCrICTars), [Part II](https://youtu.be/TA0Dpx0LOeY), [Part III](https://youtu.be/WyDn6wiwW74), [Part IV](https://youtu.be/iCSNHH45EeI) * TED: [The magic of Fibonacci numbers](https://youtu.be/SjSHVDfXHQ4) (有繁體中文字幕) 依據 [Fibonacci and Golden Ratio Formulae](http://www.maths.surrey.ac.uk/hosted-sites/R.Knott/Fibonacci/fibFormulae.html)，考慮 Fibonacci 最初提到不會死亡的兔子繁衍總量問題，假設成年兔子為 `a` ，幼年兔子為 `b` ，我們可得到以下關係: $$ a_{n+1} = a_n + b_n \\ b_{n+1} = a_n $$ 將上述推導寫成矩陣的形式就會變成 $$ \begin{pmatrix} a_{n+1} \\ b_{n+1} \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \begin{pmatrix} a_n \\ b_n \end{pmatrix} $$ 則 $\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}$ 就是所謂的 ==Q-matrix==，可進一步改寫為: $$ Q = \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} = \begin{pmatrix} F_2 & F_1 \\ F_1 & F_0 \end{pmatrix} \\ Q^n = \begin{pmatrix} F_{n+1} & F_n \\ F_n & F_{n-1} \end{pmatrix} $$ 解說短片: [The Fibonacci Q-matrix](https://youtu.be/lTHVwsHJrG0) 接著要談及 [Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 手法，繼續整理: $$ \begin{split} \begin{bmatrix} F(2n+1) \\ F(2n) \end{bmatrix} &= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^{2n} \begin{bmatrix} F(1) \\ F(0) \end{bmatrix}\\ \\ &= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^n \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^n \begin{bmatrix} F(1) \\ F(0) \end{bmatrix}\\ \\ &= \begin{bmatrix} F(n+1) & F(n) \\ F(n) & F(n-1) \end{bmatrix} \begin{bmatrix} F(n+1) & F(n) \\ F(n) & F(n-1) \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix}\\ \\ &= \begin{bmatrix} F(n+1)^2 + F(n)^2\\ F(n)F(n+1) + F(n-1)F(n) \end{bmatrix} \end{split} $$ 因此可得: $$ \begin{split} F(2k) &= F(k)[2F(k+1) - F(k)] \\ F(2k+1) &= F(k+1)^2+F(k)^2 \end{split} $$ 根據 Fibonacci 數列原始定義 `F(0) = 0, F(1) = 1, F(2) = 1`我們就可用這三個值，依據上述公式推導出隨後的數值。對應的虛擬碼如下: ```python Fast_Fib(n) a = 0; b = 1; // m = 0 for i = (number of binary digit in n) to 1 t1 = a*(2*b - a); t2 = b^2 + a^2; a = t1; b = t2; // m *= 2 if (current binary digit == 1) t1 = a + b; // m++ a = b; b = t1; return a; ``` 示範案例: 求解 $F(10)$: 10~10~ = 1010~2~ | i | start | 4 | 3 | 2 | 1 | result | |---|-------|----------|----------|----------|---------|--------| | n | - | ==1==010 | 1==0==10 | 10==1==0 | 101==0== | - | |F(m) | F(0) | F(0*2+1) | F(1*2) | F(2*2+1) | F(5*2) | F(10) | | a | 0 | 1 | 1 | 5 | 55 | 55 | | b | 1 | 1 | 2 | 8 | 89 | - | 對照 $F(11)$ 11~10~ = 1011~2~ | | 1 | 0 |1|1| result | -------- | -------- | -------- |-------- |-------- |-------- | | F(n)| F(0*2+1)| F(1*2) |F(2*2+1)|F(5*2+1) |F(11) | a |1 |1|5|89|89 | b |1|2|8|144 ### 考慮到硬體加速 $F(n)$ 的手法許多現代處理器提供 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令，可搭配上述 [Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 手法運用: 1. 省略 $F(0)$，直接從 $F(1)$ 開始; 2. clz/[ffs](https://www.kernel.org/doc/htmldocs/kernel-api/API-ffs.html): 先去除數字 MSB 起算的開頭 0 位元，因為真正的數字不包含 leading 0s，所以計算它也沒用，因此需要 [clz](https://en.wikipedia.org/wiki/Find_first_set#CLZ) 計算有多少 leading 0s * 遇到 `0` $\rightarrow$ 進行 fast doubling，也就是求求 $F(2n)$ 和 $F(2n+1)$ * 遇到 `1` $\rightarrow$ 進行 fast doubling，也就是先求 $F(2n)$ 和 $F(2n+1)$，再求 $F(2n+2)$ 可對照 [你所不知道的 C 語言: 遞迴呼叫篇](https://hackmd.io/@sysprog/c-recursion)。 ### Fibonacci 數的應用電腦在計算亂數時，常以新產生出來的數值作為下一次計算的依據，這就是為什麼計算機隨機數大部分都會表示成遞迴定義的數列。 :::info 這裡只探討 [Pseudo-Random Number Generators](https://en.wikipedia.org/wiki/Pseudorandom_number_generator) (PRNG) ::: > 這是一般的 Fibonacci 數列 ```cpp f[0] = 0; f[1] = 1; f[i] = f[i - 1] + f[i - 2]; ``` >數列呈現單調遞增 ``` 1, 1, 2, 3, 5, 8, 13, 21, ... ``` > 如果我們強迫 Fibonacci 數列的數值超出 100 之後折回來 ```cpp f[0] = 0; f[1] = 1; f[i] = (f[i - 1] + f[i - 2]) % 100; ``` > 一開始雖然還是看得到規則，不過整體趨勢已經不再是單調遞增，當數值折回之後規律變得不太明顯 ```cpp 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 44, 33, 77, 10, 87, 97, 84, 81, 65, 46, 11, 57, 68, 25, 93, 18, 11, 29, 40, ... ``` > 甚至可將 Fibonacci 的遞迴數列改成: ```cpp f[0] = 18; f[1] = 83; f[2] = 4; f[3] = 42; f[4] = 71; f[i] = (f[i - 2] + f[i - 5]) % 100; ``` > 這個數列的規則已不容易看穿 ```cpp 18, 83, 4, 42, 71, 60, 54, 64, 96, 35, 56, 89, 20, 85, 55, 41, 44, 61, 29, 16, 70, 60, 31, 89, 47, 59, 7, 90, 96, 37, ... ``` 這種產生隨機數的方法，稱為 [Lagged Fibonacci generator](https://en.wikipedia.org/wiki/Lagged_Fibonacci_generator) (LFG)，是電腦產生隨機數的一種方法。延伸閱讀: * [For The Love of Computing: The Lagged Fibonacci Generator — Where Nature Meet Random Numbers](https://medium.com/asecuritysite-when-bob-met-alice/for-the-love-of-computing-the-lagged-fibonacci-generator-where-nature-meet-random-numbers-f9fb5bd6c237) ## :package: 撰寫 Linux 核心模組請自行參閱以下教材: * [Introduction to Linux kernel driver programming](https://events.linuxfoundation.org/wp-content/uploads/2017/12/Introduction-to-Linux-Kernel-Driver-Programming-Michael-Opdenacker-Bootlin-.pdf) * [Part 1: Introduction](http://derekmolloy.ie/writing-a-linux-kernel-module-part-1-introduction/) * [Part 2: A Character Device](http://derekmolloy.ie/writing-a-linux-kernel-module-part-2-a-character-device/) ## :dart: `fibdrv`: 可輸出 Fibonacci 數列的 Linux 核心模組 ### 前期準備 :::info 1. 開發環境預期和 ==[lab0](https://hackmd.io/@sysprog/linux2020-lab0)== 相似，如果你剛好重新安裝 Ubuntu Linux，請依據指示將必要的開發套件裝好; 2. 從 [Homework2](https://hackmd.io/@sysprog/linux2020-homework2) 起，我們就有分析應用程式和 Linux 核心效能的需求，==不該在虛擬機器環境== 裡頭測試 (但 [container](https://linuxcontainers.org/) 或 [Linux KVM](https://www.linux-kvm.org/page/Main_Page) 可接受)，否則會有效能偏差的問題 $\to$ 及早在實體機器上安裝好 GNU/Linux; ::: * 自從 Linux 核心 4.4 版以來，Ubuntu Linux 預設開啟 `EFI_SECURE_BOOT_SIG_ENFORCE`，這使得核心模組需要適度的簽章才可掛載進入 Linux 核心，為了後續測試的便利，我們需要將 UEFI Secure Boot 的功能**關閉**，請見 [Why do I get “Required key not available” when install 3rd party kernel modules or after a kernel upgrade?](https://askubuntu.com/questions/762254/why-do-i-get-required-key-not-available-when-install-3rd-party-kernel-modules) * 檢查 Linux 核心版本 ```shell $ uname -r ``` 預期是大於等於 `4.15` 的版本，例如 `4.15.0-45-generic`。若在你的開發環境中，核心版本低於 `4.15` 的話，需要更新 Linux 核心，請自行參照相關文件 * 安裝 `linux-headers` 套件 (注意寫法裡頭有 `s`)，以 [Ubuntu Linux 18.04 LTS](https://packages.ubuntu.com/bionic/linux-headers-generic) 為例: ```shell $ sudo apt install linux-headers-`uname -r` ``` * 確認 `linux-headers` 套件已正確安裝於開發環境 ```shell $ dpkg -L linux-headers-4.15.0-45-generic | grep "/lib/modules" ``` 預期得到以下輸出: ``` /lib/modules/4.15.0-45-generic/build ``` * 檢驗目前的使用者身份 ```shell $ whoami ``` 預期為「不是 root 的使用者名稱」，例如 `jserv` (或者你安裝 Ubuntu Linux 指定的登入帳號名稱)。由於測試過程需要用到 sudo，請一併查驗: ```shell $ sudo whoami ``` 預期輸出是 `root` :notes: 在下列操作中，請==避免用 root 帳號輸入命令==，而該善用 `sudo` :::danger 之後的實驗中，我們會重建 root file system，若濫用 root 權限，很可能就把 GNU/Linux 開發環境不小心破壞 (當然，你還是可重新安裝)，現在開始養成好習慣 ::: * 安裝後續會用得到的工具 ```shell $ sudo apt install util-linux strace gnuplot-nox ``` * 取得原始程式碼 ```shell $ git clone https://github.com/sysprog21/fibdrv $ cd fibdrv ``` * 編譯並測試 ```shell $ make check ``` 預期會看到綠色的 ` Passed [-]` 字樣，隨後是 ``` f(93) fail input: 7540113804746346429 expected: 12200160415121876738 ``` 這符合預期，因為給定的 `fibdrv` 存在==缺陷==。 :::info 就因世界不完美，才有我們工程師存在的空間。 ::: * 觀察產生的 `fibdrv.ko` 核心模組 ```shell $ modinfo fibdrv.ko ``` 預期可得以下輸出: ``` description: Fibonacci engine driver author: National Cheng Kung University, Taiwan license: Dual MIT/GPL name: fibdrv vermagic: 4.15.0-45-generic SMP mod_unload ``` * 觀察 `fibdrv.ko` 核心模組在 Linux 核心掛載後的行為（要先透過 `insmod` 將模組載入核心後才會有下面的裝置檔案 `/dev/fibonacci`） ```shell $ ls -l /dev/fibonacci $ cat /sys/class/fibonacci/fibonacci/dev ``` 新建立的裝置檔案 `/dev/fibonacci`，注意到 `236` 這個數字，在你的電腦也許會有出入。試著對照 [fibdev.c](https://github.com/sysprog21/fibdrv/blob/master/fibdrv.c)，找尋彼此的關聯。 ```shell $ cat /sys/module/fibdrv/version ``` 預期輸出是 `0.1`，這和 [fibdev.c](https://github.com/sysprog21/fibdrv/blob/master/fibdrv.c) 透過 `MODULE_VERSION` 所指定的版本號碼相同。 ```shell $ lsmod | grep fibdrv $ cat /sys/module/fibdrv/refcnt ``` 這兩道命令的輸出都是 `0`，意味著目前的 [reference counting](https://en.wikipedia.org/wiki/Reference_counting)。 ### `fibdrv` 核心模組內部觀察使用者層級 (user-level) 的程式如何與 fibdrv 互動: ```cpp fd = open(FIB_DEV, O_RDWR); if (fd < 0) { perror("Failed to open character device"); exit(1); } for (i = 0; i <= offset; i++) { sz = write(fd, write_buf, strlen(write_buf)); printf("Writing to " FIB_DEV ", returned the sequence %lld\n", sz); } for (i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); sz = read(fd, buf, 1); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%lld.\n", i, sz); } ``` `fibdrv` 設計為一個 [character device](https://linux-kernel-labs.github.io/refs/heads/master/labs/device_drivers.html)，可理解是個能夠循序存取檔案，透過定義相關的函數，可利用存取檔案的系統呼叫以存取 (即 open, read, write, mmap 等等)。因此，使用者層級 (user-level 或 userspace) 的程式可透過 `read` 系統呼叫來得到輸出。接著來看如何實作: ```cpp /* calculate the fibonacci number at given offset */ static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { return (ssize_t) fib_sequence(*offset); } const struct file_operations fib_fops = { .owner = THIS_MODULE, .read = fib_read, .write = fib_write, .open = fib_open, .release = fib_release, .llseek = fib_device_lseek, }; ``` 不難發現是透過 `fib_fops` 中的自行定義的 read 來實作讀取操作，而 `fib_read` 最終會回傳 `fib_sequence(*offset)`，因此就是透過讓使用者指定不同的 offest 作為 Fibonacci 數列的 $x$ 然後透過 read 的回傳值輸出 $fib(x)$ 給使用者。在 [LP64](https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models) 資料模式中，`long long` 僅寬 64 位元，因此若要表示更大的數，需要用兩個以上的變數表示一個數，在這種情況下作加法運算時，若使用 "+" operator ，會發生 overflow，可考慮實作一個更多位元的全加器。 $fib(92)$ 的計算結果用 16 進位表示是 `0x1 11F3 8AD0 840B F6BF`，超過 64 位元整數所夠表示的最大數值，因此考量到要正確輸出 $fib(100)$ 或數列更後面的數值，就必須使用到特製的結構來處理回傳值。以下是參考實作 ```cpp struct BigN { unsigned long long lower, upper; }; ``` 使用 struct BigN 來將一個數字分成兩部份: * 高位的 64 bits 保存 upper 中; * 低位的 64 bit 則是保存在 lower 中; 進行大數加法時，則需要注意 lower 是否需要進位到 upper: ```cpp static inline void addBigN(struct BigN *output, struct BigN x, struct BigN y) { output->upper = x.upper + y.upper; if (y.lower > ~x.lower) output->upper++; output->lower = x.lower + y.lower; } ``` 因為 `x.lower + ~x.lower = ~0`，移項後 `~x.lower = ~0 - x.lower`，亦即 `~x.lower` 表示 `x.lower` 跟最大值(`~0`) 的距離。所以若 `y.lower` 比 `x.lower` 距離最大值的距離還大，就表示相加後會 overflow，需要進位，這時執行 `output->upper++`。 ### [Linux Virtual File System](https://www.win.tue.nl/~aeb/linux/lk/lk-8.html) 介面透過 [Linux Virtual File System](https://www.win.tue.nl/~aeb/linux/lk/lk-8.html) 介面，本核心模組可將計算出來的 Fibonacci 數讓 `client.c` 程式得以存取。 VFS 提供一統各式檔案系統的共用介面，方便使用者操作不同的裝置或檔案時，可用一致的方式存取。Linux 的裝置驅動程式大致分為: * Character Device Driver; * Block Device Driver; * Network Device Driver; ![](https://i.imgur.com/lvjQxCt.png) 在使用裝置前需要對其定義一些 file operation，並將其註冊到 kernel 中。依據 [/include/linux/fs.h](https://elixir.bootlin.com/linux/v4.15/source/include/linux/fs.h#L1692) 中的定義 ```cpp struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ... int (*open) (struct inode *, struct file *); ... int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, loff_t, loff_t, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ... } __randomize_layout; ``` 此手法可參見 [你所不知道的 C 語言：物件導向程式設計篇](https://hackmd.io/@sysprog/c-oop)。以本例來說，宣告一個 file_operations 的資料型態並設定一些對應到 VFS 操作的函式 (fib_read, fib_write 等等)。當在使用者模式程式中呼叫到 read 系統呼叫時，藉由 VFS 將其對應到 fib_read。 ```cpp const struct file_operations fib_fops = { .owner = THIS_MODULE, .read = fib_read, .write = fib_write, .open = fib_open, .release = fib_release, .llseek = fib_device_lseek, }; ``` 對照 [Linux 核心設計: 檔案系統概念及實作手法](https://hackmd.io/@sysprog/linux-file-system)。 ### 核心模式的時間測量 LWN 的 [Reinventing the timer wheel](https://lwn.net/Articles/646950/) 一文中描述道： > The kernel maintains two types of timers with two distinct use cases. The high-resolution timer ("hrtimer") mechanism provides accurate timers for work that needs to be done in the near future; hrtimer use is relatively rare, but, when hrtimers are used, they almost always run to completion. "Timeouts," instead, are normally used to alert the kernel to an expected event that has failed to arrive — a missing network packet or I/O completion interrupt, for example. The accuracy requirements for these timers are less stringent (it doesn't matter if an I/O timeout comes a few milliseconds late), and, importantly, these timers are usually canceled before they expire. 跟計時有關的功能主要用在 2 種情境： 1. timer: 安排「在某個時間點做某件事情」。可想像成火車班次表，這個重要的地方是：若某班火車誤點，會連鎖地影響後面所有班次。因此需要較精準的計時機制; 2. timeout: 用來作為逾時的通知，提醒「有東西遲到了」。最簡單的例子是 `qtest` 裡面的 `SIGALRM` 的使用。這對於時間的精準度要求不高; 其中一種計時方式是用作業系統從開機以來的計時器中斷發生的次數作為計時的依據，這個計時機制叫作 `jiffies`，很早就存在於 Linux 核心。較舊的 Linux 核心提供一個建議在 `jiffies` 上的計時機制，叫作 timer wheel。可以參考 [A new approach to kernel timers](https://lwn.net/Articles/152436/) 一文 (註：標題雖然說是 new approach，但這篇寫作時間是 2005 年)。 ![](https://i.imgur.com/mZqKE8B.png) 以 `tv1` 為例，他是個大小是 256 個陣列，看 `jiffies` 的最低 8 位元代表的數字是多少，就去 `tv1` 陣列對應的元素找對應需要處理的事件。而 tv2 是 tv1 的陣列，每經過 $2^8$ 次 jiffy ，index 就加一。 `tv3` 後面以此類推。不過，這個計時器受限於計時器中斷觸發和實質能處理的的頻率，而這個頻率有其極限。關於中斷，可參見 [Linux 核心設計: 中斷處理和現代架構考量](https://hackmd.io/@sysprog/linux-interrupt)。一個新的機制是跳脫 `jiffies`，而將計時機制建立在一個新的資料結構 `ktime_t` 上面，可參考 LWN [The high-resolution timer API](https://lwn.net/Articles/167897/) 一文。 `hrtimer` 是在 2.6.16 開始有的新的計時機制，裡面使用了 `ktime_t` 這個新的資料結構來進行計時。這個結構體的定義會隨架構有所不同。所以跟大多數 Linux 中的資料結構使用機制類似，都要使用專門的函數來對這個資料型態進行操作。而在 x86-64 中是個 64 位元整數。相關的使用方式如下：宣告並初始化一個 `ktime_t`: ```cpp DEFINE_KTIME(name); ``` 這跟 `LIST_HEAD` 的功能之於 `struct list_head` 相仿，宣告一個 `ktime_t` 並初始化成 `0`。對 ktime 數值進行運算: ```cpp ktime_t ktime_add(ktime_t kt1, ktime_t kt2); ktime_t ktime_sub(ktime_t kt1, ktime_t kt2); /* kt1 - kt2 */ ktime_t ktime_add_ns(ktime_t kt, u64 nanoseconds); ``` 與其他時間相關的轉換： ```cpp ktime_t timespec_to_ktime(struct timespec tspec); ktime_t timeval_to_ktime(struct timeval tval); struct timespec ktime_to_timespec(ktime_t kt); struct timeval ktime_to_timeval(ktime_t kt); clock_t ktime_to_clock_t(ktime_t kt); u64 ktime_to_ns(ktime_t kt); ``` 詳見 [IBM](https://www.ibm.com/developerworks/library/l-timers-list/index.html) 關於 hrtimer 的文章，並對照 [Linux 核心設計: Timer 及其管理機制](https://hackmd.io/@sysprog/linux-timer)。 [ktime 相關的 API](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html) 可用來測量時間，我們可發現到 write 在此核心模組暫時沒作用，於是可挪用來輸出上一次 fib 的執行時間。以下是示範的修改: ```cpp static ktime_t kt; static long long fib_time_proxy(long long k) { kt = ktime_get(); long long result = fib_sequence(k); kt = ktime_sub(ktime_get(), kt); return result; } static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { return (ssize_t) fib_time_proxy(*offset); } static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { return ktime_to_ns(kt); } ``` ### 關於 [client.c](https://github.com/sysprog21/fibdrv/blob/master/client.c) 由 [read](https://linux.die.net/man/2/read) 的手冊描述可知，若 read 成功，會回傳讀進的 byte 數，而 read 的其中一個輸入參數為 count，即所要讀取的 byte 數，但在 client.c 中，計算出的 Fibonacci 數是 read() 的回傳值。參考 [Linux Driver Tutorial: How to Write a Simple Linux Device Driver](https://www.apriorit.com/dev-blog/195-simple-driver-for-linux-os) 的第 7 部分 "Using Memory Allocated in User Mode": > The user allocates a special buffer in the user-mode address space. And the other action that the read function must perform is to copy the information to this buffer. The address to which a pointer from that space points and the address in the kernel address space may have different values. That's why we cannot simply dereference the pointer. 也就是說，在使用者模式 (user-mode) 的位址空間配置一個 buffer 空間時，核心裝置驅動不能直接寫入該地址，相反的，需要透過 [copy_to_user](https://elixir.bootlin.com/linux/latest/ident/copy_to_user)，將想傳給使用者模式 (即運作中的 `client`) 的字串複製到到 `fib_read` 的 buf 參數後，client 端方可接收到此字串。依據 [Hierarchical performance measurement and modeling of the Linux file system](https://www.researchgate.net/publication/221556402_Hierarchical_performance_measurement_and_modeling_of_the_Linux_file_system) 研究指出，從核心模式複製資料到使用者層級的時間成本是，每個位元組達 22ns，因此進行效能分析時，要考慮到 [copy_to_user](https://elixir.bootlin.com/linux/latest/ident/copy_to_user) 函式的時間開銷，特別留意到資料大小成長後造成的量測誤差。 ## :microscope: Linux 效能分析的提示無論是伺服器或個人電腦裡頭的 CPU 幾乎都是多核架構，通常在多核心的作業系統中常使用處理器的親和性 ([processor affinity](https://en.wikipedia.org/wiki/Processor_affinity)，亦稱 CPU pinning) 讓行程 (process) 在特定的 CPU 核心中持續執行，不受作業系統排程的干擾。將行程綁定在特定的 CPU 核心上有許多優點，例如一個 cache bound 的程式跟其他比較耗費 CPU 計算的程式一起執行時，將程式綁定在特定的 CPU 核心可減少 cache miss 的狀況。另外在兩個行程頻繁的藉由共享記憶體進行資料交換時，將兩個行程都綁定在同一個 [NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) 節點中也可增進執行效率。在 Linux 系統中若要將特定的處理器核心指定給一個程式或行程使用，可以使用 [taskset](http://man7.org/linux/man-pages/man1/taskset.1.html) 命令來設定或取得行程的處理器親和性。延伸閱讀: * [Linux 核心設計: 不只挑選任務的排程器](https://hackmd.io/@sysprog/linux-scheduler) ### 查看行程的 CPU affinity 若要查看指定行程的處理器親和性，可使用 `taskset` 加上 `-p` 參數再加上該行程的 PID（process ID）： ```shell $ taskset -p PID ``` 其中 PID 就是行程的 ID，例如： ```shell $ taskset -p 1 ``` 參考輸出為: (顯然每台主機搭配 Linux 會有多樣結果) ``` pid 1's current affinity mask: ffffffffffffffff ``` 輸出的 affinity mask 是一個十六進位的 bitmask，將其轉換為二進位格式之後，若位元值為 `1` 則代表該行程可在這個位元對應的 CPU 核心中執行，若位元值為 `0` 則代表該行程不允許在這個位元對應的 CPU 核心中執行。在上面這個例子中十六進位的 ff 轉成二進位的格式會是 `11111111`，這 8 個 1 分別代表該行程可以在第 0 到第 7 個 CPU 核心中執行，最低（LSB; 最右邊）的位元代表第 0 個 CPU 核心，次低的代表第 1 個，以次類推。如果 affinity mask 是一個 `0x11`，則代表可在第 0 個與第 4 個 CPU 核心執行。若對 bitmask 表示法不易掌握，可加上 `-c` 參數，讓 taskset 直接輸出 CPU 的核心列表： ```shell $ taskset -cp 1 ``` 參考輸出為: ``` pid 1's current affinity list: 0-63 ``` ### 將行程固定在特定的 CPU 中執行 [taskset](http://man7.org/linux/man-pages/man1/taskset.1.html) 亦可設定行程的 core mask，將指定的行程固定在特定的 CPU 核心中執行： ```shell $ taskset -p COREMASK PID ``` 其中 COREMASK 就是指定的十六進位 core mask，PID 則為行程的 ID。除此之外，亦可使用 `-c` 參數以 CPU 的核心列表來指定： ```shell $ taskset -cp CORELIST PID ``` 其中 CORELIST 為 CPU 核心列表，以逗點分隔各個核心的編號或是使用連字號指定連續的區間，例如: `0,2,5,7-10`。例如若要將一個行程固定在第 0 個與第 4 個 CPU 核心，則使用： ```shell $ taskset -p 0x11 9030 ``` 輸出為： ``` pid 9030's current affinity mask: ff pid 9030's new affinity mask: 11 ``` 亦可使用 CPU 核心列表的方式： ```shell taskset -cp 0,4 9030 ``` 兩種方式效果一致。在 Linux 中的使用者必須有開啟 [CAP_SYS_NICE](http://man7.org/linux/man-pages/man7/capabilities.7.html) 這個權限，才能更動行程的處理器親和性，而若只是要查看處理器親和性的設定，則沒有限制（任何使用者皆可查詢。 ### 以特定的 CPU 執行程式除了更改現有行程的處理器親和性，使用者也可使用 [taskset](http://man7.org/linux/man-pages/man1/taskset.1.html) 指定 CPU 核心來執行一個新的程式： ```shell $ taskset COREMASK EXECUTABLE ``` 其中 EXECUTABLE 是要執行的程式。例如若要以第 0 個 CPU 核心執行 vlc 則使用： ```shell $ taskset 0x1 vlc ``` ### 限定 CPU 給特定的程式使用 [taskset](http://man7.org/linux/man-pages/man1/taskset.1.html) 可指定行程所使用的 CPU 核心，但不代表其他的行程不會使用這些被指定的 CPU 核心，如果你不想讓其他的行程干擾你要執行的程式，讓指定的核心只能被自己設定的行程使用，可以使用 `isolcpus` 這個 Linux 核心起始參數，這個參數可以讓特定的 CPU 核心在開機時就被保留下來。設定的方式有兩種，一個是在開機時使用 boot loader 所提供的自訂開機參數功能，手動加入 `isolcpus=cpu_id` 參數，或是直接加在 GRUB 的設定檔中，這樣 Linux 的排程器在預設的狀況下就不會將任何一般行程放在這個被保留的 CPU 核心中執行，只有那些被 [taskset](http://man7.org/linux/man-pages/man1/taskset.1.html) 特別指定的行程可使用這些 CPU 核心。舉例來說，如果想讓第 0 個與第 1 個 CPU 核心都被保留下來，則在開機時加入: ``` isolcpus=0,1 ``` 這個 Linux 起始核心參數，然後再使用 [taskset](http://man7.org/linux/man-pages/man1/taskset.1.html) 命令將這兩個 CPU 核心指定給要執行的程式使用即可。 ### 排除干擾效能分析的因素 * 抑制 [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization) (ASLR) ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` * 設定 scaling_governor 為 performance。準備以下 shell script，檔名為 `performance.sh`: ```shell for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > ${i} done ``` 之後再用 `$ sudo sh performance.sh` 執行 * 針對 Intel 處理器，關閉 turbo mode ```shell $ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" ``` ## :panda_face: Linux 核心模組掛載機制 - [ ] 檔案 `fibdrv.c` 裡頭的 `MODULE_LICENSE`, `MODULE_AUTHOR`, `MODULE_DESCRIPTION`, `MODULE_VERSION` 等巨集做了什麼事，可以讓核心知曉呢？ `insmod` 這命令背後，對應 Linux 核心內部有什麼操作呢？這些巨集本質上就是在編譯過後在 `.ko` 檔 (`ko` 即 kernel object 之意，對比使用者層級的 shared object [相當於 Microsoft Windows 的 DLL]，可對照 [你所不知道的 C 語言：動態連結器篇](https://hackmd.io/@sysprog/c-dynamic-linkage)) 中提供相對應的資訊，由於性質相同，這邊就先專注於 `MODULE_AUTHOR`。在範例程式中我們指定 module 的作者 ```clike MODULE_AUTHOR("National Cheng Kung University, Taiwan"); ``` 以下摘自 **[include/linux/module.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/module.h#L205)**: ```cpp /* * Author(s), use "Name <email>" or just "Name", for multiple * authors use multiple MODULE_AUTHOR() statements/lines. */ #define MODULE_AUTHOR(_author) MODULE_INFO(author, _author) ``` 上述註解說明 `_author` 的格式和若有多個 author 則應該呼叫多次 `MODULE_AUTHOR`。若將巨集展開應得 `MODULE_INFO(author, "National Cheng Kung University, Taiwan")` ```cpp /* Generic info of form tag = "info" */ #define MODULE_INFO(tag, info) __MODULE_INFO(tag, tag, info) ``` 繼續將上述展開得 `__MODULE_INFO(author, author, "National Cheng Kung University, Taiwan")` 以下摘自 **[include/linux/moduleparam.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/moduleparam.h)**: ```cpp #ifdef MODULE #define __MODULE_INFO(tag, name, info) \ static const char __UNIQUE_ID(name)[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = __stringify(tag) "=" info #else /* !MODULE */ /* This struct is here for syntactic coherency, it is not used */ #define __MODULE_INFO(tag, name, info) \ struct __UNIQUE_ID(name) {} #endif ``` 上述巨集的定義根據 `MODULE` 是否有被定義，`MODULE` 是在此核心模組被編譯時期所定義，若此模組編譯時已內建於核心則不會被定義。繼續將上述巨集展開 ```cpp static const char __UNIQUE_ID(author)[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = __stringify(author) "=" "National Cheng Kung University, Taiwan" ``` 以下摘自 **[include/linux/compiler-gcc.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/compiler-gcc.h#L208)**: ```cpp #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__) ``` 繼續將 \__UNIQUE_ID 展開，`__COUNTER__` 這個巨集由 GCC 自動更新，每當遇到使用到 `__COUNTER__` 就會將其值加一。 ```cpp static const char __PASTE(__PASTE(__UNIQUE_ID_, author), __COUNTER__)[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = __stringify(author) "=" "National Cheng Kung University, Taiwan" ``` 1. `__UNIQUE_ID` 會根據參數產生一個不重複的名字（參考[linux/compiler.h](https://elixir.bootlin.com/linux/v4.15/source/include/linux/compiler.h#L161)），其中使用到的技術是利用巨集中的 `##` 來將兩個引數合併成一個新的字串。 2. 透過 `__attribute__` 關鍵字告訴編譯器，這段訊息 1. 要被放在 `.modinfo` 段 2. 應該不會被程式使用到，所以不要產生警告訊息 3. 最小的對齊格式需要是 1 bit 3. 在 [linux/stringfy.h](https://elixir.bootlin.com/linux/v4.15/source/include/linux/stringify.h) 裡頭，我們可以看到 `__stringify` 的目的是為了把引數轉換成字串形式。以 `MODULE_LICENSE(Dual MIT/GPL)` 為例，被展開後的 `__stringify(tag) "=" info` 會是 `"license = Dual MIT/GPL"` 字串。總結這部份，`MODULE_XXX` 系列的巨集在最後都會被轉變成 ```cpp static const char 獨一無二的變數[] = "操作 = 參數" ``` 再放到 `.modinfo` 區段中。這裡對應到 C99/C11 規格書中的 6.4.5 String Literals: > In translation phase 6, the multibyte character sequences specified by any sequence of adjacent character and wide string literal tokens are concatenated into a single multibytecharacter sequence. 大致的意思是把 string literal 並排，等同於一個合併起來的字串。以下摘自 **[include/linux/compiler_types.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/compiler_types.h#L53)**: ```cpp /* Indirect macros required for expanded argument pasting, eg. __LINE__. */ #define ___PASTE(a,b) a##b #define __PASTE(a,b) ___PASTE(a,b) ``` 繼續展開: ```cpp static const char __UNIQUE_ID_author0[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = __stringify(author) "=" "National Cheng Kung University, Taiwan" ``` 摘自 **[include/linux/stringify.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/stringify.h#L10)**: ```cpp #define __stringify_1(x...) #x #define __stringify(x...) __stringify_1(x) ``` 注意到 `#` 和 `##` 這兩個都是 preprocessor 語法，請參照 [你所不知道的 C 語言：前置處理器應用篇](https://hackmd.io/@sysprog/c-preprocessor) 以得知詳細用法。做最後的展開能夠得到以下的結果 ```cpp static const char __UNIQUE_ID_author0[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = "author=National Cheng Kung University, Taiwan" ``` 根據 GNU GCC 文件說明對於 Variable attribute 的解說，`section` 會特別將此 variable 放到指定的 ELF section 中，這邊為 `.modinfo`。關於 ELF 的資訊，請參照 [你所不知道的 C 語言：連結器和執行檔資訊](https://hackmd.io/@sysprog/c-linker-loader)。 ``` $ objdump -s fibdrv.ko ... Contents of section .modinfo: 0000 76657273 696f6e3d 302e3100 64657363 version=0.1.desc 0010 72697074 696f6e3d 4669626f 6e616363 ription=Fibonacc 0020 6920656e 67696e65 20647269 76657200 i engine driver. 0030 61757468 6f723d4e 6174696f 6e616c20 author=National 0040 4368656e 67204b75 6e672055 6e697665 Cheng Kung Unive 0050 72736974 792c2054 61697761 6e006c69 rsity, Taiwan.li 0060 63656e73 653d4475 616c204d 49542f47 cense=Dual MIT/G 0070 504c0000 00000000 73726376 65727369 PL......srcversi 0080 6f6e3d34 42373436 37453631 43414238 on=4B7467E61CAB8 0090 32354539 35364446 38330000 00000000 25E956DF83...... 00a0 64657065 6e64733d 00726574 706f6c69 depends=.retpoli 00b0 6e653d59 006e616d 653d6669 62647276 ne=Y.name=fibdrv 00c0 00766572 6d616769 633d342e 31382e30 .vermagic=4.18.0 00d0 2d31362d 67656e65 72696320 534d5020 -16-generic SMP 00e0 6d6f645f 756e6c6f 61642000 mod_unload . ... ``` 上述可以看到 author 的資訊被寫入到 `.modinfo` section 中。更進一步，用 vim 打開 `fibdrv.ko`，並且使用 16 進位模式閱讀，可以在 [readelf](https://sourceware.org/binutils/docs/binutils/readelf.html) 輸出的 `modinfo` 區段的 `offset` (560) 中找到下面內容： ```hex 00000530: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000540: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000550: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000560: 7665 7273 696f 6e3d 302e 3100 6465 7363 version=0.1.desc 00000570: 7269 7074 696f 6e3d 4669 626f 6e61 6363 ription=Fibonacc 00000580: 6920 656e 6769 6e65 2064 7269 7665 7200 i engine driver. 00000590: 6175 7468 6f72 3d4e 6174 696f 6e61 6c20 author=National 000005a0: 4368 656e 6720 4b75 6e67 2055 6e69 7665 Cheng Kung Unive 000005b0: 7273 6974 792c 2054 6169 7761 6e00 6c69 rsity, Taiwan.li 000005c0: 6365 6e73 653d 4475 616c 204d 4954 2f47 cense=Dual MIT/G 000005d0: 504c 0000 0000 0000 7372 6376 6572 7369 PL......srcversi 000005e0: 6f6e 3d32 3444 4335 4642 3745 3736 3038 on=24DC5FB7E7608 000005f0: 4146 3136 4230 4343 3146 0000 0000 0000 AF16B0CC1F...... 00000600: 6465 7065 6e64 733d 0072 6574 706f 6c69 depends=.retpoli 00000610: 6e65 3d59 006e 616d 653d 6669 6264 7276 ne=Y.name=fibdrv 00000620: 0076 6572 6d61 6769 633d 342e 3138 2e30 .vermagic=4.18.0 00000630: 2d31 352d 6765 6e65 7269 6320 534d 5020 -15-generic SMP 00000640: 6d6f 645f 756e 6c6f 6164 2000 0000 0000 mod_unload ..... 00000650: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000660: 0000 0000 0000 0000 0000 0000 0000 0000 ................ ``` 再來看執行 `insmod` 時 Linux 核心做了什麼。摘自 [Linux Device Driver 3/e](https://lwn.net/Kernel/LDD3/) 第 2 章的資訊： > ...The use of module_init is mandatory. This macro adds a special section to the module’s object code stating where the module’s initialization function is to be found. Without this definition, your initialization function is never called. 可知 `module_init` 巨集在編譯出來的 object 中，加入初始化模組函數的起始位置。類似地，`module_exit` 的相關敘述： > Once again, the module_exit declaration is necessary to enable to kernel to find your cleanup function. ![](https://i.imgur.com/KGC9sez.png) 這邊利用 [strace](https://linux.die.net/man/1/strace) 追蹤執行 `insmod fibdrv.ko` 的過程有哪些系統呼叫被執行: ```shell= $ sudo strace insmod fibdrv.ko execve("/sbin/insmod", ["insmod", "fibdrv.ko"], 0x7ffeab43f308 /* 25 vars */) = 0 brk(NULL) = 0x561084511000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=83948, ...}) = 0 mmap(NULL, 83948, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0621290000 close(3) = 0 ... close(3) = 0 getcwd("/home/johnnylord/fibdrv", 4096) = 24 stat("/home/johnnylord/fibdrv/fibdrv.ko", {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0 openat(AT_FDCWD, "/home/johnnylord/fibdrv/fibdrv.ko", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0 mmap(NULL, 8288, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f06212a2000 finit_module(3, "", 0) = 0 munmap(0x7f06212a2000, 8288) = 0 close(3) = 0 exit_group(0) = ? +++ exited with 0 +++m ``` 自上述第 18 行可以發現呼叫到 [finit_module](https://linux.die.net/man/2/finit_module)。去查看 linux 核心中如何宣告和實作 `finit_module`。 **[kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c)** ```clike= SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) { // ... return load_module(&info, uargs, flags); } ``` 在第 4 行可以發現執行 `load_module` 這個函式: ```cpp /* Allocate and load the module: note that size of section 0 is always zero, and we rely on this for optional sections. */ static int load_module(struct load_info *info, const char __user *uargs, int flags) { ... } ``` 而在註解的部分可以看到 `load_module` 大致就是 Linux 核心為模組配置記憶體和載入模組相關資料的地方。 - [ ] 當我們透過 `insmod` 去載入一個核心模組時，為何 `module_init` 所設定的函式得以執行呢？Linux 核心做了什麼事呢? 首先，先看看原始碼 ```cpp static int __init init_fib_dev(void) { // ... } static void __exit exit_fib_dev(void) { // ... } module_init(init_fib_dev); module_exit(exit_fib_dev); ``` 以下摘自 **[include/linux/module.h](https://elixir.bootlin.com/linux/v4.18/source/include/linux/module.h#L129)**: ```cpp= /* Each module must use one module_init(). */ #define module_init(initfn) \ static inline initcall_t __maybe_unused __inittest(void) \ { return initfn; } \ int init_module(void) __attribute__((alias(#initfn))); ``` 在第 5 行，可以看到 gcc 會在編譯過後將 `initfn` 設為 `int init_module(void)` 的別名。 :::info 1. 請參閱 ==[GCC 手冊](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html)==，得知 `__attribute__((alias( ..)))` 的用法。 2. 可透過 `gcc -E fibdrv.c -I$TREE/include -I$TREE/arch/x86/include -I$TREE/include/uapi` 看到經過前置處理後的程式碼，`$TREE` 指的是核心原始程式碼的路徑，即 `/usr/src/linux-headers-$(uname -r)` 參考 [Stack Overflow - Kernel module source file after preprocessing](https://stackoverflow.com/questions/21177935/kernel-module-source-file-after-preprocessing) ::: 在 `linux/module.h` 裏面，有兩處定義 `module_init`，分別是還沒定義 `MODULE` 的 ```cpp #ifndef MODULE #define module_init(x) __initcall(x); ``` 還有已定義 `MODULE` 的 ```cpp #else /* MODULE */ ... /* Each module must use one module_init(). */ #define module_init(initfn) \ static inline initcall_t __maybe_unused __inittest(void) \ { return initfn; } \ int init_module(void) __copy(initfn) __attribute__((alias(#initfn))); ``` 從前置處理過後的結果可知，這裡選用沒定義 `MODULE` 的 `module_init(x)` ，也就是之後會繼續展開 `__initcall(x)`。 `__initcall(x)` 巨集被定義在 [linux/init.h](https://elixir.bootlin.com/linux/v4.15/source/include/linux/init.h) ```clike #define __initcall(fn) device_initcall(fn) ``` `device_initcall` 展開成 `__define_initcall(fn, 6)`，最後會變成我們預處理看到的結果。最後展開的巨集 ```cpp static inline __attribute__((unused)) __attribute__((no_instrument_function)) initcall_t __attribute__((unused)) __inittest(void) { return init_fib_dev; } int init_module(void) __attribute__((alias("init_fib_dev")));; ``` 再來解讀: 首先是第一段 `static inline ....`，這邊用了一個小技巧讓我們可在編譯時期就知道傳入的 function pointer 是不是合法的。我們回傳的 `return init_fib_dev`，他的資料型態必須要和 `initcall_t` 相同，否則編譯器會報錯。 > 這種作法和 [BUG_ON](https://kernelnewbies.org/FAQ/BUG) 和 C++ 的 [static assertion](https://en.cppreference.com/w/cpp/language/static_assert) 相似。再來是 [init_module](http://man7.org/linux/man-pages/man2/init_module.2.html)，我們知道這個**系統呼叫**讓我們把一個 ELF image 載入到 kernel space，而在最後一行 `int init_module(void) __attribute__((alias("init_fib_dev")))` 的目的是為了替 `init_module` 取一個別名。之所以要這樣做，是因前面的地方有寫到 ```cpp /* These are either module local, or the kernel's dummy ones. */ extern int init_module(void); ``` 這行告訴我們說，有 `init_module` 可以使用，但是不在這個地方實作。那實作在什麼地方呢？就是我們寫的 `init_fib_dev`，因為我們把 `init_module` 取了一個別名叫作 `init_fib_dev`。總結一下，`module_init` 巨集幫我們做 2 件事 1. 檢查傳入的函式，回傳值是否正確 2. 把 `init_module` 和傳入的函式關聯起來，因為 `insmod` 指令實作內部會呼叫 `init_module`。如此一來呼叫 `init_module` 就等同於呼叫我們自己寫的函式。透過以下實驗可確認否達到別名的效果: ```cpp #include <stdio.h> int __func() { printf("In __func()\n"); return 0; } int func() __attribute__((alias("__func"))); /* no function body */ int main() { func(); return 0; } ``` 編譯並執行: ```shell $ gcc -o test test.c $ ./test In __func() ``` 因此執行 `init_module()` 就相當於執行使用者自定義的函式 `initfn`。再來繼續回到為什麼 init_module 會被執行？就要回想系統會呼叫 `finit_module` 再來 `load_module`。摘自 **[kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c#L3785)**: ```cpp= /* Allocate and load the module: note that size of section 0 is always zero, and we rely on this for optional sections. */ static int load_module(struct load_info *info, const char __user *uargs, int flags) { struct module *mod; //... /* Figure out module layout, and allocate all the memory. */ mod = layout_and_allocate(info, flags); if (IS_ERR(mod)) { err = PTR_ERR(mod); goto free_copy; } // ... return do_init_module(mod); // ... } ``` 自第 11 行可發現，在 `do_init_module` 之前，核心先做 `layout_and_allocate` 為載入的 module 進行記憶體配置。最後在第 19 行對 module 做初始化。摘自 **[kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c#L3456)**: ```cpp /* * This is where the real work happens. * * Keep it uninlined to provide a reliable breakpoint target, e.g. for the gdb * helper command 'lx-symbols'. */ static noinline int do_init_module(struct module *mod) { // ... /* Start the module */ if (mod->init != NULL) ret = do_one_initcall(mod->init); // ... } ``` 摘自 **[init/main.c](https://elixir.bootlin.com/linux/v4.18/source/init/main.c#L874)**: ```cpp= int __init_or_module do_one_initcall(initcall_t fn) { // ... do_trace_initcall_start(fn); ret = fn(); do_trace_initcall_finish(fn, ret); // ... } ``` 可見到 `fn` 亦即傳入的 `mod->init`，核心模組的 init_function 在上述程式碼第 6 行被執行，為核心模組進行真正的初始化的工作。 - [ ] 試著執行 `$ readelf -a fibdrv.ko`, 觀察裡頭的資訊和原始程式碼及 `modinfo` 的關聯，搭配上述提問，解釋像 `fibdrv.ko` 這樣的 ELF 執行檔案是如何「植入」到 Linux 核心 Executable and Linking Format 簡稱為 ELF，可以表示一個 executable binary file 或是 object file。由於這次實驗，`fibdrv.ko` 並非可執行檔，因此這邊專注於解釋 ELF 檔案以 object file 的觀點。ELF 可大致分為 3 個部分： 1. **ELF header** 存放了有關於此 object file 的訊息 ```shell $ readelf -h fibdrv.ko ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: REL (Relocatable file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x0 Start of program headers: 0 (bytes into file) Start of section headers: 6688 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 0 (bytes) Number of program headers: 0 Size of section headers: 64 (bytes) Number of section headers: 25 Section header string table index: 24 ``` 2. **Section(s)** 有系統預定義的 section，如 `.text`, `.data`, `.bss` 等等，但也有使用者定義的 section，在本例中就有 `.modinfo`. 3. **Section Header(s)** 有對應的關於每個 section 的 metadata。例如某 Section 的 size。 ```shell $ readelf -S fibdrv.ko There are 25 section headers, starting at offset 0x1a20: Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [ 0] NULL 0000000000000000 00000000 0000000000000000 0000000000000000 0 0 0 [ 1] .note.gnu.build-i NOTE 0000000000000000 00000040 0000000000000024 0000000000000000 A 0 0 4 [ 2] .text PROGBITS 0000000000000000 00000070 000000000000015c 0000000000000000 AX 0 0 16 [ 3] .rela.text RELA 0000000000000000 00001218 0000000000000120 0000000000000018 I 22 2 8 [ 4] .init.text PROGBITS 0000000000000000 000001cc 0000000000000153 0000000000000000 AX 0 0 1 [ 5] .rela.init.text RELA 0000000000000000 00001338 00000000000003a8 0000000000000018 I 22 4 8 [ 6] .exit.text PROGBITS 0000000000000000 0000031f 0000000000000040 0000000000000000 AX 0 0 1 [ 7] .rela.exit.text RELA 0000000000000000 000016e0 00000000000000d8 0000000000000018 I 22 6 8 [ 8] __mcount_loc PROGBITS 0000000000000000 0000035f 0000000000000030 0000000000000000 A 0 0 1 [ 9] .rela__mcount_loc RELA 0000000000000000 000017b8 0000000000000090 0000000000000018 I 22 8 8 [10] .rodata.str1.1 PROGBITS 0000000000000000 0000038f :q 000000000000006e 0000000000000001 AMS 0 0 1 [11] .rodata.str1.8 PROGBITS 0000000000000000 00000400 0000000000000058 0000000000000001 AMS 0 0 8 [12] .rodata PROGBITS 0000000000000000 00000460 0000000000000100 0000000000000000 A 0 0 32 [13] .rela.rodata RELA 0000000000000000 00001848 0000000000000090 0000000000000018 I 22 12 8 [14] .modinfo PROGBITS 0000000000000000 00000560 00000000000000ec 0000000000000000 A 0 0 8 [15] .data PROGBITS 0000000000000000 00000660 0000000000000020 0000000000000000 WA 0 0 32 [16] .rela.data RELA 0000000000000000 000018d8 ... ``` 再來看 `modinfo` 這個程式和 `fibdrv.ko` 的關聯。由稍早推斷，`MODULE_XXX` 等巨集會將 module 的額外資訊放入 `fibdrv.ko` 中 `.modinfo` 中，`modinfo` 這個程式應該就是到 `fibdrv.ko` 中的 `.modinfo` 區段讀取資料並做顯示。以下是 `man modinfo` 中關於 `modinfo` 的描述。 > DESCRIPTION > ==modinfo extracts information from the Linux Kernel modules== given on the command line. If the module name is not a filename, then the /lib/modules/version directory is > searched, as is also done by modprobe(8) when loading kernel modules. > ==modinfo by default lists each attribute of the module in form fieldname : value, for > easy reading.== The filename is listed the same way (although it's not really an > attribute). - [ ] 解釋像 `fibdrv.ko` 這樣的 ELF 執行檔案是如何「植入」到 Linux 核心 `fibdrv.ko` 不是能在 shell 呼叫並執行的執行檔，它只是 ELF 格式的 object file。如果 `fibdrv.ko` 是執行檔，那麼其內容應該會包含了 Program headers 這些訊息，但是查看 ELF header 可以發現並沒有 Program header。 ```shell $ readelf -h fibdrv.ko ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: REL (Relocatable file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x0 Start of program headers: 0 (bytes into file) Start of section headers: 6688 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 0 (bytes) Number of program headers: 0 Size of section headers: 64 (bytes) Number of section headers: 25 Section header string table index: 24 ``` 因此我們需要透過 `insmod` 這個程式（可執行檔）來將 `fibdrv.ko` 植入核心中。kernel module 是執行在 kernel space 中，但是 `insmod fibdrv.ko` 是一個在 user space 的程序，因此在 `insmod` 中應該需要呼叫相關管理記憶體的 system call，將在 user space 中 kernel module 的資料複製到 kernel space 中。回頭看之前說 `insmod` 會使核心執行 `finit_module` 摘自 **[linux/kernel/module.c](https://elixir.bootlin.com/linux/v4.18/source/kernel/module.c)**: ```cpp= SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) { struct load_info info = { }; loff_t size; void *hdr; int err; // ... err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX, READING_MODULE); if (err) return err; info.hdr = hdr; info.len = size; return load_module(&info, uargs, flags); } ``` 在第 10 行核心會讀取一個檔案，在本例中，就是 `fibdrv.ko`: ```shell= $ sudo insmod fibdrv.ko ... getcwd("/home/johnnylord/fibdrv", 4096) = 24 stat("/home/johnnylord/fibdrv/fibdrv.ko", {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0 openat(AT_FDCWD, "/home/johnnylord/fibdrv/fibdrv.ko", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0 mmap(NULL, 8288, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f06212a2000 finit_module(3, "", 0) = 0 ... ``` 可見上述執行 `strace insmod fibdrv.ko` 後，在第 5 行開啟 `fibdrv.ko` 這個檔案並得到其 file descriptor 為 `3`。並在第 8 行傳入 `finit_module` 中。 - [ ] sysfs 的原理和實作 Patrick Mochel 撰寫的報告 [The sysfs Filesystem](https://mirrors.edge.kernel.org/pub/linux/kernel/people/mochel/doc/papers/ols-2005/mochel.pdf) 第 1 頁提到: > sysfs is a mechanism for representing kernel objects, their attributes, and their relationships with each other. 並在第 5 頁的 module 中提到： > The module directory contains subdirectories for each module that is loaded into the kernel.The name of each directory is the name of the module -- both the name of the module object file and the internal name of the module. `/sys/module` 這個目錄下會有以已載入 module 名稱命名的子目錄；在 [sysfs(5) -man page](http://man7.org/linux/man-pages/man5/sysfs.5.html) 提到，在 `/sys/module/`"module-name" 目錄中會有一些相關檔案，這些檔案分別紀錄了此 module 的一些資料，例如，傳入的參數值；另外，在 [/kernel/module.c - 1703 行](https://elixir.bootlin.com/linux/latest/source/kernel/module.c#L1721) 中， `module_add_modinfo_attrs`() 中，第 19行有 sysfs 相關的函式 `sysfs_create_file`: ```cpp= static int module_add_modinfo_attrs(struct module *mod) { struct module_attribute *attr; struct module_attribute *temp_attr; int error = 0; int i; mod->modinfo_attrs = kzalloc((sizeof(struct module_attribute) * (ARRAY_SIZE(modinfo_attrs) + 1)), GFP_KERNEL); if (!mod->modinfo_attrs) return -ENOMEM; temp_attr = mod->modinfo_attrs; for (i = 0; (attr = modinfo_attrs[i]) && !error; i++) { if (!attr->test || attr->test(mod)) { memcpy(temp_attr, attr, sizeof(*temp_attr)); sysfs_attr_init(&temp_attr->attr); error = sysfs_create_file(&mod->mkobj.kobj, &temp_attr->attr); ++temp_attr; } } return error; } ``` `sysfs_create_file` 函式的第二個參數為 `&temp_attr->attr`，而 `temp_attr` 是個指向 struct module_attribute [struct module_attribute](https://elixir.bootlin.com/linux/latest/source/include/linux/module.h#L52) 的指標: ```cpp= struct module_attribute { struct attribute attr; ssize_t (*show)(struct module_attribute *, struct module_kobject *, char *); ssize_t (*store)(struct module_attribute *, struct module_kobject *, const char *, size_t count); void (*setup)(struct module *, const char *); int (*test)(struct module *); void (*free)(struct module *); }; ``` 第 1 行宣告 `attr`，其型態為 [struct attribute](https://elixir.bootlin.com/linux/latest/source/include/linux/sysfs.h#L30) ( 定義於 /include/linux/sysfs.h ) : ```cpp struct attribute { const char *name; umode_t mode; #ifdef CONFIG_DEBUG_LOCK_ALLOC bool ignore_lockdep:1; struct lock_class_key *key; struct lock_class_key skey; #endif }; ``` ## 自我檢查清單 - [ ] 研讀上述 ==Linux 效能分析的提示== 描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗; - [ ] 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise)，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] `lsmod` 的輸出結果有一欄名為 `Used by`，這是 "each module's use count and a list of referring modules"，但如何實作出來呢？模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢？ - [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題 ## :penguin: 作業要求 * 回答上述==自我檢查清單==的所有問題，需要附上對應的參考資料和必要的程式碼，以第一手材料 (包含自己設計的實驗) 為佳 :::warning :warning: 如果你在 2020 年 3 月 2 日前，已從 GitHub [sysprog21/fibdrv](https://github.com/sysprog21/fibdrv) 進行 fork，請依據 ==[Alternatives to forking into the same account](https://github.community/t5/Support-Protips/Alternatives-to-forking-into-the-same-account/ba-p/7428)== 一文，對舊的 repository 做對應處置，然後重新 fork ::: * 在 GitHub 上 fork [fibdrv](https://github.com/sysprog21/fibdrv)，目標是修正 Fibonacci 數計算的正確性 (現有實作存在缺陷，請指出)，隨後改善 `fibdrv` 計算 Fibinacci 數列的執行效率，過程中需要量化執行時間，可在 Linux 核心模組和使用層級去測量 * 原本的程式碼只能列出到 Fibonacci(100) 而且==部分輸出是錯誤的數值==，請修改程式碼，列出後方更多數值 (注意: 檢查正確性和數值系統的使用) * 務必研讀上述 ==Linux 效能分析的提示== 的描述，降低效能分析過程中的干擾; * 在 Linux 核心模組中，可用 ktime 系列的 API; * 在 userspace 可用 [clock_gettime](https://linux.die.net/man/2/clock_gettime) 相關 API; * 分別[用 gnuplot 製圖](https://hackmd.io/@sysprog/Skwp-alOg)，分析 Fibonacci 數列在核心計算和傳遞到 userspace 的時間開銷，單位需要用 us 或 ns (自行斟酌) * 嘗試解讀上述實驗的時間分佈，特別是隨著 Fibonacci 數列增長後，對於 Linux 核心的影響。可修改 VFS 函式，允許透過寫入 `/dev/fibonacci` 裝置檔案來變更不同的 Fibonacci 求值的實作程式碼。 ![](https://i.imgur.com/7xyCUVO.png) * 用 printk 固然方便，但其執行時間易受各種因素而不穩定，除了透過統計來觀察，也可改用核心的 sysfs 介面，後者可讓我們自行宣告 show 和 store 介面來提供讀寫操作，可參見 [Sample kobject implementation](https://elixir.bootlin.com/linux/latest/source/samples/kobject/kobject-example.c) (注意: 切換到對應的 Linux 核心版本)。 * 逐步最佳化 Fibonacci 的執行時間，引入[費氏數列分析](https://hackmd.io/@sysprog/fibonacci-number) 提到的策略，並善用 [clz / ffs](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令 (Linux 核心有對應的包裝函式)，過程中都要充分量化 ## 繳交方式編輯 [Homework2 作業區共筆](https://hackmd.io/@sysprog/linux2020-homework2)，將你的觀察、上述要求的解說、應用場合探討，以及各式效能改善過程，善用 gnuplot 製圖，紀錄於新建立的共筆 ## 截止日期 Mar 21, 2020 (含) 之前越早在 GitHub 上有動態、越早接受 code review，評分越高 ## 作業觀摩 * [分組報告: fibdrv](https://hackmd.io/@zodf0055980/r1BgdLZ1r) * [2019 年春季班](https://hackmd.io/s/rygjaEK8V)