2022q1 Homework3 (fibdrv)

--- tags: Linux Kernel --- # 2022q1 Homework3 (fibdrv) contributed by < `kevinshieh0225` > > [作業需求](https://hackmd.io/@sysprog/linux2022-fibdrv#-%E4%BD%9C%E6%A5%AD%E8%A6%81%E6%B1%82) > [參考作業<`KYG-yaya573142`>](https://hackmd.io/@KYWeng/rkGdultSU#%E8%B2%BB%E6%B0%8F%E6%95%B8%E5%88%97) > [GitHub](https://github.com/kevinshieh0225/fibdrv) ## 前期準備 :::spoiler 實驗環境 ```shell $ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 142 Model name: Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz Stepping: 11 CPU MHz: 1800.000 CPU max MHz: 3900.0000 CPU min MHz: 400.0000 BogoMIPS: 3600.00 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-7 ``` ::: :::spoiler 依照作業需求執行模組安裝。 ```shell $ uname -r 5.13.0-30-generic // 安裝 linux-headers 套件（Ubuntu Linux 20.04 LTS） $ sudo apt install linux-headers-`uname -r` // 確認 linux-headers 套件已正確安裝於開發環境 $ dpkg -L linux-headers-5.13.0-30-generic | grep "/lib/modules" /lib/modules /lib/modules/5.13.0-30-generic /lib/modules/5.13.0-30-generic/build // check the user identity is not root $ whoami masamaloka $ sudo whoami root // 安裝分析工具 $ sudo apt install util-linux strace gnuplot-nox ``` 複製 `fibdrv` 專案並安裝模組 ```shell $ git clone https://github.com/kevinshieh0225/fibdrv $ cd fibdrv $ make check make -C /lib/modules/5.13.0-30-generic/build M=/home/masamaloka/linux2022/fibdrv modules make[1]: Entering directory '/usr/src/linux-headers-5.13.0-30-generic' make[1]: Leaving directory '/usr/src/linux-headers-5.13.0-30-generic' make unload make[1]: Entering directory '/home/masamaloka/linux2022/fibdrv' sudo rmmod fibdrv || true >/dev/null rmmod: ERROR: Module fibdrv is not currently loaded make[1]: Leaving directory '/home/masamaloka/linux2022/fibdrv' make load make[1]: Entering directory '/home/masamaloka/linux2022/fibdrv' sudo insmod fibdrv.ko insmod: ERROR: could not insert module fibdrv.ko: Operation not permitted make[1]: *** [Makefile:23: load] Error 1 make[1]: Leaving directory '/home/masamaloka/linux2022/fibdrv' make: *** [Makefile:37: check] Error 2 ``` 發生錯誤訊息，參考`YuChengPeng`的[開發紀錄](https://hackmd.io/@YuChengPeng/HkSY7SYw4?type=view)： ```shell // 查看是否有啟用 secureboot: $ dmesg|grep -i secure [ 0.000000] secureboot: Secure boot enabled [ 0.000000] Kernel is locked down from EFI Secure Boot mode; see man kernel_lockdown.7 ``` 參考作業需求說明： > 自從 Linux 核心 4.4 版以來，Ubuntu Linux 預設開啟 EFI_SECURE_BOOT_SIG_ENFORCE，這使得核心模組需要適度的簽章才可掛載進入 Linux 核心，為了後續測試的便利，我們需要將 UEFI Secure Boot 的功能關閉，請見 [Why do I get “Required key not available” when install 3rd party kernel modules or after a kernel upgrade?](https://askubuntu.com/questions/762254/why-do-i-get-required-key-not-available-when-install-3rd-party-kernel-modules) ```shell $ sudo apt install mokutil $ sudo mokutil --disable-validation ``` 設定 8-16 字元密碼，重新開機後系統詢問是否更改安全設定，選 `YES` 並`reboot`，重跑測試： ```shell $ make check make -C /lib/modules/5.13.0-30-generic/build M=/home/masamaloka/linux2022/fibdrv modules make[1]: Entering directory '/usr/src/linux-headers-5.13.0-30-generic' make[1]: Leaving directory '/usr/src/linux-headers-5.13.0-30-generic' make unload make[1]: Entering directory '/home/masamaloka/linux2022/fibdrv' sudo rmmod fibdrv || true >/dev/null [sudo] password for masamaloka: rmmod: ERROR: Module fibdrv is not currently loaded make[1]: Leaving directory '/home/masamaloka/linux2022/fibdrv' make load make[1]: Entering directory '/home/masamaloka/linux2022/fibdrv' sudo insmod fibdrv.ko make[1]: Leaving directory '/home/masamaloka/linux2022/fibdrv' sudo ./client > out make unload make[1]: Entering directory '/home/masamaloka/linux2022/fibdrv' sudo rmmod fibdrv || true >/dev/null make[1]: Leaving directory '/home/masamaloka/linux2022/fibdrv' Passed [-] f(93) fail input: 7540113804746346429 expected: 12200160415121876738 ``` 觀察產生的 `fibdrv.ko` 核心模組。 ```shell $ modinfo fibdrv.ko filename: /home/masamaloka/linux2022/fibdrv/fibdrv.ko version: 0.1 description: Fibonacci engine driver author: National Cheng Kung University, Taiwan license: Dual MIT/GPL srcversion: 9A01E3671A116ADA9F2BB0A depends: retpoline: Y name: fibdrv vermagic: 5.13.0-30-generic SMP mod_unload modversions ``` 透過 `insmod` 將模組載入核心後（參考 `jserv` [Linux 核心模組掛載機制](https://hackmd.io/@sysprog/linux-kmod)），觀察 `fibdrv.ko` 核心模組在 Linux 核心掛載後的行為。 ```shell $ sudo insmod fibdrv.ko $ ls -l /dev/fibonacci crw------- 1 root root 511, 0 三 4 12:58 /dev/fibonacci $ cat /sys/class/fibonacci/fibonacci/dev 511:0 // 輸出是 0.1，這和 fibdev.c 透過 MODULE_VERSION 所指定的版本號碼相同。 $ cat /sys/module/fibdrv/version 0.1 ``` 以下兩個執行結果為 0，所指為目前的 [Reference counting](https://zh.wikipedia.org/wiki/%E5%BC%95%E7%94%A8%E8%AE%A1%E6%95%B0) > 參照計數是電腦程式語言中的一種記憶體管理技術，是指將資源（可以是物件、記憶體或磁碟空間等等）的被參照次數儲存起來，當被參照次數變為零時就將其釋放的過程。使用參照計數技術可以實現自動資源管理的目的。同時參照計數還可以指使用參照計數技術回收未使用資源的垃圾回收演算法。 > > 當建立一個物件的實例並在 heap 上申請記憶體時，物件的參照計數就為1，在其他物件中需要持有這個物件時，就需要把該物件的參照計數加 1，需要釋放一個物件時，就將該物件的參照計數減 1，直至物件的參照計數為0，物件的記憶體會被立刻釋放。 > 不要引用中文 Wikipedia 的詞目，因為裡頭內容往往過時，且詞彙陳舊。 > :notes: jserv ```shell $ lsmod | grep fibdrv fibdrv 16384 0 $ cat /sys/module/fibdrv/refcnt 0 ``` :::: ## fibdrv 核心模組原理與實作理解本環節的目的是要從原始程式碼著手了解 fibdrv kernel module 的原理與實作，可搭配老師提供的 Writing a Linux Kernel Module 文本閱讀，收益良多。 - [Writing a Linux Kernel Module — Part 1: Introduction](http://derekmolloy.ie/writing-a-linux-kernel-module-part-1-introduction/) - [Writing a Linux Kernel Module — Part 2: A Character Device](http://derekmolloy.ie/writing-a-linux-kernel-module-part-2-a-character-device/) ### Makefile ```shell ## 將 fibdrv 編譯為核心模組 $ make -C /lib/modules/5.13.0-30-generic/build M=/home/masamaloka/linux2022/fibdrv modules ## 將模組 fibdrv.ko 載入至核心 $ (make load) sudo insmod fibdrv.ko ## 將模組 fibdrv.ko 卸載 ## ||: 左邊如果不成功執行右邊 ## true: Do nothing, Exit with a status code indicating success. ## >/dev/null: Redirecting output to /dev/null, ## which will make the shell print clean $ (make unload) sudo rmmod fibdrv.ko || true >/dev/null ## 由 ./client 執行驗證，並將結果轉向到 out 檔案。 $ sudo ./client > out ``` ### fibdrv.c `fibdrv.c` 是我們主要宣告並實作核心裝置的程式碼。 `MODULE_` 設定模組資訊，在建立核心模組後可用 `$ modinfo fibdrv.ko` 查看資訊。 `dev_t fib_dev` 宣告 [Device Number](https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch03s02.html) `struct cdev` 宣告 [Character device drivers](https://linux-kernel-labs.github.io/refs/heads/master/labs/device_drivers.html) > [Part 2: A Character Device](http://derekmolloy.ie/writing-a-linux-kernel-module-part-2-a-character-device/) > A character device typically transfers data to and from a user application — they behave like pipes or serial ports, instantly reading or writing the byte data in a character-by-character stream. They provide the framework for many typical drivers, such as those that are required for interfacing to serial communications, video capture, and audio devices. The main alternative to a character device is a block device. Block devices behave in a similar fashion to regular files, allowing a buffered array of cached data to be viewed or manipulated with operations such as reads, writes, and seeks. `struct class` 宣告 [device class](https://01.org/linuxgraphics/gfx-docs/drm/driver-api/driver-model/class.html) `DEFINE_MUTEX` 宣告 Mutex Lock ```c MODULE_LICENSE("Dual MIT/GPL"); MODULE_AUTHOR("National Cheng Kung University, Taiwan"); MODULE_DESCRIPTION("Fibonacci engine driver"); MODULE_VERSION("0.1"); static dev_t fib_dev = 0; static struct cdev *fib_cdev; static struct class *fib_class; static DEFINE_MUTEX(fib_mutex); ``` `file_operations` 的結構體提供我們驅動程式裝置（device）的操作介面，並讓我們實現介面內部操作（[參考](https://codingnote.cc/zh-tw/p/200440/)）。首先`#include <linux/fs.h>`，透過函式定義 `fib_read` 讀取裝置, `fib_write` 寫入裝置, `fib_open`打開裝置, `fib_release`釋放裝置, `fib_device_lseek`指向 offset 位置。 ```c const struct file_operations fib_fops = { .owner = THIS_MODULE, .read = fib_read, .write = fib_write, .open = fib_open, .release = fib_release, .llseek = fib_device_lseek, }; ``` **程式實作**：雖然我們主要更改的是 `fib_sequence` 的函式演算法部份，但也必須對 character device 的實作有所認知。 ```c /** * fib_sequence: return the fib result on index k num * * in the origin implementation, instinctly add the fib equation from 1 to k: * f(k) = f(k-1) + f(f-2) * * 這是我們主要去更動的函式。須解決高位數運算的問題，並兼顧效能效率。 */ static long long fib_sequence(long long k) /** * fib_open: open the fibdrv device * * use mutex_trylock to check of the ownership of mutex lockk. */ static int fib_open(struct inode *inode, struct file *file) { if (!mutex_trylock(&fib_mutex)) { printk(KERN_ALERT "fibdrv is in use"); return -EBUSY; } return 0; } /** * init_fib_dev: initialize fib device * * initial fib_mutex, fib_class, fib_cdev and fib_dev. * if initial fail, delete and return. */ static int __init init_fib_dev(void) /** * fib_read: calculate the fibonacci number at given offset * * @file: * @buf: * @size: * @offset: */ static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { return (ssize_t) fib_sequence(*offset); } /** * fib_device_lseek: repositions the file offset of the open file description associated with the file descriptor fd to the argument offset according to the directive whence as follows: * * SEEK_SET: The file offset is set to offset bytes. * SEEK_CUR: The file offset is set to its current location plus offset bytes. * SEEK_END: The file offset is set to the size of the file plus offset bytes. * * restrict the positioin as not to reach to overflow. * * @file: * @offset: * @orig: */ static loff_t fib_device_lseek(struct file *file, loff_t offset, int orig) { loff_t new_pos = 0; switch (orig) { case 0: /* SEEK_SET: */ new_pos = offset; break; case 1: /* SEEK_CUR: */ new_pos = file->f_pos + offset; break; case 2: /* SEEK_END: */ new_pos = MAX_LENGTH - offset; break; } if (new_pos > MAX_LENGTH) new_pos = MAX_LENGTH; // max case if (new_pos < 0) new_pos = 0; // min case file->f_pos = new_pos; // This is what we'll use now return new_pos; } /** * fib_release: release the mutex lock for other device to use */ static int fib_release(struct inode *inode, struct file *file) { mutex_unlock(&fib_mutex); return 0; } /** * exit_fib_dev: free and exit the device * * delete fib_mutex, fib_class, fib_cdev and fib_dev. * Unregister the character device. */ static void __exit exit_fib_dev(void) { mutex_destroy(&fib_mutex); device_destroy(fib_class, fib_dev); class_destroy(fib_class); cdev_del(fib_cdev); unregister_chrdev_region(fib_dev, 1); } ``` ### client.c 我們透過 user space 與 kernel device 溝通與測試功能的主程式碼。程式流程大致如下： 1. for 1~offset `write` info to out 2. for 1~offset implement `lseek`, `read` to implement `fib_sequence` 3. for offset~1 implement `lseek`, `read` to implement `fib_sequence` :::warning 避免書寫 `1~offset` 和 `offset~1` 這樣模糊的話，改用 $LaTeX$ 重新排版和確立數值範圍。 :notes: jserv ::: ## 效能分析 > 參考 [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU#%E6%8E%92%E9%99%A4%E5%B9%B2%E6%93%BE%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E5%9B%A0%E7%B4%A0) ### 排除干擾效能分析的因素 #### 限定 CPU 給特定的程式使用以 `sudo` 管理者權限更改 `/etc/default/grub` 檔案。修改 `GRUB_CMDLINE_LINUX_DEFAULT` 參數，新增 `"isolcpus=x(,x...)"` 指定 cpu 核心只給特定程式使用。 ```shell $ sudo vim /etc/default/grub ... GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=7" ... ``` 重新開機後檢查行程的 CPU affinity： ```shell $ taskset -cp 1 pid 1's current affinity list: 0-6 $ cat /sys/devices/system/cpu/isolated 7 ``` #### 將程式固定在特定的 CPU 中執行 ```shell $ sudo taskset -c 7 ./client ``` #### 抑制 address space layout randomization (ASLR) ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` #### 設定 scaling_governor 為 performance ```shell $ vim performance.sh for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > ${i} done $ sudo sh performance.sh ``` #### 關閉 Intel 處理器的 turbo mode ```shell $ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" ``` ### kernel space 與 user space 的傳遞時間 #### kernel space 使用 `ktime_t` 從 kernel space 測試時間。 `FIB_TIME_PROXY` 巨集：將待測時間函式傳入，回傳所用時間。考量到 `KYG-yaya573142` 在提及 [clz 的影響](https://hackmd.io/@KYWeng/rkGdultSU#clz-%E7%9A%84%E5%BD%B1%E9%9F%BF)提及，因沒有用到回傳值而被優化掉，使用 `escape` 來確保回傳值不會被編譯器優化掉。參考 [GNU Extended Asm - Volatile]([Extended-Asm.html#Volatile](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Volatile)) ```c /* * prevent compilor for optimize the none return value * from fib_write. */ __attribute__((always_inline)) static inline void escape(void *p) { __asm__ volatile ("" : : "g"(p) : "memory"); } #define FIB_TIME_PROXY(fib_f, k, result) \ ({ \ ktime_t kt = ktime_get(); \ result = fib_f(k); \ (size_t) ktime_sub(ktime_get(), kt); \ }) static ssize_t fib_write(struct file *file, const char *buf, size_t mode, loff_t *offset) { ktime_t kt; uint64_t result = 0; switch (mode) { case 0: kt = FIB_TIME_PROXY(fib_sequence, *offset, result); break; case 1: kt = FIB_TIME_PROXY(fib_fast_doubling, *offset, result); break; default: return 1; } escape(&result); return (ssize_t) ktime_to_ns(kt); } ``` #### user space and system call test 使用 [clock_gettime](https://man7.org/linux/man-pages/man2/clock_getres.2.html) 來取得時間，並將 kernel space runtime - user space runtime 來估測 system call 所佔用的時間。 ```c struct timespec t1, t2; clock_gettime(CLOCK_MONOTONIC, &t1); // kernel space runtime time[0][n] = (double) write(fd, write_buf, 0); clock_gettime(CLOCK_MONOTONIC, &t2); // user space runtime time[1][n] = (double)(t2.tv_sec * 1e9 + t2.tv_nsec) - (t1.tv_sec * 1e9 + t1.tv_nsec); // ns // system call runtime time[2][n] = time[1][n] - time[0][n]; ``` ![](https://i.imgur.com/w6fvRri.png) ## 費氏數列實作 ### iterative 原本的實作依照費氏數列的特性寫成： ```c static long long fib_sequence(long long k) { /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */ long long f[k + 2]; f[0] = 0; f[1] = 1; for (int i = 2; i <= k; i++) { f[i] = f[i - 1] + f[i - 2]; } return f[k]; } ``` 這個作法會出現一個警告：`FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel.` 直覺上可改以`malloc` 改成指派在 heap memory 中，然而在 kernel module 不可使用 `<stdlib.h>` 的標準函式庫（[參考](https://stackoverflow.com/questions/15662480/error-stdio-h-no-such-file-or-directory-error-during-make)）。可改以 `kmalloc`, `kzmalloc`, `vmalloc` 來使用（[參考](https://blog.csdn.net/lu_embedded/article/details/51588902)）。換個角度想，我們只要用兩個暫存值紀錄前兩次的費氏數值，並更新暫存即可。 ```c static uint64_t fib_sequence_register_ifelse(long long k) { uint64_t state[] = {0,1}; for (int i = 2; i <= k; i++) { if (i & 1) state[0] += state[1]; else state[1] += state[0]; } return state[(k & 1)]; } static uint64_t fib_sequence_register(long long k) { uint64_t state[] = {0,1}; for (int i = 2; i <= k; i++) { state[(i & 1)] += state[((i - 1) & 1)]; } return state[(k & 1)]; } ``` 我的實作改為用 state counter 來決定更新的暫存位置。當 i 為偶數時更新 counter = 0，奇數更新 counter = 1。下面實作用 `state[(i & 1)]` 作為更新的存值 $F(i)$ ，加上原本的`state[(i & 1)]`和 `state[((i - 1) & 1)]` ，分別等於 $F(i-2)+F(i-1)$ 。使用 kernel space 量測執行時間： ![](https://i.imgur.com/xpDSQC8.png) 每次執行的結果不太一樣，需要再觀察。在此實驗中 `with register` 是最快的，但也有時候是執行最慢的。另外用 `if-else` 的作法執行時間是最不穩定的 ### fast doubling 講義中[費氏數列](https://hackmd.io/PMozqaxqRA-jF2KmgTZcPQ?view#-%E8%B2%BB%E6%B0%8F%E6%95%B8%E5%88%97)推導了一種 `fast doubling` 的演算法： $$ \begin{split} F(2k) &= F(k)[2F(k+1) - F(k)] \\ F(2k+1) &= F(k+1)^2+F(k)^2 \end{split} $$ 在網誌 [Calculating Fibonacci Numbers by Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling)中描述了遞迴與迭代的實作方法，我們針對迭代方式做探討： ```c static uint64_t fib_fast_doubling(long long k) { // The position of the highest bit of n. // So we need to loop `h` times to get the answer. // Example: n = (Dec)50 = (Bin)00110010, then h = 6. // ^ 6th bit from right side unsigned int h = (sizeof(k) << 3) - __builtin_clz(k); uint64_t a = 0; // F(0) = 0 uint64_t b = 1; // F(1) = 1 // There is only one `1` in the bits of `mask`. The `1`'s position is same // as the highest bit of n(mask = 2^(h-1) at first), and it will be shifted // right iteratively to do `AND` operation with `n` to check `n_j` is odd or // even, where n_j is defined below. for (unsigned int mask = 1 << (h - 1); mask; mask >>= 1) { // Run h times! // Let j = h-i (looping from i = 1 to i = h), n_j = floor(n / 2^j) = // n >> j (n_j = n when j = 0), k = floor(n_j / 2), then a = F(k), // b = F(k+1) now. uint64_t c = a * (2 * b - a); // F(2k) = F(k) * [ 2 * F(k+1) – F(k) ] uint64_t d = a * a + b * b; // F(2k+1) = F(k)^2 + F(k+1)^2 if (mask & k) { // n_j is odd: k = (n_j-1)/2 => n_j = 2k + 1 a = d; // F(n_j) = F(2k + 1) b = c + d; // F(n_j + 1) = F(2k + 2) = F(2k) + F(2k + 1) } else { // n_j is even: k = n_j/2 => n_j = 2k a = c; // F(n_j) = F(2k) b = d; // F(n_j + 1) = F(2k + 1) } } return a; } ``` 使用 kernel space 量測執行時間： ![](https://i.imgur.com/80PtYtB.png) #### 測試 clz 加速測試相同實驗在： ```c static uint64_t fib_fast_doubling(long long k) { ... for (unsigned int mask = 1 << (31); ... for (unsigned int mask = 1 << (16); ... for (unsigned int mask = 1 << (6); ... for (unsigned int mask = 1 << (31 - __builtin_clz(k)); ... ... } ``` ![](https://i.imgur.com/xtUele1.png) ### why overflow? 參考 < `KYG-yaya573142` > 在 [F(92) 以後的數值錯誤的原因](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ?view#F92-%E4%BB%A5%E5%BE%8C%E7%9A%84%E6%95%B8%E5%80%BC%E9%8C%AF%E8%AA%A4%E7%9A%84%E5%8E%9F%E5%9B%A0)有十分詳盡的說明，就算使用 `uint_64` 來增長可用位元數量，還是只能推到 F(93) 的正確性。 ## big-num 實作分析 **從參考中學習**：先試著執行 < `KYG-yaya573142` > 原始程式碼： [bn_kernel.h](https://github.com/KYG-yaya573142/fibdrv/blob/master/bn_kernel.h), [bn_kernel.c](https://github.com/KYG-yaya573142/fibdrv/blob/master/bn_kernel.c) :::danger 要將兩個檔案編譯進 .ko 時必須額外編輯 Makefile。 ::: - [ ] 原有的 `Makefile` ``` obj-m += $(TARGET_MODULE).o ``` - [ ] 變更後的 `Makefile` ``` obj-m += $(TARGET_MODULE).o $(TARGET_MODULE)-objs := \ fibdrv.o \ bn_kernel.o \ ``` 注意到我們使用 `-objs` 將 object 連結到 `$(TARGET_MODULE)` kernel object 所以我們要連結的 object 名稱不能等同於 `$(TARGET_MODULE)` ### struct bn 使用結構體做大數存儲，number 是數值陣列，index 從小到大分別為低位元到高位元。 size 代表我們數值陣列的大小，sign 代表我們的正負。 ```c typedef struct _bn { bn_data *number; int size; int sign; } bn; ``` ### [bn_to_string](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ?view#bn-%E8%B3%87%E6%96%99%E7%B5%90%E6%A7%8B) 由於我們的大數結構體無法直接轉為 string 輸出形式，於是我們使用字串陣列將大數數值從二進位轉換為十進位 ASCII 表示法。 ```c char *bn_to_string(const bn *src) { // log10(x) = log2(x) / log2(10) ~= log2(x) / 3.322 size_t len = (BN_WSIZE * src->size) / 3 + 2 + src->sign; char *s = kmalloc(len, GFP_KERNEL); char *p = s; // 宣告一個長度為 len ，全字串初始化為 '0' 的字串陣列 memset(s, '0', len - 1); s[len - 1] = '\0'; /* * 二進位整數轉換為十進位字串 * ＃選擇二進位整數因為運算效率 * 從最高位元向最低位元，逐字元乘二累加轉換成十進位 */ /* src.number[0] contains least significant bits */ for (int i = src->size - 1; i >= 0; i--) { /* walk through every bit of bn */ for (bn_data d = (bn_data) 1 << (BN_WSIZE - 1); d; d >>= 1) { /* binary -> decimal string * 確認該位元是否為 1-bit * !(x) -> if 0 : 1 ; else : 0 * !!(x) -> reverse the former result */ int carry = !!(d & src->number[i]); /* 處理十進位字串： * 1. 由 bn 的 msb 依序累加到 lsb（乘二累加） * 另外由字串的低位元逐一整理到高位元（以便處理進位問題） * * 2. 乘二累加 * s[j] - '0': 純數 * s[j] = '0' + 純數*2 + carry * * 3. 進位處理 */ for (int j = len - 2; j >= 0; j--) { s[j] += s[j] - '0' + carry; carry = (s[j] > '9'); if (carry) s[j] -= 10; } } } // skip leading zero while (p[0] == '0' && p[1] != '\0') { p++; } if (src->sign) *(--p) = '-'; memmove(s, p, strlen(p) + 1); return s; } ``` ### [bn_lshift](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ?view#bit-shift) 在 fast doubling 實作中常使用到位元左移，於是定義了大數左移的函式，其中我們最多限制位移量為一個 word size 的大小以下（`uint:32, uint64_t:64`）。 ```c void bn_lshift(bn *src, size_t shift, bn *dest) { size_t z = bn_clz(src); shift %= BN_WSIZE; // only handle shift within BN_WSIZE bits atm if (!shift) return; /* * if shift size is bigger than current save limit, * resize with one more space */ bn_resize(dest, src->size + (shift > z)); /* * if the size change, handle the leading number first * (calculate the carry) */ if (dest->size > src->size) dest->number[dest->size - 1] = src->number[src->size - 1] >> (BN_WSIZE - shift); /* * dest[n] = (src[n] << shift) + (the carry from src[n-1] << shift) */ for (int i = src->size - 1; i > 0; i--) dest->number[i] = src->number[i] << shift | src->number[i - 1] >> (BN_WSIZE - shift); dest->number[0] = src->number[0] << shift; } ``` ### [bn_add / bn_sub](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ?view#%E5%8A%A0%E6%B3%95%E8%88%87%E6%B8%9B%E6%B3%95) 在學姐的實作中，將兩數減法一起丟入兩數加法中，並依照情境不同，分別對應使用不同的處理。以 psudo code 來呈現 `bn_add`： ```c void bn_sub(const bn *a, const bn *b, bn *c) { /* * c = a - b * make it c = a + (-b) * # just XOR the sign-bit from bn struct * and call bn_add(a, -b, c) */ } void bn_add(const bn *a, const bn *b, bn *c) { /* both positive or negative, consider bn_do_add */ if (a->sign == b->sign) { bn_do_add(a, b, c); c->sign = a->sign; } else { /* different sign, consider bn_do_sub */ /* let a > 0, b < 0 */ if (a->sign) // if a is negative, swap two bn SWAP(a, b); /* compare |a| and |b| */ int cmp = bn_cmp(a, b); if (cmp > 0) { /* |a| > |b| and b < 0, hence c = a - |b| */ bn_do_sub(a, b, c); c->sign = 0; } else if (cmp < 0) { /* |a| < |b| and b < 0, hence c = -(|b| - |a|) */ bn_do_sub(b, a, c); c->sign = 1; } else { /* |a| == |b| */ bn_resize(c, 1); c->number[0] = 0; c->sign = 0; } } } ``` #### bn_do_add ```c /* |c| = |a| + |b| */ static void bn_do_add(const bn *a, const bn *b, bn *c) { /* * 首先計算加法後的位數大小，並重新配置 c 的陣列大小 * max digits = max(sizeof(a) + sizeof(b)) + 1 * ＃ bn_msb -> 計算位元 most significant bit 的大小 * * 位元大小 d 應以多大的 bn 陣列存取： * d = DIV_ROUNDUP(d, BN_WSIZE) (!d 應是多餘的，因必不為零) */ int d = MAX(bn_msb(a), bn_msb(b)) + 1; d = DIV_ROUNDUP(d, BN_WSIZE) + !d; bn_resize(c, d); // round up, min size = 1 bn_data carry = 0; for (int i = 0; i < b->size; i++) { bn_data tmp1 = a->number[i]; bn_data tmp2 = b->number[i]; carry = (tmp1 += carry) < carry; carry += (c->number[i] = tmp1 + tmp2) < tmp2; } if (a->size != b->size) { // deal with the remaining part if asize > bsize for (int i = b->size; i < a->size; i++) { bn_data tmp1 = a->number[i]; carry = (tmp1 += carry) < carry; c->number[i] = tmp1; } } if (!c->number[c->size - 1] && c->size > 1) bn_resize(c, c->size - 1); } ``` #### bn_do_sub ## big num algorithm 改進 ### fib sequence 改進在開始 fast doubling 的改進前，我先對 `bn_fib` 的演算法做改寫，改成使用 state counter 的方式在決定累加的對象。 ```c void bn_fib(bn *dest, unsigned int n) { bn_resize(dest, 1); if (n < 2) { // Fib(0) = 0, Fib(1) = 1 dest->number[0] = n; return; } bn *state[2]; state[0] = bn_alloc(1); state[1] = bn_alloc(1); state[0]->number[0] = 0; state[1]->number[0] = 1; for (unsigned int i = 2; i <= n; ++i) bn_add(state[(i & 1)], state[((i - 1) & 1)], state[(i & 1)]); bn_cpy(dest, state[(n & 1)]); bn_free(state[0]); bn_free(state[1]); } ``` 在透過 `verify.py` 確認數值到 10000 時都是正確後，來看看表現： ![](https://i.imgur.com/luJTH6N.png) （發現原本的圖畫錯了）在減少 `bn_cpy` 和 `bn_swap` 的使用，並以 state counter 來紀錄暫存位置後，果然加速不少。 ### perf 分析在 user space 進行 `perf stat` `perf record` 分析原始程式碼。連結 [bn.c](https://github.com/KYG-yaya573142/fibdrv/blob/6a30bc8f8ca04cbf1ae00cf17699b5e1340e48dc/bn.c) [bn.h](https://github.com/KYG-yaya573142/fibdrv/blob/6a30bc8f8ca04cbf1ae00cf17699b5e1340e48dc/bn.h) 並執行： ```c // client_perf.c #include "bn.h" #define ITH 1000 #define ITER_TIMES 100000 int main(int argc, char const *argv[]) { bn *test = bn_alloc(1); for (int i = 0; i < ITER_TIMES; i++) { bn_fdoubling_v0(test, ITH); escape(test->number); } bn_free(test); return 0; } ``` 使用 `perf stat` 比較原本 `bn_fib` 和我的版本的程式的表現： ```shell $ sudo taskset -c 7 perf stat -r 50 -e cycles,instructions,cache-references,\ cache-misses,branch-instructions,branch-misses ./client_perf ``` 執行 50 次，針對 `cycles/instructions`、`cache`、`branch` 的使用狀況做觀察。 - [ ] `bn_fib_v0` ``` Performance counter stats for './client_perf' (50 runs): 34,2885,9666 cycles ( +- 0.12% ) (66.45%) 88,7702,9350 instructions # 2.59 insn per cycle ( +- 0.04% ) (83.26%) 40,2550 cache-references ( +- 8.58% ) (83.40%) 10,5777 cache-misses # 26.277 % of all cache refs ( +- 11.11% ) (83.33%) 10,4231,1692 branch-instructions ( +- 0.04% ) (83.47%) 211,2169 branch-misses # 0.20% of all branches ( +- 0.24% ) (83.34%) 0.9999 +- 0.0126 seconds time elapsed ( +- 1.26% ) ``` - [ ] `bn_fib_v1` ``` Performance counter stats for './client_perf' (50 runs): 31,7304,9650 cycles ( +- 0.14% ) (66.46%) 83,6467,8577 instructions # 2.64 insn per cycle ( +- 0.05% ) (83.28%) 23,1109 cache-references ( +- 4.92% ) (83.40%) 4,9677 cache-misses # 21.495 % of all cache refs ( +- 7.30% ) (83.34%) 9,9458,4240 branch-instructions ( +- 0.04% ) (83.47%) 188,3470 branch-misses # 0.19% of all branches ( +- 3.52% ) (83.33%) 0.92731 +- 0.00764 seconds time elapsed ( +- 0.82% ) ``` 我的實作方式在指令數、cache、branch 上的表現都略佳，但顯然 $O(n)$ 時間複雜度還是影響執行時間最大的硬傷。再來我們透過 perf record 來查看 fast doubling 的執行狀況並改進。 ```shell= $ sudo perf record -g --call-graph dwarf ./client_perf $ sudo perf report --stdio -g graph,0.5,caller ``` ``` 94.50% 0.00% client_perf client_perf [.] _start | ---_start __libc_start_main main bn_fdoubling_v0 | |--66.51%--bn_mult | | | |--27.50%--bn_mult_add | | | |--8.00%--bn_msb | | | | | --4.52%--bn_clz | | | |--5.71%--bn_alloc | | | | | |--2.56%--__GI___libc_malloc (inlined) | | | | | | | |--0.64%--checked_request2size (inlined) | | | | | | | --0.63%--tcache_get (inlined) | | | | | --0.78%--__memset_avx2_unaligned_erms | | | |--4.48%--bn_free | | | | | |--2.54%--_int_free | | | | | | | --0.64%--tcache_put (inlined) | | | | | --0.82%--__GI___libc_free (inlined) | | | --3.03%--bn_swap | |--9.77%--bn_add | | | --8.82%--bn_do_add | | | --4.78%--bn_msb | | | --4.15%--bn_clz | |--6.76%--bn_sub | | | --6.29%--bn_add | | | |--5.65%--bn_do_sub | | | | | |--1.13%--bn_clz | | | | | --0.64%--bn_resize | | | --0.64%--bn_cmp | |--5.88%--bn_cpy | | | |--2.86%--bn_resize | | | | | --1.43%--__GI___libc_realloc (inlined) | | | | | --1.12%--_int_realloc | | | --0.62%--__memmove_avx_unaligned_erms | --2.54%--bn_lshift | --1.75%--bn_clz ``` - `bn_mult` 佔了 66.61% ，為主要使用。其中 `bn_mult_add` 佔了 27.5%，是需要觀察的。 - 這個版本裡的動態記憶體配置下降很多，原本 `KYWeng` 是佔了 20.74% 在我這裡只剩 5.71%。 ### fast doubling 改進 #### v1: 從主程式演算法下手： 1. 將 `cpy` 改為 `swap`，並減少不必要變數 fast doubling 需要頻繁的指派，在 big-num 初版中使用 `cpy` 來賦值，但其實可以改為用 `swap` 來實現，另外在進一步改寫，可讓中間變數剩下`f2`, `k` 即可，我們用程式碼註解來觀察兩者差異。 2. 修改迴圈的起始值：用 `clz` 先將 i 推到 leading bit 的位置上。（針對 bn_lshift 先改寫） ```c /* dest = src << shift (maximun shift 31) */ void bn_lshift(bn *src, size_t shift, bn *dest) { size_t z = bn_clz(src); shift %= 32; // only handle shift within 32 bits atm if (!shift) return; bn_resize(dest, src->size + (shift > z)); if (dest->size > src->size) dest->number[dest->size - 1] = src->number[src->size - 1] >> (32 - shift); for (int i = src->size - 1; i > 0; i--) dest->number[i] = src->number[i] << shift | src->number[i - 1] >> (32 - shift); dest->number[0] = src->number[0] << shift; } ``` ```c void bn_fdoubling_v0(bn *dest, unsigned int n) { ... for (unsigned int i = 1U << 31; i; i >>= 1) { /* F(2k) = F(k) * [ 2 * F(k+1) – F(k) ] */ // bn_cpy(k1, f2); // k1 = F(k+1) bn_lshift(f2, 1, k1); // k1 = 2* F(k+1) bn_sub(k1, f1, k1); // k1 = 2 * F(k+1) – F(k) bn_mult(k1, f1, k1); // k1 = k1 * f1 = F(2k) /* state: k1 = F(2k) ; k2 = X; f1 = F(k); f2 = F(k+1) */ /* F(2k+1) = F(k)^2 + F(k+1)^2 */ bn_mult(f1, f1, f1); // f1 = F(k)^2 bn_mult(f2, f2, f2); // f2 = F(k+1)^2 bn_cpy(k2, f1); // k2 = F(k)^2 bn_add(k2, f2, k2); // k2 = F(k)^2 + F(k+1)^2 = F(2k+1) /* state: k1 = F(2k) ; k2 = F(2k+1); f1 = X; f2 = X */ if (n & i) { bn_cpy(f1, k2); bn_cpy(f2, k1); bn_add(f2, k2, f2); /* state: f1 = F(2k+1); f2 = F(2k+2) */ } else { bn_cpy(f1, k1); bn_cpy(f2, k2); /* state: f1 = F(2k); f2 = F(2k+1) */ } } ... } void bn_fdoubling_v1(bn *dest, unsigned int n) { ... for (unsigned int i = 1U << (31 - __builtin_clz(n)); i; i >>= 1) { /* F(2k) = F(k) * [ 2 * F(k+1) – F(k) ] */ bn_lshift(f2, 1, k); // k = 2* F(k+1) bn_sub(k, f1, k); // k = 2 * F(k+1) – F(k) bn_mult(k, f1, k); // k = k1 * f1 = F(2k) /* state: k = F(2k); f1 = F(k); f2 = F(k+1) */ /* F(2k+1) = F(k)^2 + F(k+1)^2 */ bn_mult(f1, f1, f1); // f1 = F(k)^2 bn_mult(f2, f2, f2); // f2 = F(k+1)^2 bn_add(f1, f2, f2); // f2 = f1^2 + f2^2 = F(2k+1) now bn_swap(f1, k); // f1 <-> k, f1 = F(2k) now /* state: k = X; f1 = F(2k); f2 = F(2k+1) */ if (n & i) { bn_swap(f1, f2); // f1 = F(2k+1) bn_add(f1, f2, f2); // f2 = F(2k+2) } } ... } ``` 意即是我們針對 `F(2k)` 的運算結果先放在 `k` ，另一邊 `F(2k+1)` 的計算結果放在 `f2`（原本是先將結果放在 `k2`），最後我們再依據情況做指派交換。這樣就少了一個不必要暫存。 ![](https://i.imgur.com/QGswWir.png) 得到大幅的改進。（雖然影響最多的應該是第二點： `clz` 的改變） #### v2 - 善用 64-bit CPU 的優勢 ```c #if defined(__LP64__) || defined(__x86_64__) || defined(__amd64__) || defined(__aarch64__) #define BN_WSIZE 64 #else #define BN_WSIZE 32 #endif #if BN_WSIZE == 64 typedef uint64_t bn_data; typedef unsigned __int128 bn_data_tmp_u; // gcc support __int128 typedef __int128 bn_data_tmp_s; // gcc support __int128 #elif BN_WSIZE == 32 typedef uint32_t bn_data; typedef uint64_t bn_data_tmp; #else #error "BN_WSIZE must be 32 or 64" #endif ``` 將原本程式碼中需要配置記憶體、運算數字暫存（通常是`int`,`unsigned int`）改為 `bn_data`，有許多需要移位計算需要也須注意是否要改成 64 位元表達；需要加法與乘法進位的暫存（通常是`unsigned long long int` ）的部份改為 `bn_data_tmp_u`；需要減法進位的暫存（通常是`long long int`）的部份改為 `bn_data_tmp_s`。 ![](https://i.imgur.com/m4btVVA.png) 前期處理雖然較慢，然而當 `bn` 所需串接處理的資料變長時，利用 64 位元為單位即大大減少了處理的次數。 ### [改善 `bn_add` 的效能](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ?view#%E6%94%B9%E5%96%84-bn_add-%E7%9A%84%E6%95%88%E8%83%BD) ```diff static void bn_do_add(const bn *a, const bn *b, bn *c) { ... - bn_data_tmp carry = 0; - for (int i = 0; i < c->size; i++) { - bn_data tmp1 = (i < a->size) ? a->number[i] : 0; - bn_data tmp2 = (i < b->size) ? b->number[i] : 0; - carry += (bn_data_tmp) tmp1 + tmp2; - c->number[i] = carry; - carry >>= DATA_BITS; - } + bn_data carry = 0; + for (int i = 0; i < b->size; i++) { + bn_data tmp1 = a->number[i]; + bn_data tmp2 = b->number[i]; + carry = (tmp1 += carry) < carry; + carry += (c->number[i] = tmp1 + tmp2) < tmp2; + } + if (a->size != b->size) { // deal with the remaining part if asize > bsize + for (int i = b->size; i < a->size; i++) { + bn_data tmp1 = a->number[i]; + carry = (tmp1 += carry) < carry; + c->number[i] = tmp1; + } + } ... } ``` ![](https://i.imgur.com/LUv2wBS.png) ### [改善 `bn_mult` 的效能](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ?view#%E6%94%B9%E5%96%84-bn_mult-%E7%9A%84%E6%95%88%E8%83%BD) ```c /* c[size] += a[size] * k, and return the carry */ static bn_data _mult_partial(const bn_data *a, bn_data asize, const bn_data k, bn_data *c) { if (k == 0) return 0; bn_data carry = 0; for (int i = 0; i < asize; i++) { bn_data high, low; __asm__("mulq %3" : "=a"(low), "=d"(high) : "%0"(a[i]), "rm"(k)); carry = high + ((low += carry) < carry); carry += ((c[i] += low) < low); } return carry; } void bn_mult(const bn *a, const bn *b, bn *c) { ... for (int j = 0; j < b->size; j++) { c->number[a->size + j] = _mult_partial(a->number, a->size, b->number[j], c->number + j); } ... } ``` ![](https://i.imgur.com/0qMLAoe.png) ## 自我檢查清單 - [ ] 研讀上述 ==Linux 效能分析的提示== 描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗; - [ ] 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise)，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] 研讀 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU)，指出針對大數運算，有哪些加速運算和縮減記憶體操作成本的舉措？ - [ ] `lsmod` 的輸出結果有一欄名為 `Used by`，這是 "each module's use count and a list of referring modules"，但如何實作出來呢？模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢？ > 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html) - [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。 ## 延伸閱讀 - [Linux Kernel Documentation](https://linux-kernel-labs.github.io/) - [Linux 讀書會 - 驅動程式類型介紹](https://hackmd.io/@combo-tw/Linux-%E8%AE%80%E6%9B%B8%E6%9C%83/%2F%40combo-tw%2FryRp--nQS#%E9%A9%85%E5%8B%95%E7%A8%8B%E5%BC%8F%E9%A1%9E%E5%9E%8B%E4%BB%8B%E7%B4%B9) - [GNU make](https://www.gnu.org/software/make/manual/html_node/index.html#toc-An-Introduction-to-Makefiles)