2022q1 Homework3 (fibdrv)

# 2022q1 Homework3 (fibdrv) contributed by < [`chengingx`](https://github.com/chengingx) > ###### tags: `linux2022` > [fibdrv](https://hackmd.io/@sysprog/linux2022-fibdrv) ## 預備知識 * [Calculating Fibonacci Numbers by Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) * [The Linux Kernel Module Programming Guide](https://sysprog21.github.io/lkmpg/) ### 費氏數列 $\left \{ \begin{array}{**lr**} F_{2n}=F_n\cdot(2\cdot F_{n+1}-F_n) \\ F_{2n+1}=F_{n+1}^2+F_{n}^2\end{array} \right.$ 對應虛擬碼如下： ```c Fast_Fib(n) a = 0; b = 1; // m = 0 for i = (number of binary digit in n) to 1 t1 = a*(2*b - a); t2 = b^2 + a^2; a = t1; b = t2; // m *= 2 if (current binary digit == 1) t1 = a + b; // m++ a = b; b = t1; return a; ``` ### 考慮到硬體加速 $F(n)$ 的手法許多現代處理器提供 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令，可搭配 [Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 手法運用: 1. 省略 $F(0)$，直接從 $F(1)$ 開始 2. clz/[ffs](https://www.kernel.org/doc/htmldocs/kernel-api/API-ffs.html): 先去除數字 MSB 起算的開頭 0 位元，因為真正的數字不包含 leading 0s，所以計算它也沒用，因此需要 clz 計算有多少 leading 0s * 遇到 `0` : 求 $F(2n)$ 和 $F(2n+1)$ * 遇到 `1` : 求 $F(2n)$ 和 $F(2n+1)$，再求 $F(2n+2)$ ### 前期測試 * 首先 insert kernel module ```shell $ sudo insmod fibdrv.ko ``` * `sudo lsmod` 可用來觀察有什麼 kernel module 在 kernel 中 ```shell $ sudo lsmod | grep fibdrv fibdrv 16384 0 ``` * 觀察 fibonacci 的 device file ```shell $ ls -l /dev/fibonacci crw------- 1 root root 508, 0 三 15 15:36 /dev/fibonacci ``` * 可以看見 major number 為 508，由 `alloc_chrdev_region` 決定 ```shell $ sudo rmmod fibdrv ``` #### 核心模式的時間測量 ```c static long long fib_time_proxy(long long k) { kt = ktime_get(); long long result = fib_sequence(k); kt = ktime_sub(ktime_get(), kt); return result; } ``` 可發現到 write 在此核心模組暫時沒作用，於是可挪用來輸出上一次 fib 的執行時間 ```c static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { return (ssize_t) fib_time_proxy(*offset); } static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { return ktime_to_ns(kt); } ``` #### 使用 fast doubling 進行核心模式的時間測量參照 [freshLiver](https://hackmd.io/@freshLiver/2022q1-hw3-fibdrv#fibdrv-%E5%AF%A6%E4%BD%9C) 的實作 (我想破頭) * original ver. 時間複雜度 : $O(n)$ ```c static long long fib_sequence(long long k) { long long f[2] = {0, 1}; for (int i = 1; i <= k; i++){ SWAP(f[0], f[1]); f[0] += f[1]; } return f[0]; } ``` * fast doubling ver. 時間複雜度 : $O(log(n))$ ```c static long long fib_sequence(long long k) { if (k == 0) return 0; long long f[2] = {0, 1}; int m = 63 - __builtin_clzll(k); for (int i = 0; i <= m; i++){ long long a = f[0] * (2 * f[1] - f[0]); // F(2k) long long b = f[1] * f[1] + f[0] * f[0]; // F(2k+1) if ((k >> (m - i)) & 1){ f[0] = b; f[1] = a + b; // F(2k+2) } else { f[0] = a; f[1] = b; } } return f[0]; } ``` > `__builtin_clzll()` 返回左起第一個1之前0的個數參考 [gnuplot 語法解說和示範](https://hackmd.io/@sysprog/Skwp-alOg) 繪圖 ```shell set terminal png set term png enhanced font 'Verdna, 10' set title 'performance comparison' set xlabel 'size' set ylabel 'time(ns)' set output 'runtime.png' set key left plot [0:92][0:150] 'output_original.txt' using 1:2 with lp lw 2 title 'original', 'output_fast_doubling.txt' using 1:2 with lp lw 2 title 'fast doubling' ``` ![](https://i.imgur.com/VgRtGbC.png) :::info $F_{93}$ 之後 overflow 也影響了執行時間 ? ::: #### 將行程固定在特定的 CPU 中執行直接加在 GRUB 的設定檔中，這樣 Linux 的排程器在預設的狀況下就不會將任何一般行程放在這個被保留的 CPU 核心中執行，只有那些被 [taskset](https://man7.org/linux/man-pages/man1/taskset.1.html) 特別指定的行程可使用這些 CPU 核心 ```shell $ sudo vi /etc/default/grub ``` ```shell GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=0" ``` ```shell $ sudo update-grub ``` 確認 `affinity list` (實驗電腦為 6 核) ```shell $ taskset -cp 1 pid 1's current affinity list: 1-5 ``` ```shell $ sudo taskset -c 0 ./client 0 ``` 不知道為何原始版本變快了，然而 fast doubling 仍然振幅很大，之後用統計手法解決 ![](https://i.imgur.com/0DJZ0Eq.png) ## 計算 $F_{93}$ (包含) 之後的 Fibonacci 數 - 使用數字字串並套用 quiz2 SSO (Small String Optimization) 透過 `$ sudo ./client` 觀察得到 $F_{93}$ 發生 overflow ```shell $ sudo ./client ... Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120. Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309. Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 93, returned the sequence -6246583658587674878. ... ``` > $F_{93}=F_{91}+F_{92} = 12200160415121876738$ > $\because 12200160415121876738 > 2^{64} - 1\ (Overflow)$ > $\therefore 12200160415121876738 - 2^{64} = -6246583658587674878 = F_{93}$ * 將 `long long` 改成 [`unsigned __int128`](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html) 節錄 [fibdrv.c](https://github.com/chengingx/fibdrv/blob/master/fibdrv.c) 重要修改 ```c static unsigned __int128 fib_sequence(long long k) { unsigned __int128 f[2] = {0, 1}; for (int i = 1; i <= k; i++) { __swap(&f[0], &f[1], sizeof(unsigned __int128)); f[0] += f[1]; } return f[0]; } static unsigned __int128 fib_time_proxy(long long k, int method) { kt = ktime_get(); unsigned __int128 result = fib_sequence(k); kt = ktime_sub(ktime_get(), kt); return result; } /* calculate the fibonacci number at given offset */ static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { unsigned __int128 result = fib_time_proxy(*offset, size); if (copy_to_user(buf, &result, sizeof(unsigned __int128))) return -EFAULT; return 1; } ``` original ver. 在 userspace 與 kernel space 的時執行時間 ![](https://i.imgur.com/SD1qVHO.png) ```shell plot [0:][0:] 'output.txt' using 1:2 with lp lw 2 title 'kernel', '' using 1:3 with lp lw 2 title 'userspace', '' using 1:4 with lp lw 2 title 'kernel to user space' ``` ### 統計手法參考 [oscarshiang](https://hackmd.io/@oscarshiang/linux_fibdrv) 所寫的 python 自動腳本，將 process data 寫成函式，使用到的手法是去除超過兩個標準差的資料 ```python def process_data(l, pd, name): print(f"Processing {name} data...") mean, std = pd.mean(axis = 1), pd.std(axis = 1) for i in tqdm(range(len(pd.index))): sum = 0 cnt = 0 for j in pd.iloc[i]: if abs(j - mean[i]) < 2 * std[i]: sum = sum + j; cnt = cnt + 1; sum = sum / cnt; l.append(sum) ``` 透過 `python AutoGenerate.py` 產生 10000 筆資料 ``` $ python AutoGenerate.py Generating data... 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [02:39<00:00, 62.85it/s] ... ``` original ver. ![](https://i.imgur.com/9Rk0mnm.png) fast doubling ver. ![](https://i.imgur.com/S9V6cWD.png) original ver. vs. fast doubling ver. in kernel ![](https://i.imgur.com/REoxxo6.png) ### 不使用硬體指令加速 ```c void fast_doubling(long long k, unsigned __int128 f[]) { if(k == 0){ return; } fast_doubling((k >> 1), f); unsigned __int128 a = f[0] * (2 * f[1] - f[0]); // F(2k) unsigned __int128 b = f[1] * f[1] + f[0] * f[0]; // F(2k+1) if(k & 1){ f[0] = b; f[1] = a + b; } else { f[0] = a; f[1] = b; } } static unsigned __int128 fib_fast_doubling(long long k) { unsigned __int128 f[2] = {0, 1}; fast_doubling(k, f); return f[0]; } ``` ![](https://i.imgur.com/vJcB2nf.png) 可以清楚看見，使用 [clz](https://en.wikipedia.org/wiki/Find_first_set) 指令可以降低執行時間 #### clz 觀察以下程式碼的 assembly code ```c unsigned int clz(unsigned int num) { return __builtin_clz(num); } int main(){ unsigned int res = clz(2); return 0; } ``` 開啟編譯器最佳化進行編譯 ```shell $ gcc -S -O2 -o test2.s test.c -w ``` ```shell clz: bsrl %edi, %eax xorl $31, %eax ret ``` 可以清楚看見有個 `bsrl` 指令，bsr 是 bit-scan reverse 舉例 : $0000000010011000_b$ * $bsf$ 會得到 $3$，$bsr$ 會得到 $7$ * word size 減去 $bsr$ 得到的數值就能得到 leading zeros * `xorl $31, %eax` 表示 $31 - eax$ > 只有 $2^n-1$ 可以使用 xor 代替 $-$，$n=1,2,3,...$ ### 註冊 Device Driver :::info 根據 [Device Driver 重點整理](https://hackmd.io/@chenging/driver) 知道 fops 的兩個重點： 1. fops 是指向 file_operations 結構的指標，驅動程式呼叫 `register_chrdev()` 將 fops 註冊到 kernel 裡後，fops 便成為該 device driver 所實作的 system call 進入點 2. 實作 system call 的函數便是透過 file_operations 結構來定義，我們稱實作 system call 的函數為 driver method ::: 觀察 `fibdrv.c` 中的 `init_fib_dev` 裡面可以看見註冊函式，[`alloc_chrdev_region`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h) 會動態產生 major number，若回傳值小於 0 代表註冊失敗 ```c rc = alloc_chrdev_region(&fib_dev, 0, 1, DEV_FIBONACCI_NAME); if (rc < 0) { printk(KERN_ALERT "Failed to register the fibonacci char device. rc = %i", rc); return rc; } ``` [Course 5 - 新字元設備驅動實驗](https://meetonfriday.com/posts/edcd30e6/) 整理 `register_chrdev()` 缺點 1. 會佔用主設備號下的所有次設備號，浪費了很多次設備號 2. 必須手動指定主設備號，也就是說你得先手動查詢哪個主設備號還沒被佔用，然後再去指定 :::info `cdev` 定義在 [include/linux/cdev.h](https://github.com/torvalds/linux/blob/70868a180501d17fea58153c649d56bc18435315/include/linux/cdev.h)，看到了老朋友 `list_head` ```c struct cdev { struct kobject kobj; struct module *owner; const struct file_operations *ops; struct list_head list; dev_t dev; unsigned int count; } __randomize_layout; ``` ::: ```c fib_cdev = cdev_alloc(); if (fib_cdev == NULL) { printk(KERN_ALERT "Failed to alloc cdev"); rc = -1; goto failed_cdev; } ``` Linux 驅動程式是透過 file_operations 來建構 VFS 層的支援 ```c fib_cdev->ops = &fib_fops; rc = cdev_add(fib_cdev, fib_dev, 1); if (rc < 0) { printk(KERN_ALERT "Failed to add cdev"); rc = -2; goto failed_cdev; } ``` 而 file_operation 裡的函數指標，即是指向每一個 system call 的實作函數 ```c const struct file_operations fib_fops = { .owner = THIS_MODULE, .read = fib_read, .write = fib_write, .open = fib_open, .release = fib_release, .llseek = fib_device_lseek, }; ``` ## mutex 對 Linux 核心模組的影響 fibdrv.c 中的 `fib_open` 對應的 system call 是 `open`，也就是在 client.c 中 ```c int fd = open(FIB_DEV, O_RDWR); ``` 當開啟 file (`/dev/fibonacci`) 時，會試圖上鎖 ```c static int fib_open(struct inode *inode, struct file *file) { if (!mutex_trylock(&fib_mutex)) { printk(KERN_ALERT "fibdrv is in use"); return -EBUSY; } return 0; } ``` 假如 file discriptor 是 critical section，也就是 fd 會共用，那麼在 `read` 就會出問題 ```c read(fd, &buf, method); ``` 以下進行實驗，將原本在 main 函式的指令寫在 `racefib` ```c void *racefib(void *arg) { int fd = open(FIB_DEV, O_RDWR); if (fd < 0) { perror("Failed to open character device"); exit(1); } for (int i = 0; i <= 10; i++){ unsigned __int128 buf; lseek(fd, i, SEEK_SET); read(fd, &buf, 1); printf("%s : F(%d): %lld\n", (char *)arg, i, buf); } close(fd); return NULL; } ``` main() 中則寫，記得編譯時加上 `-lpthread` ```c pthread_t t[2]; pthread_create(&t[0], NULL, racefib, "Thread 1"); pthread_create(&t[1], NULL, racefib, "Thread 2"); for (int i = 0; i < 2; i++) pthread_join(t[i], NULL); ``` 可看見輸出有 Failed to open character device: Device or resource busy 字樣 ```shell $ sudo ./client Thread 1 : F(0): 0 Thread 1 : F(1): 1 Thread 1 : F(2): 1 Thread 1 : F(3): 2 Thread 1 : F(4): 3 Thread 1 : F(5): 5 Failed to open character device: Device or resource busy Thread 1 : F(6): 8 Thread 1 : F(7): 13 ``` 將 `mutex` 拿掉，可以發現 $F(n)$ 數字是正確的 ``` $ sudo ./client ... Thread 1 : F(6): 8 Thread 2 : F(4): 3 Thread 2 : F(5): 5 Thread 2 : F(6): 8 Thread 2 : F(7): 13 Thread 1 : F(7): 13 ... ``` 當 `fd` 為 Global variable，且 Thread 1 sleep 1 秒，Thread 2 sleep 2 秒時，數值就會出錯 ```c void *racefib(void *arg) { int num = *((int *)arg); for (int i = 0; i <= 10; i++){ unsigned __int128 buf; lseek(fd, i, SEEK_SET); sleep(num); read(fd, &buf, 1); printf("Thread %d : F(%d): %lld\n", num, i, buf); } return NULL; } ``` 主要是因為 `lseek` 在看 `offset` 時發生的錯誤 ```c int delay_a = 1, delay_b = 2; pthread_create(&t[0], NULL, racefib, &delay_a); pthread_create(&t[1], NULL, racefib, &delay_b); ``` 由輸出結果可以看見兩個 Thread 1，一個 Thread 2 交錯，因為 `offset` 為共用，`Thread 2 : F(1): 2` 其實是 `Thread 1 : F(3)` 的值 ```shell $ sudo ./client Thread 1 : F(0): 0 Thread 2 : F(0): 1 Thread 1 : F(1): 1 Thread 1 : F(2): 1 Thread 2 : F(1): 2 Thread 1 : F(3): 1 Thread 1 : F(4): 3 Thread 2 : F(2): 5 Thread 1 : F(5): 2 Thread 1 : F(6): 8 Thread 2 : F(3): 13 Thread 1 : F(7): 3 Thread 1 : F(8): 21 Thread 2 : F(4): 34 Thread 1 : F(9): 5 Thread 1 : F(10): 55 Thread 2 : F(5): 55 Thread 2 : F(6): 8 Thread 2 : F(7): 13 Thread 2 : F(8): 21 Thread 2 : F(9): 34 Thread 2 : F(10): 55 ``` ## 自我檢查清單 - [x] 研讀上述 Linux 效能分析的提示描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作→ 從中也該理解為何不希望在虛擬機器中進行實驗; - [x] 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise)，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] 研讀 [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU) 的報告，指出針對大數運算，有哪些加速運算和縮減記憶體操作成本的舉措？ - [ ] lsmod 的輸出結果有一欄名為 Used by，這是 “each module’s use count and a list of referring modules”，但如何實作出來呢？模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢？ > 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html) - [ ] 注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。 ## 參考資料 * [Calculating Fibonacci Numbers by Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) * [The Linux Kernel Module Programming Guide](https://sysprog21.github.io/lkmpg/) * [Device Driver 重點整理](https://hackmd.io/@chenging/driver) * [Course 5 - 新字元設備驅動實驗](https://meetonfriday.com/posts/edcd30e6/) * [gnuplot 語法解說和示範](https://hackmd.io/@sysprog/Skwp-alOg) * [Linux 隱藏 GRUB 開機選單](https://www.ltsplus.com/linux/linux-hide-grub-boot-menu)