--- tags: linux2022 --- # 2022q1 Homework3 (fibdrv) contributed by < [laneser](https://github.com/laneser) > > [作業要求](https://hackmd.io/@sysprog/linux2022-fibdrv) ## 前期準備 ### 實驗環境 ```shell $ gcc --version gcc (Debian 10.2.1-6) 10.2.1 20210110 $ lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Vendor ID: ARM Model: 4 Model name: Cortex-A53 Stepping: r0p4 CPU max MHz: 1200.0000 CPU min MHz: 600.0000 BogoMIPS: 38.40 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Not affected Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fp asimd evtstrm crc32 cpuid ``` ### 安裝 linux 以及設置環境 這次找 Raspberry Pi 3 Model B v1.2 安裝 linux 來作業. 透過 https://www.raspberrypi.com/software/ 可以很容易把全新的 linux 安裝在單台 Raspberry Pi 上面, 然後在 SD Card 上面的 `/boot/wpa_supplicant.conf` 檔案寫入 wifi 的相關資訊, 就可以透過 vscode 遠端開發. 非常方便又可以隨身攜帶. - 檢查 linux 核心版本 ```shell $ uname -a Linux raspberrypi 5.10.92-v8+ #1514 SMP PREEMPT Mon Jan 17 17:39:38 GMT 2022 aarch64 GNU/Linux ``` - 安裝 linux-headers 套件 ```shell apt-get install raspberrypi-kernel-headers ``` - 確認 linux-headers 套件已正確安裝於開發環境 ```shell $ sudo dpkg -L raspberrypi-kernel-headers | grep "/lib/modules" /lib/modules /lib/modules/5.10.92-v8+ /lib/modules/5.10.92-v8+/build ``` - 檢驗目前的使用者身份 ```shell $ whoami lane $ sudo whoami root ``` - 安裝後續會用得到的工具 ```shell sudo apt install util-linux strace gnuplot-nox git cppcheck clang-format aspell ``` - 取得原始程式碼 ```shell lane@raspberrypi:~ $ git clone https://github.com/laneser/fibdrv Cloning into 'fibdrv'... remote: Enumerating objects: 80, done. remote: Counting objects: 100% (10/10), done. remote: Compressing objects: 100% (7/7), done. remote: Total 80 (delta 3), reused 8 (delta 3), pack-reused 70 Receiving objects: 100% (80/80), 23.56 KiB | 893.00 KiB/s, done. Resolving deltas: 100% (37/37), done. lane@raspberrypi:~ $ cd fibdrv ``` - 編譯並測試 ```shell lane@raspberrypi:~/fibdrv $ make check make -C /lib/modules/5.10.92-v8+/build M=/home/lane/fibdrv modules make[1]: Entering directory '/usr/src/linux-headers-5.10.92-v8+' make[1]: Leaving directory '/usr/src/linux-headers-5.10.92-v8+' make unload make[1]: Entering directory '/home/lane/fibdrv' sudo rmmod fibdrv || true >/dev/null make[1]: Leaving directory '/home/lane/fibdrv' make load make[1]: Entering directory '/home/lane/fibdrv' sudo insmod fibdrv.ko make[1]: Leaving directory '/home/lane/fibdrv' sudo ./client > out make unload make[1]: Entering directory '/home/lane/fibdrv' sudo rmmod fibdrv || true >/dev/null make[1]: Leaving directory '/home/lane/fibdrv' Passed [-] f(93) fail input: 7540113804746346429 expected: 12200160415121876738 ``` - 觀察產生的 fibdrv.ko 核心模組 ```shell $ sudo modinfo fibdrv.ko filename: /home/lane/fibdrv/fibdrv.ko version: 0.1 description: Fibonacci engine driver author: National Cheng Kung University, Taiwan license: Dual MIT/GPL srcversion: 9A01E3671A116ADA9F2BB0A depends: name: fibdrv vermagic: 5.10.92-v8+ SMP preempt mod_unload modversions aarch64 ``` - 觀察 fibdrv.ko 核心模組在 Linux 核心掛載後的行為 ```shell $ sudo insmod ./fibdrv.ko $ ls -l /dev/fibonacci crw------- 1 root root 238, 0 Mar 8 14:10 /dev/fibonacci $ cat /sys/class/fibonacci/fibonacci/dev 238:0 $ cat /sys/module/fibdrv/version 0.1 $ lsmod | grep fibdrv fibdrv 16384 0 $ cat /sys/module/fibdrv/refcnt 0 ``` ### 查看行程的 CPU affinity 編輯 `/boot/cmdline.txt` 在其後面加入 `isolcpus=0` , 重開後搞不定, 因為 `dmesg` 顯示 kernel 抱怨 ```shell Housekeeping: must include one present CPU, using boot CPU:0 ``` 根據規格這台 Raspberry Pi CPU 是 1.2 GHz 64-bit quad-core ARM Cortex-A53, 所以有四核, 可以設定 `isolcpus=3`, 再度重開機: ```shell $ taskset -cp 1 pid 1's current affinity list: 0-2 $ cat /sys/devices/system/cpu/isolated 3 ``` 可以發現 pid1 已經使用 0-2 CPU (原本是 0-3) ### 將行程固定在特定的 CPU 中執行 ```shell $ taskset -cp 3 647 pid 647's current affinity list: 0-2 pid 647's new affinity list: 3 $ taskset -cp 647 pid 647's current affinity list: 3 $ taskset -cp 0-2 647 pid 647's current affinity list: 3 pid 647's new affinity list: 0-2 ``` ### 排除干擾效能分析的因素 #### 抑制 address space layout randomization (ASLR) ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` #### 設定 scaling_governor 為 performance ```shell $ sudo su $ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ondemand $ echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor $ echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor $ echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor $ echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor $ cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor performance ``` ### 研讀費氏數列相關材料 最重要的就是 [Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 手法, 乃是應用了下列二式 $$ \begin{align} F(2k) = F(k)[2F(k+1) - F(k)]\\ F(2k+1) = F(k+1)^2+F(k)^2 \end{align} $$ 那麼主要的虛擬碼會是 ```cpp static long long fib(long n) { h = find_highest_set_bit(n); // 這部分可用 clz or ffs 從 n 的高位元執行 h 次到 n 的低位元的迴圈 { 0 => 求 F(2n), F(2n+1) 1 => 求 F(2n), F(2n+1), 再獲得 F(2n+2) } } ``` -- ## 量測執行時間 因為要報告一路改善程式的執行時間, 所以第一步是要想辦法獲得 `fibdrv` 的執行時間. 首先看 `fibdrv.c` 可以知道這個 module 有五個介面可以讓 `client.c` 來使用: - open : 可以傳的參數有 read/write 權限, 可以回傳 fd - close : 可以傳入 fd, 回傳 int - write : 可以傳入 fd, char 指標, size_t size, 回傳 ssize_t - read : 可以傳入 fd, char 指標, size_t size, 回傳 ssize_t - seek : 可以傳入 fd, size_t offset, int origin, 回傳 loff_t 參考 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 利用 write 來計算時間. 但程式的 `make check` 命令也用到 write 檢驗回傳值 (testing mode), 所以想要同時符合 testing mode 與計算時間: ```cpp static long long (*fib_methods[])(long long) = { fib_sequence, }; /* write operation * testing mode: buf exists, return 1. * calculating mode: buf is NULL * size is method number * return cost time in ns * return 0 if size is out of index */ static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { if (buf) return 1; if (size >= sizeof(fib_methods) / sizeof(fib_methods[0])) return 0; ktime_t kt = ktime_get(); fib_methods[size](*offset); return (ssize_t) ktime_to_ns(ktime_sub(ktime_get(), kt)); } ``` ## 實作費氏數列 首先解這個警告 `ISO C90 forbids variable length array ‘f’ [-Wvla]`, 因為 ```cpp /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */ long long f[k + 2]; ``` 雖然直覺就馬上用 `malloc`, 甚至 `kmalloc`, 但 kernel 的世界果然跟平常寫程式不同, 不過既然迴圈裡面同一個時間只用到三個變數, 其實可以丟棄之前的變數, 也就不需要那麼多空間. 但我也寫不出比 [Masamaloka fibdrv](https://hackmd.io/@Masamaloka/fibdrv#iterative) 更精簡的了. ```cpp static long long fib_sequence(long long k) { long long f[] = {0, 1}; for (int i = 2; i <= k; i++) { f[(i & 1)] += f[((i - 1) & 1)]; } return f[(k & 1)]; } ``` 於是就有量測時間的結果 ![](https://i.imgur.com/k2ZAezX.png) 嘗試加上 user space 的 `clock_gettime`: ```cpp struct timespec t1, t2; clock_gettime(CLOCK_MONOTONIC, &t1); time[m * 2][n] = (double) write(fd, NULL, m); clock_gettime(CLOCK_MONOTONIC, &t2); time[m * 2 + 1][n] = (long long) (t2.tv_sec * 1e9 + t2.tv_nsec) - (t1.tv_sec * 1e9 + t1.tv_nsec); ``` ![](https://i.imgur.com/3t8TUv3.png) Raspberry Pi 的 system call 耗費好像跟 intel based 差異很大. ### Fast Doubling ```cpp // ref // https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling static long long fib_fast_doubling(long long n) { unsigned int h = 0; for (unsigned int i = n; i; ++h, i >>= 1) ; uint64_t a = 0; uint64_t b = 1; for (unsigned int mask = 1 << (h - 1); mask; mask >>= 1) { uint64_t c = a * (2 * b - a); uint64_t d = a * a + b * b; if (mask & n) { a = d; b = c + d; } else { a = c; b = d; } } return a; } ``` 思考 clz / ctz 對這個函數的幫助, 看來只有一開始 `h` 的運算可以幫助 ```cpp unsigned int h = 0; for (unsigned int i = n; i; ++h, i >>= 1); ``` 所以寫一版 `clz / ctz` 的運算 ```cpp static long long fib_fast_doubling_clz(long long n) { if (n < 2) return n; uint64_t a = 0; uint64_t b = 1; for (unsigned int mask = 1 << (32 - __builtin_clzll(n)); mask; mask >>= 1) { uint64_t c = a * (2 * b - a); uint64_t d = a * a + b * b; if (mask & n) { a = d; b = c + d; } else { a = c; b = d; } } return a; } ``` ![](https://i.imgur.com/oT4shxY.png) 有趣的事情是, 似乎沒有加速的效果, 好像變得更慢了一點點? [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 也提到一樣的事情. 但 KYG-yaya573142 是遇到優化而被 opt-out, 按照目前的 `fib_write`: ```cpp static long long (*fib_methods[])(long long) = {fib_sequence, fib_fast_doubling, fib_fast_doubling_clz}; static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { if (buf) return 1; if (size >= sizeof(fib_methods) / sizeof(fib_methods[0])) return 0; ktime_t kt = ktime_get(); fib_methods[size](*offset); return (ssize_t) ktime_to_ns(ktime_sub(ktime_get(), kt)); } ``` 一但優化, `fib_sequence` 也會變快才是. 實際用 `objdump -dS fibdrv.ko` 也沒發現被 opt-out. ```shell 00000000000002b0 <fib_write>: 2b0: d503245f bti c 2b4: d503201f nop 2b8: d503201f nop 2bc: d2800020 mov x0, #0x1 // #1 2c0: b4000041 cbz x1, 2c8 <fib_write+0x18> 2c4: d65f03c0 ret 2c8: d503233f paciasp 2cc: a9be7bfd stp x29, x30, [sp, #-32]! 2d0: d2800000 mov x0, #0x0 // #0 2d4: 910003fd mov x29, sp 2d8: a90153f3 stp x19, x20, [sp, #16] 2dc: aa0203f3 mov x19, x2 2e0: f100085f cmp x2, #0x2 2e4: 54000188 b.hi 314 <fib_write+0x64> // b.pmore 2e8: aa0303f4 mov x20, x3 2ec: 94000000 bl 0 <ktime_get> 2f0: 90000002 adrp x2, 0 <fib_sequence> 2f4: 91000042 add x2, x2, #0x0 2f8: aa0003e1 mov x1, x0 2fc: f9400280 ldr x0, [x20] 300: aa0103f4 mov x20, x1 304: f8737841 ldr x1, [x2, x19, lsl #3] 308: d63f0020 blr x1 30c: 94000000 bl 0 <ktime_get> 310: cb140000 sub x0, x0, x20 314: a94153f3 ldp x19, x20, [sp, #16] 318: a8c27bfd ldp x29, x30, [sp], #32 31c: d50323bf autiasp 320: d65f03c0 ret ``` 實際比較 clz 有無的 objdump ```shell 00000000000000a0 <fib_fast_doubling>: a0: d503245f bti c a4: d503201f nop a8: d503201f nop ac: d503233f paciasp b0: 34000380 cbz w0, 120 <fib_fast_doubling+0x80> b4: aa0003e4 mov x4, x0 b8: 2a0003e1 mov w1, w0 bc: 52800003 mov w3, #0x0 // #0 c0: 2a0303e0 mov w0, w3 c4: 53017c21 lsr w1, w1, #1 c8: 11000463 add w3, w3, #0x1 cc: 35ffffa1 cbnz w1, c0 <fib_fast_doubling+0x20> d0: 52800022 mov w2, #0x1 // #1 d4: 1ac02042 lsl w2, w2, w0 d8: 34000242 cbz w2, 120 <fib_fast_doubling+0x80> dc: d2800021 mov x1, #0x1 // #1 e0: d2800000 mov x0, #0x0 // #0 e4: d503201f nop e8: d37ff823 lsl x3, x1, #1 ec: 9b017c21 mul x1, x1, x1 f0: cb000063 sub x3, x3, x0 f4: 9b000401 madd x1, x0, x0, x1 f8: 9b007c60 mul x0, x3, x0 fc: 6a04005f tst w2, w4 100: 54000080 b.eq 110 <fib_fast_doubling+0x70> // b.none 104: 8b010003 add x3, x0, x1 108: aa0103e0 mov x0, x1 10c: aa0303e1 mov x1, x3 110: 53017c42 lsr w2, w2, #1 114: 35fffea2 cbnz w2, e8 <fib_fast_doubling+0x48> 118: d50323bf autiasp 11c: d65f03c0 ret 120: d2800000 mov x0, #0x0 // #0 124: d50323bf autiasp 128: d65f03c0 ret 12c: d503201f nop ``` ```shell 0000000000000234 <fib_fast_doubling_clz>: 234: d503245f bti c 238: d503201f nop 23c: d503201f nop 240: aa0003e4 mov x4, x0 244: d503233f paciasp 248: f100041f cmp x0, #0x1 24c: 540002ad b.le 2a0 <fib_fast_doubling_clz+0x6c> 250: dac01000 clz x0, x0 254: 52800402 mov w2, #0x20 // #32 258: 4b000040 sub w0, w2, w0 25c: 52800022 mov w2, #0x1 // #1 260: 1ac02042 lsl w2, w2, w0 264: 34000222 cbz w2, 2a8 <fib_fast_doubling_clz+0x74> 268: d2800021 mov x1, #0x1 // #1 26c: d2800000 mov x0, #0x0 // #0 270: d37ff823 lsl x3, x1, #1 274: 9b017c21 mul x1, x1, x1 278: cb000063 sub x3, x3, x0 27c: 9b000401 madd x1, x0, x0, x1 280: 9b007c60 mul x0, x3, x0 284: 6a04005f tst w2, w4 288: 54000080 b.eq 298 <fib_fast_doubling_clz+0x64> // b.none 28c: 8b010003 add x3, x0, x1 290: aa0103e0 mov x0, x1 294: aa0303e1 mov x1, x3 298: 53017c42 lsr w2, w2, #1 29c: 35fffea2 cbnz w2, 270 <fib_fast_doubling_clz+0x3c> 2a0: d50323bf autiasp 2a4: d65f03c0 ret 2a8: d2800000 mov x0, #0x0 // #0 2ac: 17fffffd b 2a0 <fib_fast_doubling_clz+0x6c> ``` `fib_fast_doubling_clz` 版本的確比較小, 也用到 clz, 但就是差不多甚至比較慢一點. 如果單純比較 `(32 - __builtin_clzll(n)` vs `for (unsigned int i = n; i; ++h, i >>= 1);` , 就會發現 clz 的確比較快一點. ![](https://i.imgur.com/gBUeLjm.png) 但只運算 `clz` 居然跟運算費氏數列差異為 100 ns 跟 180 ns 這樣的等級, 應該是時間量測的精度不足. 如果是時間量測精確程度問題, 那麼重複 10 次應該可以緩和, 在 `fib_write` 內部量測時間的時候: ```cpp ktime_t kt = ktime_get(); int i = 10; while (i--) fib_methods[size](*offset); return (ssize_t) ktime_to_ns(ktime_sub(ktime_get(), kt)); ``` ![](https://i.imgur.com/e4UcQPT.png) 跟原本耗費 100 ns 相比, 接近 80 的時候才有 10 倍的數據. 而 ARM Cortex-A53 的 clz 表現只有快一點點. ### 思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本 可以觀察到 Fast doubling 當中 ```cpp uint64_t d = a * a + b * b; ``` 用了兩個平方, 而運算 $n^2$ 可以思考: $$ \begin{cases} n^2 = (2*\frac{n}{2})^2 = 4*(\frac{n}{2})^2, & \text{if $n$ is even and $n$ >= 2} \\ n^2 = (2*\frac{n}{2}+1)^2 = 4*(\frac{n}{2})^2 + 4*(\frac{n}{2}) + 1 & \text{if $n$ is odd and $n$ >= 2}\\ \end{cases} $$ --- ## 自我檢查清單 - [x] 研讀 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗; - [x] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] 研讀 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU),指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措? - [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢? > 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html) - [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。 --- ## 參考資料 - [Calculating fibonacci numbers by fast doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) - [KYG-yaya573142 的報告](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ) - [Masamaloka fibdrv](https://hackmd.io/@Masamaloka/fibdrv)