2022q1 Homework3 (fibdrv)

--- tags: linux2022 --- # 2022q1 Homework3 (fibdrv) contributed by < `SmallHanley` > > [作業說明](https://hackmd.io/@sysprog/linux2022-fibdrv) > [GitHub](https://github.com/SmallHanley/fibdrv) ## 開發環境 ```shell $ cat /proc/version Linux version 5.13.0-37-generic (buildd@lcy02-amd64-111) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #42~20.04.1-Ubuntu SMP Tue Mar 15 15:44:28 UTC 2022 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 142 Model name: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz Stepping: 10 CPU MHz: 1800.000 CPU max MHz: 3400.0000 CPU min MHz: 400.0000 BogoMIPS: 3600.00 Virtualisation: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-7 ``` ## fibdrv file_operations [include/linux/fs.h](https://elixir.bootlin.com/linux/latest/source/include/linux/fs.h#L2069) ==TODO== ## Linux 效能分析的提示首先，根據不同的算法，畫出折線圖，橫軸是費氏數列第 n 項，縱軸是計算該項所花費的時間： ![](https://i.imgur.com/DmHULbH.png) :::spoiler gnuplot script ``` set title "Time cost" set xlabel "n th sequence of Fibonacci number" set ylabel "time cost (ns)" set terminal png font " Times_New_Roman,12 " set output "time.png" set xrange [0:100] set xtics 0, 20, 100 set key left plot \ "time" using 1:2 with linespoints linewidth 2 title "original", \ "time" using 1:3 with linespoints linewidth 2 title "fast-exp", \ "time" using 1:4 with linespoints linewidth 2 title "fast-doubling", \ ``` ::: ### 執行多次運算取平均為了減少誤差，我嘗試將執行 10 次的結果取平均，再重新繪製，執行方式： ```shell $ make plot ``` ![](https://i.imgur.com/8c5Cxb0.png) ### 以特定的 CPU 執行程式一個在 userspace 的程式，我們可以使用 [taskset](https://man7.org/linux/man-pages/man1/taskset.1.html) 命令，將指定的行程透過 PID 和 cpumask 固定在指定的 CPU 上執行，又或是將給定的新的程式或命令固定在指定的 CPU 上執行，用法如下： ``` taskset [options] [mask | cpu-list] [pid|cmd [args...]] ``` 至於 kernel module，在閱讀 [Kernel_modules](https://linux-kernel-labs.github.io/refs/heads/master/labs/kernel_modules.html) tutorial 後，發現 kernel module 可以使用 `current` 變數 (需要 include `linux/sched.h`)，該變數代表的是一個指標指向該行程的 `struct task_struct`，這意味著 kernel module 是一個獨立且可以被排程的一個單位 (task)，並用有自己的 PID，我們可以使用 `taskset` 根據 PID 固定在指定的 CPU 上執行。不過我們又面臨一個問題，我們需要程式執行以後才會知道行程的 PID，因此我嘗試使用 [linux/sched](https://github.com/torvalds/linux/blob/master/include/linux/sched.h#L2173) 的 `sched_setaffinity()`，不過遇到以下問題，目前還沒解決： ```shell ERROR: modpost: "sched_setaffinity" [.../fibdrv/fibdrv.ko] undefined! ``` 後來改成在 `client.c` 使用系統呼叫 [sched_setaffinity(2)](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html)，先將 `fibdv` 的 PID 透過 `write` 傳至 `client.c`，再呼叫 `sched_setaffinity()` 將行程固定在 CPU 0。 ### 限定 CPU 給特定的程式使用 [taskset](https://man7.org/linux/man-pages/man1/taskset.1.html) 可指定行程所使用的 CPU 核心，但不代表其他的行程不會使用這些被指定的 CPU 核心，如果你不想讓其他的行程干擾你要執行的程式，讓指定的核心只能被自己設定的行程使用，可以使用 `isolcpus` 這個 Linux 核心起始參數，這個參數可以讓特定的 CPU 核心在開機時就被保留下來。我在開機時使用 boot loader 所提供的自訂開機參數功能，手動加入 `isolcpus=cpu_id` 參數： 1. 開機時進入 GRUB。 2. 按 `e` 編輯開機參數。 3. 往下到 `linux` 開頭這行，加上 `isolcpus=0`。 4. 按 `Ctrl-x` 或是 `F10` 以修改過後的設定開機。觀察 CPU 0 是否在開機後被保留下來： ```shell $ taskset -cp 1 pid 1's current affinity list: 1-7 $ cat /sys/devices/system/cpu/isolated 0 ``` 使用 `htop` 觀察 CPU 0，不過 CPU 0 的使用率並不是一直維持在 0，推測有其他程式也有使用到 sched_setaffinity()，或是中斷處理 (?)，不過可以大幅減低干擾： ![](https://i.imgur.com/oFsbbh1.png) ### 排除干擾效能分析的因素 * 抑制 [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization) (ASLR) ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` * 設定 scaling_governor 為 performance。準備以下 shell script，檔名為 `performance.sh`: ```shell for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > ${i} done ``` 之後再用 `$ sudo sh performance.sh` 執行 * 針對 Intel 處理器，關閉 turbo mode ```shell $ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" ``` ![](https://i.imgur.com/Hnr6D5w.png) ### 用統計手法去除極端值 ## 引入 bignum ### 資料結構 * 目前只有使用 `unsigned long` 陣列，後續可以加入 `size` 來做優化。 * 後續更新添加 `size` 來做優化，更該於 [commit 61cad9e](https://github.com/SmallHanley/fibdrv/commit/61cad9ea8f953696209fee3d3ce4605515a58299) 。一是降低運算成本，二是方便於乘法引入 `ctz` 加速手段。 * `size` 為表示此一數字所需的陣列大小 (即表示此數需要多少個 `unsigned long` 的空間)， `size` 為 `0` 代表數字為 `0`。 ```c typedef struct { unsigned long val[BN_LENGTH]; unsigned size; } bn_t; ``` ### bn_init() * 將陣列初始化為 `0`。 * `size` 設成 `0`。 ```c static inline void bn_init(bn_t *a) { memset(a->val, 0, sizeof(a->val)); a->size = 0; } ``` ### bn_set_with_pos() * 傳入三個參數，分別為 `bn_t` 變數的位址、`num` 和`pos`。 * 將 `pos` 作為陣列的 index，賦值成 `num`。 * 如果此數原本 `size` 為 `1`，我們又想要將陣列第 `0` 位置設成 `0`，則將 `size` 歸零。 ```c static inline void bn_set_with_pos(bn_t *a, uint64_t num, unsigned pos) { a->val[pos] = num; if (unlikely(!pos && !num && a->size == 1)) { a->size = 0; } else if (pos + 1 > a->size) { a->size = pos + 1; } } ``` ### bn_swap() * 將給定的兩物件交換。 ```c static inline void bn_swap(bn_t *a, bn_t *b) { bn_t tmp = *a; *a = *b; *b = tmp; } ``` ### bn_srl() * 將大數作邏輯右移。 * 考慮到後續大數乘法可能會使用 `ctzl` 優化，目前實作是使用位移運算子 (`<<`、`>>`)，只能處理小於等於 63 的右移，當 n 超過 64 (包含) 是未定義行為。 ```c static void bn_srl(bn_t *a, bn_t *res, unsigned n) { if (!a->size) { bn_init(res); return; } if (!n) { *res = *a; return; } for (int i = 0; i < a->size - 1; i++) { res->val[i] = a->val[i] >> n | a->val[i + 1] << (64 - n); } res->val[a->size - 1] = a->val[a->size - 1] >> n; res->size = (res->val[a->size - 1]) ? a->size : a->size - 1; } ``` ### bn_sll() * 將大數作邏輯左移。 * 同理，目前只能處理小於等於 63 的左移。 * 要注意的是，此處有關 `size` 的實作還是有瑕疵，如果左移超過儲存資料的上限，得到的 `size` 很可能是錯的。 ```c static void bn_sll(bn_t *a, bn_t *res, unsigned n) { if (!a->size) { bn_init(res); return; } if (!n) { *res = *a; return; } unsigned sz = (a->size > BN_LENGTH - 1) ? BN_LENGTH - 1 : a->size; for (int i = sz; i > 0; i--) { res->val[i] = a->val[i] << n | a->val[i - 1] >> (64 - n); } res->val[0] = a->val[0] << n; res->size = (res->val[sz]) ? sz + 1 : sz; } ``` ### bn_add() * 將 `a` 和 `b` 相加的結果存至 `res`。 * 這裡使用 GNU [Built-in Functions to Perform Arithmetic with Overflow Checking](https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html) 的 `__builtin_uaddl_overflow()` 來檢查是否產生溢位，從最低位加到最高位，並將 carry 加至高一位。 * 嘗試改寫，使得大數加法成功在編譯器優化開啟時生成出 [x86 指令集](https://en.wikipedia.org/wiki/X86_instruction_listings) 的 `ADC` 或是 `ADCX`，不過沒有成功。 ```c static void bn_add(bn_t *a, bn_t *b, bn_t *res) { unsigned c = 0; unsigned sz = (a->size > b->size) ? a->size : b->size; for (int i = 0; i < sz; i++) { unsigned c1 = __builtin_uaddl_overflow(a->val[i], b->val[i], &res->val[i]); unsigned c2 = __builtin_uaddl_overflow(res->val[i], c, &res->val[i]); c = c1 | c2; } if (sz < BN_LENGTH) { res->val[sz] = c; res->size = (res->val[sz]) ? sz + 1 : sz; } else { res->size = BN_LENGTH; } } ``` ### bn_sub() * 將 a 和 b 相減的結果存至 res。 ```c static void bn_sub(bn_t *a, bn_t *b, bn_t *res) { unsigned c = 0; for (int i = 0; i < a->size; i++) { unsigned c1 = __builtin_usubl_overflow(a->val[i], b->val[i], &res->val[i]); unsigned c2 = __builtin_usubl_overflow(res->val[i], c, &res->val[i]); c = c1 | c2; } for (int i = a->size - 1; i >= 0; i--) { if (res->val[i]) { res->size = i + 1; return; } } res->size = 0; } ``` ### bn_mul() * 將 a 和 b 相乘的結果存至 res。 * 目前採用 O(n) 的乘法，使用位移運算子和大數加法實作，後續可以增加 `ctzl` 或是不同演算法優化。 ```c static void bn_mul(bn_t a, bn_t b, bn_t *res) { bn_init(res); unsigned sz = b.size; for (int i = 0; i < 64 * sz; i++) { if (b.val[0] & 1) { bn_add(res, &a, res); } bn_sll(&a, &a, 1); bn_srl(&b, &b, 1); } } ``` 乘法除了加上 `size` 來優化，後續也增加 `ctzl` 的優化。原本的實作方式是模擬計算機組織教科書有關乘法器的實作方式，每次都只位移一個 bit。當除數的 bits 為 set 的數量很少的時候，這種作法的位移運算成本會高很多，於是我使用 `ctzl` 加速，可以一次位移多個 bits，實作如下： ```c static void bn_mul(bn_t a, bn_t b, bn_t *res) { bn_init(res); while (b.size) { if (b.val[0]) { unsigned z = __builtin_ctzl(b.val[0]); bn_sll(&a, &a, z); bn_srl(&b, &b, z); bn_add(res, &a, res); b.val[0] &= ~(1ul); } else { bn_sll(&a, &a, 63); bn_srl(&b, &b, 63); } } } ``` 對應的實驗和效能分析可以參考下面圖表。 ### bn2string() ```c static char *bn2string(bn_t *a) { char str[64 * BN_LENGTH / 3 + 2]; memset(str, '0', sizeof(str) - 1); str[sizeof(str) - 1] = '\0'; unsigned sz = a.size; for (int i = 0; i < 64 * sz; i++) { unsigned carry = a.val[sz - 1] >> 63; for (int j = sizeof(str) - 2; j >= 0; j--) { str[j] += str[j] - '0' + carry; carry = (str[j] > '9'); if (carry) str[j] -= 10; } bn_sll(&a, &a, 1); } // search for numbers int i = 0; while (i < sizeof(str) - 2 && str[i] == '0') i++; // passing string back char *p = kmalloc(sizeof(str) - i, GFP_KERNEL); memcpy(p, str + i, sizeof(str) - i); return p; } ``` ### 效能分析引入 bignum 後，根據 Linux 效能分析的提示，繪製出不同實作方式對於不同輸入參數所花的時間比較： ![](https://i.imgur.com/vQs5dYc.png) 將輸入參數擴增至 1000，發現 fast-doubling 和 exponentiating by squaring 與原本 $O(n)$ 時間複雜度相比，花費時間高很多，可見目前採用的乘法實作方式還有很大改進空間： ![](https://i.imgur.com/7BxOmj0.png) 後續改進加入 `size` 和 `ctzl` 優化，前者優化各種運算的次數 (包含加法、減法、位移運算)，後者大幅減少位移運算次數，實驗結果如下： ![](https://i.imgur.com/AHJU9Kw.png) 發現 exponentiating by squaring 的乘法運算量還是過大，fast-doubling 和 sequence 較為接近，不過在前 1000 項還是 sequence 的方式花費時間最少 (後續實驗發現輸入參數數量級要到 $10^5$，fast-doubling 才會追過 sequence)。下圖以 fast-doubling 的實作方式，分析加上 `size` (不含 `ctzl`) 前後花費時間的比較： ![](https://i.imgur.com/JELUuC8.png) 在大數乘法的實作中加入 `ctzl` 優化，可以大幅減少位移運算的次數，下圖是實驗有無 `ctzl` 的比較，發現採用 `ctzl` 的程式碼執行時間大幅縮短： ![](https://i.imgur.com/VRbj8eP.png) 進一步觀察整體程式 (fast-doubling 從 0 執行到 1000) 呼叫大數運算的次數，確實大幅減少 `bn_sll()` 、`bn_srl()` 的呼叫次數： | no-ctz | add | sub | sll | srl | | ------ | ------ | ---- | ------- | ------- | | times | 541118 | 8987 | 2316635 | 2307648 | | with-ctz | add | sub | sll | srl | | -------- | ------ | ---- | ------ | ------ | | times | 541118 | 8987 | 562141 | 553154 | ## Reference [矩陣快速冪](https://hackmd.io/@3xOSPTI6QMGdj6jgMMe08w/HJiorDSpE?type=view) [Fast-doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) [The high-resolution timer API](https://lwn.net/Articles/167897/) [Message logging with printk](https://www.kernel.org/doc/html/latest/core-api/printk-basics.html) [oscarshiang 共筆](https://hackmd.io/@oscarshiang/linux_fibdrv) [Other Built-in Functions Provided by GCC](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) [Kernel modules](https://linux-kernel-labs.github.io/refs/heads/master/labs/kernel_modules.html) [rt:labs Real-time properties of Linux](https://rt-labs.com/docs/p-net/linuxtiming.html) [Built-in Functions to Perform Arithmetic with Overflow Checking](https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html)