2023q1 Homework3 (fibdrv)

--- tags: linux2023 --- # 2023q1 Homework3 (fibdrv) contributed by <`DokiDokiPB` > :::danger 使用一個 grave accent 符號來標注，注意細節！ :notes: jserv ::: #### 實驗環境 ```shell $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i5-4250U CPU @ 1.30GHz Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 Stepping: 1 CPU max MHz: 2600.0000 CPU min MHz: 800.0000 Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 64 KiB (2 instances) L1i: 64 KiB (2 instances) L2: 512 KiB (2 instances) L3: 3 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-3 ``` 使用 Macbook Air 2013 A1466 (Ubuntu Linux) 二核四執行緒 :::warning 區分 processor core (核) 和 OS kernel (核心) :notes: jserv ::: 另外舊版的電腦沒有 Secure Boot 的問題，可以直接加載 fibonacci driver #### 設定 VSCode 開發環境 fibdrv 作業本身沒有把 .vscode 的目錄加入 .gitignore 列表中，需要手動新增 `.vscode` :::warning 可提交 pull request。 :notes: jserv ::: 當 VSCode 第一次開啟 workspace 時，會建立 `.vscode` 目錄跟 `c_cpp_properties.json` 檔案，修改該檔案，即可在 VSCode 中使用語法提示： ```json { "configurations": [ { "name": "Linux", "includePath": [ "${workspaceFolder}/**", "/usr/include", "/usr/local/include", "/usr/src/linux-headers-5.19.0-32-generic/include", "/usr/src/linux-headers-5.19.0-32-generic/arch/x86/include", "/usr/src/linux-headers-5.19.0-32-generic/arch/x86/include/generated", "/usr/lib/modules/5.19.0-32-generic/build/include", "/usr/lib/gcc/x86_64-linux-gnu/11/include" ], "defines": [ "KBUILD_MODNAME=\"fibdrv_new\"", "__GNUC__", "__KERNEL__", "MODULE", "_USERSPACEFIB" ], "compilerPath": "/usr/bin/gcc", "cStandard": "gnu17", "cppStandard": "gnu++17", "intelliSenseMode": "linux-gcc-x64" } ], "version": 4 } ``` `_USERSPACEFIB` 是自定義的巨集，在實作大數時，在使用者層級(userspace) 驗證每個函式正確性非常方便。不同的設定為影響 VSCode 解析巨集時候的行為，例如 `"cStandard": "gnu17"`，若是設定成 `"cStandard": "c99"`，部份結構體如 `struct timespec` 就不會被 VSCode 解析 ### 自我檢查清單 - [ ] 研讀上述 Linux 效能分析的提示描述 - [ ] 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 複習 C 語言數值系統和 bitwise operation，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本 - [ ] 學習指出針對大數運算的加速運算和縮減記憶體操作成本的舉措，紀錄你的認知和疑惑 - [ ] 注意到 fibdrv.c 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣，什麼場景中會需要呢？ # 開發紀錄 ##### 初步修改 `fib_sequence` 參考作業說明以及 [Fast Doubling 手法說明](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-a#-%E8%B2%BB%E6%B0%8F%E6%95%B8%E5%88%97) 1. $Fib(2k) = Fib(k)[2Fib(k+1) - Fib(k)]$ 2. $Fib(2k + 1) = Fib(k+1)^2 + Fib(k)^2$ 在以下程式碼中，分別代表變數： - $Fib(k) \rightarrow a$ - $Fib(k + 1) \rightarrow b$ ```c static long long fib_sequence(long long k) { unsigned long long mask = ULLONG_MAX ^ (ULLONG_MAX >> 1); mask >>= __builtin_clz(k); unsigned long long a = 0, b = 1; // fib(0), fib(1) for (; mask; mask >>= 1) { unsigned long long c = a * (2 * b - a); unsigned long long d = a * a + b * b; int aset = !(mask & k); int bset = !!(mask & k); a = d ^ ((c ^ d) & -aset); b = d + (c & -bset); } return a; } ``` 其中的 `aset` 與 `best` 的位元技巧改寫自以下程式碼: ```c if (mask & offset) { a = d; b = c + d; } else { a = c; b = d; } ``` 使用 `make load` 後使用 `sudo ./client` 輸出，可以計算到 $Fib(92)$ ``` Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189. Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120. Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309. Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429. ``` ## 使用 `bn` 結構體實作 Fibonacci 之前撰寫的 [2022 fibdrv](https://hackmd.io/@DokiDokiPB/2022fibdrv) 中使用的 ```decnum``` 程式碼，採用 $10^9$ 進制實作(在 $32$ 位元下)，進位時需要用到除法運算子(```%```)，撰寫程式碼上並沒有比較簡單；對比以 $2$ 進制實作的運算時間差異巨大，因此這次採用 $2$ 進制實作。這裡採用 `_bn` 大數結構體來實作。新增 `bn.c`, `bn.h` 用於實作 `_bn`，並在 `Makefile` 修改名稱，使其可以將該新增檔案編譯 ```diff - TARGET_MODULE := fibdrv + TARGET_MODULE := fibdrv_new - $(TARGET_MODULE)-objs := fibdrv.o + $(TARGET_MODULE)-objs := fibdrv.o bn.o ```  ### 使用 `ufib.c` 在使用者層級進行測試 - 加載費氏數列模組進入 Linux kernel 後，就會很難除錯，很多時候若是實作的函式本身若寫錯，需要花費多餘的時間檢查。 - 引入 `bn` 結構體進 `fibdrv` 模組，相關測試程式碼也需要加入的。雖然 C 語言一定有測試的函式庫可以使用，這裡只單獨新增 `ufib.c` 的檔案測試。新增一個在使用者層級即可編譯與偵錯的程式碼與巨集在 `bn.h` 中新增巨集 `_USERSPACEFIB`，用於在使用者層級測試程式碼用。 ```c #ifndef _USERSPACEFIB // ... #else #include <stdbool.h> #include <stdlib.h> #include <string.h> #define kmalloc(size, flags) (malloc(size)) #define krealloc(ptr, size, flags) (realloc((ptr), (size))) #define kfree(p) (free(p)) typedef unsigned int __u32; typedef signed int __s32; typedef unsigned long __u64; typedef signed long __s64; #define GFP_KERNEL (0) #endif ``` 搭配檔案 `ufib.c`，並在 Makefile 中新增 `$(CC) -g ufib.c bn.c -o ufib.out -D _USERSPACEFIB` 即可編譯對應檔案 ### _bn 結構體 ```c typedef struct _bn { __u32 size; __u32 *num; } bn; void bn_add(bn *a, bn *b, bn *c); void bn_diff(bn *a, bn *b, bn *c); void bn_lshift(bn *a, __u32 s); void bn_mult(bn *a, bn *b, bn *c); #define BN_DIV_ROUND(x, len) (((x) + (len) - 1) / (len)) ``` - `size` 即表示配置位元組的大小，在實作上，結構體不一定用到全部的位元組。 - `BN_DIV_ROUND` 實作上與 `DIV_ROUND` 相同透過 `BN_DIV_ROUND` 巨集 $\lceil \frac{digits + 31}{32} \rceil$ 計算需要的位元組，以數值列出關係 ``` digits macro 0 (0 + 32 - 1) / 32 = 0 1 (1 + 32 - 1) / 32 = 1 2 (2 + 32 - 1) / 32 = 1 32 (32 + 32 - 1) / 32 = 1 ...... 33 (33 + 32 - 1) / 32 = 2 64 (64 + 32 - 1) / 32 = 2 ``` #### bn_add 函式 ```c void bn_add(bn *a, bn *b, bn *c) { __u32 digits = BN_MAX(bn_msb(a), bn_msb(b)) + 1; digits = BN_DIV_ROUND(digits, 32) + !digits; bn_resize(c, digits); bn aa = *a; bn bb = *b; if (a->size < b->size) bn_swap(&aa, &bb); __u64 carry = 0; __u32 i = 0; for (; i < b->size; i++) { carry += (__u64) aa.num[i] + (__u64) bb.num[i]; c->num[i] = carry; carry >>= 32; } for (; i < a->size; i++) { carry += a->num[i]; c->num[i] = carry; carry >>= 32; } if (carry) { c->num[i] = carry; } } ``` 兩數相加最多只會進 $1$ 位，因此 `*c` 需要配置與 `*a` 或 `*b` 中最大位元數量 $+1$。其中若是 `*a == *c` 或 `*b == *c`，使用 `*aa` 與 `*bb` 代為操作加法。 #### bn_diff 函式 ```c void bn_diff(bn *a, bn *b, bn *c) { int abcmp = bn_cmp(a, b); if (abcmp == 0) { bn_set_zero(c); return; } bn aa = *a; bn bb = *b; if (abcmp == 1) bn_swap(&aa, &bb); bn_resize(c, a->size); __u64 carry = 0; __u32 i = 0; for (; i < bb.size; i++) { __u64 x = (__u64) aa.num[i] - (__u64) bb.num[i] - carry; __u64 sign = !!(x >> 32); c->num[i] = x + (sign << 32); carry = sign; } for (; i < aa.size; i++) { __u64 x = (__u64) aa.num[i] - carry; __u64 sign = !!(x >> 32); c->num[i] = x + (sign << 32); carry = sign; } } ``` 在 fast doubling 手法中，減法運算皆為較大的數字減去較小的數字，因此實作上可以改為計算兩者的差值。 1. `bn_cmp` 決定和者數字較大，由大減去小的數字，用來保證是大數減去小數字。 2. 在計算減法過程中，若是某個位數減出負數，則向下一位位數借位。 - 負數在 $2$ 補數系統中，變數 `x` 的位元從 `[63:32]` 皆為 $1$ ，透過 `sign = !!(x >> 32)` 即可判斷正負。並加回一個 `sign << 32`，`carry` 向高位元借位。過程上幾乎與 `bn_add` 函式相同 #### bn_mult 函式 ```c void bn_mult(bn *a, bn *b, bn *c) { __u32 digits = bn_msb(a) + bn_msb(b) + 1; digits = BN_DIV_ROUND(digits, 32) + !digits; bn_resize(c, digits); if (!c->num) return; bn_set_zero(c); for (__u32 i = 0; i < a->size; i++) { for (__u32 j = 0; j < b->size; j++) { __u64 carry = (__u64) a->num[i] * (__u64) b->num[j]; __u32 k = i + j; do { __u64 s = carry + c->num[k]; c->num[k] = s; carry = s >> 32; k++; } while (k < c->size && carry); } } } ``` 與加法、減法不同之處，在於某兩位數乘法之後，至多指派兩組資料到不同記憶體空間，因此必須是 `a != c` 且 `b ！!= c` #### bn_lshift 函式 ```c void bn_lshift(bn *a, __u32 s) { s = s & 0b11111; __u64 carry = 0; __u32 i = 0; for (; i < a->size; i++) { __u64 x = a->num[i]; a->num[i] = (x << s) | carry; carry = x >> (32 - s); } if (carry) { bn_resize(a, i + 1); a->num[i] = carry; } } ``` 最大左移量為 31，在 fast doubleing 手法中只須左移 1，也就是乘以 2 #### `bn_clz` 函式與 `bn_msb` 函式初步的實作從最高位數去計算零的數量 ```c static __u32 bn_clz(const bn *p) { __u32 cnt = 0; for (__s64 i = p->size - 1; i >= 0; i--) { if (p->num[i]) { cnt += __builtin_clz(p->num[i]); break; } else { cnt += 32; } } return cnt; } static __u32 bn_msb(const bn *p) { return (p->size << 5) - bn_clz(p); } ``` ### 實作加法的 Fibonacci 數列該函式實作於 `ufib.c` 中 ```c static void test_bn_add(__u64 k) { bn a, b, c; /* ... */ a.num[0] = 0; b.num[0] = 1; for (__u64 i = 0; i < k; i++) { bn_add(&a, &b, &c); bn_swap(&a, &b); bn_swap(&c, &b); } /* the result is stored in *a */ } ``` ### 實作 Fast Doubling 手法的 Fibonacci 數列該函式實作於 `ufib.c` 中 ```c static void test_bn_fast_doubling(__u64 k) { __u64 mask = 1UL << 63; mask >>= __builtin_clzl(k); bn a, b, c, d; /* ... */ a.num[0] = 0; b.num[0] = 1; for (; mask; mask >>= 1) { bn_cpy(&d, &b); bn_lshift(&d, 1); bn_diff(&d, &a, &d); bn_mult(&a, &d, &c); bn_mult(&a, &a, &d); bn_mult(&b, &b, &a); bn_add(&d, &a, &d); if (mask & k) { bn_add(&c, &d, &b); bn_swap(&a, &d); } else { bn_swap(&c, &a); bn_swap(&d, &b); } } } ``` 觀摩 [yanjiew 同學](https://hackmd.io/@yanjiew/linux2023q1-fibdrv#%E5%AF%A6%E4%BD%9C%E5%A4%A7%E6%95%B8%E9%81%8B%E7%AE%97%E7%89%88%E6%9C%AC%E8%B2%BB%E6%B0%8F%E6%95%B8%E5%88%97)底下留言中可以使用 Hash table，來嘗試引入課堂中提到的 Hash table #### 將大數轉換成字串這裡直接使用 [yaya573142 同學](https://github.com/KYG-yaya573142/fibdrv/blob/master/bn_kernel.c#L47)的 `bn_to_string` 令 $n$ 表示十進制的費式數列的結果，最終需要長度為 $y = \lceil log_{10}{n} \rceil$ 的字串，則 - 可化為 $y = \lceil \frac{log_2{n}}{log_2{10}} \rceil$ - $log_{2}{n}$ 即表示最高位有效位元位置，可藉由 `bn_msb` 或簡單的 `8 * sizeof(__u32) * size` 求出 - $log_{2}{10}$ 約略為 $3.222 \approx 3$ - `y = (8 * sizeof(__u32) * size) / 3 + 3` ## 加入量測時間 ### 在使用者層級量測時間先直接在 `ufib.c` 上直接進行測試加入 `ufib_plot.gp` 檔案 ``` plot [0:500][0:] \ "ufib_time.ut" using 1:2 with linespoints linewidth 2 pointtype 7 pointsize 0.5 title "fast doubling" ``` 對應程式碼： ```c #include <time.h> #define NANOSECOND(x) ((x).tv_sec * 1e9 + (x).tv_nsec) struct timespec t1, t2; for (int i = 0; i < 500; i++) { clock_gettime(CLOCK_MONOTONIC, &t1); clock_gettime(CLOCK_MONOTONIC, &t2); long long ut = (long long) (NANOSECOND(t2) - NANOSECOND(t1)); printf("%d %lld\n", i, ut); } ``` Makefile 新增 ``` uplot: ufib ./ufib.out > ./ufib_time.ut gnuplot scripts/ufib_plot.gp ``` 即可產生圖片 ![](https://i.imgur.com/o2fs6E5.png) 忽略極端值之後，從 1000 ns 開始往上遞增，最終 $Fib(500)$ 為 5220 ns 對比 2022 年寫的 fibdrv 採用 $10^9$ 進制的執行時間差異巨大，起始時間從 10000 ns 開始，有明顯的提昇 :+1:  ### 將 bn 結構體加入 `fibdrv.c` 中 > git commit [41c651ec143d6e606e088e3034098d52ae458e1a](https://github.com/a1091150/fibdrv_2023/commit/41c651ec143d6e606e088e3034098d52ae458e1a) 加入程式碼之後，主要修改 `#define MAX_LENGTH 500` 與 `fib_read` 函式，設計成： - 將輸出字串結果透過 `copy_to_user` 複製回 user space - 將回傳值作為時間間隔 ```c static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { ktime_t kt; kt = ktime_get(); bn ret = fib_fast_doubling(*offset); kt = ktime_sub(ktime_get(), kt); if (!ret.num) goto end; char *s = bn_to_string(&ret); size_t len = strlen(s) + 1; len = len > size ? size : len; copy_to_user(buf, s, len); kfree(s); end: return (ssize_t) ktime_to_ns(kt); } ``` 新增 `tclient.c` 與 `scripts/kfib_plot.gp`，內容與 `ufib_plot.gp` 類似，最終產生結果在圖片中某些測量極端數值，是因為尚未移除干擾。在[作業說明](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-c)中，需要移除干擾，使得結果穩定。 ![](https://i.imgur.com/dkkkpzp.png) ## 初步使用 `perf` 進行量測在之前的 2022 fibdrv 中，使用 `ftrace` 與 `trace-cmd` 追蹤程式碼在 kernel space 中的行為，大致在 user space 下測試差不多。使用 `sudo perf stat -r 2 -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses ./ufib.out`, `sudo perf report --stdio` 觀察結果 ### 測試 `test_fast_doubling` 函式在 `ufib.c` 中的改為測試 `test_fast_doubling`，程式碼主體改為以下程式碼 ```c for (int i = 0; i < 500; i++) { test_bn_fast_doubling(500); } ``` ``` test_bn_fast_doubling |--44.39%--bn_mult | |--11.18%--bn_resize | |--3.75%--asm_sysvec_apic_timer_interrupt ... | --3.73%--bn_msb | |--14.86%--bn_add | --3.73%--bn_msb |--10.94%--bn_diff | --3.52%--bn_cmp, bn_msb, bn_clz |--7.58%--bn_new |--3.74%--bn_cpy --3.74%--bn_resize ``` ``` 7,793,228 cycles ( +- 2.18% ) 17,076,298 instructions # 2.26 insn per cycle ( +- 0.02% ) 4,560 cache-misses # 23.547 % of all cache refs ( +- 16.46% ) 18,457 cache-references ( +- 2.24% ) 2,121,315 branch-instructions ( +- 0.02% ) <not counted> branch-misses (0.00%) 0.0038004 +- 0.0000351 seconds time elapsed ( +- 0.92% ) ``` 觀察輸出，初步有兩種改進方向： - 改善記憶體配置時機跟大小： - `bn_resize` 有相當的比例 - 程式法部份的改進： - 意外的事是 `bn_add`, `bn_diff`, `bn_msb` 總共有 24 % - `bn_mult` 本身的乘法計算就佔去 30 % ### 測試 `test_bn_add` 函式在 `ufib.c` 中的改為測試 `test_bn_add`，程式碼主體改為以下程式碼 ```c for (int i = 0; i < 500; i++) { test_bn_add(500); } ``` `perf` 輸出： ``` |--96.56%--test_bn_add | |--87.57%--bn_add | | |--18.25%--bn_msb | | | --10.87%--bn_clz | | --6.35%--bn_resize | |--4.48%--bn_swap | |--1.79%--bn_resize | --0.88%--bn_msb --0.90%--bn_swap ``` 使用 `sudo perf stat -r 10 -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses ./ufib.out` ``` Performance counter stats for './ufib.out' (10 runs): 66,434,477 cycles ( +- 0.64% ) (64.59%) 148,983,157 instructions # 2.23 insn per cycle ( +- 0.15% ) (84.58%) 5,870 cache-misses # 21.757 % of all cache refs ( +- 16.48% ) (84.58%) 24,720 cache-references ( +- 1.87% ) (84.58%) 14,859,034 branch-instructions ( +- 0.12% ) (84.58%) 32,670 branch-misses # 0.22% of all branches ( +- 2.83% ) (81.68%) 0.026891 +- 0.000233 seconds time elapsed ( +- 0.87% ) ``` `cache-misse` 在第一次執行程式時佔全部的 21 %，在第二次執行時會剩下 5 % 在加法程式碼實作中，目前看起來是沒有更多改進空間