# 2023q1 Homework3 (fibdrv) contributed by < `csm1735` > ## 開發環境 ```shell $ gcc --version gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i5-7300HQ CPU @ 2.50GHz CPU family: 6 Model: 158 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 9 CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 4999.90 ``` ## 自我檢查清單 - [x] 研讀上述 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗; - [ ] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] 學習指出針對大數運算的加速運算和縮減記憶體操作成本的舉措,紀錄你的認知和疑惑 - [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。 $\to$ 搭配閱讀〈[並行和多執行緒程式設計](https://hackmd.io/@sysprog/concurrency)〉 ## 開發紀錄 一開始須先將 Secure Boot 的功能關閉,這樣做 `make check` 時才能得到如下的預期結果: ``` Passed [-] f(93) fail input: 7540113804746346429 expected: 12200160415121876738 ``` 其它指令輸出如下 ```shell $ cat /sys/class/fibonacci/fibonacci/dev 510:0 $ cat /sys/module/fibdrv/version 0.1 $ lsmod | grep fibdrv fibdrv 16384 0 $ cat /sys/module/fibdrv/refcnt 0 ``` ## Fibonacci 數列 ### 使用 Fast Doubling 加速運算 觀察 `fibdrv.c` 中的 `fib_sequence()` 可發現,原先的費氏數列是透過迭代的手法來實做。 ```c f[0] = 0; f[1] = 1; for (int i = 2; i <= k; i++) { f[i] = f[i - 1] + f[i - 2]; } ``` 為了加速運算,我們可透過教材中提到的 [Fast Doubling](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-d#Bottom-up-Fast-Doubling) 手法來改寫,也就是以下公式: $F(2k) = F(k)[2F(k+1)-F(k)]$ $F(2k+1) = F(k+1)^2+F(k)^2$ ```c static long long fib_sequence(long long k) { if (k <= 2) return !!k; uint8_t count = 63 - __builtin_clzll(k); uint64_t n0 = 1, n1 = 1; for (uint64_t i = count; i-- > 0;) { uint64_t fib_2n0 = n0 * ((n1 << 1) - n0); uint64_t fib_2n1 = n0 * n0 + n1 * n1; if (k & (1UL << i)) { n0 = fib_2n1; n1 = fib_2n0 + fib_2n1; } else { n0 = fib_2n0; n1 = fib_2n1; } } return n0; } ``` 由於 `k = 1` 跟 `k = 2` 時的結果皆為 1 ,這邊透過 `!!k` 的手法使此處的回傳值只會有 0 或 1。 `__builtin_clzll` 功能為去計算總共有幾個 leading zero ,以方便我們做位移,因為只要檢查目標數對應的位元,即可知曉下次應以 $fib(2n)$ 還是 $fib(2n+1)$ 為基礎進行計算 ### 計算 F93 (包含) 之後的 Fibonacci 數 將使用的資料由 `long long` 更改為 `uint64_t` ,使得正整數的最大有效值從 $2^{64-1}-1$ 來到 $2^{64}-1$ ,此更改讓 $F(93)$ 可以被正確計算出來,但從 $F(94)$ 開始仍會得到 overflow 後的數值。 ```shell $ sudo ./client ... Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309. Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 93, returned the sequence 12200160415121876738. Reading from /dev/fibonacci at offset 94, returned the sequence 1293530146158671551. ... ``` 因此這邊選擇使用數字字串的手法來實作大數運算,以計算到更後面的數值 ```c typedef struct BigN { char num[128]; } bn; ``` 定義了一個長度為 128 的字串來儲存大數,每個 index 分別用來儲存一個位數。 ```c static void string_add(char *a, char *b, char *out) { int size_a = strlen(a), size_b = strlen(b); int index = 0, carry = 0; for (index = 0; index < size_a; ++index) { int tmp = (index < size_b) ? (a[index] - '0') + (b[index] - '0') + carry : (a[index] - '0') + carry; out[index] = '0' + tmp % 10; carry = tmp / 10; } if (carry) { out[index] = '1'; } out[++index] = '\0'; } ``` 此處為字串的加法,由於 `a = fib[i - 1].num, b = fib[i - 2].num` ,我們可確保 `size_a >= size_b` ,因此我們只需要在 `index < size_b` 的時候去做 `a[index]` 跟 `b[index]` 的相加,當 `index >= size_b` 之後就只需要處理剩餘的 `a` ,而由於加法可能會導致進位,因此用了 `carry` 來儲存進位與否。 ```c void reverse_string(char *s, size_t size) { for (int i = 0; i < size / 2; ++i) { s[i] = s[i] ^ s[size - i - 1]; s[size - i - 1] = s[i] ^ s[size - i - 1]; s[i] = s[i] ^ s[size - i - 1]; } } ``` 這邊將字串做反轉的動作,使輸出能如我們預期,實作上使用到了[你所不知道的 C 語言:數值系統篇](https://hackmd.io/@sysprog/c-numerics#%E9%81%8B%E7%94%A8-bit-wise-operator)中所提到的 [XOR swap](https://en.wikipedia.org/wiki/XOR_swap_algorithm) ,這麼做的好處是完全不需要用到額外的記憶體。 ```c static long long fib_sequence_bn(long long k, char *buf) { bn *fib = kmalloc(sizeof(bn) * (k + 1), GFP_KERNEL); strncpy(fib[0].num, "0", 2); strncpy(fib[1].num, "1", 2); for (int i = 2; i <= k; ++i) { string_add(fib[i - 1].num, fib[i - 2].num, fib[i].num); } uint64_t size = strlen(fib[k].num); reverse_string(fib[k].num, size); __copy_to_user(buf, fib[k].num, size + 1); return size; } ``` 計算出所求的 Fibonacci 數,再透過 `copy_to_user`,將想傳給使用者模式的字串複製到 `fib_read` 的 `buf` 參數後,讓 client 端方可接收到此字串。 ```c /* MAX_LENGTH is set to 92 because * ssize_t can't fit the number > 92 */ #define MAX_LENGTH 500 ``` 由於我們已經可以算到更後面的值了,所以要記得將 `MAX_LENGTH` 從 92 調整到 500 ```shell $ sudo ./client ... Reading from /dev/fibonacci at offset 498, returned the sequence 53254932961459429406936070704742495854129188261636423939579059478176515507039697978099330699648074089624. Reading from /dev/fibonacci at offset 499, returned the sequence 86168291600238450732788312165664788095941068326060883324529903470149056115823592713458328176574447204501. Reading from /dev/fibonacci at offset 500, returned the sequence 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125. ... ``` 如此一來就能正確算到 $F(500)$ 的值。 ### 時間測量與效能分析 引入 [ktime](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html) 相關的 API 來測量在 kernel space 的執行時間 ```c static ktime_t kt; static long long fib_time_proxy(long long k, char *buf) { kt = ktime_get(); long long result = fib_sequence_bn(k, buf); kt = ktime_sub(ktime_get(), kt); return result; } ``` 並對 `fib_read` 跟 `fib_write` 做改寫 ```c static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { return (ssize_t) fib_time_proxy(*offset, buf); } static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { return ktime_to_ns(kt); } ``` 接著修改 `client.c` 的程式碼,因為要先透過 `read` 去計算 `kt` 的值才能夠透過 `write` 得到執行時間,因此每次都在 `read` 完後去呼叫 `write` ```c for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); sz = read(fd, buf, sizeof(buf)); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%s.\n", i, buf); long long kt = write(fd, write_buf, 0); printf("ktime is %lld.\n", kt); } ``` 如此一來便可得到預期結果 ``` Reading from /dev/fibonacci at offset 0, returned the sequence 0. ktime is 468. Reading from /dev/fibonacci at offset 1, returned the sequence 1. ktime is 413. Reading from /dev/fibonacci at offset 2, returned the sequence 1. ktime is 282. Reading from /dev/fibonacci at offset 3, returned the sequence 2. ktime is 251. ``` --- 使用 [clock_gettime](https://linux.die.net/man/2/clock_gettime) 來測量在 user space 的執行時間 ```c static inline long user_time() { struct timespec time; clock_gettime(CLOCK_REALTIME, &time); return time.tv_sec * 1e9 + time.tv_nsec; } ``` 在 `client.c` 中定義了 `user_time()` 可透過 `clock_gettime()` 取得當前時間,第一個參數選擇 `CLOCK_REALTIME` 表示使用系統實際時間,需要 `#include <time.h>` ,而其中的 `struct timespec` 定義如下 ```c struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; ``` 有兩個成員,分別代表 seconds 跟 nanoseconds ```c for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); long long start = user_time(); sz = read(fd, buf, sizeof(buf)); long long ut = user_time() - start; printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%s.\n", i, buf); long long kt = write(fd, write_buf, 0); printf("ktime is %lld, utime is %lld, kernel to user is %lld\n", kt, ut, ut - kt); } ``` 如此一來,除了原先的 kernel space time ,還可計算 user space time ,並進一步算出 kernel to user 的傳遞時間 ``` Reading from /dev/fibonacci at offset 0, returned the sequence 0. ktime is 356, utime is 1280, kernel to user is 924 Reading from /dev/fibonacci at offset 1, returned the sequence 1. ktime is 334, utime is 1024, kernel to user is 690 Reading from /dev/fibonacci at offset 2, returned the sequence 1. ktime is 200, utime is 768, kernel to user is 568 Reading from /dev/fibonacci at offset 3, returned the sequence 2. ktime is 182, utime is 768, kernel to user is 586 ``` ![](https://i.imgur.com/yLjEqLJ.png) ### 限定 CPU 給特定的程式使用 [教材](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-c#%E9%99%90%E5%AE%9A-CPU-%E7%B5%A6%E7%89%B9%E5%AE%9A%E7%9A%84%E7%A8%8B%E5%BC%8F%E4%BD%BF%E7%94%A8)中提及了兩種讓特定的 CPU 核在開機時就被保留下來的方法 1. 在開機時使用 boot loader 所提供的自訂開機參數功能,手動加入 isolcpus=cpu_id 參數 2. 直接加在 GRUB 的設定檔中 這邊選擇了以 2. 來實作 > [參考1](https://www.linkedin.com/pulse/cpu-isolation-affinity-linux-vinit-tirnagarwar) [參考2](https://blog.csdn.net/tang05505622334/article/details/96477552) 在 `/etc/default/grub` 中新增 ``` GRUB_CMDLINE_LINUX="isolcpus=0" ``` 並透過以下指令更新 ``` sudo update-grub ``` 接著重新開機,即可檢查是否成功 ``` $ cat /sys/devices/system/cpu/isolated 0 ``` 可看到第 0 個 cpu 被保留了下來,接著就可以透過 taskset 命令將這個 CPU 核指定給要執行的程式使用。 ``` sudo taskset -c 0 ./client ``` ![](https://i.imgur.com/YTeseAi.png) ### 排除干擾效能分析的因素 抑制 address space layout randomization (ASLR) ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` 設定 scaling_governor 為 performance。 準備以下 shell script,檔名為 `performance.sh` : ``` for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > ${i} done ``` 之後再用 `$ sudo sh performance.sh` 執行 針對 Intel 處理器,關閉 turbo mode ```shell $ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" ``` 關閉 SMT (Intel 稱它為 Hyper-Threading) ```shell $ sudo sh -c "echo off > /sys/devices/system/cpu/smt/control" ``` ![](https://i.imgur.com/oh1FaUo.png)