Linux 核心專題: 重做 fibdrv

# Linux 核心專題: 重做 fibdrv > 執行人: paulpeng-popo > [專題解說錄影](https://youtu.be/yPyiDiKKO6k) > [GitHub](https://github.com/paulpeng-popo/fibdrv) :::success :question: 提問清單 * ? ::: ## 任務簡述依據 [fibdrv](https://hackmd.io/@sysprog/linux2023-fibdrv) 作業規範，繼續投入 Linux 核心模組和相關程式的開發。 ### 公式推導假設成年兔子為 `a`，幼年兔子為 `b`，且兔子不死亡 $a_{n+1} = a_n + b_n$ $b_{n+1} = a_n$ ==一般遞迴== $a_{n+1} = a_{n-1} + a_n, n \ge 1$ $a_0 = 0, a_1 = 1$ ==矩陣運算 Q-matrix== $\begin{bmatrix} a_{n+1} \\ b_{n+1} \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} a_n \\ b_n \end{bmatrix}$ 又可改寫成 $\begin{bmatrix} a_{n+1} \\ a_n \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} a_n \\ a_{n-1} \end{bmatrix}$ $\begin{bmatrix} a_{n+1} & a_n \\ a_n & a_{n-1} \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^n$ ==Fast Doubling== $F(0) = a_0 = 0$ $F(1) = a_1 = 1$ $$ \begin{split} \begin{bmatrix} F(2n+1) \\ F(2n) \end{bmatrix} &= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^{2n} \begin{bmatrix} F(1) \\ F(0) \end{bmatrix}\\ \\ &= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^n \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^n \begin{bmatrix} F(1) \\ F(0) \end{bmatrix}\\ \\ &= \begin{bmatrix} F(n+1) & F(n) \\ F(n) & F(n-1) \end{bmatrix} \begin{bmatrix} F(n+1) & F(n) \\ F(n) & F(n-1) \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix}\\ \\ &= \begin{bmatrix} F(n+1)^2 + F(n)^2\\ F(n)F(n+1) + F(n-1)F(n) \end{bmatrix} \end{split} $$ 其中 $F(n-1) = F(n+1) - F(n)$ 最後得到 $F(2k) = F(k)[ 2F(k+1)-F(k) ]$ $F(2k+1) = F(k+1)^2 + F(k)^2$ ### Fast Doubling $F(2k) = F(k)[ 2F(k+1)-F(k) ]$ $F(2k+1) = F(k+1)^2 + F(k)^2$ 作業說明中的虛擬碼 ```clike Fast_Fib(n) a = 0; b = 1; // m = 0 for i = (number of binary digit in n) to 1 t1 = a*(2*b - a); t2 = b^2 + a^2; a = t1; b = t2; // m *= 2 if (current binary digit == 1) t1 = a + b; // m++ a = b; b = t1; return a; ``` ### 原理這裡 Fast Doubling 利用 **比 $n$ 兩倍小的 $n'$** 來算出 $F(n)$，遞迴的深度則被限制在 $O(\log{n})$ 內，這可以大幅減少重複的計算參考：[Calculating Fibonacci Numbers by Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) ```c /* FILE: fibdrv.c */ static long long fib_fastd(long long n) { if (n < 2) return n; long long a = 0; long long b = 1; for (int j = 63 - 1; j >= 0; j--) { long long c = a * ((b << 1) - a); long long d = a * a + b * b; a = c; b = d; if ((n >> j) & 1LL) { a = d; b = c + d; } } return a; } ``` n 小於 2，直接回傳，不會經過迴圈從 n 的 MSB 開始，依序計算該數的費氏數，直到 n 的 LSB 或剩下之 bit 皆為 0，則提前跳出迴圈先用 `h` 記錄實際的數字共有幾個 bit (不含 leading 0s)，接著從 MSB 開始計算首項費氏數，再藉由迴圈依次增加 `n >> j` 的 bit 數，以計算後續的費氏數 **複雜度分析** - 每次迭代使用 3 個乘法跟 2 個加法，時間複雜度為 $O(1)$ - 根據輸入的 n 大小決定迭代次數，即最多做 $O(log n)$ 次遞迴 - 總共時間複雜度為 $O(1) * O(log n) = O(log n)$ 解釋 fast doubling ==為什麼用 2k 跟 2k + 1== 例如：**長度為 n 個 bit 的數字 $\alpha$** 與 **長度為 n+1 個 bit 的數字 $\beta$** 具有以下關係： - if $\beta$ 要為偶數，則 $\beta = 2\alpha$，也就是 $\beta$ = $\alpha$ << 1 - if $\beta$ 要為奇數，則 $\beta = 2\alpha + 1$，也就是 $\beta$ = $\alpha$ << 1 + 1 剛好就是 binary number 的進位規則使用 `__builtin_clzll` function 做硬體上的計算加速 `clz` (Count Leading Zero) 可以計算數字 `n` 前面有幾個 0，加速取得 `h` 的速度 ```diff - for (int j = 63 - 1; j >= 0; j--) { + long long mask = 1LL << (63 - __builtin_clzll(n)); + for (; mask; mask >>= 1) { - if ((n >> j) & 1LL) { + if (mask & n) { ``` ### 實作大數運算前的效能比較 #### VLA 改進 Linux kernel 不支援 VLA，因此使用 Dynamic Programming 與空間重複利用的技巧，避免動態陣列的使用，同時將 space complexity 從 $O(n)$ 降到 $O(1)$ > Linux 核心限制使用 VLA 的因素 > 在 [The Linux Kernel is now VLA-Free](https://www.phoronix.com/news/Linux-Kills-The-VLA) 有提到 Linux kernel 不再允許 VLA 的使用 > > C99 具有 VLA (Variable-Length Array) 的特性，允許執行時期再決定陣列佔用的空間大小，但可能會造成 stack overflow 等安全疑慮根據教材 [你所不知道的 C 語言：函式呼叫篇](https://hackmd.io/@sysprog/c-function#stack-based-buffer-overflow)，攻擊者可以利用 stack-based-buffer-overflow 來執行惡意程式碼，從而對系統造成威脅 > > 此外，Linus Torvalds 也說過：[USING VLA'S IS ACTIVELY STUPID! It generates much more code, and much _slower_ code (and more fragile code), than just using a fixed key size would have done.](https://lkml.org/lkml/2018/3/7/621)』 ```c /* FILE: fibdrv.c */ static long long fib_sequence(long long k) { long long f[2] = {0, 1}; if (k < 2) return f[k]; for (int i = 2; i <= k; i++) { f[i & 1] = f[0] + f[1]; } return f[k & 1]; } ``` #### 時間測量功能 > 參考 [時間測量與效能分析](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-c) 的說明與 [qwe661234](https://hackmd.io/@qwe661234/linux2022q1-homework3#%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90) 的解釋 > 1. [High resolution timers and dynamic ticks design notes](https://docs.kernel.org/timers/highres.html) > 2. [hrtimers - subsystem for high-resolution kernel timers](https://docs.kernel.org/timers/hrtimers.html) 在 Linux 核心中具有兩種計時方式 ==jiffies== 是 Linux kernel 較早期存在的計時方法，使用作業系統**從開機以來的計時器中斷發生的次數**作為計時的依據，此全域變數稱作 `ticks` 數。計時器中斷為週期性的中斷事件，以固定的時間間隔觸發，利用間隔固定的特性便能達到測量時間的目的。通常情況下，jiffies 的精度為毫秒級別，具體的精度取決於 tick rate 的大小。例如 1000Hz 表示每秒觸發 1000 次的 timer interrupt，因此 jiffies 的增加速度為每毫秒一次。 jiffies 轉換為**秒**的公式為 ```c (unsigned long long)(jiffies - INITIAL_JIFFIES) / HZ; ``` 由以上程式碼可知，假如系統原先設定的 `tick rate (HZ)` 不大於 $10^9$，其時間精度是無法達到 ns 等級的。 ==hrtimer== 是在 2.6.16 開始有的新的計時機制，裡面使用了 `ktime_t` 這個資料結構來進行計時。書中提到，原本基於 `jiffies` 的計時機制 `timer wheel` 在許多層面上有許多缺失，並幾乎無法再優化/整合了，包括： > 1. The forced handling of low-resolution and high-resolution timers in the same way leads to a lot of compromises, macro magic and #ifdef mess. > 2. The unpredictable [O(N)] overhead of cascading leads to delays which necessitate a more complex handling of high resolution timers, which in turn decreases robustness. > 3. The timer wheel data structure is too rigid for high-res timers. 因此實作 `hrtimer` 的設計考量包括 - simplicity - data structure not bound to jiffies or any other granularity. All the kernel logic works at 64-bit nanoseconds resolution - no compromises - simplification of existing, timing related kernel code ktime_t 結構體定義如下 ```c union ktime { s64 tv64; }; typedef union ktime ktime_t; ``` 其結構體內有一型態為 64 位元的有號整數 tv64 跟據 Linux 原始碼 [include/linux/ktime.h](https://github.com/torvalds/linux/blob/master/include/linux/ktime.h)， `ktime_t` 的單位為奈秒 (ns) 透過呼叫 [Linux kernel API](https://elixir.bootlin.com/linux/v4.4/source/kernel/time/timekeeping.c#L665) `ktime_get()`，可以取得**相對於系統啟動時間的偏移量**，因此藉由計算出兩次 `ktime_get()` 的時間差異，便可知其之間的程式碼執行時間花費 ```c /* FILE: fibdrv.c */ static ktime_t kt; static long long fib_time_proxy(long long k) { kt = ktime_get(); long long result = fib_sequence(k); kt = ktime_sub(ktime_get(), kt); return result; } static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { return (ssize_t) fib_time_proxy(*offset); } static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { return ktime_to_ns(kt); } ``` 在 read 的時候，計算 fib，並同時將所花時間儲存在 `kt` 中接著 write 的時候，將 `kt` 的值轉換成 nanoseconds 再回傳給 client `client` 中順序呼叫 `read` 和 `write`，即可取得所花費的 kernel time ```c /* FILE: client.c */ for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); sz = read(fd, buf, 1); kt = write(fd, write_buf, 1); } ``` 執行 `sudo ./client` 後，成功輸出計算費氏數時間，輸出如下 ``` Reading from /dev/fibonacci at offset 0, returned the sequence 0. Time: 331 (ns). Reading from /dev/fibonacci at offset 1, returned the sequence 1. Time: 197 (ns). Reading from /dev/fibonacci at offset 2, returned the sequence 1. Time: 183 (ns). Reading from /dev/fibonacci at offset 3, returned the sequence 2. Time: 113 (ns). Reading from /dev/fibonacci at offset 4, returned the sequence 3. Time: 123 (ns). Reading from /dev/fibonacci at offset 5, returned the sequence 5. Time: 120 (ns). Reading from /dev/fibonacci at offset 6, returned the sequence 8. Time: 81 (ns). Reading from /dev/fibonacci at offset 7, returned the sequence 13. Time: 78 (ns). Reading from /dev/fibonacci at offset 8, returned the sequence 21. ... ``` #### 選擇演算法 > 參考 [yanjiew1](https://hackmd.io/@yanjiew/linux2023q1-fibdrv) 的做法，利用 `fib_read` 傳入 `size` 作為使用不同演算法的條件 ```c static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) ``` ```c /* FILE: fibdrv.c */ switch (size) { case 1: result = fib_sequence(k); break; case 2: result = fib_fastd(k); break; case 3: result = fib_fastd_clz(k); break; default: return -EINVAL; } ``` 使用 gnuplot 製圖，確認 fast doubling 確實加速費氏數的計算，同時發現 `__builtin_clzll` 大幅降低時間開銷 ![](https://hackmd.io/_uploads/HkAnHsWuh.png) NOTE: 此處時間為 kernel mode 中計算的時間，還未加入 user space 的時間開銷比較 #### 排除效能干擾因素使用 `taskset` 預留 cpu 給效能測試使用 1. `$ sudo vim /etc/default/grub` 2. 修改此行 `GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=5"` 3. `$ sudo update-grub` 4. `$ sudo init 6` 將作業說明提到的方法寫成一個 script，包括： 1. 關閉 ASLR（address space layout randomization 2. 將 cpufrq 從 powersave 變成 performance 3. 關閉 turbo mode 4. 取消 cpu smp affinity ```shell #!/bin/sh CPUID=5 MASK=$((1 << $CPUID)) MASK=$((0xfff & ~$MASK)) MASK=`printf "%x" $MASK` ORIG_ASLR=`cat /proc/sys/kernel/randomize_va_space` ORIG_GOV=`cat /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor` ORIG_TURBO=`cat /sys/devices/system/cpu/intel_pstate/no_turbo` ORIG_IRQ=`cat /proc/irq/$CPUID/smp_affinity` sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" sudo sh -c "echo performance > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor" for file in /proc/irq/*/smp_affinity; do var=0x$(cat "$file") var=$((var & 0xfdf)) var=`printf "%x" $var` sudo sh -c "echo $var > $file" 2> /dev/null done sudo sh -c "echo $MASK > /proc/irq/$CPUID/smp_affinity" # Load the module and run the client make unload make load python3 scripts/statisic_plot.py -cpu=$CPUID gnuplot -e "filename='scripts/data.txt'" scripts/draw.gp make unload # Restore original settings sudo sh -c "echo $ORIG_ASLR > /proc/sys/kernel/randomize_va_space" sudo sh -c "echo $ORIG_GOV > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor" sudo sh -c "echo $ORIG_TURBO > /sys/devices/system/cpu/intel_pstate/no_turbo" sudo sh -c "echo $ORIG_IRQ > /proc/irq/$CPUID/smp_affinity" ``` 最後使用 `taskset -c 5 ./client` 指定 client 使用 cpu 5 進行費氏數計算 #### 去除極端值參考 [時間測量與效能分析](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-c) 計算 50 次費氏數的平均，並將離散值做處理，用以得到較平滑的效能比較圖去除的 outlier 以超過==兩個標準==差為條件，移除離 50 次平均值太遠的數據 > [commit 4b6a40c](https://github.com/paulpeng-popo/fibdrv/commit/4b6a40cd4785434a5250e6ef80409fb5e2ea2561#diff-310bc6068100c1b144bacada83ff9be0125cea94160c62aea2ca0b15c3e566fa) ![](https://hackmd.io/_uploads/HkcaqzEO3.png) 以常態分佈曲線來看，當我們排除==兩個標準==以外的數據後，代表此統計手法是在 95% 之信賴區間內取樣平均 ```python def outlier_filter(datas, threshold = 2): datas = np.array(datas) z = np.abs((datas - datas.mean()) / datas.std()) return datas[z < threshold] def data_processing(data_set, n): catgories = data_set[0].shape[0] samples = data_set[0].shape[1] final = np.zeros((catgories, samples)) for c in range(catgories): for s in range(1, samples): final[c][0] = data_set[0][c][0] final[c][s] = \ outlier_filter([data_set[i][c][s] for i in range(n)]).mean() return final ``` ## TODO: 紀錄閱讀作業說明中所有的疑惑閱讀 [fibdrv](https://hackmd.io/@sysprog/linux2023-fibdrv) 作業規範，包含「作業說明錄影」和「Code Review 錄影」，本著「誠實面對自己」的心態，在本頁紀錄所有的疑惑，並與授課教師預約討論。過程中，彙整 [Homework3](https://hackmd.io/@sysprog/linux2023-homework3) 學員的成果，挑選至少三份開發紀錄，提出值得借鏡之處，並重現相關實驗。 > Todo: 嘗試使用 `mlock` 改善 page fault 的情況 > [2020 - KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU#system-call-overhead) > Todo: 嘗試使用 `sysfs` > [willwillhi1](https://hackmd.io/@willwillhi/linux2023q1-fibdrv#%E5%A4%A7%E6%95%B8%E9%81%8B%E7%AE%97) ## TODO: 回覆「自我檢查清單」回答「自我檢查清單」的所有問題，需要附上對應的參考資料和必要的程式碼，以第一手材料 (包含自己設計的實驗) 為佳 - [x] 研讀上述 ==Linux 效能分析的提示== 描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗 - 虛擬機技術通常使用虛擬化層來模擬硬體資源，這個虛擬化層介於實際硬體和作業系統之間，可能會引入效能上的延遲或不準確的測量結果。 - 虛擬環境中可能有多個虛擬機共享同一個實際硬體資源，意味著資源分配可能會導致效能變異性。 - 虛擬機中的時間處理不同於實際硬體。虛擬機軟體必須模擬時間的流逝，並為虛擬機中的作業系統和應用程式提供時間服務。 - [ ] 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise)，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本 - [ ] 學習指出針對大數運算的加速運算和縮減記憶體操作成本的舉措，紀錄你的認知和疑惑 - [X] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。 $\to$ 搭配閱讀〈[並行和多執行緒程式設計](https://hackmd.io/@sysprog/concurrency)〉 ### 觀察 Linux 核心模組中的 mutex fibdrv 中的 mutex 使用在 `open` 和 `release` 的部份，也就是一次只允許一個 thread 能夠存取 character device ```c static int fib_open(struct inode *inode, struct file *file) { if (!mutex_trylock(&fib_mutex)) { printk(KERN_ALERT "fibdrv is in use"); return -EBUSY; } return 0; } static int fib_release(struct inode *inode, struct file *file) { mutex_unlock(&fib_mutex); return 0; } ``` 現在嘗試將 mutex lock 移除 ```c static int fib_open(struct inode *inode, struct file *file) { /* if (!mutex_trylock(&fib_mutex)) { printk(KERN_ALERT "fibdrv is in use"); return -EBUSY; } */ return 0; } static int fib_release(struct inode *inode, struct file *file) { //mutex_unlock(&fib_mutex); return 0; } ``` 撰寫簡單的 POSIX thread 程式來測試例如：`thread-1` 和 `thread-2` 都同時打開 fibdrc device，讀取 fib(0) ~ fib(49)，並且為了使用延遲讓兩個 thread 交替執行，營造 race condition 的情況 ```c static int fd; void *job(void* arg){ char buf[100]; int id = *((int *) arg); for (int i = 0; i <= 50; i++) { lseek(fd, i, SEEK_SET); sleep(id); read(fd, buf, 0); printf("thread %d read fib(%d): %s\n", id, i, buf); } } int main(){ fd = open("/dev/fibonacci", O_RDWR); if (fd < 0) { perror("Failed to open character device"); exit(1); } pthread_t test[2]; int pid1 = 1; pthread_create(&test[0], NULL, job, (void *), &pid1); int pid2 = 2; pthread_create(&test[1], NULL, job, (void *), &pid2); for(int i = 0; i < 2; i++) pthread_join(test[i], NULL); close(fd); return 0; } ``` 這邊可以看到前幾項發生錯誤了，thread-2 讀取 fib(2) 的值為 2，說明在 thread-2 設定好的 offest 並將要透過 read() 進入 kernel space 讀取之前，file offset 被 thread-1 更動過了，造成錯誤結果 ``` thread 1 read fib(0): 0 thread 2 read fib(0): 1 thread 1 read fib(1): 1 thread 1 read fib(2): 1 thread 2 read fib(1): 2 <------- thread 1 read fib(3): 1 thread 1 read fib(4): 3 thread 2 read fib(2): 5 thread 1 read fib(5): 2 thread 1 read fib(6): 8 ``` 因此我們需要在 device 被 open() 時將 mutex 上鎖，而 release() 時將 mutex 解鎖，保持同時只有一個 thread 能夠控制 device driver ## TODO: 以 [sysprog21/bignum](https://github.com/sysprog21/bignum) 為範本，實作有效的大數運算理解其中的技巧並導入到 fibdrv 中，並留意以下: * 在 Linux 核心模組中，可用 ktime 系列的 API; * 在 userspace 可用 clock_gettime 相關 API; * 善用統計模型，除去極端數值，過程中應詳述你的手法 * 分別用 gnuplot 製圖，分析 Fibonacci 數列在核心計算和傳遞到 userspace 的時間開銷，單位需要用 us 或 ns (自行斟酌) ### 嘗試引入 [sysprog21/bignum](https://github.com/sysprog21/bignum) 程式碼，使 fib 核心模組可以計算 93 以後的數值 > [commit 4105efe](https://github.com/paulpeng-popo/fibdrv/commit/4105efeba818c6a222236a8ba6d62424149befe1) > Todo: 分析細部的程式實作方法，並思考能否加速 > 參考 [bakudr18](https://hackmd.io/@bakudr18/HJ363p_Qd#%E5%A4%A7%E6%95%B8%E9%81%8B%E7%AE%97) ```c typedef struct { uint64_t *digits; /* Digits of number. */ uint32_t size; /* Length of number. */ uint32_t alloc; /* Size of allocation. */ unsigned sign : 1; /* Sign bit. */ } ``` - CASE 1: - $digits[0] = 2$ - $digits[1] = 1$ - bignum = $1 * 2^{64^1} + 2 * 2^{64^0}$ - CASE 2: - $digits[0] = 3$ - $digits[1] = 2$ - $digits[2] = 1$ - bignum = $1 * 2^{64^2} + 2 * 2^{64^1} + 3 * 2^{64^0}$ ### 大數運算之 fast doubling ```c bn *a1 = fib; /* Use output param fib as a1 */ bn_t a0, tmp, a; bn_init_u32(a0, 0); /* a0 = 0 */ bn_set_u32(a1, 1); /* a1 = 1 */ bn_init(tmp); /* tmp = 0 */ bn_init(a); /* Start at second-highest bit set. */ for (uint64_t k = ((uint64_t) 1) << (62 - __builtin_clzll(n)); k; k >>= 1) { /* Both ways use two squares, two adds, one multipy and one shift. */ bn_lshift(a0, 1, a); /* a03 = a0 * 2 */ bn_add(a, a1, a); /* ... + a1 */ bn_sqr(a1, tmp); /* tmp = a1^2 */ bn_sqr(a0, a0); /* a0 = a0 * a0 */ bn_add(a0, tmp, a0); /* ... + a1 * a1 */ bn_mul(a1, a, a1); /* a1 = a1 * a */ if (k & n) { bn_swap(a1, a0); /* a1 <-> a0 */ bn_add(a0, a1, a1); /* a1 += a0 */ } } /* Now a1 (alias of output parameter fib) = F[n] */ ``` 修改 `verify.py` 使 python 腳本可透過參數更動測試的項數大小 ```python # FILE: verify.py parser = argparse.ArgumentParser() parser.add_argument('-k', '--max_k', type=int, default=500) args = parser.parse_args() verify_fib(args.max_k) ``` 修改 `client.c` ```c int offset = MAX_FIB_K; ``` 在編譯 client 時添加 `-D` flag 用作條件編譯 ```makefile client: client.c $(CC) -o $@ $^ -DMAX_FIB_K=$(FIB_K) ``` 執行 ```sh $ make check FIB_K=5000 ``` 得到輸出 ``` make load make[1]: Entering directory '/home/pengpaul/linux2023/fibdrv' sudo insmod fibdrv_new.ko make[1]: Leaving directory '/home/pengpaul/linux2023/fibdrv' sudo ./client > out make unload make[1]: Entering directory '/home/pengpaul/linux2023/fibdrv' sudo rmmod fibdrv_new || true >/dev/null make[1]: Leaving directory '/home/pengpaul/linux2023/fibdrv' Passed [-] ``` 確認在 $fib(5000)$ 以前其演算法的正確性 ### 實作大數運算後的效能比較 #### 使用者模式時間同步因為要測量運算完成後，把資料從核心搬到使用者模式，及使用者模式收到資料的時間。故在使用者模式取得到時間要能跟核心模式進行對應。在使用者模式可以用 `clock_gettime` 函式來取得時間，其中又有不同的 clock id 。跟據 Linux 核心 [ktime API 文件](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html)，在 Linux 核心內，以 `ktime_get` 函式所取得的時間可以對應到 `CLOCK_MONOTONIC` 的時間。 > [探討 clock_gettime 造成 page fault 的原因](https://hackmd.io/@bakudr18/HJ363p_Qd#%E6%8E%A2%E8%A8%8E-clock_gettime-%E9%80%A0%E6%88%90-page-fault-%E7%9A%84%E5%8E%9F%E5%9B%A0) ```c struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; ``` 從 `clock_gettime` 取得的時間結構定義如上，我們可以簡單的將 `tv_sec` * $10^9$ 來換算成與 ktime 對應的 nanoseconds ```c struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); long long sz = read(fd, buf, sizeof(buf)); clock_gettime(CLOCK_MONOTONIC, &end); long long ut = (long long) (end.tv_sec - start.tv_sec) * \ 1e9 + (end.tv_nsec - start.tv_nsec); ``` 通過 user time - kernel time 可以得出 `copy_to_user` system call 的花費時間依據 [Hierarchical performance measurement and modeling of the Linux file system](https://www.researchgate.net/publication/221556402_Hierarchical_performance_measurement_and_modeling_of_the_Linux_file_system) 研究指出，從核心模式複製資料到使用者層級的時間成本是，每個位元組達 22ns。也就是當數字越大，需傳遞的字串越長，時間花費就越多如圖所示，隨著 n 增加，system call 開銷明顯可見地大幅增加 ![](https://hackmd.io/_uploads/HJoV_GVun.png) 當把 n 設到 500 時程式執行的效率越來越差 ![](https://hackmd.io/_uploads/rJdyjJNuh.png) > 有產生階梯圖的狀況，每次階梯的產生剛好發生在 $fib(n)$ 的數值超過一個 `uint64_t` 可表示的大小時，與 [paul90317](https://hackmd.io/@paul90317/2023q1-fibdrv#Bottom-up) 同學遇到相似的結果，但原因可能有待釐清在[作業三說明](https://hackmd.io/@sysprog/linux2023-fibdrv/https%3A%2F%2Fhackmd.io%2F%40sysprog%2Flinux2023-fibdrv-g)中，老師有給出三種提升效能的方法 1. 儘量降低由核心傳遞到應用程式的資料量 2. 確保應用程式不需要過多的計算量才可重現 Fibonacci 數列 3. Integer Encoding Algorithm 技巧針對第一種降低資料傳遞量的做法，可以將 `bn` 結構體先回傳到 user space，再去做 `bn_to_str` 的操作 ```c char *bn_to_str(const unsigned long long *digits, size_t size) { size_t str_size = (size_t) (size * DIGIT_BITS) / 3 + 1; if (size == 0) { char *str = malloc(2); str[0] = '0'; str[1] = '\0'; return str; } else { char *s = malloc(str_size); char *p = s; memset(s, '0', str_size - 1); s[str_size - 1] = '\0'; /* n.digits[0] contains least significant bits */ for (int i = size - 1; i >= 0; i--) { /* walk through every bit of bn */ for (unsigned long long d = 1ULL << 63; d; d >>= 1) { /* binary -> decimal string */ int carry = !!(d & digits[i]); for (int j = str_size - 2; j >= 0; j--) { s[j] += s[j] - '0' + carry; carry = (s[j] > '9'); if (carry) s[j] -= 10; } } } // skip leading zero while (p[0] == '0' && p[1] != '\0') { p++; } memmove(s, p, strlen(p) + 1); return s; } } ``` > Todo: 減少 `bn_to_str()` 迴圈和分支的使用情況實驗證明，`copy_to_user` 複製 `bn` 結構體比複製字串還來的節省時間 system call overhead 維持在 1000 ns 左右，比複製字串的 2000 ns 好上許多 ![](https://hackmd.io/_uploads/rJGnOGNun.png) ![](https://hackmd.io/_uploads/BJzZZgVd2.png) ## TODO: 實作更快速的乘法運算參照 [Schönhage–Strassen algorithm](https://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm)，在上述大數運算的程式碼基礎之上，改進乘法運算，確保在大多數的案例均可加速，需要有對應的驗證機制。