fibdriv 專題

# fibdriv 專題 contributed by < `yellow-hank` > > [GitHub](https://github.com/yellow-hank/fibdrv) ###### tags: `LinuxKernel` ## 執行環境 :::spoiler ### kernel vesion ```shell $ uname -r 5.4.0-73-generic ``` ### cpu 資訊 ```shell $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz Stepping: 10 CPU MHz: 800.059 CPU max MHz: 4600.0000 CPU min MHz: 800.0000 BogoMIPS: 6399.96 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 1.5 MiB L3 cache: 12 MiB NUMA node0 CPU(s): 0-11 ``` ::: ## 使用大數結構實做 fibonacci ### 資料結構運用兩個 64 位元表達128位元的數字 ```c= struct Big_number { unsigned long long lower, upper; } ``` ### 大數加法為了之後串接十進位表示式方便，這裡捨去 $\geq 10^{19}$ 的數字，直接進位到 upper 中 ```c= static Big_number bigN_add(Big_number augend, Big_number addend) { Big_number result; result.upper = augend.upper + addend.upper; if (augend.lower > ~addend.lower) { result.upper++; result.lower = augend.lower + addend.lower + 8446744073709551616U; } else { if (augend.lower + addend.lower > 9999999999999999999U) { result.upper++; result.lower = augend.lower + addend.lower - 10000000000000000000U; } else { result.lower = augend.lower + addend.lower; } } return result; } ``` ### 費氏數列使用迴圈的方式來計算費氏數列 ```c= static Big_number fib_sequence(long long k) { /* FIXME: use clz/ctz and fast algorithms to speed up */ Big_number f[3]; f[0].lower = 0; f[0].upper = 0; f[1].lower = 1; f[1].upper = 0; if (k == 0) return f[0]; if (k == 1) return f[1]; for (int i = 2; i <= k; i++) { f[2] = bigN_add(f[0], f[1]); f[0] = f[1]; f[1] = f[2]; } return f[2]; } ``` ### copy_to_user 因為 ssize_t 最大能表示的數字位為 $2^{64} - 1$，所以利用 copy_to_user 將結果傳回 userspace，因為使用 128 位元儲存，能表達的十進位數最多能到 40 位元，所以這裡 buf 的最大空間設計成 40。依據 [Hierarchical performance measurement and modeling of the Linux file system](https://www.researchgate.net/publication/221556402_Hierarchical_performance_measurement_and_modeling_of_the_Linux_file_system) 研究指出，從核心模式複製資料到使用者層級的時間成本是，每個位元組達 22ns，所以當數字增加時，在效能分析時需要考慮進去。 ```c= static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { Big_number result; result = fib_sequence(*offset); int ret = 0; char result_buf[40]; if (result.upper > 0) { snprintf(result_buf, 40, "%llu%019llu", result.upper, result.lower); } else { snprintf(result_buf, 40, "%llu", result.lower); } ret = copy_to_user(buf, result_buf, 40); return ret; } ``` ### 結果使用此資料結構，目前最多能計算到 $fib(184)$ ``` $ make check make -C /lib/modules/5.4.0-73-generic/build M=/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv modules make[1]: Entering directory '/usr/src/linux-headers-5.4.0-73-generic' Building modules, stage 2. MODPOST 1 modules make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-73-generic' make unload make[1]: Entering directory '/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv' sudo rmmod fibdrv || true >/dev/null rmmod: ERROR: Module fibdrv is not currently loaded make[1]: Leaving directory '/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv' make load make[1]: Entering directory '/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv' sudo insmod fibdrv.ko make[1]: Leaving directory '/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv' sudo ./client > out make unload make[1]: Entering directory '/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv' sudo rmmod fibdrv || true >/dev/null make[1]: Leaving directory '/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv' Passed [-] ``` ### 排除干擾效能分析因素 #### 使排程器不賦予特定編號核心任務編輯 /etc/default/grub ``` sudo vim /etc/default/grub ``` 修改此行，在末端加入 `isolcpus=11`，如下列，在選擇 cpu 編號時，考量到 [IRQ affinity](https://cs.uwaterloo.ca/~brecht/servers/apic/SMP-affinity.txt)，應該挑選最後一個有效 CPU 編號 ``` GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=11" ``` 修改完後，輸入下行更新設定 ``` sudo update-grub ``` 重新開機後輸入以下指令可以確認已經排除上述編號 11 的核心 ``` shell $ taskset -cp 1 pid 1's current affinity list: 0-10 ``` #### 指定 client 程式在定編號的 cpu 執行到 Makefile 更改 `sudo ./client > out` 此行為下列 ```shell $ sudo taskset -c 11 ./client ``` #### 抑制 [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization)(ASLR) ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` #### 設定 scaling_governor 為 performance 讓 CPU 使用最高頻率來執行所有的 process performance_cpu.sh ```shell for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor done ``` 執行此 script ``` $ sudo sh performance_cpu.sh ``` #### 針對 Intel CPU 關閉 turbo mode ```shell $ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" ``` #### SMP IRQ affinity 讓編號 11 核心不要去處理 IRQ。使用下列 script 設置 ```shell #!/bin/bash for file in `find /proc/irq -name "smp_affinity"` do var=0x`cat ${file}` var="$(( $var & 0x7ff ))" var=`printf '%.2x' ${var}` sudo bash -c "echo ${var} > ${file}" done sudo bash -c "echo 7ff > /proc/irq/default_smp_affinity" ``` 上述參考自[ KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) ### 時間量測方式 #### kernel space 時間量測利用 [ktime API](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html)來量測，並選用 ns 作為量測的時間單位 ```c static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { ktime_t ktime; ktime = ktime_get(); fib_sequence(*offset); ktime = ktime_sub(ktime_get(), ktime); return (ssize_t) ktime_to_ns(ktime); } ``` #### user space 時間量測利用 [clock_gettime API](https://man7.org/linux/man-pages/man2/clock_gettime.2.html) 來量測 user space 時間 ```c clock_gettime(CLOCK_MONOTONIC, &t1); kernel_time = write(fd, write_buf, strlen(write_buf)); clock_gettime(CLOCK_MONOTONIC, &t2); ``` ### 效能分析在測試效能時，請將一些不必要的程式關閉，例如 : chrome browser ，現今的瀏覽器會佔用許多資源，可能會導致效能分析不準確。多次量測有發現會偶爾出現如第一張圖片有很多振盪，大部分分析都會像第二張圖片。 ![](https://i.imgur.com/dxfkQuq.png) ![](https://i.imgur.com/taeDGIn.png) 因為會有上述取樣誤差的原因，所以取樣50次，並且假設此模型為常態分配，剔除2個標準差外的極值，剩下為95％的值取平均，得到下圖 ![](https://i.imgur.com/9RhlpgG.png) 從此圖可以發現到 kernel_to_user 要花大約 550ns 的時間，Fibonacci driver 會隨著數字增加，使運算時間成線性遞增。 ## 使用 __int128 實做 fast doubing fibonacci [GitHub](https://github.com/yellow-hank/fibdrv/tree/fast_doubling) 下列推導引用自 [hw3 fibdriv](https://hackmd.io/D6ExAsVKTEuqHu_Rax86dQ?both) $$ \begin{split} \begin{bmatrix} F(2n+1) \\ F(2n) \end{bmatrix} &= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^{2n} \begin{bmatrix} F(1) \\ F(0) \end{bmatrix}\\ \\ &= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^n \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^n \begin{bmatrix} F(1) \\ F(0) \end{bmatrix}\\ \\ &= \begin{bmatrix} F(n+1) & F(n) \\ F(n) & F(n-1) \end{bmatrix} \begin{bmatrix} F(n+1) & F(n) \\ F(n) & F(n-1) \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix}\\ \\ &= \begin{bmatrix} F(n+1)^2 + F(n)^2\\ F(n)F(n+1) + F(n-1)F(n) \end{bmatrix} \end{split} $$ 依照上述的的公式推導，可以減少遞迴的次數，使費氏數列可以花更少的函式呼叫來達成一樣的結果 ```c unsigned __int128 fib_sequence_fast_db(int k) { unsigned __int128 a = 0, b = 1; for (int i = 31U - (unsigned) __builtin_clz(k); i >= 0; i--) { unsigned __int128 t1 = a * (2 * b - a); unsigned __int128 t2 = b * b + a * a; a = t1; b = t2; if (k & 1 << i) { t1 = a + b; a = b; b = t1; } } return a; } ``` ### 效能分析 ![](https://i.imgur.com/9RhlpgG.png) ![](https://i.imgur.com/6eJOANz.png) 根據上圖的比較，可以發現到 fast_doubling 在 $$ fib(n) \\n\leq 184 $$ 相對迴圈方式快速，且不會隨著數字越大，運算時間有更長的趨勢，呈現水平。