2023q1 Homework2 (fibdrv)

# 2023q1 Homework2 (fibdrv) contributed by < `chunplusplus` > ## 開發環境 ```bash $ gcc --version gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz CPU family: 6 Model: 165 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 2 CPU max MHz: 5100.0000 CPU min MHz: 800.0000 BogoMIPS: 4599.93 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_go od nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm 2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsa ve avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi 2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 256 KiB (8 instances) L1i: 256 KiB (8 instances) L2: 2 MiB (8 instances) L3: 16 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-15 ``` ## 大數實作 (base 2) 為了避免 overflow, 按照作業說明, 實作以下 struct 儲存大數及大數相關的 [operators](https://github.com/chunplusplus/fibdrv/blob/master/bn2.h) ```c typedef struct _bn { unsigned int *number; unsigned int size; int sign; } bn; ``` client 需要先計算儲存 Fibonacci 數列所需的空間以接收返回字串 ```c #define LOGPHI (20898764025) #define LOGSQRT5 (34948500216) #define SCALE (100000000000) size_t cal_buf_size(int n) { size_t digits = (n * LOGPHI - LOGSQRT5) / SCALE; return digits + 2; } ``` 輸出 $fib(500000)$ 需時 38 秒 ```shell XPS:~/linux2023/fibdrv$ sudo perf stat ./a.out 500000 > /dev/null Performance counter stats for './a.out 500000': 38,396.36 msec task-clock # 1.000 CPUs utilized 3 context-switches # 0.078 /sec 1 cpu-migrations # 0.026 /sec 148 page-faults # 3.855 /sec 166,247,518,242 cycles # 4.330 GHz 567,270,229,314 instructions # 3.41 insn per cycle 40,825,544,728 branches # 1.063 G/sec 31,960,588 branch-misses # 0.08% of all branches 38.397309680 seconds time elapsed 0.003999000 seconds user 38.391836000 seconds sys ``` 用 `perf record` 發現 98.69% 的時間花在 10 進制字串的解碼運算中。 ```shell XPS:~/linux2023/fibdrv$ sudo perf record -g --call-graph dwarf ./a.out 500000 > /dev/null [ perf record: Woken up 4974 times to write data ] [ perf record: Captured and wrote 1243.631 MB perf.data (153333 samples) ] XPS:~/linux2023/fibdrv$ sudo perf report --stdio -g graph,0.5,caller # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 153K of event 'cycles' # Event count (approx.): 166684569759 # # Children Self Command Shared Object Symbol # ........ ........ ....... .................... .................................... # 100.00% 0.00% a.out [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe | ---entry_SYSCALL_64_after_hwframe do_syscall_64 | --100.00%--__x64_sys_read ksys_read vfs_read fib_read | |--98.69%--bn_to_string | --1.30%--bn_fib_fdoubling.part.0 bn_mult ``` ## 大數實作 (base 10) 為了有效率的輸出十進位字串, `bn` 的 `number` 改以 10 的冪 (`BOUND32`) 作為上限值。`BOUND32` 取 1,000,000,000, 是因為 unsigned int 的最大值為 4,294,967,295, 用第 10 位來儲存加法時的進位。 ```c #define MAX_DIGITS 9 #define BOUND32 1000000000U typedef struct _bn { unsigned int *number; unsigned int size; int sign; } bn; ``` 因為 base 10 大數不能像 base 2 大數般有原生的位移操作, 所以相關的 [operators](https://github.com/chunplusplus/fibdrv/blob/master/bn10.h) 實作細節和 base 2 大數的不一樣。新的十進位輸出邏輯變為簡單很多。 ```c for (int i = src->size - 1; i >= 0; i--) { snprintf(p, MAX_DIGITS + 1, "%09u", src->number[i]); p += MAX_DIGITS; } ``` 迭代方式計算 $fib(500000)$ 需時大幅減少為 4.36 秒 ```shell XPS:~/linux2023/fibdrv$ sudo perf stat ./a.out 500000 > /dev/null Performance counter stats for './a.out 500000': 4,361.83 msec task-clock # 1.000 CPUs utilized 3 context-switches # 0.688 /sec 3 cpu-migrations # 0.688 /sec 84 page-faults # 19.258 /sec 18,862,840,766 cycles # 4.325 GHz 78,445,987,521 instructions # 4.16 insn per cycle 8,717,609,976 branches # 1.999 G/sec 1,082,927 branch-misses # 0.01% of all branches 4.363249358 seconds time elapsed 0.000000000 seconds user 4.356633000 seconds sys ``` 絕大部分時間花在大數加法 `bn_do_add` ```shell # Children Self Command Shared Object Symbol # ........ ........ ....... .................... .................................. # 99.99% 0.00% a.out [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe | ---entry_SYSCALL_64_after_hwframe do_syscall_64 | --99.99%--__x64_sys_read ksys_read vfs_read fib_read | --99.93%--bn_fib | --99.77%--bn_do_add ``` Fast Doubling 計算 $fib(500000)$ 需時減少為 1.13 秒 ```shell Performance counter stats for './a.out 500000': 1,131.88 msec task-clock # 0.999 CPUs utilized 1 context-switches # 0.883 /sec 1 cpu-migrations # 0.883 /sec 85 page-faults # 75.096 /sec 4,800,684,889 cycles # 4.241 GHz 10,621,616,396 instructions # 2.21 insn per cycle 743,437,312 branches # 656.816 M/sec 34,882,763 branch-misses # 4.69% of all branches 1.133136415 seconds time elapsed 0.000000000 seconds user 1.132164000 seconds sys ``` 絕大部分時間花在大數加法 `bn_mult` ```shell # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 5K of event 'cycles' # Event count (approx.): 5552968697 # # Children Self Command Shared Object Symbol # ........ ........ ....... .................... .................................... # 99.92% 0.00% a.out a.out [.] _start | ---_start __libc_start_main_impl (inlined) __libc_start_call_main (inlined) main __GI___libc_read (inlined) entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_read ksys_read vfs_read | --99.90%--fib_read | --99.88%--bn_fib_fdoubling.part.0 | --99.86%--bn_mult ``` ## 效能分析 ### 量化執行時間 `fib_read` 使用 ktime 系列的 API 計算和返回 kernel space 的時間開銷 ```c static ktime_t kt; static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { kt = ktime_get(); bn *fib = bn_alloc(1); bn_fib_fdoubling(fib, *offset); /* bn_fib(fib, *offset); */ char *str = bn_to_string(fib); copy_to_user(buf, str, strlen(str) + 1); bn_free(fib); kt = ktime_sub(ktime_get(), kt); return ktime_to_ns(kt); } ``` client 用 clock_gettime 取得時間來計算 user space 時間開銷; 傳遞到 user space 的時間開銷等於 user space 和 kernel space 時間開銷的差 ```c #include <time.h> static inline long long get_nanotime() { struct timespec ts; clock_gettime(CLOCK_REALTIME, &ts); return ts.tv_sec * 1e9 + ts.tv_nsec; } ... .. . lseek(fd, i, SEEK_SET); long long start = get_nanotime(); size_t size = cal_buf_size(i); char *buf = malloc(size); long long ktime = read(fd, buf, size); long long utime = get_nanotime() - start; fprintf(data, "%d %lld %lld %lld\n", i, ktime, utime, utime - ktime); ``` ### 統計分析按照作業提示, 去除 95% 區間之外的值, 分別比較第 $0$ 至 $100th$ 和第 $0$ 至 $1000th$ Fibonacci 數列的 kernel space、user space 和傳遞到 user space 的時間開銷 ![](https://i.imgur.com/eWrpMRt.png) ![](https://i.imgur.com/PpNEI1F.png) ## 加速使用 unsigned long long int 作為 bn 的 number 資料型別 ```c #define MAX_DIGITS 18 #define BOUND64 1000000000000000000UL typedef struct _bn { unsigned long long int *number; unsigned int size; int sign; } bn; ``` iterator ```shell Performance counter stats for './a.out 500000': 2,504.36 msec task-clock # 1.000 CPUs utilized 2 context-switches # 0.799 /sec 2 cpu-migrations # 0.799 /sec 84 page-faults # 33.541 /sec 10,648,255,008 cycles # 4.252 GHz 40,703,058,544 instructions # 3.82 insn per cycle 4,361,322,190 branches # 1.741 G/sec 559,728 branch-misses # 0.01% of all branches 2.505285807 seconds time elapsed 0.000000000 seconds user 2.500120000 seconds sys ``` 實作 Fast doubling 時, make 產生以下錯誤, 看起來連結不到 __int128 的除法... ```shell ERROR: modpost: "__udivti3" [/home/kachun/linux2023/fibdrv/fibdrv.ko] undefined! ERROR: modpost: "__udivmodti4" [/home/kachun/linux2023/fibdrv/fibdrv.ko] undefined! ``` https://skanthak.homepage.t-online.de/integer.html