# 2023q1 Homework2 (fibdrv)
contributed by < `chunplusplus` >
## 開發環境
```bash
$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz
CPU family: 6
Model: 165
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 2
CPU max MHz: 5100.0000
CPU min MHz: 800.0000
BogoMIPS: 4599.93
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_go
od nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm
2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsa
ve avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp
ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi
2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida
arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 2 MiB (8 instances)
L3: 16 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
```
## 大數實作 (base 2)
為了避免 overflow, 按照作業說明, 實作以下 struct 儲存大數及大數相關的 [operators](https://github.com/chunplusplus/fibdrv/blob/master/bn2.h)
```c
typedef struct _bn {
unsigned int *number;
unsigned int size;
int sign;
} bn;
```
client 需要先計算儲存 Fibonacci 數列所需的空間以接收返回字串
```c
#define LOGPHI (20898764025)
#define LOGSQRT5 (34948500216)
#define SCALE (100000000000)
size_t cal_buf_size(int n)
{
size_t digits = (n * LOGPHI - LOGSQRT5) / SCALE;
return digits + 2;
}
```
輸出 $fib(500000)$ 需時 38 秒
```shell
XPS:~/linux2023/fibdrv$ sudo perf stat ./a.out 500000 > /dev/null
Performance counter stats for './a.out 500000':
38,396.36 msec task-clock # 1.000 CPUs utilized
3 context-switches # 0.078 /sec
1 cpu-migrations # 0.026 /sec
148 page-faults # 3.855 /sec
166,247,518,242 cycles # 4.330 GHz
567,270,229,314 instructions # 3.41 insn per cycle
40,825,544,728 branches # 1.063 G/sec
31,960,588 branch-misses # 0.08% of all branches
38.397309680 seconds time elapsed
0.003999000 seconds user
38.391836000 seconds sys
```
用 `perf record` 發現 98.69% 的時間花在 10 進制字串的解碼運算中。
```shell
XPS:~/linux2023/fibdrv$ sudo perf record -g --call-graph dwarf ./a.out 500000 > /dev/null
[ perf record: Woken up 4974 times to write data ]
[ perf record: Captured and wrote 1243.631 MB perf.data (153333 samples) ]
XPS:~/linux2023/fibdrv$ sudo perf report --stdio -g graph,0.5,caller
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 153K of event 'cycles'
# Event count (approx.): 166684569759
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... .................... ....................................
#
100.00% 0.00% a.out [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
|
---entry_SYSCALL_64_after_hwframe
do_syscall_64
|
--100.00%--__x64_sys_read
ksys_read
vfs_read
fib_read
|
|--98.69%--bn_to_string
|
--1.30%--bn_fib_fdoubling.part.0
bn_mult
```
## 大數實作 (base 10)
為了有效率的輸出十進位字串, `bn` 的 `number` 改以 10 的冪 (`BOUND32`) 作為上限值。`BOUND32` 取 1,000,000,000, 是因為 unsigned int 的最大值為 4,294,967,295, 用第 10 位來儲存加法時的進位。
```c
#define MAX_DIGITS 9
#define BOUND32 1000000000U
typedef struct _bn {
unsigned int *number;
unsigned int size;
int sign;
} bn;
```
因為 base 10 大數不能像 base 2 大數般有原生的位移操作, 所以相關的 [operators](https://github.com/chunplusplus/fibdrv/blob/master/bn10.h) 實作細節和 base 2 大數的不一樣。
新的十進位輸出邏輯變為簡單很多。
```c
for (int i = src->size - 1; i >= 0; i--) {
snprintf(p, MAX_DIGITS + 1, "%09u", src->number[i]);
p += MAX_DIGITS;
}
```
迭代方式計算 $fib(500000)$ 需時大幅減少為 4.36 秒
```shell
XPS:~/linux2023/fibdrv$ sudo perf stat ./a.out 500000 > /dev/null
Performance counter stats for './a.out 500000':
4,361.83 msec task-clock # 1.000 CPUs utilized
3 context-switches # 0.688 /sec
3 cpu-migrations # 0.688 /sec
84 page-faults # 19.258 /sec
18,862,840,766 cycles # 4.325 GHz
78,445,987,521 instructions # 4.16 insn per cycle
8,717,609,976 branches # 1.999 G/sec
1,082,927 branch-misses # 0.01% of all branches
4.363249358 seconds time elapsed
0.000000000 seconds user
4.356633000 seconds sys
```
絕大部分時間花在大數加法 `bn_do_add`
```shell
# Children Self Command Shared Object Symbol
# ........ ........ ....... .................... ..................................
#
99.99% 0.00% a.out [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
|
---entry_SYSCALL_64_after_hwframe
do_syscall_64
|
--99.99%--__x64_sys_read
ksys_read
vfs_read
fib_read
|
--99.93%--bn_fib
|
--99.77%--bn_do_add
```
Fast Doubling 計算 $fib(500000)$ 需時減少為 1.13 秒
```shell
Performance counter stats for './a.out 500000':
1,131.88 msec task-clock # 0.999 CPUs utilized
1 context-switches # 0.883 /sec
1 cpu-migrations # 0.883 /sec
85 page-faults # 75.096 /sec
4,800,684,889 cycles # 4.241 GHz
10,621,616,396 instructions # 2.21 insn per cycle
743,437,312 branches # 656.816 M/sec
34,882,763 branch-misses # 4.69% of all branches
1.133136415 seconds time elapsed
0.000000000 seconds user
1.132164000 seconds sys
```
絕大部分時間花在大數加法 `bn_mult`
```shell
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 5K of event 'cycles'
# Event count (approx.): 5552968697
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... .................... ....................................
#
99.92% 0.00% a.out a.out [.] _start
|
---_start
__libc_start_main_impl (inlined)
__libc_start_call_main (inlined)
main
__GI___libc_read (inlined)
entry_SYSCALL_64_after_hwframe
do_syscall_64
__x64_sys_read
ksys_read
vfs_read
|
--99.90%--fib_read
|
--99.88%--bn_fib_fdoubling.part.0
|
--99.86%--bn_mult
```
## 效能分析
### 量化執行時間
`fib_read` 使用 ktime 系列的 API 計算和返回 kernel space 的時間開銷
```c
static ktime_t kt;
static ssize_t fib_read(struct file *file,
char *buf,
size_t size,
loff_t *offset)
{
kt = ktime_get();
bn *fib = bn_alloc(1);
bn_fib_fdoubling(fib, *offset);
/* bn_fib(fib, *offset); */
char *str = bn_to_string(fib);
copy_to_user(buf, str, strlen(str) + 1);
bn_free(fib);
kt = ktime_sub(ktime_get(), kt);
return ktime_to_ns(kt);
}
```
client 用 clock_gettime 取得時間來計算 user space 時間開銷; 傳遞到 user space 的時間開銷等於 user space 和 kernel space 時間開銷的差
```c
#include <time.h>
static inline long long get_nanotime()
{
struct timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
return ts.tv_sec * 1e9 + ts.tv_nsec;
}
...
..
.
lseek(fd, i, SEEK_SET);
long long start = get_nanotime();
size_t size = cal_buf_size(i);
char *buf = malloc(size);
long long ktime = read(fd, buf, size);
long long utime = get_nanotime() - start;
fprintf(data, "%d %lld %lld %lld\n", i, ktime, utime, utime - ktime);
```
### 統計分析
按照作業提示, 去除 95% 區間之外的值, 分別比較第 $0$ 至 $100th$ 和第 $0$ 至 $1000th$ Fibonacci 數列的 kernel space、user space 和傳遞到 user space 的時間開銷


## 加速
使用 unsigned long long int 作為 bn 的 number 資料型別
```c
#define MAX_DIGITS 18
#define BOUND64 1000000000000000000UL
typedef struct _bn {
unsigned long long int *number;
unsigned int size;
int sign;
} bn;
```
iterator
```shell
Performance counter stats for './a.out 500000':
2,504.36 msec task-clock # 1.000 CPUs utilized
2 context-switches # 0.799 /sec
2 cpu-migrations # 0.799 /sec
84 page-faults # 33.541 /sec
10,648,255,008 cycles # 4.252 GHz
40,703,058,544 instructions # 3.82 insn per cycle
4,361,322,190 branches # 1.741 G/sec
559,728 branch-misses # 0.01% of all branches
2.505285807 seconds time elapsed
0.000000000 seconds user
2.500120000 seconds sys
```
實作 Fast doubling 時, make 產生以下錯誤, 看起來連結不到 __int128 的除法...
```shell
ERROR: modpost: "__udivti3" [/home/kachun/linux2023/fibdrv/fibdrv.ko] undefined!
ERROR: modpost: "__udivmodti4" [/home/kachun/linux2023/fibdrv/fibdrv.ko] undefined!
```
https://skanthak.homepage.t-online.de/integer.html