# 2023q1 Homework3 (fibdrv)
contributed by < `fourcolor` >
## 自我檢查清單
- [ ] 研讀上述 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
- [ ] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
- [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] 學習指出針對大數運算的加速運算和縮減記憶體操作成本的舉措,紀錄你的認知和疑惑
- [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。 $\to$ 搭配閱讀〈[並行和多執行緒程式設計](https://hackmd.io/@sysprog/concurrency)〉
::: spoiler 實驗環境
```bash
$ gcc
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ uname -r
5.19.0-35-generic
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i7-12700KF
CPU family: 6
Model: 151
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 2
CPU max MHz: 5000.0000
CPU min MHz: 800.0000
BogoMIPS: 7219.20
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr p
ge mca cmov pat pse36 clflush dts acpi mmx fxsr ss
e sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm c
onstant_tsc art arch_perfmon pebs bts rep_good nop
l xtopology nonstop_tsc cpuid aperfmperf tsc_known
_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est
tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4
_2 x2apic movbe popcnt tsc_deadline_timer aes xsav
e avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_
fault cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb
stibp ibrs_enhanced tpr_shadow vnmi flexpriority e
pt vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep
bmi2 erms invpcid rdt_a rdseed adx smap clflushopt
clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsav
es split_lock_detect avx_vnni dtherm ida arat pln
pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_
req hfi umip pku ospke waitpkg gfni vaes vpclmulqd
q rdpid movdiri movdir64b fsrm md_clear serialize
arch_lbr ibt flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 512 KiB (12 instances)
L1i: 512 KiB (12 instances)
L2: 12 MiB (9 instances)
L3: 25 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-19
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via
prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user po
inter sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB f
illing, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected
```
:::
## 正確列出 Fibonacci(100)
為了避免 `long long` 在 F93 後超過 $2^{63}-1$ 造成的 overflow,使用數字字串來做運算。這邊參考 [Small/Short String Optimization](https://github.com/AdrianHuang/fibdrv/tree/big_number) 的實做
```c
static long long fib_sequence_str(long long k, void *buf)
{
strNum_t *f = kmalloc((k + 2) * sizeof(strNum_t), GFP_KERNEL);
strncpy(f[0].data, "0", 1);
strncpy(f[1].data, "1", 1);
for (int i = 2; i <= k; i++) {
string_number_add(f[i - 2].data, f[i - 1].data, f[i].data);
}
size_t retSize = strlen(f[k].data);
if (copy_to_user(buf, f[k].data, retSize)) {
kfree(f);
return -EFAULT;
}
return retSize;
}
```
## 時間測量
時間測量得部分參考[時間測量和效能分析](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-c),使用 [ktime API](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html) 實作。由於 fibdrv `write` 的部分是沒有作用的,因此借用他的回傳值作為 `read` 的測量時間。
## 效能分析
效能分析的部分,原本依照[時間測量和效能分析](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-c)
```bash
$ taskset -cp 1
pid 1's current affinity list: 0-19
$ vi /etc/default/grub
GRUB_CMD_LINUX="isolcpus=19"
$ sudo update-grub
$ reboot
```
但發現第 20 個 cpu 核並沒有被保留下來
```bash
$ taskset -cp 1
pid 1's current affinity list: 0-19
```
發現[原因](https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html?highlight=isolcpus)
> Isolate from the general SMP balancing and scheduling algorithms. Note that performing domain isolation this way is irreversible: it's not possible to bring back a CPU to the domains once isolated through isolcpus. It's strongly advised to use cpusets instead to disable scheduler load balancing through the "cpuset.sched_load_balance" file. It offers a much more flexible interface where CPUs can move in and out of an isolated set anytime.
並且推測 isolcpus 在我所使用的 linux 版本中已經棄用了。文件說明使用 [cpusets](https://docs.kernel.org/admin-guide/cgroup-v1/cpusets.html) 來代替此功能。而 cpusets 是 cgroups 的其中一項功能,主要用來控制隔離 Process 所使用的資源。(以下以 [cgroups v2](https://docs.kernel.org/admin-guide/cgroup-v2.html) 為主)
首先在 /sys/fs/cgroup/ 建立一個檔案目錄
```bash
$ mkdir /sys/fs/cgroup/isolate
```
進到 /sys/fs/cgroup/isolate 後可以看到
```bash
$ ls
cgroup.controllers cpu.weight.nice memory.low
cgroup.events hugetlb.1GB.current memory.max
cgroup.freeze hugetlb.1GB.events memory.min
cgroup.kill hugetlb.1GB.events.local memory.numa_stat
cgroup.max.depth hugetlb.1GB.max memory.oom.group
cgroup.max.descendants hugetlb.1GB.numa_stat memory.peak
cgroup.procs hugetlb.1GB.rsvd.current memory.pressure
cgroup.stat hugetlb.1GB.rsvd.max memory.reclaim
cgroup.subtree_control hugetlb.2MB.current memory.stat
cgroup.threads hugetlb.2MB.events memory.swap.current
cgroup.type hugetlb.2MB.events.local memory.swap.events
cpu.idle hugetlb.2MB.max memory.swap.high
cpu.max hugetlb.2MB.numa_stat memory.swap.max
cpu.max.burst hugetlb.2MB.rsvd.current memory.zswap.current
cpu.pressure hugetlb.2MB.rsvd.max memory.zswap.max
cpuset.cpus io.max misc.current
cpuset.cpus.effective io.pressure misc.events
cpuset.cpus.partition io.prio.class misc.max
cpuset.mems io.stat pids.current
cpuset.mems.effective io.weight pids.events
cpu.stat memory.current pids.max
cpu.uclamp.max memory.events rdma.current
cpu.uclamp.min memory.events.local rdma.max
cpu.weight memory.high
```
這邊我們關注幾個檔案
cgroup.controllers: 代表可以掌控的資源類別
cgroup.subtree_control: 代表在此根目錄下的子目錄可以掌控的資源類別
cgroup.procs: 當前執行在該 cgroup 的 PID
cpuset.cpus: 當前 cgroup 可以使用 cpus
cpuset.cpus.effective: 當前 group 有效的 cpus
cpuset.cpus.partition:

若是當前的 cgroup.controllers 不包含 cpu 可以看上一層的 cgroup.subtree_control 是否有包含 cpu ,確認當前 groups 可以掌控 cpu 資源後,要先更改 cpuset.cpus.partition 為 root 或 isolated 才能讓當前 cgroup 的 cpu 與其他 cgroups 獨立,接著再更改 cpuset.cpus 為當前想要隔離的 cpus ,最後確認當前的 cpuset.cpus.effective 與上一層的 cpuset.cpus.effective 沒有交集及成功。
```bash
$ echo 18,19 > cpuset.cpus
$ echo root > cpuset.cpus.partition
$ cat cpuset.cpus.effective
18-19
$ cd .. && cat cpuset.cpus.effective
0-17
$ taskset -cp 1
pid 1's current affinity list: 0-17
```
執行時將當前 PID 寫入 cgroup.procs
```bash
# ~/workplace/fibdrv/
$ echo $$ > /sys/fs/cgroup/isolate/cgroup.procs
$ ./client
```
<!--  -->
---
在沒有隔離cpu資源時

---
隔離cpu資源後,曲線的震盪次數明顯減少

---
## Fast Doubling
首先參考 [加速 Fibonacci 運算](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-d) 並且搭配 copy_to_user
```c
static long long fib_sequence_fast_double(long long k, void *buf)
{
long long *f = kmalloc(sizeof(long long), GFP_KERNEL);
f[0] = fast_double(k);
if (copy_to_user(buf, f, sizeof(long long))) {
kfree(f);
return -EFAULT;
}
return sizeof(long long);
}
```

## 引入 `bn`
參考 [初步支援大數運算](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-d#%E5%88%9D%E6%AD%A5%E6%94%AF%E6%8F%B4%E5%A4%A7%E6%95%B8%E9%81%8B%E7%AE%97) 將 `bn` 運算引入 `fibdrv` 並利用 perf 來分析。

### `perf record`
```bash
99.64% 0.00% client client [.] _start
|
---_start
__libc_start_main_impl (inlined)
__libc_start_call_main (inlined)
main
|
|--97.43%--__GI___libc_read (inlined)
| entry_SYSCALL_64_after_hwframe
| do_syscall_64
| __x64_sys_read
| ksys_read
| |
| --97.12%--vfs_read
| |
| --96.82%--fib_read
| fib_sequence_bn_fast_double
| |
| |--78.59%--bn_to_string
| |
| |--14.04%--bn_mult
| | |
| | |--4.64%--bn_alloc
| | | |
| | | |--1.84%--kmem_cache_alloc_trace
| | | |
| | | --1.52%--memset_erms
| | |
| | |--2.88%--kfree
| | | |
| | | --0.61%--memcg_slab_free_hook
| | |
| | |--0.69%--memcpy_erms
| | |
| | |--0.63%--memset_erms
| | |
| | --0.61%--bn_cpy
| | bn_resize
| | krealloc
| |
| |--1.34%--bn_cpy
| | bn_resize
| | |
| | --0.61%--krealloc
| | __kmalloc_track_caller
| |
| |--0.82%--bn_alloc
| | |
| | --0.51%--kmem_cache_alloc_trace
| |
| --0.61%--memcpy_erms
```
* 會發現 `fib_sequence_bn_fast_double` 花了 78.59% 的時間在 `bn_to_string` 與文章中的相去甚遠。然後我在另一台電腦做了一樣的事情。
::: spoiler 實驗環境
```
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz
Stepping: 11
CPU MHz: 800.007
CPU max MHz: 4200.0000
CPU min MHz: 800.0000
BogoMIPS: 7200.00
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-3
Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cach
e flushes, SMT disabled
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
ia prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user
pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling
, PBRSB-eIBRS Not affected
Vulnerability Srbds: Mitigation; Microcode
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtr
r pge mca cmov pat pse36 clflush dts acpi mmx f
xsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rd
tscp lm constant_tsc art arch_perfmon pebs bts
rep_good nopl xtopology nonstop_tsc cpuid aperf
mperf pni pclmulqdq dtes64 monitor ds_cpl vmx e
st tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_
1 sse4_2 x2apic movbe popcnt tsc_deadline_timer
aes xsave avx f16c rdrand lahf_lm abm 3dnowpre
fetch cpuid_fault invpcid_single pti ssbd ibrs
ibpb stibp tpr_shadow vnmi flexpriority ept vpi
d ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi
2 erms invpcid mpx rdseed adx smap clflushopt i
ntel_pt xsaveopt xsavec xgetbv1 xsaves dtherm i
da arat pln pts hwp hwp_notify hwp_act_window h
wp_epp md_clear flush_l1d arch_capabilities
```
:::
```bash
87.90% 0.00% client client [.] _start
|
---_start
__libc_start_main (inlined)
main
|
|--77.82%--__GI___read (inlined)
| |
| |--76.04%--entry_SYSCALL_64_after_hwframe
| | do_syscall_64
| | __x64_sys_read
| | ksys_read
| | vfs_read
| | __vfs_read
| | fib_read
| | fib_sequence_bn_fast_double
| | |
| | |--48.81%--bn_mult
| | | |
| | | |--22.94%--bn_alloc
| | | | |
| | | | |--8.04%--__kmalloc
| | | | | |
| | | | | --1.61%--kmalloc_slab
| | | | |
| | | | |--5.82%--memcg_kmem_put_cache
| | | | |
| | | | |--5.54%--memset_erms
| | | | |
| | | | |--2.00%--kmem_cache_alloc_trace
| | | | |
| | | | --1.53%--_cond_resched
| | | |
| | | |--13.35%--bn_free
| | | | kfree
| | | |
| | | |--4.58%--memset_erms
| | | |
| | | --2.20%--memcpy_erms
| | |
| | |--7.81%--bn_sub
| | | bn_add
| | | |
| | | --6.38%--bn_do_sub.isra.0
| | |
| | |--6.22%--memcpy_erms
| | |
| | |--5.38%--bn_to_string
| | |
| | |--3.59%--bn_add
| | | |
| | | --1.75%--bn_resize
| | |
| | |--2.40%--bn_alloc
| | |
| | --1.83%--bn_cpy
| |
| --1.78%--syscall_return_via_sysret
|
```
* 會發現 `bn_mult` 在這另一台電腦會有比較多的站比
::: info
目前推測可能跟 cache 有關?
:::
* 第一台電腦的 cache miss 次數
```bash
Performance counter stats for './client 4' (5 runs):
<not counted> cpu_core/cache-misses/ (0.00%)
6554 cpu_atom/cache-misses/ ( +-140.24% )
<not counted> cpu_core/cache-references/ (0.00%)
7,3179 cpu_atom/cache-references/ ( +- 8.94% )
<not counted> cpu_core/instructions/ (0.00%)
9,4302,4215 cpu_atom/instructions/ ( +- 0.01% )
<not counted> cpu_core/cycles/ (0.00%)
3,0308,9079 cpu_atom/cycles/ ( +- 0.06% )
0.0803402 +- 0.0000555 seconds time elapsed ( +- 0.07% )
```
* 第二台電腦
```bash
Performance counter stats for './client 4' (5 runs):
3,6465 cache-misses # 41.113 % of all cache refs ( +- 7.81% )
8,8695 cache-references ( +- 2.53% )
1093,6314 instructions # 1.27 insn per cycle ( +- 0.05% )
859,4814 cycles ( +- 0.47% )
0.0040462 +- 0.0000896 seconds time elapsed ( +- 2.21% )
```