# 2023q1 Homework3 (fibdrv) contributed by < `fourcolor` > ## 自我檢查清單 - [ ] 研讀上述 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗; - [ ] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] 學習指出針對大數運算的加速運算和縮減記憶體操作成本的舉措,紀錄你的認知和疑惑 - [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。 $\to$ 搭配閱讀〈[並行和多執行緒程式設計](https://hackmd.io/@sysprog/concurrency)〉 ::: spoiler 實驗環境 ```bash $ gcc gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ uname -r 5.19.0-35-generic $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Vendor ID: GenuineIntel Model name: 12th Gen Intel(R) Core(TM) i7-12700KF CPU family: 6 Model: 151 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 1 Stepping: 2 CPU max MHz: 5000.0000 CPU min MHz: 800.0000 BogoMIPS: 7219.20 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr p ge mca cmov pat pse36 clflush dts acpi mmx fxsr ss e sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm c onstant_tsc art arch_perfmon pebs bts rep_good nop l xtopology nonstop_tsc cpuid aperfmperf tsc_known _freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4 _2 x2apic movbe popcnt tsc_deadline_timer aes xsav e avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_ fault cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority e pt vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsav es split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_ req hfi umip pku ospke waitpkg gfni vaes vpclmulqd q rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 512 KiB (12 instances) L1i: 512 KiB (12 instances) L2: 12 MiB (9 instances) L3: 25 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-19 Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user po inter sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB f illing, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected ``` ::: ## 正確列出 Fibonacci(100) 為了避免 `long long` 在 F93 後超過 $2^{63}-1$ 造成的 overflow,使用數字字串來做運算。這邊參考 [Small/Short String Optimization](https://github.com/AdrianHuang/fibdrv/tree/big_number) 的實做 ```c static long long fib_sequence_str(long long k, void *buf) { strNum_t *f = kmalloc((k + 2) * sizeof(strNum_t), GFP_KERNEL); strncpy(f[0].data, "0", 1); strncpy(f[1].data, "1", 1); for (int i = 2; i <= k; i++) { string_number_add(f[i - 2].data, f[i - 1].data, f[i].data); } size_t retSize = strlen(f[k].data); if (copy_to_user(buf, f[k].data, retSize)) { kfree(f); return -EFAULT; } return retSize; } ``` ## 時間測量 時間測量得部分參考[時間測量和效能分析](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-c),使用 [ktime API](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html) 實作。由於 fibdrv `write` 的部分是沒有作用的,因此借用他的回傳值作為 `read` 的測量時間。 ## 效能分析 效能分析的部分,原本依照[時間測量和效能分析](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-c) ```bash $ taskset -cp 1 pid 1's current affinity list: 0-19 $ vi /etc/default/grub GRUB_CMD_LINUX="isolcpus=19" $ sudo update-grub $ reboot ``` 但發現第 20 個 cpu 核並沒有被保留下來 ```bash $ taskset -cp 1 pid 1's current affinity list: 0-19 ``` 發現[原因](https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html?highlight=isolcpus) > Isolate from the general SMP balancing and scheduling algorithms. Note that performing domain isolation this way is irreversible: it's not possible to bring back a CPU to the domains once isolated through isolcpus. It's strongly advised to use cpusets instead to disable scheduler load balancing through the "cpuset.sched_load_balance" file. It offers a much more flexible interface where CPUs can move in and out of an isolated set anytime. 並且推測 isolcpus 在我所使用的 linux 版本中已經棄用了。文件說明使用 [cpusets](https://docs.kernel.org/admin-guide/cgroup-v1/cpusets.html) 來代替此功能。而 cpusets 是 cgroups 的其中一項功能,主要用來控制隔離 Process 所使用的資源。(以下以 [cgroups v2](https://docs.kernel.org/admin-guide/cgroup-v2.html) 為主) 首先在 /sys/fs/cgroup/ 建立一個檔案目錄 ```bash $ mkdir /sys/fs/cgroup/isolate ``` 進到 /sys/fs/cgroup/isolate 後可以看到 ```bash $ ls cgroup.controllers cpu.weight.nice memory.low cgroup.events hugetlb.1GB.current memory.max cgroup.freeze hugetlb.1GB.events memory.min cgroup.kill hugetlb.1GB.events.local memory.numa_stat cgroup.max.depth hugetlb.1GB.max memory.oom.group cgroup.max.descendants hugetlb.1GB.numa_stat memory.peak cgroup.procs hugetlb.1GB.rsvd.current memory.pressure cgroup.stat hugetlb.1GB.rsvd.max memory.reclaim cgroup.subtree_control hugetlb.2MB.current memory.stat cgroup.threads hugetlb.2MB.events memory.swap.current cgroup.type hugetlb.2MB.events.local memory.swap.events cpu.idle hugetlb.2MB.max memory.swap.high cpu.max hugetlb.2MB.numa_stat memory.swap.max cpu.max.burst hugetlb.2MB.rsvd.current memory.zswap.current cpu.pressure hugetlb.2MB.rsvd.max memory.zswap.max cpuset.cpus io.max misc.current cpuset.cpus.effective io.pressure misc.events cpuset.cpus.partition io.prio.class misc.max cpuset.mems io.stat pids.current cpuset.mems.effective io.weight pids.events cpu.stat memory.current pids.max cpu.uclamp.max memory.events rdma.current cpu.uclamp.min memory.events.local rdma.max cpu.weight memory.high ``` 這邊我們關注幾個檔案 cgroup.controllers: 代表可以掌控的資源類別 cgroup.subtree_control: 代表在此根目錄下的子目錄可以掌控的資源類別 cgroup.procs: 當前執行在該 cgroup 的 PID cpuset.cpus: 當前 cgroup 可以使用 cpus cpuset.cpus.effective: 當前 group 有效的 cpus cpuset.cpus.partition: ![](https://i.imgur.com/xuuaRrk.png) 若是當前的 cgroup.controllers 不包含 cpu 可以看上一層的 cgroup.subtree_control 是否有包含 cpu ,確認當前 groups 可以掌控 cpu 資源後,要先更改 cpuset.cpus.partition 為 root 或 isolated 才能讓當前 cgroup 的 cpu 與其他 cgroups 獨立,接著再更改 cpuset.cpus 為當前想要隔離的 cpus ,最後確認當前的 cpuset.cpus.effective 與上一層的 cpuset.cpus.effective 沒有交集及成功。 ```bash $ echo 18,19 > cpuset.cpus $ echo root > cpuset.cpus.partition $ cat cpuset.cpus.effective 18-19 $ cd .. && cat cpuset.cpus.effective 0-17 $ taskset -cp 1 pid 1's current affinity list: 0-17 ``` 執行時將當前 PID 寫入 cgroup.procs ```bash # ~/workplace/fibdrv/ $ echo $$ > /sys/fs/cgroup/isolate/cgroup.procs $ ./client ``` <!-- ![](https://i.imgur.com/FhSyrPc.png) --> --- 在沒有隔離cpu資源時 ![](https://i.imgur.com/l38H4cO.png) --- 隔離cpu資源後,曲線的震盪次數明顯減少 ![](https://i.imgur.com/F8nhTuK.png) --- ## Fast Doubling 首先參考 [加速 Fibonacci 運算](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-d) 並且搭配 copy_to_user ```c static long long fib_sequence_fast_double(long long k, void *buf) { long long *f = kmalloc(sizeof(long long), GFP_KERNEL); f[0] = fast_double(k); if (copy_to_user(buf, f, sizeof(long long))) { kfree(f); return -EFAULT; } return sizeof(long long); } ``` ![](https://i.imgur.com/gXMn0C2.png) ## 引入 `bn` 參考 [初步支援大數運算](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-d#%E5%88%9D%E6%AD%A5%E6%94%AF%E6%8F%B4%E5%A4%A7%E6%95%B8%E9%81%8B%E7%AE%97) 將 `bn` 運算引入 `fibdrv` 並利用 perf 來分析。 ![](https://i.imgur.com/rFmUA8A.png) ### `perf record` ```bash 99.64% 0.00% client client [.] _start | ---_start __libc_start_main_impl (inlined) __libc_start_call_main (inlined) main | |--97.43%--__GI___libc_read (inlined) | entry_SYSCALL_64_after_hwframe | do_syscall_64 | __x64_sys_read | ksys_read | | | --97.12%--vfs_read | | | --96.82%--fib_read | fib_sequence_bn_fast_double | | | |--78.59%--bn_to_string | | | |--14.04%--bn_mult | | | | | |--4.64%--bn_alloc | | | | | | | |--1.84%--kmem_cache_alloc_trace | | | | | | | --1.52%--memset_erms | | | | | |--2.88%--kfree | | | | | | | --0.61%--memcg_slab_free_hook | | | | | |--0.69%--memcpy_erms | | | | | |--0.63%--memset_erms | | | | | --0.61%--bn_cpy | | bn_resize | | krealloc | | | |--1.34%--bn_cpy | | bn_resize | | | | | --0.61%--krealloc | | __kmalloc_track_caller | | | |--0.82%--bn_alloc | | | | | --0.51%--kmem_cache_alloc_trace | | | --0.61%--memcpy_erms ``` * 會發現 `fib_sequence_bn_fast_double` 花了 78.59% 的時間在 `bn_to_string` 與文章中的相去甚遠。然後我在另一台電腦做了一樣的事情。 ::: spoiler 實驗環境 ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz Stepping: 11 CPU MHz: 800.007 CPU max MHz: 4200.0000 CPU min MHz: 800.0000 BogoMIPS: 7200.00 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-3 Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cach e flushes, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v ia prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling , PBRSB-eIBRS Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtr r pge mca cmov pat pse36 clflush dts acpi mmx f xsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rd tscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperf mperf pni pclmulqdq dtes64 monitor ds_cpl vmx e st tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_ 1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowpre fetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpi d ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi 2 erms invpcid mpx rdseed adx smap clflushopt i ntel_pt xsaveopt xsavec xgetbv1 xsaves dtherm i da arat pln pts hwp hwp_notify hwp_act_window h wp_epp md_clear flush_l1d arch_capabilities ``` ::: ```bash 87.90% 0.00% client client [.] _start | ---_start __libc_start_main (inlined) main | |--77.82%--__GI___read (inlined) | | | |--76.04%--entry_SYSCALL_64_after_hwframe | | do_syscall_64 | | __x64_sys_read | | ksys_read | | vfs_read | | __vfs_read | | fib_read | | fib_sequence_bn_fast_double | | | | | |--48.81%--bn_mult | | | | | | | |--22.94%--bn_alloc | | | | | | | | | |--8.04%--__kmalloc | | | | | | | | | | | --1.61%--kmalloc_slab | | | | | | | | | |--5.82%--memcg_kmem_put_cache | | | | | | | | | |--5.54%--memset_erms | | | | | | | | | |--2.00%--kmem_cache_alloc_trace | | | | | | | | | --1.53%--_cond_resched | | | | | | | |--13.35%--bn_free | | | | kfree | | | | | | | |--4.58%--memset_erms | | | | | | | --2.20%--memcpy_erms | | | | | |--7.81%--bn_sub | | | bn_add | | | | | | | --6.38%--bn_do_sub.isra.0 | | | | | |--6.22%--memcpy_erms | | | | | |--5.38%--bn_to_string | | | | | |--3.59%--bn_add | | | | | | | --1.75%--bn_resize | | | | | |--2.40%--bn_alloc | | | | | --1.83%--bn_cpy | | | --1.78%--syscall_return_via_sysret | ``` * 會發現 `bn_mult` 在這另一台電腦會有比較多的站比 ::: info 目前推測可能跟 cache 有關? ::: * 第一台電腦的 cache miss 次數 ```bash Performance counter stats for './client 4' (5 runs): <not counted> cpu_core/cache-misses/ (0.00%) 6554 cpu_atom/cache-misses/ ( +-140.24% ) <not counted> cpu_core/cache-references/ (0.00%) 7,3179 cpu_atom/cache-references/ ( +- 8.94% ) <not counted> cpu_core/instructions/ (0.00%) 9,4302,4215 cpu_atom/instructions/ ( +- 0.01% ) <not counted> cpu_core/cycles/ (0.00%) 3,0308,9079 cpu_atom/cycles/ ( +- 0.06% ) 0.0803402 +- 0.0000555 seconds time elapsed ( +- 0.07% ) ``` * 第二台電腦 ```bash Performance counter stats for './client 4' (5 runs): 3,6465 cache-misses # 41.113 % of all cache refs ( +- 7.81% ) 8,8695 cache-references ( +- 2.53% ) 1093,6314 instructions # 1.27 insn per cycle ( +- 0.05% ) 859,4814 cycles ( +- 0.47% ) 0.0040462 +- 0.0000896 seconds time elapsed ( +- 2.21% ) ```