fibdrv - HackMD

# Fibdrv contributed by < [`ericlinqq`](https://github.com/ericlinqq) > ## 實驗環境 ``` $ fastfetch ## ** eric@eric-macbookair *####****. -------------------- ###, OS: Fedora Linux Asahi Remix 42 (Forty Two [Adams]) aarch64 ...,/#,,,.. Host: Apple MacBook Air (13-inch, M2, 2022) /*,,,,,,,,*,........,, Kernel: Linux 6.16.4-402.asahi.fc42.aarch64+16k ,((((((//*,,,,,,,,,...... Shell: zsh 5.9 ((((((((((((((%............ Display (eDP-1): 2560x1600 @ 60 Hz (as 1598x999) in 13" [Built-in] ,(((((((((((((((@@(............ DE: KDE Plasma 6.4.4 (((((((((((((((((@@@@/............ WM: KWin (Wayland) ,((((((((((((((((((@@@@@&*........... Terminal: ghostty 1.2.0-2.fc42 ((((((((((((((((((((@@@@@@@&,........... CPU: Apple M2 (8) @ 3.20 GHz (((((((((((((((((((((@@@&%&@@@%,.......... GPU: Apple M2 (10) @ 1.40 GHz [Integrated] /(((((((((((((((((((@@@&%%&@@@@(........ Memory: 9.58 GiB / 15.27 GiB (63%) ,((((((((((((((((@@@&&@@&/&@@@/.. /((((((((((((@@@@@@/.../&& .(((((((((@@@@(.... /(((((@@#... .((&, ``` ``` $ lscpu Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: Apple Model name: Blizzard-M2 Model: 0 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 0x1 Frequency boost: disabled CPU(s) scaling MHz: 72% CPU max MHz: 2424.0000 CPU min MHz: 600.0000 BogoMIPS: 48.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 asimdf hm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp flagm2 frint i8mm bf16 bti ecv Model name: Avalanche-M2 Model: 0 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 0x1 CPU(s) scaling MHz: 100% CPU max MHz: 3204.0000 CPU min MHz: 660.0000 BogoMIPS: 48.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 asimdf hm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp flagm2 frint i8mm bf16 bti ecv Caches (sum of all): L1d: 768 KiB (8 instances) L1i: 1.3 MiB (8 instances) L2: 20 MiB (2 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 ``` ```bash $ gcc --version gcc (GCC) 15.2.1 20250808 (Red Hat 15.2.1-1) ``` 安裝 `kernel-16k-devel` 套件 (對應到作業說明中的 `linux-headers` 套件) ```bash $ sudo dnf install kernel-16k-devel ``` ## 排除干擾效能分析的因素 ### 查詢 CPU 資訊 Apple M2 包含 Avalanche (Performance Core) 以及 Blizzard (Efficiency Core) 兩種 Core，這邊選擇使用 P Core 來作測試藉由以下指令可以看到 CPU 0-3 為 E Core，而 CPU 4 - 7 為 P Core ``` $ lscpu -e CPU NODE SOCKET CORE L1d:L1i:L2 ONLINE MAXMHZ MINMHZ MHZ 0 0 0 0 0:0:0 yes 2424.0000 600.0000 1752.0000 1 0 0 1 1:1:0 yes 2424.0000 600.0000 1752.0000 2 0 0 2 2:2:0 yes 2424.0000 600.0000 1752.0000 3 0 0 3 3:3:0 yes 2424.0000 600.0000 1752.0000 4 0 0 0 4:4:1 yes 3204.0000 660.0000 660.0000 5 0 0 1 5:5:1 yes 3204.0000 660.0000 660.0000 6 0 0 2 6:6:1 yes 3204.0000 660.0000 660.0000 7 0 0 3 7:7:1 yes 3204.0000 660.0000 660.0000 ``` ### 相關 kernel 開機參數參考 [Ubuntu documentation: How to configures CPUs for real-time processing](https://documentation.ubuntu.com/real-time/latest/how-to/cpu-boot-configs) 以下操作皆需要在 `/etc/default/grub` 中的`GRUB_CMDLINE_LINUX_DEFAULT` 新增 kernel 開機參數，並利用 ``` $ sudo grub2-mkconfig -o /etc/grub2.cfg ``` 來更新 GRUB 的設定，重新開機後即生效 #### 保留特定 CPU 避免被排程加入`isolcpus=7` 可以用以下指令來確認是否成功 ``` $ cat /sys/devices/system/cpu/isolated 7 ``` ``` $ taskset -cp 1 pid 1's current affinity list: 0-6 ``` #### 避免特定 CPU 被指派處理 IRQ 加入 `irqaffinity=7` 可以用以下指令來確認 ``` $ find /proc/irq/ -name smp_affinity_list -print -exec cat '{}' ';' /proc/irq/1/smp_affinity_list 0-6 /proc/irq/2/smp_affinity_list 0-6 /proc/irq/3/smp_affinity_list 0-6 /proc/irq/4/smp_affinity_list 0-6 /proc/irq/5/smp_affinity_list 0-6 /proc/irq/6/smp_affinity_list 0-6 ... ``` #### 避免特定 CPU 被排程器時鐘中斷加入 `nohz=on nohz_full=7` 可以用以下指令確認 ``` $ cat /sys/devices/system/cpu/nohz_full 7 ``` #### 避免特定 CPU 處理 RCU callback 加入 `rcu_nocbs=7` 可以用以下指令確認 ``` $ sudo dmesg | grep rcu [ 0.000000] rcu: Offload RCU callbacks from CPUs: 7. ``` ### 以特定 CPU 執行程式 ``` $ taskset -c 7 ./client ``` ### 抑制 ASLR ``` $ sudo bash -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` ### 設定 scaling_governor 為 performance ``` $ sudo bash -c "echo performance > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor" ``` ### 設定 scaling_min_freq ``` $ MAX_FREQ=`cat /sys/devices/system/cpu/cpu7/cpufreq/scaling_max_freq` $ sudo bash -c "echo $MAX_FREQ > /sys/devices/system/cpu/cpu7/cpufreq/scaling_min_freq" ``` ## 時間量測方式 ### 利用 write operation 測試執行時間使用 `ktime` 系列 API 來量測執行時間，並利用 `fib_time_proxy()` 函式作為 wrapper function，用以進行量測，並返回計算結果 ```c static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { int k = *offset; switch (size) { case 0: fib_time_proxy(fib_sequence, k); break; case 1: fib_time_proxy(fib_sequence_fdoubling, k); break; case 2: fib_time_proxy(fib_sequence_fdoubling_clz, k); break; default: return 1; } return (ssize_t) ktime_to_ns(kt); } ``` `write(fd, buf, 0)` 為測試 naive `write(fd, buf, 1)` 為測試 fast doubling `write(fd, buf, 2)` 為測試 fast doubling w/ clz ```c static ktime_t kt; typedef long long (*fib_ft)(long long); static long long fib_time_proxy(fib_ft fib, long long k) { kt = ktime_get(); long long res = fib(k); kt = ktime_sub(ktime_get(), kt); return res; } ``` ### 執行時間直接輸出執行一次的時間 ![runtime_ns](https://hackmd.io/_uploads/BkScygYjeg.png) 可以觀察到上圖中有兩個問題： 1. Naive 方法的執行時間在某幾個 Fibonacci number 會比其他高出很多，在圖上呈現出細尖的形狀 2. Fast doubling 的方法似乎只有兩種執行時間，不是 0 就是 40 左右以下將針對這兩點分別進行討論 ### Naive 方法部份 F(n) 執行時間過長從上圖可以觀察到當 `n = 0, 3, 7, 11, 15, 23, 31, 63` 時會有相比其他數相差非常巨大的執行時間 #### 查看 Naive 方法的實作方式： ```c static long long fib_sequence(long long k) { long long *f = kmalloc(sizeof(*f) * (k + 2), GFP_KERNEL); ... kfree(f); } ``` 發現會使用 `kmalloc` 來動態配置記憶體，並且會依照傳入的 `k` 數值配置一個 `sizeof(long long) * (k + 2)` 空間，依照環境 LP64 data model，即為 `8 * (k + 2)` Bytes，我們將上述提到的數字代入， | k | 0 | 3 | 7 | 11 | 15 | 23 | 31 | 63 | |:------------:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | Bytes needed | 16 | 40 | 72 | 104 | 136 | 200 | 264 | 520 | #### 查看 `kmalloc` 會使用到的通用 slab cache ``` $ sudo cat /proc/slabinfo # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> ... kmalloc-32k 0 0 32768 4 8 : tunables 0 0 0 : slabdata 0 0 0 kmalloc-16k 0 0 16384 8 8 : tunables 0 0 0 : slabdata 0 0 0 kmalloc-8k 96 96 8192 16 8 : tunables 0 0 0 : slabdata 6 6 0 kmalloc-4k 224 288 4096 32 8 : tunables 0 0 0 : slabdata 9 9 0 kmalloc-2k 256 256 2048 32 4 : tunables 0 0 0 : slabdata 8 8 0 kmalloc-1k 288 288 1024 32 2 : tunables 0 0 0 : slabdata 9 9 0 kmalloc-512 384 384 512 32 1 : tunables 0 0 0 : slabdata 12 12 0 kmalloc-256 448 448 256 64 1 : tunables 0 0 0 : slabdata 7 7 0 kmalloc-128 896 896 128 128 1 : tunables 0 0 0 : slabdata 7 7 0 kmalloc-64 2048 2048 64 256 1 : tunables 0 0 0 : slabdata 8 8 0 kmalloc-32 3584 3584 32 512 1 : tunables 0 0 0 : slabdata 7 7 0 kmalloc-16 7168 7168 16 1024 1 : tunables 0 0 0 : slabdata 7 7 0 kmalloc-8 14336 14336 8 2048 1 : tunables 0 0 0 : slabdata 7 7 0 kmalloc-192 680 680 192 85 1 : tunables 0 0 0 : slabdata 8 8 0 kmalloc-96 1190 1190 96 170 1 : tunables 0 0 0 : slabdata 7 7 0 ``` 可以看到配置的記憶體都是比所要求空間還要來得大的 slab cache | k | 0 | 3 | 7 | 11 | 15 | 23 | 31 | 63 | |:---------------:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:----:| | Bytes　needed | 16 | 40 | 72 | 104 | 136 | 200 | 264 | 520 | | Bytes allocated | 16 | 64 | 96 | 128 | 192 | 256 | 512 | 1024 | 接著再觀察 `k - 1` 時所配置到的記憶體大小，剛好就是每個 slab cache 的大小 | k - 1 | 2 | 6 | 10 | 14 | 22 | 30 | 62 | |:---------------:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | Bytes　needed | 32 | 64 | 96 | 128 | 192 | 256 | 512 | | Bytes allocated | 32 | 64 | 96 | 128 | 192 | 256 | 512 | 這些多餘的時間主要花費在要求配置一個新的 slab cache，而無法復用先前剛 `kfree` 掉的空間。後續會使用統計方法來重複取樣，以消除掉這些異常值，另外也會嘗試改寫避免動態配置記憶體。 ### Fast doubling 方法執行時間為 0 觀察上圖，執行時間只有 0 或 40 幾，因此懷疑或許是執行時間小於時鐘精度，造成前後 Cycle count 相等，因此就看到執行時間為 0 的情況 #### 查看 Clock Source ``` $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource arch_sys_counter ``` ``` $ cat /sys/devices/system/clocksource/clocksource0/available_clocksource arch_sys_counter ``` 此為 ARMv8-A 架構的 [System counter](https://developer.arm.com/documentation/102379/0100/System-Counter?lang=en)，我們可以通過讀取[CNTFRQ_EL0](https://developer.arm.com/documentation/ddi0595/2020-12/AArch64-Registers/CNTFRQ-EL0--Counter-timer-Frequency-register?lang=en) 暫存器來得到它的頻率，讀取 [CNTPCT_EL0](https://developer.arm.com/documentation/ddi0601/2025-06/AArch64-Registers/CNTPCT-EL0--Counter-timer-Physical-Count-Register) 得到目前計數值 :::warning 若在 Userspace (EL0) 讀取上述兩個暫存器可能會發生 Illegal Instrucion，須手動設置 [CNTKCTL_EL1](https://developer.arm.com/documentation/ddi0601/2020-12/AArch64-Registers/CNTKCTL-EL1--Counter-timer-Kernel-Control-register) 暫存器之`EL0PCTEN, BIT[0]` 為 1 後才能夠成功讀取。 ::: #### 利用 System counter 查看執行 cycle 數撰寫以下函式來讀取 `CNTPCT_EL0` 暫存器 ```c static unsigned long read_cntpct(void) { unsigned long cnt; asm volatile ("mrs %0, CNTPCT_EL0": "=r" (cnt)::); return cnt; } ``` 改寫 `fib_time_proxy()` 利用執行前後數值相差來計算執行所需 Cycle 數 ```c kt = read_cntpct(); long long res = fib(k); kt = read_cntpct() - kt; ``` ![runtime_cycle](https://hackmd.io/_uploads/r1zpSBqilx.png) 果然可以看到 Fast doubling 的 Cycle 數只有 0 和 1 兩種數值 #### 確認 System conter 的頻率從 kernel log 中確認 ``` $ sudo dmesg | less [ 0.000000] arch_timer: cp15 timer(s) running at 24.00MHz (phys). [ 0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x588fe9dc0, max_idle_ns: 440795202592 ns [ 0.000000] sched_clock: 56 bits at 24MHz, resolution 41ns, wraps every 4398046511097ns ``` 撰寫以下函式來讀取 `CNTFRQ_EL0` 暫存器 ```c static unsigned long read_cntfreq(void) { unsigned long freq; asm volatile ("mrs %0, CNTFRQ_EL0": "=r" (freq)::); return freq; } ``` 利用 `printk` 輸出資訊 ```c printk(KERN_INFO "CNTFRQ_EL0: %lu\n", read_cntfreq()); ``` 查看 kernel log ``` $ sudo dmesg | less [ 9007.220132] CNTFRQ_EL0: 24000000 ``` System counter 頻率為 24MHz，也就是一個 Cycle 約為 41 ns，因此前面我們只能看到 0 和 42 ns 兩種執行時間 ### 改為使用 PMU 進行分析我們可改用 Performance Monitoring Unit (PMU)，ARM 架構下提供暫存器 Performance Monitor Cycle Counter (PMCC) 用於紀錄處理器 Cycle 的數量，根據 ARMv8-A Manual [PMCCNTR_EL0](https://developer.arm.com/documentation/ddi0595/2021-03/AArch64-Registers/PMCCNTR-EL0--Performance-Monitors-Cycle-Count-Register) 暫存器可以用來讀取 Cycle 的計數值，不過當中也有提醒： > This register is present only when FEAT_PMUv3 is implemented. Otherwise, direct accesses to PMCCNTR_EL0 are UNDEFINED. ``` $ sudo dmesg | grep PMU [ 0.000000] CPU features: detected: Apple IMPDEF PMUv3 Traps [ 0.668723] hw perfevents: enabled with apple_blizzard_pmu PMU driver, 10 (0,000003ff) counters available [ 0.668817] hw perfevents: enabled with apple_avalanche_pmu PMU driver, 10 (0,000003ff) counters available ``` ``` $ ls /sys/bus/event_source/devices | grep pmu apple_avalanche_pmu apple_blizzard_pmu ``` ``` $ cat /sys/bus/event_source/devices/apple_avalanche_pmu/cpus 4-7 ``` 可惜的是，Apple M2 上並沒有使用 armv8-pmuv3，而是使用自家專有的 PMU，因此直接讀取 `PMCCNTR_EL0` 暫存器會發生 Illegal Instruction Exception 不過我們還是可以利用 `perf_events` 來進行效能量測 ### 利用 `perf_events` 在 kernel module 內進行效能量測 #### 利用 open operation 建立 `perf_events` 在 open 時，建立 perf_event 監測 CPU cycle count 利用 [work_on_cpu()](https://github.com/torvalds/linux/blob/cec1e6e5d1ab33403b809f79cd20d6aff124ccfe/include/linux/workqueue.h#L843) 在目標 CPU 上利用函式 [perf_event_create_kerenl_count()](https://github.com/torvalds/linux/blob/07e27ad16399afcd693be20211b0dfae63e0615f/kernel/events/core.c#L13771C1-L13771C34)　建立 `perf_event`，參數設定為`cpu` 設為目標 CPU，`task`　設為目前的 task (task mode)，並將 `fib_ctx` 存入 `file->private_data`　中，方便在對 `/dev/fibonacci` 操作時得到相關執行參數與結果 ```c struct fib_ctx { struct perf_event *pe; int cpu; fib_ft func; long long k; u64 cycle; long long result; }; static int fib_open(struct inode *inode, struct file *file) { ... struct fib_ctx *fc = kzalloc(sizeof(*fc), GFP_KERNEL); if (!fc) return -ENOMEM; fc->cpu = smp_processor_id(); if (!cpu_online(fc->cpu)) { kfree(fc); return -EINVAL; } long ret = work_on_cpu(fc->cpu, fib_create_pe_oncpu, fc); if (ret) { kfree(fc); return ret; } file->private_data = fc; return 0; } ``` 如前面所述，`fib_create_pe_oncpu()`　被放入指定 CPU 執行，並在內部呼叫 [perf_event_create_kernel_counter](https://github.com/torvalds/linux/blob/07e27ad16399afcd693be20211b0dfae63e0615f/kernel/events/core.c#L13771C1-L13771C34) 建立　`perf_event` ```c static long fib_create_pe_oncpu(void *data) { struct fib_ctx *fc = data; fc->pe = perf_event_create_kernel_counter(&attr, fc->cpu, current, NULL, NULL); if (IS_ERR(fc->pe)) return PTR_ERR(fc->pe); return 0; } ``` 設定 `perf_event_attr` 為監控 CPU Cycle，並且獨占 PMU ```c struct perf_event_attr attr = {.type = PERF_TYPE_HARDWARE, .size = sizeof(attr), .config = PERF_COUNT_HW_CPU_CYCLES, .disabled = 0, .exclude_hv = 1, .pinned = 1, .exclusive = 1, .exclude_kernel = 0, .read_format = 0}; ``` #### 利用 write operation 讀取執行待測函式前後的 cycle count 修改先前的 `fib_time_proxy()`，讓其在目標 CPU 上執行測試函式 ```c static long long fib_time_proxy(struct fib_ctx *fc) { work_on_cpu(fc->cpu, fib_cycle_delta, fc); return fc->result; } ``` 透過 [perf_event_read_value()](https://github.com/torvalds/linux/blob/07e27ad16399afcd693be20211b0dfae63e0615f/kernel/events/core.c#L5909) 讀取 `perf_event`　所監測之　Cycle count; ```c static long fib_cycle_delta(void *data) { struct fib_ctx *fc = data; u64 v0 = 0, en0 = 0, run0 = 0; u64 v1 = 0, en1 = 0, run1 = 0; v0 = perf_event_read_value(fc->pe, &en0, &run0); fc->result = fc->func(fc->k); v1 = perf_event_read_value(fc->pe, &en1, &run1); fc->cycle = v1 - v0; return 0; } ``` ![runtime_perf](https://hackmd.io/_uploads/BJi_SBjigg.png) 終於我們得以看到更細粒度的執行時間，並且不同方法執行時間對比更加明顯 ### 改進 Naive 方法因為使用 `kmalloc` 造成執行時間較長且不穩定，考慮這邊我們並沒有紀錄並重複使用 fibonacci number，因此可以直接用變數暫存結果即可，這樣在與 `fast doubling` 方法比較時也會比較公平 ```c static long long fib_sequence2(long long k) { long long a = 0; // f(0) long long b = 1; // f(1) for (int i = 0; i < k / 2; i++) { a += b; // f(i) b += a; // f(i + 1) } return (k & 1) ? b : a; } ``` ![improve_naive](https://hackmd.io/_uploads/r1y5RnIhle.png) ### 以統計方法消除極端值對每個 Fibonacci number 的計算皆重複執行 1000 次，並根據 95% 信賴區間將兩個標準差以外之樣本排除，取剩餘樣本之平均作為最終執行時間。可以將原先 naive 方法的 `kmalloc` 所造成的極端值去除。 ![runtime_stat](https://hackmd.io/_uploads/r1AiyTL3el.png) ### User Space 和 Kernel Space 傳遞時間以原先的 Naive 方法進行測試，同樣使用上節之統計方法，Kernel Module 中使用 `ktime` ， User Space 使用 `clock_gettime` ![syscall_plot](https://hackmd.io/_uploads/ryM6u3Jhge.png) 可以看到 kernel time 和 user time 皆隨著輸入線性增加，system call overhead 則是保持固定