--- tags: linux2022 --- # 2022q1 Homework3 (fibdrv) contributed by < [howardjchen](https://github.com/howardjchen) > > [作業要求](https://hackmd.io/@sysprog/linux2022-fibdrv) ## 自我檢查清單 - [x] 研讀 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗; - [ ] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] 研讀 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU),指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措? - [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢? > 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html) - [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。 ## 開發環境 ### Ubuntu Linux 系統資訊 ``` Static hostname: pcie-lab Icon name: computer-desktop Chassis: desktop Machine ID: fd5e7ae180f94e0cbbb273229cc399e3 Boot ID: 45ad9e2f7b1349a8bb6bb449830c36a1 Operating System: Ubuntu 20.04.2 LTS Kernel: Linux 5.11.0-34-generic Architecture: x86-64 ``` ``` gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 165 Model name: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz Stepping: 5 CPU MHz: 2900.000 CPU max MHz: 4800.0000 CPU min MHz: 800.0000 BogoMIPS: 5799.77 Virtualization: VT-x L1d cache: 256 KiB L1i cache: 256 KiB L2 cache: 2 MiB L3 cache: 16 MiB NUMA node0 CPU(s): 0-15 Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected ``` ## 1. 設定 Linux 效能分析環境 由於原本的 fibdrv 只能跑到 offset=92 就會 overflow ,這邊先著手進行修改 $fib(92)$ 的計算結果用 16 進位表示是 `0x1 11F3 8AD0 840B F6BF`,超過 64 位元整數所夠表示的最大數值,因此考量到要正確輸出 $fib(100)$ 或數列更後面的數值,就必須使用到特製的結構來處理回傳值。我們利用作業裡的例子進行實作 ```c= struct BigN { unsigned long long lower, upper; }; ``` 使用 struct BigN 來將一個數字分成兩部份: * 高位的 64 bits 保存 upper 中; * 低位的 64 bit 則是保存在 lower 中; 進行大數加法時,則需要注意 lower 是否需要進位到 upper: ```cpp static inline void addBigN(struct BigN *output, struct BigN x, struct BigN y) { output->upper = x.upper + y.upper; if (y.lower > ~x.lower) output->upper++; output->lower = x.lower + y.lower; } ``` 利用 128 bit 可以表示到 offset=186 ``` [32838.639067] offset: 186 FA63C8D9FA216A8F C8A7213B333270F8 ``` --- 我們先針對 ```fib_sequence``` 來進行時間量測 ```c static struct BigN fib_sequence(long long k) { /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */ struct BigN f[k + 2]; f[0].upper = 0; f[0].lower = 0; f[1].upper = 0; f[1].lower = 1; for (int i = 2; i <= k; i++) addBigN(&f[i], f[i - 1], f[i - 2]); return f[k]; } static struct BigN fib_time_proxy(long long k) { kt = ktime_get(); struct BigN result = fib_sequence(k); kt = ktime_sub(ktime_get(), kt); return result; } ``` 我們利用 write 來回傳計算的時間 ```c static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { return ktime_to_ns(kt); } ``` 於是就有量測時間的結果,但量測數值有點浮動 ![](https://i.imgur.com/K04TMqO.png) ### 排除干擾效能分析的因素 #### 限定 CPU 給特定的程式使用 參考 [KYG-yaya573142 的報告](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ?both) 修改 `/etc/default/grub` 內`GRUB_CMDLINE_LINUX_DEFAULT`,加入 `isolcpus=5` 來指定編號 5 的核心不被排程器賦予任務,修改完成後需 `sudo update-grub` 來更新設定,重開機後即生效 (可從 `/sys/devices/system/cpu/isolated` 確認是否生效) ``` GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=5" ``` 修改後可觀察到 PID 1 - [init](https://en.wikipedia.org/wiki/Init) 的 affinity list 不包含該編號的 CPU ```shell $ taskset -cp 1 pid 1's current affinity list: 0-4,6-15 ``` #### 將程式固定在特定的 CPU 中執行 透過指定處理器親和性 (processor affinity,亦稱 CPU pinning),可以將程式固定在特定的 CPU 中執行 ```shell $ sudo taskset -c 5 ./client ``` #### 抑制 [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization) (ASLR) ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` #### 設定 scaling_governor 為 performance 準備以下 shell script 來設定 CPU 以最高頻率執行所有 process ```shell $ cat performance.sh for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor done $ sudo sh performance.sh ``` #### 關閉 turbo mode 關閉 Intel 處理器的 turbo mode ```shell $ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" ``` #### SMP IRQ affinity 執行上述步驟後進行量測,發現結果仍有飄動的情況 ![](https://i.imgur.com/BVA4C5Y.png) 針對 [SMP IRQ affinity](https://www.kernel.org/doc/Documentation/IRQ-affinity.txt) 進行設置,盡量避免 CPU 5 去處理 IRQ。使用以下 script 進行設置,僅將 CPU 5 從可用清單去除,但不大幅度改變本來系統的設置 (例如 smp_affinity 原本是 0~15,只會被更改為 0~4, 6-15) 註: smp_affinity 和 smp_affinity_list 擇一設定即可 ```bash #!/bin/bash for file in `find /proc/irq -name "smp_affinity"` do var=0x`cat ${file}` var="$(( $var & 0xdf ))" var=`printf '%.2x' ${var}` echo "${var} > ${file}" done echo df > /proc/irq/default_smp_affinity ``` 設置完畢後可以透過 `cat /proc/interrupts` 觀察 CPU 5 的 IRQ 數量,同時也可以量測到更穩定的實驗結果 ![](https://i.imgur.com/5YAexjp.png) ### 量測時間的方式 #### user space 使用 [`clock_gettime`](http://man7.org/linux/man-pages/man2/clock_getres.2.html) 來取得時間 ```cpp #include <time.h> ... struct timespec t1, t2; clock_gettime(CLOCK_MONOTONIC, &t1); // do something here clock_gettime(CLOCK_MONOTONIC, &t2); long long ut = (long long)(t2.tv_sec * 1e9 + t2.tv_nsec) - (t1.tv_sec * 1e9 + t1.tv_nsec); // ns ``` #### kernel space 使用 [ktime](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html) 來量測執行時間,這裡參照作業說明的手法,挪用 `fib_write` 來回傳 kernel space 的執行時間 ```c static struct BigN fib_sequence(long long k) { /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */ struct BigN f[k + 2]; f[0].upper = 0; f[0].lower = 0; f[1].upper = 0; f[1].lower = 1; for (int i = 2; i <= k; i++) addBigN(&f[i], f[i - 1], f[i - 2]); return f[k]; } static struct BigN fib_time_proxy(long long k) { kt = ktime_get(); struct BigN result = fib_sequence(k); kt = ktime_sub(ktime_get(), kt); return result; } ``` ### user space 與 kernel space 的傳遞時間 使用上述[量測時間的方式](#量測時間的方式)中提到的方式分別量測 user space 及 kernel space 花費的時間,再將兩者相減即可獲得 user space 與 kernel space 的傳遞時間 ```c for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); clock_gettime(CLOCK_MONOTONIC, &t1); sz = read(fd, buf, 1); clock_gettime(CLOCK_MONOTONIC, &t2); long long ut = (long long) (t2.tv_sec * 1e9 + t2.tv_nsec) - (t1.tv_sec * 1e9 + t1.tv_nsec); // ns time_val = write(fd, write_buf, strlen(write_buf)); printf("%03d %lld %lld %lld\n", i, time_val, ut, (ut - time_val)); } ``` ![](https://i.imgur.com/TR95vM8.png) ## 用統計手法去除極端值 假設數據分佈接近常態分佈的情況下,一個標準差到三個標準差之內的資料量約佔 68%, 95%, 99.7% ![](https://upload.wikimedia.org/wikipedia/commons/8/8c/Standard_deviation_diagram.svg) > 圖片來源: [wikipedia](https://en.wikipedia.org/wiki/Standard_deviation) ### Python script 實作以及結果 ```python def find_best_mean(file, ans_list): with open(file) as row_data: for line in row_data: # for each row, split and add nums to pool row = line.split() for ele_str in row[1:]: pool.append(int(ele_str)) # Calculate std for each pool data = np.array(pool) left = data.mean() - 2 * data.std() right = data.mean() + 2 * data.std() # Filter out and find mean average of 95% data for x in pool: if left <= x and x <= right: filter_pool.append(x) ans = np.array(filter_pool) ans_list.append(ans.mean()) filter_pool.clear() pool.clear() ``` 我們採用每次取100個樣本並留下 95% 的 data 取平均,可以發現畫出來的圖形平滑許多,不會再有漂移的情況 ![](https://i.imgur.com/C3z83Mg.png) ## 2. clz / ctz 對 Fibonacci 數運算的幫助 > 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 ## 3. 研讀 KYG-yaya573142 的報告 > 指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措? ## 4. lsmod output study > lsmod 的輸出結果有一欄名為 Used by,這是 “each module’s use count and a list of referring modules”,但如何實作出來呢?模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢? 搭配閱讀 The Linux driver implementer’s API guide » Driver Basics ## 5. fibdrv.c 中 mutex 分析 > 注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。 ## 6. 思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本 > 複習 C 語言 數值系統 和 bitwise operation,思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; ## 7. 改善 fibdrv 效能 > 在 GitHub 上 fork fibdrv,目標是修正 Fibonacci 數計算的正確性 (現有實作存在缺陷,請指出),隨後改善 fibdrv 計算 Fibinacci 數列的執行效率,過程中需要量化執行時間,可在 Linux 核心模組和使用層。 ### 1. 增加 fibdrv 運算數值 > * 原本的程式碼只能列出到 $Fibonacci(100)$ 而且==部分輸出是錯誤的數值==,請修改程式碼,列出後方更多數值 (注意: 檢查正確性和數值系統的使用) 由於原本的 fibdrv 只能跑到 offset=92 就會 overflow ,這邊先著手進行修改 $fib(92)$ 的計算結果用 16 進位表示是 `0x1 11F3 8AD0 840B F6BF`,超過 64 位元整數所夠表示的最大數值,因此考量到要正確輸出 $fib(100)$ 或數列更後面的數值,就必須使用到特製的結構來處理回傳值。我們利用作業裡的例子進行實作 #### string add function 為了要增加運算的位數值,這邊實作個字串相加的 function ```c static char *add_two_string(char *num1, char *num2) { int idx1 = strlen(num1) - 1; int idx2 = strlen(num2) - 1; int n = idx1 > idx2 ? strlen(num1) : strlen(num2); char *ret = kzalloc((n + 2) * sizeof(char), GFP_KERNEL); if (!ret) { printk(KERN_INFO "[%s] Cannot get memory with size %d\n", __func__, n + 2); return NULL; } int idxRet = n; int carry = 0; while (idx1 >= 0 || idx2 >= 0) { int sum = 0; if (idx1 >= 0) { sum += num1[idx1] - '0'; idx1--; } if (idx2 >= 0) { sum += num2[idx2] - '0'; idx2--; } sum += carry; carry = sum / 10; ret[idxRet] = sum % 10 + '0'; idxRet--; } if (carry) { ret[idxRet] = '1'; idxRet--; } return &(ret[idxRet + 1]); } ``` ```c static int fib_sequence(char **f, long long k) { f[0] = kzalloc(sizeof(char) + 1, GFP_KERNEL); f[1] = kzalloc(sizeof(char) + 1, GFP_KERNEL); strncpy(f[0], "0", 2); strncpy(f[1], "1", 2); for (int i = 2; i <= k; i++) { f[i] = add_two_string(f[i - 1], f[i - 2]); if (f[i] == NULL) return -ENOMEM; } return 0; } static int fib_time_proxy(char **f, long long k) { kt = ktime_get(); int ret = fib_sequence(f, k); kt = ktime_sub(ktime_get(), kt); return ret; } ``` 然後我在們 ```fib_read``` 的時候利用```copy_to_user``` 將運算的結果傳到 user space buffer ```c /* calculate the fibonacci number at given offset */ static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { char **f; if (*offset > 1) f = kmalloc((*offset + 1) * sizeof(f), GFP_KERNEL); else f = kmalloc(3 * sizeof(f), GFP_KERNEL); if (!f) { printk(KERN_INFO "[%s] Cannot get memory with size %lld\n", __func__, (*offset + 1)); return -ENOMEM; } int ret = fib_time_proxy(f, *offset); if (ret < 0) return ret; /* * Copy at most size bytes to user space. * Return ''0'' on success and some other value on error. */ if (copy_to_user(buf, f[*offset], strlen(f[*offset])+1)) return -EFAULT; else return 0; return *offset; } ``` **client.c** ```c for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); clock_gettime(CLOCK_MONOTONIC, &t1); sz = read(fd, buf, 160); clock_gettime(CLOCK_MONOTONIC, &t2); long long ut = (long long) (t2.tv_sec * 1e9 + t2.tv_nsec) - (t1.tv_sec * 1e9 + t1.tv_nsec); // ns time_val = write(fd, write_buf, strlen(write_buf)); fprintf(data, "%03d %lld %lld %lld\n", i, time_val, ut, (ut - time_val)); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%s.\n", i, buf); if (sz < 0) break; } ``` 目前可以成功計算到 ```offset=500``` ``` Reading from /dev/fibonacci at offset 500, returned the sequence 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125. ``` :::danger TODO: 發現在 kfree 我們 kmalloc 出來的 ptr的時候會有 general protection fault 的 error ::: ``` ... [ 109.924479] offset: 0 with 0 [ 109.924491] offset: 1 with 1 [ 109.924495] offset: 2 with 1 [ 109.924499] offset: 3 with 2 [ 109.924503] offset: 4 with 3 [ 109.924507] offset: 5 with 5 [ 109.924514] general protection fault, probably for non-canonical address 0xa0ff89d2c365701a: 0000 [#1] SMP NOPTI [ 109.924516] CPU: 0 PID: 2186 Comm: client Tainted: P C OE 5.13.0-35-generic #40~20.04.1-Ubuntu [ 109.924518] Hardware name: Gigabyte Technology Co., Ltd. H470 HD3/H470 HD3, BIOS F4b 12/15/2020 [ 109.924519] RIP: 0010:__kmalloc+0xfd/0x4a0 [ 109.924523] Code: 08 65 4c 03 05 c4 00 73 52 49 83 78 10 00 4d 8b 20 0f 84 5d 03 00 00 4d 85 e4 0f 84 54 03 00 00 41 8b 47 28 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 4c 89 e0 48 0f c9 48 31 cb [ 109.924524] RSP: 0018:ffff9e360112fd40 EFLAGS: 00010282 [ 109.924526] RAX: a0ff89d2c365701a RBX: 0000000000000001 RCX: 0000000000000032 [ 109.924527] RDX: 0000000000019001 RSI: 0000000000000dc0 RDI: 000000000003a040 [ 109.924528] RBP: ffff9e360112fd78 R08: ffff89d66d63a040 R09: ffff89d2c3657d22 [ 109.924529] R10: 0000000000000004 R11: ffff89d2c3657d23 R12: a0ff89d2c365701a [ 109.924529] R13: 0000000000000000 R14: 0000000000000dc0 R15: ffff89d2c0042200 [ 109.924530] FS: 00007fa0c2ec2540(0000) GS:ffff89d66d600000(0000) knlGS:0000000000000000 [ 109.924531] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 109.924532] CR2: 00005643b0135488 CR3: 0000000148e70003 CR4: 00000000007706f0 [ 109.924533] PKRU: 55555554 [ 109.924534] Call Trace: [ 109.924535] <TASK> [ 109.924536] ? fib_read+0x119/0x258 [fibdrv] [ 109.924539] fib_read+0x119/0x258 [fibdrv] [ 109.924540] vfs_read+0xa0/0x190 [ 109.924543] ksys_read+0x67/0xe0 [ 109.924545] __x64_sys_read+0x1a/0x20 [ 109.924547] do_syscall_64+0x61/0xb0 [ 109.924549] ? vfs_write+0x1c3/0x250 [ 109.924551] ? exit_to_user_mode_prepare+0x3d/0x1c0 [ 109.924553] ? syscall_exit_to_user_mode+0x27/0x50 [ 109.924555] ? __x64_sys_lseek+0x1a/0x20 [ 109.924556] ? do_syscall_64+0x6e/0xb0 [ 109.924558] ? do_syscall_64+0x6e/0xb0 [ 109.924559] ? do_syscall_64+0x6e/0xb0 [ 109.924561] ? exc_page_fault+0x8f/0x170 [ 109.924562] ? asm_exc_page_fault+0x8/0x30 [ 109.924564] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 109.924566] RIP: 0033:0x7fa0c2ddd002 [ 109.924567] Code: c0 e9 c2 fe ff ff 50 48 8d 3d 7a cb 0a 00 e8 d5 1a 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 [ 109.924568] RSP: 002b:00007ffc6b410d38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 109.924569] RAX: ffffffffffffffda RBX: 00005643af0315b0 RCX: 00007fa0c2ddd002 [ 109.924570] RDX: 0000000000000001 RSI: 00007ffc6b410daf RDI: 0000000000000003 [ 109.924571] RBP: 00007ffc6b410dd0 R08: 00007ffc6b524080 R09: 000000000000006d [ 109.924572] R10: 00007ffc6b524090 R11: 0000000000000246 R12: 00005643af031200 [ 109.924572] R13: 00007ffc6b410ec0 R14: 0000000000000000 R15: 0000000000000000 [ 109.924574] </TASK> [ 109.924574] Modules linked in: fibdrv(OE) rfcomm cmac algif_hash algif_skcipher af_alg bnep nls_iso8859_1 nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_hda_codec_realtek soundwire_bus snd_hda_codec_generic snd_soc_core ledtrig_audio snd_compress snd_hda_codec_hdmi ac97_bus snd_pcm_dmaengine intel_rapl_msr snd_hda_intel intel_rapl_common snd_intel_dspcfg intel_tcc_cooling nvidia(POE) snd_intel_sdw_acpi snd_hda_codec x86_pkg_temp_thermal snd_hda_core intel_powerclamp coretemp snd_hwdep snd_pcm kvm_intel snd_seq_midi r8188eu(C) mei_hdcp kvm snd_seq_midi_event btusb lib80211 snd_rawmidi crct10dif_pclmul ghash_clmulni_intel drm_kms_helper snd_seq btrtl aesni_intel cfg80211 btbcm cec snd_seq_device btintel crypto_simd snd_timer rc_core cryptd [ 109.924600] bluetooth ucsi_ccg snd fb_sys_fops mei_me rapl typec_ucsi joydev syscopyarea ecdh_generic sysfillrect input_leds ecc sysimgblt typec soundcore intel_cstate gigabyte_wmi mei intel_pch_thermal efi_pstore ee1004 wmi_bmof intel_wmi_thunderbolt mxm_wmi mac_hid acpi_pad acpi_tad sch_fq_codel msr parport_pc ppdev lp parport drm ip_tables x_tables autofs4 hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid crc32_pclmul nvme ahci e1000e i2c_i801 xhci_pci nvme_core i2c_smbus libahci i2c_nvidia_gpu xhci_pci_renesas wmi video pinctrl_cannonlake [ 109.924620] ---[ end trace 35fe98851d94e884 ]--- [ 109.998967] RIP: 0010:__kmalloc+0xfd/0x4a0 [ 109.998971] Code: 08 65 4c 03 05 c4 00 73 52 49 83 78 10 00 4d 8b 20 0f 84 5d 03 00 00 4d 85 e4 0f 84 54 03 00 00 41 8b 47 28 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 4c 89 e0 48 0f c9 48 31 cb [ 109.998973] RSP: 0018:ffff9e360112fd40 EFLAGS: 00010282 [ 109.998974] RAX: a0ff89d2c365701a RBX: 0000000000000001 RCX: 0000000000000032 [ 109.998975] RDX: 0000000000019001 RSI: 0000000000000dc0 RDI: 000000000003a040 [ 109.998976] RBP: ffff9e360112fd78 R08: ffff89d66d63a040 R09: ffff89d2c3657d22 [ 109.998977] R10: 0000000000000004 R11: ffff89d2c3657d23 R12: a0ff89d2c365701a [ 109.998978] R13: 0000000000000000 R14: 0000000000000dc0 R15: ffff89d2c0042200 [ 109.998979] FS: 00007fa0c2ec2540(0000) GS:ffff89d66d600000(0000) knlGS:0000000000000000 [ 109.998980] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 109.998981] CR2: 00005643b0135488 CR3: 0000000148e70003 CR4: 00000000007706f0 [ 109.998982] PKRU: 55555554 ``` ### 2. 改用 sysfs > 用 printk 固然方便,但其執行時間易受各種因素而不穩定,除了透過統計來觀察,也可改用核心的 sysfs 介面,後者可讓我們自行宣告 show 和 store 介面來提供讀寫操作,可參見 [Sample kobject implementation](https://elixir.bootlin.com/linux/latest/source/samples/kobject/kobject-example.c) (注意: 切換到對應的 Linux 核心版本)。 ### 3. 使用 clz / ffs 改善效能 > 逐步最佳化 Fibonacci 的執行時間,引入[費氏數列分析](https://hackmd.io/@sysprog/fibonacci-number) 提到的策略,並善用 [clz / ffs](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令 (Linux 核心有對應的包裝函式),過程中都要充分量化 ### 4. 研讀 bignum 與比較 > 嘗試研讀 [bignum](https://github.com/0xff07/bignum/tree/fibdrv) (`fibdrv` 分支) 的實作,理解其中的技巧並與你的實作進行效能比較,探討個中值得取鏡之處。 > - 原理和分析可見 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) ### 5. Integer Endocing Algorithm > 當 $Fib(n)$ 隨著 $n$ 越來越大時,由 Linux 核心傳遞字串形式的十進位就成為效能瓶頸,研讀〈[Integer Encoding Algorithm 筆記](https://kkc.github.io/2021/01/30/integer-compression-note/)〉所提及的策略,在 Linux 核心模組先對數值進行編碼和壓縮,以二進位方式傳遞到使用者層級後,再由應用程式予以顯示 $Fib(n)$ 數值 > - 儘量降低由核心傳遞到應用程式的資料量 > - 確保應用程式不需要過多的計算量才可重現 Fibonacci 數列 ## 參考資料 - [Calculating fibonacci numbers by fast doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) - [KYG-yaya573142 的報告](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ) - [Masamaloka fibdrv](https://hackmd.io/@Masamaloka/fibdrv)