---
tags: linux2022
---
# 2022q1 Homework3 (fibdrv)
contributed by < [howardjchen](https://github.com/howardjchen) >
> [作業要求](https://hackmd.io/@sysprog/linux2022-fibdrv)
## 自我檢查清單
- [x] 研讀 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
- [ ] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
- [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] 研讀 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU),指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措?
- [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢?
> 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html)
- [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。
## 開發環境
### Ubuntu Linux 系統資訊
```
Static hostname: pcie-lab
Icon name: computer-desktop
Chassis: desktop
Machine ID: fd5e7ae180f94e0cbbb273229cc399e3
Boot ID: 45ad9e2f7b1349a8bb6bb449830c36a1
Operating System: Ubuntu 20.04.2 LTS
Kernel: Linux 5.11.0-34-generic
Architecture: x86-64
```
```
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 165
Model name: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
Stepping: 5
CPU MHz: 2900.000
CPU max MHz: 4800.0000
CPU min MHz: 800.0000
BogoMIPS: 5799.77
Virtualization: VT-x
L1d cache: 256 KiB
L1i cache: 256 KiB
L2 cache: 2 MiB
L3 cache: 16 MiB
NUMA node0 CPU(s): 0-15
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
```
## 1. 設定 Linux 效能分析環境
由於原本的 fibdrv 只能跑到 offset=92 就會 overflow ,這邊先著手進行修改
$fib(92)$ 的計算結果用 16 進位表示是 `0x1 11F3 8AD0 840B F6BF`,超過 64 位元整數所夠表示的最大數值,因此考量到要正確輸出 $fib(100)$ 或數列更後面的數值,就必須使用到特製的結構來處理回傳值。我們利用作業裡的例子進行實作
```c=
struct BigN {
unsigned long long lower, upper;
};
```
使用 struct BigN 來將一個數字分成兩部份:
* 高位的 64 bits 保存 upper 中;
* 低位的 64 bit 則是保存在 lower 中;
進行大數加法時,則需要注意 lower 是否需要進位到 upper:
```cpp
static inline void addBigN(struct BigN *output, struct BigN x, struct BigN y)
{
output->upper = x.upper + y.upper;
if (y.lower > ~x.lower)
output->upper++;
output->lower = x.lower + y.lower;
}
```
利用 128 bit 可以表示到 offset=186
```
[32838.639067] offset: 186 FA63C8D9FA216A8F C8A7213B333270F8
```
---
我們先針對 ```fib_sequence``` 來進行時間量測
```c
static struct BigN fib_sequence(long long k)
{
/* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
struct BigN f[k + 2];
f[0].upper = 0;
f[0].lower = 0;
f[1].upper = 0;
f[1].lower = 1;
for (int i = 2; i <= k; i++)
addBigN(&f[i], f[i - 1], f[i - 2]);
return f[k];
}
static struct BigN fib_time_proxy(long long k)
{
kt = ktime_get();
struct BigN result = fib_sequence(k);
kt = ktime_sub(ktime_get(), kt);
return result;
}
```
我們利用 write 來回傳計算的時間
```c
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
return ktime_to_ns(kt);
}
```
於是就有量測時間的結果,但量測數值有點浮動

### 排除干擾效能分析的因素
#### 限定 CPU 給特定的程式使用
參考 [KYG-yaya573142 的報告](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ?both) 修改 `/etc/default/grub` 內`GRUB_CMDLINE_LINUX_DEFAULT`,加入 `isolcpus=5` 來指定編號 5 的核心不被排程器賦予任務,修改完成後需 `sudo update-grub` 來更新設定,重開機後即生效 (可從 `/sys/devices/system/cpu/isolated` 確認是否生效)
```
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=5"
```
修改後可觀察到 PID 1 - [init](https://en.wikipedia.org/wiki/Init) 的 affinity list 不包含該編號的 CPU
```shell
$ taskset -cp 1
pid 1's current affinity list: 0-4,6-15
```
#### 將程式固定在特定的 CPU 中執行
透過指定處理器親和性 (processor affinity,亦稱 CPU pinning),可以將程式固定在特定的 CPU 中執行
```shell
$ sudo taskset -c 5 ./client
```
#### 抑制 [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization) (ASLR)
```shell
$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"
```
#### 設定 scaling_governor 為 performance
準備以下 shell script 來設定 CPU 以最高頻率執行所有 process
```shell
$ cat performance.sh
for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
done
$ sudo sh performance.sh
```
#### 關閉 turbo mode
關閉 Intel 處理器的 turbo mode
```shell
$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"
```
#### SMP IRQ affinity
執行上述步驟後進行量測,發現結果仍有飄動的情況

針對 [SMP IRQ affinity](https://www.kernel.org/doc/Documentation/IRQ-affinity.txt) 進行設置,盡量避免 CPU 5 去處理 IRQ。使用以下 script 進行設置,僅將 CPU 5 從可用清單去除,但不大幅度改變本來系統的設置 (例如 smp_affinity 原本是 0~15,只會被更改為 0~4, 6-15)
註: smp_affinity 和 smp_affinity_list 擇一設定即可
```bash
#!/bin/bash
for file in `find /proc/irq -name "smp_affinity"`
do
var=0x`cat ${file}`
var="$(( $var & 0xdf ))"
var=`printf '%.2x' ${var}`
echo "${var} > ${file}"
done
echo df > /proc/irq/default_smp_affinity
```
設置完畢後可以透過 `cat /proc/interrupts` 觀察 CPU 5 的 IRQ 數量,同時也可以量測到更穩定的實驗結果

### 量測時間的方式
#### user space
使用 [`clock_gettime`](http://man7.org/linux/man-pages/man2/clock_getres.2.html) 來取得時間
```cpp
#include <time.h>
...
struct timespec t1, t2;
clock_gettime(CLOCK_MONOTONIC, &t1);
// do something here
clock_gettime(CLOCK_MONOTONIC, &t2);
long long ut = (long long)(t2.tv_sec * 1e9 + t2.tv_nsec)
- (t1.tv_sec * 1e9 + t1.tv_nsec); // ns
```
#### kernel space
使用 [ktime](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html) 來量測執行時間,這裡參照作業說明的手法,挪用 `fib_write` 來回傳 kernel space 的執行時間
```c
static struct BigN fib_sequence(long long k)
{
/* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
struct BigN f[k + 2];
f[0].upper = 0;
f[0].lower = 0;
f[1].upper = 0;
f[1].lower = 1;
for (int i = 2; i <= k; i++)
addBigN(&f[i], f[i - 1], f[i - 2]);
return f[k];
}
static struct BigN fib_time_proxy(long long k)
{
kt = ktime_get();
struct BigN result = fib_sequence(k);
kt = ktime_sub(ktime_get(), kt);
return result;
}
```
### user space 與 kernel space 的傳遞時間
使用上述[量測時間的方式](#量測時間的方式)中提到的方式分別量測 user space 及 kernel space 花費的時間,再將兩者相減即可獲得 user space 與 kernel space 的傳遞時間
```c
for (int i = 0; i <= offset; i++) {
lseek(fd, i, SEEK_SET);
clock_gettime(CLOCK_MONOTONIC, &t1);
sz = read(fd, buf, 1);
clock_gettime(CLOCK_MONOTONIC, &t2);
long long ut = (long long) (t2.tv_sec * 1e9 + t2.tv_nsec) -
(t1.tv_sec * 1e9 + t1.tv_nsec); // ns
time_val = write(fd, write_buf, strlen(write_buf));
printf("%03d %lld %lld %lld\n", i, time_val, ut, (ut - time_val));
}
```

## 用統計手法去除極端值
假設數據分佈接近常態分佈的情況下,一個標準差到三個標準差之內的資料量約佔 68%, 95%, 99.7%

> 圖片來源: [wikipedia](https://en.wikipedia.org/wiki/Standard_deviation)
### Python script 實作以及結果
```python
def find_best_mean(file, ans_list):
with open(file) as row_data:
for line in row_data:
# for each row, split and add nums to pool
row = line.split()
for ele_str in row[1:]:
pool.append(int(ele_str))
# Calculate std for each pool
data = np.array(pool)
left = data.mean() - 2 * data.std()
right = data.mean() + 2 * data.std()
# Filter out and find mean average of 95% data
for x in pool:
if left <= x and x <= right:
filter_pool.append(x)
ans = np.array(filter_pool)
ans_list.append(ans.mean())
filter_pool.clear()
pool.clear()
```
我們採用每次取100個樣本並留下 95% 的 data 取平均,可以發現畫出來的圖形平滑許多,不會再有漂移的情況

## 2. clz / ctz 對 Fibonacci 數運算的幫助
> 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
## 3. 研讀 KYG-yaya573142 的報告
> 指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措?
## 4. lsmod output study
> lsmod 的輸出結果有一欄名為 Used by,這是 “each module’s use count and a list of referring modules”,但如何實作出來呢?模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢?
搭配閱讀 The Linux driver implementer’s API guide » Driver Basics
## 5. fibdrv.c 中 mutex 分析
> 注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。
## 6. 思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本
> 複習 C 語言 數值系統 和 bitwise operation,思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
## 7. 改善 fibdrv 效能
> 在 GitHub 上 fork fibdrv,目標是修正 Fibonacci 數計算的正確性 (現有實作存在缺陷,請指出),隨後改善 fibdrv 計算 Fibinacci 數列的執行效率,過程中需要量化執行時間,可在 Linux 核心模組和使用層。
### 1. 增加 fibdrv 運算數值
> * 原本的程式碼只能列出到 $Fibonacci(100)$ 而且==部分輸出是錯誤的數值==,請修改程式碼,列出後方更多數值 (注意: 檢查正確性和數值系統的使用)
由於原本的 fibdrv 只能跑到 offset=92 就會 overflow ,這邊先著手進行修改 $fib(92)$ 的計算結果用 16 進位表示是 `0x1 11F3 8AD0 840B F6BF`,超過 64 位元整數所夠表示的最大數值,因此考量到要正確輸出 $fib(100)$ 或數列更後面的數值,就必須使用到特製的結構來處理回傳值。我們利用作業裡的例子進行實作
#### string add function
為了要增加運算的位數值,這邊實作個字串相加的 function
```c
static char *add_two_string(char *num1, char *num2)
{
int idx1 = strlen(num1) - 1;
int idx2 = strlen(num2) - 1;
int n = idx1 > idx2 ? strlen(num1) : strlen(num2);
char *ret = kzalloc((n + 2) * sizeof(char), GFP_KERNEL);
if (!ret) {
printk(KERN_INFO "[%s] Cannot get memory with size %d\n", __func__,
n + 2);
return NULL;
}
int idxRet = n;
int carry = 0;
while (idx1 >= 0 || idx2 >= 0) {
int sum = 0;
if (idx1 >= 0) {
sum += num1[idx1] - '0';
idx1--;
}
if (idx2 >= 0) {
sum += num2[idx2] - '0';
idx2--;
}
sum += carry;
carry = sum / 10;
ret[idxRet] = sum % 10 + '0';
idxRet--;
}
if (carry) {
ret[idxRet] = '1';
idxRet--;
}
return &(ret[idxRet + 1]);
}
```
```c
static int fib_sequence(char **f, long long k)
{
f[0] = kzalloc(sizeof(char) + 1, GFP_KERNEL);
f[1] = kzalloc(sizeof(char) + 1, GFP_KERNEL);
strncpy(f[0], "0", 2);
strncpy(f[1], "1", 2);
for (int i = 2; i <= k; i++) {
f[i] = add_two_string(f[i - 1], f[i - 2]);
if (f[i] == NULL)
return -ENOMEM;
}
return 0;
}
static int fib_time_proxy(char **f, long long k)
{
kt = ktime_get();
int ret = fib_sequence(f, k);
kt = ktime_sub(ktime_get(), kt);
return ret;
}
```
然後我在們 ```fib_read``` 的時候利用```copy_to_user``` 將運算的結果傳到 user space buffer
```c
/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
char *buf,
size_t size,
loff_t *offset)
{
char **f;
if (*offset > 1)
f = kmalloc((*offset + 1) * sizeof(f), GFP_KERNEL);
else
f = kmalloc(3 * sizeof(f), GFP_KERNEL);
if (!f) {
printk(KERN_INFO "[%s] Cannot get memory with size %lld\n", __func__,
(*offset + 1));
return -ENOMEM;
}
int ret = fib_time_proxy(f, *offset);
if (ret < 0)
return ret;
/*
* Copy at most size bytes to user space.
* Return ''0'' on success and some other value on error.
*/
if (copy_to_user(buf, f[*offset], strlen(f[*offset])+1))
return -EFAULT;
else
return 0;
return *offset;
}
```
**client.c**
```c
for (int i = 0; i <= offset; i++) {
lseek(fd, i, SEEK_SET);
clock_gettime(CLOCK_MONOTONIC, &t1);
sz = read(fd, buf, 160);
clock_gettime(CLOCK_MONOTONIC, &t2);
long long ut = (long long) (t2.tv_sec * 1e9 + t2.tv_nsec) -
(t1.tv_sec * 1e9 + t1.tv_nsec); // ns
time_val = write(fd, write_buf, strlen(write_buf));
fprintf(data, "%03d %lld %lld %lld\n", i, time_val, ut,
(ut - time_val));
printf("Reading from " FIB_DEV
" at offset %d, returned the sequence "
"%s.\n",
i, buf);
if (sz < 0)
break;
}
```
目前可以成功計算到 ```offset=500```
```
Reading from /dev/fibonacci at offset 500, returned the sequence 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125.
```
:::danger
TODO: 發現在 kfree 我們 kmalloc 出來的 ptr的時候會有 general protection fault 的 error
:::
```
...
[ 109.924479] offset: 0 with 0
[ 109.924491] offset: 1 with 1
[ 109.924495] offset: 2 with 1
[ 109.924499] offset: 3 with 2
[ 109.924503] offset: 4 with 3
[ 109.924507] offset: 5 with 5
[ 109.924514] general protection fault, probably for non-canonical address 0xa0ff89d2c365701a: 0000 [#1] SMP NOPTI
[ 109.924516] CPU: 0 PID: 2186 Comm: client Tainted: P C OE 5.13.0-35-generic #40~20.04.1-Ubuntu
[ 109.924518] Hardware name: Gigabyte Technology Co., Ltd. H470 HD3/H470 HD3, BIOS F4b 12/15/2020
[ 109.924519] RIP: 0010:__kmalloc+0xfd/0x4a0
[ 109.924523] Code: 08 65 4c 03 05 c4 00 73 52 49 83 78 10 00 4d 8b 20 0f 84 5d 03 00 00 4d 85 e4 0f 84 54 03 00 00 41 8b 47 28 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 4c 89 e0 48 0f c9 48 31 cb
[ 109.924524] RSP: 0018:ffff9e360112fd40 EFLAGS: 00010282
[ 109.924526] RAX: a0ff89d2c365701a RBX: 0000000000000001 RCX: 0000000000000032
[ 109.924527] RDX: 0000000000019001 RSI: 0000000000000dc0 RDI: 000000000003a040
[ 109.924528] RBP: ffff9e360112fd78 R08: ffff89d66d63a040 R09: ffff89d2c3657d22
[ 109.924529] R10: 0000000000000004 R11: ffff89d2c3657d23 R12: a0ff89d2c365701a
[ 109.924529] R13: 0000000000000000 R14: 0000000000000dc0 R15: ffff89d2c0042200
[ 109.924530] FS: 00007fa0c2ec2540(0000) GS:ffff89d66d600000(0000) knlGS:0000000000000000
[ 109.924531] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 109.924532] CR2: 00005643b0135488 CR3: 0000000148e70003 CR4: 00000000007706f0
[ 109.924533] PKRU: 55555554
[ 109.924534] Call Trace:
[ 109.924535] <TASK>
[ 109.924536] ? fib_read+0x119/0x258 [fibdrv]
[ 109.924539] fib_read+0x119/0x258 [fibdrv]
[ 109.924540] vfs_read+0xa0/0x190
[ 109.924543] ksys_read+0x67/0xe0
[ 109.924545] __x64_sys_read+0x1a/0x20
[ 109.924547] do_syscall_64+0x61/0xb0
[ 109.924549] ? vfs_write+0x1c3/0x250
[ 109.924551] ? exit_to_user_mode_prepare+0x3d/0x1c0
[ 109.924553] ? syscall_exit_to_user_mode+0x27/0x50
[ 109.924555] ? __x64_sys_lseek+0x1a/0x20
[ 109.924556] ? do_syscall_64+0x6e/0xb0
[ 109.924558] ? do_syscall_64+0x6e/0xb0
[ 109.924559] ? do_syscall_64+0x6e/0xb0
[ 109.924561] ? exc_page_fault+0x8f/0x170
[ 109.924562] ? asm_exc_page_fault+0x8/0x30
[ 109.924564] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 109.924566] RIP: 0033:0x7fa0c2ddd002
[ 109.924567] Code: c0 e9 c2 fe ff ff 50 48 8d 3d 7a cb 0a 00 e8 d5 1a 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[ 109.924568] RSP: 002b:00007ffc6b410d38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 109.924569] RAX: ffffffffffffffda RBX: 00005643af0315b0 RCX: 00007fa0c2ddd002
[ 109.924570] RDX: 0000000000000001 RSI: 00007ffc6b410daf RDI: 0000000000000003
[ 109.924571] RBP: 00007ffc6b410dd0 R08: 00007ffc6b524080 R09: 000000000000006d
[ 109.924572] R10: 00007ffc6b524090 R11: 0000000000000246 R12: 00005643af031200
[ 109.924572] R13: 00007ffc6b410ec0 R14: 0000000000000000 R15: 0000000000000000
[ 109.924574] </TASK>
[ 109.924574] Modules linked in: fibdrv(OE) rfcomm cmac algif_hash algif_skcipher af_alg bnep nls_iso8859_1 nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_hda_codec_realtek soundwire_bus snd_hda_codec_generic snd_soc_core ledtrig_audio snd_compress snd_hda_codec_hdmi ac97_bus snd_pcm_dmaengine intel_rapl_msr snd_hda_intel intel_rapl_common snd_intel_dspcfg intel_tcc_cooling nvidia(POE) snd_intel_sdw_acpi snd_hda_codec x86_pkg_temp_thermal snd_hda_core intel_powerclamp coretemp snd_hwdep snd_pcm kvm_intel snd_seq_midi r8188eu(C) mei_hdcp kvm snd_seq_midi_event btusb lib80211 snd_rawmidi crct10dif_pclmul ghash_clmulni_intel drm_kms_helper snd_seq btrtl aesni_intel cfg80211 btbcm cec snd_seq_device btintel crypto_simd snd_timer rc_core cryptd
[ 109.924600] bluetooth ucsi_ccg snd fb_sys_fops mei_me rapl typec_ucsi joydev syscopyarea ecdh_generic sysfillrect input_leds ecc sysimgblt typec soundcore intel_cstate gigabyte_wmi mei intel_pch_thermal efi_pstore ee1004 wmi_bmof intel_wmi_thunderbolt mxm_wmi mac_hid acpi_pad acpi_tad sch_fq_codel msr parport_pc ppdev lp parport drm ip_tables x_tables autofs4 hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid crc32_pclmul nvme ahci e1000e i2c_i801 xhci_pci nvme_core i2c_smbus libahci i2c_nvidia_gpu xhci_pci_renesas wmi video pinctrl_cannonlake
[ 109.924620] ---[ end trace 35fe98851d94e884 ]---
[ 109.998967] RIP: 0010:__kmalloc+0xfd/0x4a0
[ 109.998971] Code: 08 65 4c 03 05 c4 00 73 52 49 83 78 10 00 4d 8b 20 0f 84 5d 03 00 00 4d 85 e4 0f 84 54 03 00 00 41 8b 47 28 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 4c 89 e0 48 0f c9 48 31 cb
[ 109.998973] RSP: 0018:ffff9e360112fd40 EFLAGS: 00010282
[ 109.998974] RAX: a0ff89d2c365701a RBX: 0000000000000001 RCX: 0000000000000032
[ 109.998975] RDX: 0000000000019001 RSI: 0000000000000dc0 RDI: 000000000003a040
[ 109.998976] RBP: ffff9e360112fd78 R08: ffff89d66d63a040 R09: ffff89d2c3657d22
[ 109.998977] R10: 0000000000000004 R11: ffff89d2c3657d23 R12: a0ff89d2c365701a
[ 109.998978] R13: 0000000000000000 R14: 0000000000000dc0 R15: ffff89d2c0042200
[ 109.998979] FS: 00007fa0c2ec2540(0000) GS:ffff89d66d600000(0000) knlGS:0000000000000000
[ 109.998980] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 109.998981] CR2: 00005643b0135488 CR3: 0000000148e70003 CR4: 00000000007706f0
[ 109.998982] PKRU: 55555554
```
### 2. 改用 sysfs
> 用 printk 固然方便,但其執行時間易受各種因素而不穩定,除了透過統計來觀察,也可改用核心的 sysfs 介面,後者可讓我們自行宣告 show 和 store 介面來提供讀寫操作,可參見 [Sample kobject implementation](https://elixir.bootlin.com/linux/latest/source/samples/kobject/kobject-example.c) (注意: 切換到對應的 Linux 核心版本)。
### 3. 使用 clz / ffs 改善效能
> 逐步最佳化 Fibonacci 的執行時間,引入[費氏數列分析](https://hackmd.io/@sysprog/fibonacci-number) 提到的策略,並善用 [clz / ffs](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令 (Linux 核心有對應的包裝函式),過程中都要充分量化
### 4. 研讀 bignum 與比較
> 嘗試研讀 [bignum](https://github.com/0xff07/bignum/tree/fibdrv) (`fibdrv` 分支) 的實作,理解其中的技巧並與你的實作進行效能比較,探討個中值得取鏡之處。
> - 原理和分析可見 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU)
### 5. Integer Endocing Algorithm
> 當 $Fib(n)$ 隨著 $n$ 越來越大時,由 Linux 核心傳遞字串形式的十進位就成為效能瓶頸,研讀〈[Integer Encoding Algorithm 筆記](https://kkc.github.io/2021/01/30/integer-compression-note/)〉所提及的策略,在 Linux 核心模組先對數值進行編碼和壓縮,以二進位方式傳遞到使用者層級後,再由應用程式予以顯示 $Fib(n)$ 數值
> - 儘量降低由核心傳遞到應用程式的資料量
> - 確保應用程式不需要過多的計算量才可重現 Fibonacci 數列
## 參考資料
- [Calculating fibonacci numbers by fast doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling)
- [KYG-yaya573142 的報告](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ)
- [Masamaloka fibdrv](https://hackmd.io/@Masamaloka/fibdrv)