---
tags: linux2022
---
# 2022q1 Homework3 (fibdrv)
contributed by < [laneser](https://github.com/laneser) >
> [作業要求](https://hackmd.io/@sysprog/linux2022-fibdrv)
## 前期準備
### 實驗環境
```shell
$ gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110
$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 4
Model name: Cortex-A53
Stepping: r0p4
CPU max MHz: 1200.0000
CPU min MHz: 600.0000
BogoMIPS: 38.40
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Not affected
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fp asimd evtstrm crc32 cpuid
```
### 安裝 linux 以及設置環境
這次找 Raspberry Pi 3 Model B v1.2 安裝 linux 來作業.
透過 https://www.raspberrypi.com/software/ 可以很容易把全新的 linux 安裝在單台 Raspberry Pi 上面, 然後在 SD Card 上面的 `/boot/wpa_supplicant.conf` 檔案寫入 wifi 的相關資訊, 就可以透過 vscode 遠端開發. 非常方便又可以隨身攜帶.
- 檢查 linux 核心版本
```shell
$ uname -a
Linux raspberrypi 5.10.92-v8+ #1514 SMP PREEMPT Mon Jan 17 17:39:38 GMT 2022 aarch64 GNU/Linux
```
- 安裝 linux-headers 套件
```shell
apt-get install raspberrypi-kernel-headers
```
- 確認 linux-headers 套件已正確安裝於開發環境
```shell
$ sudo dpkg -L raspberrypi-kernel-headers | grep "/lib/modules"
/lib/modules
/lib/modules/5.10.92-v8+
/lib/modules/5.10.92-v8+/build
```
- 檢驗目前的使用者身份
```shell
$ whoami
lane
$ sudo whoami
root
```
- 安裝後續會用得到的工具
```shell
sudo apt install util-linux strace gnuplot-nox git cppcheck clang-format aspell
```
- 取得原始程式碼
```shell
lane@raspberrypi:~ $ git clone https://github.com/laneser/fibdrv
Cloning into 'fibdrv'...
remote: Enumerating objects: 80, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 80 (delta 3), reused 8 (delta 3), pack-reused 70
Receiving objects: 100% (80/80), 23.56 KiB | 893.00 KiB/s, done.
Resolving deltas: 100% (37/37), done.
lane@raspberrypi:~ $ cd fibdrv
```
- 編譯並測試
```shell
lane@raspberrypi:~/fibdrv $ make check
make -C /lib/modules/5.10.92-v8+/build M=/home/lane/fibdrv modules
make[1]: Entering directory '/usr/src/linux-headers-5.10.92-v8+'
make[1]: Leaving directory '/usr/src/linux-headers-5.10.92-v8+'
make unload
make[1]: Entering directory '/home/lane/fibdrv'
sudo rmmod fibdrv || true >/dev/null
make[1]: Leaving directory '/home/lane/fibdrv'
make load
make[1]: Entering directory '/home/lane/fibdrv'
sudo insmod fibdrv.ko
make[1]: Leaving directory '/home/lane/fibdrv'
sudo ./client > out
make unload
make[1]: Entering directory '/home/lane/fibdrv'
sudo rmmod fibdrv || true >/dev/null
make[1]: Leaving directory '/home/lane/fibdrv'
Passed [-]
f(93) fail
input: 7540113804746346429
expected: 12200160415121876738
```
- 觀察產生的 fibdrv.ko 核心模組
```shell
$ sudo modinfo fibdrv.ko
filename: /home/lane/fibdrv/fibdrv.ko
version: 0.1
description: Fibonacci engine driver
author: National Cheng Kung University, Taiwan
license: Dual MIT/GPL
srcversion: 9A01E3671A116ADA9F2BB0A
depends:
name: fibdrv
vermagic: 5.10.92-v8+ SMP preempt mod_unload modversions aarch64
```
- 觀察 fibdrv.ko 核心模組在 Linux 核心掛載後的行為
```shell
$ sudo insmod ./fibdrv.ko
$ ls -l /dev/fibonacci
crw------- 1 root root 238, 0 Mar 8 14:10 /dev/fibonacci
$ cat /sys/class/fibonacci/fibonacci/dev
238:0
$ cat /sys/module/fibdrv/version
0.1
$ lsmod | grep fibdrv
fibdrv 16384 0
$ cat /sys/module/fibdrv/refcnt
0
```
### 查看行程的 CPU affinity
編輯 `/boot/cmdline.txt` 在其後面加入 `isolcpus=0` , 重開後搞不定, 因為 `dmesg` 顯示 kernel 抱怨
```shell
Housekeeping: must include one present CPU, using boot CPU:0
```
根據規格這台 Raspberry Pi CPU 是 1.2 GHz 64-bit quad-core ARM Cortex-A53, 所以有四核, 可以設定 `isolcpus=3`, 再度重開機:
```shell
$ taskset -cp 1
pid 1's current affinity list: 0-2
$ cat /sys/devices/system/cpu/isolated
3
```
可以發現 pid1 已經使用 0-2 CPU (原本是 0-3)
### 將行程固定在特定的 CPU 中執行
```shell
$ taskset -cp 3 647
pid 647's current affinity list: 0-2
pid 647's new affinity list: 3
$ taskset -cp 647
pid 647's current affinity list: 3
$ taskset -cp 0-2 647
pid 647's current affinity list: 3
pid 647's new affinity list: 0-2
```
### 排除干擾效能分析的因素
#### 抑制 address space layout randomization (ASLR)
```shell
$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"
```
#### 設定 scaling_governor 為 performance
```shell
$ sudo su
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
ondemand
$ echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
$ echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
$ echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
$ echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
$ cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
performance
```
### 研讀費氏數列相關材料
最重要的就是 [Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 手法, 乃是應用了下列二式
$$
\begin{align}
F(2k) = F(k)[2F(k+1) - F(k)]\\
F(2k+1) = F(k+1)^2+F(k)^2
\end{align}
$$
那麼主要的虛擬碼會是
```cpp
static long long fib(long n)
{
h = find_highest_set_bit(n); // 這部分可用 clz or ffs
從 n 的高位元執行 h 次到 n 的低位元的迴圈 {
0 => 求 F(2n), F(2n+1)
1 => 求 F(2n), F(2n+1), 再獲得 F(2n+2)
}
}
```
--
## 量測執行時間
因為要報告一路改善程式的執行時間, 所以第一步是要想辦法獲得 `fibdrv` 的執行時間.
首先看 `fibdrv.c` 可以知道這個 module 有五個介面可以讓 `client.c` 來使用:
- open : 可以傳的參數有 read/write 權限, 可以回傳 fd
- close : 可以傳入 fd, 回傳 int
- write : 可以傳入 fd, char 指標, size_t size, 回傳 ssize_t
- read : 可以傳入 fd, char 指標, size_t size, 回傳 ssize_t
- seek : 可以傳入 fd, size_t offset, int origin, 回傳 loff_t
參考 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 利用 write 來計算時間.
但程式的 `make check` 命令也用到 write 檢驗回傳值 (testing mode),
所以想要同時符合 testing mode 與計算時間:
```cpp
static long long (*fib_methods[])(long long) = {
fib_sequence,
};
/* write operation
* testing mode: buf exists, return 1.
* calculating mode: buf is NULL
* size is method number
* return cost time in ns
* return 0 if size is out of index
*/
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
if (buf)
return 1;
if (size >= sizeof(fib_methods) / sizeof(fib_methods[0]))
return 0;
ktime_t kt = ktime_get();
fib_methods[size](*offset);
return (ssize_t) ktime_to_ns(ktime_sub(ktime_get(), kt));
}
```
## 實作費氏數列
首先解這個警告 `ISO C90 forbids variable length array ‘f’ [-Wvla]`, 因為
```cpp
/* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
long long f[k + 2];
```
雖然直覺就馬上用 `malloc`, 甚至 `kmalloc`, 但 kernel 的世界果然跟平常寫程式不同,
不過既然迴圈裡面同一個時間只用到三個變數, 其實可以丟棄之前的變數, 也就不需要那麼多空間.
但我也寫不出比 [Masamaloka fibdrv](https://hackmd.io/@Masamaloka/fibdrv#iterative) 更精簡的了.
```cpp
static long long fib_sequence(long long k)
{
long long f[] = {0, 1};
for (int i = 2; i <= k; i++) {
f[(i & 1)] += f[((i - 1) & 1)];
}
return f[(k & 1)];
}
```
於是就有量測時間的結果
![](https://i.imgur.com/k2ZAezX.png)
嘗試加上 user space 的 `clock_gettime`:
```cpp
struct timespec t1, t2;
clock_gettime(CLOCK_MONOTONIC, &t1);
time[m * 2][n] = (double) write(fd, NULL, m);
clock_gettime(CLOCK_MONOTONIC, &t2);
time[m * 2 + 1][n] =
(long long) (t2.tv_sec * 1e9 + t2.tv_nsec) -
(t1.tv_sec * 1e9 + t1.tv_nsec);
```
![](https://i.imgur.com/3t8TUv3.png)
Raspberry Pi 的 system call 耗費好像跟 intel based 差異很大.
### Fast Doubling
```cpp
// ref
// https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling
static long long fib_fast_doubling(long long n)
{
unsigned int h = 0;
for (unsigned int i = n; i; ++h, i >>= 1)
;
uint64_t a = 0;
uint64_t b = 1;
for (unsigned int mask = 1 << (h - 1); mask; mask >>= 1) {
uint64_t c = a * (2 * b - a);
uint64_t d = a * a + b * b;
if (mask & n) {
a = d;
b = c + d;
} else {
a = c;
b = d;
}
}
return a;
}
```
思考 clz / ctz 對這個函數的幫助, 看來只有一開始 `h` 的運算可以幫助
```cpp
unsigned int h = 0;
for (unsigned int i = n; i; ++h, i >>= 1);
```
所以寫一版 `clz / ctz` 的運算
```cpp
static long long fib_fast_doubling_clz(long long n)
{
if (n < 2)
return n;
uint64_t a = 0;
uint64_t b = 1;
for (unsigned int mask = 1 << (32 - __builtin_clzll(n)); mask; mask >>= 1) {
uint64_t c = a * (2 * b - a);
uint64_t d = a * a + b * b;
if (mask & n) {
a = d;
b = c + d;
} else {
a = c;
b = d;
}
}
return a;
}
```
![](https://i.imgur.com/oT4shxY.png)
有趣的事情是, 似乎沒有加速的效果, 好像變得更慢了一點點?
[KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 也提到一樣的事情.
但 KYG-yaya573142 是遇到優化而被 opt-out, 按照目前的 `fib_write`:
```cpp
static long long (*fib_methods[])(long long) = {fib_sequence, fib_fast_doubling,
fib_fast_doubling_clz};
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
if (buf)
return 1;
if (size >= sizeof(fib_methods) / sizeof(fib_methods[0]))
return 0;
ktime_t kt = ktime_get();
fib_methods[size](*offset);
return (ssize_t) ktime_to_ns(ktime_sub(ktime_get(), kt));
}
```
一但優化, `fib_sequence` 也會變快才是.
實際用 `objdump -dS fibdrv.ko` 也沒發現被 opt-out.
```shell
00000000000002b0 <fib_write>:
2b0: d503245f bti c
2b4: d503201f nop
2b8: d503201f nop
2bc: d2800020 mov x0, #0x1 // #1
2c0: b4000041 cbz x1, 2c8 <fib_write+0x18>
2c4: d65f03c0 ret
2c8: d503233f paciasp
2cc: a9be7bfd stp x29, x30, [sp, #-32]!
2d0: d2800000 mov x0, #0x0 // #0
2d4: 910003fd mov x29, sp
2d8: a90153f3 stp x19, x20, [sp, #16]
2dc: aa0203f3 mov x19, x2
2e0: f100085f cmp x2, #0x2
2e4: 54000188 b.hi 314 <fib_write+0x64> // b.pmore
2e8: aa0303f4 mov x20, x3
2ec: 94000000 bl 0 <ktime_get>
2f0: 90000002 adrp x2, 0 <fib_sequence>
2f4: 91000042 add x2, x2, #0x0
2f8: aa0003e1 mov x1, x0
2fc: f9400280 ldr x0, [x20]
300: aa0103f4 mov x20, x1
304: f8737841 ldr x1, [x2, x19, lsl #3]
308: d63f0020 blr x1
30c: 94000000 bl 0 <ktime_get>
310: cb140000 sub x0, x0, x20
314: a94153f3 ldp x19, x20, [sp, #16]
318: a8c27bfd ldp x29, x30, [sp], #32
31c: d50323bf autiasp
320: d65f03c0 ret
```
實際比較 clz 有無的 objdump
```shell
00000000000000a0 <fib_fast_doubling>:
a0: d503245f bti c
a4: d503201f nop
a8: d503201f nop
ac: d503233f paciasp
b0: 34000380 cbz w0, 120 <fib_fast_doubling+0x80>
b4: aa0003e4 mov x4, x0
b8: 2a0003e1 mov w1, w0
bc: 52800003 mov w3, #0x0 // #0
c0: 2a0303e0 mov w0, w3
c4: 53017c21 lsr w1, w1, #1
c8: 11000463 add w3, w3, #0x1
cc: 35ffffa1 cbnz w1, c0 <fib_fast_doubling+0x20>
d0: 52800022 mov w2, #0x1 // #1
d4: 1ac02042 lsl w2, w2, w0
d8: 34000242 cbz w2, 120 <fib_fast_doubling+0x80>
dc: d2800021 mov x1, #0x1 // #1
e0: d2800000 mov x0, #0x0 // #0
e4: d503201f nop
e8: d37ff823 lsl x3, x1, #1
ec: 9b017c21 mul x1, x1, x1
f0: cb000063 sub x3, x3, x0
f4: 9b000401 madd x1, x0, x0, x1
f8: 9b007c60 mul x0, x3, x0
fc: 6a04005f tst w2, w4
100: 54000080 b.eq 110 <fib_fast_doubling+0x70> // b.none
104: 8b010003 add x3, x0, x1
108: aa0103e0 mov x0, x1
10c: aa0303e1 mov x1, x3
110: 53017c42 lsr w2, w2, #1
114: 35fffea2 cbnz w2, e8 <fib_fast_doubling+0x48>
118: d50323bf autiasp
11c: d65f03c0 ret
120: d2800000 mov x0, #0x0 // #0
124: d50323bf autiasp
128: d65f03c0 ret
12c: d503201f nop
```
```shell
0000000000000234 <fib_fast_doubling_clz>:
234: d503245f bti c
238: d503201f nop
23c: d503201f nop
240: aa0003e4 mov x4, x0
244: d503233f paciasp
248: f100041f cmp x0, #0x1
24c: 540002ad b.le 2a0 <fib_fast_doubling_clz+0x6c>
250: dac01000 clz x0, x0
254: 52800402 mov w2, #0x20 // #32
258: 4b000040 sub w0, w2, w0
25c: 52800022 mov w2, #0x1 // #1
260: 1ac02042 lsl w2, w2, w0
264: 34000222 cbz w2, 2a8 <fib_fast_doubling_clz+0x74>
268: d2800021 mov x1, #0x1 // #1
26c: d2800000 mov x0, #0x0 // #0
270: d37ff823 lsl x3, x1, #1
274: 9b017c21 mul x1, x1, x1
278: cb000063 sub x3, x3, x0
27c: 9b000401 madd x1, x0, x0, x1
280: 9b007c60 mul x0, x3, x0
284: 6a04005f tst w2, w4
288: 54000080 b.eq 298 <fib_fast_doubling_clz+0x64> // b.none
28c: 8b010003 add x3, x0, x1
290: aa0103e0 mov x0, x1
294: aa0303e1 mov x1, x3
298: 53017c42 lsr w2, w2, #1
29c: 35fffea2 cbnz w2, 270 <fib_fast_doubling_clz+0x3c>
2a0: d50323bf autiasp
2a4: d65f03c0 ret
2a8: d2800000 mov x0, #0x0 // #0
2ac: 17fffffd b 2a0 <fib_fast_doubling_clz+0x6c>
```
`fib_fast_doubling_clz` 版本的確比較小, 也用到 clz, 但就是差不多甚至比較慢一點.
如果單純比較
`(32 - __builtin_clzll(n)` vs `for (unsigned int i = n; i; ++h, i >>= 1);` , 就會發現 clz 的確比較快一點.
![](https://i.imgur.com/gBUeLjm.png)
但只運算 `clz` 居然跟運算費氏數列差異為 100 ns 跟 180 ns 這樣的等級, 應該是時間量測的精度不足.
如果是時間量測精確程度問題, 那麼重複 10 次應該可以緩和, 在 `fib_write` 內部量測時間的時候:
```cpp
ktime_t kt = ktime_get();
int i = 10;
while (i--)
fib_methods[size](*offset);
return (ssize_t) ktime_to_ns(ktime_sub(ktime_get(), kt));
```
![](https://i.imgur.com/e4UcQPT.png)
跟原本耗費 100 ns 相比, 接近 80 的時候才有 10 倍的數據.
而 ARM Cortex-A53 的 clz 表現只有快一點點.
### 思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本
可以觀察到 Fast doubling 當中
```cpp
uint64_t d = a * a + b * b;
```
用了兩個平方, 而運算 $n^2$ 可以思考:
$$
\begin{cases}
n^2 = (2*\frac{n}{2})^2 = 4*(\frac{n}{2})^2, & \text{if $n$ is even and $n$ >= 2} \\
n^2 = (2*\frac{n}{2}+1)^2 = 4*(\frac{n}{2})^2 + 4*(\frac{n}{2}) + 1 & \text{if $n$ is odd and $n$ >= 2}\\
\end{cases}
$$
---
## 自我檢查清單
- [x] 研讀 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
- [x] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
- [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] 研讀 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU),指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措?
- [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢?
> 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html)
- [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。
---
## 參考資料
- [Calculating fibonacci numbers by fast doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling)
- [KYG-yaya573142 的報告](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ)
- [Masamaloka fibdrv](https://hackmd.io/@Masamaloka/fibdrv)