---
tags: linux2022
---
# 2022q1 Homework3 (fibdrv)
contributed by < [Andy240881](https://github.com/Andy240881) >
> [作業要求](https://hackmd.io/@sysprog/linux2022-fibdrv)
## 自我檢查清單
- [x] 研讀 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
- [x] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
- [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] 研讀 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU),指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措?
- [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢?
> 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html)
- [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。
---
## 前期準備
### 實驗環境
```shell
$ gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Stepping: 10
CPU MHz: 1000.272
CPU max MHz: 3201.0000
CPU min MHz: 800.0000
BogoMIPS: 6399.96
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-11
```
### 安裝 linux 以及設置環境
- 檢查 linux 核心版本
```shell
$ uname -a
Linux andy-System-Product-Name 5.4.0-107-generic #121~18.04.1-Ubuntu SMP Thu Mar 24 17:21:33 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
```
- 安裝 linux-headers 套件
```shell
$ sudo apt install linux-headers-`uname -r`
```
- 確認 linux-headers 套件已正確安裝於開發環境
```shell
$ dpkg -L linux-headers-5.4.0-66-generic | grep "/lib/modules"
```
- 檢驗目前的使用者身份
```shell
$ whoami
andy
$ sudo whoami
root
```
- 安裝後續會用得到的工具
```shell
$ sudo apt install util-linux strace gnuplot-nox git cppcheck clang-format aspell
```
- 取得原始程式碼
```shell
$ git clone https://github.com/laneser/fibdrv
```
- 編譯並測試
```shell
$ make check
```
- 觀察產生的 fibdrv.ko 核心模組
```shell
$ sudo modinfo fibdrv.ko
filename: /home/andy/linux2022/fibdrv/fibdrv.ko
version: 0.1
description: Fibonacci engine driver
author: National Cheng Kung University, Taiwan
license: Dual MIT/GPL
srcversion: 050B4D3F798E40B5F2FE613
depends:
retpoline: Y
name: fibdrv
vermagic: 5.4.0-107-generic SMP mod_unload modversions
```
- 觀察 fibdrv.ko 核心模組在 Linux 核心掛載後的行為
```shell
$ sudo insmod ./fibdrv.ko
$ ls -l /dev/fibonacci
crw------- 1 root root 236, 0 Apr 25 18:32 /dev/fibonacci
$ cat /sys/class/fibonacci/fibonacci/dev
236:0
$ cat /sys/module/fibdrv/version
0.1
$ lsmod | grep fibdrv
fibdrv 16384 0
$ cat /sys/module/fibdrv/refcnt
0
```
### 將行程固定到特定的CPU上執行
- 查看行程目前的 CPU affinity。
```shell
$ taskset -cp 1
pid 1's current affinity list: 0-11
```
- 加入 isolcpus=2 到 /etc/default/grub 中GRUB_CMDLINE_LINUX_DEFAULT的那行。
- 更新 grub 設定後重開機。
- CPU:3 已被孤立。
```shell
$ sudo update-grub
$ sudo reboot
$ taskset -cp 1
pid 1's current affinity list: 0,1,3-11
```
- 以 CPU:3 執行 client (測試執行時間)
```shell
$ sudo taskset 0x11 ./client_statistic
```
### 排除干擾效能分析的因素
* 抑制 [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization) (ASLR)
```shell
$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"
```
* 設定 scaling_governor 為 performance。準備以下 shell script,檔名為 `performance.sh`:
```shell
for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
echo performance > ${i}
done
```
之後再用 `$ sudo sh performance.sh` 執行
### 研讀費氏數列相關材料
最重要的就是 [Fast Doubling](https://scm.iis.sinica.edu.tw/ncs/2019/06/how-to-compute-fibonacci/) 手法, 應用了下列二式
$$
\begin{align}
F(2k) = F(k)[2F(k+1) - F(k)]\\
F(2k+1) = F(k+1)^2+F(k)^2
\end{align}
$$
虛擬碼:
```python
Fast_Fib(n)
a = 0; b = 1; // m = 0
for i = (number of binary digit in n) to 1
t1 = a*(2*b - a);
t2 = b^2 + a^2;
a = t1; b = t2; // m *= 2
if (current binary digit == 1)
t1 = a + b; // m++
a = b; b = t1;
return a;
```
### 虛擬碼的解讀
* 以求 F(n) , n = 10 為例:
* 用 top down 的方式思考,為了應用 Fast doubling 的兩個等式,需依序尋找 F(n >> 1) -> F(n >> 2) -> F(n >> 3)。
* F(1010~2~) -> F(101~2~) -> F(10~2~) -> F(1~2~)
* 用 buttom up 的方式來計算,利用已知 F(0) = 0 、 F(1) = 1, n 為奇數則利用$$A: F(2k+1) = F(k+1)^2+F(k)^2$$,反之則利用$$B: F(2k) = F(k)[2F(k+1) - F(k)]$$
* 計算過程:F(0) ->~A~ F(1) ->~B~ F(2) ->~A~ F(5) ->~B~ F(10)
## 實作費氏數列並觀察運算時間
### 量測執行時間
參考 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 與 [laneser](https://hackmd.io/@laneser/fibdrv),利用 write 來計算執行時間。
```c
static long long (*fib_methods[])(long long) = {fib_sequence, fib_sequence_fd,
fib_sequence_fd_clz};
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
ktime_t kt;
if (buf)
return 1;
kt = ktime_get();
for (int i = 0; i < 10; i++) {
fib_methods[size](*offset);
}
return (ssize_t) ktime_to_ns(ktime_sub(ktime_get(), kt));
}
```
在 user space 利用 `clock_gettime` 計算時間。
```c
clock_gettime(CLOCK_MONOTONIC, &t1);
k_time[m][n] =
(double) write(fd, NULL, m); /* runtime in kernel space */
clock_gettime(CLOCK_MONOTONIC, &t2);
u_time[m][n] = (long long) (t2.tv_sec * 1e9 + t2.tv_nsec) -
(t1.tv_sec * 1e9 + t1.tv_nsec);
```
初步的運算時間統計結果:

### 實作費氏數列
針對 `ISO C90 forbids variable length array ‘f’ [-Wvla]` 的警告,
```c=
/* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
long long f[k + 2];
```
參考 [Masamaloka fibdrv](https://hackmd.io/@Masamaloka/fibdrv#iterative) 的解法,
```c
static long long fib_sequence(long long k)
{
long long f[] = {0, 1};
for (int i = 2; i <= k; i++) {
f[(i & 1)] += f[((i - 1) & 1)];
}
return f[(k & 1)];
}
```
因為迴圈內一次只會用到三個變數,可以重複利用固定的變數來計算 fibonacci number。
#### Fast doubling
```c
static long long fib_sequence_fd(long long k)
{
unsigned int h = 0;
uint64_t a = 0;
uint64_t b = 1;
for (unsigned int i = k; i; ++h, i >>= 1)
;
for (unsigned int mask = 1U << (h - 1); mask; mask >>= 1) {
uint64_t t1 = a * (2 * b - a); // fib(2m)
uint64_t t2 = b * b + a * a; // fib(2m+1)
if (mask & k) {
a = t2;
b = t1 + t2;
} else {
a = t1;
b = t2;
}
}
return a;
}
```
利用 clz / ctz 加速計算MSB的位置
```c
static long long fib_sequence_fd_clz(long long k)
{
if (k < 2)
return k;
unsigned int h = 0;
uint64_t a = 0;
uint64_t b = 1;
h = 31 - __builtin_clzll(k);
for (unsigned int mask = 1U << h; mask; mask >>= 1) {
uint64_t t1 = a * (2 * b - a); // fib(2m)
uint64_t t2 = b * b + a * a; // fib(2m+1)
if (mask & k) {
a = t2;
b = t1 + t2;
} else {
a = t1;
b = t2;
}
}
return a;
}
```

## 實做大數運算
> 參考 [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU)
#### Big number 資料結構
為了計算 Fib(92) 之後的數值,利用長度可變動的表示法:
* `number` - 整數陣列,與 bn 的 size、sign 組合使用可以表達 $\pm(2^{32} \times size-1)$ 以內的整數。
* `size` - number 陣列的長度
* `sign` - 0 為正數、1 為負數
```c
/*
* bignum data structure
* number[0] contains least significant bits
* number[size - 1] contains most significant bits
* sign = 1 for negative number
*/
typedef struct _bn {
unsigned int *number;
unsigned int size;
int sign;
} bn;
```
#### bn_to_string
:::spoiler 完整程式碼
```c
/*
* output bn to decimal string
* Note: the returned string should be freed with the kfree()
*/
char *bn_to_string(const bn *src)
{
// log10(x) = log2(x) / log2(10) ~= log2(x) / 3.322
size_t len = (8 * sizeof(int) * src->size) / 3 + 2 + src->sign;
char *s = kmalloc(len, GFP_KERNEL);
char *p = s;
memset(s, '0', len - 1);
s[len - 1] = '\0';
/* src.number[0] contains least significant bits */
for (int i = src->size - 1; i >= 0; i--) {
/* walk through every bit of bn */
for (unsigned int d = 1U << 31; d; d >>= 1) {
int carry = !!(d & src->number[i]);
/* binary -> decimal string */
for (int j = len - 2; j >= 0; j--) {
s[j] += s[j] - '0' + carry;
carry = (s[j] > '9');
if (carry)
s[j] -= 10;
}
}
}
// skip leading zero
while (p[0] == '0' && p[1] != '\0') {
p++;
}
if (src->sign)
*(--p) = '-';
memmove(s, p, strlen(p) + 1);
return s;
}
```
:::
**程式碼解析:**
```c
// log10(x) = log2(x) / log2(10) ~= log2(x) / 3.322
size_t len = (8 * sizeof(int) * src->size) / 3 + 2 + src->sign;
```
* 計算將二進位的整數轉為十進位的字串所需的字串長度,
* `(8 * sizeof(int) * src->size)/3` - $\log_{10}x=\log_2x/log_2{10}$,$\log_2x$ 等於number陣列中所佔有的bit數,但還沒搞懂為什麼是除 3 而不是3.322。
* `+2` - 結尾的`'\0'` 與 (?)。
* `src->sign` - 若為負數,則 sign 為 1,需要佔據一個位置存放負號。
```c=
/* src.number[0] contains least significant bits */
for (int i = src->size - 1; i >= 0; i--) {
/* walk through every bit of bn */
for (unsigned int d = 1U << 31; d; d >>= 1) {
int carry = !!(d & src->number[i]);
/* binary -> decimal string */
for (int j = len - 2; j >= 0; j--) {
s[j] += s[j] - '0' + carry;
carry = (s[j] > '9');
if (carry)
s[j] -= 10;
}
}
}
```
* 將二進位的整數轉為十進位的字串
* `line 2` - 從 number 的高位元到低為元
* `line 5` - 確認 src->number[i] 裡面的每個 bit 是否為 0
* `line 7` - 由字串的低位元到高位元
* `line 8`
1. 可以理解為`s[j] = 2 * int(s[j]) + '0' + carry`,使得 s[j] 乘二累加
2. 以 size = 1, src->number[0] = $0110...0_2$ = $6_{10}$為例,
3. 在 `d = 1U << 2` 時,`carry = 1` ,s[j] = 2 * 0 + '0' + 1 = '1'
4. 在 `d = 1U << 1` 時,`carry = 1` ,s[j] = 2 * 1 + '0' + 1 = '3'
5. 在 `d = 1U` 時,`carry = 0` ,s[j] = 2 * 3 + '0' + 0 = '6'
* `line 9 ~ 11` - 進位處理
#### bn_add / bn_sub
:::spoiler
```c
/*
* c = a + b
* Note: work for c == a or c == b
*/
void bn_add(const bn *a, const bn *b, bn *c)
{
if (a->sign == b->sign) { // both positive or negative
bn_do_add(a, b, c);
c->sign = a->sign;
} else { // different sign
if (a->sign) // let a > 0, b < 0
SWAP(a, b);
int cmp = bn_cmp(a, b);
if (cmp > 0) {
/* |a| > |b| and b < 0, hence c = a - |b| */
bn_do_sub(a, b, c);
c->sign = 0;
} else if (cmp < 0) {
/* |a| < |b| and b < 0, hence c = -(|b| - |a|) */
bn_do_sub(b, a, c);
c->sign = 1;
} else {
/* |a| == |b| */
bn_resize(c, 1);
c->number[0] = 0;
c->sign = 0;
}
}
}
/*
* c = a - b
* Note: work for c == a or c == b
*/
void bn_sub(const bn *a, const bn *b, bn *c)
{
/* xor the sign bit of b and let bn_add handle it */
bn tmp = *b;
tmp.sign ^= 1; // a - b = a + (-b)
bn_add(a, &tmp, c);
}
```
:::
* `bn_add` 負責所有正負號的判斷,所以 bn_sub 只是改變 b 的正負號後,再直接交給 bn_add 判斷,但不能直接改變 b 的數值,所以這裡使用 tmp 來暫時的賦予不同的正負號
* 若 tmp.sign = 1,b < 0,則變號帶入 `bn_add` (a - b = a+ (-b))
* 若 tmp.sign = 0,b > 0,則變號帶入 `bn_add` (a - (-b) = a + b)
* `bn_cmp` 負責比對兩個 bn 物件開絕對值後的大小,邏輯類似 strcmp
##### bn_do_add
:::spoiler 完整程式碼
```c
/* |c| = |a| + |b| */
static void bn_do_add(const bn *a, const bn *b, bn *c)
{
// max digits = max(sizeof(a) + sizeof(b)) + 1
int d = MAX(bn_msb(a), bn_msb(b)) + 1;
d = DIV_ROUNDUP(d, 32) + !d;
bn_resize(c, d); // round up, min size = 1
unsigned long long int carry = 0;
for (int i = 0; i < c->size; i++) {
unsigned int tmp1 = (i < a->size) ? a->number[i] : 0;
unsigned int tmp2 = (i < b->size) ? b->number[i] : 0;
carry += (unsigned long long int) tmp1 + tmp2;
c->number[i] = carry;
carry >>= 32;
}
if (!c->number[c->size - 1] && c->size > 1)
bn_resize(c, c->size - 1);
}
```
:::
```c=
int d = MAX(bn_msb(a), bn_msb(b)) + 1;
d = DIV_ROUNDUP(d, 32) + !d;
bn_resize(c, d); // round up, min size = 1
```
* 計算加法所需的最大位元數
* `line 1` - 取 a, b 中最大的位數當作 d。
* 推測 `+1` 是避免 `MAX(bn_msb(a), bn_msb(b))` 剛好是 32 的倍數,導致相加後溢位。
* `line 2` -
* [DIV_ROUNDUP](https://medium.com/@arunistime/how-div-round-up-works-179f1a2113b5) 回傳 d 是 32 的幾倍。
* 若 `d = 0`,則 `DIV_ROUNDUP` 會得到 0,需要加上 !d 修正,使得最小值至少為 1
* `line 3` - 將 c 的 size 增加至 d
```c=+
unsigned long long int carry = 0;
for (int i = 0; i < c->size; i++) {
unsigned int tmp1 = (i < a->size) ? a->number[i] : 0;
unsigned int tmp2 = (i < b->size) ? b->number[i] : 0;
carry += (unsigned long long int) tmp1 + tmp2;
c->number[i] = carry;
carry >>= 32;
}
```
* 使用 8 bytes 大小的 carry 來實行兩個 4 bytes 項目的加法來避免 overflow。
* `line 8` - 等號右方記得要先將其中一方進行 [integer promotion](https://stackoverflow.com/questions/44455248/integer-promotion-in-c),不然會先被 truncated 然後才 implicit integer promotion
* `line 10` - 保留進位的結果 (carry 32 ~ 63 位元)。
##### bn_do_sub
::: spoiler 完整程式碼
```c
/*
* |c| = |a| - |b|
* Note: |a| > |b| must be true
*/
static void bn_do_sub(const bn *a, const bn *b, bn *c)
{
// max digits = max(sizeof(a) + sizeof(b))
int d = MAX(a->size, b->size);
bn_resize(c, d);
long long int carry = 0;
for (int i = 0; i < c->size; i++) {
unsigned int tmp1 = (i < a->size) ? a->number[i] : 0;
unsigned int tmp2 = (i < b->size) ? b->number[i] : 0;
carry = (long long int) tmp1 - tmp2 - carry;
if (carry < 0) {
c->number[i] = carry + (1LL << 32);
carry = 1;
} else {
c->number[i] = carry;
carry = 0;
}
}
d = bn_clz(c) / 32;
if (d == c->size)
--d;
bn_resize(c, c->size - d);
}
```
:::
* 實際上使用無號整數進行計算,因此若絕對值相減會小於 0,需先對調 a 與 b,並於計算完成後再再補上負號
* 計算的邏輯和 bn_do_add 一樣,不過此時 carry 是作為借位使用
#### bn_multi
:::spoiler
```c=
/*
* c = a x b
* Note: work for c == a or c == b
* using the simple quadratic-time algorithm (long multiplication)
*/
void bn_mult(const bn *a, const bn *b, bn *c)
{
// max digits = sizeof(a) + sizeof(b))
int d = bn_msb(a) + bn_msb(b);
d = DIV_ROUNDUP(d, 32) + !d; // round up, min size = 1
bn *tmp;
/* make it work properly when c == a or c == b */
if (c == a || c == b) {
tmp = c; // save c
c = bn_alloc(d);
} else {
tmp = NULL;
for (int i = 0; i < c->size; i++)
c->number[i] = 0;
bn_resize(c, d);
}
for (int i = 0; i < a->size; i++) {
for (int j = 0; j < b->size; j++) {
unsigned long long int carry = 0;
carry = (unsigned long long int) a->number[i] * b->number[j];
bn_mult_add(c, i + j, carry);
}
}
c->sign = a->sign ^ b->sign;
if (tmp) {
bn_swap(tmp, c); // restore c
bn_free(c);
}
}
```
:::
## 參考資料
* [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU)
* [laneser](https://hackmd.io/@laneser/fibdrv)