# 2022q1 Homework3 (fibdrv)
contributed by < [2020leon](https://github.com/2020leon) >
## [作業要求](https://hackmd.io/@sysprog/linux2022-fibdrv)
### 自我檢查清單
- [x] 研讀上述 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
- [ ] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
- [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] 研讀 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU),指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措?
- [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢?
> 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html)
- [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。
#### 作業要求
- [ ] 回答上述==自我檢查清單==的所有問題,需要附上對應的參考資料和必要的程式碼,以第一手材料 (包含自己設計的實驗) 為佳
:::warning
:warning: 如果你在 2022 年 3 月 1 日前,已從 GitHub [sysprog21/fibdrv](https://github.com/sysprog21/fibdrv) 進行 fork,請依據 ==[Alternatives to forking into the same account](https://github.community/t5/Support-Protips/Alternatives-to-forking-into-the-same-account/ba-p/7428)== 一文,對舊的 repository 做對應處置,然後重新 fork
:::
- [ ] 在 GitHub 上 fork [fibdrv](https://github.com/sysprog21/fibdrv),目標是修正 Fibonacci 數計算的正確性 (現有實作存在缺陷,請指出),隨後改善 `fibdrv` 計算 Fibinacci 數列的執行效率,過程中需要量化執行時間,可在 Linux 核心模組和使用層級去測量
- 原本的程式碼只能列出到 $Fibonacci(100)$ 而且==部分輸出是錯誤的數值==,請修改程式碼,列出後方更多數值 (注意: 檢查正確性和數值系統的使用)
- 務必研讀上述 ==Linux 效能分析的提示== 的描述,降低效能分析過程中的干擾;
- 在 Linux 核心模組中,可用 ktime 系列的 API;
- 在 userspace 可用 [clock_gettime](https://linux.die.net/man/2/clock_gettime) 相關 API;
- 善用統計模型,除去極端數值,過程中應詳述你的手法
- 分別[用 gnuplot 製圖](https://hackmd.io/@sysprog/Skwp-alOg),分析 Fibonacci 數列在核心計算和傳遞到 userspace 的時間開銷,單位需要用 us 或 ns (自行斟酌)
- 嘗試解讀上述實驗的時間分佈,特別是隨著 Fibonacci 數列增長後,對於 Linux 核心的影響。可修改 VFS 函式,允許透過寫入 `/dev/fibonacci` 裝置檔案來變更不同的 Fibonacci 求值的實作程式碼。
![](https://i.imgur.com/7xyCUVO.png)
- [ ] 用 printk 固然方便,但其執行時間易受各種因素而不穩定,除了透過統計來觀察,也可改用核心的 sysfs 介面,後者可讓我們自行宣告 show 和 store 介面來提供讀寫操作,可參見 [Sample kobject implementation](https://elixir.bootlin.com/linux/latest/source/samples/kobject/kobject-example.c) (注意: 切換到對應的 Linux 核心版本)。
- [ ] 逐步最佳化 Fibonacci 的執行時間,引入[費氏數列分析](https://hackmd.io/@sysprog/fibonacci-number) 提到的策略,並善用 [clz / ffs](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令 (Linux 核心有對應的包裝函式),過程中都要充分量化
- [ ] 嘗試研讀 [bignum](https://github.com/0xff07/bignum/tree/fibdrv) (`fibdrv` 分支) 的實作,理解其中的技巧並與你的實作進行效能比較,探討個中值得取鏡之處。
- 原理和分析可見 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU)
- [ ] 當 $Fib(n)$ 隨著 $n$ 越來越大時,由 Linux 核心傳遞字串形式的十進位就成為效能瓶頸,研讀〈[Integer Encoding Algorithm 筆記](https://kkc.github.io/2021/01/30/integer-compression-note/)〉所提及的策略,在 Linux 核心模組先對數值進行編碼和壓縮,以二進位方式傳遞到使用者層級後,再由應用程式予以顯示 $Fib(n)$ 數值
- 儘量降低由核心傳遞到應用程式的資料量
- 確保應用程式不需要過多的計算量才可重現 Fibonacci 數列
## 環境
```shell
$ uname -a
Linux leon-ubuntu-20 5.13.0-35-generic #40~20.04.1-Ubuntu SMP Mon Mar 7 09:18:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 36 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 58
Model name: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
Stepping: 9
CPU MHz: 1700.000
CPU max MHz: 3800.0000
CPU min MHz: 1600.0000
BogoMIPS: 6784.21
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-3
```
## 開發紀錄
### 修正可變長度陣列( Variable-Length Array, VLA )
原程式碼如下。
```c
static long long fib_sequence(long long k)
{
/* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
long long f[k + 2];
f[0] = 0;
f[1] = 1;
for (int i = 2; i <= k; i++) {
f[i] = f[i - 1] + f[i - 2];
}
return f[k];
}
```
其中宣告處可見 `long long f[k + 2]` ,其雖合乎 C99 標準,但其實有更好的實做可以避免 VLA 。修正後如下。
```c
static long long fib_sequence(long long k)
{
long long a = 0, b = 1;
if (k <= 1)
return k;
for (long long i = k; i > 1; i--) {
k = a + b;
a = b;
b = k;
}
return k;
}
```
### 利用 fast doubling 實做
直接以費波那契數的定義實做,時間複雜度為 $O(n)$ ,若改以 fast doubling 的方式則可使時間複雜度降為 $O(logn)$ 。
根據作業要求所給出的虛擬碼實做 fast doubling ,首次嘗試如下。
```c
#define clz(x) __builtin_clzll(x)
static long long fib_sequence(long long k)
{
long long a = 0, b = 1;
int k_clz = clz(k);
if (k <= 1)
return k;
k <<= k_clz;
for (int i = sizeof(long long) * 8; i > k_clz; i--, k <<= 1) {
long long t1 = a * (2 * b - a);
long long t2 = b * b + a * a;
a = t1;
b = t2;
if (k & (long long) ~(~0ULL >> 1)) {
t1 = a + b;
a = b;
b = t1;
}
}
return a;
}
```
在以上嘗試中,將 `k` 最高位的 1 對齊 MSB ,並逐次將 `k` 左移偵測最高位以實做 fast doubling 。
後來發現更為簡化的實做方式。
```c
#define clz(x) __builtin_clzll(x)
static long long fib_sequence(long long k)
{
if (k <= 1)
return k;
long long a = 0, b = 1;
for (int mask = 1 << (sizeof(long long) * 8 - 1 - clz(k)); mask > 0;
mask >>= 1) {
long long t1 = a * (2 * b - a);
b = b * b + a * a;
a = t1;
if (k & mask) {
t1 = a + b;
a = b;
b = t1;
}
}
return a;
}
```
以上實做中,不再將 `k` 進行位移。反之,以 `mask` 變數作為遮罩,逐次將 `mask` 右移以對 `k` 進行檢查。
最後則是為未來大數運算做準備。由於目前的程式碼在 `fib_read` 函式中直接回傳所算得之費波那契數,然而此機制無法應付數值超過 `ssize_t` 所能表示之最大值的情況。因此回歸 `read` 函式之用法,將所得之數放進 `fib_read` 函式的 `buf` 參數中。
原 `fib_read` 實做如下。
```c
/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
char *buf,
size_t size,
loff_t *offset)
{
return (ssize_t) fib_sequence(*offset);
}
```
先修改 `fib_read` 。當 `buf` 的大小不足以放入 `long long` 整數時,回傳 `-1` 。
```c
/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
char *buf,
size_t size,
loff_t *offset)
{
if (size >= sizeof(long long)) {
*(long long *) buf = fib_sequence(*offset);
return (ssize_t) size;
}
return -1;
}
```
再修改 `client.c` 關於 `read` 的部份。
```diff
int main()
{
long long sz;
- char buf[1];
+ char buf[sizeof(long long)];
...
for (int i = 0; i <= offset; i++) {
lseek(fd, i, SEEK_SET);
- sz = read(fd, buf, 1);
+ sz = read(fd, buf, sizeof(buf));
printf("Reading from " FIB_DEV
" at offset %d, returned the sequence "
"%lld.\n",
- i, sz);
+ i, *(long long *) buf);
}
for (int i = offset; i >= 0; i--) {
lseek(fd, i, SEEK_SET);
- sz = read(fd, buf, 1);
+ sz = read(fd, buf, sizeof(buf));
printf("Reading from " FIB_DEV
" at offset %d, returned the sequence "
"%lld.\n",
- i, sz);
+ i, *(long long *) buf);
}
close(fd);
return 0;
}
```
### 實做大數
在此實做固定長度有號大數。新增 `bignum.h` ,定義 `struct bignum` 。
```c
#define BN_ARRAY_SIZE 7
/* Little-endian */
struct bignum {
uint32_t num[BN_ARRAY_SIZE];
int32_t num_and_sign;
};
```
`struct bignum` 以 little endian 的方式儲存數值:最低位位於 `num[0]` ,最高位位於 `num_and_sign` 。在此選擇以 32 位元整數作為成員型態的原因是考慮在進行運算時可以 64 位元的變數儲存會溢位的數值。將最高位獨立為 `int32_t` 型態可避免在實做時過度轉型。此結構的數值範圍為 $[-2^{255}, 2^{255} - 1]$ 。
接下來實做基本大數運算於 `bignum.c` 。函式原型如下。
```c
/* bignum = 0 */
void bignum_init(struct bignum *bignum);
void bignum_from_int(struct bignum *bignum, int64_t i);
void bignum_to_dec(const struct bignum *bignum, char *str, size_t size);
/* c = a + b */
void bignum_add(const struct bignum *a,
const struct bignum *b,
struct bignum *c);
/* c = a - b */
void bignum_sub(const struct bignum *a,
const struct bignum *b,
struct bignum *c);
/* c = a * b */
void bignum_mul(const struct bignum *a,
const struct bignum *b,
struct bignum *c);
/* c = a / b */
void bignum_div(const struct bignum *a,
const struct bignum *b,
struct bignum *c);
/* b = -a */
void bignum_neg(const struct bignum *a, struct bignum *b);
/* b = |a| */
void bignum_abs(const struct bignum *a, struct bignum *b);
/* b = a << 1 */
void bignum_shl1(const struct bignum *a, struct bignum *b, uint32_t lsb);
/* b = a >> 1 */
void bignum_shr1(const struct bignum *a, struct bignum *b, uint32_t msb);
```
實做後,需修改 `Makefile` 以將 `bignum` 導入模組及 client 中。
```diff
...
-TARGET_MODULE := fibdrv
+TARGET_MODULE := fibdrvmodule
...
+$(TARGET_MODULE)-objs := fibdrv.o bignum.o
...
-client: client.c
+client: client.c bignum.c
$(CC) -o $@ $^
```
在編譯前,需了解到在 user mode 和 kernel mode 所引入的標頭檔不盡相同。可以藉由檢查 `__KERNEL__` 巨集是否有被定義來決定所引入的標頭檔。如下。
```c
#ifdef __KERNEL__
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/string.h>
#else
#include <stdio.h>
#include <string.h>
#endif
```
接著以 fast doubling 和大數實做費波那契數,最後稍稍修改 `fib_read` 及 `client.c` 。
```c
static void fib_sequence(long long k, struct bignum *result)
{
if (!result)
return;
if (k <= 1) {
bignum_from_int(result, k);
return;
}
struct bignum a, b;
bignum_from_int(&a, 0);
bignum_from_int(&b, 1);
for (int mask = 1 << (sizeof(long long) * 8 - 1 - clz(k)); mask > 0;
mask >>= 1) {
struct bignum t;
bignum_shl1(&b, &t, 0);
bignum_sub(&t, &a, &t);
bignum_mul(&t, &a, &t);
bignum_mul(&b, &b, &b);
bignum_mul(&a, &a, &a);
bignum_add(&a, &b, &b);
a = t;
if (k & mask) {
bignum_add(&a, &b, &t);
a = b;
b = t;
}
}
*result = a;
}
```
最後執行 `make check` 發現無法通過測試,原因為 `scripts/expected.txt` 中大於 92 的費波那契數為錯誤的。修改後方可通過測試。
```diff
Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429.
-Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429.
+Reading from /dev/fibonacci at offset 93, returned the sequence 12200160415121876738.
+Reading from /dev/fibonacci at offset 94, returned the sequence 19740274219868223167.
+Reading from /dev/fibonacci at offset 95, returned the sequence 31940434634990099905.
+Reading from /dev/fibonacci at offset 96, returned the sequence 51680708854858323072.
+Reading from /dev/fibonacci at offset 97, returned the sequence 83621143489848422977.
+Reading from /dev/fibonacci at offset 98, returned the sequence 135301852344706746049.
+Reading from /dev/fibonacci at offset 99, returned the sequence 218922995834555169026.
+Reading from /dev/fibonacci at offset 100, returned the sequence 354224848179261915075.
+Reading from /dev/fibonacci at offset 100, returned the sequence 354224848179261915075.
+Reading from /dev/fibonacci at offset 99, returned the sequence 218922995834555169026.
+Reading from /dev/fibonacci at offset 98, returned the sequence 135301852344706746049.
+Reading from /dev/fibonacci at offset 97, returned the sequence 83621143489848422977.
+Reading from /dev/fibonacci at offset 96, returned the sequence 51680708854858323072.
+Reading from /dev/fibonacci at offset 95, returned the sequence 31940434634990099905.
+Reading from /dev/fibonacci at offset 94, returned the sequence 19740274219868223167.
+Reading from /dev/fibonacci at offset 93, returned the sequence 12200160415121876738.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
```
### 效能分析
根據作業要求,將可能會影響效能的因素排除。
```bash
# Disable ASLR
sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"
# Disable inline turbo
sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"
# Scaling govenor
for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
sudo sh -c "echo performance > ${i}"
done
```
接著再次改寫程式碼,將多個需比較的內容寫入 driver 中,並以 `enum` 來決定所要執行的程式碼區段。
```c
#define FIBDRV_MODE_SIZE 5
enum fibdrv_mode {
FIBDRV_BIGNUM_FAST,
FIBDRV_BIGNUM_ORIG,
FIBDRV_LL_FAST,
FIBDRV_LL_ORIG,
FIBDRV_TIME
};
```
改寫 `fib_write` ,使之成為使用者改變 driver mode 的函式。
```c
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
if (size == sizeof(mode) && *(enum fibdrv_mode *) buf < FIBDRV_MODE_SIZE) {
if (copy_from_user(&mode, buf, sizeof(mode)) != 0)
return -1;
} else
mode = FIBDRV_BIGNUM_FAST;
return 1;
}
```
接著在 driver 中安插計時相關程式碼,所得之時間單位為奈秒。
```c
ktime_t kt = ktime_get();
// codes
duration = ktime_sub(ktime_get(), kt); // duration is a global var
```
並在 user mode 中也安插計時相關程式碼。
```c
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
// code
clock_gettime(CLOCK_MONOTONIC, &t1);
// to nanosecond
long long ut = (long long) (t1.tv_sec * 1e9 + t1.tv_nsec) -
(t0.tv_sec * 1e9 + t0.tv_nsec);
```
最後利用 gnuplot 將資料轉換為折線圖。
第一張圖為利用 bignum fast doubling 所得之數據。
![](https://i.imgur.com/Y0LhkZ3.png)
第二張及第三張圖為根據不同的演算法所得之時間數據。
![](https://i.imgur.com/O9fVTSX.png)
![](https://i.imgur.com/ZKbKncj.png)