# 2022q1 Homework3 (fibdrv)
contributed by < [`chengingx`](https://github.com/chengingx) >
###### tags: `linux2022`
> [fibdrv](https://hackmd.io/@sysprog/linux2022-fibdrv)
## 預備知識
* [Calculating Fibonacci Numbers by Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling)
* [The Linux Kernel Module Programming Guide](https://sysprog21.github.io/lkmpg/)
### 費氏數列
$\left \{ \begin{array}{**lr**} F_{2n}=F_n\cdot(2\cdot F_{n+1}-F_n) \\ F_{2n+1}=F_{n+1}^2+F_{n}^2\end{array} \right.$
對應虛擬碼如下:
```c
Fast_Fib(n)
a = 0; b = 1; // m = 0
for i = (number of binary digit in n) to 1
t1 = a*(2*b - a);
t2 = b^2 + a^2;
a = t1; b = t2; // m *= 2
if (current binary digit == 1)
t1 = a + b; // m++
a = b; b = t1;
return a;
```
### 考慮到硬體加速 $F(n)$ 的手法
許多現代處理器提供 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令,可搭配 [Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 手法運用:
1. 省略 $F(0)$,直接從 $F(1)$ 開始
2. clz/[ffs](https://www.kernel.org/doc/htmldocs/kernel-api/API-ffs.html): 先去除數字 MSB 起算的開頭 0 位元,因為真正的數字不包含 leading 0s,所以計算它也沒用,因此需要 clz 計算有多少 leading 0s
* 遇到 `0` : 求 $F(2n)$ 和 $F(2n+1)$
* 遇到 `1` : 求 $F(2n)$ 和 $F(2n+1)$,再求 $F(2n+2)$
### 前期測試
* 首先 insert kernel module
```shell
$ sudo insmod fibdrv.ko
```
* `sudo lsmod` 可用來觀察有什麼 kernel module 在 kernel 中
```shell
$ sudo lsmod | grep fibdrv
fibdrv 16384 0
```
* 觀察 fibonacci 的 device file
```shell
$ ls -l /dev/fibonacci
crw------- 1 root root 508, 0 三 15 15:36 /dev/fibonacci
```
* 可以看見 major number 為 508,由 `alloc_chrdev_region` 決定
```shell
$ sudo rmmod fibdrv
```
#### 核心模式的時間測量
```c
static long long fib_time_proxy(long long k)
{
kt = ktime_get();
long long result = fib_sequence(k);
kt = ktime_sub(ktime_get(), kt);
return result;
}
```
可發現到 write 在此核心模組暫時沒作用,於是可挪用來輸出上一次 fib 的執行時間
```c
static ssize_t fib_read(struct file *file,
char *buf,
size_t size,
loff_t *offset)
{
return (ssize_t) fib_time_proxy(*offset);
}
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
return ktime_to_ns(kt);
}
```
#### 使用 fast doubling 進行核心模式的時間測量
參照 [freshLiver](https://hackmd.io/@freshLiver/2022q1-hw3-fibdrv#fibdrv-%E5%AF%A6%E4%BD%9C) 的實作 (我想破頭)
* original ver.
時間複雜度 : $O(n)$
```c
static long long fib_sequence(long long k)
{
long long f[2] = {0, 1};
for (int i = 1; i <= k; i++){
SWAP(f[0], f[1]);
f[0] += f[1];
}
return f[0];
}
```
* fast doubling ver.
時間複雜度 : $O(log(n))$
```c
static long long fib_sequence(long long k)
{
if (k == 0)
return 0;
long long f[2] = {0, 1};
int m = 63 - __builtin_clzll(k);
for (int i = 0; i <= m; i++){
long long a = f[0] * (2 * f[1] - f[0]); // F(2k)
long long b = f[1] * f[1] + f[0] * f[0]; // F(2k+1)
if ((k >> (m - i)) & 1){
f[0] = b;
f[1] = a + b; // F(2k+2)
} else {
f[0] = a;
f[1] = b;
}
}
return f[0];
}
```
> `__builtin_clzll()` 返回左起第一個1之前0的個數
參考 [gnuplot 語法解說和示範](https://hackmd.io/@sysprog/Skwp-alOg) 繪圖
```shell
set terminal png
set term png enhanced font 'Verdna, 10'
set title 'performance comparison'
set xlabel 'size'
set ylabel 'time(ns)'
set output 'runtime.png'
set key left
plot [0:92][0:150] 'output_original.txt' using 1:2 with lp lw 2 title 'original', 'output_fast_doubling.txt' using 1:2 with lp lw 2 title 'fast doubling'
```

:::info
$F_{93}$ 之後 overflow 也影響了執行時間 ?
:::
#### 將行程固定在特定的 CPU 中執行
直接加在 GRUB 的設定檔中,這樣 Linux 的排程器在預設的狀況下就不會將任何一般行程放在這個被保留的 CPU 核心中執行,只有那些被 [taskset](https://man7.org/linux/man-pages/man1/taskset.1.html) 特別指定的行程可使用這些 CPU 核心
```shell
$ sudo vi /etc/default/grub
```
```shell
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=0"
```
```shell
$ sudo update-grub
```
確認 `affinity list` (實驗電腦為 6 核)
```shell
$ taskset -cp 1
pid 1's current affinity list: 1-5
```
```shell
$ sudo taskset -c 0 ./client 0
```
不知道為何原始版本變快了,然而 fast doubling 仍然振幅很大,之後用統計手法解決

## 計算 $F_{93}$ (包含) 之後的 Fibonacci 數 - 使用數字字串並套用 quiz2 SSO (Small String Optimization)
透過 `$ sudo ./client` 觀察得到 $F_{93}$ 發生 overflow
```shell
$ sudo ./client
...
Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence -6246583658587674878.
...
```
> $F_{93}=F_{91}+F_{92} = 12200160415121876738$
> $\because 12200160415121876738 > 2^{64} - 1\ (Overflow)$
> $\therefore 12200160415121876738 - 2^{64} = -6246583658587674878 = F_{93}$
* 將 `long long` 改成 [`unsigned __int128`](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html)
節錄 [fibdrv.c](https://github.com/chengingx/fibdrv/blob/master/fibdrv.c) 重要修改
```c
static unsigned __int128 fib_sequence(long long k)
{
unsigned __int128 f[2] = {0, 1};
for (int i = 1; i <= k; i++) {
__swap(&f[0], &f[1], sizeof(unsigned __int128));
f[0] += f[1];
}
return f[0];
}
static unsigned __int128 fib_time_proxy(long long k, int method)
{
kt = ktime_get();
unsigned __int128 result = fib_sequence(k);
kt = ktime_sub(ktime_get(), kt);
return result;
}
/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
char *buf,
size_t size,
loff_t *offset)
{
unsigned __int128 result = fib_time_proxy(*offset, size);
if (copy_to_user(buf, &result, sizeof(unsigned __int128)))
return -EFAULT;
return 1;
}
```
original ver. 在 userspace 與 kernel space 的時執行時間

```shell
plot [0:][0:] 'output.txt' using 1:2 with lp lw 2 title 'kernel', '' using 1:3 with lp lw 2 title 'userspace', '' using 1:4 with lp lw 2 title 'kernel to user
space'
```
### 統計手法
參考 [oscarshiang](https://hackmd.io/@oscarshiang/linux_fibdrv) 所寫的 python 自動腳本,將 process data 寫成函式,使用到的手法是去除超過兩個標準差的資料
```python
def process_data(l, pd, name):
print(f"Processing {name} data...")
mean, std = pd.mean(axis = 1), pd.std(axis = 1)
for i in tqdm(range(len(pd.index))):
sum = 0
cnt = 0
for j in pd.iloc[i]:
if abs(j - mean[i]) < 2 * std[i]:
sum = sum + j;
cnt = cnt + 1;
sum = sum / cnt;
l.append(sum)
```
透過 `python AutoGenerate.py` 產生 10000 筆資料
```
$ python AutoGenerate.py
Generating data...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [02:39<00:00, 62.85it/s]
...
```
original ver.

fast doubling ver.

original ver. vs. fast doubling ver. in kernel

### 不使用硬體指令加速
```c
void fast_doubling(long long k, unsigned __int128 f[])
{
if(k == 0){
return;
}
fast_doubling((k >> 1), f);
unsigned __int128 a = f[0] * (2 * f[1] - f[0]); // F(2k)
unsigned __int128 b = f[1] * f[1] + f[0] * f[0]; // F(2k+1)
if(k & 1){
f[0] = b;
f[1] = a + b;
} else {
f[0] = a;
f[1] = b;
}
}
static unsigned __int128 fib_fast_doubling(long long k)
{
unsigned __int128 f[2] = {0, 1};
fast_doubling(k, f);
return f[0];
}
```

可以清楚看見,使用 [clz](https://en.wikipedia.org/wiki/Find_first_set) 指令可以降低執行時間
#### clz
觀察以下程式碼的 assembly code
```c
unsigned int clz(unsigned int num) {
return __builtin_clz(num);
}
int main(){
unsigned int res = clz(2);
return 0;
}
```
開啟編譯器最佳化進行編譯
```shell
$ gcc -S -O2 -o test2.s test.c -w
```
```shell
clz:
bsrl %edi, %eax
xorl $31, %eax
ret
```
可以清楚看見有個 `bsrl` 指令,bsr 是 bit-scan reverse
舉例 :
$0000000010011000_b$
* $bsf$ 會得到 $3$,$bsr$ 會得到 $7$
* word size 減去 $bsr$ 得到的數值就能得到 leading zeros
* `xorl $31, %eax` 表示 $31 - eax$
> 只有 $2^n-1$ 可以使用 xor 代替 $-$,$n=1,2,3,...$
### 註冊 Device Driver
:::info
根據 [Device Driver 重點整理](https://hackmd.io/@chenging/driver) 知道 fops 的兩個重點:
1. fops 是指向 file_operations 結構的指標,驅動程式呼叫 `register_chrdev()` 將 fops 註冊到 kernel 裡後,fops 便成為該 device driver 所實作的 system call 進入點
2. 實作 system call 的函數便是透過 file_operations 結構來定義,我們稱實作 system call 的函數為 driver method
:::
觀察 `fibdrv.c` 中的 `init_fib_dev` 裡面可以看見註冊函式,[`alloc_chrdev_region`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h) 會動態產生 major number,若回傳值小於 0 代表註冊失敗
```c
rc = alloc_chrdev_region(&fib_dev, 0, 1, DEV_FIBONACCI_NAME);
if (rc < 0) {
printk(KERN_ALERT
"Failed to register the fibonacci char device. rc = %i",
rc);
return rc;
}
```
[Course 5 - 新字元設備驅動實驗](https://meetonfriday.com/posts/edcd30e6/) 整理 `register_chrdev()` 缺點
1. 會佔用主設備號下的所有次設備號,浪費了很多次設備號
2. 必須手動指定主設備號,也就是說你得先手動查詢哪個主設備號還沒被佔用,然後再去指定
:::info
`cdev` 定義在 [include/linux/cdev.h](https://github.com/torvalds/linux/blob/70868a180501d17fea58153c649d56bc18435315/include/linux/cdev.h),看到了老朋友 `list_head`
```c
struct cdev {
struct kobject kobj;
struct module *owner;
const struct file_operations *ops;
struct list_head list;
dev_t dev;
unsigned int count;
} __randomize_layout;
```
:::
```c
fib_cdev = cdev_alloc();
if (fib_cdev == NULL) {
printk(KERN_ALERT "Failed to alloc cdev");
rc = -1;
goto failed_cdev;
}
```
Linux 驅動程式是透過 file_operations 來建構 VFS 層的支援
```c
fib_cdev->ops = &fib_fops;
rc = cdev_add(fib_cdev, fib_dev, 1);
if (rc < 0) {
printk(KERN_ALERT "Failed to add cdev");
rc = -2;
goto failed_cdev;
}
```
而 file_operation 裡的函數指標,即是指向每一個 system call 的實作函數
```c
const struct file_operations fib_fops = {
.owner = THIS_MODULE,
.read = fib_read,
.write = fib_write,
.open = fib_open,
.release = fib_release,
.llseek = fib_device_lseek,
};
```
## mutex 對 Linux 核心模組的影響
fibdrv.c 中的 `fib_open` 對應的 system call 是 `open`,也就是在 client.c 中
```c
int fd = open(FIB_DEV, O_RDWR);
```
當開啟 file (`/dev/fibonacci`) 時,會試圖上鎖
```c
static int fib_open(struct inode *inode, struct file *file)
{
if (!mutex_trylock(&fib_mutex)) {
printk(KERN_ALERT "fibdrv is in use");
return -EBUSY;
}
return 0;
}
```
假如 file discriptor 是 critical section,也就是 fd 會共用,那麼在 `read` 就會出問題
```c
read(fd, &buf, method);
```
以下進行實驗,將原本在 main 函式的指令寫在 `racefib`
```c
void *racefib(void *arg)
{
int fd = open(FIB_DEV, O_RDWR);
if (fd < 0) {
perror("Failed to open character device");
exit(1);
}
for (int i = 0; i <= 10; i++){
unsigned __int128 buf;
lseek(fd, i, SEEK_SET);
read(fd, &buf, 1);
printf("%s : F(%d): %lld\n", (char *)arg, i, buf);
}
close(fd);
return NULL;
}
```
main() 中則寫,記得編譯時加上 `-lpthread`
```c
pthread_t t[2];
pthread_create(&t[0], NULL, racefib, "Thread 1");
pthread_create(&t[1], NULL, racefib, "Thread 2");
for (int i = 0; i < 2; i++)
pthread_join(t[i], NULL);
```
可看見輸出有 Failed to open character device: Device or resource busy 字樣
```shell
$ sudo ./client
Thread 1 : F(0): 0
Thread 1 : F(1): 1
Thread 1 : F(2): 1
Thread 1 : F(3): 2
Thread 1 : F(4): 3
Thread 1 : F(5): 5
Failed to open character device: Device or resource busy
Thread 1 : F(6): 8
Thread 1 : F(7): 13
```
將 `mutex` 拿掉,可以發現 $F(n)$ 數字是正確的
```
$ sudo ./client
...
Thread 1 : F(6): 8
Thread 2 : F(4): 3
Thread 2 : F(5): 5
Thread 2 : F(6): 8
Thread 2 : F(7): 13
Thread 1 : F(7): 13
...
```
當 `fd` 為 Global variable,且 Thread 1 sleep 1 秒,Thread 2 sleep 2 秒時,數值就會出錯
```c
void *racefib(void *arg)
{
int num = *((int *)arg);
for (int i = 0; i <= 10; i++){
unsigned __int128 buf;
lseek(fd, i, SEEK_SET);
sleep(num);
read(fd, &buf, 1);
printf("Thread %d : F(%d): %lld\n", num, i, buf);
}
return NULL;
}
```
主要是因為 `lseek` 在看 `offset` 時發生的錯誤
```c
int delay_a = 1, delay_b = 2;
pthread_create(&t[0], NULL, racefib, &delay_a);
pthread_create(&t[1], NULL, racefib, &delay_b);
```
由輸出結果可以看見兩個 Thread 1,一個 Thread 2 交錯,因為 `offset` 為共用,`Thread 2 : F(1): 2` 其實是 `Thread 1 : F(3)` 的值
```shell
$ sudo ./client
Thread 1 : F(0): 0
Thread 2 : F(0): 1
Thread 1 : F(1): 1
Thread 1 : F(2): 1
Thread 2 : F(1): 2
Thread 1 : F(3): 1
Thread 1 : F(4): 3
Thread 2 : F(2): 5
Thread 1 : F(5): 2
Thread 1 : F(6): 8
Thread 2 : F(3): 13
Thread 1 : F(7): 3
Thread 1 : F(8): 21
Thread 2 : F(4): 34
Thread 1 : F(9): 5
Thread 1 : F(10): 55
Thread 2 : F(5): 55
Thread 2 : F(6): 8
Thread 2 : F(7): 13
Thread 2 : F(8): 21
Thread 2 : F(9): 34
Thread 2 : F(10): 55
```
## 自我檢查清單
- [x] 研讀上述 Linux 效能分析的提示 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作→ 從中也該理解為何不希望在虛擬機器中進行實驗;
- [x] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
- [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] 研讀 [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU) 的報告,指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措?
- [ ] lsmod 的輸出結果有一欄名為 Used by,這是 “each module’s use count and a list of referring modules”,但如何實作出來呢?模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢?
> 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html)
- [ ] 注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。
## 參考資料
* [Calculating Fibonacci Numbers by Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling)
* [The Linux Kernel Module Programming Guide](https://sysprog21.github.io/lkmpg/)
* [Device Driver 重點整理](https://hackmd.io/@chenging/driver)
* [Course 5 - 新字元設備驅動實驗](https://meetonfriday.com/posts/edcd30e6/)
* [gnuplot 語法解說和示範](https://hackmd.io/@sysprog/Skwp-alOg)
* [Linux 隱藏 GRUB 開機選單](https://www.ltsplus.com/linux/linux-hide-grub-boot-menu)