L04: fibdrv

主講人: jserv / 課程討論區: 2023 年系統軟體課程

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

返回「Linux 核心設計/實作」課程進度表

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

撰寫 Linux 核心模組

請自行參閱以下教材:

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

`fibdrv`: 可輸出 Fibonacci 數列的 Linux 核心模組

前期準備

開發環境預期和 lab0 相似，如果你剛好重新安裝 Ubuntu Linux，請依據指示將必要的開發套件裝好;
自本次作業，我們就有分析應用程式和 Linux 核心效能的需求，不該在虛擬機器環境裡頭測試 (但 container 或 Linux KVM 可接受)，否則會有效能偏差的問題
$\to$ 及早在實體機器上安裝好 GNU/Linux;

自從 Linux 核心 4.4 版以來，Ubuntu Linux 預設開啟 EFI_SECURE_BOOT_SIG_ENFORCE，這使得核心模組需要適度的簽章才可掛載進入 Linux 核心，為了後續測試的便利，我們需要將 UEFI Secure Boot 的功能關閉，請見 Why do I get “Required key not available” when install 3rd party kernel modules or after a kernel upgrade?
檢查 Linux 核心版本
```
$ uname -r
```
預期是大於等於 5.4.0 的版本，例如 5.4.0-66-generic。若在你的開發環境中，核心版本低於 5.4 的話，需要更新 Linux 核心，請自行參照相關文件
安裝 linux-headers 套件 (注意寫法裡頭有 s)，以 Ubuntu Linux 20.04 LTS 為例:
```
$ sudo apt install linux-headers-`uname -r`
```

確認 linux-headers 套件已正確安裝於開發環境

$ dpkg -L linux-headers-5.4.0-66-generic | grep "/lib/modules"

預期得到以下輸出:

/lib/modules/5.4.0-66-generic/build

檢驗目前的使用者身份
```
$ whoami
```
預期為「不是 root 的使用者名稱」，例如 jserv (或者你安裝 Ubuntu Linux 指定的登入帳號名稱)。由於測試過程需要用到 sudo，請一併查驗:
```
$ sudo whoami
```
預期輸出是 root
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
在下列操作中，請避免用 root 帳號輸入命令，而該善用 sudo

之後的實驗中，我們會重建 root file system，若濫用 root 權限，很可能就把 GNU/Linux 開發環境不小心破壞 (當然，你還是可重新安裝)，現在開始養成好習慣

安裝後續會用得到的工具

$ sudo apt install util-linux strace gnuplot-nox

取得原始程式碼

$ git clone https://github.com/sysprog21/fibdrv
$ cd fibdrv

編譯並測試
```
$ make check
```
預期會看到綠色的 Passed [-] 字樣，隨後是
```
f(93) fail
input: 7540113804746346429
expected: 12200160415121876738
```
這符合預期，因為給定的 fibdrv 存在缺陷。

就因世界不完美，才有我們工程師存在的空間。

觀察產生的 fibdrv.ko 核心模組

$ modinfo fibdrv.ko

預期可得以下輸出:

description:    Fibonacci engine driver
author:         National Cheng Kung University, Taiwan
license:        Dual MIT/GPL
name:           fibdrv
vermagic:       5.4.0-45-generic SMP mod_unload

觀察 fibdrv.ko 核心模組在 Linux 核心掛載後的行為（要先透過 insmod 將模組載入核心後才會有下面的裝置檔案 /dev/fibonacci）
模組載入核心：
```
$ sudo insmod fibdrv.ko
```
```
$ ls -l /dev/fibonacci
$ cat /sys/class/fibonacci/fibonacci/dev
```
新建立的裝置檔案 /dev/fibonacci，注意到 236 這個數字，在你的電腦也許會有出入。試著對照 fibdrv.c，找尋彼此的關聯。
```
$ cat /sys/module/fibdrv/version 
```
預期輸出是 0.1，這和 fibdrv.c 透過 MODULE_VERSION 所指定的版本號碼相同。
```
$ lsmod | grep fibdrv
$ cat /sys/module/fibdrv/refcnt
```
這兩道命令的輸出都是 0，意味著目前的 reference counting。

計算 F₉₃ (包含) 之後的 Fibonacci 數 (功能受限)

因 F₉₃ 之後的運算會發生 overflow，導致無法正確地計算結果。可以使用底下方法計算 big number:

使用 GCC __int128 型態，或者自行定義的結構:

 struct BigN {
    unsigned long long lower, upper;
};

使用數字字串做運算，並運用 Small/Short String Optimization。實作程式碼。
底下為數字字串加法實作，細節如下:
- 確保 a 字串長度大於 b 字串
- 將 a 與 b 字串反轉
- 逐字對數字字元做加法運算
- 將得出字串反轉，即得出最終結果

static void string_number_add(xs *a, xs *b, xs *out)
{
    char *data_a, *data_b;
    size_t size_a, size_b;
    int i, carry = 0;
    int sum;

    /*
     * Make sure the string length of 'a' is always greater than
     * the one of 'b'.
     */
    if (xs_size(a) < xs_size(b))
        __swap((void *) &a, (void *) &b, sizeof(void *));

    data_a = xs_data(a);
    data_b = xs_data(b);

    size_a = xs_size(a);
    size_b = xs_size(b);

    reverse_str(data_a, size_a);
    reverse_str(data_b, size_b);

    char buf[size_a + 2];

    for (i = 0; i < size_b; i++) {
        sum = (data_a[i] - '0') + (data_b[i] - '0') + carry;
        buf[i] = '0' + sum % 10;
        carry = sum / 10;
    }

    for (i = size_b; i < size_a; i++) {
        sum = (data_a[i] - '0') + carry;
        buf[i] = '0' + sum % 10;
        carry = sum / 10;
    }

    if (carry)
        buf[i++] = '0' + carry;

    buf[i] = 0;

    reverse_str(buf, i);

    /* Restore the original string */
    reverse_str(data_a, size_a);
    reverse_str(data_b, size_b);

    if (out)
        *out = *xs_tmp(buf);
}

如此一來，計算到 F₅₀₀ (The first 500 Fibonacci numbers ) 也是正確的，結果如下:

$ sudo ./client
...
Reading from /dev/fibonacci at offset 499, returned the sequence 86168291600238450732788312165664788095941068326060883324529903470149056115823592713458328176574447204501.
Reading from /dev/fibonacci at offset 500, returned the sequence 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125.
...

減少 copy to user 傳送的資料量

計算 leading zeros

以

f i b (100) = 354224848179261915075 =

100110011001111011011011101101010011111000101100101001011111111000011_{2}

為例，先對

f i b (100)

取2為底的對數

\log_{2} (354224848179261915075) = 68.26322731561805

無條件進位後是69，要表達到

2^{69} - 1

總共需要 69 個 bit，每個 int 是 32 bit，共需要 3 個 int。
leading zeros 數量為 96-69 = 27個

可以用 bitwise 操作最高位的整數來計算 leading zeros。

int clz (int x){
    if (x == 0) return 32;
    int n = 0;
    if ((x & 0xFFFF0000) == 0) {n += 16; x <<= 16;}
    if ((x & 0xFF000000) == 0) {n +=  8; x <<=  8;}
    if ((x & 0xF0000000) == 0) {n +=  4; x <<=  4;}
    if ((x & 0xC0000000) == 0) {n +=  2; x <<=  2;}
    if ((x & 0x80000000) == 0) n +=  1;
    return n;
}

這個程式碼以二分搜尋法找出 leading zeros，做法如下:

檢查最高 16 bit 是否全為零，全為零的話將 x 的首16個零移除(x 左移16 bit)
檢查

以 fib(100) 為例，最高位整數儲存的是 0x00000013

0x00000013 & 0xFFFF0000 == 0 ， n = 16 ， x = 0x00130000
0x00130000 & 0xFF000000 == 0 ， n = 24 ， x = 0x13000000
0x13000000 & 0xF0000000 != 0
0x13000000 & 0xC0000000 == 0 ， n = 26 ， x = 0x4C000000
0x4C000000 & 0x80000000 == 0 ， n = 27

與之前的計算一致。

計算以 2 為底數的對數

以 32 減去 clz(x) 可獲得 x 所使用到的 bit 數量。
令

n = 32 - c l z (x)

使用這些 bit 可表達的範圍為

2^{n - 1}

2^{n} - 1

，可得

n = 32 - c l z (x) = ⌈ l o g_{2} x ⌉

且

n - 1 = 31 - c l z (x) = ⌊ l o g_{2} x ⌋

。

fibdrv 計算完後，會藉由 copy_to_user() 將結果傳回使用者空間。計算leading zeros，並將 int32 細分為 4 個位元組，呼叫copy_to_user() 時不傳送全為 0 的位元組。

static int my_copy_to_user(const bn *bignum, char __user *buf)
{
    int i = MAX_LENGTH_BN - 1;
    for (; bignum->num[i] == 0; i--)
        ;
    int lzbyte = __builtin_clz(bignum->num[i]) >> 3;

    size_t size = sizeof(int32_t) * (i + 1) - lzbyte;
    copy_to_user(buf, bignum->num, size);

    return size;
}

使用 gcc 內建函式 __builtin_clz 來計算 leading zero bit 數量，其中 >> 3 除以 8 並捨棄餘數換算成 leading zero byte。

針對 little-endian 架構，非零的位元組會被存在較低的記憶體位址。以 fib(100) 為例，需要 3 個整數，最高位整數是 0x00000013，倘若我們只要前 2 個整數以及第 3 個整數的 0x13 由 copy_to_user() 傳送。

將 copy_to_user() 的 size 指定為 2 個整數再加 1 個位元組，就不用傳送 leading zero byte。

將複製的 byte 數量作為 read 的回傳值傳回 user。

sz = my_copy_to_user(a, buf);
...
return sz;

在 client.c 中將 bn 初始化後以 memcpy 複製資料即可。

    sz = read(fd, buf, 1);
        
    bn a;
    bn_init(&a);

    memcpy(a.num, buf, sz);
    int j = MAX_LENGTH_BN;
    for (; a.num[j] == 0; j--)
        ;
    a.length = j + 1;
    ...

預先計算儲存
$F (n)$ 所需的空間

倘若一開始就知道該配置多少空間給 Fibonacci 運算，就能避免空間浪費，或在計算過程中重複呼叫 krealloc。

首先，我們可用 Binet 公式計算任意 Fibonacci 數列中第

n

個數

F n = \frac{(ϕ^{n}) - (1 - ϕ)^{n}}{\sqrt{5}}

ϕ = \frac{1 + \sqrt{5}}{2} (g o l d e n r a t i o)

上式的近似值:

F n = \frac{ϕ^{n}}{\sqrt{5}} (s i n c e (1 - ϕ)^{n} i s v e r y s m a l l f o r l a r g e n)

知道近似值後，我們可使以 10 為底的對數計算其位數。具體如下：

D i g i t s = \log_{10} (\frac{ϕ^{n}}{\sqrt{5}}) D i g i t s = n \log_{10} ϕ - (\frac{\log_{10} 5}{2})

當

n

為足夠大的正整數時，

F (n) \approx 0.4472135955 \times {1.61803398875}^{n}

將近似值取 2 為底的對數，可得到需要使用的位元數

l o g_{2} (0.4472135955 \times {1.61803398875}^{n}) \approx - 1.160964 + n \times 0.694242

再除以 32 可得到需要使用的 uint32 數量

(- 1.160964 + n \times 0.694242) \div 32 = - 0.036280125 + n \times 0.0216950625

Linux 核心無法使用 (正確說法是，不能直接使用) 浮點數運算，因此將算式乘以

10^{10}

，亦即:

⌊ \frac{- 362801250 + n \times 216950625}{10^{10}} ⌋ + 1

此算式的結果就是該事先配置的 uint32 數量。

參考的 C 程式:

#define BUFFSIZE (8 * sizeof(int) * LENGTH / 3 + 2)
#define LOGPHI (0.20898764025)
#define LOGSQRT5 (0.34948500216)

int main()
{
    int offset = 100; /* TODO: try test something bigger than the limit */
    int fd = open(FIB_DEV, O_RDWR);
    if (fd < 0) {
        perror("Failed to open character device");
        exit(1);
    }
    float digits;
    int size;
    for (int i = 0; i <= offset; i++) {
        // calculate how many digits are needed for fib(i)
        // digits needed for fib(n) = n*LOG(Phi) - (LOG √5)
        digits = i * LOGPHI - LOGSQRT5;
        float buf_size = floor(digits / 9);
        size = (int) buf_size;
        char *buf = malloc(sizeof(char) * (BUFFSIZE * size));
        lseek(fd, i, SEEK_SET);
        long long time1 = read(fd, buf, 0);
        long long time2 = read(fd, buf, 1);
        printf("%d %lld %s\n", i, time1, buf); 
        free(buf);
    }

`fibdrv` 核心模組內部

觀察使用者層級 (user-level) 的程式如何與 fibdrv 互動:

    fd = open(FIB_DEV, O_RDWR);

    if (fd < 0) {
        perror("Failed to open character device");
        exit(1);
    }

    for (i = 0; i <= offset; i++) {
        sz = write(fd, write_buf, strlen(write_buf));
        printf("Writing to " FIB_DEV ", returned the sequence %lld\n", sz);
    }

    for (i = 0; i <= offset; i++) {
        lseek(fd, i, SEEK_SET);
        sz = read(fd, buf, 1);
        printf("Reading from " FIB_DEV
               " at offset %d, returned the sequence "
               "%lld.\n",
               i, sz);
    }

fibdrv 設計為一個 character device，可理解是個能夠循序存取檔案，透過定義相關的函數，可利用存取檔案的系統呼叫以存取 (即 open, read, write, mmap 等等)。因此，使用者層級 (user-level 或 userspace) 的程式可透過 read 系統呼叫來得到輸出。

接著來看如何實作:

/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    return (ssize_t) fib_sequence(*offset);
}

const struct file_operations fib_fops = {
    .owner = THIS_MODULE,
    .read = fib_read,
    .write = fib_write,
    .open = fib_open,
    .release = fib_release,
    .llseek = fib_device_lseek,
};

不難發現是透過 fib_fops 中的自行定義的 read 來實作讀取操作，而 fib_read 最終會回傳 fib_sequence(*offset)，因此就是透過讓使用者指定不同的 offest 作為 Fibonacci 數列的

x

然後透過 read 的回傳值輸出

f i b (x)

給使用者。

在 LP64 資料模式中，long long 僅寬 64 位元，因此若要表示更大的數，需要用兩個以上的變數表示一個數，在這種情況下作加法運算時，若使用 "+" operator ，會發生 overflow，可考慮實作一個更多位元的全加器。

f i b (92)

的計算結果用 16 進位表示是 0x1 11F3 8AD0 840B F6BF，超過 64 位元整數所夠表示的最大數值，因此考量到要正確輸出

f i b (100)

或數列更後面的數值，就必須使用到特製的結構來處理回傳值。以下是參考實作

struct BigN {
    unsigned long long lower, upper;
};

使用 struct BigN 來將一個數字分成兩部份:

高位的 64 bits 保存 upper 中;
低位的 64 bit 則是保存在 lower 中;

進行大數加法時，則需要注意 lower 是否需要進位到 upper:

static inline void addBigN(struct BigN *output, struct BigN x, struct BigN y)
{
    output->upper = x.upper + y.upper;
    if (y.lower > ~x.lower)
        output->upper++;
    output->lower = x.lower + y.lower;
}

因為 x.lower + ~x.lower = ~0，移項後 ~x.lower = ~0 - x.lower，亦即 ~x.lower 表示 x.lower 跟最大值(~0) 的距離。所以若 y.lower 比 x.lower 距離最大值的距離還大，就表示相加後會 overflow，需要進位，這時執行 output->upper++。

Linux Virtual File System 介面

透過 Linux Virtual File System 介面，本核心模組可將計算出來的 Fibonacci 數讓 client.c 程式得以存取。

VFS 提供一統各式檔案系統的共用介面，方便使用者操作不同的裝置或檔案時，可用一致的方式存取。Linux 的裝置驅動程式大致分為:

Character Device Driver;
Block Device Driver;
Network Device Driver;

在使用裝置前需要對其定義一些 file operation，並將其註冊到 kernel 中。
依據 /include/linux/fs.h 中的定義

struct file_operations {
    struct module *owner;
    loff_t (*llseek) (struct file *, loff_t, int);
    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
    ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
    ...
    int (*open) (struct inode *, struct file *);
    ...
    int (*release) (struct inode *, struct file *);
    int (*fsync) (struct file *, loff_t, loff_t, int datasync);
    int (*fasync) (int, struct file *, int);
    int (*lock) (struct file *, int, struct file_lock *);
    ...
} __randomize_layout;

此手法可參見你所不知道的 C 語言：物件導向程式設計篇。

以本例來說，宣告一個 file_operations 的資料型態並設定一些對應到 VFS 操作的函式 (fib_read, fib_write 等等)。當在使用者模式程式中呼叫到 read 系統呼叫時，藉由 VFS 將其對應到 fib_read。

const struct file_operations fib_fops = {
    .owner = THIS_MODULE,
    .read = fib_read,
    .write = fib_write,
    .open = fib_open,
    .release = fib_release,
    .llseek = fib_device_lseek,
};

對照 Linux 核心設計: 檔案系統概念及實作手法。

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Linux 核心的浮點數運算

Robert Love 在 Linux Kernel Development 一書論及浮點運算:

No (Easy) Use of Floating Point
When using floating-point instructions kernel normally catches a trap and then initiates the transition from integer to floating point mode. Unlike user-space, the kernel does not have the luxury of seamless support for floating point because it cannot easily trap itself. Using a floating point inside the kernel requires manually saving and restoring the floating point registers. Except in the rare cases, no floating-point operations are in the kernel.

解說: Use of floating point in the Linux kernel

Rusty Russell 在 Unreliable Guide To Hacking The Linux Kernel 則說:

The FPU context is not saved; even in user context the FPU state probably won't correspond with the current process: you would mess with some user process' FPU state. If you really want to do this, you would have to explicitly save/restore the full FPU state (and avoid context switches). It is generally a bad idea; use fixed point arithmetic first.

Lazy FP state restore

CVE-2018-3665 存在於 Intel Core 系列微處理器中，因為 speculative execution（推測執行）技術中的一些缺陷加上特定作業系統中對 FPU（Floating point unit）進行 context switch 所產生的漏洞，允許一個本地端的程式洩漏其他程序的 FPU 暫存器內容。

Lazy FP state leak 的原理是透過 FPU 的 lazy state switching 機制達成。因為 FPU 和 SIMD 暫存器不一定會在每個任務持續被使用到，因此作業系統排程器可將不需要使用到 FPU 的任務，標註為不可使用 FPU，而不更改裡面的內容。

然而，在現今的亂序執行 CPU 中，lazy state switching 裡會設定的 "FPU not available" 可能沒辦法馬上被偵測到，導致我們在 process B，但仍然可以存取到 process A 的 FPU 暫存器內容。

基於上述原因，儘管我們在 Linux 核心模式中仍可使用浮點運算，但這不僅會拖慢執行速度，還需要手動儲存/取回相關的 FPU 暫存器，因此核心文件不建議在核心中使用到浮點運算。

浮點運算時間差異比較

afcidk 透過開發簡單的 Linux 核心模組，來測試在單純的浮點數運算及整數運算花費的時間差異。

相關的程式碼可見 floating.c

可見到核心模式中的浮點數運算，時間成本較整數運算高。