J06: fibdrv

tags: `linux2021`

主講人: jserv / 課程討論區: 2021 年系統軟體課程

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

返回「Linux 核心設計」課程進度表

作業說明錄影 / Code Review 錄影

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

預期目標

撰寫適用於 Linux 核心層級的程式
- 學習 ktimer, copy_to_user 一類的核心 API
複習 C 語言數值系統和 bitwise operation
數值分析和運算改進策略
初探 Linux VFS
自動測試機制
透過工具進行效能分析

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

費氏數列

參照以下材料理解 Fibonacci 數列和對應的計算手法:

費氏數列分析
「什麼是費氏數列」短片: Part I, Part II, Part III, Part IV
TED: The magic of Fibonacci numbers (有繁體中文字幕)

依據 Fibonacci and Golden Ratio Formulae，考慮 Fibonacci 最初提到不會死亡的兔子繁衍總量問題，假設成年兔子為 a ，幼年兔子為 b ，我們可得到以下關係:

a_{n + 1} = a_{n} + b_{n} b_{n + 1} = a_{n}

將上述推導寫成矩陣的形式就會變成

(\begin{matrix} a_{n + 1} \\ b_{n + 1} \end{matrix}) = (\begin{matrix} 1 & 1 \\ 1 & 0 \end{matrix}) (\begin{matrix} a_{n} \\ b_{n} \end{matrix})

則

(\begin{matrix} 1 & 1 \\ 1 & 0 \end{matrix})

就是所謂的 Q-matrix，可進一步改寫為:

Q = (\begin{matrix} 1 & 1 \\ 1 & 0 \end{matrix}) = (\begin{matrix} F_{2} & F_{1} \\ F_{1} & F_{0} \end{matrix}) Q^{n} = (\begin{matrix} F_{n + 1} & F_{n} \\ F_{n} & F_{n - 1} \end{matrix})

解說短片: The Fibonacci Q-matrix

接著談及 Fast Doubling 手法，繼續整理:

\begin{aligned} [\begin{array}{c} F (2 n + 1) \\ F (2 n) \end{array}] & = {[\begin{array}{c} 1 & 1 \\ 1 & 0 \end{array}]}^{2 n} [\begin{array}{c} F (1) \\ F (0) \end{array}] \\ = {[\begin{array}{c} 1 & 1 \\ 1 & 0 \end{array}]}^{n} {[\begin{array}{c} 1 & 1 \\ 1 & 0 \end{array}]}^{n} [\begin{array}{c} F (1) \\ F (0) \end{array}] \\ = [\begin{array}{c} F (n + 1) & F (n) \\ F (n) & F (n - 1) \end{array}] [\begin{array}{c} F (n + 1) & F (n) \\ F (n) & F (n - 1) \end{array}] [\begin{array}{c} 1 \\ 0 \end{array}] \\ = [\begin{array}{c} F (n + 1)^{2} + F (n)^{2} \\ F (n) F (n + 1) + F (n - 1) F (n) \end{array}] \end{aligned}

因此可得:

\begin{aligned} F (2 k) & = F (k) [2 F (k + 1) - F (k)] \\ F (2 k + 1) & = F (k + 1)^{2} + F (k)^{2} \end{aligned}

根據 Fibonacci 數列原始定義 F(0) = 0, F(1) = 1, F(2) = 1我們就可用這三個值，依據上述公式推導出隨後的數值。

對應的虛擬碼如下:

Fast_Fib(n)
    a = 0; b = 1;       // m = 0
    for i = (number of binary digit in n) to 1
        t1 = a*(2*b - a);
        t2 = b^2 + a^2;
        a = t1; b = t2; // m *= 2
        if (current binary digit == 1)
            t1 = a + b; // m++
            a = b; b = t1;
    return a;

以

F (6)

為例:

第 1 次遞迴

第 2 次遞迴

可見遞迴次數縮減，還能再更快嗎？再觀察到

F (6)

被分成

F (3)

和

F (4)

兩個部分，其中

F (4) = F (2) + F (3)

，可以利用

F (3)

和遞迴

F (3)

時所得到的

F (2)

去計算

F (4)

，這樣可以再次降低運算的次數，如下：

第 2 次遞迴

和最初的 Fibonacci 數列定義相比，可見相當大的差距。

示範案例:
求解

F (10)

:
10₁₀ = 1010₂

i	start	4	3	2	1	result
n	-	1010	1010	1010	1010	-
F(m)	F(0)	F(0*2+1)	F(1*2)	F(2*2+1)	F(5*2)	F(10)
a	0	1	1	5	55	55
b	1	1	2	8	89	-

對照

F (11)

11₁₀ = 1011₂

	1	0	1	1	result
F(n)	F(0*2+1)	F(1*2)	F(2*2+1)	F(5*2+1)	F(11)
a	1	1	5	89	89
b	1	2	8	144

考慮到硬體加速
$F (n)$ 的手法

許多現代處理器提供 clz / ctz 一類的指令，可搭配上述 Fast Doubling 手法運用:

省略
$F (0)$ ，直接從
$F (1)$ 開始;
clz/ffs: 先去除數字 MSB 起算的開頭 0 位元，因為真正的數字不包含 leading 0s，所以計算它也沒用，因此需要 clz 計算有多少 leading 0s
- 遇到 0
  $\to$ 進行 fast doubling，也就是求求
  $F (2 n)$ 和
  $F (2 n + 1)$
- 遇到 1
  $\to$ 進行 fast doubling，也就是先求
  $F (2 n)$ 和
  $F (2 n + 1)$ ，再求
  $F (2 n + 2)$

可對照你所不知道的 C 語言: 遞迴呼叫篇。

Fibonacci 數的應用

電腦在計算亂數時，常以新產生出來的數值作為下一次計算的依據，這就是為什麼計算機隨機數大部分都會表示成遞迴定義的數列。

這裡只探討 Pseudo-Random Number Generators (PRNG)

這是一般的 Fibonacci 數列

f[0] = 0;
f[1] = 1;
f[i] = f[i - 1] + f[i - 2];

數列呈現單調遞增

1, 1, 2, 3, 5, 8, 13, 21, ...

如果我們強迫 Fibonacci 數列的數值超出 100 之後折回來

f[0] = 0;
f[1] = 1;
f[i] = (f[i - 1] + f[i - 2]) % 100;

一開始雖然還是看得到規則，不過整體趨勢已經不再是單調遞增，當數值折回之後規律變得不太明顯

1, 1, 2, 3, 5, 8, 13, 21, 34, 55,
89, 44, 33, 77, 10, 87, 97, 84, 81, 65,
46, 11, 57, 68, 25, 93, 18, 11, 29, 40,
...

甚至可將 Fibonacci 的遞迴數列改成:

f[0] = 18;
f[1] = 83;
f[2] = 4;
f[3] = 42;
f[4] = 71;
f[i] = (f[i - 2] + f[i - 5]) % 100;

這個數列的規則已不容易看穿

18, 83, 4, 42, 71, 60, 54, 64, 96, 35,
56, 89, 20, 85, 55, 41, 44, 61, 29, 16,
70, 60, 31, 89, 47, 59, 7, 90, 96, 37,
...

這種產生隨機數的方法，稱為 Lagged Fibonacci generator (LFG)，是電腦產生隨機數的一種方法。

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

撰寫 Linux 核心模組

請自行參閱以下教材:

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

`fibdrv`: 可輸出 Fibonacci 數列的 Linux 核心模組

前期準備

開發環境預期和 lab0 相似，如果你剛好重新安裝 Ubuntu Linux，請依據指示將必要的開發套件裝好;
從 Homework3 起，我們就有分析應用程式和 Linux 核心效能的需求，不該在虛擬機器環境裡頭測試 (但 container 或 Linux KVM 可接受)，否則會有效能偏差的問題
$\to$ 及早在實體機器上安裝好 GNU/Linux;

自從 Linux 核心 4.4 版以來，Ubuntu Linux 預設開啟 EFI_SECURE_BOOT_SIG_ENFORCE，這使得核心模組需要適度的簽章才可掛載進入 Linux 核心，為了後續測試的便利，我們需要將 UEFI Secure Boot 的功能關閉，請見 Why do I get “Required key not available” when install 3rd party kernel modules or after a kernel upgrade?
檢查 Linux 核心版本
```
$ uname -r
```
預期是大於等於 5.4.0 的版本，例如 5.4.0-66-generic。若在你的開發環境中，核心版本低於 5.4 的話，需要更新 Linux 核心，請自行參照相關文件
安裝 linux-headers 套件 (注意寫法裡頭有 s)，以 Ubuntu Linux 20.04 LTS 為例:
```
$ sudo apt install linux-headers-`uname -r`
```

確認 linux-headers 套件已正確安裝於開發環境

$ dpkg -L linux-headers-5.4.0-66-generic | grep "/lib/modules"

預期得到以下輸出:

/lib/modules/5.4.0-66-generic/build

檢驗目前的使用者身份
```
$ whoami
```
預期為「不是 root 的使用者名稱」，例如 jserv (或者你安裝 Ubuntu Linux 指定的登入帳號名稱)。由於測試過程需要用到 sudo，請一併查驗:
```
$ sudo whoami
```
預期輸出是 root
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
在下列操作中，請避免用 root 帳號輸入命令，而該善用 sudo

之後的實驗中，我們會重建 root file system，若濫用 root 權限，很可能就把 GNU/Linux 開發環境不小心破壞 (當然，你還是可重新安裝)，現在開始養成好習慣

安裝後續會用得到的工具

$ sudo apt install util-linux strace gnuplot-nox

取得原始程式碼

$ git clone https://github.com/sysprog21/fibdrv
$ cd fibdrv

編譯並測試
```
$ make check
```
預期會看到綠色的 Passed [-] 字樣，隨後是
```
f(93) fail
input: 7540113804746346429
expected: 12200160415121876738
```
這符合預期，因為給定的 fibdrv 存在缺陷。

就因世界不完美，才有我們工程師存在的空間。

觀察產生的 fibdrv.ko 核心模組

$ modinfo fibdrv.ko

預期可得以下輸出:

description:    Fibonacci engine driver
author:         National Cheng Kung University, Taiwan
license:        Dual MIT/GPL
name:           fibdrv
vermagic:       5.4.0-45-generic SMP mod_unload

觀察 fibdrv.ko 核心模組在 Linux 核心掛載後的行為（要先透過 insmod 將模組載入核心後才會有下面的裝置檔案 /dev/fibonacci）
```
$ ls -l /dev/fibonacci
$ cat /sys/class/fibonacci/fibonacci/dev
```
新建立的裝置檔案 /dev/fibonacci，注意到 236 這個數字，在你的電腦也許會有出入。試著對照 fibdev.c，找尋彼此的關聯。
```
$ cat /sys/module/fibdrv/version 
```
預期輸出是 0.1，這和 fibdev.c 透過 MODULE_VERSION 所指定的版本號碼相同。
```
$ lsmod | grep fibdrv
$ cat /sys/module/fibdrv/refcnt
```
這兩道命令的輸出都是 0，意味著目前的 reference counting。

計算 F₉₃ (包含) 之後的 Fibonacci 數 - 使用數字字串並套用 quiz2 SSO (Small String Optimization)

因 F₉₃ 之後的運算會發生 overflow，導致無法正確地計算結果。可以使用底下方法計算 big number:

使用 GCC __int128 型態，或者自行定義的結構:

 struct BigN {
    unsigned long long lower, upper;
};

使用數字字串做運算，並運用 Small/Short String Optimization。實作程式碼。
底下為數字字串加法實作，細節如下:
- 確保 a 字串長度大於 b 字串
- 將 a 與 b 字串反轉
- 逐字對數字字元做加法運算
- 將得出字串反轉，即得出最終結果

static void string_number_add(xs *a, xs *b, xs *out)
{
    char *data_a, *data_b;
    size_t size_a, size_b;
    int i, carry = 0;
    int sum;

    /*
     * Make sure the string length of 'a' is always greater than
     * the one of 'b'.
     */
    if (xs_size(a) < xs_size(b))
        __swap((void *) &a, (void *) &b, sizeof(void *));

    data_a = xs_data(a);
    data_b = xs_data(b);

    size_a = xs_size(a);
    size_b = xs_size(b);

    reverse_str(data_a, size_a);
    reverse_str(data_b, size_b);

    char buf[size_a + 2];

    for (i = 0; i < size_b; i++) {
        sum = (data_a[i] - '0') + (data_b[i] - '0') + carry;
        buf[i] = '0' + sum % 10;
        carry = sum / 10;
    }

    for (i = size_b; i < size_a; i++) {
        sum = (data_a[i] - '0') + carry;
        buf[i] = '0' + sum % 10;
        carry = sum / 10;
    }

    if (carry)
        buf[i++] = '0' + carry;

    buf[i] = 0;

    reverse_str(buf, i);

    /* Restore the original string */
    reverse_str(data_a, size_a);
    reverse_str(data_b, size_b);

    if (out)
        *out = *xs_tmp(buf);
}

如此一來，計算到 F₅₀₀ (The first 500 Fibonacci numbers ) 也是正確的，結果如下:

$ sudo ./client
...
Reading from /dev/fibonacci at offset 499, returned the sequence 86168291600238450732788312165664788095941068326060883324529903470149056115823592713458328176574447204501.
Reading from /dev/fibonacci at offset 500, returned the sequence 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125.
...

`fibdrv` 核心模組內部

觀察使用者層級 (user-level) 的程式如何與 fibdrv 互動:

    fd = open(FIB_DEV, O_RDWR);

    if (fd < 0) {
        perror("Failed to open character device");
        exit(1);
    }

    for (i = 0; i <= offset; i++) {
        sz = write(fd, write_buf, strlen(write_buf));
        printf("Writing to " FIB_DEV ", returned the sequence %lld\n", sz);
    }

    for (i = 0; i <= offset; i++) {
        lseek(fd, i, SEEK_SET);
        sz = read(fd, buf, 1);
        printf("Reading from " FIB_DEV
               " at offset %d, returned the sequence "
               "%lld.\n",
               i, sz);
    }

fibdrv 設計為一個 character device，可理解是個能夠循序存取檔案，透過定義相關的函數，可利用存取檔案的系統呼叫以存取 (即 open, read, write, mmap 等等)。因此，使用者層級 (user-level 或 userspace) 的程式可透過 read 系統呼叫來得到輸出。

接著來看如何實作:

/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    return (ssize_t) fib_sequence(*offset);
}

const struct file_operations fib_fops = {
    .owner = THIS_MODULE,
    .read = fib_read,
    .write = fib_write,
    .open = fib_open,
    .release = fib_release,
    .llseek = fib_device_lseek,
};

不難發現是透過 fib_fops 中的自行定義的 read 來實作讀取操作，而 fib_read 最終會回傳 fib_sequence(*offset)，因此就是透過讓使用者指定不同的 offest 作為 Fibonacci 數列的

x

然後透過 read 的回傳值輸出

f i b (x)

給使用者。

在 LP64 資料模式中，long long 僅寬 64 位元，因此若要表示更大的數，需要用兩個以上的變數表示一個數，在這種情況下作加法運算時，若使用 "+" operator ，會發生 overflow，可考慮實作一個更多位元的全加器。

f i b (92)

的計算結果用 16 進位表示是 0x1 11F3 8AD0 840B F6BF，超過 64 位元整數所夠表示的最大數值，因此考量到要正確輸出

f i b (100)

或數列更後面的數值，就必須使用到特製的結構來處理回傳值。以下是參考實作

struct BigN {
    unsigned long long lower, upper;
};

使用 struct BigN 來將一個數字分成兩部份:

高位的 64 bits 保存 upper 中;
低位的 64 bit 則是保存在 lower 中;

進行大數加法時，則需要注意 lower 是否需要進位到 upper:

static inline void addBigN(struct BigN *output, struct BigN x, struct BigN y)
{
    output->upper = x.upper + y.upper;
    if (y.lower > ~x.lower)
        output->upper++;
    output->lower = x.lower + y.lower;
}

因為 x.lower + ~x.lower = ~0，移項後 ~x.lower = ~0 - x.lower，亦即 ~x.lower 表示 x.lower 跟最大值(~0) 的距離。所以若 y.lower 比 x.lower 距離最大值的距離還大，就表示相加後會 overflow，需要進位，這時執行 output->upper++。

Linux Virtual File System 介面

透過 Linux Virtual File System 介面，本核心模組可將計算出來的 Fibonacci 數讓 client.c 程式得以存取。

VFS 提供一統各式檔案系統的共用介面，方便使用者操作不同的裝置或檔案時，可用一致的方式存取。Linux 的裝置驅動程式大致分為:

Character Device Driver;
Block Device Driver;
Network Device Driver;

在使用裝置前需要對其定義一些 file operation，並將其註冊到 kernel 中。
依據 /include/linux/fs.h 中的定義

struct file_operations {
    struct module *owner;
    loff_t (*llseek) (struct file *, loff_t, int);
    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
    ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
    ...
    int (*open) (struct inode *, struct file *);
    ...
    int (*release) (struct inode *, struct file *);
    int (*fsync) (struct file *, loff_t, loff_t, int datasync);
    int (*fasync) (int, struct file *, int);
    int (*lock) (struct file *, int, struct file_lock *);
    ...
} __randomize_layout;

此手法可參見你所不知道的 C 語言：物件導向程式設計篇。

以本例來說，宣告一個 file_operations 的資料型態並設定一些對應到 VFS 操作的函式 (fib_read, fib_write 等等)。當在使用者模式程式中呼叫到 read 系統呼叫時，藉由 VFS 將其對應到 fib_read。

const struct file_operations fib_fops = {
    .owner = THIS_MODULE,
    .read = fib_read,
    .write = fib_write,
    .open = fib_open,
    .release = fib_release,
    .llseek = fib_device_lseek,
};

對照 Linux 核心設計: 檔案系統概念及實作手法。

核心模式的時間測量

LWN 的 Reinventing the timer wheel 一文中描述道：

The kernel maintains two types of timers with two distinct use cases. The high-resolution timer ("hrtimer") mechanism provides accurate timers for work that needs to be done in the near future; hrtimer use is relatively rare, but, when hrtimers are used, they almost always run to completion. "Timeouts," instead, are normally used to alert the kernel to an expected event that has failed to arrive — a missing network packet or I/O completion interrupt, for example. The accuracy requirements for these timers are less stringent (it doesn't matter if an I/O timeout comes a few milliseconds late), and, importantly, these timers are usually canceled before they expire.

跟計時有關的功能主要用在 2 種情境：

timer: 安排「在某個時間點做某件事情」。可想像成火車班次表，這個重要的地方是：若某班火車誤點，會連鎖地影響後面所有班次。因此需要較精準的計時機制;
timeout: 用來作為逾時的通知，提醒「有東西遲到了」。最簡單的例子是 qtest 裡面的 SIGALRM 的使用。這對於時間的精準度要求不高;

其中一種計時方式是用作業系統從開機以來的計時器中斷發生的次數作為計時的依據，這個計時機制叫作 jiffies，很早就存在於 Linux 核心。較舊的 Linux 核心提供一個建議在 jiffies 上的計時機制，叫作 timer wheel。可以參考 A new approach to kernel timers 一文 (註：標題雖然說是 new approach，但這篇寫作時間是 2005 年)。

以 tv1 為例，他是個大小是 256 個陣列，看 jiffies 的最低 8 位元代表的數字是多少，就去 tv1 陣列對應的元素找對應需要處理的事件。而 tv2 是 tv1 的陣列，每經過

2^{8}

次 jiffy ，index 就加一。 tv3 後面以此類推。

不過，這個計時器受限於計時器中斷觸發和實質能處理的的頻率，而這個頻率有其極限。關於中斷，可參見 Linux 核心設計: 中斷處理和現代架構考量。

一個新的機制是跳脫 jiffies，而將計時機制建立在一個新的資料結構 ktime_t 上面，可參考 LWN The high-resolution timer API 一文。

hrtimer 是在 2.6.16 開始有的新的計時機制，裡面使用了 ktime_t 這個新的資料結構來進行計時。這個結構體的定義會隨架構有所不同。所以跟大多數 Linux 中的資料結構使用機制類似，都要使用專門的函數來對這個資料型態進行操作。而在 x86-64 中是個 64 位元整數。相關的使用方式如下：

宣告並初始化一個 ktime_t:

DEFINE_KTIME(name);

這跟 LIST_HEAD 的功能之於 struct list_head 相仿，宣告一個 ktime_t 並初始化成 0。

對 ktime 數值進行運算:

    ktime_t ktime_add(ktime_t kt1, ktime_t kt2);
    ktime_t ktime_sub(ktime_t kt1, ktime_t kt2);  /* kt1 - kt2 */
    ktime_t ktime_add_ns(ktime_t kt, u64 nanoseconds);

與其他時間相關的轉換：

    ktime_t timespec_to_ktime(struct timespec tspec);
    ktime_t timeval_to_ktime(struct timeval tval);
    struct timespec ktime_to_timespec(ktime_t kt);
    struct timeval ktime_to_timeval(ktime_t kt);
    clock_t ktime_to_clock_t(ktime_t kt);
    u64 ktime_to_ns(ktime_t kt);

詳見 IBM 關於 hrtimer 的文章，並對照 Linux 核心設計: Timer 及其管理機制。

ktime 相關的 API 可用來測量時間，我們可發現到 write 在此核心模組暫時沒作用，於是可挪用來輸出上一次 fib 的執行時間。以下是示範的修改:

static ktime_t kt;

static long long fib_time_proxy(long long k)
{
    kt = ktime_get();
    long long result = fib_sequence(k);
    kt = ktime_sub(ktime_get(), kt);

    return result;
}

static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    return (ssize_t) fib_time_proxy(*offset);
}

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    return ktime_to_ns(kt);
}

關於 client.c

由 read 的手冊描述可知，若 read 成功，會回傳讀進的 byte 數，而 read 的其中一個輸入參數為 count，即所要讀取的 byte 數，但在 client.c 中，計算出的 Fibonacci 數是 read() 的回傳值。

參考 Linux Driver Tutorial: How to Write a Simple Linux Device Driver 的第 7 部分 "Using Memory Allocated in User Mode":

The user allocates a special buffer in the user-mode address space. And the other action that the read function must perform is to copy the information to this buffer. The address to which a pointer from that space points and the address in the kernel address space may have different values. That's why we cannot simply dereference the pointer.

也就是說，在使用者模式 (user-mode) 的位址空間配置一個 buffer 空間時，核心裝置驅動不能直接寫入該地址，相反的，需要透過 copy_to_user，將想傳給使用者模式 (即運作中的 client) 的字串複製到到 fib_read 的 buf 參數後，client 端方可接收到此字串。

依據 Hierarchical performance measurement and modeling of the Linux file system 研究指出，從核心模式複製資料到使用者層級的時間成本是，每個位元組達 22ns，因此進行效能分析時，要考慮到 copy_to_user 函式的時間開銷，特別留意到資料大小成長後造成的量測誤差。

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Linux 效能分析的提示

無論是伺服器或個人電腦裡頭的 CPU 幾乎都是多核架構，通常在多核心的作業系統中常使用處理器的親和性 (processor affinity，亦稱 CPU pinning) 讓行程 (process) 在特定的 CPU 核心中持續執行，不受作業系統排程的干擾。

將行程綁定在特定的 CPU 核心上有許多優點，例如一個 cache bound 的程式跟其他比較耗費 CPU 計算的程式一起執行時，將程式綁定在特定的 CPU 核心可減少 cache miss 的狀況。另外在兩個行程頻繁的藉由共享記憶體進行資料交換時，將兩個行程都綁定在同一個 NUMA 節點中也可增進執行效率。

在 Linux 系統中若要將特定的處理器核心指定給一個程式或行程使用，可以使用 taskset 命令來設定或取得行程的處理器親和性。

查看行程的 CPU affinity

若要查看指定行程的處理器親和性，可使用 taskset 加上 -p 參數再加上該行程的 PID（process ID）：

$ taskset -p PID

其中 PID 就是行程的 ID，例如：

$ taskset -p 1

參考輸出為: (顯然每台主機搭配 Linux 會有多樣結果)

pid 1's current affinity mask: ffffffffffffffff

輸出的 affinity mask 是一個十六進位的 bitmask，將其轉換為二進位格式之後，若位元值為 1 則代表該行程可在這個位元對應的 CPU 核心中執行，若位元值為 0 則代表該行程不允許在這個位元對應的 CPU 核心中執行。

在上面這個例子中十六進位的 ff 轉成二進位的格式會是 11111111，這 8 個 1 分別代表該行程可以在第 0 到第 7 個 CPU 核心中執行，最低（LSB; 最右邊）的位元代表第 0 個 CPU 核心，次低的代表第 1 個，以次類推。

如果 affinity mask 是一個 0x11，則代表可在第 0 個與第 4 個 CPU 核心執行。若對 bitmask 表示法不易掌握，可加上 -c 參數，讓 taskset 直接輸出 CPU 的核心列表：

$ taskset -cp 1

參考輸出為:

pid 1's current affinity list: 0-63

將行程固定在特定的 CPU 中執行

taskset 亦可設定行程的 core mask，將指定的行程固定在特定的 CPU 核心中執行：

$ taskset -p COREMASK PID

其中 COREMASK 就是指定的十六進位 core mask，PID 則為行程的 ID。除此之外，亦可使用 -c 參數以 CPU 的核心列表來指定：

$ taskset -cp CORELIST PID

其中 CORELIST 為 CPU 核心列表，以逗點分隔各個核心的編號或是使用連字號指定連續的區間，例如: 0,2,5,7-10。

例如若要將一個行程固定在第 0 個與第 4 個 CPU 核心，則使用：

$ taskset -p 0x11 9030

輸出為：

pid 9030's current affinity mask: ff
pid 9030's new affinity mask: 11

亦可使用 CPU 核心列表的方式：

taskset -cp 0,4 9030

兩種方式效果一致。

在 Linux 中的使用者必須有開啟 CAP_SYS_NICE 這個權限，才能更動行程的處理器親和性，而若只是要查看處理器親和性的設定，則沒有限制（任何使用者皆可查詢。

以特定的 CPU 執行程式

除了更改現有行程的處理器親和性，使用者也可使用 taskset 指定 CPU 核心來執行一個新的程式：

$ taskset COREMASK EXECUTABLE

其中 EXECUTABLE 是要執行的程式。
例如若要以第 0 個 CPU 核心執行 vlc 則使用：

$ taskset 0x1 vlc

限定 CPU 給特定的程式使用

taskset 可指定行程所使用的 CPU 核心，但不代表其他的行程不會使用這些被指定的 CPU 核心，如果你不想讓其他的行程干擾你要執行的程式，讓指定的核心只能被自己設定的行程使用，可以使用 isolcpus 這個 Linux 核心起始參數，這個參數可以讓特定的 CPU 核心在開機時就被保留下來。

設定的方式有兩種，一個是在開機時使用 boot loader 所提供的自訂開機參數功能，手動加入 isolcpus=cpu_id 參數，或是直接加在 GRUB 的設定檔中，這樣 Linux 的排程器在預設的狀況下就不會將任何一般行程放在這個被保留的 CPU 核心中執行，只有那些被 taskset 特別指定的行程可使用這些 CPU 核心。

舉例來說，如果想讓第 0 個與第 1 個 CPU 核心都被保留下來，則在開機時加入:

isolcpus=0,1

這個 Linux 起始核心參數，然後再使用 taskset 命令將這兩個 CPU 核心指定給要執行的程式使用即可。

排除干擾效能分析的因素

抑制 address space layout randomization (ASLR)

$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"

設定 scaling_governor 為 performance。準備以下 shell script，檔名為 performance.sh:

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
    echo performance > ${i}
done

之後再用 $ sudo sh performance.sh 執行

針對 Intel 處理器，關閉 turbo mode

$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Linux 核心模組掛載機制

檔案 fibdrv.c 裡頭的 MODULE_LICENSE, MODULE_AUTHOR, MODULE_DESCRIPTION, MODULE_VERSION 等巨集做了什麼事，可以讓核心知曉呢？ insmod 這命令背後，對應 Linux 核心內部有什麼操作呢？

這些巨集本質上就是在編譯過後在 .ko 檔 (ko 即 kernel object 之意，對比使用者層級的 shared object [相當於 Microsoft Windows 的 DLL]，可對照你所不知道的 C 語言：動態連結器篇) 中提供相對應的資訊，由於性質相同，這邊就先專注於 MODULE_AUTHOR。在範例程式中我們指定 module 的作者

MODULE_AUTHOR("National Cheng Kung University, Taiwan");

以下摘自 include/linux/module.h:

/*
 * Author(s), use "Name <email>" or just "Name", for multiple
 * authors use multiple MODULE_AUTHOR() statements/lines.
 */
#define MODULE_AUTHOR(_author) MODULE_INFO(author, _author)

上述註解說明 _author 的格式和若有多個 author 則應該呼叫多次 MODULE_AUTHOR。
若將巨集展開應得 MODULE_INFO(author, "National Cheng Kung University, Taiwan")

/* Generic info of form tag = "info" */
#define MODULE_INFO(tag, info) __MODULE_INFO(tag, tag, info)

繼續將上述展開得 __MODULE_INFO(author, author, "National Cheng Kung University, Taiwan")

以下摘自 include/linux/moduleparam.h:

#ifdef MODULE
#define __MODULE_INFO(tag, name, info)					  \
static const char __UNIQUE_ID(name)[]					  \
  __used __attribute__((section(".modinfo"), unused, aligned(1)))	  \
  = __stringify(tag) "=" info
#else  /* !MODULE */
/* This struct is here for syntactic coherency, it is not used */
#define __MODULE_INFO(tag, name, info)					  \
  struct __UNIQUE_ID(name) {}
#endif

上述巨集的定義根據 MODULE 是否有被定義，MODULE 是在此核心模組被編譯時期所定義，若此模組編譯時已內建於核心則不會被定義。繼續將上述巨集展開

static const char __UNIQUE_ID(author)[]					  \
  __used __attribute__((section(".modinfo"), unused, aligned(1)))	  \
  = __stringify(author) "=" "National Cheng Kung University, Taiwan"

以下摘自 include/linux/compiler-gcc.h:

#define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)

繼續將 __UNIQUE_ID 展開，__COUNTER__ 這個巨集由 GCC 自動更新，每當遇到使用到 __COUNTER__ 就會將其值加一。

static const char __PASTE(__PASTE(__UNIQUE_ID_, author), __COUNTER__)[]					  \
  __used __attribute__((section(".modinfo"), unused, aligned(1)))	  \
  = __stringify(author) "=" "National Cheng Kung University, Taiwan"

__UNIQUE_ID 會根據參數產生一個不重複的名字（參考linux/compiler.h），其中使用到的技術是利用巨集中的 ## 來將兩個引數合併成一個新的字串。
透過 __attribute__ 關鍵字告訴編譯器，這段訊息
1. 要被放在 .modinfo 段
2. 應該不會被程式使用到，所以不要產生警告訊息
3. 最小的對齊格式需要是 1 bit
在 linux/stringfy.h 裡頭，我們可以看到 __stringify 的目的是為了把引數轉換成字串形式。以 MODULE_LICENSE(Dual MIT/GPL) 為例，被展開後的 __stringify(tag) "=" info 會是 "license = Dual MIT/GPL" 字串。

總結這部份，MODULE_XXX 系列的巨集在最後都會被轉變成

static const char 獨一無二的變數[] = "操作 = 參數"

再放到 .modinfo 區段中。這裡對應到 C99/C11 規格書中的 6.4.5 String Literals:

In translation phase 6, the multibyte character sequences specified by any sequence of adjacent character and wide string literal tokens are concatenated into a single multibytecharacter sequence.

大致的意思是把 string literal 並排，等同於一個合併起來的字串。

以下摘自 include/linux/compiler_types.h:

/* Indirect macros required for expanded argument pasting, eg. __LINE__. */
#define ___PASTE(a,b) a##b
#define __PASTE(a,b) ___PASTE(a,b)

繼續展開:

static const char __UNIQUE_ID_author0[]			    \
  __used __attribute__((section(".modinfo"), unused, aligned(1)))	  \
  = __stringify(author) "=" "National Cheng Kung University, Taiwan"

摘自 include/linux/stringify.h:

#define __stringify_1(x...)	#x
#define __stringify(x...)	__stringify_1(x)

注意到 # 和 ## 這兩個都是 preprocessor 語法，請參照你所不知道的 C 語言：前置處理器應用篇以得知詳細用法。

做最後的展開能夠得到以下的結果

static const char __UNIQUE_ID_author0[] \
  __used __attribute__((section(".modinfo"), unused, aligned(1))) \
  = "author=National Cheng Kung University, Taiwan"

根據 GNU GCC 文件說明對於 Variable attribute 的解說，section 會特別將此 variable 放到指定的 ELF section 中，這邊為 .modinfo。關於 ELF 的資訊，請參照你所不知道的 C 語言：連結器和執行檔資訊。

$ objdump -s fibdrv.ko
...
Contents of section .modinfo:
 0000 76657273 696f6e3d 302e3100 64657363  version=0.1.desc
 0010 72697074 696f6e3d 4669626f 6e616363  ription=Fibonacc
 0020 6920656e 67696e65 20647269 76657200  i engine driver.
 0030 61757468 6f723d4e 6174696f 6e616c20  author=National 
 0040 4368656e 67204b75 6e672055 6e697665  Cheng Kung Unive
 0050 72736974 792c2054 61697761 6e006c69  rsity, Taiwan.li
 0060 63656e73 653d4475 616c204d 49542f47  cense=Dual MIT/G
 0070 504c0000 00000000 73726376 65727369  PL......srcversi
 0080 6f6e3d34 42373436 37453631 43414238  on=4B7467E61CAB8
 0090 32354539 35364446 38330000 00000000  25E956DF83......
 00a0 64657065 6e64733d 00726574 706f6c69  depends=.retpoli
 00b0 6e653d59 006e616d 653d6669 62647276  ne=Y.name=fibdrv
 00c0 00766572 6d616769 633d342e 31382e30  .vermagic=4.18.0
 00d0 2d31362d 67656e65 72696320 534d5020  -16-generic SMP 
 00e0 6d6f645f 756e6c6f 61642000           mod_unload .    
...

上述可以看到 author 的資訊被寫入到 .modinfo section 中。

更進一步，用 vim 打開 fibdrv.ko，並且使用 16 進位模式閱讀，可以在 readelf 輸出的 modinfo 區段的 offset (560) 中找到下面內容：

00000530: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000540: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000550: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000560: 7665 7273 696f 6e3d 302e 3100 6465 7363  version=0.1.desc
00000570: 7269 7074 696f 6e3d 4669 626f 6e61 6363  ription=Fibonacc
00000580: 6920 656e 6769 6e65 2064 7269 7665 7200  i engine driver.
00000590: 6175 7468 6f72 3d4e 6174 696f 6e61 6c20  author=National
000005a0: 4368 656e 6720 4b75 6e67 2055 6e69 7665  Cheng Kung Unive
000005b0: 7273 6974 792c 2054 6169 7761 6e00 6c69  rsity, Taiwan.li
000005c0: 6365 6e73 653d 4475 616c 204d 4954 2f47  cense=Dual MIT/G
000005d0: 504c 0000 0000 0000 7372 6376 6572 7369  PL......srcversi
000005e0: 6f6e 3d32 3444 4335 4642 3745 3736 3038  on=24DC5FB7E7608
000005f0: 4146 3136 4230 4343 3146 0000 0000 0000  AF16B0CC1F......
00000600: 6465 7065 6e64 733d 0072 6574 706f 6c69  depends=.retpoli
00000610: 6e65 3d59 006e 616d 653d 6669 6264 7276  ne=Y.name=fibdrv
00000620: 0076 6572 6d61 6769 633d 342e 3138 2e30  .vermagic=4.18.0
00000630: 2d31 352d 6765 6e65 7269 6320 534d 5020  -15-generic SMP
00000640: 6d6f 645f 756e 6c6f 6164 2000 0000 0000  mod_unload .....
00000650: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000660: 0000 0000 0000 0000 0000 0000 0000 0000  ................

再來看執行 insmod 時 Linux 核心做了什麼。

摘自 Linux Device Driver 3/e 第 2 章的資訊：

…The use of module_init is mandatory. This macro adds a special section to the module’s object code stating where the module’s initialization function is to be found. Without this definition, your initialization function is never called.

可知 module_init 巨集在編譯出來的 object 中，加入初始化模組函數的起始位置。類似地，module_exit 的相關敘述：

Once again, the module_exit declaration is necessary to enable to kernel to find your cleanup function.

這邊利用 strace 追蹤執行 insmod fibdrv.ko 的過程有哪些系統呼叫被執行:






















$ sudo strace insmod fibdrv.ko

execve("/sbin/insmod", ["insmod", "fibdrv.ko"], 0x7ffeab43f308 /* 25 vars */) = 0
brk(NULL)                               = 0x561084511000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=83948, ...}) = 0
mmap(NULL, 83948, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0621290000
close(3)                                = 0
...
close(3)                                = 0
getcwd("/home/johnnylord/fibdrv", 4096) = 24
stat("/home/johnnylord/fibdrv/fibdrv.ko", {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0
openat(AT_FDCWD, "/home/johnnylord/fibdrv/fibdrv.ko", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0
mmap(NULL, 8288, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f06212a2000
finit_module(3, "", 0)                  = 0
munmap(0x7f06212a2000, 8288)            = 0
close(3)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++m

自上述第 18 行可以發現呼叫到 finit_module。去查看 linux 核心中如何宣告和實作 finit_module。

kernel/module.c





SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
{
    // ...    
    return load_module(&info, uargs, flags);
}

在第 4 行可以發現執行 load_module 這個函式:

/* Allocate and load the module: note that size of section 0 is always
   zero, and we rely on this for optional sections. */
static int load_module(struct load_info *info, const char __user *uargs,
		       int flags)
{
    ...
}

而在註解的部分可以看到 load_module 大致就是 Linux 核心為模組配置記憶體和載入模組相關資料的地方。

當我們透過 insmod 去載入一個核心模組時，為何 module_init 所設定的函式得以執行呢？Linux 核心做了什麼事呢?

首先，先看看原始碼

static int __init init_fib_dev(void)
{
    // ...
}

static void __exit exit_fib_dev(void)
{
    // ...
}

module_init(init_fib_dev);
module_exit(exit_fib_dev);

以下摘自 include/linux/module.h:





/* Each module must use one module_init(). */
#define module_init(initfn)					\
	static inline initcall_t __maybe_unused __inittest(void)		\
	{ return initfn; }					\
	int init_module(void) __attribute__((alias(#initfn)));

在第 5 行，可以看到 gcc 會在編譯過後將 initfn 設為 int init_module(void) 的別名。

請參閱 GCC 手冊，得知 __attribute__((alias( ..))) 的用法。
可透過 gcc -E fibdrv.c -I$TREE/include -I$TREE/arch/x86/include -I$TREE/include/uapi 看到經過前置處理後的程式碼，$TREE 指的是核心原始程式碼的路徑，即 /usr/src/linux-headers-$(uname -r)

參考 Stack Overflow - Kernel module source file after preprocessing

在 linux/module.h 裏面，有兩處定義 module_init，分別是
還沒定義 MODULE 的

#ifndef MODULE
#define module_init(x)	__initcall(x);

還有已定義 MODULE 的

#else /* MODULE */
...
/* Each module must use one module_init(). */
#define module_init(initfn)					\
	static inline initcall_t __maybe_unused __inittest(void)		\
	{ return initfn; }					\
	int init_module(void) __copy(initfn) __attribute__((alias(#initfn)));

從前置處理過後的結果可知，這裡選用沒定義 MODULE 的 module_init(x) ，也就是之後會繼續展開 __initcall(x)。

__initcall(x) 巨集被定義在 linux/init.h

#define __initcall(fn) device_initcall(fn)

device_initcall 展開成 __define_initcall(fn, 6)，最後會變成我們預處理看到的結果。最後展開的巨集

static inline __attribute__((unused))
    __attribute__((no_instrument_function)) 
    initcall_t __attribute__((unused)) 
    __inittest(void) { return init_fib_dev; } 
    
int init_module(void) __attribute__((alias("init_fib_dev")));;

再來解讀: 首先是第一段 static inline ....，這邊用了一個小技巧讓我們可在編譯時期就知道傳入的 function pointer 是不是合法的。我們回傳的 return init_fib_dev，他的資料型態必須要和 initcall_t 相同，否則編譯器會報錯。

這種作法和 BUG_ON 和 C++ 的 static assertion 相似。

再來是 init_module，我們知道這個系統呼叫讓我們把一個 ELF image 載入到 kernel space，而在最後一行 int init_module(void) __attribute__((alias("init_fib_dev"))) 的目的是為了替 init_module 取一個別名。

之所以要這樣做，是因前面的地方有寫到

/* These are either module local, or the kernel's dummy ones. */
extern int init_module(void);

這行告訴我們說，有 init_module 可以使用，但是不在這個地方實作。那實作在什麼地方呢？就是我們寫的 init_fib_dev，因為我們把 init_module 取了一個別名叫作 init_fib_dev。

總結一下，module_init 巨集幫我們做 2 件事

檢查傳入的函式，回傳值是否正確
把 init_module 和傳入的函式關聯起來，因為 insmod 指令實作內部會呼叫 init_module。如此一來呼叫 init_module 就等同於呼叫我們自己寫的函式。

透過以下實驗可確認否達到別名的效果:

#include <stdio.h>
int __func() {
    printf("In __func()\n");
    return 0;
}
int func() __attribute__((alias("__func"))); /* no function body */
int main() {
    func();
    return 0;
}

編譯並執行:

$ gcc -o test test.c
$ ./test
In __func()

因此執行 init_module() 就相當於執行使用者自定義的函式 initfn。

再來繼續回到為什麼 init_module 會被執行？
就要回想系統會呼叫 finit_module 再來 load_module。

摘自 kernel/module.c:





















/* Allocate and load the module: note that size of section 0 is always
   zero, and we rely on this for optional sections. */
static int load_module(struct load_info *info, const char __user *uargs,
		       int flags)
{
	struct module *mod;
	
        //...

	/* Figure out module layout, and allocate all the memory. */
	mod = layout_and_allocate(info, flags);
	if (IS_ERR(mod)) {
		err = PTR_ERR(mod);
		goto free_copy;
	}

        // ...
	
	return do_init_module(mod);
        // ...
}

自第 11 行可發現，在 do_init_module 之前，核心先做 layout_and_allocate 為載入的 module 進行記憶體配置。最後在第 19 行對 module 做初始化。

摘自 kernel/module.c:

/*
 * This is where the real work happens.
 *
 * Keep it uninlined to provide a reliable breakpoint target, e.g. for the gdb
 * helper command 'lx-symbols'.
 */
static noinline int do_init_module(struct module *mod)
{
        // ...
    
	/* Start the module */
	if (mod->init != NULL)
		ret = do_one_initcall(mod->init);
        
        // ...
}

摘自 init/main.c:










int __init_or_module do_one_initcall(initcall_t fn)
{
	// ...
	
	do_trace_initcall_start(fn);
	ret = fn();
	do_trace_initcall_finish(fn, ret);

        // ...
}

可見到 fn 亦即傳入的 mod->init，核心模組的 init_function 在上述程式碼第 6 行被執行，為核心模組進行真正的初始化的工作。

試著執行 $ readelf -a fibdrv.ko, 觀察裡頭的資訊和原始程式碼及 modinfo 的關聯，搭配上述提問，解釋像 fibdrv.ko 這樣的 ELF 執行檔案是如何「植入」到 Linux 核心

Executable and Linking Format 簡稱為 ELF，可以表示一個 executable binary file 或是 object file。由於這次實驗，fibdrv.ko 並非可執行檔，因此這邊專注於解釋 ELF 檔案以 object file 的觀點。ELF 可大致分為 3 個部分：

ELF header
存放了有關於此 object file 的訊息

$ readelf -h fibdrv.ko

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              REL (Relocatable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          0 (bytes into file)
  Start of section headers:          6688 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           0 (bytes)
  Number of program headers:         0
  Size of section headers:           64 (bytes)
  Number of section headers:         25
  Section header string table index: 24

Section(s)
有系統預定義的 section，如 .text, .data, .bss 等等，但也有使用者定義的 section，在本例中就有 .modinfo.
Section Header(s)
有對應的關於每個 section 的 metadata。例如某 Section 的 size。

$ readelf -S fibdrv.ko
There are 25 section headers, starting at offset 0x1a20:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .note.gnu.build-i NOTE             0000000000000000  00000040
       0000000000000024  0000000000000000   A       0     0     4
  [ 2] .text             PROGBITS         0000000000000000  00000070
       000000000000015c  0000000000000000  AX       0     0     16
  [ 3] .rela.text        RELA             0000000000000000  00001218
       0000000000000120  0000000000000018   I      22     2     8
  [ 4] .init.text        PROGBITS         0000000000000000  000001cc
       0000000000000153  0000000000000000  AX       0     0     1
  [ 5] .rela.init.text   RELA             0000000000000000  00001338
       00000000000003a8  0000000000000018   I      22     4     8
  [ 6] .exit.text        PROGBITS         0000000000000000  0000031f
       0000000000000040  0000000000000000  AX       0     0     1
  [ 7] .rela.exit.text   RELA             0000000000000000  000016e0
       00000000000000d8  0000000000000018   I      22     6     8
  [ 8] __mcount_loc      PROGBITS         0000000000000000  0000035f
       0000000000000030  0000000000000000   A       0     0     1
  [ 9] .rela__mcount_loc RELA             0000000000000000  000017b8
       0000000000000090  0000000000000018   I      22     8     8
  [10] .rodata.str1.1    PROGBITS         0000000000000000  0000038f
     :q 000000000000006e  0000000000000001 AMS       0     0     1
  [11] .rodata.str1.8    PROGBITS         0000000000000000  00000400
       0000000000000058  0000000000000001 AMS       0     0     8
  [12] .rodata           PROGBITS         0000000000000000  00000460
       0000000000000100  0000000000000000   A       0     0     32
  [13] .rela.rodata      RELA             0000000000000000  00001848
       0000000000000090  0000000000000018   I      22    12     8
  [14] .modinfo          PROGBITS         0000000000000000  00000560
       00000000000000ec  0000000000000000   A       0     0     8
  [15] .data             PROGBITS         0000000000000000  00000660
       0000000000000020  0000000000000000  WA       0     0     32
  [16] .rela.data        RELA             0000000000000000  000018d8
  
   ...

再來看 modinfo 這個程式和 fibdrv.ko 的關聯。由稍早推斷，MODULE_XXX 等巨集會將 module 的額外資訊放入 fibdrv.ko 中 .modinfo 中，modinfo 這個程式應該就是到 fibdrv.ko 中的 .modinfo 區段讀取資料並做顯示。以下是 man modinfo 中關於 modinfo 的描述。

DESCRIPTION
modinfo extracts information from the Linux Kernel modules given on the command line. If the module name is not a filename, then the /lib/modules/version directory is
searched, as is also done by modprobe(8) when loading kernel modules.
modinfo by default lists each attribute of the module in form fieldname : value, for
easy reading. The filename is listed the same way (although it's not really an
attribute).

解釋像 fibdrv.ko 這樣的 ELF 執行檔案是如何「植入」到 Linux 核心

fibdrv.ko 不是能在 shell 呼叫並執行的執行檔，它只是 ELF 格式的 object file。如果 fibdrv.ko 是執行檔，那麼其內容應該會包含了 Program headers 這些訊息，但是查看 ELF header 可以發現並沒有 Program header。

$ readelf -h fibdrv.ko
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              REL (Relocatable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          0 (bytes into file)
  Start of section headers:          6688 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           0 (bytes)
  Number of program headers:         0
  Size of section headers:           64 (bytes)
  Number of section headers:         25
  Section header string table index: 24

因此我們需要透過 insmod 這個程式（可執行檔）來將 fibdrv.ko 植入核心中。kernel module 是執行在 kernel space 中，但是 insmod fibdrv.ko 是一個在 user space 的程序，因此在 insmod 中應該需要呼叫相關管理記憶體的 system call，將在 user space 中 kernel module 的資料複製到 kernel space 中。

回頭看之前說 insmod 會使核心執行 finit_module

摘自 linux/kernel/module.c:


















SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
{
	struct load_info info = { };
	loff_t size;
	void *hdr;
	int err;

        // ...
    
	err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
				       READING_MODULE);
	if (err)
		return err;
	info.hdr = hdr;
	info.len = size;

	return load_module(&info, uargs, flags);
}

在第 10 行核心會讀取一個檔案，在本例中，就是 fibdrv.ko:









$ sudo insmod fibdrv.ko
...
getcwd("/home/johnnylord/fibdrv", 4096) = 24
stat("/home/johnnylord/fibdrv/fibdrv.ko", {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0
openat(AT_FDCWD, "/home/johnnylord/fibdrv/fibdrv.ko", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=8288, ...}) = 0
mmap(NULL, 8288, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f06212a2000
finit_module(3, "", 0)                  = 0
...

可見上述執行 strace insmod fibdrv.ko 後，在第 4 行開啟 fibdrv.ko 這個檔案並得到其 file descriptor 為 3。並在第 8 行傳入 finit_module 中。

sysfs 的原理和實作

Patrick Mochel 撰寫的報告 The sysfs Filesystem 第 1 頁提到:

sysfs is a mechanism for representing kernel objects, their attributes, and their relationships with each other.

並在第 5 頁的 module 中提到：

The module directory contains subdirectories for each module that is loaded into the kernel.The name of each directory is the name of the module – both the name of the module object file and the internal name of the module.

/sys/module 這個目錄下會有以已載入 module 名稱命名的子目錄；在 sysfs(5) -man page 提到，在 /sys/module/"module-name" 目錄中會有一些相關檔案，這些檔案分別紀錄了此 module 的一些資料，例如，傳入的參數值；另外，在 /kernel/module.c - 1703 行中， module_add_modinfo_attrs() 中，第 19行有 sysfs 相關的函式 sysfs_create_file:

























static int module_add_modinfo_attrs(struct module *mod)
{
	struct module_attribute *attr;
	struct module_attribute *temp_attr;
	int error = 0;
	int i;

	mod->modinfo_attrs = kzalloc((sizeof(struct module_attribute) *
					(ARRAY_SIZE(modinfo_attrs) + 1)),
					GFP_KERNEL);
	if (!mod->modinfo_attrs)
		return -ENOMEM;

	temp_attr = mod->modinfo_attrs;
	for (i = 0; (attr = modinfo_attrs[i]) && !error; i++) {
		if (!attr->test || attr->test(mod)) {
			memcpy(temp_attr, attr, sizeof(*temp_attr));
			sysfs_attr_init(&temp_attr->attr);
			error = sysfs_create_file(&mod->mkobj.kobj,
					&temp_attr->attr);
			++temp_attr;
		}
	}
	return error;
}

sysfs_create_file 函式的第二個參數為 &temp_attr->attr，而 temp_attr 是個指向 struct module_attribute
struct module_attribute 的指標:

struct module_attribute {
	struct attribute attr;
	ssize_t (*show)(struct module_attribute *, struct module_kobject *,
			char *);
	ssize_t (*store)(struct module_attribute *, struct module_kobject *,
			 const char *, size_t count);
	void (*setup)(struct module *, const char *);
	int (*test)(struct module *);
	void (*free)(struct module *);
};

第 1 行宣告 attr，其型態為
struct attribute ( 定義於 /include/linux/sysfs.h ) :

struct attribute {
	const char		*name;
	umode_t			mode;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
	bool			ignore_lockdep:1;
	struct lock_class_key	*key;
	struct lock_class_key	skey;
#endif
};

用統計手法去除極端值

假設數據分佈接近常態分佈的情況下，一個標準差到三個標準差之內的資料量約佔 68%, 95%, 99.7%

圖片來源: wikipedia

Python script 實作以及結果

4ce43c4

datas 的資料為計算某個 fibonacci number 的時間，並用此手法去除 95% 區間之外的值

其中 z 代表某個 sample 距離 mean 幾個標準差

def outlier_filter(datas, threshold = 2):
    datas = np.array(datas)
    z = np.abs((datas - datas.mean()) / datas.std())
    return datas[z < threshold]

處理計算每個 Fibonacci number 的時間，最後返回處理好資料 (去除 outlier 再取平均)

def data_processing(data_set, n):
    catgories = data_set[0].shape[0]
    samples = data_set[0].shape[1]
    final = np.zeros((catgories, samples))

    for c in range(catgories):        
        for s in range(samples):
            final[c][s] =                                                    \
                outlier_filter([data_set[i][c][s] for i in range(n)]).mean()
    return final

以下是分別是計算

100 t h

和

1000 t h

Fibonacci number 的結果，單獨一張圖看起來可能還是波動很大，但對照組放下去就能看出差異

每個數據取樣 50 次，對照組為直接平均

從處理過數據的圖中可以更明顯的看出執行時間有週期性波動的趨勢。

自我檢查清單

研讀上述 Linux 效能分析的提示描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作
$\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
複習 C 語言數值系統和 bitwise operation，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
研讀 KYG-yaya573142 的報告，指出針對大數運算，有哪些加速運算和縮減記憶體操作成本的舉措？
lsmod 的輸出結果有一欄名為 Used by，這是 "each module's use count and a list of referring modules"，但如何實作出來呢？模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢？
注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

作業要求

回答上述自我檢查清單的所有問題，需要附上對應的參考資料和必要的程式碼，以第一手材料 (包含自己設計的實驗) 為佳
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
如果你在 2021 年 3 月 9 日前，已從 GitHub sysprog21/fibdrv 進行 fork，請依據 Alternatives to forking into the same account 一文，對舊的 repository 做對應處置，然後重新 fork
在 GitHub 上 fork fibdrv，目標是修正 Fibonacci 數計算的正確性 (現有實作存在缺陷，請指出)，隨後改善 fibdrv 計算 Fibinacci 數列的執行效率，過程中需要量化執行時間，可在 Linux 核心模組和使用層級去測量
- 原本的程式碼只能列出到 Fibonacci(100) 而且部分輸出是錯誤的數值，請修改程式碼，列出後方更多數值 (注意: 檢查正確性和數值系統的使用)
- 務必研讀上述 Linux 效能分析的提示的描述，降低效能分析過程中的干擾;
- 在 Linux 核心模組中，可用 ktime 系列的 API;
- 在 userspace 可用 clock_gettime 相關 API;
- 善用統計模型，除去極端數值，過程中應詳述你的手法
- 分別用 gnuplot 製圖，分析 Fibonacci 數列在核心計算和傳遞到 userspace 的時間開銷，單位需要用 us 或 ns (自行斟酌)
- 嘗試解讀上述實驗的時間分佈，特別是隨著 Fibonacci 數列增長後，對於 Linux 核心的影響。可修改 VFS 函式，允許透過寫入 /dev/fibonacci 裝置檔案來變更不同的 Fibonacci 求值的實作程式碼。
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
用 printk 固然方便，但其執行時間易受各種因素而不穩定，除了透過統計來觀察，也可改用核心的 sysfs 介面，後者可讓我們自行宣告 show 和 store 介面來提供讀寫操作，可參見 Sample kobject implementation (注意: 切換到對應的 Linux 核心版本)。
逐步最佳化 Fibonacci 的執行時間，引入費氏數列分析提到的策略，並善用 clz / ffs 一類的指令 (Linux 核心有對應的包裝函式)，過程中都要充分量化
嘗試研讀 bignum (fibdrv 分支) 的實作，理解其中的技巧並與你的實作進行效能比較，探討個中值得取鏡之處。
- 原理和分析可見 KYG-yaya573142 的報告

繳交方式

編輯 Homework3 作業區共筆，將你的觀察、上述要求的解說、應用場合探討，以及各式效能改善過程，善用 gnuplot 製圖，紀錄於新建立的共筆

截止日期

Mar 22, 2021 (含) 之前
越早在 GitHub 上有動態、越早接受 code review，評分越高

J06: fibdrv

tags: linux2021

Image Not Showing Possible Reasons The image file may be corruptedThe server hosting the image is unavailableThe image path is incorrectThe image format is not supported Learn More → 預期目標

Image Not Showing Possible Reasons The image file may be corruptedThe server hosting the image is unavailableThe image path is incorrectThe image format is not supported Learn More → 費氏數列

考慮到硬體加速 F(n) 的手法

Fibonacci 數的應用

Image Not Showing Possible Reasons The image file may be corruptedThe server hosting the image is unavailableThe image path is incorrectThe image format is not supported Learn More → 撰寫 Linux 核心模組

Image Not Showing Possible Reasons The image file may be corruptedThe server hosting the image is unavailableThe image path is incorrectThe image format is not supported Learn More → fibdrv: 可輸出 Fibonacci 數列的 Linux 核心模組

前期準備

計算 F93 (包含) 之後的 Fibonacci 數 - 使用數字字串並套用 quiz2 SSO (Small String Optimization)

fibdrv 核心模組內部