2023q1 Homework3 (fibdrv)

contributed by < linhoward0522 >

開發環境

$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               GenuineIntel
  Model name:            12th Gen Intel(R) Core(TM) i5-12400
    CPU family:          6
    Model:               151
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            2
    CPU max MHz:         5600.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4992.00

前期準備

fibdrv

檢查 Linux 核心版本

$ uname -r

5.15.0-60-generic

安裝linux-headers 套件

$ sudo apt install linux-headers-`uname -r`

確認 linux-headers 套件已正確安裝於開發環境

$ dpkg -L linux-headers-5.15.0-60-generic | grep "/lib/modules"

/lib/modules/5.15.0-60-generic/build

檢驗目前的使用者身份

$ whoami
howard
$ sudo whoami
root

安裝後續會用得到的工具

$ sudo apt install util-linux strace gnuplot-nox

取得原始程式碼

$ git clone https://github.com/linhoward0522/fibdrv
$ cd fibdrv

編譯並測試
```
$ make check
```
觀察產生的 fibdrv.ko 核心模組
```
$ modinfo fibdrv.ko
```

觀察 fibdrv.ko 核心模組在 Linux 核心掛載後的行為

$ sudo insmod fibdrv.ko

輸出

insmod: ERROR: could not insert module fibdrv.ko: Operation not permitted

發現是沒有 disable Secure Boot in UEFI (BIOS) settings

$ sudo apt install mokutil
$ sudo mokutil --disable-validation

接著會要求你設定一組密碼，之後 reboot，選擇 change secure boot state 後輸入剛剛設定的密碼後選擇 Yes

$ ls -l /dev/fibonacci
crw------- 1 root root 234, 0  三   3 11:30 /dev/fibonacci
$ cat /sys/class/fibonacci/fibonacci/dev
234:0

$ cat /sys/module/fibdrv/version 
0.1

$ lsmod | grep fibdrv
$ cat /sys/module/fibdrv/refcnt

這兩道命令的輸出都是 0

Linux 效能分析的提示

研讀 Linux 效能分析的提示描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作

\to

從中也該理解為何不希望在虛擬機器中進行實驗;

查看行程的 CPU affinity

查看特定行程的處理器親和性
```
$ taskset -p 1
pid 1's current affinity mask: fff
```
代表該行程可以在第 0 到第 11 個 CPU 核中執行
```
$ taskset -cp 1
pid 1's current affinity list: 0-11
```
加上 -c 參數，讓 taskset 直接輸出 CPU 核的列表：

將行程固定在特定的 CPU 中執行

$ taskset -cp 1 1
pid 1's current affinity list: 0-11
taskset: failed to set pid 1's affinity: Operation not permitted

這邊一直解決不了，後來參考其他學員 willwillhi1 的做法

修改 /etc/default/grub 內的 GRUB_CMDLINE_LINUX_DEFAULT，加入 isolcpus=7 來指定編號 7 的核心不被排程器賦予任務，修改完成後需 sudo update-grub 來更新設定，重開機後即生效

$ sudo vi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=7"
$ sudo update-grub

重新開機後檢查 CPU affinity 結果如下

$ taskset -cp 1
pid 1's current affinity list: 0-6,8-11
$ cat /sys/devices/system/cpu/isolated 
7

固定在特定的 `CPU` 中執行

$ sudo taskset -c 7 ./client

排除干擾效能分析的因素

抑制 address space layout randomization (ASLR)

$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"

設定 scaling_governor 為 performance ，準備以下 shell script，檔名為 performance.sh:

$ vi performance.sh
for i in /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
do
    echo performance > ${i}
done
$ sudo sh performance.sh

針對 Intel 處理器，關閉 turbo mode

$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

觀察 `fibdrv` 核心模組

include/linux/fs.h
fibdrv 屬於 character device，所以資料結構為 struct cdev

struct inode {
    ...    
    union {
        struct pipe_inode_info	*i_pipe;
        struct block_device	*i_bdev;
        struct cdev		*i_cdev;
        char			*i_link;
        unsigned		i_dir_seq;
    };
    ...
} __randomize_layout;

include/linux/cdev.h
其中可以看到 const struct file_operations ，存放關於此 character device 的相關操作

struct cdev {
    struct kobject kobj;
    struct module *owner;
    const struct file_operations *ops;
    struct list_head list;
    dev_t dev;
    unsigned int count;
} __randomize_layout;

fib_sequence 用來計算費氏數列

static long long fib_sequence(long long k)

fibdrv 設計為一個 character device，可以藉由檔案操作介面進行操作(open, read, write, mmap 等等)，定義如下：

const struct file_operations fib_fops = {
    .owner = THIS_MODULE,
    .read = fib_read,
    .write = fib_write,
    .open = fib_open,
    .release = fib_release,
    .llseek = fib_device_lseek,
};

fib_read 用來取得費氏數列的計算結果

/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    return (ssize_t) fib_sequence(*offset);
}

其中 offset 是用來控制要算到第幾個費氏數列

可藉由 fib_device_lseek 更改 offset

static loff_t fib_device_lseek(struct file *file, loff_t offset, int orig)

fib_write 目前沒有功能

先宣告一個 mutex ，用來控制使用此模組的行程數量，fib_open 一次只允許一個行程做使用

#include <linux/mutex.h>

static DEFINE_MUTEX(fib_mutex);
static int fib_open(struct inode *inode, struct file *file)
{
    if (!mutex_trylock(&fib_mutex)) {
        printk(KERN_ALERT "fibdrv is in use");
        return -EBUSY;
    }
    return 0;
}
static int fib_release(struct inode *inode, struct file *file)
{
    mutex_unlock(&fib_mutex);
    return 0;
}

init_fib_dev
透過 cdev_alloc 先配置一個 cdev structure ，接著需要使用 cdev_init 來初始化，最後用 cdev_add 才能被加入 Linux 核心

static int __init init_fib_dev(void)
{
    fib_cdev = cdev_alloc();
    if (fib_cdev == NULL) {
        printk(KERN_ALERT "Failed to alloc cdev");
        rc = -1;
        goto failed_cdev;
    }
    cdev_init(fib_cdev, &fib_fops);
    rc = cdev_add(fib_cdev, fib_dev, 1);
}

費氏數列

研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說

迭代法

static long long fib_sequence(long long k)
{
    /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
    long long f[k + 2];

    f[0] = 0;
    f[1] = 1;

    for (int i = 2; i <= k; i++) {
        f[i] = f[i - 1] + f[i - 2];
    }

    return f[k];
}

Variable-length array 允許執行時期再決定陣列佔用的空間，而不是編譯時期時決定。可能有以下問題

因為執行時期才決定佔用的空間，對於執行時間來說可能會有影響
重要的是會對 Linux 核心堆疊有安全疑慮 (security implication)

The Linux Kernel Is Now VLA-Free

因為 variable-length array 會生成更多的機器碼，更慢的代碼，而且相對而言更加地不可維護，所以 Linux kernel 專案從頭到尾都沒有使用過 VLA

is an array data structure whose length is determined at run time (instead of at compile time).

Linus Torvalds has expressed his displeasure in the past over VLA usage for arrays with predetermined small sizes because it generates lower quality assembly code.With the Linux kernel, the Linux kernel is effectively VLA-free.

static long long fib_sequence(long long k)
{
    /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
    if (k < 2)
        return k;

    long long a = 0, b = 1;

    for (int i = 2; i <= k; i++) {
        long long c = a + b;
        a = b;
        b = c;
    }
    return b;
}

改變實作將程式碼改成不用陣列來儲存計算出來的值

Fast Doubling

複習 C 語言數值系統和 bitwise operation，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本

使用迭代法來實作，時間複雜度為
$O (N)$
若使用 Fast doubling 來實作，時間複雜度可降為
$O (l o g N)$

透過 Q-Matrix 以及 Fast Doubling 之數學推論，我們最後可得兩公式可供使用

\begin{aligned} F (2 k) & = F (k) F (k + 1) + F (k - 1) F (k) \\ = F (k) [F (k + 1) + F (k - 1)] \\ = F (k) [2 F (k + 1) - F (k)] \\ F (2 k + 1) & = F (k + 1)^{2} + F (k)^{2} \end{aligned}

透過這兩個公式可以讓我們計算費波那契數列的時間複雜度為

O (l o g N)

exponentiation by squaring

其中 Fast Doubling 文章當中有提及幾種作法

RECURSIVE (TOP-DOWN) APPROACH

此方法使用上述兩個公式來實作，但會有一些問題
假設我們要找
$F i b (6)$ ，很明顯
$F i b (3)$ 會執行兩次
所以當 n 越大，就會有更多子樹被重複執行

retrieving data from array
```
uint64_t MEM[SIZE] = { [0 ... SIZE-1] = INIT };
```
- 為了解決上述問題，我們新增一個 array 來儲存已經被計算過的數值，可以先確認是否被計算過，若有則直接取值
- 但此方案會隨著 n 越大而增加記憶體的使用
use a two-elements array
- 為了解決上述問題，我們使用 two-elements array 即不會隨著 K 越大而增加記憶體的使用
  
  當我們要計算
  $F_{6}$ 時，如果我們只有
  $F_{2}$ ,
  $F_{3}$ ，要如何計算
  $F_{5}$ ?
  根據公式得知我們可以使用
  $F_{2}$ ,
  $F_{3}$ 來去計算出
  $F_{4}$ ,
  $F_{5}$
  最後我們即可得
  $F_{6} = F_{4} + F_{5}$
- 我們可以使用 two-elements array 來計算
  $F_{10}$
  - 當 n 是偶數，我們可以找到
    $F_{k}$ ，
    $k = \frac{n}{2}$
    - 我們就可以使用
      $(\begin{array}{ccc} F_{k} \\ F_{k + 1} \end{array}) = (\begin{array}{ccc} F_{\frac{n}{2}} \\ F_{\frac{n}{2} + 1} \end{array})$ 來計算
      $(\begin{array}{ccc} F_{2 k} \\ F_{2 k + 1} \end{array}) = (\begin{array}{ccc} F_{n} \\ F_{n + 1} \end{array})$
  - 當 n 是奇數，我們可以找到
    $F_{k}$ ，
    $k = \frac{n - 1}{2}$
    - 我們就可以使用
      $(\begin{array}{ccc} F_{k} \\ F_{k + 1} \end{array}) = (\begin{array}{ccc} F_{\frac{n - 1}{2}} \\ F_{\frac{n - 1}{2} + 1} \end{array})$ 來計算
      $(\begin{array}{ccc} F_{2 k} \\ F_{2 k + 1} \end{array}) = (\begin{array}{ccc} F_{n - 1} \\ F_{n} \end{array})$
    - 並且
      $(\begin{array}{ccc} F_{n} \\ F_{n - 1} + F_{n} \end{array}) = (\begin{array}{ccc} F_{n} \\ F_{n + 1} \end{array})$
  $k_{i}$
  $k_{1}$
  $k_{2}$
  $k_{3}$
  $k_{4}$
  $k_{5}$
  
  $n (= k_{i})$ 10 5 2 1 0
  
  $n i s o d d$ v v
  
  $F_{2 k_{i + 1}}$
  $F_{4}$
  $F_{0}$
  
  $F_{n}$
  $F_{10}$
  $F_{5}$
  $F_{2}$
  $F_{1}$
  $F_{0}$
  
  $F_{n + 1}$
  $F_{6}$
  $F_{3}$
  $F_{2}$
  $F_{1}$
  - 使用這種 bottom-up 方法，我們可以用
    $F_{0}$ ,
    $F_{1}$ 計算出
    $F_{0}$ ,
    $F_{1}$ ,
    $F_{2}$ ，然後用
    $F_{1}$ ,
    $F_{2}$ 計算出
    $F_{2}$ ,
    $F_{3}$ ，用
    $F_{2}$ ,
    $F_{3}$ 計算出
    $F_{4}$ ,
    $F_{5}$ ,
    $F_{6}$ ，用
    $F_{5}$ ,
    $F_{6}$ 計算出
    $F_{10}$

$k_{i}$	$k_{1}$	$k_{2}$	$k_{3}$	$k_{4}$	$k_{5}$
$n (= k_{i})$	10	5	2	1	0
$n i s o d d$		v		v
$F_{2 k_{i + 1}}$		$F_{4}$		$F_{0}$
$F_{n}$	$F_{10}$	$F_{5}$	$F_{2}$	$F_{1}$	$F_{0}$
$F_{n + 1}$		$F_{6}$	$F_{3}$	$F_{2}$	$F_{1}$

BIT-MASK

由於除以 2 也就是進行右移，所以我們只要找出 highest 1-bit 即可知道要迭代幾次
接著搭配上述 bottom-up 的方法
最後利用 BIT-MASK 來改善上述實作
- 用 & 操作從最高位元做到最低位元

static long long fib_fast_doubling_sequence(long long k)
{
    /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
    // The position of the highest bit of n.
    // So we need to loop `h` times to get the answer.
    // Example: n = (Dec)50 = (Bin)00110010, then h = 6.
    //                               ^ 6th bit from right side
    if (k < 2)
        return k;

    unsigned int h = 0;
    for (unsigned int i = k; i; ++h, i >>= 1)
        ;

    long long a = 0, b = 1;

    for (long long mask = 1 << (h - 1); mask; mask >>= 1) {
        // Let j = h-i (looping from i = 1 to i = h), n_j = floor(n / 2^j) = n
        // >> j (n_j = n when j = 0), k = floor(n_j / 2), then a = F(k), b =
        // F(k+1) now.
        long long c = a * (2 * b - a);  // F(2k) = F(k) * [ 2 * F(k+1) – F(k) ]
        long long d = a * a + b * b;    // F(2k+1) = F(k)^2 + F(k+1)^2

        if (mask & k) {  // n_j is odd: k = (n_j-1)/2 => n_j = 2k + 1
            a = d;       //   F(n_j) = F(2k + 1)
            b = c + d;   //   F(n_j + 1) = F(2k + 2) = F(2k) + F(2k + 1)
        } else {         // n_j is even: k = n_j/2 => n_j = 2k
            a = c;       //   F(n_j) = F(2k)
            b = d;       //   F(n_j + 1) = F(2k + 1)
        }
    }

    return a;
}

考慮到硬體加速
$F (n)$ 的手法

許多現代處理器提供 clz / ctz 一類的指令，可搭配上述 Fast Doubling 手法運用:

unsigned int h = 0;
for (unsigned int i = k; i; ++h, i >>= 1)
    ;

使用 __builtin_clzll 改寫將上述程式碼改成

int __builtin_clzll (unsigned int x)
Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.
Similar to __builtin_clz, except the argument type is unsigned long long.

unsigned int h = 64 - __builtin_clzll(k);

時間測量

kernel space

使用 ktime 相關的 API 並根據作業相關資料來修改

首先在 fibdrv.c 增加
```
#include <linux/ktime.h>
```

參考作業相關資料，因為 fib_write 目前沒有功能，於是用來輸出 fib 的執行時間

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    ktime_t kt;
    kt = ktime_get();
    fib_sequence(*offset);
    kt = ktime_sub(ktime_get(), kt);
    return (ssize_t) ktime_to_ns(kt);
}

接著我想同時測量三種實作費氏數列的執行時間，觀察 fib_write 後發現可以利用 size 來指定要使用哪種方法，將上面程式碼做修正

ktime_t kt;
    switch (size) {
    case 0:
        kt = ktime_get();
        fib_sequence(*offset);
        kt = ktime_sub(ktime_get(), kt);
        break;
    case 1:
        kt = ktime_get();
        fib_fast_doubling_sequence(*offset);
        kt = ktime_sub(ktime_get(), kt);
        break;
    case 2:
        kt = ktime_get();
        fib_fast_doubling_clz_sequence(*offset);
        kt = ktime_sub(ktime_get(), kt);
        break;
    default:
        return 0;
    }
    return (ssize_t) ktime_to_ns(kt);

並且新增測試程式 client_test.c

for (int i = 0; i <= offset; i++) {
    lseek(fd, i, SEEK_SET);
    /*iterative*/
    t1 = write(fd, write_buf, 0);
    /*fast_doubling*/
    t2 = write(fd, write_buf, 1);
    /*fast_doubling_clz*/
    t1 = write(fd, write_buf, 2);

    printf("%lld,%lld,%lld",t1, t2, t3);
}

修改 Makefile，即可在命令列輸入 make test 來執行

all: $(GIT_HOOKS) client client_test
    $(MAKE) -C $(KDIR) M=$(PWD) modules

client_test: client_test.c
    $(CC) -o $@ $^

test: all
    $(MAKE) unload
    $(MAKE) load
    sudo taskset -c 7 ./client_test > test
    $(MAKE) unload
    gnuplot scripts/test.gp

先將先前載入過的模組釋放掉，再將 fibdrv.ko 模組載入，接著將行程固定在特定的 CPU 中並執行 client_test ，將結果存到 test ，最後將模組釋放掉後並使用 gnuplot 來畫出運行結果

運行結果

從上圖可以看出 fast doubling 方法整體效率比 interative 方法還要好，但是使用了 clz 的 fast doubling 方法並沒有比較快，甚至還比 interative 方法還慢，有點不太合理

這邊目前找不出原因

根據與老師的一對一討論後，老師建議應當要考慮 cache 的影響，所以我在實驗過程中會計算兩次費氏數列，確保 cache 裡的資料已被更新，同時執行第二次費氏數列的同時才會紀錄時間，以降低 cache 的影響

uesr space

使用 clock_gettime 相關 API ，首先在 client_test.c 中加入 #include <time.h>

clock_gettime 函數
可以取得 wall-clock time 或程式的 CPU time，其所傳回的時間是用 timespec 這個結構所儲存的 :
```
struct timespec {
    time_t   tv_sec;        /* seconds */
    long     tv_nsec;       /* nanoseconds */
};
```
其精準度最高可達 nanosecond，若要查詢實際的精確度，可以使用 clock_getres 函數
- CLOCK_MONOTONIC : 單調遞增時間（monotonic time），這個時間會非常穩定的持續遞增，不會因為系統時間改變而有變動，適合用於測量程式執行效能

修改 client_test.c 程式碼，使用引數來決定要測量哪種方法

for (int i = 0; i <= offset; i++) {
    long long kernel_time, user_time;
    int x = atoi(argv[1]);
    struct timespec t1, t2;

    lseek(fd, i, SEEK_SET);
    clock_gettime(CLOCK_MONOTONIC, &t1);
    kernel_time = write(fd, write_buf, x);
    clock_gettime(CLOCK_MONOTONIC, &t2);

    user_time = (long long) (t2.tv_sec * 1000000000 + t2.tv_nsec) -
                    (t1.tv_sec * 1000000000 + t1.tv_nsec);

    printf("%lld %lld %lld\n", user_time, kernel_time,
           user_time - kernel_time);
}

C99 [7.20.1.2] The atoi, atol, and atoll functions

The atoi, atol, and atoll functions convert the initial portion of the string pointed to by nptr to int, long int, and long long int representation, respectively.

修改 Makefile

test: all
    $(MAKE) unload
    $(MAKE) load
    sudo taskset -c 7 ./client_test 4 > test
    gnuplot scripts/test.gp
    sudo taskset -c 7 ./client_test 1 > test_time
    gnuplot scripts/test_time.gp
    $(MAKE) unload

iterative
fast doubling
fast doubling with clz

以上三種方法共同的現象是前幾個 F(n) 的計算中所花的時間上都會非常的高。
而且 fast doubling with clz 振動的幅度較大

大數運算

學習指出針對大數運算的加速運算和縮減記憶體操作成本的舉措，紀錄你的認知和疑惑

研讀並參考作業規範 : 初步支援大數運算

struct bn

將其定義在 bn.h 中

typedef struct _bn {
    unsigned int *number;
    unsigned int size;
    int sign;
} bn;

number : 指向儲存的數值，之後會以 array 的形式來取用
size : 配置的記憶體大小，單位為 sizeof(unsigned int)
sign : 判斷為正數或負數， 0 為正數、1 為負數

bn_to_string

由於大數沒辦法直接以數值的形式列出，所以我們使用字串陣列將大數數值轉換成十進位的 ASCII 表示法

char *bn_to_string(const bn *src)
{
    // log10(x) = log2(x) / log2(10) ~= log2(x) / 3.322
    size_t len = (8 * sizeof(int) * src->size) / 3 + 2 + src->sign;
    char *s = kmalloc(len, GFP_KERNEL);
    char *p = s;

    memset(s, '0', len - 1);
    s[len - 1] = '\0';

這裡使用將對數轉換為 10 為底的方法，來估算大數所需的十進位位數
首先宣告一個長度為 len ，作用是計算出轉換成字串後所需要的總長度，並初始化為 '0' 的字串陣列
GFP_KERNEL 表示要求分配的記憶體是可交換的（swappable）

/* src.number[0] contains least significant bits */
for (int i = src->size - 1; i >= 0; i--) {
    /* walk through every bit of bn */
    for (unsigned int d = 1U << 31; d; d >>= 1) {
        /* binary -> decimal string */
        int carry = !!(d & src->number[i]);
        for (int j = len - 2; j >= 0; j--) {
            s[j] += s[j] - '0' + carry;
            carry = (s[j] > '9');
            if (carry)
                s[j] -= 10;
        }
    }
}

外層迴圈用來迭代 src 中的每一個整數，從最高位元到最低位元
內層迴圈用來迭代 src->number[i] 中的每一個二進位，一樣從最高位元到最低位元
接著對於每一個位元 d ，判斷 src->number[i] 中的該位元是否為 1，如果為 1，則 carry 設為 1，否則設為 0
最內層迴圈用來將 s[j] 的 ASCII 碼值減去 '0' 的 ASCII 值，就可以得到 s[j] 代表的數值，然後將該數值加上 carry 。如果相加的結果大於 9 ，則需要進位，進位後則要將該位的值減去 10

// skip leading zero
while (p[0] == '0' && p[1] != '\0') {
    p++;
}
if (src->sign)
    *(--p) = '-';
memmove(s, p, strlen(p) + 1);
return s;

接著跳過字串開頭的所有 '0'，因為這些 '0' 不會影響轉換後的數值
如果 src->sign 為 1，則表示該大數為負數，則在字串開頭插入一個負號

bn_lshift/bn_rshift

此函式用來將 bn 數字向左移動

void bn_lshift(bn *src, size_t shift)
{
    size_t z = bn_clz(src);
        shift %= 32;  // only handle shift within 32 bits atm
        if (!shift)
            return;

        if (shift > z)
            bn_resize(src, src->size + 1);

首先計算 bn 中最高位元的位數，為了在左位移時檢查是否需要增加 bn 數字的大小

/* bit shift */
for (int i = src->size - 1; i > 0; i--)
    src->number[i] =
        src->number[i] << shift | src->number[i - 1] >> (32 - shift);
src->number[0] <<= shift;

接著從最高位元向最低位元迭代，將每個 bn 數字進行左移操作
先將數字左移 shift 位，再將 src->number[i - 1] 位置的數字右移 32-shift 位，並使用 | 合併，即可完成左移操作
最後，將 bn 數字最低位元的左移 shift 位

bn_add / bn_sub

加法與減法由於需要考慮數值的正負號，先由 bn_add 與 bn_sub 判斷結果的正負號，再使用函式bn_do_add 與 bn_do_add 進行無號整數的計算

bn_add 負責所有正負號的判斷，所以 bn_sub 只是改變 b 的正負號後，再直接交給 bn_add 判斷
bn_cmp 負責比對兩個 bn 物件絕對值後的大小

bn_do_add

/* |c| = |a| + |b| */
static void bn_do_add(const bn *a, const bn *b, bn *c)
{
    // max digits = max(sizeof(a) + sizeof(b)) + 1
    int d = MAX(bn_msb(a), bn_msb(b)) + 1;
    d = DIV_ROUNDUP(d, 32) + !d;
    bn_resize(c, d);  // round up, min size = 1

    unsigned long long int carry = 0;
    for (int i = 0; i < c->size; i++) {
        unsigned int tmp1 = (i < a->size) ? a->number[i] : 0;
        unsigned int tmp2 = (i < b->size) ? b->number[i] : 0;
        carry += (unsigned long long int) tmp1 + tmp2;
        c->number[i] = carry;
        carry >>= 32;
    }

    if (!c->number[c->size - 1] && c->size > 1)
        bn_resize(c, c->size - 1);
}

首先計算加法後的位數大小，並重新配置 c 的大小
進入 for 迴圈後，對於 c 中的第 i 個位元，分別從 a 和 b 中取出第 i 個位元的值，將其相加後並加上 carry 的值，即可得 c 中第 i 個位元的值
將 carry 右移 32 位元，以便下一次迴圈可以加上進位

bn_do_sub

for (int i = 0; i < c->size; i++) {
    unsigned int tmp1 = (i < a->size) ? a->number[i] : 0;
    unsigned int tmp2 = (i < b->size) ? b->number[i] : 0;

    carry = (long long int) tmp1 - tmp2 - carry;
    if (carry < 0) {
        c->number[i] = carry + (1LL << 32);
        carry = 1;
    } else {
        c->number[i] = carry;
        carry = 0;
    }
}

carry 用於保存上一位減法操作的借位
若相減小於 0，則需先對調 a 與 b，計算完成後再補上負號

正確計算
$F (92)$ 以後的數值

參考作業規範中迭代以及 fast doubling 的程式進行測試

調整 fibdrv.c 中 MAX_LENGTH 的數值

#define MAX_LENGTH 1000

調整 client.c 中 offset 的數值

int offset = 1000; /* TODO: try test something bigger than the limit */

使用命令 make check 時會將 client.c 的結果輸出到 out ，並且跟 scripts/expected.txt 做比對，但是 expected.txt 從

F (92)

以後的數值是錯誤的

先修改命令 make check







check: all
	$(MAKE) unload
	$(MAKE) load
	sudo ./client > out
	$(MAKE) unload
	#@diff -u out scripts/expected.txt && $(call pass)
	@scripts/verify.py && $(call pass)

把第六行註解，並在第七行加上 && $(call pass) ，若是結果正確就會出現 Passed [-]
接著要用 verify.py 做測試，所以做一點修改

@@ -5,7 +5,7 @@
 result_split = []
 dics = []
 
-for i in range(2, 101):
+for i in range(2, 1001):
     expect.append(expect[i - 1] + expect[i - 2])
 with open('out', 'r') as f:
     tmp = f.readline()
@@ -25,4 +25,4 @@
         print('f(%s) fail' % str(i[0]))
         print('input: %s' %(fib))
         print('expected: %s' %(expect[i[0]]))
-        exit()
+        exit(1)

其中 exit(1) 是有錯誤即退出，所以若結果是錯誤的就不會在終端機出現 Passed [-]

開發問題紀錄

$ git commit -a

Following files need to be cleaned up:
fibdrv.c
/usr/bin/sha1sum: scripts/checksums: No such file or directory
[!] You are not allowed to change the header file queue.h or list.h

因為我是先 fork ，再從自己的 repo git clone 下來，所以只好參考 commit : a6b1320 在本地端做修改後才能正常 git commit

提交一份 pull request 以改進原本的檢查。
:notes: jserv

一對一討論問題回答

為何 `Linux` 核心不傾向使用 `VLA`

Variable-length array 允許執行時期再決定陣列佔用的空間，而不是編譯時期時決定。可能有以下問題

因為執行時期才決定佔用的空間，對於執行時間來說可能會有影響
重要的是會對 Linux 核心堆疊有安全疑慮 (security implication)

The Linux Kernel Is Now VLA-Free

為何要實作 `fast doubling` ?

使用迭代法來實作，時間複雜度為
$O (N)$
若使用 Fast doubling 來實作，時間複雜度可降為
$O (l o g N)$

exponentiation by squaring

何者在 `Linux` 核心可作，但在 `user space` 無法？

存取硬體資源：例如記憶體、磁碟
執行驅動程式：例如裝置驅動程式、檔案系統驅動程式
執行系統呼叫等等

Writing a Linux Kernel Module Part 1: Introduction

現有的大數運算，可配置多大的記憶體空間？ (`kmalloc`)

kmalloc() will return a memory chunk with size of power of 2 that matches or exceeds len and will return NULL upon failure.
The maximum size allocatable by kmalloc() is 1024 pages, or 4MB on x86.

1 Memory management

測量程式碼佔用的記憶體空間

使用 valgrind 測量大數運算佔用的記憶體空間

$ sudo taskset -c 7 valgrind --tool=massif --stacks=yes ./client

澄清 `GFP_KERNEL`

Using GFP_KERNEL means that kmalloc can put the current process to sleep waiting for
a page when called in low-memory situations.

Allocating Memory

This is a normal allocation and might block. This is the flag to use in process context code when it is safe to sleep. The kernel will do whatever it has to in order to obtain the memory requested by the caller. This flag should be your first choice.

kmalloc()

2023q1 Homework3 (fibdrv)

開發環境

前期準備

Linux 效能分析的提示

查看行程的 CPU affinity

將行程固定在特定的 CPU 中執行

固定在特定的 CPU 中執行

排除干擾效能分析的因素

觀察 fibdrv 核心模組

費氏數列

迭代法

Fast Doubling

考慮到硬體加速 F(n) 的手法

時間測量

kernel space

uesr space

大數運算

struct bn

bn_to_string

bn_lshift/bn_rshift

bn_add / bn_sub

bn_do_add

bn_do_sub

正確計算 F(92) 以後的數值

開發問題紀錄

一對一討論問題回答

為何 Linux 核心不傾向使用 VLA

為何要實作 fast doubling ?

何者在 Linux 核心可作，但在 user space 無法？

現有的大數運算，可配置多大的記憶體空間？ (kmalloc)

測量程式碼佔用的記憶體空間

澄清 GFP_KERNEL

Read more

2023q1 Homework1 (lab0)

Demystifying the Linux CPU Scheduler 閱讀筆記

關於 BORE 及 schedgraph 的描述

2023q1 Homework7 (ktcp)