2022q1 Homework3 (fibdrv)

contributed by < NOVBobLee >

作業要求

環境設置

Linux 核心版本（Ubuntu）：

$ uname -r
5.13.0-35-generic

編譯器版本：

$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0

CPU 架構和規格：

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           94
Model name:                      Intel(R) Xeon(R) CPU E3-1230 v5 @ 3.40GHz
Stepping:                        3
CPU MHz:                         3400.000
CPU max MHz:                     3800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6799.81
Virtualization:                  VT-x
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        8 MiB
NUMA node0 CPU(s):               0-7
...

確認使用者身份不是 root ：

$ whoami
zz4t
$ sudo whoami
root

確認 UEFI secure boot 已關閉：

$ dmesg | grep -i secure
[    0.000000] secureboot: Secure boot disabled

確認 linux-headers 套件已安裝：

$ dpkg -L linux-headers-5.13.0-35-generic | grep /lib/modules
/lib/modules
/lib/modules/5.13.0-35-generic
/lib/modules/5.13.0-35-generic/build

檢查專案，確認可以執行（出現綠色 Passed [-] ）：

$ git clone git@github.com:NOVBobLee/fibdrv.git
$ cd fibdrv
$ make check
cc -o client client.c
...
 Passed [-]
f(93) fail
input: 7540113804746346429
expected: 12200160415121876738

獨立實驗環境，排除干擾

參照這次 K04: fibdrv 實驗說明和 KYG 學長的共筆，將干擾實驗的因素排除。

保留一顆 CPU 核心不被排其他行程

## Add isolcpus=7 in /etc/default/grub (0-based index)
## GRUB_CMDLINE_LINUX_DEFAULT="quiet splash ipv6.disable=1 isolcpus=7"
$ sudo update-grub
$ shutdown -r now
$ cat /sys/devices/system/cpu/isolated
7
## CPU 7 is not in pid 1 init's affinity list
$ taskset -cp 1
pid 1's current affinity list: 0-6

使用 taskset 指定行程執行在保留的 CPU 上

$ sudo taskset -c 7 ./client

關閉 ASLR (Address Space Layout Randomization)

關於 sudo sh -c

參考 sh(1) 和 What does sudo sh -c means in linux?

$ sudo echo 3 > test.plain
$ sudo sh -c "echo 3 > test.sudoshc"
$ ls -l test.*
-rw-rw-r-- 1 zz4t zz4t 2 Mar 22 14:30 test.plain
-rw-r--r-- 1 root root 2 Mar 22 14:30 test.sudoshc

$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"

關閉 Intel 處理器的 Turbo mode

$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

使用 CPU 的 performace mode

$ sudo sh -c "echo performace > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor"

修改 SMP IRQ Affinity ，避免保留的 CPU 去處理 IRQ

#!/bin/sh
for file in `find /proc/irq -name "smp_affinity"`
do
    mask=0x`cat ${file}`
    new_mask=$((${mask} & ~0x80))
    new_mask=`printf '%02x' ${new_mask}`
    sudo sh -c "echo ${new_mask} > ${file}"
done

mask=0x`cat /proc/irq/default_smp_affinity`
new_mask=$((${mask} & ~0x80))
new_mask=`printf '%02x' ${new_mask}`
sudo sh -c "echo ${new_mask} > /proc/irq/default_smp_affinity"

有些 SMP IRQ Affinity 是無法從 User Space 更改的，比方說 IRQ0 - timer interrupt ，不過從 /proc/interrupts 觀察，剛好保留的 CPU 沒處理過這個 IRQ ，發生次數也不多。

$ cat /proc/interrupts 
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
   0:          7          0          0          0          0          0          0          0   IO-APIC    2-edge      timer
   1:          0          4          0          0          0          0          0          0   IO-APIC    1-edge      i8042
   ...

除了 IRQ0 外，另外還有 IRQ2 - cascaded interrupt 的也無法更改。不過 IRQ2 尚未發生過，在更改過 default_smp_affinity 後，若 IRQ2 發生，也不會使用保留的 CPU 。

## error msg from modify irq0 affinity
$ sudo sh -c "echo 7f > /proc/irq/0/smp_affinity"
sh: 1: echo: echo: I/O error
## some affinities cannot be changed
$ cat /proc/irq/0/effective_affinity
00

本共筆測試結果皆參照用統計手法去除極端值取樣 1000 次，取 95% 信賴區間之後再計算其平均，可參考程式碼 expt01_userkernel.c 。

測試效果

在原環境下：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

在原環境下，每次測試變化大，不易觀察趨勢。

排除以上干擾後：

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

排除以上干擾後，每次測試的變化變小，趨勢比較容易觀察，穩定度真的上升很多，不過還是會有突波出現。

概觀 `fibdrv` 模組

編譯 fibdrv 模組：

$ make
$ ls -l fibdrv.*
-rw-rw-r-- 1 zz4t zz4t   4046 Mar  9 23:26 fibdrv.c
-rw-rw-r-- 1 zz4t zz4t 327920 Mar 21 14:09 fibdrv.ko
-rw-rw-r-- 1 zz4t zz4t     42 Mar 21 14:09 fibdrv.mod
-rw-rw-r-- 1 zz4t zz4t   1265 Mar 21 14:09 fibdrv.mod.c
-rw-rw-r-- 1 zz4t zz4t 106488 Mar 21 14:09 fibdrv.mod.o
-rw-rw-r-- 1 zz4t zz4t 223024 Mar 21 14:09 fibdrv.o

fibdrv 模組資訊：

$ modinfo fibdrv.ko
filename:       /home/zz4t/study/sysprog/fibdrv/fibdrv.ko
version:        0.1
description:    Fibonacci engine driver
author:         National Cheng Kung University, Taiwan
license:        Dual MIT/GPL
srcversion:     9A01E3671A116ADA9F2BB0A
depends:        
retpoline:      Y
name:           fibdrv
vermagic:       5.13.0-35-generic SMP mod_unload modversions

裝載 fibdrv 模組後，可以看到他以 character deivce 的形式出現：

$ sudo insmod fibdrv.ko
## Module              Size   Used by
$ lsmod | grep fibdrv
fibdrv                 16384  0
$ ls -l /dev/fibonacci 
crw------- 1 root root 507, 0 Mar 21 14:11 /dev/fibonacci
## (Device Major Number):(Minor number)
$ cat /sys/class/fibonacci/fibonacci/dev
507:0
$ cat /sys/module/fibdrv/version
0.1
## No one uses it
$ cat /sys/module/fibdrv/refcnt
0

模組主程式 `fibdrv.c` ：

由 fib_sequence 計算費氏數列。

static long long fib_sequence(long long k)
{
    /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
    long long f[k + 2];

    f[0] = 0;
    f[1] = 1;

    for (int i = 2; i <= k; i++) {
        f[i] = f[i - 1] + f[i - 2];
    }

    return f[k];
}

由於此模組於 VFS 中有產生 character device file ，所以可以藉由檔案操作介面進行操作，有定義的方法如下：

const struct file_operations fib_fops = {
    .owner = THIS_MODULE,
    .read = fib_read,
    .write = fib_write,
    .open = fib_open,
    .release = fib_release,
    .llseek = fib_device_lseek,
};

使用 read 讀取費氏數列的計算結果。

/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    return (ssize_t) fib_sequence(*offset);
}

計算費氏數列到第幾位是由 offset 控制，可由 lseek 更改 offset 。

static loff_t fib_device_lseek(struct file *file, loff_t offset, int orig)
{
    loff_t new_pos = 0;
    switch (orig) {
    case 0: /* SEEK_SET: */
        new_pos = offset;
        break;
    case 1: /* SEEK_CUR: */
        new_pos = file->f_pos + offset;
        break;
    case 2: /* SEEK_END: */
        new_pos = MAX_LENGTH - offset;
        break;
    }

    if (new_pos > MAX_LENGTH)
        new_pos = MAX_LENGTH;  // max case
    if (new_pos < 0)
        new_pos = 0;        // min case
    file->f_pos = new_pos;  // This is what we'll use now
    return new_pos;
}

藉由 mutex 控制使用此模組的行程數量，從 fib_open 可以看到一次只允許一個行程使用。

static DEFINE_MUTEX(fib_mutex);
static int fib_open(struct inode *inode, struct file *file)
{
    if (!mutex_trylock(&fib_mutex)) {
        printk(KERN_ALERT "fibdrv is in use");
        return -EBUSY;
    }
    return 0;
}
static int fib_release(struct inode *inode, struct file *file)
{
    mutex_unlock(&fib_mutex);
    return 0;
}

write 目前沒有實際功能，之後會拿來改寫成讀取計算時間用。

/* write operation is skipped */
static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    return 1;
}

加入計時器

這裡在 kernel space 使用 hrtimer 計時，而 user space 使用的是 clock_gettime ，最後可以用兩者的差計算出 system call 所佔用的時間。

改寫 write ，加入計時器，測量在 kernel space 裡計算費氏數列的時間。

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    ktime_t kt;
    kt = ktime_get();
    fib_sequence(*offset);
    kt = ktime_sub(ktime_get(), kt);
    return (ssize_t) ktime_to_ns(kt);
}

使用 clock_gettime 計算在 user space 裡，計算費氏數列的總耗費時間。

#include <time.h>
int main(void)
{
    ...
    struct timespec t1, t2;
    clock_gettime(CLOCK_MONOTONIC, &t1);
    times[KERNEL][n] = write(fd_fib, buf, 1);
    clock_gettime(CLOCK_MONOTONIC, &t2);
    times[USER][n] = (double) (t2.tv_sec * 1e9 + t2.tv_nsec -
                               t1.tv_sec * 1e9 - t1.tv_nsec);
    ...
}

完整程式碼 expt01_userkernel.c

下圖為測試結果，只有計算到第 92 位費氏數，使用的方法為方法一（費氏數列定義），可以看到計算時間正比於費氏數列的位數，符合預期的 \(O(n)\) 時間複雜度。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

中間的藍色線為 kernel space 與 user space 的差，幾乎為水平線，為固定值。再做進一步測試，固定計算第零位費氏數，只改變輸入的 size 大小。在實作上是沒有使用到 copy_from_user ，所以應該不會因 size 大小而變動，藍色線應該還會是水平線。

diff --git a/expt_userkernel01.c b/expt_userkernel01.c
index 90a3f61..0d054a6 100644
--- a/expt_userkernel01.c
+++ b/expt_userkernel01.c
@@ -26,7 +26,7 @@ enum { KERNEL, USER };
 
 int main(void)
 {
-    char buf[1] = {'\0'};
+    char buf[93] = { [0 ... 91] = 'a', [92] = '\0' };
 
     int fd_fib = open(FIB_DEV, O_RDWR);
     if (fd_fib < 0) {
@@ -42,7 +42,7 @@ int main(void)
 
     /* start testing time */
     for (int i = 0; i <= NFIB; ++i) {
-        lseek(fd_fib, i, SEEK_SET);
+        int bufsize = i + 1;
         double times[NSPACE][NSAMPLE] = {0.};
         double mean[NSPACE] = {0.}, sd[NSPACE] = {0.};
 
@@ -50,7 +50,7 @@ int main(void)
         for (int n = 0; n < NSAMPLE; ++n) {
             struct timespec t1, t2;
             clock_gettime(CLOCK_MONOTONIC, &t1);
-            times[KERNEL][n] = write(fd_fib, buf, 1);
+            times[KERNEL][n] = write(fd_fib, buf, bufsize);
             clock_gettime(CLOCK_MONOTONIC, &t2);
             times[USER][n] = (double) (t2.tv_sec * 1e9 + t2.tv_nsec -
                                        t1.tv_sec * 1e9 - t1.tv_nsec);

結果符合推測，這結果在之後修改 write 時，引數 size 可以拿來傳遞其他參數。不過可以看到 kernel space 與 user space 之間的差距非常大，差不多是 500 納秒，與在 kernel space 計算所需的時間相比，成本是很高的。其原因是使用 write 這個 system call ，在 kernel space 與 user space 之間做轉換，以下為 perf 的執行結果，程式是執行 write 計算第 92 位費氏數 2,000,000 次，所以總佔時間為 95.84% ，這邊我們只看 write 裡面的成份。

            ---_start
               __libc_start_main
               main
               |          
                --95.84%--__GI___libc_write
                          |          
                          |--69.43%--entry_SYSCALL_64_after_hwframe
                          |          |          
                          |           --67.97%--do_syscall_64
                          |                     |          
                          |                     |--36.37%--syscall_exit_to_user_mode
                          |                     |          |          
                          |                     |           --1.62%--exit_to_user_mode_prepare
                          |                     |          
                          |                      --30.32%--__x64_sys_write
                          |                                |          
                          |                                 --30.15%--ksys_write
                          |                                           |          
                          |                                           |--25.62%--vfs_write
                          |                                           |          |          
                          |                                           |          |--4.03%--0xffffffffc302c23a
                          |                                           |          |          |          
                          |                                           |          |          |--2.71%--read_tsc
                          |                                           |          |          |          
                          |                                           |          |           --1.23%--ktime_get
                          |                                           |          |          
                          |                                           |          |--2.87%--rw_verify_area
                          |                                           |          |          |          
                          |                                           |          |           --2.58%--security_file_permission
                          |                                           |          |                     |          
                          |                                           |          |                     |--1.51%--apparmor_file_permission
                          |                                           |          |                     |          |          
                          |                                           |          |                     |           --1.49%--common_file_perm
                          |                                           |          |                     |          
                          |                                           |          |                      --0.72%--common_file_perm
                          |                                           |          |          
                          |                                           |          |--2.49%--0xffffffffc302c2af
                          |                                           |          |          |          
                          |                                           |          |          |--1.64%--read_tsc
                          |                                           |          |          |          
                          |                                           |          |           --0.66%--ktime_get
                          |                                           |          |          
                          |                                           |          |--2.30%--0xffffffffc302c295
                          |                                           |          |          
                          |                                           |          |--2.13%--0xffffffffc302c290
                          |                                           |          |          
                          |                                           |          |--2.13%--0xffffffffc302c28a
                          |                                           |          |          
                          |                                           |          |--2.00%--0xffffffffc302c29c
                          |                                           |          |          
                          |                                           |          |--1.82%--0xffffffffc302c2a2
                          |                                           |          |          
                          |                                           |          |--0.68%--__fsnotify_parent
                          |                                           |          |          
                          |                                           |          |--0.53%--0xffffffffc302c298
                          |                                           |          |          
                          |                                           |           --0.51%--0xffffffffc302c28c
                          |                                           |          
                          |                                           |--1.91%--__fdget_pos
                          |                                           |          |          
                          |                                           |           --1.70%--__fget_light
                          |                                           |          
                          |                                            --0.53%--0xffffffffc302c155
                          |          
                          |--12.22%--__entry_text_start
                          |          
                           --11.97%--syscall_return_via_sysret

費氏數列實作

方法一（費氏數列定義）

費氏數列的定義為 \(F(n+1)=F(n)+F(n-1)\) ，初始值不只一種，這裡使用的是 \(F(0)=0,\, F(1)=1\) 。

最初在模組裡所用的方法即為費氏數列的定義，先配置一個陣列，大小可容納我們指定的費氏數列所有數量，依費氏數列定義開始計算從小到大的費氏數，最終的計算即為所求。

static long long fib_sequence(long long k)
{
    /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
    long long f[k + 2];

    f[0] = 0;
    f[1] = 1;

    for (int i = 2; i <= k; i++) {
        f[i] = f[i - 1] + f[i - 2];
    }

    return f[k];
}

fibdrv.c 在編譯時會跳出 -Wvla 的警告 warning: ISO C90 forbids variable length array ‘f’ [-Wvla] ，而程式碼裡也有寫 FIXME ，指出 VLA 是不允許在 Linux kernel 中使用的，想到可直接替換的方法包括改用 kmalloc 等動態記憶體配置或是只儲存必要的兩個數。

下面程式碼使用 kmalloc_array 動態配置記憶體，跟原本的程式相比，差異只有 VLA 的部份換成使用動態記憶體配置，計算的部份沒有改變。

#include <linux/slab.h>
static long long fib_seq_kmalloc(long long k)
{
    long long result, *f = kmalloc_array(k + 2, sizeof(long long), GFP_KERNEL);
    if (unlikely(!f))
        return -1;

    f[0] = 0;
    f[1] = 1;
    for (int i = 2; i <= k; ++i)
        f[i] = f[i - 1] + f[i - 2];

    result = f[k];
    kfree(f);
    return result;
}

這個程式碼只使用必要的兩個數，每次做讀取和寫入的位置剛好與奇偶相關，所以每次陣列位置都需要做判斷奇偶的運算，不過由於陣列只有兩個數，足夠放在 cache 裡，甚至是 register 裡，減少讀取記憶體的時間。

static long long fib_seq_fixedla(long long k)
{
    long long f[2] = {0, 1};

    for (int i = 2; i <= k; ++i)
        f[i & 0x1] += f[(i - 1) & 0x1];

    return f[k & 0x1];
}

下圖為 kernel space 裡的各方法所需計算時間，可以看出長度為 2 的固定長度陣列（以下用圖中代號 fixed-la 稱呼）所需時間最少，再來是 VLA ，最後是差距很大的 kmalloc ，猜測 fixed-la 比較快是因為不用反覆去記憶體讀寫，而剩下兩個算法相同，都是用連續記憶體，那差距只在動態配置記憶體呼叫上。

方法二（ Exact solution method ）

由費氏數列的定義 \(F_{n+1}=F_n+F_{n-1}\) 與 \(F_n=F_n\) ，可以得出一個矩陣關係式：

\(\begin{pmatrix} F_{n+1} \\ F_n \end{pmatrix} =\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \begin{pmatrix} F_n \\ F_{n-1} \end{pmatrix}\)

而這個關係式可以進一步推得

\(\begin{pmatrix} F_{n+k+1} \\ F_{n+k} \end{pmatrix} =\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^k \begin{pmatrix} F_{n+1} \\ F_{n} \end{pmatrix} \text{ 和 } \begin{pmatrix} F_{n+1} \\ F_n \end{pmatrix} =\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^n \begin{pmatrix} F_1 \\ F_0 \end{pmatrix}\)

這樣可以看出若要算出 \(F_n\) ，需要計算出中間矩陣的 \(n\) 次方，由於該矩陣是對稱矩陣，這裡可以使用對角化（ \(\phi_+\) 為黃金比例， \(\phi_-\) 為其另一個特徵值）。

\(\begin{split} \begin{pmatrix} F_{n+1} \\ F_n \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^n \begin{pmatrix} 1 \\ 0 \end{pmatrix} \\ &= \dfrac{1}{\phi_+ - \phi_-} \begin{pmatrix} \phi_+ & \phi_-\\ 1 & 1 \end{pmatrix} \begin{pmatrix} \phi_+ & \\ & \phi_- \end{pmatrix}^n \begin{pmatrix} 1 & -\phi_- \\ -1 & \phi_+ \end{pmatrix} \begin{pmatrix} 1 \\ 0 \end{pmatrix} \\ &= \dfrac{1}{\sqrt{5}} \begin{pmatrix} \phi_+ & \phi_-\\ 1 & 1 \end{pmatrix} \begin{pmatrix} \phi_+^n & \\ & \phi_-^n \end{pmatrix} \begin{pmatrix} 1 \\ -1 \end{pmatrix} \\ &= \dfrac{1}{\sqrt{5}} \begin{pmatrix} \phi_+ & \phi_-\\ 1 & 1 \end{pmatrix} \begin{pmatrix} \phi_+^n \\ -\phi_-^n \end{pmatrix} \\ &= \dfrac{1}{\sqrt{5}} \begin{pmatrix} \phi_+^{n+1} - \phi_-^{n+1} \\ \phi_+^n - \phi_-^n \end{pmatrix} \end{split}\)

所以最後得到 \(F_n=\dfrac{1}{\sqrt{5}}(\phi_+^n - \phi_-^n)\) ，只需要計算冪即可，比較有趣的是 \(\phi_-\) 是個負數，針對他的次方 \(n\) 的奇偶性，會對前面的 \(\phi_+^n\) 一下往上修正，一下往下修正。不過可惜的是 \(\sqrt{5}\) 是無理數（ \(\phi_+\) 和 \(\phi_-\) 都有 \(\sqrt{5}\) ），在資料型態和記憶體大小的限制下，計算一定會產生誤差，而且 \(n\) 越大越明顯。

#include <math.h>
#define phi_p 1.6180339887498948L
#define phi_n -0.6180339887498949L
#define inv_sqrt5 0.4472135954999579L
long long fib_exact(int k)
{
    double result = inv_sqrt5 * (pow(phi_p, k) - pow(phi_n, k));
    return (long long) result;
}

這裡使用 double 型態，直到 \(F_{71}\) 都沒有問題，然而從下一個結果開始，誤差產生，然後擴大，可能的對應方式是增加計算精準度，針對不同位數的費氏數改變精準度。

	\(F_{72}\)	\(F_{92}\)
Fibonacci number	498454011879264	7540113804746346429
Exact-solution method	498454011879265	7540113804746369024
Error	+1	+22595

既然 \(\sqrt{5}\) 是問題根源，是否可以繞過他作計算，觀察 \(\phi_+ = \dfrac{1+\sqrt{5}}{2}\) 還有 \(\phi_- = \dfrac{1-\sqrt{5}}{2}\) ，每個費氏數都是整數，所以計算後的分子應該可以跟前面的分母 \(\sqrt{5}\) 相消，由此可知 \(F_n\) 即 \(\phi_+^n\) 的 \(\sqrt{5}\) 前面係數。

\(\begin{split} F_1 &= \dfrac{1}{\sqrt{5}}(\dfrac{1+\sqrt{5}}{2} - \dfrac{1-\sqrt{5}}{2}) \\ &= \dfrac{1}{\sqrt{5}}\cdot\dfrac{2\sqrt{5}}{2} \\ &= 1 \end{split}\)

從上面推導可以知道，實際上可以不用計算 \(\sqrt{5}\) 。又觀察 \(\dfrac{1+\sqrt{5}}{2}\) 的冪可以發現，最終都可以變成 \(\dfrac{a+b\sqrt{5}}{2}\) 的形式，所以可以得到：

\(\begin{split} (\dfrac{1+\sqrt{5}}{2})^{k-1}(\dfrac{1+\sqrt{5}}{2}) &= \dfrac{a+b\sqrt{5}}{2}\cdot\dfrac{1+\sqrt{5}}{2} \\ &= \dfrac{(a+5b)+(a+b)\sqrt{5}}{4} \\ &= \dfrac{(\dfrac{a+5b}{2})+(\dfrac{a+b}{2})\sqrt{5}}{2} \end{split}\)

以這個方式重複計算到 \(n\) 次方，然後最終 \(\dfrac{a+b}{2}\) 即為解。

static long long fib_seq_exactsol2(long long k)
{
    if (unlikely(k < 2))
        return k;

    long long a = 1, b = 1;
    for (int i = 2; i <= k; ++i) {
        long long tmp_a = (a + 5 * b) >> 1;
        b = (a + b) >> 1;
        a = tmp_a;
    }
    return b;
}

因為在運算中用到 (a + 5 * b) ，結果比之前的方法更早發生 overflow ，在計算 \(F_{91}\) 時發生溢位（前面的方法溢位在 \(F_{93}\) ），那麼就試著避免先相加或相乘的情況，利用在 quiz2 裡的測驗一學到的技巧。

\(\big(a,\, b\big) \rightarrow \big(\dfrac{a+5b}{2},\, \dfrac{a+b}{2}\big) = \big(\dfrac{a+b}{2}+2b,\, \dfrac{a+b}{2}\big)\)

static long long fib_seq_exactsol3(long long k)
{
    if (unlikely(k < 2))
        return k;

    long long a = 1, b = 1;
    for (int i = 2; i <= k; ++i) {
        long long old_b = b;
        b = (a & b) + ((a ^ b) >> 1);
        a = (old_b << 1) + b;
    }

    return b;
}

這樣溢位延後到 \(F_{92}\) 發生，而在計算時間上，第一改寫結果的計算時間比 fixed-la 高，比 vla 少；而第二次的改寫結果降到跟 fixed-la 一樣，猜測第一個改寫結果所需時間較高，可能跟乘法運算有關係。

方法三（ Fast doubling method ）

以上的時間複雜度都是 \(O(n)\) ，而接下來的第三個方法 Fast doubling method 可以降到 \(O(log\,n)\) 。

由 \(\begin{pmatrix} F_{n+k+1} \\ F_{n+k} \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^{k} \begin{pmatrix} F_{n+1} \\ F_{n} \end{pmatrix}\) 可以推導出：

\(\begin{split} \begin{pmatrix} F_{2n+1} \\ F_{2n} \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^{2n} \begin{pmatrix} 1 \\ 0 \end{pmatrix} \\ &=\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^{n-1} \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^{n-1} \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \begin{pmatrix} 1 \\ 0 \end{pmatrix} \\ &=\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^{n-1} \begin{pmatrix} F_2 & F_1 \\ F_1 & F_0 \end{pmatrix} \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^{n-1} \begin{pmatrix} F_2 & F_1 \\ F_1 & F_0 \end{pmatrix} \begin{pmatrix} 1 \\ 0 \end{pmatrix} \\ &=\begin{pmatrix} F_{n+1} & F_n \\ F_n & F_{n-1} \end{pmatrix} \begin{pmatrix} F_{n+1} & F_n \\ F_n & F_{n-1} \end{pmatrix} \begin{pmatrix} 1 \\ 0 \end{pmatrix} \\ &=\begin{pmatrix} F_{n+1} & F_n \\ F_n & F_{n-1} \end{pmatrix} \begin{pmatrix} F_{n+1} \\ F_n \end{pmatrix} \\ &= \begin{pmatrix} F_{n+1}^2 + F_n^2 \\ F_n(F_{n+1} + F_{n-1}) \end{pmatrix} \\ &=\begin{pmatrix} F_{n+1}^2 + F_n^2 \\ F_n(2F_{n+1} - F_{n}) \end{pmatrix} \end{split}\)

從推導出的關係式可以看出位數 \(n\) 可以成兩倍跳，但是單只用此關係式只能算出位數 \(n\) 為 \(2\) 的冪左右的費氏數，還需要作一些應變。現在觀察一個數的二進位表示，其組成跟 \(2\) 的冪有關，這裡用 \(13\) 舉例。

最高位數	十進位表示	二進位表示	目標的二進位餘下部份	動作需求
目標	13	1101
0	0	0	1101
\(2^0\)	1	1	101	\(\times2+1\)
\(2^1\)	3	11	01	\(\times2+1\)
\(2^2\)	6	110	1	\(\times2\)
\(2^3\)	13	1101		\(\times2+1\)

由上表看到 \(13\) 的二進位組成方式，剛開始由 \(0\) 開始，若下一個位數是 1 ，則需要 \(\times2+1\) ，所以 \(0\) 才會變成 \(1\) ；反之，下一個位數為 0 ，則只需要 \(\times 2\) ，所以 \(3\) 會變為 \(6\) ，原理跟短除法一樣。

\(\begin{matrix} F_0 \\ F_1 \end{matrix} \xrightarrow[]{\times2+1} \begin{matrix} F_1 \\ F_2 \end{matrix} \xrightarrow[]{\times2+1} \begin{matrix} F_3 \\ F_4 \end{matrix} \xrightarrow[]{\times2} \begin{matrix} F_6 \\ F_7 \end{matrix} \xrightarrow[]{\times2+1} \begin{matrix} F_{13} \\ F_{14} \end{matrix}\)

可以看出每一步都會用到 \(2\) 倍的關係式，而加 1 的部份可以用定義達成，由以上的說明可以寫出一個 bottom-up 的 fast doubling method ：

static long long fibseq_fastdoubling(long long k)
{
    if (unlikely(k < 2))
        return k;

    /* find the left-most bit */
    unsigned long long mask = 1ULL << 62;
    while (!(k & mask))
        mask >>= 1;

    /* fast doubling */
    long long a = 0, b = 1;
    while (mask) {
        /* times 2 */
        long long tmp = a;
        a = a * ((b << 1) - a);
        b = tmp * tmp + b * b;

        /* plus 1 */
        if (k & mask) {
            tmp = b;
            b += a;
            a = tmp;
        }
        mask >>= 1;
    }
    return a;
}

從測試結果可以看到 fast doubling method 的時間變化疑似水平線，跟之前的方法相比，在計算較大位數的費氏數時，都是 fast doubling method 以更少的時間完成計算。

fast doubling method 迴圈有二，第一個要找最高位數的 bit ，此法是從 long long 的第 62 位開始往右尋找；第二的迴圈則是 fast doubling method 的本體，迭代次數是看總共由幾個位數組成，舉例 13 是由 4 個位數組成，所以會迭代四次。

另外， mask 目前不需要使用到 unsigned long long 的長度，使用 unsigned int 長度即可應付到 \(F_{4294967295}\) （如果記憶體大到足夠使用的話）。

從上面的實驗結果可以看到，深藍色線從 \(F_3\) 開始會進入 fast doubling method 計算，包括第一個迴圈和第二個迴圈。而在第二個迴圈的迭代次數是看組成位數， 3 的組成位數為 2 所以會迭代兩次，然後到最後計算的 \(F_{92}\) ， 92 的組成位數是 6 ，可以明顯看出 fast doubling method 的結果大部分都是在跑第一個迴圈，所以現在針對第一個迴圈去做最佳化。

上面是測試最高位數 bit 的起始點，括號中為起始位數，除了 62 使用 unsigned long long ，其餘都使用 unsigned int 長度。現在是算到 \(F_{92}\) ，所以就測到起始位數 6 ，而這個改變至少可以使計算時間減少一半。

/* find the left-most bit */
unsigned mask = 1U << (fls((unsigned) k) - 1);

而找尋最高位數 bit 的方法還有其他方式，接下來測試 clz/ffs (count leading zeros/find first set) ，這裡使用 GCC builting-function __builting_clz 、 Linux 核心使用的 fls （方向跟 wiki 使用的相反，方向是從 LSB 往 MSB ，回傳的是位置 (1-based) ），其實作方式有三種（見相關連結），可以用 objdump -dS fibdrv.ko 看是使用哪一種，這裡使用的是 x86 硬體的 bsr 指令；同樣從反組譯的輸出看， __builtin_clz 也是用硬體的 bsr 指令。

; fast doubling method using fls (bsr)
0000000000000150 <fibseq_fastdoubling_fls>:
{
 150:   e8 00 00 00 00          callq  155 <fibseq_fastdoubling_fls+0x5>
 155:   55                      push   %rbp
 156:   48 89 e5                mov    %rsp,%rbp
    if (unlikely(k < 2))
 159:   48 83 ff 01             cmp    $0x1,%rdi
 15d:   7e 4b                   jle    1aa <fibseq_fastdoubling_fls+0x5a>
    unsigned mask = 1U << (fls((unsigned) k) - 1);
 15f:   ba 01 00 00 00          mov    $0x1,%edx
         * top 32 bits will be cleared.
         *
         * We cannot do this on 32 bits because at the very least some
         * 486 CPUs did not behave this way.
         */
        asm("bsrl %1,%0"
 164:   b9 ff ff ff ff          mov    $0xffffffff,%ecx
 169:   0f bd cf                bsr    %edi,%ecx
 16c:   d3 e2                   shl    %cl,%edx
    while (mask) {
        
; fast doubling method using __builtin_clz (bsr)
00000000000001c0 <fibseq_fastdoubling_clz>:
{
 1c0:   e8 00 00 00 00          callq  1c5 <fibseq_fastdoubling_clz+0x5>
 1c5:   55                      push   %rbp
 1c6:   48 89 e5                mov    %rsp,%rbp
    if (unlikely(k < 2))
 1c9:   48 83 ff 01             cmp    $0x1,%rdi
 1cd:   7e 49                   jle    218 <fibseq_fastdoubling_clz+0x58>
    unsigned mask = 1U << (31 - __builtin_clz((unsigned) k));
 1cf:   0f bd cf                bsr    %edi,%ecx
 1d2:   ba 00 00 00 80          mov    $0x80000000,%edx
 1d7:   83 f1 1f                xor    $0x1f,%ecx
 1da:   d3 ea                   shr    %cl,%edx
    while (mask) {

Linux 核心中的 __ffs

Linux kernel 裡 (x86) 的 __ffs 雖然名稱跟 wiki 的一樣，不過他的 first 指的是從 LSB 開始往 MSB 的方向，換句話說是在做 ctz (count tailing zeros) ，而他的實作有三種，一種是使用 GCC 的 builtin-function __builtin_ctzl ，一種是使用 x86 指令 bsf ，而最後一種使用二元搜尋法。

在第一次測試中，直接呼叫 __ffs ，結果導致 soft lockup ，最終是用 Magic SysRq key 強制關機，跟核心有關的東西還是小心為上。

相關連結：
Softlockup detector and hardlockup detector (aka nmi_watchdog)
__ffs 使用 bsf 位置： arch/x86/include/asm/bitiops.h
__ffs 使用二元搜尋法位置： include/asm-generic/bitops/__ffs.h
__ffs 使用 __builtin_ctzl 位置： include/asm-generic/bitops/builtin-__ffs.h

從測試可以看到 __builtin_clz 和 fls 跟起始位數 6 的差不多快，剛開始在 \(F_{20}\) 前是 __builtin_clz 與 fls 較快，然後到 \(F_{31}\) 的過程中三者不相上下，之後變成起始位數 6 較快。原因是使用迴圈的次數減少，當在找 \(F_{32}\) 最高位數時只需要右移一次，甚至在 \(F_{64}\) 之後不需要右移。但是我們不會每次都知道該使用什麼起始位數，在計算多種位數的費氏數時，使用 __builtin_clz 或 fls 會更佳。

相關連結：
Calculating Fibonacci Numbers by Fast Doubling
fls 使用 bsr 的位置： arch/x86/include/asm/bitops.h
fls 使用二元搜尋法的位置： include/asm-generic/bitops/fls.h
fls 使用 __builtin_clz 的位置： include/asm-generic/bitops/builtin-fls.h

在 \(F_{92}\) 之前就用上面這個測試當一個小結尾，下一個目標是計算 \(F_{93}\) 以上的費氏數。

計算 \(F_{93}\) 之後的費氏數列 - 大數計算

在計算 \(F_{93}\) 之前要先把 fibdrv.c 的限制提高，這是由於之前 long long 的最高上限只放得下 \(F_{92}\) ，所以在 lseek 的時候會檢查 offset ，如果輸入的參數 offset 大於 92 就會被限制成 92 。

diff --git a/fibdrv.c b/fibdrv.c
index d838dff..c6db028 100644
--- a/fibdrv.c
+++ b/fibdrv.c
@@ -18,7 +18,8 @@ MODULE_VERSION("0.1");
 /* MAX_LENGTH is set to 92 because
  * ssize_t can't fit the number > 92
  */
-#define MAX_LENGTH 92
+//#define MAX_LENGTH 92
+#define MAX_LENGTH 100

在之前的測試都只計算到 \(F_{92}\) ，是因為計算 \(F_{93}\) 會發生溢位，所以就不放在測試範圍內了。而溢位的原因是型態的限制， Linux 是使用 LP64 ，所以 long long 跟 long 位元是一樣長的，都是 64 bits ，從下表可以看到 long long 跟 unsigned long long 可以存下費氏數的極限，分別是 \(F_{92}\) 跟 \(F_{93}\) ，再多就存不下了，之後就必須靠 GCC 擴充或是大數計算了。

	\(F_{92}\)	\(F_{93}\)	\(F_{94}\)
Fibonacci	7540113804746346429	12200160415121876738	19740274219868223167
`LONG_MAX`	9223372036854775807	9223372036854775807	9223372036854775807
`ULONG_MAX`	18446744073709551615	18446744073709551615	18446744073709551615

參考 KYG-yaya573142 學長的報告，這裡大數計算使用類似的策略，用 uint32_t 儲存資料，在使用加法、乘法、左移運算等會溢位的運算時，轉換成 uint64_t 來計算，最後在依溢位的情況（第 32 個至第 63 個位元）決定是否改變 uint32_t 的數量和儲存方式。

結構 `fbn`

在 KYG-yaya573142 的實作中，很多方法都是適用於 general case 的，比方說可以計算負數，所以在結構裡會有 sign 這個成員，但在費氏數列中，以我們列舉的方法裡是不會遇到負數的（除了 exact solution method ），所以說有些部份我們可以針對費氏數的計算作特化。因此，這裡使用的結構為

/*
 * [num] points to an array, every elements are 4-byte,
 *       so storing a big number larger than 4 bytes will
 *       be like bellow:
 *
 * 0xfedc'ba98'7654'3210 => | 76543210 | fedcba98 | ... |
 *                               ^          ^
 *                             num[0]     num[1]
 *
 * [len] is the length of the array
 */
typedef struct {
    u32 *num;
    int len;
} fbn;

其儲存方式可以用下圖表示，這裡左側為以十六進位表示的目標數值，總共需要 6 bytes 來儲存，可見只用單個 uint32_t 或 uint64_t 都不夠，所以使用多個 uint32_t 來儲存，一個 uint32_t 可以儲存 4 bytes ，即圖右邊每個格子儲存 8 個十六進位的數字。而陣列方向則是順著 x86 的 little endian ，每次新的元素是產生在右邊。

其數學式可以由 \(\text{Big Number} = a_0 + a_1 2^{32} + a_2 2^{2\cdot32}+ a_3 2^{3\cdot32} + \dots + a_n 2^{n\cdot32}\text{ , }\forall\,a_k\in[\,0,\,2^{32} - 1]\) 來表示， \(a_0\) 即陣列第零個元素，以此類推。

加法 `fbn_add`

而實際計算方式可以從加法開始，而加法也是費氏數列的主要運算之一，先介紹一些用到的函式：

fbn_swap 是交換兩個 fbn 指標

/* Swap two fbn pointers */
#define fbn_swap(a, b)  \
    ({                  \
        fbn *tmp = a;   \
        a = b;          \
        b = tmp;        \
    })

fbn_resize 是調整 fbn 裡的 num 陣列長度

/*
 * Resize fbn.
 * @obj: fbn object
 * @len: new length to @obj's num
 * Return 0 on success and -1 on failure.
 */
static int fbn_resize(fbn *obj, int len)
{
    if (obj->len == len)
        return 0;
    obj->num = krealloc_array(obj->num, len, sizeof(u32), GFP_KERNEL);
    if (unlikely(!obj->num))
        return -1; /* fail to realloc */
    if (len > obj->len)
        memset(obj->num + obj->len, 0, sizeof(u32) * (len - obj->len));
    obj->len = len;
    return 0;
}

fbn_swap_content 是交換兩 fbn 所持有的內容，目的是取代比較費工的 fbn_copy 。在許多大數運算上會指定回傳的 fbn ，在運算過程中計算的結果可能會在 fbn 的暫時變數（運算完即被釋放）中，此時使用 fbn_copy 比較浪費，因為不需要兩個內容都相同，只需要將正確結果換進指定回傳的 fbn 內即可。

/* Swap two fbn contents. */
static void fbn_swap_content(fbn *a, fbn *b)
{
    u32 *num = a->num;
    a->num = b->num;
    b->num = num;
    int len = a->len;
    a->len = b->len;
    b->len = len;
}

/*
 * Copy fbn to another fbn.
 * @des: the copy destination of fbn
 * @src: the copy source of fbn
 * Return 0 on success and -1 on failure.
 */
int fbn_copy(fbn *des, fbn *src)
{
    int res = fbn_resize(des, src->len);
    if (unlikely(res < 0))
        return -1;
    memcpy(des->num, src->num, sizeof(u32) * src->len);
    return 0;
}

再來解釋加法 fbn_add ，剛開始挑選讓 a 為擁有陣列比較長的 fbn ，以利之後的程式碼簡潔，然後調整回傳結果的 c 的陣列長度，保持跟 a 是一樣長的。

/* fbn last element */
#define fbn_lastelmt(x) ((x)->num[(x)->len - 1])

/* c = a + b, addition assignment (a += b) is also acceptable */
void fbn_add(fbn *c, fbn *a, fbn *b)
{
    /* a->num is always the longest one */
    if (a->len < b->len)
        fbn_swap(a, b);
    fbn_resize(c, a->len);

    /* addition operation (same length part) */
    int i;
    u64 bcabinet = 0;
    for (i = 0; i < b->len; ++i) {
        bcabinet += (u64) a->num[i] + b->num[i];
        c->num[i] = bcabinet;
        bcabinet >>= 32;
    }
    /* addition operation (remaining part) */
    for (; i < a->len; ++i) {
        bcabinet += (u64) a->num[i];
        c->num[i] = bcabinet;
        bcabinet >>= 32;
    }
    /* if the carry is still remained */
    if (unlikely(bcabinet)) {
        fbn_resize(c, a->len + 1);
        fbn_lastelmt(c) = bcabinet; /* bcabinet = 1 */
    }
}

第二步開始進行加法，可以對照下面的數學表示，第一階段是 a, b 都擁有相對應的元素，所以在 bcabinet 的加法裡兩者都會參與，而計算上因為要保留溢位的部份，所以提昇成 u64 的型態來相加，加完以後 bcabinet 可以分為高位元和低位元兩個部份，低位元的部份可以直接存進 c 裡，而 bcabinet 的高位元部份代表現在陣列元素 u32 裡相加溢位的部份，換句話說是進位，影響的部份在下一個陣列元素，所以將之右移 32 位，準備跟下一次迴圈裡的陣列元素相加。

\(A = a_0 + a_1 2^{32} + a_2 2^{2\cdot32} + \dots \\ B = b_0 + b_1 2^{32} + b_2 2^{2\cdot32} + \dots \\ \text{If } a_0 + b_0 > (2^{32}-1)\text{ , then } a_0 + b_0 = (2^{32}-1) + K\text{ , where }K\in[\,1,\,2^{32}-1] \\ \text{i.e. }a_0 + b_0 = 2^{32} + (K-1) \\ \text{So, there's a carry bit from }2^{32}. \\ (a_1 + b_1 + 1)\cdot2^{32} = \dots\)

加法的第二階段跟第一階段相同，不過 b 已經沒有相對應的元素，所以 bcabinet 只需要跟 a 做相加。在最後 a 陣列對應的元素都相加完畢，如果 bcabinet 還有餘數，代表說還需要進位，但 c 的陣列已經到盡頭了，所以還要再加長 c 的陣列長度，將剩餘的 bcabinet 存入，這樣加法就完成了。

在加法的過程中，若 a 跟 c 為同一個指標也是沒有問題的，換句話說可以拿來做 addition assignment 運算（ a += b ），這是由於寫進 a 的某陣列元素後，代表該位數的相加已經結束，不會再回去使用該元素。順代一提， bcabinet 是來自 Biosafety Cabinet 給的印象，由於操作上會有生化危險，所以必須在一個櫃子裡操作。現在生活中充斥著新冠肺炎病毒 Covid-19 ，這裡變數名稱就使用跟病毒有關的 P4 實驗室器具。

大數計算版本的費氏數列 (1) `fbn_fib_defi`

完成加法後就可以寫出基本的費氏數列計算（使用定義的迭代方式）：

/*
 * Calculate the nth Fibonacci number with definition.
 * @des: fbn object to store @n-th Fibonacci number
 * @n: @n-th Fibonacci number
 */
void fbn_fib_defi(fbn *des, int n)
{
    /* trivial case */
    if (unlikely(n < 2)) {
        fbn_resize(des, 1);
        des->num[0] = n;
        return;
    }

    /* Fibonacci definition */
    fbn *arr[2];
    arr[0] = fbn_alloc(1); /* initial value is 0 */
    arr[1] = fbn_alloc(1);
    arr[1]->num[0] = 1;
    for (int i = 2; i <= n; ++i)
        fbn_add(arr[i & 1], arr[i & 1], arr[(i - 1) & 1]);

    fbn_swap_content(des, arr[n & 1]);
    fbn_free(arr[0]);
    fbn_free(arr[1]);
}

十進位字串輸出 `fbn_print`

此函式引入 KYG-yaya573142 的實作，原理跟上面 fast doubling method 說明 13 的二進位組成一樣。

/* Print fbn into a string (decimal), need kfree to free this string */
char *fbn_print(const fbn *obj)
{
    size_t slen = (sizeof(int) * 8 * obj->len) / 3 + 2;
    char *str = kmalloc(slen, GFP_KERNEL), *p = str;
    memset(str, '0', slen - 1);
    str[slen - 1] = '\0';

    for (int i = obj->len - 1; i >= 0; --i) {
        for (unsigned mask = 1U << 31; mask; mask >>= 1) {
            int carry = !!(mask & obj->num[i]);
            for (int j = slen - 2; j >= 0; --j) {
                /* bit * 2 + 1 */
                str[j] += str[j] - '0' + carry;
                carry = (str[j] > '9');
                if (carry)
                    str[j] -= 10;
            }
        }
    }

    while (p[0] == '0' && p[1] != '\0')
        ++p;
    memmove(str, p, strlen(p) + 1);
    return str;
}

改寫 read ，測試大數計算版本的費氏數列計算結果：

/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    fbn *fib = fbn_alloc(1);
    fbn_fib_defi(fib, *offset);
    char *str = fbn_print(fib);
    size_t left = copy_to_user(buf, str, strlen(str) + 1);
    kfree(str);
    fbn_free(fib);
    return left;
}

與 The first 500 Fibonacci numbers 對照 \(F_{92}\) 之後的結果，確認無誤。

...
92 7540113804746346429
93 12200160415121876738
94 19740274219868223167
95 31940434634990099905
96 51680708854858323072
97 83621143489848422977
98 135301852344706746049
99 218922995834555169026
100 354224848179261915075

左移運算 `fbn_lshift`

再來是準備寫 fast doubling method 的大數計算版本，這需要實作左移運算、減法和乘法，那麼先從左移運算開始。

左移運算有實作兩個版本，一個是移動 32 位元以內（超過會取 modulus ）的版本，另一個是通用版本，不過在 fast doubling method 中只會用到左移一位的運算，所以這邊就不放上通用版本了。

/* Modulo 32 */
#define MOD32(x) ((x) & ((1U << 5) - 1))
/* Round-Up 32 bits */
#define ROUNDUP32(x) (((x) + (1U << 5) - 1) & ~((1U << 5) - 1))
/* Check fbn number is zero or not */
#define fbn_iszero(x) (((x)->len == 1) && (!(x)->num[0]))

/*
 * Left-shift under 32 bits: b = a << k. a <<= k is also acceptable.
 * @b: fbn object to store the result
 * @a: fbn object to be shifted
 * @k: shift @k bits, k <= 32, take modulus 32
 */
void fbn_lshift32(fbn *b, fbn *a, int k)
{
    /* shift 0 bit or a is zero fbn */
    if (unlikely(!k || fbn_iszero(a))) {
        fbn_copy(b, a);
        return;
    }
    /* take modulus (1 <= k <= 32) and resize b */
    u32 kmod32 = MOD32(k);
    k = kmod32 + (!kmod32 << 5);
    int new_len = a->len - 1 + (ROUNDUP32(fls(fbn_lastelmt(a)) + k) >> 5);
    fbn_resize(b, new_len);

    /* shift and combine carry bits */
    u64 bcabinet = 0;
    for (int i = 0; i < a->len; ++i) {
        bcabinet = (u64) a->num[i] << k | bcabinet;
        b->num[i] = bcabinet;
        bcabinet >>= 32;
    }
    /* remaining part */
    if (bcabinet)
        fbn_lastelmt(b) = bcabinet;
}

減法 `fbn_sub`

減法運算跟加法 fbn_add 原理一樣，由於沒有使用 sign 這個成員，所以不能使用轉換正負號的方法。最後會將 num 的高位數元素為零的部份截掉，如果最終運算結果是零，則 fbn 的長度還是需要維持為 1 ，不過這在計算費氏數列時不會遇到。

/* c = a - b, where a >= b. a -= b is also acceptable */
void fbn_sub(fbn *c, fbn *a, fbn *b)
{
    fbn_resize(c, a->len);

    int i;
    u32 borrow = 0;
    u64 subtrahend;
    for (i = 0; i < b->len; ++i) {
        subtrahend = (u64) b->num[i] + borrow;
        borrow = subtrahend > a->num[i];
        c->num[i] = a->num[i] - (u32) subtrahend;
    }
    for (; i < a->len; ++i) {
        subtrahend = (u64) borrow;
        borrow = subtrahend > a->num[i];
        c->num[i] = a->num[i] - (u32) subtrahend;
    }
    /* truncate the leading zero elements */
    for (i = c->len - 1; i >= 0 && !c->num[i]; --i)
        ;
    fbn_resize(c, i + 1 + (i == -1)); /* avoid i == -1, happens when num
                                         is 0, although it won't happen
                                         in Fibonacci computing*/
}

乘法 `fbn_mul`

目前使用比較直觀的 long multiplication ，跟 fbn_lshift 一樣，因為有使用 fls ，所以要避免遇到 num 高位數元素為零的情況。

/* c = a * b (long multiplication). a *= b is also acceptable */
void fbn_mul(fbn *c, fbn *a, fbn *b)
{
    if (unlikely(fbn_iszero(a) || fbn_iszero(b))) {
        fbn_resize(c, 1);
        c->num[0] = 0;
        return;
    }

    /* Be careful, num[obj->len - 1] cannot be zero. If met, it means
     * there's zero elements in the high position of num didn't truncated */
    int new_len = a->len + b->len - 2 +
                  (ROUNDUP32(fls(fbn_lastelmt(a)) + fls(fbn_lastelmt(b))) >> 5);
    fbn *pseudo_c = fbn_alloc(new_len); /* need an all zero array */

    /* long multiplication */
    for (int offset = 0; offset < b->len; ++offset) {
        int pc_idx = 0; /* 0 for suppressing cppcheck */
        u64 bcabinet = 0;
        /* c += a * (b->num[offset]) */
        for (int i = 0; i < a->len; ++i) {
            pc_idx = i + offset;
            bcabinet +=
                (u64) a->num[i] * b->num[offset] + pseudo_c->num[pc_idx];
            pseudo_c->num[pc_idx] = bcabinet;
            bcabinet >>= 32;
        }
        pseudo_c->num[pc_idx + 1] = bcabinet; /* maybe it's 0 */
    }
    /* truncate the leading zero element */
    if (!fbn_lastelmt(pseudo_c))
        fbn_resize(pseudo_c, new_len - 1);

    /* pass the content to c */
    fbn_swap_content(pseudo_c, c);
    fbn_free(pseudo_c);
}

過程中 bcabinet 使用 u64 型態解決溢位問題，當每個值處於 u32 最大極限的狀態下， u64 依然可以容納其結果，請見下面數學關係式：

\(\text{Suppose that }a = b = c = bcabinet = 2^{32}-1 \\ \begin{split} \text{Then, } a\cdot b + c + bcabinet &= (2^{64} - 2\cdot 2^{32} + 1) + 2\cdot(2^{32} - 1) \\ &= 2^{64} - 1 \leq (2^{64} - 1) \end{split}\)

大數計算版本的費氏數列 (2) fbn_fib_fastdoubling

有了以上的運算，現在可以實作 fast doubling method 的大數計算版本。

/*
 * Calculate the nth Fibonacci number with fast doubling method.
 * @des: fbn object to store @n-th Fibonacci number
 * @n: @n-th Fibonacci number
 */
void fbn_fib_fastdoubling(fbn *des, int n)
{
    fbn_resize(des, 1);
    /* trivial case */
    if (unlikely(n < 2)) {
        des->num[0] = n;
        return;
    }

    /* fast doubling method */
    u32 mask = 1U << (fls((u32) n) - 1);
    fbn *a = des; /* a will be the result */
    fbn *b = fbn_alloc(1);
    fbn *tmp = fbn_alloc(1);
    a->num[0] = 0; /* a = 0 */
    b->num[0] = 1; /* b = 1 */
    while (mask) {
        /* times 2 */
        fbn_lshift32(tmp, b, 1);  /* tmp = ((b << 1) */
        fbn_sub(tmp, tmp, a);     /*          - a) */
        fbn_mul(tmp, tmp, a);     /*          * a */
        fbn_mul(a, a, a);         /* a^2 */
        fbn_mul(b, b, b);         /* b^2 */
        fbn_add(b, b, a);         /* b = a^2 + b^2 */
        fbn_swap_content(a, tmp); /* a <-> tmp */

        /* plus 1 */
        if (mask & n) {
            fbn_swap_content(a, b); /* a <-> b */
            fbn_add(b, b, a);       /* b += a */
        }
        mask >>= 1;
    }

    fbn_free(b);
    fbn_free(tmp);
}

效能測試

下圖為測試兩種方法 fbn_fib_defi （左）和 fbn_fib_fastdoubling （右），分別在 user space （綠線）與 kernel space （紫線）中所花費的時間，而靠在水平線上的是 copy_to_user 加上 kernel/user 空間轉換的時間（藍線）。

兩種方法相比， fast doubling method 稍快（ kernel 是包括 fbn_fib_fastdoubling 和 fbn_print ），但看不出 \(O(log\,n)\) 的趨勢，反而跟左圖趨勢相似，推測主要時間都被 fbn_print 給主導，以下測試將 fbn_print 移到藍線上，從測試結果可以看出推測是正確的。

從以上測試中得知， fbn_print 影響過大。然後在計算 \(F_{1000}\) 以下的費氏數時，大部分時間可以由 kernel space 主導，在之後的測試中，將以測試 kernel space 時間為主。下圖為三個函式在 kernel space 中的執行時間。

從以上測試中，計畫之後的效能改善分為兩個部份：

fbn_print ：此函式目的為輸出 fbn 結構持有數字所代表的十進位表示，影響總體花費時間最大。
以 fbn_fib_fastdoubling 為主的大數運算：在初步實作大數運算的過程中，有想到不少可以改善的空間，之後將以此函式相關部份最佳化。

改善效能

Radix conversion by performing short division

在 fbn_print 中主要計算在裡面的雙迴圈，分別對 bits 和 digits （十進位）走迴圈，其目的是 radix conversion ，從二進位轉換成十進位，原理是二進位的組成方式（見 13 的二進位組成方式）。而另外還有其他種轉換方式，比方說使用短除法，藉由不斷從除法得到的餘數，剛好會是轉換目標十進位表示的每一位數字（順序由低位到高位），所以首先需要一個有效率的大數除法。

舉個例子，將 0x80 除以 10 。一個十六進位數字可以存 4 個位元，所以上限為 0xf ，那麼其最高可以容納一個 10 (dec) 在裡面，如此我們可以以一個十六進位數字為單位進行除法，如下面數學所表示：

\(\begin{split} 0\text{x}80 &= 8\cdot 2^4 + 0\cdot 2^0 \\ &= (0\cdot 10 + 8)\cdot 2^4 + 0\cdot 2^0 \\ &= 0\cdot 10\cdot 2^4 + (8\cdot 2^4 + 0)\cdot 2^0 \\ &= 0\cdot 10\cdot 2^4 + (12\cdot 10 + 8)\cdot 2^0 \\ &= 0\cdot 10\cdot 2^4 + 12\cdot 10\cdot 2^0 + 8\cdot 2^0 \\ &= (0\cdot 2^4 + 12\cdot 2^0)\cdot 10 + 8 \\ &= 0\text{xc}\cdot 10 + 8\ \end{split}\)

剛開始先對高位數的 \(8\) 進行除法，得到商 \(0\) 餘數 \(8\) ，但由於是高位數，所以該餘數的實際值還要乘上他的底 \(2^4\) ，即實際值為 \(8\cdot 2^4\) ，也就是說還有 10 在裡面，所以接下來將他加進比較低的位數（此時兩者基底相同，皆為 \(2^0\) ），變成 \((8\cdot 2^4+0)\) ，繼續進行除法，得到商 \(12\) 餘數 \(8\) ，此時沒有更低的位數，餘數也比 10 小，所以除法已經完成。如此一來，最終可以算出商 0xc （依舊可以由十六進位表示）與餘數 8 。

此除法可以套用在大數計算上，不過除數有個限制，就是必須可以被 u32 給容納（ num 元素使用 u32 型態）。另外，大數除法需要走過陣列每一個元素，將會耗費不少時間，所以能做越少次越好，參考 sysprog21/bignum 的作法，可以將除數拉高到 \(10^9\) ，這樣總體流程變為

大數除法除以 \(10^9\)
得到的餘數再做短除法，輸出 9 個十進位數字
回到 1.

#define divl(high, low, d, q, r) \
    __asm__("divl %4" : "=a"(q), "=d"(r) : "0"(low), "1"(high), "rm"(d))

而高位加低位的除法剛好在 x86-64 裡有指令 div 可以使用，其中 l 是代表 doub(l)eword (4 bytes) ，每個變數皆為 32 bits ， d 是除數， q 為商， r 為餘數，比較特別的是被除數是 64 bits ，由 high, low 組成，所做的事情以下面數學表示：

\(high\cdot2^{32}+low = q\cdot d + r\text{ , where }0\leq q\leq(2^{32}-1)\text{ and }0\leq r<d\)

從上面數學式可以看出這個指令有 integer overflow 的可能，比方說 high = 1, d = 1 的情況，此時 q 會存不下超過 32 bits 的商，不過在這使用的演算法中不會遇到除法溢位，由於 high 是高位數的餘數，上限為 \(10^9-1\) ，此時除法不會遇到溢位問題，請見下面數學：

\(\text{Suppose that }high = 10^9-1,\,low = 2^{32}-1,\,d = 10^9 \\ \begin{split} q &= \lfloor\dfrac{high\cdot2^{32}+low}{d}\rfloor \\ &= \lfloor\dfrac{(10^9-1)\cdot2^{32}+(2^{32}-1)}{10^9}\rfloor \\ &= \lfloor\dfrac{2^{32}\cdot10^9-1}{10^9}\rfloor \\ &= \lfloor\dfrac{(2^{32}-1)\cdot10^9+(10^9-1)}{10^9}\rfloor \\ &= 2^{32}-1 \leq (2^{32}-1) \end{split}\)

以下為步驟 1. 的程式碼，由於想減少呼叫 krealloc ，所以沒有使用 fbn_resize ， obj 不會因為出現零元素就調整長度，這樣的話，需要另外使用 nonzero_len 紀錄現在實際值所需要的長度：

/*
 * Divide obj by 10^9.
 * @obj: fbn object which is dividend in the beginning and quotient in the end
 * @nonzero_len: obj->len - #(leading zero elements)
 * Return the remainder.
 */
static u32 fbn_divten9(fbn *obj, int nonzero_len)
{
    u32 high_r = 0, divisor = 1000000000U;
    /* start from the leading non-zero element */
    u32 *nump = obj->num + nonzero_len - 1;

    for (int i = nonzero_len - 1; i >= 0; --i) {
        u32 cur = *nump, q, r;
        divl(high_r, cur, divisor, q, r);
        *nump = q; /* store the quotient */

        high_r = r; /* update the remainder */
        --nump;     /* move to the lower num */
    }
    return high_r;
}

而在 Linux 核心中當然也有 radix conversion 的需要，比方說 printf ，方法也是使用短除法取餘數，不過並不會使用除法和模，取而代之使用定點數，舉個例子， r 除以 10 的商可以轉換成：

\(\begin{split} \dfrac{r}{10} &= \dfrac{r}{10}\cdot\dfrac{2^{15}}{2^{15}} \\ &= (r\cdot\dfrac{2^{15}}{10})\cdot\dfrac{1}{2^{15}} \\ &\approx (r\cdot0\text{xccd})\cdot\dfrac{1}{2^{15}} \end{split}\)

可以看到他將除法轉換成乘法，由於是取近似值，所以還是會有些誤差存在，不過他保證在 \(r<16389\) 的情形下輸出是正確的。這邊將 Linux 核心中的 put_dec 改寫成適合這裡的版本。

static unsigned put_dec_helper4(char *end, unsigned x)
{
    /* q = x / 10^4
     *   = (x * (2^43 / 10^4)) * 2^(-43)
     * require: x < 1,128,869,999 */
    unsigned q = (x * 0x346DC5D7ULL) >> 43;
    unsigned r = x - q * 10000;

    for (int i = 0; i < 3; ++i) {
        /* q2 = r / 10
         *    = (r * (2^15 / 10)) * 2^(-15)
         * require: r < 16,389 */
        unsigned q2 = (r * 0xccd) >> 15;
        *--end = '0' + (r - q2 * 10);
        r = q2;
    }
    *--end = '0' + r;

    return q;
}

程序大概是這樣，先除以 \(10^4\) ，然後除以 \(10\) 四次，分別從餘數得到 4 個 digits （基底為 \(10^0\) ），然後一樣步驟再算出 4 個 digits （基底為 \(10^4\) ），再來因為 put_dec 是輸入除以 \(10^9\) 之後的餘數，所以現在只剩一個 digit （基底為 \(10^8\) ），直接轉換成字元即可。

/* modified from put_dec() in drivers/firmware/efi/libstub/vsprintf.c */
static char *put_dec(char *end, unsigned n)
{
    unsigned high, q;
    char *p = end;
    high = n >> 16; /* low = (n & 0xffff) */

    /* n = high * 2^16 + low
     *   = 65536 * high + low
     *   = (6 * 10^4 + 5536) * high + low
     *   = (6 * high) * 10^4 + (5536 * high + low) */

    /* 10^0 part */
    q = 5536 * high + (n & 0xffff);
    q = put_dec_helper4(p, q);
    p -= 4; /* 4 digits */

    /* 10^4 part */
    q += 6 * high;
    q = put_dec_helper4(p, q);
    p -= 4; /* 4 digits */

    /* 10^8 part, q < 10 */
    *--p = '0' + q; /* the 9-th digit */

    return p;
}

再來將兩個函式組合起來，先是用 fbn_divten9 做 \(10^9\) 的大數除法，然後使用 put_dec 從餘數算出 9 個 digits ，輸出字串，重複這兩個步驟直到整個的短除法結束。

/* Print fbn into string (version 1), need kfree to free the string */
char *fbn_printv1(const fbn *obj)
{
    if (unlikely(fbn_iszero(obj))) {
        char *str = kmalloc(2, GFP_KERNEL);
        str[0] = '0';
        str[1] = '\0';
        return str;
    }

    ...
    int res = fbn_copy(obj2, obj); /* copy fbn */
    ...
    char *str_end = str + str_len - 1, *head = str_end - 1;

    /* short division, print decimal string */
    int nonzero_len = obj2->len;
    do {
        /* divided by 10^9, obj2 will become the quotient */
        u32 r_ten9 = fbn_divten9(obj2, nonzero_len);
        /* print r_ten9 in str (9 digits) */
        head = put_dec(head, r_ten9);

        /* decrease when a new leading zero element appears */
        nonzero_len -= !obj2->num[nonzero_len - 1];
    } while (nonzero_len);

    /* strip off the leading 0's */
    while (head < str_end && *head == '0')
        ++head;
    memmove(str, head, strlen(head) + 1);

    fbn_free(obj2);
    return str;
    ...
}

測試結果真的非常令人滿意，原因是 v0 是疊加二進位的 bit ，每次計算簡單，一個加法和一個 if 裡的減法，但總共計算次數要 bits 數量乘上 digits （十進位）數量（準確說是 kmalloc 的字串長度），這在費氏數越大時，其計算量非常可觀。而 v1 是使用短除法直接計算 digit ，每次計算雖然是比較複雜，除法、乘法和加法（ fbn_divten9 接 put_dec ），不過次數較少， fbn 長度乘上 digits 數量，總共計算量是少上許多。現在使用 v1 執行下面的實驗（ \(F_0\) 算到 \(F_{1000}\) 每次 1000 次），大概 5 秒內可以結束，這在之後開發與測試的效率將可以大幅提昇。

在 kernel space 中三個函式執行時間比較，現在 fbn_printv1 時間比 fbn_fib_fastdoubling 少，不過從趨勢看，之後 fbn_printv1 將會超過 fbn_fib_fastdoubling ，然後 copy_to_user 則是相較之下比較微小的漸增趨勢。

相關連結：
大數除法： sysprog21/bignum
使用定點數做 radix conversion ： put_dec
Reciprocal Multiplication, a tutorial
Ratio of Bits to Decimal Digits

Fast doubling method without subtraction (v0 -> v1)

原本的 fast doubling method 是 \(\begin{pmatrix}F_{n+1} \\ F_n\end{pmatrix}\rightarrow\begin{pmatrix}F_{2n+1} \\F_{2n}\end{pmatrix}\) ，這其中還有幾個地方還能改進，第一個：由於是從 \(\begin{pmatrix}F_1 \\ F_0\end{pmatrix}\) 開始，這在原本的演算法中，在第一次迴圈乘以二的部份會是做白工（ \(\begin{pmatrix}F_1 \\ F_0\end{pmatrix}\xrightarrow[\times2]{}\begin{pmatrix}F_1 \\F_0\end{pmatrix}\) ），只有加一的部份是有效的（ \(\begin{pmatrix}F_1 \\ F_0\end{pmatrix}\xrightarrow[+1]{}\begin{pmatrix}F_2 \\F_1\end{pmatrix}\) ）。第二個：有使用減法，在大數運算的減法中，會將沒有必要的零元素給去除，持續修正 fbn 的陣列長度，這在演算法中會增加呼叫 krealloc 的機會。第三個：若目標是 \(F_k\) ，則每次最終都會多計算一個附加的 \(F_{k+1}\) 。

針對這幾點，這裡使用稍微不同的數學等式，一樣利用 \(\begin{pmatrix} F_{n+2} \\ F_{n+1} \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \begin{pmatrix} F_{n+1} \\ F_{n} \end{pmatrix}\) 推導，不同在於這次是使用 \(2n-1\) 次方：

\(\begin{split} \begin{pmatrix} F_{2n} \\ F_{2n-1} \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^{2n-1} \begin{pmatrix} F_1 \\ F_0 \end{pmatrix} \\ &=\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^{n-1} \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^{n-1} \begin{pmatrix} F_1 \\ F_0 \end{pmatrix} \\ &=\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^{n-1} \begin{pmatrix} F_2 & F_1 \\ F_1 & F_0 \end{pmatrix} \begin{pmatrix} F_n \\ F_{n-1} \end{pmatrix} \\ &=\begin{pmatrix} F_{n+1} & F_n \\ F_n & F_{n-1} \end{pmatrix} \begin{pmatrix} F_n \\ F_{n-1} \end{pmatrix} \\ &=\begin{pmatrix} F_n(F_{n+1}+F_{n-1}) \\ F_n^2 + F_{n-1}^2 \end{pmatrix} \\ &= \begin{pmatrix} F_n(2F_{n-1}+F_n) \\ F_n^2 + F_{n-1}^2 \end{pmatrix} \end{split}\)

最終變成 \(\begin{pmatrix}F_n \\ F_{n-1}\end{pmatrix}\rightarrow\begin{pmatrix}F_{2n} \\F_{2n-1}\end{pmatrix}\) ，來比較一下兩種等式的不同之處，以目標 \(F_{13}\) 為例：

原：\(\begin{matrix} F_1 \\ F_0 \end{matrix} \xrightarrow[\times2+1]{} \begin{matrix} F_2 \\ F_1 \end{matrix} \xrightarrow[\times2+1]{} \begin{matrix} F_4 \\ F_3 \end{matrix} \xrightarrow[\times2]{} \begin{matrix} F_7 \\ F_6 \end{matrix} \xrightarrow[\times2+1]{} \begin{matrix} F_{14} \\ F_{13} \end{matrix}\)

新：\(\begin{matrix} F_1 \\ F_0 \end{matrix} \xrightarrow[]{\times2+1} \begin{matrix} F_3 \\ F_2 \end{matrix} \xrightarrow[]{\times2} \begin{matrix} F_6 \\ F_5 \end{matrix} \xrightarrow[]{\times2+1} \begin{matrix} F_{13} \\ F_{12} \end{matrix}\)

可以看出使用新的等式，總體少了一次計算，是第一次迴圈的部份，之後的計算則是跟原本的相同。第二，新的等式最終會計算到 \(F_{13}\) ，不會計算出多餘的 \(F_{14}\) 。第三，減法的部份變成使用加法。

void fbn_fib_fastdoublingv1(fbn *des, int n)
{
    fbn_resize(des, 1);
    /* trivial case */
    if (unlikely(n < 2)) {
        des->num[0] = n;
        return;
    }

    /* fast doubling method */
    u32 mask = 1U << (fls((u32) n) - 1 - 1);
    fbn *a = fbn_alloc(1);
    fbn *b = des; /* b will be the result */
    fbn *tmp = fbn_alloc(1);
    a->num[0] = 0; /* a = 0 */
    b->num[0] = 1; /* b = 1 */
    while (mask) {
        /* times 2 */
        fbn_lshift32(tmp, a, 1);  /* tmp = ((a << 1) */
        fbn_add(tmp, tmp, b);     /*        + b) */
        fbn_mul(tmp, tmp, b);     /*        * b */
        fbn_mul(a, a, a);         /* a^2 */
        fbn_mul(b, b, b);         /* b^2 */
        fbn_add(a, a, b);         /* b = a^2 + b^2 */
        fbn_swap_content(b, tmp); /* a <-> tmp */

        /* plus 1 */
        if (mask & n) {
            fbn_swap_content(a, b); /* a <-> b */
            fbn_add(b, b, a);       /* b += a */
        }
        mask >>= 1;
    }

    fbn_free(a);
    fbn_free(tmp);
}

兩種方法進行比較測試，平均減少 260 ns 左右，計算 \(F_{1000}\) 時間從 4112.2986 (v0) 減至 3853.0461 ns (v1) ，減少幅度約為 6% 。

Reduce unnecessary computations (v1 -> v2)

DIV_ROUNDUP32 ：結合 ROUNDUP32 巨集與除以 32 的計算，減少不必要的計算，影響的函數有 fbn_lshift32 和 fbn_mul 。

 /* Modulo 32 */
 #define MOD32(x) ((x) & ((1U << 5) - 1))
+/* Divide 32 */
+#define DIV32(x) ((x) >> 5)
 /* Round up 32 bits */
 #define ROUNDUP32(x) (((x) + (1U << 5) - 1) & ~((1U << 5) - 1))
+/* Round up 32 bits and then divide 32 */
+#define DIV_ROUNDUP32(x) DIV32((x) + (1U << 5) - 1)

修改 fbn_lshift32 ：原函式可以左移 32 位元（含）以下，改為 31 位元（含）以下。

-void fbn_lshift32(fbn *b, fbn *a, int k)
+void fbn_lshift31(fbn *b, fbn *a, int k)
 {
     /* shift 0 bit or a is zero fbn */
     if (unlikely(!k || fbn_iszero(a))) {
         fbn_copy(b, a);
         return;
     }
-    /* take modulus (1 <= k <= 32) and resize b */
-    u32 kmod32 = MOD32(k);
-    k = kmod32 + (!kmod32 << 5);
-    int new_len = a->len - 1 + (ROUNDUP32(fls(fbn_lastelmt(a)) + k) >> 5);
+    /* take modulus 32 and resize b */
+    int new_len = a->len - 1 + DIV_ROUNDUP32(fls(fbn_lastelmt(a)) + MOD32(k)); 
     fbn_resize(b, new_len);

測試結果如下，由於計算費氏數越到後面，使用 fbn_lshift 與 fbn_mul 的次數就越多，所以時間的差距會拉大。計算 \(F_{1000}\) 時間從 3853.0461 (v1) 減少至 3653.0772 ns (v2 - divroundup32) ，減少幅度約 5% 。

Lazy `krealloc` (memory pool)

類似 Lazy allocation ，只有在需要的時候才會呼叫配置記憶體的函式，目的是減少呼叫 krealloc 的次數。這裡將配置陣列元素的單位增加至 4 個 u32 元素，只有當配置的陣列元素不足時才會呼叫 krealloc ，如此配置的陣列元素只增不減，可以減少呼叫 krealloc 的需求。

首先要修改 fbn 結構，新的成員 cap 是配置的總陣列長度；而新的 len 成員表示有效元素長度，指的是在該長度內元素儲存的值是有效的，因為將配置的單位加大成 4 ，有些元素是沒有意義的零，而 len 就是區分元素是否有效的一個成員，不過要注意若 fbn 要表示零，是使用 len 長度為零來表示：

/*
 * [num] points to an array, every elements are 4-byte,
 *       so storing a big number larger than 4 bytes will
 *       be like bellow:
 *
 * 0xba98'7654'3210 => | 76543210 | 0000ba98 | ... |
 *                           ^          ^
 *                         num[0]     num[1]
 *
 * [len] is the length of array with valid value elements
 *       i.e. allocated array length - #(leading zero elements)
 * [cap] is the allocated array length
 */
typedef struct {
    u32 *num;
    int len;
    int cap;
} fbn;

/* Check fbn number is zero or not */
#define fbn_iszero(x) (!(x)->len)

而 lazy krealloc 的主角 fbn_resize ，將配置的單位加大至 4 個元素，且只在配置的陣列長度不足時才會呼叫 krealloc 。

/*
 * Resize fbn, realloc if needed (lazy alloc).
 * @obj: fbn object
 * @len: new length to @obj's num
 * Return 0 on success and -1 on failure.
 */
static int fbn_resize(fbn *obj, int len)
{
    obj->len = len;
    if (likely(len <= obj->cap))
        return 0;
    int new_cap = ROUNDUP4(len);
    obj->num = krealloc_array(obj->num, new_cap, sizeof(u32), GFP_KERNEL);
    if (unlikely(!obj->num))
        return -1; /* fail to realloc */
    if (new_cap > obj->cap)
        memset(obj->num + obj->cap, 0, sizeof(u32) * (new_cap - obj->cap));
    obj->cap = new_cap;
    return 0;
}

一樣是測試 fast doubling method ，比較的對象是上一次的 v2 ，計算的費氏數越大，效益也越大。在計算 \(F_{1000}\) 時，時間從 3653.0772 (v2) 減少至 3603.3485 ns (v3) ，幅度約為 1% 。（若有突起，下降比紫線快，綠線總體相較比較平滑， why? ）

另一個方案是先計算最終費氏數需要多少陣列元素，只在最初的時候配置所有陣列元素，所以途中完全不需要使用 krealloc ，而如何計算多少陣列元素，可以參考此式 binary digits = \(\lfloor log_2(F_n)\rfloor + 1\) ，利用總二進位位數換算成 u32 的元素個數，但有兩個問題，一是需要浮點數計算，而 \(n\) 越大，誤差也越大，二是需要先計算出 \(F_n\) ，亦即我們的最終計算目標。

使用 `perf` 分析

使用 perf 觀察 fbn_fib_fastdoubling 的執行效能，將 fib_read 改成以下樣子，重複計算 fbn_fib_fastdoubling 200000 次，讓該函式變成時間主導項。由於有 lazy realloc 的關係，所以每次迴圈必須重新初始化 fib 這個變數，這裡可以用 caller-callee 關係知道 fbn_alloc 和 fbn_free 是哪一部份的程式，進一步排除該因素。

static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t method,
                        loff_t *offset)
{
    ...
#elif defined(_PERF_EVT)
#define REPEAT (200 * 1000)
    for (int i = 0; i < REPEAT; ++i) {
        fbn *fib = fbn_alloc(1);
        bn_fibonacci_seq[method](fib, *offset);
        fbn_free(fib);
    }
    return 0;
    ...
}

由 perf 輸出可以知道 fbn_fib_fastdoubling 每部份計算的時間佔比，其中 fbn_mul 佔最多，佔了 84.77% / 95.41% ，其他依序為 fbn_add, fbn_alloc, fbn_free, fbn_lshift31 ，佔比都相對低很多，很清楚可以知道，對 fbn_mul 最佳化效果會是最明顯的。

$ sudo perf record -g ./expt07bn_perf

## load kernel module symbols
$ sudo mkdir -p /lib/modules/$(uname -r)/extra
$ sudo cp fibdrv_bn.ko /lib/modules/$(uname -r)/extra
$ sudo depmod -a

$ sudo perf report --stdio --children -g graph,.3,caller
...
    96.83%     1.42%  expt07bn_perf  [fibdrv_bn]        [k] fbn_fib_fastdoublingv1
            |          
            |--95.41%--fbn_fib_fastdoublingv1
            |          |          
            |          |--84.77%--fbn_mul
            |          |          |          
            |          |          |--33.16%--fbn_alloc
            |          |          |          |          
            |          |          |          |--13.21%--__kmalloc
            |          |          |          |          |          
            |          |          |          |          |--2.13%--kmalloc_slab
            |          |          |          |          |          
            |          |          |          |           --0.49%--__cond_resched
            |          |          |          |          
            |          |          |          |--9.14%--memset_erms
            |          |          |          |          
            |          |          |          |--7.51%--kmem_cache_alloc_trace
            |          |          |          |          |          
            |          |          |          |           --0.45%--rcu_all_qs
            |          |          |          |          
            |          |          |          |--0.41%--__cond_resched
            |          |          |          |          
            |          |          |           --0.37%--kmalloc_slab
            |          |          |          
            |          |          |--22.27%--fbn_free
            |          |          |          |          
            |          |          |          |--20.62%--kfree
            |          |          |          |          |          
            |          |          |          |           --3.97%--memcg_slab_free_hook
            |          |          |          |          
            |          |          |           --1.27%--memcg_slab_free_hook
            |          |          |          
            |          |           --0.56%--kfree
            |          |          
            |          |--4.05%--fbn_add
            |          |          
            |          |--2.80%--fbn_alloc
            |          |          |          
            |          |          |--0.82%--__kmalloc
            |          |          |          
            |          |          |--0.82%--kmem_cache_alloc_trace
            |          |          |          
            |          |           --0.79%--memset_erms
            |          |          
            |          |--1.91%--fbn_free
            |          |          |          
            |          |           --1.50%--kfree
            |          |                     |          
            |          |                      --0.37%--memcg_slab_free_hook
            |          |          
            |           --1.39%--fbn_lshift31
...

$ sudo perf report -g graph,.3,caller
+   99.91%     0.00%  expt07bn_perf  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+   99.91%     0.00%  expt07bn_perf  [kernel.kallsyms]  [k] do_syscall_64
+   99.89%     0.00%  expt07bn_perf  libc-2.31.so       [.] __libc_start_main
+   99.89%     0.00%  expt07bn_perf  libc-2.31.so       [.] read
+   99.89%     0.00%  expt07bn_perf  [kernel.kallsyms]  [k] __x64_sys_read
+   99.89%     0.00%  expt07bn_perf  [kernel.kallsyms]  [k] ksys_read
+   99.89%     0.00%  expt07bn_perf  [kernel.kallsyms]  [k] vfs_read
+   99.63%     0.07%  expt07bn_perf  [fibdrv_bn]        [k] fib_read
+   96.83%     1.42%  expt07bn_perf  [fibdrv_bn]        [k] fbn_fib_fastdoublingv1
+   85.26%    28.79%  expt07bn_perf  [fibdrv_bn]        [k] fbn_mul
+   37.68%     2.32%  expt07bn_perf  [fibdrv_bn]        [k] fbn_alloc
+   25.40%     0.90%  expt07bn_perf  [fibdrv_bn]        [k] fbn_free
+   24.32%    19.42%  expt07bn_perf  [kernel.kallsyms]  [k] kfree
+   15.83%    12.09%  expt07bn_perf  [kernel.kallsyms]  [k] __kmalloc
+   10.41%     9.85%  expt07bn_perf  [kernel.kallsyms]  [k] memset_erms
+    9.53%     8.22%  expt07bn_perf  [kernel.kallsyms]  [k] kmem_cache_alloc_trace
+    6.13%     5.34%  expt07bn_perf  [kernel.kallsyms]  [k] memcg_slab_free_hook
+    4.39%     3.94%  expt07bn_perf  [fibdrv_bn]        [k] fbn_add
...

Karatsuba multiplication

\(U = U_0 + U_1 2^N \\ V = V_0 + V_1 2^N \\ \text{Let }z_0 = U_0V_0,\,z_1 = (U_0 + U_1)(V_0 + V_1),\,z_2 = U_1V_1 \\ \text{Then, }UV = z_0 + (z_1 - z_0 - z_2)2^N + z_1 2^{2N} \\ \text{Let }z_1' = (U_0 - U_1)(V_1 - V_0) = (U_1 - U_0)(V_0 - V_1) \\ \text{Then, }UV = z_0 + (z_1' + z_0 + z_2)2^N + z_1 2^{2N}\)

Karatsuba algorithm

square instead of mul

(save almost half of multiplications)

                a    b    c    d
x)              a    b    c    d
----------------------------------
               ad   bd   cd   dd
          ac   bc   cc   cd
     ab   bb   bc   bd
aa   ab   ac   ad

use gcc extension `unsigned __int128`

128-bit intergers

2022q1 Homework3 (fibdrv)

環境設置

獨立實驗環境，排除干擾

測試效果

概觀 fibdrv 模組

模組主程式 fibdrv.c ：

加入計時器

費氏數列實作

方法一（費氏數列定義）

方法二（ Exact solution method ）

方法三（ Fast doubling method ）

計算 \(F_{93}\) 之後的費氏數列 - 大數計算

結構 fbn

加法 fbn_add

大數計算版本的費氏數列 (1) fbn_fib_defi

十進位字串輸出 fbn_print

左移運算 fbn_lshift

減法 fbn_sub

乘法 fbn_mul

大數計算版本的費氏數列 (2) fbn_fib_fastdoubling

效能測試

改善效能

Radix conversion by performing short division

Fast doubling method without subtraction (v0 -> v1)

Reduce unnecessary computations (v1 -> v2)

Lazy krealloc (memory pool)

使用 perf 分析

Karatsuba multiplication

square instead of mul

use gcc extension unsigned __int128

Read more

SSH with Pem File

Logitech Gaming Mouse - G102 LightSync

Acer EFI Booting Problem

linux2023 - Homework2

概觀 `fibdrv` 模組

模組主程式 `fibdrv.c` ：

結構 `fbn`

加法 `fbn_add`

大數計算版本的費氏數列 (1) `fbn_fib_defi`

十進位字串輸出 `fbn_print`

左移運算 `fbn_lshift`

減法 `fbn_sub`

乘法 `fbn_mul`

Lazy `krealloc` (memory pool)

使用 `perf` 分析

use gcc extension `unsigned __int128`