2022q1 Homework3 (fibdrv)

contributed by < tinyynoob >

作業要求

最近的方向可以跳到 June rework 開始看

因為一篇筆記放不下了，再開一篇：fibdrv 2

研讀資料

Fibonacci 相關性質

https://hackmd.io/@sysprog/fibonacci-number

Algorithms for Computing Fibonacci Numbers Quickly

Fast doubling 算法

\begin{aligned} F_{2 n + 1} & = F_{n + 1}^{2} + F_{n}^{2} \\ F_{2 n} & = F_{n} \cdot (2 F_{n + 1} - F_{n}) \end{aligned}

作業說明中有給出 pseudocode :

Fast_Fib(n)
    a = 0; b = 1;       // m = 0
    for i = (number of binary digit in n) to 1
        t1 = a*(2*b - a);
        t2 = b^2 + a^2;
        a = t1; b = t2; // m *= 2
        if (current binary digit == 1)
            t1 = a + b; // m++
            a = b; b = t1;
    return a;

注意到這個 ^ 是指 power 而非 xor 。

目前看不出 clz() 能對計算 Fib 做到什麼強力優化，似乎只能加快 initialize i 而已，需要再研究研究。

待

Follow 《The Linux Kernel Module Programming Guide》

將自己研讀的摘要整理到另一篇筆記： https://hackmd.io/@tinyynoob/Hk89qPw-9

本地版本：

$ apt-cache search linux-headers-`uname -r`
linux-headers-5.13.0-30-generic - Linux kernel headers for version 5.13.0 on 64 bit x86 SMP

根據書中範例嘗試建立 hello-1.c 及對應的 makefile 。

make 時馬上出現問題：

log

$ make -p | grep PWD
scripts/Makefile.build:44: /home/tinyynoob/code/kernel/hello-1/Makefile: 沒有此一檔案或目錄
make[2]: *** 沒有規則可製作目標「/home/tinyynoob/code/kernel/hello-1/Makefile」。 停止。
OLDPWD = /home/tinyynoob/code/kernel
PWD = /home/tinyynoob/code/kernel/hello-1
CONFIG_HPWDT_NMI_DECODING = y
make[1]: *** [Makefile:1879：/home/tinyynoob/code/kernel/hello-1] 錯誤 2
OLDPWD = /home/tinyynoob/code/kernel
PWD = /home/tinyynoob/code/kernel/hello-1
CONFIG_HPWDT_NMI_DECODING = y
        @echo  '  Syntax: make -C path/to/kernel/src M=$$PWD target'
make: *** [makefile:7：all] 錯誤 2
PWD := /home/tinyynoob/code/kernel/hello-1
OLDPWD = /home/tinyynoob/code/kernel
        echo $((PWD)
        make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules 
        make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

發現原來檔名把 Makefile 寫成 makefile 也不行，更改後得到：

log

$ make
make -C /lib/modules/5.13.0-30-generic/build M=/home/tinyynoob/code/kernel/hello-1       modules 
make[1]: 進入目錄「/usr/src/linux-headers-5.13.0-30-generic」
  CC [M]  /home/tinyynoob/code/kernel/hello-1/hello-1.o
  MODPOST /home/tinyynoob/code/kernel/hello-1/Module.symvers
  CC [M]  /home/tinyynoob/code/kernel/hello-1/hello-1.mod.o
  LD [M]  /home/tinyynoob/code/kernel/hello-1/hello-1.ko
  BTF [M] /home/tinyynoob/code/kernel/hello-1/hello-1.ko
Skipping BTF generation for /home/tinyynoob/code/kernel/hello-1/hello-1.ko due to unavailability of vmlinux
make[1]: 離開目錄「/usr/src/linux-headers-5.13.0-30-generic」

看似往正確的方向前進了，也有產出 hello-1.ko ，但沒有完全成功，它在 vmlinux 遇到問題

vmlinux 問題尚未解決

根據 modinfo 可以發現在 hello-1.ko 的 vermagic 欄位有 modversions 。回去閱讀書中提到 Modversioning 的部份，得知我可能需要關掉它。

參考此，試著進到 /boot/config-5.13.0-30-generic 關掉 CONFIG_MODVERSIONS 。

關掉 modversions 但尚未 recompile kernel

把 hello-1.c 裡的 return 值改成負數會導致：

insmod: ERROR: could not insert module hello-1.ko: Operation not permitted

研讀數字字串運算範例程式

source code

數字字串在運算時可以因應字串 size 選擇不同的策略。

範例程式使用的字串使用 union 定義如下：




























typedef union {
    /* allow strings up to 15 bytes to stay on the stack
     * use the last byte as a null terminator and to store flags
     * much like fbstring:
     * https://github.com/facebook/folly/blob/master/folly/docs/FBString.md
     */
    char data[16];

    struct {
        uint8_t filler[15],
            /* how many free bytes in this stack allocated string
             * same idea as fbstring
             */
            space_left : 4,
            /* if it is on heap, set to 1 */
            is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1;
    };

    /* heap allocated */
    struct {
        char *ptr;
        /* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */
        size_t size : MAX_STR_LEN_BITS,
                      /* capacity is always a power of 2 (unsigned)-1 */
                      capacity : 6;
        /* the last 4 bits are important flags */
    };
} xs;

在短碼的情況會使用第 9 行的 struct ，直接將資料字串存在 filler 並把一些 flag 資訊放在最後一個 byte
在長碼的情況會使用第 20 行的 struct ，使用一個 ptr 指標並配置空間在 heap ，後面的 8 byte 拿來存放一些 flag

還沒完全理解在長碼情況下，

如何繼續使用 is_ptr flag
capacity 使用的詳細狀況

範例程式在 fib_sequence() 計算 Fib ，算法使用逐項相加：

    for (i = 2; i <= k; i++)
        string_number_add(&f[i - 1], &f[i - 2], &f[i]);

因此若是我們能改善乘法及平方的複雜度，則可以使用 fast doubling 的技巧來改善目前算法。

Linux 效能分析

嘗試查看 CPU affinity ：

$ taskset -p 8528
pid 8528 目前的近似者遮罩：ff

ff 的二進位為 11111111 ，代表這個 process 可以在 7~0 任意一個 CPU 上執行。試著更改：

$ taskset -p FA 8528
pid 8528 目前的近似者遮罩：ff
pid 8528 新的近似者遮罩：fa

根據網路上的教學，在 /boot/grub/grub.cfg 中加入 isolcpus=0,1 。

最後，完成排除干擾效能分析的因素中的所有設定。

reboot 之後查看 process 發現 core mask 變成了 fc ：

$ taskset -p 1
pid 1 目前的近似者遮罩：fc

雖然有些已經變成 fc 但有些仍顯示 ff ，因此重新依據參考共筆進行設定。結果仍失敗，目前的在設定 smp_affinity 上因腳本錯誤遇到困難。

Fork

嘗試掛載模組之後輸入 sudo ./client 可以得到

結果

Reading from /dev/fibonacci at offset 0, returned the sequence 0.
Reading from /dev/fibonacci at offset 1, returned the sequence 1.
Reading from /dev/fibonacci at offset 2, returned the sequence 1.
Reading from /dev/fibonacci at offset 3, returned the sequence 2.
Reading from /dev/fibonacci at offset 4, returned the sequence 3.
Reading from /dev/fibonacci at offset 5, returned the sequence 5.
Reading from /dev/fibonacci at offset 6, returned the sequence 8.
Reading from /dev/fibonacci at offset 7, returned the sequence 13.
Reading from /dev/fibonacci at offset 8, returned the sequence 21.
Reading from /dev/fibonacci at offset 9, returned the sequence 34.
Reading from /dev/fibonacci at offset 10, returned the sequence 55.
Reading from /dev/fibonacci at offset 11, returned the sequence 89.
Reading from /dev/fibonacci at offset 12, returned the sequence 144.
Reading from /dev/fibonacci at offset 13, returned the sequence 233.
Reading from /dev/fibonacci at offset 14, returned the sequence 377.
Reading from /dev/fibonacci at offset 15, returned the sequence 610.
Reading from /dev/fibonacci at offset 16, returned the sequence 987.
Reading from /dev/fibonacci at offset 17, returned the sequence 1597.
Reading from /dev/fibonacci at offset 18, returned the sequence 2584.
Reading from /dev/fibonacci at offset 19, returned the sequence 4181.
Reading from /dev/fibonacci at offset 20, returned the sequence 6765.
Reading from /dev/fibonacci at offset 21, returned the sequence 10946.
Reading from /dev/fibonacci at offset 22, returned the sequence 17711.
Reading from /dev/fibonacci at offset 23, returned the sequence 28657.
Reading from /dev/fibonacci at offset 24, returned the sequence 46368.
Reading from /dev/fibonacci at offset 25, returned the sequence 75025.
Reading from /dev/fibonacci at offset 26, returned the sequence 121393.
Reading from /dev/fibonacci at offset 27, returned the sequence 196418.
Reading from /dev/fibonacci at offset 28, returned the sequence 317811.
Reading from /dev/fibonacci at offset 29, returned the sequence 514229.
Reading from /dev/fibonacci at offset 30, returned the sequence 832040.
Reading from /dev/fibonacci at offset 31, returned the sequence 1346269.
Reading from /dev/fibonacci at offset 32, returned the sequence 2178309.
Reading from /dev/fibonacci at offset 33, returned the sequence 3524578.
Reading from /dev/fibonacci at offset 34, returned the sequence 5702887.
Reading from /dev/fibonacci at offset 35, returned the sequence 9227465.
Reading from /dev/fibonacci at offset 36, returned the sequence 14930352.
Reading from /dev/fibonacci at offset 37, returned the sequence 24157817.
Reading from /dev/fibonacci at offset 38, returned the sequence 39088169.
Reading from /dev/fibonacci at offset 39, returned the sequence 63245986.
Reading from /dev/fibonacci at offset 40, returned the sequence 102334155.
Reading from /dev/fibonacci at offset 41, returned the sequence 165580141.
Reading from /dev/fibonacci at offset 42, returned the sequence 267914296.
Reading from /dev/fibonacci at offset 43, returned the sequence 433494437.
Reading from /dev/fibonacci at offset 44, returned the sequence 701408733.
Reading from /dev/fibonacci at offset 45, returned the sequence 1134903170.
Reading from /dev/fibonacci at offset 46, returned the sequence 1836311903.
Reading from /dev/fibonacci at offset 47, returned the sequence 2971215073.
Reading from /dev/fibonacci at offset 48, returned the sequence 4807526976.
Reading from /dev/fibonacci at offset 49, returned the sequence 7778742049.
Reading from /dev/fibonacci at offset 50, returned the sequence 12586269025.
Reading from /dev/fibonacci at offset 51, returned the sequence 20365011074.
Reading from /dev/fibonacci at offset 52, returned the sequence 32951280099.
Reading from /dev/fibonacci at offset 53, returned the sequence 53316291173.
Reading from /dev/fibonacci at offset 54, returned the sequence 86267571272.
Reading from /dev/fibonacci at offset 55, returned the sequence 139583862445.
Reading from /dev/fibonacci at offset 56, returned the sequence 225851433717.
Reading from /dev/fibonacci at offset 57, returned the sequence 365435296162.
Reading from /dev/fibonacci at offset 58, returned the sequence 591286729879.
Reading from /dev/fibonacci at offset 59, returned the sequence 956722026041.
Reading from /dev/fibonacci at offset 60, returned the sequence 1548008755920.
Reading from /dev/fibonacci at offset 61, returned the sequence 2504730781961.
Reading from /dev/fibonacci at offset 62, returned the sequence 4052739537881.
Reading from /dev/fibonacci at offset 63, returned the sequence 6557470319842.
Reading from /dev/fibonacci at offset 64, returned the sequence 10610209857723.
Reading from /dev/fibonacci at offset 65, returned the sequence 17167680177565.
Reading from /dev/fibonacci at offset 66, returned the sequence 27777890035288.
Reading from /dev/fibonacci at offset 67, returned the sequence 44945570212853.
Reading from /dev/fibonacci at offset 68, returned the sequence 72723460248141.
Reading from /dev/fibonacci at offset 69, returned the sequence 117669030460994.
Reading from /dev/fibonacci at offset 70, returned the sequence 190392490709135.
Reading from /dev/fibonacci at offset 71, returned the sequence 308061521170129.
Reading from /dev/fibonacci at offset 72, returned the sequence 498454011879264.
Reading from /dev/fibonacci at offset 73, returned the sequence 806515533049393.
Reading from /dev/fibonacci at offset 74, returned the sequence 1304969544928657.
Reading from /dev/fibonacci at offset 75, returned the sequence 2111485077978050.
Reading from /dev/fibonacci at offset 76, returned the sequence 3416454622906707.
Reading from /dev/fibonacci at offset 77, returned the sequence 5527939700884757.
Reading from /dev/fibonacci at offset 78, returned the sequence 8944394323791464.
Reading from /dev/fibonacci at offset 79, returned the sequence 14472334024676221.
Reading from /dev/fibonacci at offset 80, returned the sequence 23416728348467685.
Reading from /dev/fibonacci at offset 81, returned the sequence 37889062373143906.
Reading from /dev/fibonacci at offset 82, returned the sequence 61305790721611591.
Reading from /dev/fibonacci at offset 83, returned the sequence 99194853094755497.
Reading from /dev/fibonacci at offset 84, returned the sequence 160500643816367088.
Reading from /dev/fibonacci at offset 85, returned the sequence 259695496911122585.
Reading from /dev/fibonacci at offset 86, returned the sequence 420196140727489673.
Reading from /dev/fibonacci at offset 87, returned the sequence 679891637638612258.
Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931.
Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189.
Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189.
Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931.
Reading from /dev/fibonacci at offset 87, returned the sequence 679891637638612258.
Reading from /dev/fibonacci at offset 86, returned the sequence 420196140727489673.
Reading from /dev/fibonacci at offset 85, returned the sequence 259695496911122585.
Reading from /dev/fibonacci at offset 84, returned the sequence 160500643816367088.
Reading from /dev/fibonacci at offset 83, returned the sequence 99194853094755497.
Reading from /dev/fibonacci at offset 82, returned the sequence 61305790721611591.
Reading from /dev/fibonacci at offset 81, returned the sequence 37889062373143906.
Reading from /dev/fibonacci at offset 80, returned the sequence 23416728348467685.
Reading from /dev/fibonacci at offset 79, returned the sequence 14472334024676221.
Reading from /dev/fibonacci at offset 78, returned the sequence 8944394323791464.
Reading from /dev/fibonacci at offset 77, returned the sequence 5527939700884757.
Reading from /dev/fibonacci at offset 76, returned the sequence 3416454622906707.
Reading from /dev/fibonacci at offset 75, returned the sequence 2111485077978050.
Reading from /dev/fibonacci at offset 74, returned the sequence 1304969544928657.
Reading from /dev/fibonacci at offset 73, returned the sequence 806515533049393.
Reading from /dev/fibonacci at offset 72, returned the sequence 498454011879264.
Reading from /dev/fibonacci at offset 71, returned the sequence 308061521170129.
Reading from /dev/fibonacci at offset 70, returned the sequence 190392490709135.
Reading from /dev/fibonacci at offset 69, returned the sequence 117669030460994.
Reading from /dev/fibonacci at offset 68, returned the sequence 72723460248141.
Reading from /dev/fibonacci at offset 67, returned the sequence 44945570212853.
Reading from /dev/fibonacci at offset 66, returned the sequence 27777890035288.
Reading from /dev/fibonacci at offset 65, returned the sequence 17167680177565.
Reading from /dev/fibonacci at offset 64, returned the sequence 10610209857723.
Reading from /dev/fibonacci at offset 63, returned the sequence 6557470319842.
Reading from /dev/fibonacci at offset 62, returned the sequence 4052739537881.
Reading from /dev/fibonacci at offset 61, returned the sequence 2504730781961.
Reading from /dev/fibonacci at offset 60, returned the sequence 1548008755920.
Reading from /dev/fibonacci at offset 59, returned the sequence 956722026041.
Reading from /dev/fibonacci at offset 58, returned the sequence 591286729879.
Reading from /dev/fibonacci at offset 57, returned the sequence 365435296162.
Reading from /dev/fibonacci at offset 56, returned the sequence 225851433717.
Reading from /dev/fibonacci at offset 55, returned the sequence 139583862445.
Reading from /dev/fibonacci at offset 54, returned the sequence 86267571272.
Reading from /dev/fibonacci at offset 53, returned the sequence 53316291173.
Reading from /dev/fibonacci at offset 52, returned the sequence 32951280099.
Reading from /dev/fibonacci at offset 51, returned the sequence 20365011074.
Reading from /dev/fibonacci at offset 50, returned the sequence 12586269025.
Reading from /dev/fibonacci at offset 49, returned the sequence 7778742049.
Reading from /dev/fibonacci at offset 48, returned the sequence 4807526976.
Reading from /dev/fibonacci at offset 47, returned the sequence 2971215073.
Reading from /dev/fibonacci at offset 46, returned the sequence 1836311903.
Reading from /dev/fibonacci at offset 45, returned the sequence 1134903170.
Reading from /dev/fibonacci at offset 44, returned the sequence 701408733.
Reading from /dev/fibonacci at offset 43, returned the sequence 433494437.
Reading from /dev/fibonacci at offset 42, returned the sequence 267914296.
Reading from /dev/fibonacci at offset 41, returned the sequence 165580141.
Reading from /dev/fibonacci at offset 40, returned the sequence 102334155.
Reading from /dev/fibonacci at offset 39, returned the sequence 63245986.
Reading from /dev/fibonacci at offset 38, returned the sequence 39088169.
Reading from /dev/fibonacci at offset 37, returned the sequence 24157817.
Reading from /dev/fibonacci at offset 36, returned the sequence 14930352.
Reading from /dev/fibonacci at offset 35, returned the sequence 9227465.
Reading from /dev/fibonacci at offset 34, returned the sequence 5702887.
Reading from /dev/fibonacci at offset 33, returned the sequence 3524578.
Reading from /dev/fibonacci at offset 32, returned the sequence 2178309.
Reading from /dev/fibonacci at offset 31, returned the sequence 1346269.
Reading from /dev/fibonacci at offset 30, returned the sequence 832040.
Reading from /dev/fibonacci at offset 29, returned the sequence 514229.
Reading from /dev/fibonacci at offset 28, returned the sequence 317811.
Reading from /dev/fibonacci at offset 27, returned the sequence 196418.
Reading from /dev/fibonacci at offset 26, returned the sequence 121393.
Reading from /dev/fibonacci at offset 25, returned the sequence 75025.
Reading from /dev/fibonacci at offset 24, returned the sequence 46368.
Reading from /dev/fibonacci at offset 23, returned the sequence 28657.
Reading from /dev/fibonacci at offset 22, returned the sequence 17711.
Reading from /dev/fibonacci at offset 21, returned the sequence 10946.
Reading from /dev/fibonacci at offset 20, returned the sequence 6765.
Reading from /dev/fibonacci at offset 19, returned the sequence 4181.
Reading from /dev/fibonacci at offset 18, returned the sequence 2584.
Reading from /dev/fibonacci at offset 17, returned the sequence 1597.
Reading from /dev/fibonacci at offset 16, returned the sequence 987.
Reading from /dev/fibonacci at offset 15, returned the sequence 610.
Reading from /dev/fibonacci at offset 14, returned the sequence 377.
Reading from /dev/fibonacci at offset 13, returned the sequence 233.
Reading from /dev/fibonacci at offset 12, returned the sequence 144.
Reading from /dev/fibonacci at offset 11, returned the sequence 89.
Reading from /dev/fibonacci at offset 10, returned the sequence 55.
Reading from /dev/fibonacci at offset 9, returned the sequence 34.
Reading from /dev/fibonacci at offset 8, returned the sequence 21.
Reading from /dev/fibonacci at offset 7, returned the sequence 13.
Reading from /dev/fibonacci at offset 6, returned the sequence 8.
Reading from /dev/fibonacci at offset 5, returned the sequence 5.
Reading from /dev/fibonacci at offset 4, returned the sequence 3.
Reading from /dev/fibonacci at offset 3, returned the sequence 2.
Reading from /dev/fibonacci at offset 2, returned the sequence 1.
Reading from /dev/fibonacci at offset 1, returned the sequence 1.
Reading from /dev/fibonacci at offset 0, returned the sequence 0.

可以發現到 92 項以後數值就不再改變，代表這裡存在嚴重缺陷，對於超過 64 位元的數值無能為力。

另外在輸入指令

$ ls -l /dev/fibonacci

後，不是得到 235 也不是得到 236 而是 505

crw------- 1 root root 505, 0  3月 11 13:10 /dev/fibonacci

ktime doc

看參考共筆依樣畫葫蘆，由於目前 fibdrv.c 的 fib_write() 沒有功能，所以在其中塞入測量時間的程式碼：

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    ktime_t kt;
    kt = ktime_get();
    fib_sequence(*offset);
    kt = ktime_sub(ktime_get(), kt);
    return (ssize_t) ktime_to_ns(kt);
}

然而輸出的數值看起來都差不多，回頭想找找看這個 offset 到底是怎麼設的。我發現在 client.c 呼叫時傳入的參數並不包含 offset ，不過在呼叫 write() 跟 read() 時有個不同是 read() 多了一行：

        lseek(fd, i, SEEK_SET);

其中 SEEK_SET 將被展開為 0 。觀察發現呼叫 lseek() 會連到 fibdrv.c 中的：

static loff_t fib_device_lseek(struct file *file, loff_t offset, int orig)
{
    loff_t new_pos = 0;
    switch (orig) {
    case 0: /* SEEK_SET: */
        new_pos = offset;
        break;
    case 1: /* SEEK_CUR: */
        new_pos = file->f_pos + offset;
        break;
    case 2: /* SEEK_END: */
        new_pos = MAX_LENGTH - offset;
        break;
    }

    if (new_pos > MAX_LENGTH)
        new_pos = MAX_LENGTH;  // max case
    if (new_pos < 0)
        new_pos = 0;        // min case
    file->f_pos = new_pos;  // This is what we'll use now
    return new_pos;
}

雖然可以猜測這個函式的回傳值會更改 offset ，但是目前不知道這到底是如何達成的，我甚至不曉得 offset 到底存在哪裡。

TODO:

釐清 offset

答：
我認為就是存在 file 結構的 f_pos 成員中

研讀參考共筆中的大數運算

https://hackmd.io/@KYWeng/rkGdultSU#如何減少大數運算的成本

這份共筆提出多種實作上改進計算 Fib 速度的技巧，摘要如下：

調整 Fast Fib 算法

將原先的算法由

\begin{aligned} F_{2 n + 1} & = F_{n + 1}^{2} + F_{n}^{2} \\ F_{2 n} & = F_{n} \cdot (2 F_{n + 1} - F_{n}) \end{aligned}

改良為：

\begin{aligned} F_{2 n - 1} & = F_{n}^{2} + F_{n - 1}^{2} \\ F_{2 n} & = F_{n} \cdot (2 F_{n - 1} + F_{n}) \end{aligned}

以迴避所有減法運算，並且不會再出現任何負數，可以全部用 unsigned 型態儲存。

使用 memory pool

在大數結構中引入 memory pool 的概念來避免頻繁的 realloc() 。

使用內嵌組合語言加速乘法

做乘法運算時直接指定高低位的位置，而不是使用擴展整數型別。

以參考共筆中的語法為例：

__asm__("mulq %3" : "=a"(low), "=d"(high) : "%0"(a[i]), "rm"(k));

        assembly language : output :  input

其中 instruction 顯然為 mulq ，而

= 是 write only 的意思
a, d 分別表示 $rax, $rdx
( ) 內使用 C 變數
rm 表示由 compiler 自己選擇存放位置
%0 在此等同 a 暫存器
%3 代表 rm

參考：

https://ithelp.ithome.com.tw/articles/10213500
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

引入 Karatsuba algorithm 算法

核心思想為在做乘法時，將一些乘法運算改為加減，考慮：

\begin{aligned} (a_{1} 2^{n} + a_{0}) \times (b_{1} 2^{n} + b_{0}) & = c_{2} 2^{2 n} + c_{1} 2^{n} + c_{0} \\ = (a_{1} b_{1}) 2^{2 n} + (a_{1} b_{0} + a_{0} b_{1}) 2^{n} + a_{0} b_{0} \end{aligned}

原先需要做 4 次乘法運算，經過一些設計可以變成：

(2^{2 n} + 2^{n}) (a_{1} b_{1}) + 2^{n} (a_{1} - a_{0}) (b_{0} - b_{1}) + (2^{n} + 1) (a_{0} b_{0})

便能少做一次乘法。

改進 fibdrv

實作程式碼 github

大數運算

在為 fibdrv 改進效能之前，我們需要先確保它的正確性，為此我們首先引入大數運算的機制。對於存放大數的方式前面提出了兩種設計：

單純使用結構
使用 union 達成 small string optimization

由於在計算 Fib 的情境下不會一直需要宣告新的變數，也不會有頻繁居於小數計算的情況，因此我們選擇使用結構的方式，如此也較便利於 maintain resize 的功能。

因為我目前不太熟悉 kernel programming 的各種 function 用法，因此我打算先在 user space 開發，並 debug 確認功能沒問題之後再移植到 kernel space 。

我想借助 macro 來完成此工作，例如這樣：

#if DEBUG
    N = (ubn *) malloc(sizeof(ubn));
#else
    N = (ubn *) kmalloc(sizeof(ubn), GFP_KERNEL);
#endif

由於在計算 Fib 的過程中不會遇到任何負數，因此我的大數運算假設不存在負數，相關的運算也都沒有提供。

引入共筆用的 macro ：

#if defined(__LP64__) || defined(__x86_64__) || defined(__amd64__) || \
    defined(__aarch64__)
typedef uint64_t ubn_unit;
// typedef unsigned __int128 bn_data_tmp; // gcc support __int128
#define ubn_unit_bit 64
#else
typedef uint32_t ubn_unit;
// typedef uint64_t bn_data_tmp;
#define ubn_unit_bit 32
#endif

目前還未使用到加倍的型別，暫時先註解掉。

基於沒有負數的假設，我們得以直接移除共筆結構中的 sign 成員，訂出：

/* unsigned big number
 * @data: MSB:[size-1], LSB:[0]
 * @size: sizeof(ubn_unit)
 * @capacity: allocated size
 */
typedef struct {
    ubn_unit *data;
    int size;
    int capacity;
} ubn;

預期 (無號) 大數運算的界面提供以下功能：

ubignum_init()
ubignum_free()
ubignum_resize()
ubignum_add()
ubignum_mult()
ubignum_assign()

目前比較沒有頭緒的是 assign() 的作法

ubignum_init()

程式碼

bool ubignum_init(ubn *N)
{
#if DEBUG
    N = (ubn *) malloc(sizeof(ubn));
#else
    N = (ubn *) kmalloc(sizeof(ubn), GFP_KERNEL);
#endif
    if (!N)
        goto struct_aloc_failed;

#if DEBUG
    N->data = (ubn_unit *) calloc(sizeof(ubn_unit),
                                  DEFAULT_CAPACITY);  // including mem reset
#else
    N->data = (ubn_unit *) kcalloc(sizeof(ubn_unit), DEFAULT_CAPACITY,
                                   GFP_KERNEL);  // including mem reset
#endif
    if (!N->data)
        goto data_aloc_failed;
    N->size = 0;
    N->capacity = DEFAULT_CAPACITY;
    return true;

data_aloc_failed:
    free(N);
    N = NULL;
struct_aloc_failed:
    return false;
}

其中 DEFAULT_CAPACITY 目前訂為 16 。

靜態分析出現一堆問題，這才發現我原本的寫法根本不會正確把初始化後的 N 回傳，因此修改程式碼：

ubignum_resize()

程式碼

bool ubignum_resize(ubn *N, int new_size)
{
    if (new_size < 0)
        return false;
#if DEBUG
    ubn_unit *new = (ubn_unit *) realloc(N->data, sizeof(ubn_unit) * new_size);
#else
    ubn_unit *new =
        (ubn_unit *) krealloc(N->data, sizeof(ubn_unit) * new_size, GFP_KERNEL);
#endif
    if (!new)
        return false;
    N = new;
    N->capacity = new_size;
    return true;
}

目前不確定允許 new_size 傳入 0 是否是個好主意。

ubignum_add()

bool ubignum_add(const ubn *a, const ubn *b, ubn *out)

目前設計讓使用者傳入存放 sum 的變數，但這麼做就需要解決 pointer aliasing 的問題。

由低到高做加法的流程大致可以分成兩段：

兩數相加
一數加上 1. 的 carry out

首先開發 1. 的部份：














    ubn_unit carry = 0;
    int i = 0;
    for (i = 0; i < min(a->size, b->size); i++) {
        if (__builtin_expect(i >= out->capacity, 0)) {
            if (!resize(out, out->capacity * 2)) {
                return false;
            }
        }
        out->data[i] = a->data[i] + b->data[i] + carry;
        // The following consideration does not include carry
        // fix it!
        carry = ((a->data[i] ^ b->data[i]) >> 1 ^ (a->data[i] & b->data[i])) &
                (1u << (ubn_unit_bit - 1));
    }

若超出 out 範圍則對它進行 resize() ，其中加入 builtin_expect() 協助 compiler 進行 branch prediction 。

撰寫過程馬上讓我遭遇了幾個問題：

第 11 行，我嘗試引入 quiz2 第一題計算平均的方式再取 MSB 來得出新的 carry ，但我發覺這樣的話舊的 carry 沒有被加進來，所以在例如 (a, b, c) = (UINT_MAX, 0, 1) 的情況下會算出錯誤結果。
我不熟悉 gcc 對於 unsigned 型態的右移會如何處理

根據 gcc 的 mail list ：

For right shifts, if the type is signed, then we use an arithmetic shift;
if the type is unsigned then we use a logical.

因此對於 unsigned 右移的行為符合我的預期。

1. 待解決

\to

後面完全更換作法就不再有這個問題

為了應對 pointer aliasing 的問題，我們應該不能直接拿 a->data 和 b->data 去運算，因為它們在第 9 行可能會被改變 (如果他們跟 out 指到同樣位置) ，那麼第 12 行的運算將會錯誤。因此需要先將它們在第 9 行的值存起來，將程式碼改成：













    for (i = 0; i < min(a->size, b->size); i++) {
        if (__builtin_expect(i >= out->capacity, 0)) {
            if (!resize(out, out->capacity * 2)) {
                goto realoc_failed;
            }
        }
        ubn_unit an = a->data[i];
        ubn_unit bn = b->data[i];
        out->data[i] = an + bn + carry;
        // The following consideration does not include carry
        // fix it!
        carry = ((an ^ bn) >> 1 + (an & bn)) & (1u << (ubn_unit_bit - 1));
    }

其中第 4 行跳到最後去把 a 跟 b 還原成原本的樣子，這也是因為 aliasing 才有的問題。

第二段程式碼：





















    ubn *remain = NULL;
    if (i == a->size)
        remain = b;
    else if (i == b->size)
        remain = a;
    for (; carry || i < remain->size; i++) {
        if (__builtin_expect(i >= out->capacity, 0)) {
            if (!resize(out, out->capacity * 2)) {
                return false;
            }
        }
        if (__builtin_expect(i < remain->size, 1)) {
            out->data[i] = remain->data[i] + carry;
            carry =
                ((remain->data[i] ^ carry) >> 1 ^ (remain->data[i] & carry)) &
                (1u << (ubn_unit_bit - 1));
        } else {  // the last carry out carries to a new ubn_unit
            out->data[i] = carry;  // = 1
            carry = 0;
        }
    }

先使用 remain 來記住哪一個加數還有剩餘的位數沒加，接著就進入迴圈繼續加完，大部分的情況都會走 12 行，唯一會走到 17 行的情況是最高位有 carry out 而且剛好 carry 到一個新的 ubn_unit chunk 。

同樣處理 pointer aliasing 的問題並改進寫法得：

    ubn *remain = (i == a->size) ? b : a;
    for (; carry || i < remain->size; i++) {
        if (__builtin_expect(i >= out->capacity, 0)) {
            if (!resize(out, out->capacity * 2)) {
                goto realoc_failed;
            }
        }
        if (__builtin_expect(i < remain->size, 1)) {
            ubn_unit rn = remain->data[i];
            out->data[i] = rn + carry;
            carry =
                ((rn ^ carry) >> 1 ^ (rn & carry)) & (1u << (ubn_unit_bit - 1));
        } else {  // the last carry out carries to a new ubn_unit
            out->data[i] = carry;  // = 1
            carry = 0;
        }
    }

處理 aliasing 的程式碼：

    ubn *store = NULL;
    int alias = 0;
    if (a == out)  // pointer aliasing
        alias ^= 1;
    if (b == out)  // pointer aliasing
        alias ^= 2;
    if (alias) {
        store = ubignum_duplicate(store, out);
        if (!store)
            return false;
    }

直接使用一個變數 alias 中的 bit 來觀察，若有 aliasing 則配置空間給保留原值。

realoc_failed:
    switch (alias) {
    case 1:
        swapptr(&a, &store);
        break;
    case 2:
    case 3:
        swapptr(&b, &store);
    case 0:
    }
    return false;

注意到這裡 2, 3 共用的原因是，如果 a 和 b 指到同一個位置，那麼恢復 b 也同時恢復了 a 。

aliasing 的問題比我預期的還更難處理，我直接重改邏輯，變成如果有 aliasing 則配置一個新空間存放計算結果，算完再放到 out ，改寫過程中也修復了一些 bug 。

更改後的 ubignum_add() 程式碼



























































/* out = a + b
 * Aliasing arguments are acceptable.
 * If it return true, the result is put at @out.
 * If return false with alias input, the @out would be unchanged.
 * If return false without alias input, the @out would return neither answer nor
 * original out.
 */
bool ubignum_add(const ubn *a, const ubn *b, ubn **out)
{
    ubn *ans = *out;
    int alias = 0;
    if (a == *out)  // pointer aliasing
        alias ^= 1;
    if (b == *out)  // pointer aliasing
        alias ^= 2;
    if (alias) {  // if so, allocate space to store the result
        if (!ubignum_init(&ans))
            return false;
    }

    ubn_unit carry = 0;
    int i = 0;
    for (i = 0; i < min(a->size, b->size); i++) {
        if (i >= ans->size) {
            if (__builtin_expect(i >= ans->capacity, 0))
                if (!ubignum_resize(ans, ans->capacity * 2))
                    goto realoc_failed;
            ans->size++;
        }
        const ubn_unit_extend sum = a->data[i] + b->data[i] + carry;
        ans->data[i] = sum;
        carry = sum >> ubn_unit_bit;
    }

    const ubn *remain = (i == a->size) ? b : a;
    for (; carry || i < remain->size; i++) {
        if (i >= ans->size) {
            if (__builtin_expect(i >= ans->capacity, 0))
                if (!ubignum_resize(ans, ans->capacity * 2))
                    goto realoc_failed;
            ans->size++;
        }
        if (__builtin_expect(i < remain->size, 1)) {
            const ubn_unit_extend sum = remain->data[i] + carry;
            ans->data[i] = sum;
            carry = sum >> ubn_unit_bit;
        } else {  // the last carry out carries to a new ubn_unit
            ans->data[i] = carry;  // = 1
            carry = 0;
        }
    }
    *out = ans;  // no condition needed
    return true;

realoc_failed:
    if (alias)
        ubignum_free(ans);
    return false;
}

其中 ubn_unit_extend 在 x64 定義為 unsigned __int128 ，目前嘗試在 user space 配合 gdb 工具進行測試，得到錯誤的結果， carry 沒有被正確計算

carry = sum >> ubn_unit_bit;

疑似 uint128 的 shift operation 有問題，我嘗試去找 gcc doc ，但是內容只有三行，沒有提到是否支援 shift operation 。

我在 assignment 之前加入強制轉型就有我預期的結果了

        const ubn_unit_extend sum = (ubn_unit_extend) a->data[i] + b->data[i] + carry;

我好像已經吃了很多次這種虧，卻都沒有記住，我一直誤以為計算的時候就會轉成左邊的型態，這次特別寫進 commit message ，希望能確實記得。

ubignum_mult()

雖然前面有提及加速的算法，但是因為我目前的大數機制沒有減法，所以暫時先略過此項。

在開發時發現 ubn_unit 的加法一直重複遇到，所以將其模組化

/* no checking, sum and cout input should be guarantee */
static inline void ubn_unit_add(const ubn_unit a,
                                const ubn_unit b,
                                const int cin,
                                ubn_unit *sum,
                                int *cout)
{
    const ubn_unit_extend SUM = (ubn_unit_extend) a + b + cin;
    *sum = SUM;
    *cout = SUM >> ubn_unit_bit;
}

長得有點像硬體加法器的樣子。 cin 和 cout 都只會用到一個 bit ，用什麼型態似乎都蠻浪費的，最後選用了最常見的 int 。

mult() 程式碼






































































bool ubignum_mult(const ubn *a, const ubn *b, ubn **out)
{
    if (!a || !b || !out || !*out)
        return false;
    ubn *ans = *out;
    int alias = 0;
    if (a == *out)  // pointer aliasing
        alias ^= 1;
    if (b == *out)  // pointer aliasing
        alias ^= 2;
    if (alias) {  // if alias, allocate space to store the result
        if (!ubignum_init(&ans))
            return false;
    }
    /* keep mcand longer then mplier */
    const ubn *mcand = a->size > b->size ? a : b;
    const ubn *mplier = a->size > b->size ? b : a;
    ubn *prod;
    if (!ubignum_init(&prod))
        goto prod_aloc_failed;
    if (prod->capacity < mcand->size + 1)
        if (!ubignum_resize(&prod, mcand->size + 1))
            goto prod_resize_failed;

    /* The outer loop */
    for (int i = 0; i < mplier->size; i++) {
        /* The inner loop explanation:
         *       a b c  = mcand->data[2, 1, 0]
         *     *     d  = mplier->data[0]
         *     --------
         *         c c  = (high, low) denotes result of c * d
         *       b b
         *   + a a
         *   ----------
         *     x y ? c  sum of bb and cc, it may be carry out to x
         *   + a a
         *
         * let x be recorded in @carry and y be @overlap
         * no realloc needed for @prod
         */
        int carry = 0;
        ubn_unit overlap = 0;
        for (int j = 0; j < mcand->size; j++) {
            ubn_unit low, high;
            // need macro to select appropriate assembly
            __asm__("mulq %3"
                    : "=a"(low), "=d"(high)
                    : "%0"(mcand->data[j]), "rm"(mplier->data[i]));
            int cout = 0;
            ubn_unit_add(low, overlap, carry & 2, &prod->data[j], &cout);
            carry = (carry << 1) | cout;  // update carry where cout is 0 or 1
            overlap = high;
        }
        prod->data[mcand->size] =
            overlap + (ubn_unit) carry & 2;  // no carry out would be generated

        if (!ubignum_mult_add(prod, i, &ans))
            goto multadd_failed;
    }

    *out = ans;
    return true;
multadd_failed:
prod_resize_failed:
    ubignum_free(prod);
prod_aloc_failed:
    if (alias)
        ubignum_free(ans);
    return false;
}

目前的 bug 是 inline asm 沒有正確作用，看起來像是 low 跟 high 相反了，也可以看出確實是把 $rdx, $rax 分別接到 high 跟 low 。

Breakpoint 2, ubignum_mult (a=0x5555555592a0, b=0x555555559350, out=0x7fffffffdc88) at bignum.h:336
336                 ubn_unit_add(low, overlap, carry & 2, &prod->data[j], &cout);
(gdb) p mcand->data[j]
$1 = 18446744073709551615
(gdb) p i
$2 = 0
(gdb) p j
$3 = 0
(gdb) p mplier->data[i]
$4 = 18446744073709551615
(gdb) p low
$5 = 1
(gdb) p high
$6 = 18446744073709551614
(gdb) info registers rax
rax            0x1                 1
(gdb) info registers rdx
rdx            0xfffffffffffffffe  -2

難道說它執行 mulq instruction 時的資料放置方式是跟原本相反嗎？

我嘗試去查看組合語言的編譯結果，看到：

	mulq %rdx

雖然它把 "rm" 選成 $rdx ，但這應該也不會影響指令執行結果才對。

我發現其實它沒有算錯，這個結果是正確的，但因為我對乘法的直覺還不夠，誤用加法的邏輯來思考乘法才誤判結果。

在 pointer aliasing 的情況下無法做到 in place 的算法，否則錯誤處理會出問題。

大數運算第二版

在開發的過程中，因為大數運算難以 debug 只能經由 gdb 慢慢嘗試，因此修修改改花了很多時間，改了又錯又重改的情況多次發生，難以一一列舉，在此大略敘述過程遇到的狀況。

pointer aliasing issues

首先是前面提及 pointer aliasing 的問題，這個狀況是我在第一個版本遇到最大的問題，第一次是在開發加法時遇到。例如當想計算 a = a + b 時，若加出更大的位數則需要 reallocation ，但中途的 reallocation 失敗會導致傳入值 a 被破壞而正確的新 a 也沒有算出來，進退兩難的情況。

起初為了應對這個問題，我每次計算都先查看是否有 aliasing ，若有則進行複製， maintain 一份副本以利在 reallocation 失敗時能將數值復原。然而這個方法顯然導致很糟糕的效能，雖然之後為了節省交換的操作而直接計算在新空間，但仍然用了 3 個數。
使用者在使用界面時若指定 a = a + b ，那通常直覺也會認為自己使用了 in place 的算法，可是我在實作上卻無論如何都需要 3 個數的空間，違反使用直覺。

原程式碼 (add)

/* (*out) = a + b
 * Aliasing arguments are acceptable.
 * If it return true, the result is put at @out.
 * If return false with alias input, the @out would be unchanged.
 * If return false without alias input, the @out would return neither answer nor
 * original out.
 */
bool ubignum_add(const ubn *a, const ubn *b, ubn **out)
{
    if (!a || !b || !out || !*out)
        return false;
    ubn *ans = *out;
    int alias = 0;
    if (a == *out)  // pointer aliasing
        alias ^= 1;
    if (b == *out)  // pointer aliasing
        alias ^= 2;
    if (alias) {  // if alias, allocate space to store the result
        if (!ubignum_init(&ans))
            return false;
    }

    int i = 0, carry = 0;
    for (i = 0; i < min(a->size, b->size); i++) {
        if (i >= ans->size) {
            if (__builtin_expect(i >= ans->capacity, 0))
                if (!ubignum_resize(&ans, ans->capacity * 2))
                    goto realoc_failed;
            ans->size++;
        }
        ubn_unit_add(a->data[i], b->data[i], carry, &ans->data[i], &carry);
    }

    const ubn *remain = (i == a->size) ? b : a;
    for (; carry || i < remain->size; i++) {
        if (i >= ans->size) {
            if (__builtin_expect(i >= ans->capacity, 0))
                if (!ubignum_resize(&ans, ans->capacity * 2))
                    goto realoc_failed;
            ans->size++;
        }
        ubn_unit_add(remain->data[i], 0, carry, &ans->data[i], &carry);
    }
    if (alias)
        ubignum_free(*out);
    *out = ans;
    return true;

realoc_failed:
    if (alias)
        ubignum_free(ans);
    return false;
}

對於每個

n

測量 1000 次

fib (n)

計算後印出平均消耗時間。
其糟糕效能如下：

為此我後來發現若是能提早知道 reallocation 的大小，或至少知道該次計算的上界，那麼即可在開始計算之前就進行 reallocation 。如此一來即使配置失敗，那麼直接回傳 false 原先的值也不會被破壞。因此在計算 a = a + b 時便可以達成 in place 的目標，使用 2 個數的空間完成加法，也因為每次 ubignum_add() 都只會做一次 reallocation ，大幅降低了 memory allocation 的花費。此外也不再需要確認 aliasing ，移除判斷條件後也使程式碼簡潔許多。

改良 (add)

/* (*out) = a + b
 * Aliasing arguments are acceptable.
 * If false is returned, the values of (*out) remains unchanged.
 */
bool ubignum_add(const ubn *a, const ubn *b, ubn **out)
{
    if (!a || !b || !out || !*out)
        return false;
    while (__builtin_expect((*out)->capacity < max(a->size, b->size) + 1, 0))
        if (!ubignum_resize(out, (*out)->capacity * 2))
            return false;
    if (*out != a && *out != b)  // if no pointer aliasing
        ubignum_zero(*out);
    int i = 0, carry = 0;
    for (i = 0; i < min(a->size, b->size); i++)
        ubn_unit_add(a->data[i], b->data[i], carry, &(*out)->data[i], &carry);
    const ubn *const remain = (i == a->size) ? b : a;
    for (; i < remain->size; i++)
        ubn_unit_add(remain->data[i], 0, carry, &(*out)->data[i], &carry);
    (*out)->size = remain->size;
    if (carry) {
        (*out)->data[i] = 1;
        (*out)->size++;
    }
    return true;
}

效能：

相比原先的版本加速將近十倍之多。

列印

第二個困難出在列印，雖然我的 ubignum 已能進行數種運算，也正確地將數值儲存在檔案中，但它們就是一串由上百、上千個 0, 1 組成的資料，難以為人類所理解。

為了將很長的二進位轉成十進制，我左思右想許久，最後認為終究是離不開除法的命運，還是得乖乖寫一個除法。我想起前幾周的測驗題中，恰好有計算除法的內容，於是我就嘗試套用這個模式，意圖以位移和減法來達成除法運算。此外因為目前沒有其它除法的應用場景，所以可以只考量「除以 10 」的情況並進行特化。

為了實作出 ubignum_divby_ten() ，首先需要 ubignum_left_shift() 及 ubignum_sub() 。

ubignum_sub()

由於我的大數體系原先完全不存在任何負的概念，因此想引入減法也需要做一些權衡。

不過因為除法中使用的減法會保證結果為非負 (餘數) ，如此一來就不必引入負數，減法直接使用二補數加法進行，並把最高的進位拋棄。基於直接假設減法都是大減小，原本的 ubignum 體系可以維持不變。

以下程式碼都展示解決前述 pointer aliasing 問題後的版本。

sub() 程式碼

bool ubignum_sub(const ubn *a, const ubn *b, ubn **out)
{
    if (!a || !b || !out || !*out)
        return false;
    /* ones' complement of b */
    ubn *cmt;
    if (!ubignum_init(&cmt))
        goto cmt_failed;
    if (__builtin_expect(cmt->capacity < a->size, 0))
        if (!ubignum_resize(&cmt, a->size))
            goto cmt_realoc_failed;
    cmt->size = a->size;
    for (int i = 0; i < b->size; i++)
        cmt->data[i] = ~b->data[i];
    for (int i = b->size; i < a->size; i++)
        cmt->data[i] = CPU_64 ? UINT64_MAX : UINT32_MAX;

    int carry = 1;
    for (int i = 0; i < a->size; i++)  // compute result and store in cmt
        ubn_unit_add(a->data[i], cmt->data[i], carry, &cmt->data[i], &carry);
    // the last carry is discarded
    for (int i = a->size - 1; i >= 0; i--)
        if (!cmt->data[i])
            cmt->size--;
    ubignum_free(*out);
    *out = cmt;
    return true;
cmt_realoc_failed:
    ubignum_free(cmt);
cmt_failed:
    return false;
}

ubignum_left_shift()

left_shift() 程式碼














































/* left shift a->data by d bit */
bool ubignum_left_shift(const ubn *a, int d, ubn **out)
{
    if (!a || d < 0 || !out || !*out)
        return false;
    else if (d == 0) {
        if (a == *out)
            return true;
        if (__builtin_expect((*out)->capacity < a->size, 0))
            if (!ubignum_resize(out, a->size))
                return false;
        for (int i = 0; i < a->size; i++)
            (*out)->data[i] = a->data[i];
        (*out)->size = a->size;
        return true;
    }
    const int chunk_shift = d / ubn_unit_bit;
    const int shift = d % ubn_unit_bit;
    const int new_size = a->size + chunk_shift + 1;
    if (__builtin_expect((*out)->capacity < new_size, 0))
        if (!ubignum_resize(out, new_size))
            return false;

    int ai = new_size - chunk_shift - 1;  // note a->size may be changed due to
                                          // aliasing, should not use a->size
    int oi = new_size - 1;
    /* copy data from a to (*out) */
    if (shift) {
        (*out)->data[oi--] = a->data[ai - 1] >> (ubn_unit_bit - shift);
        for (ai--; ai > 0; ai--)
            (*out)->data[oi--] = (a->data[ai] << shift) |
                                 (a->data[ai - 1] >> (ubn_unit_bit - shift));
        (*out)->data[oi--] = a->data[ai] << shift;  // ai is now 0
    } else {
        while (ai >= 0)
            (*out)->data[oi--] = a->data[ai--];
    }
    /* end copy */
    while (oi >= 0)  // set remaining part of (*out) to 0
        (*out)->data[oi--] = 0;

    (*out)->size = new_size;
    if ((*out)->data[(*out)->size - 1] == 0)  // if MS chunk is 0
        (*out)->size--;
    return true;
}

大致上就是一些搬移的操作，在此解釋 29 至 33 行的操作，假設我們要將 a ：

00000000 11111111 01001111 00000000

左移 3 位。

注意到 chunk 編號是
$(3, 2, 1, 0)$ ，左邊為 most significant 。

每次我們需要目前 chunk 的最低 5 位放到高 (置左) 與其右方 chunk 的最高 3 位放到低 (置右) 進行合併，以編號

2

為例即是：

    11111...    from [2]
or  .....010    from [1]
   ----------
    11111010

取最低 5 位放到高：左移
$3$
取最高 3 位放到低：右移
$(8 - 3)$

最高跟最低的 chunk 都只有分別一邊所以需要特殊處理，最終得：

00000111 11111010 01111000 00000000

由於我們每次都不會去取目標左方的新 chunk ，因此 pointer aliasing 在此並不會造成問題。

另外，考量到當 shift 為 0 時，原本的程式碼會導致數值右移超過自己的 size ，產生 undefined behavior ，因此需要第 28 行的 if..else 來進行 branch 。

ubignum_divby_ten()

概念源於 quiz3 的算法。

在撰寫的過程中，再多寫了一個 ubignum_compare() 函式以利比較。

divby_ten() 程式碼

/* a / 10 = (*quo)...rmd */
bool ubignum_divby_ten(const ubn *a, ubn **quo, int *rmd)
{
    if (!a || !a->size || !quo || !*quo || !rmd)
        return false;
    ubn *ans, *dvd, *suber, *ten;
    if (!ubignum_init(&ans))
        return false;
    if (!ubignum_resize(&ans, a->size))
        goto cleanup_ans;
    if (!ubignum_init(&dvd))
        goto cleanup_ans;
    if (!ubignum_resize(&dvd, a->size))
        goto cleanup_dvd;
    if (!ubignum_init(&suber))
        goto cleanup_dvd;
    if (!ubignum_resize(&suber, dvd->capacity + 1))
        goto cleanup_suber;
    if (!ubignum_init(&ten))
        goto cleanup_suber;
    ubignum_uint(ten, 10);

    for (int i = 0; i < a->size; i++)
        dvd->data[i] = a->data[i];
    dvd->size = a->size;
    int shift = dvd->size * ubn_unit_bit - 4;  // can be improved by clz()
    ubignum_left_shift(ten, shift, &suber);
    while (ubignum_compare(dvd, ten) >= 0) {  // if dvd >= 10
        while (ubignum_compare(dvd, suber) < 0) {
            shift--;
            ubignum_left_shift(ten, shift, &suber);
        }
        ans->data[shift / ubn_unit_bit] |= (ubn_unit) 1
                                           << (shift % ubn_unit_bit);
        ubignum_sub(dvd, suber, &dvd);
    }
    ans->size = a->size;
    if (ans->data[a->size - 1] == 0)
        ans->size--;
    *rmd = (int) dvd->data[0];
    ubignum_free(ten);
    ubignum_free(suber);
    ubignum_free(dvd);
    ubignum_free(*quo);
    *quo = ans;
    return true;
cleanup_suber:
    ubignum_free(suber);
cleanup_dvd:
    ubignum_free(dvd);
cleanup_ans:
    ubignum_free(ans);
    return false;
}

之後同樣如 quiz3 引入 builtin_clz() 並重新包裝成 ubignum_clz() 進行改進，關鍵段落變化：

    for (int i = 0; i < a->size; i++)
        dvd->data[i] = a->data[i];
    dvd->size = a->size;
-   int shift = dvd->size * ubn_unit_bit - 4;
-   ubignum_left_shift(ten, shift, &suber);
    while (ubignum_compare(dvd, ten) >= 0) {  // if dvd >= 10
-       while (ubignum_compare(dvd, suber) < 0) {
-           shift--;
-           ubignum_left_shift(ten, shift, &suber);
-       }
+       int shift = dvd->size * ubn_unit_bit - ubignum_clz(dvd) - 4;
+       ubignum_left_shift(ten, shift, &suber);
+       if (ubignum_compare(dvd, suber) < 0)
+           ubignum_left_shift(ten, --shift, &suber);
        ans->data[shift / ubn_unit_bit] |= (ubn_unit) 1
                                           << (shift % ubn_unit_bit);
        ubignum_sub(dvd, suber, &dvd);
    }
    ans->size = a->size;  // FIX: ans = 0 should be considered
    if (ans->data[a->size - 1] == 0)
        ans->size--;
    *rmd = (int) dvd->data[0];

ubignum_2decimal()

在實作完成除以 10 之後，我們即可透過不斷除以 10 來將二進位數轉成字串印出。

2decimal() 程式碼

/* convert the ubn to ascii string */
char *ubignum_2decimal(const ubn *N)
{
    if (!N)
        return NULL;
    if (ubignum_iszero(N)) {
#if DEBUG
        char *ans = (char *) calloc(sizeof(char), 2);
#else
        char *ans = (char *) kcalloc(sizeof(char), 2, GFP_KERNEL);
#endif
        if (!ans)
            return NULL;
        ans[0] = '0';
        ans[1] = 0;
        return ans;
    }
    ubn *dvd = NULL;
    if (!ubignum_init(&dvd))
        return NULL;
    if (!ubignum_resize(&dvd, N->size))
        goto cleanup_dvd;
    for (int i = 0; i < N->size; i++)
        dvd->data[i] = N->data[i];
    dvd->size = N->size;
    /* Let n be the number.
     * digit = 1 + log_10(n) = 1 + \frac{log_2(n)}{log_2(10)}
     * log_2(10) \approx 3.3219 \approx 7/2,  we simply choose 3
     */
    unsigned digit = (ubn_unit_bit * N->size / 3) + 1;
#if DEBUG
    char *ans = (char *) calloc(sizeof(char), digit);
#else
    char *ans = (char *) kcalloc(sizeof(char), digit, GFP_KERNEL);
#endif
    if (!ans)
        goto cleanup_dvd;
    /* convert 2-base to 10-base */
    int index = 0;
    while (dvd->size) {
        int rmd;
        if (__builtin_expect(!ubignum_divby_ten(dvd, &dvd, &rmd), 0))
            goto cleanup_ans;
        ans[index++] = rmd | '0';
    }
    int len = index;
    /* reverse string */
    index = len - 1;
    for (int i = 0; i < index; i++, index--) {
        char tmp = ans[i];
        ans[i] = ans[index];
        ans[index] = tmp;
    }
    ubignum_free(dvd);
    return ans;
cleanup_ans:
#if DEBUG
    free(ans);
#else
    kfree(ans);
#endif
cleanup_dvd:
    ubignum_free(dvd);
    return NULL;
}

目前我認為字串反轉的方式還可以再改良，但因為這個部份並非 fib 運算的主軸，因此暫時就先如此。

fibdrv 效能分析與改進

原先想與範例程式碼比較，然而稍微更改後在自己電腦實驗的結果卻非常糟：

因此目前先以自己改進為方向，

由於速度的提昇，我們現在將範圍拉到

n = 1000

來測試。

對於目前 fib sequence 逐項往上加的算法

static inline void ubn_unit_add(const ubn_unit a,
                                const ubn_unit b,
                                const int cin,
                                ubn_unit *sum,
                                int *cout)
{
    const ubn_unit_extend SUM = (ubn_unit_extend) a + b + cin;
    *sum = SUM;
    *cout = SUM >> ubn_unit_bit;
}

我認為仍有可能改進的部份：

使用 inline asm 來做加法，如此 (在 64 位元情況) 可以避免用到 uint128_t

今天 (3/24) 上課的時候，聽到老師介紹 gcc 有 builtin 的 overflow 函式，藉由這點我們即可改寫原本的 ubn_unit_add()，不需要低到 asm 的層級，相對 asm 的優點在於 architecture independence。然而這裡是需要 3 個數相加，因此需要呼叫兩次 builtin_uadd_overflow()。

static inline void ubn_unit_add(const ubn_unit a,
                                const ubn_unit b,
                                const int cin,
                                ubn_unit *sum,
                                int *cout)
{
    *cout = __builtin_uaddll_overflow(a, cin, sum);
    *cout |= __builtin_uaddll_overflow(*sum, b, sum);
}

測試後可以發現在較大的 case 上，使用 builtin_uadd_overflow() 可以得到略佳的效果。再測到更大一些得：

當我嘗試將

n

提高到 3000 時，卻觀察到一個奇怪的現象，就是在 1000 之後，計算時間就不再有提昇的趨勢，如圖：

我覺得這非常不合理，因為以我的算法對於越大的數肯定有越高的計算量。目前想到的可能出在，因數值太大而產生 overflow ，但 20000

\times

1000 也仍不到 32-bit int 的上界，應不可能超出 long long ，且可以從圖中見得有些數字遠遠超過水平線卻仍被印了出來。或許需要更理解 ktime_t 相關的知識。

On 64-bit systems, a ktime_t is really just a 64-bit integer value in nanoseconds.
–- The high-resolution timer API

以這個資料來說，顯然更不是時間變數資料型態的問題了。

經過數天之後我終於發現問題，是 fibdrv.c 中的 MAX_LENGTH 。之前改過一次之後我就忘記有這個參數，目前直接設定成 1000000 以利後續實驗。

更正後試跑已可正常進行統計：

嘗試引入 fast doubling 算法加速

效能改善的情況高於我的預期，我原先以為 fast doubling 要到很大的 case 才會有顯著的改善。但從實驗結果可以發現，在

n = 400

左右兩種算法即出現執行時間的交叉。

試著仿造使用 perf stat 測量我寫的 fast doubling 算法到

n = 3000

：

$ sudo perf stat -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses ./exp 1 > /dev/null

 Performance counter stats for './exp 1':

   118,264,893,816      cycles                                                      
   264,687,477,706      instructions              #    2.24  insn per cycle         
        15,142,354      cache-misses              #   15.801 % of all cache refs    
        95,829,386      cache-references                                            
    38,631,044,731      branch-instructions                                         
        69,373,051      branch-misses             #    0.18% of all branches        

      35.744624633 seconds time elapsed

       0.931959000 seconds user
      34.806501000 seconds sys

可以看到我的實作程式碼有高達 15% 的 cache miss 。

我想利用 cachegrind 來測試是哪個函式發生最多 cache miss ，結果如下：

$ sudo valgrind --tool=cachegrind ./exp 1 > /dev/null
==37329== Cachegrind, a cache and branch-prediction profiler
==37329== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==37329== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==37329== Command: ./exp 1
==37329== 
--37329-- warning: L3 cache found, using its data for the LL simulation.
==37329== 
==37329== I   refs:      69,563,678
==37329== I1  misses:         1,159
==37329== LLi misses:         1,128
==37329== I1  miss rate:       0.00%
==37329== LLi miss rate:       0.00%
==37329== 
==37329== D   refs:      28,270,312  (24,764,894 rd   + 3,505,418 wr)
==37329== D1  misses:         3,323  (     2,585 rd   +       738 wr)
==37329== LLd misses:         2,733  (     2,070 rd   +       663 wr)
==37329== D1  miss rate:        0.0% (       0.0%     +       0.0%  )
==37329== LLd miss rate:        0.0% (       0.0%     +       0.0%  )
==37329== 
==37329== LL refs:            4,482  (     3,744 rd   +       738 wr)
==37329== LL misses:          3,861  (     3,198 rd   +       663 wr)
==37329== LL miss rate:         0.0% (       0.0%     +       0.0%  )

$ sudo cg_annotate cachegrind.out.37329 
--------------------------------------------------------------------------------
I1 cache:         32768 B, 64 B, 8-way associative
D1 cache:         32768 B, 64 B, 8-way associative
LL cache:         6291456 B, 64 B, 12-way associative
Command:          ./exp 1
Data file:        cachegrind.out.37329
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
Ir         I1mr  ILmr  Dr         D1mr  DLmr  Dw        D1mw DLmw 
--------------------------------------------------------------------------------
69,563,678 1,159 1,128 24,764,894 2,585 2,070 3,505,418  738  663  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir         I1mr ILmr Dr         D1mr  DLmr Dw        D1mw DLmw  file:function
--------------------------------------------------------------------------------
33,078,041    4    4 15,024,011     1    0 3,012,013    0    0  ???:main
27,000,072    1    1  6,000,016     0    0         0    0    0  /build/glibc-sMfBJT/glibc-2.31/io/../sysdeps/unix/sysv/linux/write.c:write
 6,054,221   27   24  3,027,082    11    0        15    1    1  ???:???
 1,493,991   48   47    398,997    19    6   261,001    6    2  /build/glibc-sMfBJT/glibc-2.31/stdio-common/vfprintf-internal.c:__vfprintf_internal
   648,170    4    4    141,032     4    1   114,012    1    0  /build/glibc-sMfBJT/glibc-2.31/libio/fileops.c:_IO_file_xsputn@@GLIBC_2.2.5
   418,262    3    3     30,733     2    1    24,733    0    0  /build/glibc-sMfBJT/glibc-2.31/stdio-common/_itoa.c:_itoa_word
   228,000    5    5     21,000     3    2         0    0    0  /build/glibc-sMfBJT/glibc-2.31/string/../sysdeps/x86_64/multiarch/strchr-avx2.S:__strchrnul_avx2
   163,977    2    2     29,983     0    0    17,985   63   63  /build/glibc-sMfBJT/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
    90,000    4    4     18,000     3    0    33,000    3    2  /build/glibc-sMfBJT/glibc-2.31/stdio-common/printf.c:printf
    78,000    1    1     24,000     0    0     6,000    0    0  /build/glibc-sMfBJT/glibc-2.31/stdio-common/../libio/libioP.h:__vfprintf_internal
    71,545   10   10     14,316 1,080  893        15    2    0  /build/glibc-sMfBJT/glibc-2.31/elf/dl-addr.c:_dl_addr

cachegrind 似乎無法配合 taskset 使用

改用

sudo taskset -c 1 valgrind --tool=cachegrind ./exp 1 > /dev/null

即可正常使用 cachegrind ，推測為 Valgrind 實作方式的影響

測試結果竟顯示 data cache miss 為 0% ，與從 perf stat 收到的資訊天差地遠。

可以觀察到 cache reference 在 perf 顯示 95,829,386 ；而在 cachegrind 顯示為 28,270,312 。

進行更多次 perf 測試，發現每次的 cache references 竟有巨大落差：

 Performance counter stats for './exp 1':

   112,781,466,006      cycles                                                      
   261,497,199,771      instructions              #    2.32  insn per cycle         
         6,962,452      cache-misses              #   11.023 % of all cache refs    
        63,160,276      cache-references                                            
    38,179,989,631      branch-instructions                                         
        67,070,255      branch-misses             #    0.18% of all branches        

      33.386655206 seconds time elapsed

       0.975406000 seconds user
      32.384290000 seconds sys


 Performance counter stats for './exp 1':

   108,714,836,439      cycles                                                      
   264,555,729,876      instructions              #    2.43  insn per cycle         
         1,520,786      cache-misses              #   14.088 % of all cache refs    
        10,795,259      cache-references                                            
    38,609,436,202      branch-instructions                                         
        63,560,865      branch-misses             #    0.16% of all branches        

      32.059672731 seconds time elapsed

       0.847985000 seconds user
      31.207482000 seconds sys

我不曉得為什麼會出現這樣的情況。

後續發現是測試次數過少而導致的數值不穩定，後面會補充結果。

回去觀摩參考共筆，發現因為我先前閱讀的重點都在運算加速，沒注意到資料搬移優化的部份，嘗試將其納入我的實作：

    for (int currbit = 1 << (32 - __builtin_clzll(k) - 1 - 1); currbit;
         currbit = currbit >> 1) {
        /* compute 2n-1 */
        ubignum_mult(fast[1], fast[1], &fast[0]);
        ubignum_mult(fast[2], fast[2], &fast[3]);
        ubignum_add(fast[0], fast[3], &fast[3]);
        /* compute 2n */
        ubignum_left_shift(fast[1], 1, &fast[4]);
        ubignum_add(fast[4], fast[2], &fast[4]);
        ubignum_mult(fast[4], fast[2], &fast[4]);
        n *= 2;
        if (k & currbit) {
            ubignum_add(fast[3], fast[4], &fast[0]);
            n++;
-           ubignum_copy(fast[2], fast[0]);
-           ubignum_copy(fast[1], fast[4]);
+           ubignum_swapptr(&fast[2], &fast[0]);
+           ubignum_swapptr(&fast[1], &fast[4]);
        } else {
-           ubignum_copy(fast[2], fast[4]);
-           ubignum_copy(fast[1], fast[3]);
+           ubignum_swapptr(&fast[2], &fast[4]);
+           ubignum_swapptr(&fast[1], &fast[3]);
        }
    }

效能差異：

我測出來的結果和參考共筆的結論不符合，並沒有出現效能大幅改善的情形。

perf 測起來仍然非常不穩：

perf 測試

$ sudo perf stat -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses ./exp 1 > /dev/null

 Performance counter stats for './exp 1':

    99,537,523,424      cycles                                                      
   241,725,789,043      instructions              #    2.43  insn per cycle         
         1,579,228      cache-misses              #   53.451 % of all cache refs    
         2,954,538      cache-references                                            
    34,729,788,080      branch-instructions                                         
        72,001,270      branch-misses             #    0.21% of all branches        

      29.372597211 seconds time elapsed

       0.963940000 seconds user
      28.402259000 seconds sys

Performance counter stats for './exp 1':

   110,100,048,298      cycles                                                      
   251,991,005,629      instructions              #    2.29  insn per cycle         
        11,886,534      cache-misses              #   14.939 % of all cache refs    
        79,566,707      cache-references                                            
    36,196,420,296      branch-instructions                                         
        71,695,967      branch-misses             #    0.20% of all branches        

      32.880962908 seconds time elapsed

       0.939414000 seconds user
      31.908111000 seconds sys

嘗試在 perf 命令設置參數 -r 10 來降低測量數值的波動，為避免時間過長，改為測試 fib 到

n = 1000

。
經過這個設定後，每次測試間得到的數值就不再有顯著差異，至少每個 events 的前兩個數字會保持不變。

$ bash measure
...

 Performance counter stats for 'taskset -c 1 ./exp 1' (10 runs):

    23,018,839,194      cycles                                                        ( +-  0.27% )
    49,622,366,443      instructions              #    2.16  insn per cycle           ( +-  0.00% )
         1,634,045      cache-misses              #   20.062 % of all cache refs      ( +-  6.49% )
         8,145,159      cache-references                                              ( +-  5.01% )
     8,319,541,416      branch-instructions                                           ( +-  0.00% )
        13,417,461      branch-misses             #    0.16% of all branches          ( +-  0.18% )

           14.4259 +- 0.0388 seconds time elapsed  ( +-  0.27% )

確立測試方法之後再回去重新測量使用 ubignum_copy() 的版本，得：

$ bash measure
...

 Performance counter stats for 'taskset -c 1 ./exp 1' (10 runs):

    25,215,127,347      cycles                                                        ( +-  2.31% )
    51,971,604,860      instructions              #    2.06  insn per cycle           ( +-  0.00% )
         2,423,932      cache-misses              #   13.441 % of all cache refs      ( +- 18.29% )
        18,033,276      cache-references                                              ( +- 24.11% )
     8,811,912,111      branch-instructions                                           ( +-  0.00% )
        13,058,167      branch-misses             #    0.15% of all branches          ( +-  1.10% )

            15.804 +- 0.367 seconds time elapsed  ( +-  2.32% )

對照之下，從舊版到新版雖然 cache-miss-rate 由 13% 增長到 20% ，但我們可以注意到 cache-references 總數大約降低了一半，我推測這個差距來自於 copy 的過程對資料的 access 。
因此就 data cache 而言，因為省下這些操作，新版仍是更佳的實作方式。

20% 的 cache miss 非常驚人，因此目前最直接的改進目標就是要想辦法降低 miss rate 。

引入 ubignum_square()

 Performance counter stats for 'taskset -c 1 ./exp 1' (10 runs):

    25,405,898,533      cycles                                                        ( +-  1.21% )
    53,288,279,701      instructions              #    2.10  insn per cycle           ( +-  0.00% )
         1,370,567      cache-misses              #   20.309 % of all cache refs      ( +- 34.02% )
         6,748,682      cache-references                                              ( +- 32.89% )
     9,033,980,854      branch-instructions                                           ( +-  0.00% )
        12,907,672      branch-misses             #    0.14% of all branches          ( +-  0.62% )

            15.921 +- 0.193 seconds time elapsed  ( +-  1.21% )

差異實在也不是很明顯，甚至是落後於全數使用 ubignum_mult() 的版本。

雖然我想要嘗試測到大一點的 case ，

執行測試有時會出現頻繁的失敗，不確定原因是什麼，但與引入 ubignum_square() 應該沒有關聯，因為即使改回 mult 的版本也沒有恢復。其錯誤情景如下：

$ sudo perf stat -r 10 -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses taskset -c 0 ./exp 1 > /dev/null
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束

我想看看是否是 fib_fast() 的計算過程中的問題，例如 reallocation 失敗等等，因此嘗試引入 flag 變數來紀錄：

fib_fast()

static ubn *fib_fast(long long k)
{
    ubn *fast[5] = {NULL};
    bool flag = true;
    if (k == 0) {
        flag &= ubignum_init(&fast[0]);
        ubignum_zero(fast[0]);
        return fast[0];
    } else if (k == 1) {
        flag &= ubignum_init(&fast[0]);
        ubignum_uint(fast[0], 1);
        return fast[0];
    }

    for (int i = 0; i < 5; i++)
        flag &= ubignum_init(&fast[i]);
    ubignum_zero(fast[1]);
    ubignum_uint(fast[2], 1);
    int n = 1;
    for (int currbit = 1 << (32 - __builtin_clzll(k) - 1 - 1); currbit;
         currbit = currbit >> 1) {
        /* compute 2n-1 */
        flag &= ubignum_square(fast[1], &fast[0]);
        flag &= ubignum_square(fast[2], &fast[3]);
        //flag &= ubignum_mult(fast[1], fast[1], &fast[0]);
        //flag &= ubignum_mult(fast[2], fast[2], &fast[3]);
        flag &= ubignum_add(fast[0], fast[3], &fast[3]);
        /* compute 2n */
        flag &= ubignum_left_shift(fast[1], 1, &fast[4]);
        flag &= ubignum_add(fast[4], fast[2], &fast[4]);
        flag &= ubignum_mult(fast[4], fast[2], &fast[4]);
        n *= 2;
        if (k & currbit) {
            flag &= ubignum_add(fast[3], fast[4], &fast[0]);
            n++;
            ubignum_swapptr(&fast[2], &fast[0]);
            ubignum_swapptr(&fast[1], &fast[4]);
        } else {
            ubignum_swapptr(&fast[2], &fast[4]);
            ubignum_swapptr(&fast[1], &fast[3]);
        }
    }
    ubignum_free(fast[0]);
    ubignum_free(fast[1]);
    ubignum_free(fast[3]);
    ubignum_free(fast[4]);
    if(!flag)
        printk(KERN_INFO "flag is %d\n", flag);
    return fast[2];
}

執行後於 dmesg 中並未看到任何 flag 相關的報錯，不過出現了一段紅底黑字：

[  990.777318] BUG: unable to handle page fault for address: ffff96d435beb220
[  990.777322] #PF: supervisor read access in kernel mode
[  990.777324] #PF: error_code(0x0000) - not-present page

June Rework

隔一段時間沒有著手於這份專案，回來先進行一些實驗重做等等，並補上一些之前忽略的內容。

由於第一次做的時候認為需要專注於提昇運算 Fibonacci number 的速度，而只測量計算消耗的時間。然而這類 character device 在使用上總是要 read，因此應該也進行實驗量測 user space 到 kernel space 的時間、以及資料複製的時間等等納入成本計算，進行效能的全盤考量，完整地進行實驗才能正確認知到效能瓶頸何在。

user-kernel transition time

時間量測

首先實驗量測由 user mode 進出 kernel mode 的耗時

code

    for (int i = 0; i <= offset; i++) {
        lseek(fd, i, SEEK_SET);
        unsigned long long ucons = 0, kcons = 0;
        for (int j = 0; j < TEST_NUM; j++) {
            struct timespec t1, t2;
            clock_gettime(CLOCK_MONOTONIC, &t1);
            kcons += write(fd, NULL, select);
            clock_gettime(CLOCK_MONOTONIC, &t2);
            ucons += (unsigned long long) (t2.tv_sec * 1e9 + t2.tv_nsec) -
                     (t1.tv_sec * 1e9 + t1.tv_nsec);
        }
        printf("%d,%llu,%llu,%llu\n", i, kcons / TEST_NUM, ucons / TEST_NUM, (ucons - kcons)/TEST_NUM);
    }

/* write operation is skipped */
/* try to measure time within this function */
static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    ubn *N = NULL;  // do not initialize
    ktime_t kt = ktime_get();
    switch (size) {
    case 0:
        N = fib_sequence(*offset);
        break;
    case 1:
        N = fib_fast(*offset);
        break;
    default:
        return 0;
    }
    ubignum_free(N);
    kt = ktime_sub(ktime_get(), kt);
    return (ssize_t) ktime_to_ns(kt);
}

以 iteration 算法測試，並使用較小的 offset 來突顯該數值，兩個時間相減大致可得 (user

\to

kernel)

+

(kernel

\to

user) 的時間。

即使用上了作業說明和參考共筆提及的所有方式，仍會產生這種巨大的 peak。從 csv 檔觀察，發現其位於

n = 13, 43, 44, 63, 64

fib(13) = 233
fib(43) = 433494437
fib(44) = 701408733
fib(63) = 6557470319842
fib(64) = 10610209857723

第一猜想是使用 ubignum_resize() 導致的消耗，但結果這幾個值看起來也並非是 UINT32_MAX 或 UINT64_MAX，從此 value 上看不出有特別意義。接著又測試幾次，發現每次 peak 都產生在不同地方。

試著排除 peak 點再次作圖，結果發現在其它情況我的量測也不太穩定。

換成另一種叫簡易的方式來測量：將 fib_write() 函式清空並直接把 user space 下的來回時間當成是 transition time。

注意到這時 x 軸

n

不再有意義

無論怎麼做都一直量到 peak，但可以看出這個來回時間幾乎都非常貼近 1000 ns。

為了考察這個問題，我試著把每次測試中過大的的耗時數值印出來，結果發現偶爾會出現 case 有十倍左右的時間消耗。於是我了解到排除統計上的離群值有其必要性，果然是不該偷吃步。

排除統計離群值

在統計的過程中，對於某些不特定的

n

會出現標準差大於平均值的奇怪現象，於是我同樣把過程中的資料全部存下來，再去奇怪的

n

看看出了什麼問題，借助 sprintf() 的小技巧來把資料全部存起來：

        char name[128];
        sprintf(name, "data/%d.dat", i);
        FILE *f = fopen(name, "w");

我這裡跑完時間測量後，看到

n = 67

時

μ = 1073, σ = 1322

，於是查詢 data/67.dat，原來其中一次的耗時測量到 42656。然而經過離群值的排除，調整後的平均值正常地落在了 1085。

outlier elimination

/* Eliminate 5% outliers and compute the average.
 */
uint64_t average(int64_t *nums, int numsSize)
{
    int64_t mu = 0;
    for (int i = 0; i < numsSize; i++)
        mu += nums[i];
    mu /= numsSize;
    int64_t deviation = 0;
    for (int i = 0; i < numsSize; i++)
        deviation += (nums[i] - mu) * (nums[i] - mu) / numsSize;
    deviation = sqrt(deviation);
    for (int i = 0; i < numsSize; i++)
        if (nums[i] >= mu + 2 * deviation || nums[i] <= mu - 2 * deviation)
            nums[i] = -1;
    // printf("%ld, %ld\t", mu, deviation);
    uint64_t ans = 0;
    for (int i = 0; i < numsSize; i++)
        if (nums[i] != -1)
            ans += nums[i];
    return ans / (numsSize / 20 * 19);
}

其中 sqrt() 為我自己實作的整數開根號函式。

至此我們成功排除掉了 peak 的問題，於是耗時的細微變化便可在圖表上呈現出來。

圖表的刻度變小之後，可以看出測量上有 100 ～ 300 ns 的起伏，仍無法達到如 KYG 同學的平滑結果。這裡由於任何

n

的輸入都是直接 return 沒有計算，因此最理想的狀況應該要是一條水平線。

打算做一個 commit 時收到警告訊息：

Dangerous function detected in /home/tinyynoob/code/fibdrv/exp.c
54:        sprintf(name, "data/%d.dat", i);

於是改為使用 snprint() 並順利完成 commit。

於 desktop 量測

由於怎麼測總是拿到漂浮的數據，我想 laptop 或許天生就較為不穩定，因此我想從 laptop 改成用 desktop 跑跑看是否會出現相同的結果。

cpu info

架構：                           x86_64
CPU 作業模式：                   32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
每核心執行緒數：                 1
每通訊端核心數：                 4
Socket(s):                       1
NUMA 節點：                      1
供應商識別號：                   GenuineIntel
CPU 家族：                       6
型號：                           94
Model name:                      Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
製程：                           3
CPU MHz：                        2511.819
CPU max MHz:                     3600.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6399.96
虛擬：                           VT-x
L1d 快取：                       128 KiB
L1i 快取：                       128 KiB
L2 快取：                        1 MiB
L3 快取：                        6 MiB
NUMA node0 CPU(s)：              0-3
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cach
                                 e flushes, SMT disabled
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
                                 ia prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user
                                  pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_
                                 FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Mitigation; Microcode
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtr
                                 r pge mca cmov pat pse36 clflush dts acpi mmx f
                                 xsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rd
                                 tscp lm constant_tsc art arch_perfmon pebs bts 
                                 rep_good nopl xtopology nonstop_tsc cpuid aperf
                                 mperf pni pclmulqdq dtes64 monitor ds_cpl vmx s
                                 mx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid s
                                 se4_1 sse4_2 x2apic movbe popcnt tsc_deadline_t
                                 imer aes xsave avx f16c rdrand lahf_lm abm 3dno
                                 wprefetch cpuid_fault invpcid_single pti ssbd i
                                 brs ibpb stibp tpr_shadow vnmi flexpriority ept
                                  vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 
                                 smep bmi2 erms invpcid rtm mpx rdseed adx smap 
                                 clflushopt intel_pt xsaveopt xsavec xgetbv1 xsa
                                 ves dtherm ida arat pln pts hwp hwp_notify hwp_
                                 act_window hwp_epp md_clear flush_l1d

然而同樣的程式碼和 scripts 移過來執行竟然出現 segmentation fault。

我使用 gdb 工具查詢，發現出錯在 fprintf() 處，我的 f 意圖打開目錄 data 下的檔案，然而不存在這個 directory 因此產生記憶體區段錯誤，mkdir data 後解決。

跑出來的圖表：

雖然看似仍有很大的浮動，但仔細觀察可以注意到 y 軸的刻度由 50 降至 10，確實穩定了不少。從比較大的 scale 看是像這樣子：

雖然還沒辦法變成完全的水平線，但已從最初上千 ns 的 peak 改善許多。

版本 commit id：b80d536b08bd1a3c3325b53d684f97e39ac734f9

接著再把之前從 fib_write() 清空的內容補回進行測試：

可以看出無論

n

的變化，user space 到 kernel space 的來回時間穩定落在 600 ns 左右。

即使一路算到

n = 500

也是如此：

版本 id：766a4b80f6fbcf8a0f5ee0931113b16e6ea629ec

運算時間

繼續使用 desktop 來重新量測兩種 Fibonacci number 算法在 kernel space 的耗時：

像之前一樣看到我實作的 fast fib 算法會出現這種階梯形的跳躍現象，這次我想要好好研究這些階梯的成因。

先從小一點的 case 開始探討：

從 fast.csv 查看發現跳點出現在

n = 2, 4, 8, 16, 32, 64

，每次往上跳 500 或 1000 ns。每個跳躍竟然都恰好落在 2 的冪，這肯定有很好的理由可以解釋。

仔細想想之後也很直觀，因為 fast fib 算法每次迴圈都會將

n

加倍，因此需要

\log (n)

次 iteration 以算完要求結果。而這麼解釋就代表每個跳躍便是 fast doubling 算法 each iteration 的耗時，從 fast_fib() 程式碼也可以看出有個迴圈：

    int n = 1;
    for (int currbit = 1 << (32 - __builtin_clzll(k) - 1 - 1); currbit;
         currbit = currbit >> 1) {
        /* compute 2n-1 */
        flag &= ubignum_square(fast[1], &fast[0]);
        flag &= ubignum_square(fast[2], &fast[3]);
        flag &= ubignum_add(fast[0], fast[3], &fast[3]);
        /* compute 2n */
        flag &= ubignum_left_shift(fast[1], 1, &fast[4]);
        flag &= ubignum_add(fast[4], fast[2], &fast[4]);
        flag &= ubignum_mult(fast[4], fast[2], &fast[4]);
        n *= 2;
        if (k & currbit) {
            flag &= ubignum_add(fast[3], fast[4], &fast[0]);
            n++;
            ubignum_swapptr(&fast[2], &fast[0]);
            ubignum_swapptr(&fast[1], &fast[4]);
        } else {
            ubignum_swapptr(&fast[2], &fast[4]);
            ubignum_swapptr(&fast[1], &fast[3]);
        }
    }

在這裡可以簡單比較兩種算法的特性如下：

	sequencial	fast doubling
迭代次數	$n$	$⌊ \log (n) ⌋$
每輪迭代的複雜度	低，約 16 ns	高，約 834 ns

該數字使用
$n = 0$ 到
$n = 500$ 的測量結果算得

比較

KYG 同學

為了確認自己實作的水準，clone KYG 同學的 fibdrv 進行對比，使用 optimize-bn 分支在我的電腦上執行 make plot，結果超乎我的預期：

耗時僅僅我的

1 / 10

，顯然我的程式碼有不只一點改進空間。此外，為何使用相同的設定，它的程式就能有非常平滑的結果，而我卻會一直抖動，這也需要再好好研究。

但仔細去看程式碼發現其測量有誤，我 clone 到的版本並非 bignum 的實作。接著繼續閱讀之後，發現同學並未把自己的最新版本推上 github，因此我無法在自己的電腦上與其進行耗時比較。

需要找找看哪個同學的程式碼適合拿來做效能對比

研讀 0xff07 同學的實作

研讀同學的實作，看看是否有值得借鏡的地方。

apm_internal.h

此 apm 取的是 Arbitrary-Precision arithmetic 的意思

首先我看到同學將 inline assembly 的部份包裝成 macro：

#if defined(__amd64__) || defined(__x86_64__)
#define digit_mul(u, v, hi, lo) \
    __asm__("mulq %3" : "=a"(lo), "=d"(hi) : "%0"(u), "rm"(v))
#define digit_div(n1, n0, d, q, r) \
    __asm__("divq %4" : "=a"(q), "=d"(r) : "0"(n0), "1"(n1), "rm"(d))
#endif

我認為這是非常好的想法，由於 inline assembly 的語法邏輯跟 C 的邏輯有些不同，因此包裝起來可以使人閱讀時統一依 C 的邏輯理解，較為易讀，同時也方便重複使用。

同時他在無法使用 x86 assembly 的情況下仍提供了對應實作：

#ifndef digit_mul
#define digit_mul(u, v, hi, lo)                    \
    do {                                           \
        apm_digit __u0 = (u);                      \
        apm_digit __u1 = __u0 >> APM_DIGIT_HSHIFT; \
        __u0 &= APM_DIGIT_LMASK;                   \
        apm_digit __v0 = (v);                      \
        apm_digit __v1 = __v0 >> APM_DIGIT_HSHIFT; \
        __v0 &= APM_DIGIT_LMASK;                   \
        apm_digit __lo = __u0 * __v0;              \
        apm_digit __hi = __u1 * __v1;              \
        __u1 *= __v0;                              \
        __v0 = __u1 << APM_DIGIT_HSHIFT;           \
        __u1 >>= APM_DIGIT_HSHIFT;                 \
        __hi += __u1 + ((__lo += __v0) < __v0);    \
        __v1 *= __u0;                              \
        __u0 = __v1 << APM_DIGIT_HSHIFT;           \
        __v1 >>= APM_DIGIT_HSHIFT;                 \
        __hi += __v1 + ((__lo += __u0) < __u0);    \
        (lo) = __lo;                               \
        (hi) = __hi;                               \
    } while (0)
#endif

apm.c apm.h

這部份提供許多大數的 operations 包括：

increment/decrement
add
sub
dmul，計算 partial product 的函式
shift
cmp

預設是 u 和 v 運算後將結果放到 w，其中沒看到計算 apm 乘法的函式，僅有計算 partial product 的 dmul()。

apm 乘法被獨立寫在 mul.c

帶 i postfix 的 operations 似乎是代表 in-place，都是 u, v 運算後放到 u
帶 n postfix 的 operations 不確定語意上是代表哪個英文單字，不過其共通點為 u, v 兩數的初始 size 相同

mul.c

在其乘法實作當中混用了兩種乘法

base，最原始的直式乘法方式
Karatsuba multiplication

短碼時使用直式乘法，長碼時使用 Karatsuba，使兩者各展所長。

Karatsuba multiplication 解說

假設我們要作

x \times y

，那麼藉由分解較大的

x

及

y

，可以達到：

\begin{aligned} x \times y & = (x_{1} \times 2^{b} + x_{0}) \times (y_{1} \times 2^{b} + y_{0}) \\ = (x_{1} y_{1}) 2^{2 b} + (x_{1} y_{0} + x_{0} y_{1}) 2^{b} + (x_{0} y_{0}) \end{aligned}

因

(x_{1} y_{0} + x_{0} y_{1}) = (x_{1} + x_{0}) \times (y_{1} + y_{0}) - x_{1} y_{1} - x_{0} y_{0}

，可將乘法次數由 4 次降到 3 次。

此外每個

x_{i} y_{i}

又呈現

x \times y

的形式，因此可以再次遞迴下去，總結而言這是個 divide and conquer 的算法。

bn.h

此部份將前面大數相關的 type 使用 struct 包起來：

typedef struct {
    apm_digit *digits; /* Digits of number. */
    apm_size size;     /* Length of number. */
    apm_size alloc;    /* Size of allocation. */
    unsigned sign : 1; /* Sign bit. */
} bn, bn_t[1];

並使用 apm 的介面提供一系列的 arithmetic operations，再次包裝成 bn 介面。