Try   HackMD

2022q1 Homework3 (fibdrv)

contributed by < tinyynoob >

作業要求

最近的方向可以跳到 June rework 開始看

因為一篇筆記放不下了,再開一篇:fibdrv 2

研讀資料

Fibonacci 相關性質

https://hackmd.io/@sysprog/fibonacci-number

Algorithms for Computing Fibonacci Numbers Quickly

Fast doubling 算法

F2n+1=Fn+12+Fn2F2n=Fn(2Fn+1Fn)

作業說明中有給出 pseudocode :

Fast_Fib(n)
    a = 0; b = 1;       // m = 0
    for i = (number of binary digit in n) to 1
        t1 = a*(2*b - a);
        t2 = b^2 + a^2;
        a = t1; b = t2; // m *= 2
        if (current binary digit == 1)
            t1 = a + b; // m++
            a = b; b = t1;
    return a;

注意到這個 ^ 是指 power 而非 xor 。

目前看不出 clz() 能對計算 Fib 做到什麼強力優化,似乎只能加快 initialize i 而已,需要再研究研究。

Follow 《The Linux Kernel Module Programming Guide》

將自己研讀的摘要整理到另一篇筆記: https://hackmd.io/@tinyynoob/Hk89qPw-9

本地版本:

$ apt-cache search linux-headers-`uname -r`
linux-headers-5.13.0-30-generic - Linux kernel headers for version 5.13.0 on 64 bit x86 SMP

根據書中範例嘗試建立 hello-1.c 及對應的 makefile

make 時馬上出現問題:

log
$ make -p | grep PWD
scripts/Makefile.build:44: /home/tinyynoob/code/kernel/hello-1/Makefile: 沒有此一檔案或目錄
make[2]: *** 沒有規則可製作目標「/home/tinyynoob/code/kernel/hello-1/Makefile」。 停止。
OLDPWD = /home/tinyynoob/code/kernel
PWD = /home/tinyynoob/code/kernel/hello-1
CONFIG_HPWDT_NMI_DECODING = y
make[1]: *** [Makefile:1879:/home/tinyynoob/code/kernel/hello-1] 錯誤 2
OLDPWD = /home/tinyynoob/code/kernel
PWD = /home/tinyynoob/code/kernel/hello-1
CONFIG_HPWDT_NMI_DECODING = y
        @echo  '  Syntax: make -C path/to/kernel/src M=$$PWD target'
make: *** [makefile:7:all] 錯誤 2
PWD := /home/tinyynoob/code/kernel/hello-1
OLDPWD = /home/tinyynoob/code/kernel
        echo $((PWD)
        make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules 
        make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean


發現原來檔名把 Makefile 寫成 makefile 也不行,更改後得到:

log
$ make
make -C /lib/modules/5.13.0-30-generic/build M=/home/tinyynoob/code/kernel/hello-1       modules 
make[1]: 進入目錄「/usr/src/linux-headers-5.13.0-30-generic」
  CC [M]  /home/tinyynoob/code/kernel/hello-1/hello-1.o
  MODPOST /home/tinyynoob/code/kernel/hello-1/Module.symvers
  CC [M]  /home/tinyynoob/code/kernel/hello-1/hello-1.mod.o
  LD [M]  /home/tinyynoob/code/kernel/hello-1/hello-1.ko
  BTF [M] /home/tinyynoob/code/kernel/hello-1/hello-1.ko
Skipping BTF generation for /home/tinyynoob/code/kernel/hello-1/hello-1.ko due to unavailability of vmlinux
make[1]: 離開目錄「/usr/src/linux-headers-5.13.0-30-generic」


看似往正確的方向前進了,也有產出 hello-1.ko ,但沒有完全成功,它在 vmlinux 遇到問題

vmlinux 問題尚未解決

根據 modinfo 可以發現在 hello-1.ko 的 vermagic 欄位有 modversions 。回去閱讀書中提到 Modversioning 的部份,得知我可能需要關掉它。

參考,試著進到 /boot/config-5.13.0-30-generic 關掉 CONFIG_MODVERSIONS

關掉 modversions 但尚未 recompile kernel

hello-1.c 裡的 return 值改成負數會導致:

insmod: ERROR: could not insert module hello-1.ko: Operation not permitted

研讀數字字串運算範例程式

source code

數字字串在運算時可以因應字串 size 選擇不同的策略。

範例程式使用的字串使用 union 定義如下:

typedef union { /* allow strings up to 15 bytes to stay on the stack * use the last byte as a null terminator and to store flags * much like fbstring: * https://github.com/facebook/folly/blob/master/folly/docs/FBString.md */ char data[16]; struct { uint8_t filler[15], /* how many free bytes in this stack allocated string * same idea as fbstring */ space_left : 4, /* if it is on heap, set to 1 */ is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1; }; /* heap allocated */ struct { char *ptr; /* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */ size_t size : MAX_STR_LEN_BITS, /* capacity is always a power of 2 (unsigned)-1 */ capacity : 6; /* the last 4 bits are important flags */ }; } xs;
  • 在短碼的情況會使用第 9 行的 struct ,直接將資料字串存在 filler 並把一些 flag 資訊放在最後一個 byte
  • 在長碼的情況會使用第 20 行的 struct ,使用一個 ptr 指標並配置空間在 heap ,後面的 8 byte 拿來存放一些 flag

還沒完全理解在長碼情況下,

  1. 如何繼續使用 is_ptr flag
  2. capacity 使用的詳細狀況

範例程式在 fib_sequence() 計算 Fib ,算法使用逐項相加:

    for (i = 2; i <= k; i++)
        string_number_add(&f[i - 1], &f[i - 2], &f[i]);

因此若是我們能改善乘法及平方的複雜度,則可以使用 fast doubling 的技巧來改善目前算法。

Linux 效能分析

嘗試查看 CPU affinity :

$ taskset -p 8528
pid 8528 目前的近似者遮罩:ff

ff 的二進位為 11111111 ,代表這個 process 可以在 7~0 任意一個 CPU 上執行。試著更改:

$ taskset -p FA 8528
pid 8528 目前的近似者遮罩:ff
pid 8528 新的近似者遮罩:fa

根據網路上的教學,在 /boot/grub/grub.cfg 中加入 isolcpus=0,1

最後,完成 排除干擾效能分析的因素 中的所有設定。

reboot 之後查看 process 發現 core mask 變成了 fc :

$ taskset -p 1
pid 1 目前的近似者遮罩:fc

雖然有些已經變成 fc 但有些仍顯示 ff ,因此重新依據參考共筆進行設定。結果仍失敗,目前的在設定 smp_affinity 上因腳本錯誤遇到困難。


Fork

嘗試掛載模組之後輸入 sudo ./client 可以得到

結果
Reading from /dev/fibonacci at offset 0, returned the sequence 0.
Reading from /dev/fibonacci at offset 1, returned the sequence 1.
Reading from /dev/fibonacci at offset 2, returned the sequence 1.
Reading from /dev/fibonacci at offset 3, returned the sequence 2.
Reading from /dev/fibonacci at offset 4, returned the sequence 3.
Reading from /dev/fibonacci at offset 5, returned the sequence 5.
Reading from /dev/fibonacci at offset 6, returned the sequence 8.
Reading from /dev/fibonacci at offset 7, returned the sequence 13.
Reading from /dev/fibonacci at offset 8, returned the sequence 21.
Reading from /dev/fibonacci at offset 9, returned the sequence 34.
Reading from /dev/fibonacci at offset 10, returned the sequence 55.
Reading from /dev/fibonacci at offset 11, returned the sequence 89.
Reading from /dev/fibonacci at offset 12, returned the sequence 144.
Reading from /dev/fibonacci at offset 13, returned the sequence 233.
Reading from /dev/fibonacci at offset 14, returned the sequence 377.
Reading from /dev/fibonacci at offset 15, returned the sequence 610.
Reading from /dev/fibonacci at offset 16, returned the sequence 987.
Reading from /dev/fibonacci at offset 17, returned the sequence 1597.
Reading from /dev/fibonacci at offset 18, returned the sequence 2584.
Reading from /dev/fibonacci at offset 19, returned the sequence 4181.
Reading from /dev/fibonacci at offset 20, returned the sequence 6765.
Reading from /dev/fibonacci at offset 21, returned the sequence 10946.
Reading from /dev/fibonacci at offset 22, returned the sequence 17711.
Reading from /dev/fibonacci at offset 23, returned the sequence 28657.
Reading from /dev/fibonacci at offset 24, returned the sequence 46368.
Reading from /dev/fibonacci at offset 25, returned the sequence 75025.
Reading from /dev/fibonacci at offset 26, returned the sequence 121393.
Reading from /dev/fibonacci at offset 27, returned the sequence 196418.
Reading from /dev/fibonacci at offset 28, returned the sequence 317811.
Reading from /dev/fibonacci at offset 29, returned the sequence 514229.
Reading from /dev/fibonacci at offset 30, returned the sequence 832040.
Reading from /dev/fibonacci at offset 31, returned the sequence 1346269.
Reading from /dev/fibonacci at offset 32, returned the sequence 2178309.
Reading from /dev/fibonacci at offset 33, returned the sequence 3524578.
Reading from /dev/fibonacci at offset 34, returned the sequence 5702887.
Reading from /dev/fibonacci at offset 35, returned the sequence 9227465.
Reading from /dev/fibonacci at offset 36, returned the sequence 14930352.
Reading from /dev/fibonacci at offset 37, returned the sequence 24157817.
Reading from /dev/fibonacci at offset 38, returned the sequence 39088169.
Reading from /dev/fibonacci at offset 39, returned the sequence 63245986.
Reading from /dev/fibonacci at offset 40, returned the sequence 102334155.
Reading from /dev/fibonacci at offset 41, returned the sequence 165580141.
Reading from /dev/fibonacci at offset 42, returned the sequence 267914296.
Reading from /dev/fibonacci at offset 43, returned the sequence 433494437.
Reading from /dev/fibonacci at offset 44, returned the sequence 701408733.
Reading from /dev/fibonacci at offset 45, returned the sequence 1134903170.
Reading from /dev/fibonacci at offset 46, returned the sequence 1836311903.
Reading from /dev/fibonacci at offset 47, returned the sequence 2971215073.
Reading from /dev/fibonacci at offset 48, returned the sequence 4807526976.
Reading from /dev/fibonacci at offset 49, returned the sequence 7778742049.
Reading from /dev/fibonacci at offset 50, returned the sequence 12586269025.
Reading from /dev/fibonacci at offset 51, returned the sequence 20365011074.
Reading from /dev/fibonacci at offset 52, returned the sequence 32951280099.
Reading from /dev/fibonacci at offset 53, returned the sequence 53316291173.
Reading from /dev/fibonacci at offset 54, returned the sequence 86267571272.
Reading from /dev/fibonacci at offset 55, returned the sequence 139583862445.
Reading from /dev/fibonacci at offset 56, returned the sequence 225851433717.
Reading from /dev/fibonacci at offset 57, returned the sequence 365435296162.
Reading from /dev/fibonacci at offset 58, returned the sequence 591286729879.
Reading from /dev/fibonacci at offset 59, returned the sequence 956722026041.
Reading from /dev/fibonacci at offset 60, returned the sequence 1548008755920.
Reading from /dev/fibonacci at offset 61, returned the sequence 2504730781961.
Reading from /dev/fibonacci at offset 62, returned the sequence 4052739537881.
Reading from /dev/fibonacci at offset 63, returned the sequence 6557470319842.
Reading from /dev/fibonacci at offset 64, returned the sequence 10610209857723.
Reading from /dev/fibonacci at offset 65, returned the sequence 17167680177565.
Reading from /dev/fibonacci at offset 66, returned the sequence 27777890035288.
Reading from /dev/fibonacci at offset 67, returned the sequence 44945570212853.
Reading from /dev/fibonacci at offset 68, returned the sequence 72723460248141.
Reading from /dev/fibonacci at offset 69, returned the sequence 117669030460994.
Reading from /dev/fibonacci at offset 70, returned the sequence 190392490709135.
Reading from /dev/fibonacci at offset 71, returned the sequence 308061521170129.
Reading from /dev/fibonacci at offset 72, returned the sequence 498454011879264.
Reading from /dev/fibonacci at offset 73, returned the sequence 806515533049393.
Reading from /dev/fibonacci at offset 74, returned the sequence 1304969544928657.
Reading from /dev/fibonacci at offset 75, returned the sequence 2111485077978050.
Reading from /dev/fibonacci at offset 76, returned the sequence 3416454622906707.
Reading from /dev/fibonacci at offset 77, returned the sequence 5527939700884757.
Reading from /dev/fibonacci at offset 78, returned the sequence 8944394323791464.
Reading from /dev/fibonacci at offset 79, returned the sequence 14472334024676221.
Reading from /dev/fibonacci at offset 80, returned the sequence 23416728348467685.
Reading from /dev/fibonacci at offset 81, returned the sequence 37889062373143906.
Reading from /dev/fibonacci at offset 82, returned the sequence 61305790721611591.
Reading from /dev/fibonacci at offset 83, returned the sequence 99194853094755497.
Reading from /dev/fibonacci at offset 84, returned the sequence 160500643816367088.
Reading from /dev/fibonacci at offset 85, returned the sequence 259695496911122585.
Reading from /dev/fibonacci at offset 86, returned the sequence 420196140727489673.
Reading from /dev/fibonacci at offset 87, returned the sequence 679891637638612258.
Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931.
Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189.
Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189.
Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931.
Reading from /dev/fibonacci at offset 87, returned the sequence 679891637638612258.
Reading from /dev/fibonacci at offset 86, returned the sequence 420196140727489673.
Reading from /dev/fibonacci at offset 85, returned the sequence 259695496911122585.
Reading from /dev/fibonacci at offset 84, returned the sequence 160500643816367088.
Reading from /dev/fibonacci at offset 83, returned the sequence 99194853094755497.
Reading from /dev/fibonacci at offset 82, returned the sequence 61305790721611591.
Reading from /dev/fibonacci at offset 81, returned the sequence 37889062373143906.
Reading from /dev/fibonacci at offset 80, returned the sequence 23416728348467685.
Reading from /dev/fibonacci at offset 79, returned the sequence 14472334024676221.
Reading from /dev/fibonacci at offset 78, returned the sequence 8944394323791464.
Reading from /dev/fibonacci at offset 77, returned the sequence 5527939700884757.
Reading from /dev/fibonacci at offset 76, returned the sequence 3416454622906707.
Reading from /dev/fibonacci at offset 75, returned the sequence 2111485077978050.
Reading from /dev/fibonacci at offset 74, returned the sequence 1304969544928657.
Reading from /dev/fibonacci at offset 73, returned the sequence 806515533049393.
Reading from /dev/fibonacci at offset 72, returned the sequence 498454011879264.
Reading from /dev/fibonacci at offset 71, returned the sequence 308061521170129.
Reading from /dev/fibonacci at offset 70, returned the sequence 190392490709135.
Reading from /dev/fibonacci at offset 69, returned the sequence 117669030460994.
Reading from /dev/fibonacci at offset 68, returned the sequence 72723460248141.
Reading from /dev/fibonacci at offset 67, returned the sequence 44945570212853.
Reading from /dev/fibonacci at offset 66, returned the sequence 27777890035288.
Reading from /dev/fibonacci at offset 65, returned the sequence 17167680177565.
Reading from /dev/fibonacci at offset 64, returned the sequence 10610209857723.
Reading from /dev/fibonacci at offset 63, returned the sequence 6557470319842.
Reading from /dev/fibonacci at offset 62, returned the sequence 4052739537881.
Reading from /dev/fibonacci at offset 61, returned the sequence 2504730781961.
Reading from /dev/fibonacci at offset 60, returned the sequence 1548008755920.
Reading from /dev/fibonacci at offset 59, returned the sequence 956722026041.
Reading from /dev/fibonacci at offset 58, returned the sequence 591286729879.
Reading from /dev/fibonacci at offset 57, returned the sequence 365435296162.
Reading from /dev/fibonacci at offset 56, returned the sequence 225851433717.
Reading from /dev/fibonacci at offset 55, returned the sequence 139583862445.
Reading from /dev/fibonacci at offset 54, returned the sequence 86267571272.
Reading from /dev/fibonacci at offset 53, returned the sequence 53316291173.
Reading from /dev/fibonacci at offset 52, returned the sequence 32951280099.
Reading from /dev/fibonacci at offset 51, returned the sequence 20365011074.
Reading from /dev/fibonacci at offset 50, returned the sequence 12586269025.
Reading from /dev/fibonacci at offset 49, returned the sequence 7778742049.
Reading from /dev/fibonacci at offset 48, returned the sequence 4807526976.
Reading from /dev/fibonacci at offset 47, returned the sequence 2971215073.
Reading from /dev/fibonacci at offset 46, returned the sequence 1836311903.
Reading from /dev/fibonacci at offset 45, returned the sequence 1134903170.
Reading from /dev/fibonacci at offset 44, returned the sequence 701408733.
Reading from /dev/fibonacci at offset 43, returned the sequence 433494437.
Reading from /dev/fibonacci at offset 42, returned the sequence 267914296.
Reading from /dev/fibonacci at offset 41, returned the sequence 165580141.
Reading from /dev/fibonacci at offset 40, returned the sequence 102334155.
Reading from /dev/fibonacci at offset 39, returned the sequence 63245986.
Reading from /dev/fibonacci at offset 38, returned the sequence 39088169.
Reading from /dev/fibonacci at offset 37, returned the sequence 24157817.
Reading from /dev/fibonacci at offset 36, returned the sequence 14930352.
Reading from /dev/fibonacci at offset 35, returned the sequence 9227465.
Reading from /dev/fibonacci at offset 34, returned the sequence 5702887.
Reading from /dev/fibonacci at offset 33, returned the sequence 3524578.
Reading from /dev/fibonacci at offset 32, returned the sequence 2178309.
Reading from /dev/fibonacci at offset 31, returned the sequence 1346269.
Reading from /dev/fibonacci at offset 30, returned the sequence 832040.
Reading from /dev/fibonacci at offset 29, returned the sequence 514229.
Reading from /dev/fibonacci at offset 28, returned the sequence 317811.
Reading from /dev/fibonacci at offset 27, returned the sequence 196418.
Reading from /dev/fibonacci at offset 26, returned the sequence 121393.
Reading from /dev/fibonacci at offset 25, returned the sequence 75025.
Reading from /dev/fibonacci at offset 24, returned the sequence 46368.
Reading from /dev/fibonacci at offset 23, returned the sequence 28657.
Reading from /dev/fibonacci at offset 22, returned the sequence 17711.
Reading from /dev/fibonacci at offset 21, returned the sequence 10946.
Reading from /dev/fibonacci at offset 20, returned the sequence 6765.
Reading from /dev/fibonacci at offset 19, returned the sequence 4181.
Reading from /dev/fibonacci at offset 18, returned the sequence 2584.
Reading from /dev/fibonacci at offset 17, returned the sequence 1597.
Reading from /dev/fibonacci at offset 16, returned the sequence 987.
Reading from /dev/fibonacci at offset 15, returned the sequence 610.
Reading from /dev/fibonacci at offset 14, returned the sequence 377.
Reading from /dev/fibonacci at offset 13, returned the sequence 233.
Reading from /dev/fibonacci at offset 12, returned the sequence 144.
Reading from /dev/fibonacci at offset 11, returned the sequence 89.
Reading from /dev/fibonacci at offset 10, returned the sequence 55.
Reading from /dev/fibonacci at offset 9, returned the sequence 34.
Reading from /dev/fibonacci at offset 8, returned the sequence 21.
Reading from /dev/fibonacci at offset 7, returned the sequence 13.
Reading from /dev/fibonacci at offset 6, returned the sequence 8.
Reading from /dev/fibonacci at offset 5, returned the sequence 5.
Reading from /dev/fibonacci at offset 4, returned the sequence 3.
Reading from /dev/fibonacci at offset 3, returned the sequence 2.
Reading from /dev/fibonacci at offset 2, returned the sequence 1.
Reading from /dev/fibonacci at offset 1, returned the sequence 1.
Reading from /dev/fibonacci at offset 0, returned the sequence 0.


可以發現到 92 項以後數值就不再改變,代表這裡存在嚴重缺陷,對於超過 64 位元的數值無能為力。

另外在輸入指令

$ ls -l /dev/fibonacci 

後,不是得到 235 也不是得到 236 而是 505

crw------- 1 root root 505, 0  3月 11 13:10 /dev/fibonacci

ktime doc

參考共筆依樣畫葫蘆,由於目前 fibdrv.cfib_write() 沒有功能,所以在其中塞入測量時間的程式碼:

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    ktime_t kt;
    kt = ktime_get();
    fib_sequence(*offset);
    kt = ktime_sub(ktime_get(), kt);
    return (ssize_t) ktime_to_ns(kt);
}

然而輸出的數值看起來都差不多,回頭想找找看這個 offset 到底是怎麼設的。我發現在 client.c 呼叫時傳入的參數並不包含 offset ,不過在呼叫 write()read() 時有個不同是 read() 多了一行:

        lseek(fd, i, SEEK_SET);

其中 SEEK_SET 將被展開為 0 。觀察發現呼叫 lseek() 會連到 fibdrv.c 中的:

static loff_t fib_device_lseek(struct file *file, loff_t offset, int orig)
{
    loff_t new_pos = 0;
    switch (orig) {
    case 0: /* SEEK_SET: */
        new_pos = offset;
        break;
    case 1: /* SEEK_CUR: */
        new_pos = file->f_pos + offset;
        break;
    case 2: /* SEEK_END: */
        new_pos = MAX_LENGTH - offset;
        break;
    }

    if (new_pos > MAX_LENGTH)
        new_pos = MAX_LENGTH;  // max case
    if (new_pos < 0)
        new_pos = 0;        // min case
    file->f_pos = new_pos;  // This is what we'll use now
    return new_pos;
}

雖然可以猜測這個函式的回傳值會更改 offset ,但是目前不知道這到底是如何達成的,我甚至不曉得 offset 到底存在哪裡。

TODO:

  • 釐清 offset

答:
我認為就是存在 file 結構的 f_pos 成員中

研讀參考共筆中的大數運算

https://hackmd.io/@KYWeng/rkGdultSU#如何減少大數運算的成本

這份共筆提出多種實作上改進計算 Fib 速度的技巧,摘要如下:

調整 Fast Fib 算法

將原先的算法由

F2n+1=Fn+12+Fn2F2n=Fn(2Fn+1Fn)

改良為:

F2n1=Fn2+Fn12F2n=Fn(2Fn1+Fn)

以迴避所有減法運算,並且不會再出現任何負數,可以全部用 unsigned 型態儲存。

使用 memory pool

在大數結構中引入 memory pool 的概念來避免頻繁的 realloc()

使用內嵌組合語言加速乘法

做乘法運算時直接指定高低位的位置,而不是使用擴展整數型別。

以參考共筆中的語法為例:

__asm__("mulq %3" : "=a"(low), "=d"(high) : "%0"(a[i]), "rm"(k));

        assembly language : output :  input

其中 instruction 顯然為 mulq ,而

  1. = 是 write only 的意思
  2. a, d 分別表示 $rax, $rdx
  3. ( ) 內使用 C 變數
  4. rm 表示由 compiler 自己選擇存放位置
  5. %0 在此等同 a 暫存器
  6. %3 代表 rm

參考:

https://ithelp.ithome.com.tw/articles/10213500
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

引入 Karatsuba algorithm 算法

核心思想為在做乘法時,將一些乘法運算改為加減,考慮:

(a12n+a0)×(b12n+b0)=c222n+c12n+c0=(a1b1)22n+(a1b0+a0b1)2n+a0b0

原先需要做 4 次乘法運算,經過一些設計可以變成:

(22n+2n)(a1b1)+2n(a1a0)(b0b1)+(2n+1)(a0b0)

便能少做一次乘法。

改進 fibdrv

實作程式碼 github

大數運算

在為 fibdrv 改進效能之前,我們需要先確保它的正確性,為此我們首先引入大數運算的機制。對於存放大數的方式前面提出了兩種設計:

  • 單純使用結構
  • 使用 union 達成 small string optimization

由於在計算 Fib 的情境下不會一直需要宣告新的變數,也不會有頻繁居於小數計算的情況,因此我們選擇使用結構的方式,如此也較便利於 maintain resize 的功能。

因為我目前不太熟悉 kernel programming 的各種 function 用法,因此我打算先在 user space 開發,並 debug 確認功能沒問題之後再移植到 kernel space 。

我想借助 macro 來完成此工作,例如這樣:

#if DEBUG
    N = (ubn *) malloc(sizeof(ubn));
#else
    N = (ubn *) kmalloc(sizeof(ubn), GFP_KERNEL);
#endif

由於在計算 Fib 的過程中不會遇到任何負數,因此我的大數運算假設不存在負數,相關的運算也都沒有提供。

引入共筆用的 macro

#if defined(__LP64__) || defined(__x86_64__) || defined(__amd64__) || \
    defined(__aarch64__)
typedef uint64_t ubn_unit;
// typedef unsigned __int128 bn_data_tmp; // gcc support __int128
#define ubn_unit_bit 64
#else
typedef uint32_t ubn_unit;
// typedef uint64_t bn_data_tmp;
#define ubn_unit_bit 32
#endif

目前還未使用到加倍的型別,暫時先註解掉。

基於沒有負數的假設,我們得以直接移除共筆結構中的 sign 成員,訂出:

/* unsigned big number
 * @data: MSB:[size-1], LSB:[0]
 * @size: sizeof(ubn_unit)
 * @capacity: allocated size
 */
typedef struct {
    ubn_unit *data;
    int size;
    int capacity;
} ubn;

預期 (無號) 大數運算的界面提供以下功能:

  • ubignum_init()
  • ubignum_free()
  • ubignum_resize()
  • ubignum_add()
  • ubignum_mult()
  • ubignum_assign()

目前比較沒有頭緒的是 assign() 的作法

ubignum_init()

程式碼
bool ubignum_init(ubn *N)
{
#if DEBUG
    N = (ubn *) malloc(sizeof(ubn));
#else
    N = (ubn *) kmalloc(sizeof(ubn), GFP_KERNEL);
#endif
    if (!N)
        goto struct_aloc_failed;

#if DEBUG
    N->data = (ubn_unit *) calloc(sizeof(ubn_unit),
                                  DEFAULT_CAPACITY);  // including mem reset
#else
    N->data = (ubn_unit *) kcalloc(sizeof(ubn_unit), DEFAULT_CAPACITY,
                                   GFP_KERNEL);  // including mem reset
#endif
    if (!N->data)
        goto data_aloc_failed;
    N->size = 0;
    N->capacity = DEFAULT_CAPACITY;
    return true;

data_aloc_failed:
    free(N);
    N = NULL;
struct_aloc_failed:
    return false;
}

其中 DEFAULT_CAPACITY 目前訂為 16 。

靜態分析出現一堆問題,這才發現我原本的寫法根本不會正確把初始化後的 N 回傳,因此修改程式碼:

ubignum_resize()

程式碼
bool ubignum_resize(ubn *N, int new_size)
{
    if (new_size < 0)
        return false;
#if DEBUG
    ubn_unit *new = (ubn_unit *) realloc(N->data, sizeof(ubn_unit) * new_size);
#else
    ubn_unit *new =
        (ubn_unit *) krealloc(N->data, sizeof(ubn_unit) * new_size, GFP_KERNEL);
#endif
    if (!new)
        return false;
    N = new;
    N->capacity = new_size;
    return true;
}

目前不確定允許 new_size 傳入 0 是否是個好主意。

ubignum_add()

bool ubignum_add(const ubn *a, const ubn *b, ubn *out)

目前設計讓使用者傳入存放 sum 的變數,但這麼做就需要解決 pointer aliasing 的問題。

由低到高做加法的流程大致可以分成兩段:

  1. 兩數相加
  2. 一數加上 1. 的 carry out

首先開發 1. 的部份:

ubn_unit carry = 0; int i = 0; for (i = 0; i < min(a->size, b->size); i++) { if (__builtin_expect(i >= out->capacity, 0)) { if (!resize(out, out->capacity * 2)) { return false; } } out->data[i] = a->data[i] + b->data[i] + carry; // The following consideration does not include carry // fix it! carry = ((a->data[i] ^ b->data[i]) >> 1 ^ (a->data[i] & b->data[i])) & (1u << (ubn_unit_bit - 1)); }

若超出 out 範圍則對它進行 resize() ,其中加入 builtin_expect() 協助 compiler 進行 branch prediction 。

撰寫過程馬上讓我遭遇了幾個問題:

  1. 第 11 行,我嘗試引入 quiz2 第一題 計算平均的方式再取 MSB 來得出新的 carry ,但我發覺這樣的話舊的 carry 沒有被加進來,所以在例如 (a, b, c) = (UINT_MAX, 0, 1) 的情況下會算出錯誤結果。
  2. 我不熟悉 gcc 對於 unsigned 型態的右移會如何處理

根據 gcc 的 mail list

For right shifts, if the type is signed, then we use an arithmetic shift;
if the type is unsigned then we use a logical.

因此對於 unsigned 右移的行為符合我的預期。

1. 待解決

後面完全更換作法就不再有這個問題

為了應對 pointer aliasing 的問題,我們應該不能直接拿 a->datab->data 去運算,因為它們在第 9 行可能會被改變 (如果他們跟 out 指到同樣位置) ,那麼第 12 行的運算將會錯誤。因此需要先將它們在第 9 行的值存起來,將程式碼改成:

for (i = 0; i < min(a->size, b->size); i++) { if (__builtin_expect(i >= out->capacity, 0)) { if (!resize(out, out->capacity * 2)) { goto realoc_failed; } } ubn_unit an = a->data[i]; ubn_unit bn = b->data[i]; out->data[i] = an + bn + carry; // The following consideration does not include carry // fix it! carry = ((an ^ bn) >> 1 + (an & bn)) & (1u << (ubn_unit_bit - 1)); }

其中第 4 行跳到最後去把 ab 還原成原本的樣子,這也是因為 aliasing 才有的問題。

第二段程式碼:

ubn *remain = NULL; if (i == a->size) remain = b; else if (i == b->size) remain = a; for (; carry || i < remain->size; i++) { if (__builtin_expect(i >= out->capacity, 0)) { if (!resize(out, out->capacity * 2)) { return false; } } if (__builtin_expect(i < remain->size, 1)) { out->data[i] = remain->data[i] + carry; carry = ((remain->data[i] ^ carry) >> 1 ^ (remain->data[i] & carry)) & (1u << (ubn_unit_bit - 1)); } else { // the last carry out carries to a new ubn_unit out->data[i] = carry; // = 1 carry = 0; } }

先使用 remain 來記住哪一個加數還有剩餘的位數沒加,接著就進入迴圈繼續加完,大部分的情況都會走 12 行,唯一會走到 17 行的情況是最高位有 carry out 而且剛好 carry 到一個新的 ubn_unit chunk 。

同樣處理 pointer aliasing 的問題並改進寫法得:

    ubn *remain = (i == a->size) ? b : a;
    for (; carry || i < remain->size; i++) {
        if (__builtin_expect(i >= out->capacity, 0)) {
            if (!resize(out, out->capacity * 2)) {
                goto realoc_failed;
            }
        }
        if (__builtin_expect(i < remain->size, 1)) {
            ubn_unit rn = remain->data[i];
            out->data[i] = rn + carry;
            carry =
                ((rn ^ carry) >> 1 ^ (rn & carry)) & (1u << (ubn_unit_bit - 1));
        } else {  // the last carry out carries to a new ubn_unit
            out->data[i] = carry;  // = 1
            carry = 0;
        }
    }

處理 aliasing 的程式碼:

    ubn *store = NULL;
    int alias = 0;
    if (a == out)  // pointer aliasing
        alias ^= 1;
    if (b == out)  // pointer aliasing
        alias ^= 2;
    if (alias) {
        store = ubignum_duplicate(store, out);
        if (!store)
            return false;
    }

直接使用一個變數 alias 中的 bit 來觀察,若有 aliasing 則配置空間給保留原值。

realoc_failed:
    switch (alias) {
    case 1:
        swapptr(&a, &store);
        break;
    case 2:
    case 3:
        swapptr(&b, &store);
    case 0:
    }
    return false;

注意到這裡 2, 3 共用的原因是,如果 ab 指到同一個位置,那麼恢復 b 也同時恢復了 a


aliasing 的問題比我預期的還更難處理,我直接重改邏輯,變成如果有 aliasing 則配置一個新空間存放計算結果,算完再放到 out ,改寫過程中也修復了一些 bug 。

更改後的 ubignum_add() 程式碼
/* out = a + b * Aliasing arguments are acceptable. * If it return true, the result is put at @out. * If return false with alias input, the @out would be unchanged. * If return false without alias input, the @out would return neither answer nor * original out. */ bool ubignum_add(const ubn *a, const ubn *b, ubn **out) { ubn *ans = *out; int alias = 0; if (a == *out) // pointer aliasing alias ^= 1; if (b == *out) // pointer aliasing alias ^= 2; if (alias) { // if so, allocate space to store the result if (!ubignum_init(&ans)) return false; } ubn_unit carry = 0; int i = 0; for (i = 0; i < min(a->size, b->size); i++) { if (i >= ans->size) { if (__builtin_expect(i >= ans->capacity, 0)) if (!ubignum_resize(ans, ans->capacity * 2)) goto realoc_failed; ans->size++; } const ubn_unit_extend sum = a->data[i] + b->data[i] + carry; ans->data[i] = sum; carry = sum >> ubn_unit_bit; } const ubn *remain = (i == a->size) ? b : a; for (; carry || i < remain->size; i++) { if (i >= ans->size) { if (__builtin_expect(i >= ans->capacity, 0)) if (!ubignum_resize(ans, ans->capacity * 2)) goto realoc_failed; ans->size++; } if (__builtin_expect(i < remain->size, 1)) { const ubn_unit_extend sum = remain->data[i] + carry; ans->data[i] = sum; carry = sum >> ubn_unit_bit; } else { // the last carry out carries to a new ubn_unit ans->data[i] = carry; // = 1 carry = 0; } } *out = ans; // no condition needed return true; realoc_failed: if (alias) ubignum_free(ans); return false; }


其中 ubn_unit_extend 在 x64 定義為 unsigned __int128 ,目前嘗試在 user space 配合 gdb 工具進行測試,得到錯誤的結果, carry 沒有被正確計算

carry = sum >> ubn_unit_bit;

疑似 uint128 的 shift operation 有問題,我嘗試去找 gcc doc ,但是內容只有三行,沒有提到是否支援 shift operation 。

我在 assignment 之前加入強制轉型就有我預期的結果了

        const ubn_unit_extend sum = (ubn_unit_extend) a->data[i] + b->data[i] + carry;

我好像已經吃了很多次這種虧,卻都沒有記住,我一直誤以為計算的時候就會轉成左邊的型態,這次特別寫進 commit message ,希望能確實記得。

ubignum_mult()

雖然前面有提及加速的算法,但是因為我目前的大數機制沒有減法,所以暫時先略過此項。

在開發時發現 ubn_unit 的加法一直重複遇到,所以將其模組化

/* no checking, sum and cout input should be guarantee */
static inline void ubn_unit_add(const ubn_unit a,
                                const ubn_unit b,
                                const int cin,
                                ubn_unit *sum,
                                int *cout)
{
    const ubn_unit_extend SUM = (ubn_unit_extend) a + b + cin;
    *sum = SUM;
    *cout = SUM >> ubn_unit_bit;
}

長得有點像硬體加法器的樣子。 cincout 都只會用到一個 bit ,用什麼型態似乎都蠻浪費的,最後選用了最常見的 int 。

mult() 程式碼
bool ubignum_mult(const ubn *a, const ubn *b, ubn **out) { if (!a || !b || !out || !*out) return false; ubn *ans = *out; int alias = 0; if (a == *out) // pointer aliasing alias ^= 1; if (b == *out) // pointer aliasing alias ^= 2; if (alias) { // if alias, allocate space to store the result if (!ubignum_init(&ans)) return false; } /* keep mcand longer then mplier */ const ubn *mcand = a->size > b->size ? a : b; const ubn *mplier = a->size > b->size ? b : a; ubn *prod; if (!ubignum_init(&prod)) goto prod_aloc_failed; if (prod->capacity < mcand->size + 1) if (!ubignum_resize(&prod, mcand->size + 1)) goto prod_resize_failed; /* The outer loop */ for (int i = 0; i < mplier->size; i++) { /* The inner loop explanation: * a b c = mcand->data[2, 1, 0] * * d = mplier->data[0] * -------- * c c = (high, low) denotes result of c * d * b b * + a a * ---------- * x y ? c sum of bb and cc, it may be carry out to x * + a a * * let x be recorded in @carry and y be @overlap * no realloc needed for @prod */ int carry = 0; ubn_unit overlap = 0; for (int j = 0; j < mcand->size; j++) { ubn_unit low, high; // need macro to select appropriate assembly __asm__("mulq %3" : "=a"(low), "=d"(high) : "%0"(mcand->data[j]), "rm"(mplier->data[i])); int cout = 0; ubn_unit_add(low, overlap, carry & 2, &prod->data[j], &cout); carry = (carry << 1) | cout; // update carry where cout is 0 or 1 overlap = high; } prod->data[mcand->size] = overlap + (ubn_unit) carry & 2; // no carry out would be generated if (!ubignum_mult_add(prod, i, &ans)) goto multadd_failed; } *out = ans; return true; multadd_failed: prod_resize_failed: ubignum_free(prod); prod_aloc_failed: if (alias) ubignum_free(ans); return false; }


目前的 bug 是 inline asm 沒有正確作用,看起來像是 lowhigh 相反了,也可以看出確實是把 $rdx, $rax 分別接到 highlow

Breakpoint 2, ubignum_mult (a=0x5555555592a0, b=0x555555559350, out=0x7fffffffdc88) at bignum.h:336
336                 ubn_unit_add(low, overlap, carry & 2, &prod->data[j], &cout);
(gdb) p mcand->data[j]
$1 = 18446744073709551615
(gdb) p i
$2 = 0
(gdb) p j
$3 = 0
(gdb) p mplier->data[i]
$4 = 18446744073709551615
(gdb) p low
$5 = 1
(gdb) p high
$6 = 18446744073709551614
(gdb) info registers rax
rax            0x1                 1
(gdb) info registers rdx
rdx            0xfffffffffffffffe  -2

難道說它執行 mulq instruction 時的資料放置方式是跟原本相反嗎?

我嘗試去查看組合語言的編譯結果,看到:

	mulq %rdx

雖然它把 "rm" 選成 $rdx ,但這應該也不會影響指令執行結果才對。

我發現其實它沒有算錯,這個結果是正確的,但因為我對乘法的直覺還不夠,誤用加法的邏輯來思考乘法才誤判結果。

在 pointer aliasing 的情況下無法做到 in place 的算法,否則錯誤處理會出問題。


大數運算 第二版

在開發的過程中,因為大數運算難以 debug 只能經由 gdb 慢慢嘗試,因此修修改改花了很多時間,改了又錯又重改的情況多次發生,難以一一列舉,在此大略敘述過程遇到的狀況。

pointer aliasing issues

首先是前面提及 pointer aliasing 的問題,這個狀況是我在第一個版本遇到最大的問題,第一次是在開發加法時遇到。例如當想計算 a = a + b 時,若加出更大的位數則需要 reallocation ,但中途的 reallocation 失敗會導致傳入值 a 被破壞而正確的新 a 也沒有算出來,進退兩難的情況。

起初為了應對這個問題,我每次計算都先查看是否有 aliasing ,若有則進行複製, maintain 一份副本以利在 reallocation 失敗時能將數值復原。然而這個方法顯然導致很糟糕的效能,雖然之後為了節省交換的操作而直接計算在新空間,但仍然用了 3 個數。
使用者在使用界面時若指定 a = a + b ,那通常直覺也會認為自己使用了 in place 的算法,可是我在實作上卻無論如何都需要 3 個數的空間,違反使用直覺。

原程式碼 (add)
/* (*out) = a + b
 * Aliasing arguments are acceptable.
 * If it return true, the result is put at @out.
 * If return false with alias input, the @out would be unchanged.
 * If return false without alias input, the @out would return neither answer nor
 * original out.
 */
bool ubignum_add(const ubn *a, const ubn *b, ubn **out)
{
    if (!a || !b || !out || !*out)
        return false;
    ubn *ans = *out;
    int alias = 0;
    if (a == *out)  // pointer aliasing
        alias ^= 1;
    if (b == *out)  // pointer aliasing
        alias ^= 2;
    if (alias) {  // if alias, allocate space to store the result
        if (!ubignum_init(&ans))
            return false;
    }

    int i = 0, carry = 0;
    for (i = 0; i < min(a->size, b->size); i++) {
        if (i >= ans->size) {
            if (__builtin_expect(i >= ans->capacity, 0))
                if (!ubignum_resize(&ans, ans->capacity * 2))
                    goto realoc_failed;
            ans->size++;
        }
        ubn_unit_add(a->data[i], b->data[i], carry, &ans->data[i], &carry);
    }

    const ubn *remain = (i == a->size) ? b : a;
    for (; carry || i < remain->size; i++) {
        if (i >= ans->size) {
            if (__builtin_expect(i >= ans->capacity, 0))
                if (!ubignum_resize(&ans, ans->capacity * 2))
                    goto realoc_failed;
            ans->size++;
        }
        ubn_unit_add(remain->data[i], 0, carry, &ans->data[i], &carry);
    }
    if (alias)
        ubignum_free(*out);
    *out = ans;
    return true;

realoc_failed:
    if (alias)
        ubignum_free(ans);
    return false;
}


對於每個

n 測量 1000 次
fib(n)
計算後印出平均消耗時間。
其糟糕效能如下:

為此我後來發現若是能提早知道 reallocation 的大小,或至少知道該次計算的上界,那麼即可在開始計算之前就進行 reallocation 。如此一來即使配置失敗,那麼直接回傳 false 原先的值也不會被破壞。因此在計算 a = a + b 時便可以達成 in place 的目標,使用 2 個數的空間完成加法,也因為每次 ubignum_add() 都只會做一次 reallocation ,大幅降低了 memory allocation 的花費。此外也不再需要確認 aliasing ,移除判斷條件後也使程式碼簡潔許多。

改良 (add)
/* (*out) = a + b
 * Aliasing arguments are acceptable.
 * If false is returned, the values of (*out) remains unchanged.
 */
bool ubignum_add(const ubn *a, const ubn *b, ubn **out)
{
    if (!a || !b || !out || !*out)
        return false;
    while (__builtin_expect((*out)->capacity < max(a->size, b->size) + 1, 0))
        if (!ubignum_resize(out, (*out)->capacity * 2))
            return false;
    if (*out != a && *out != b)  // if no pointer aliasing
        ubignum_zero(*out);
    int i = 0, carry = 0;
    for (i = 0; i < min(a->size, b->size); i++)
        ubn_unit_add(a->data[i], b->data[i], carry, &(*out)->data[i], &carry);
    const ubn *const remain = (i == a->size) ? b : a;
    for (; i < remain->size; i++)
        ubn_unit_add(remain->data[i], 0, carry, &(*out)->data[i], &carry);
    (*out)->size = remain->size;
    if (carry) {
        (*out)->data[i] = 1;
        (*out)->size++;
    }
    return true;
}


效能:

相比原先的版本加速將近十倍之多。

列印

第二個困難出在列印,雖然我的 ubignum 已能進行數種運算,也正確地將數值儲存在檔案中,但它們就是一串由上百、上千個 0, 1 組成的資料,難以為人類所理解。

為了將很長的二進位轉成十進制,我左思右想許久,最後認為終究是離不開除法的命運,還是得乖乖寫一個除法。我想起前幾周的測驗題中,恰好有計算除法的內容,於是我就嘗試套用這個模式,意圖以位移和減法來達成除法運算。此外因為目前沒有其它除法的應用場景,所以可以只考量「除以 10 」的情況並進行特化。

為了實作出 ubignum_divby_ten() ,首先需要 ubignum_left_shift()ubignum_sub()

ubignum_sub()

由於我的大數體系原先完全不存在任何負的概念,因此想引入減法也需要做一些權衡。

不過因為除法中使用的減法會保證結果為非負 (餘數) ,如此一來就不必引入負數,減法直接使用二補數加法進行,並把最高的進位拋棄。基於直接假設減法都是大減小,原本的 ubignum 體系可以維持不變。

以下程式碼都展示解決前述 pointer aliasing 問題後的版本。

sub() 程式碼
bool ubignum_sub(const ubn *a, const ubn *b, ubn **out)
{
    if (!a || !b || !out || !*out)
        return false;
    /* ones' complement of b */
    ubn *cmt;
    if (!ubignum_init(&cmt))
        goto cmt_failed;
    if (__builtin_expect(cmt->capacity < a->size, 0))
        if (!ubignum_resize(&cmt, a->size))
            goto cmt_realoc_failed;
    cmt->size = a->size;
    for (int i = 0; i < b->size; i++)
        cmt->data[i] = ~b->data[i];
    for (int i = b->size; i < a->size; i++)
        cmt->data[i] = CPU_64 ? UINT64_MAX : UINT32_MAX;

    int carry = 1;
    for (int i = 0; i < a->size; i++)  // compute result and store in cmt
        ubn_unit_add(a->data[i], cmt->data[i], carry, &cmt->data[i], &carry);
    // the last carry is discarded
    for (int i = a->size - 1; i >= 0; i--)
        if (!cmt->data[i])
            cmt->size--;
    ubignum_free(*out);
    *out = cmt;
    return true;
cmt_realoc_failed:
    ubignum_free(cmt);
cmt_failed:
    return false;
}
ubignum_left_shift()
left_shift() 程式碼
/* left shift a->data by d bit */ bool ubignum_left_shift(const ubn *a, int d, ubn **out) { if (!a || d < 0 || !out || !*out) return false; else if (d == 0) { if (a == *out) return true; if (__builtin_expect((*out)->capacity < a->size, 0)) if (!ubignum_resize(out, a->size)) return false; for (int i = 0; i < a->size; i++) (*out)->data[i] = a->data[i]; (*out)->size = a->size; return true; } const int chunk_shift = d / ubn_unit_bit; const int shift = d % ubn_unit_bit; const int new_size = a->size + chunk_shift + 1; if (__builtin_expect((*out)->capacity < new_size, 0)) if (!ubignum_resize(out, new_size)) return false; int ai = new_size - chunk_shift - 1; // note a->size may be changed due to // aliasing, should not use a->size int oi = new_size - 1; /* copy data from a to (*out) */ if (shift) { (*out)->data[oi--] = a->data[ai - 1] >> (ubn_unit_bit - shift); for (ai--; ai > 0; ai--) (*out)->data[oi--] = (a->data[ai] << shift) | (a->data[ai - 1] >> (ubn_unit_bit - shift)); (*out)->data[oi--] = a->data[ai] << shift; // ai is now 0 } else { while (ai >= 0) (*out)->data[oi--] = a->data[ai--]; } /* end copy */ while (oi >= 0) // set remaining part of (*out) to 0 (*out)->data[oi--] = 0; (*out)->size = new_size; if ((*out)->data[(*out)->size - 1] == 0) // if MS chunk is 0 (*out)->size--; return true; }


大致上就是一些搬移的操作,在此解釋 29 至 33 行的操作,假設我們要將 a

00000000 11111111 01001111 00000000

左移 3 位。

  • 注意到 chunk 編號是
    (3,2,1,0)
    ,左邊為 most significant 。

每次我們需要目前 chunk 的最低 5 位放到高 (置左) 與其右方 chunk 的最高 3 位放到低 (置右) 進行合併,以編號

2 為例即是:

    11111...    from [2]
or  .....010    from [1]
   ----------
    11111010
  • 取最低 5 位放到高:左移
    3
  • 取最高 3 位放到低:右移
    (83)

最高跟最低的 chunk 都只有分別一邊所以需要特殊處理,最終得:

00000111 11111010 01111000 00000000

由於我們每次都不會去取目標左方的新 chunk ,因此 pointer aliasing 在此並不會造成問題。

另外,考量到當 shift 為 0 時,原本的程式碼會導致數值右移超過自己的 size ,產生 undefined behavior ,因此需要第 28 行的 if..else 來進行 branch 。

ubignum_divby_ten()

概念源於 quiz3 的算法。

在撰寫的過程中,再多寫了一個 ubignum_compare() 函式以利比較。

divby_ten() 程式碼
/* a / 10 = (*quo)...rmd */
bool ubignum_divby_ten(const ubn *a, ubn **quo, int *rmd)
{
    if (!a || !a->size || !quo || !*quo || !rmd)
        return false;
    ubn *ans, *dvd, *suber, *ten;
    if (!ubignum_init(&ans))
        return false;
    if (!ubignum_resize(&ans, a->size))
        goto cleanup_ans;
    if (!ubignum_init(&dvd))
        goto cleanup_ans;
    if (!ubignum_resize(&dvd, a->size))
        goto cleanup_dvd;
    if (!ubignum_init(&suber))
        goto cleanup_dvd;
    if (!ubignum_resize(&suber, dvd->capacity + 1))
        goto cleanup_suber;
    if (!ubignum_init(&ten))
        goto cleanup_suber;
    ubignum_uint(ten, 10);

    for (int i = 0; i < a->size; i++)
        dvd->data[i] = a->data[i];
    dvd->size = a->size;
    int shift = dvd->size * ubn_unit_bit - 4;  // can be improved by clz()
    ubignum_left_shift(ten, shift, &suber);
    while (ubignum_compare(dvd, ten) >= 0) {  // if dvd >= 10
        while (ubignum_compare(dvd, suber) < 0) {
            shift--;
            ubignum_left_shift(ten, shift, &suber);
        }
        ans->data[shift / ubn_unit_bit] |= (ubn_unit) 1
                                           << (shift % ubn_unit_bit);
        ubignum_sub(dvd, suber, &dvd);
    }
    ans->size = a->size;
    if (ans->data[a->size - 1] == 0)
        ans->size--;
    *rmd = (int) dvd->data[0];
    ubignum_free(ten);
    ubignum_free(suber);
    ubignum_free(dvd);
    ubignum_free(*quo);
    *quo = ans;
    return true;
cleanup_suber:
    ubignum_free(suber);
cleanup_dvd:
    ubignum_free(dvd);
cleanup_ans:
    ubignum_free(ans);
    return false;
}


之後同樣如 quiz3 引入 builtin_clz() 並重新包裝成 ubignum_clz() 進行改進,關鍵段落變化:

    for (int i = 0; i < a->size; i++)
        dvd->data[i] = a->data[i];
    dvd->size = a->size;
-   int shift = dvd->size * ubn_unit_bit - 4;
-   ubignum_left_shift(ten, shift, &suber);
    while (ubignum_compare(dvd, ten) >= 0) {  // if dvd >= 10
-       while (ubignum_compare(dvd, suber) < 0) {
-           shift--;
-           ubignum_left_shift(ten, shift, &suber);
-       }
+       int shift = dvd->size * ubn_unit_bit - ubignum_clz(dvd) - 4;
+       ubignum_left_shift(ten, shift, &suber);
+       if (ubignum_compare(dvd, suber) < 0)
+           ubignum_left_shift(ten, --shift, &suber);
        ans->data[shift / ubn_unit_bit] |= (ubn_unit) 1
                                           << (shift % ubn_unit_bit);
        ubignum_sub(dvd, suber, &dvd);
    }
    ans->size = a->size;  // FIX: ans = 0 should be considered
    if (ans->data[a->size - 1] == 0)
        ans->size--;
    *rmd = (int) dvd->data[0];
ubignum_2decimal()

在實作完成除以 10 之後,我們即可透過不斷除以 10 來將二進位數轉成字串印出。

2decimal() 程式碼
/* convert the ubn to ascii string */
char *ubignum_2decimal(const ubn *N)
{
    if (!N)
        return NULL;
    if (ubignum_iszero(N)) {
#if DEBUG
        char *ans = (char *) calloc(sizeof(char), 2);
#else
        char *ans = (char *) kcalloc(sizeof(char), 2, GFP_KERNEL);
#endif
        if (!ans)
            return NULL;
        ans[0] = '0';
        ans[1] = 0;
        return ans;
    }
    ubn *dvd = NULL;
    if (!ubignum_init(&dvd))
        return NULL;
    if (!ubignum_resize(&dvd, N->size))
        goto cleanup_dvd;
    for (int i = 0; i < N->size; i++)
        dvd->data[i] = N->data[i];
    dvd->size = N->size;
    /* Let n be the number.
     * digit = 1 + log_10(n) = 1 + \frac{log_2(n)}{log_2(10)}
     * log_2(10) \approx 3.3219 \approx 7/2,  we simply choose 3
     */
    unsigned digit = (ubn_unit_bit * N->size / 3) + 1;
#if DEBUG
    char *ans = (char *) calloc(sizeof(char), digit);
#else
    char *ans = (char *) kcalloc(sizeof(char), digit, GFP_KERNEL);
#endif
    if (!ans)
        goto cleanup_dvd;
    /* convert 2-base to 10-base */
    int index = 0;
    while (dvd->size) {
        int rmd;
        if (__builtin_expect(!ubignum_divby_ten(dvd, &dvd, &rmd), 0))
            goto cleanup_ans;
        ans[index++] = rmd | '0';
    }
    int len = index;
    /* reverse string */
    index = len - 1;
    for (int i = 0; i < index; i++, index--) {
        char tmp = ans[i];
        ans[i] = ans[index];
        ans[index] = tmp;
    }
    ubignum_free(dvd);
    return ans;
cleanup_ans:
#if DEBUG
    free(ans);
#else
    kfree(ans);
#endif
cleanup_dvd:
    ubignum_free(dvd);
    return NULL;
}


目前我認為字串反轉的方式還可以再改良,但因為這個部份並非 fib 運算的主軸,因此暫時就先如此。


fibdrv 效能分析與改進

原先想與範例程式碼比較,然而稍微更改後在自己電腦實驗的結果卻非常糟:

因此目前先以自己改進為方向,

由於速度的提昇,我們現在將範圍拉到

n=1000 來測試。

對於目前 fib sequence 逐項往上加的算法

static inline void ubn_unit_add(const ubn_unit a,
                                const ubn_unit b,
                                const int cin,
                                ubn_unit *sum,
                                int *cout)
{
    const ubn_unit_extend SUM = (ubn_unit_extend) a + b + cin;
    *sum = SUM;
    *cout = SUM >> ubn_unit_bit;
}

我認為仍有可能改進的部份:

  1. 使用 inline asm 來做加法,如此 (在 64 位元情況) 可以避免用到 uint128_t

今天 (3/24) 上課的時候,聽到老師介紹 gcc 有 builtin 的 overflow 函式,藉由這點我們即可改寫原本的 ubn_unit_add(),不需要低到 asm 的層級,相對 asm 的優點在於 architecture independence。然而這裡是需要 3 個數相加,因此需要呼叫兩次 builtin_uadd_overflow()

static inline void ubn_unit_add(const ubn_unit a,
                                const ubn_unit b,
                                const int cin,
                                ubn_unit *sum,
                                int *cout)
{
    *cout = __builtin_uaddll_overflow(a, cin, sum);
    *cout |= __builtin_uaddll_overflow(*sum, b, sum);
}

測試後可以發現在較大的 case 上,使用 builtin_uadd_overflow() 可以得到略佳的效果。再測到更大一些得:


當我嘗試將

n 提高到 3000 時,卻觀察到一個奇怪的現象,就是在 1000 之後,計算時間就不再有提昇的趨勢,如圖:

我覺得這非常不合理,因為以我的算法對於越大的數肯定有越高的計算量。目前想到的可能出在,因數值太大而產生 overflow ,但 20000

× 1000 也仍不到 32-bit int 的上界,應不可能超出 long long ,且可以從圖中見得有些數字遠遠超過水平線卻仍被印了出來。或許需要更理解 ktime_t 相關的知識。

On 64-bit systems, a ktime_t is really just a 64-bit integer value in nanoseconds.
- The high-resolution timer API

以這個資料來說,顯然更不是時間變數資料型態的問題了。

經過數天之後我終於發現問題,是 fibdrv.c 中的 MAX_LENGTH 。之前改過一次之後我就忘記有這個參數,目前直接設定成 1000000 以利後續實驗。

更正後試跑已可正常進行統計:


嘗試引入 fast doubling 算法加速

效能改善的情況高於我的預期,我原先以為 fast doubling 要到很大的 case 才會有顯著的改善。但從實驗結果可以發現,在

n=400 左右兩種算法即出現執行時間的交叉。

試著仿造使用 perf stat 測量我寫的 fast doubling 算法到

n=3000

$ sudo perf stat -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses ./exp 1 > /dev/null

 Performance counter stats for './exp 1':

   118,264,893,816      cycles                                                      
   264,687,477,706      instructions              #    2.24  insn per cycle         
        15,142,354      cache-misses              #   15.801 % of all cache refs    
        95,829,386      cache-references                                            
    38,631,044,731      branch-instructions                                         
        69,373,051      branch-misses             #    0.18% of all branches        

      35.744624633 seconds time elapsed

       0.931959000 seconds user
      34.806501000 seconds sys

可以看到我的實作程式碼有高達 15% 的 cache miss 。

我想利用 cachegrind 來測試是哪個函式發生最多 cache miss ,結果如下:

$ sudo valgrind --tool=cachegrind ./exp 1 > /dev/null
==37329== Cachegrind, a cache and branch-prediction profiler
==37329== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==37329== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==37329== Command: ./exp 1
==37329== 
--37329-- warning: L3 cache found, using its data for the LL simulation.
==37329== 
==37329== I   refs:      69,563,678
==37329== I1  misses:         1,159
==37329== LLi misses:         1,128
==37329== I1  miss rate:       0.00%
==37329== LLi miss rate:       0.00%
==37329== 
==37329== D   refs:      28,270,312  (24,764,894 rd   + 3,505,418 wr)
==37329== D1  misses:         3,323  (     2,585 rd   +       738 wr)
==37329== LLd misses:         2,733  (     2,070 rd   +       663 wr)
==37329== D1  miss rate:        0.0% (       0.0%     +       0.0%  )
==37329== LLd miss rate:        0.0% (       0.0%     +       0.0%  )
==37329== 
==37329== LL refs:            4,482  (     3,744 rd   +       738 wr)
==37329== LL misses:          3,861  (     3,198 rd   +       663 wr)
==37329== LL miss rate:         0.0% (       0.0%     +       0.0%  )

$ sudo cg_annotate cachegrind.out.37329 
--------------------------------------------------------------------------------
I1 cache:         32768 B, 64 B, 8-way associative
D1 cache:         32768 B, 64 B, 8-way associative
LL cache:         6291456 B, 64 B, 12-way associative
Command:          ./exp 1
Data file:        cachegrind.out.37329
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
Ir         I1mr  ILmr  Dr         D1mr  DLmr  Dw        D1mw DLmw 
--------------------------------------------------------------------------------
69,563,678 1,159 1,128 24,764,894 2,585 2,070 3,505,418  738  663  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir         I1mr ILmr Dr         D1mr  DLmr Dw        D1mw DLmw  file:function
--------------------------------------------------------------------------------
33,078,041    4    4 15,024,011     1    0 3,012,013    0    0  ???:main
27,000,072    1    1  6,000,016     0    0         0    0    0  /build/glibc-sMfBJT/glibc-2.31/io/../sysdeps/unix/sysv/linux/write.c:write
 6,054,221   27   24  3,027,082    11    0        15    1    1  ???:???
 1,493,991   48   47    398,997    19    6   261,001    6    2  /build/glibc-sMfBJT/glibc-2.31/stdio-common/vfprintf-internal.c:__vfprintf_internal
   648,170    4    4    141,032     4    1   114,012    1    0  /build/glibc-sMfBJT/glibc-2.31/libio/fileops.c:_IO_file_xsputn@@GLIBC_2.2.5
   418,262    3    3     30,733     2    1    24,733    0    0  /build/glibc-sMfBJT/glibc-2.31/stdio-common/_itoa.c:_itoa_word
   228,000    5    5     21,000     3    2         0    0    0  /build/glibc-sMfBJT/glibc-2.31/string/../sysdeps/x86_64/multiarch/strchr-avx2.S:__strchrnul_avx2
   163,977    2    2     29,983     0    0    17,985   63   63  /build/glibc-sMfBJT/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
    90,000    4    4     18,000     3    0    33,000    3    2  /build/glibc-sMfBJT/glibc-2.31/stdio-common/printf.c:printf
    78,000    1    1     24,000     0    0     6,000    0    0  /build/glibc-sMfBJT/glibc-2.31/stdio-common/../libio/libioP.h:__vfprintf_internal
    71,545   10   10     14,316 1,080  893        15    2    0  /build/glibc-sMfBJT/glibc-2.31/elf/dl-addr.c:_dl_addr

cachegrind 似乎無法配合 taskset 使用

改用

sudo taskset -c 1 valgrind --tool=cachegrind ./exp 1 > /dev/null

即可正常使用 cachegrind ,推測為 Valgrind 實作方式的影響

測試結果竟顯示 data cache miss 為 0% ,與從 perf stat 收到的資訊天差地遠。

可以觀察到 cache reference 在 perf 顯示 95,829,386 ;而在 cachegrind 顯示為 28,270,312 。

進行更多次 perf 測試,發現每次的 cache references 竟有巨大落差:
 Performance counter stats for './exp 1':

   112,781,466,006      cycles                                                      
   261,497,199,771      instructions              #    2.32  insn per cycle         
         6,962,452      cache-misses              #   11.023 % of all cache refs    
        63,160,276      cache-references                                            
    38,179,989,631      branch-instructions                                         
        67,070,255      branch-misses             #    0.18% of all branches        

      33.386655206 seconds time elapsed

       0.975406000 seconds user
      32.384290000 seconds sys

 Performance counter stats for './exp 1':

   108,714,836,439      cycles                                                      
   264,555,729,876      instructions              #    2.43  insn per cycle         
         1,520,786      cache-misses              #   14.088 % of all cache refs    
        10,795,259      cache-references                                            
    38,609,436,202      branch-instructions                                         
        63,560,865      branch-misses             #    0.16% of all branches        

      32.059672731 seconds time elapsed

       0.847985000 seconds user
      31.207482000 seconds sys


我不曉得為什麼會出現這樣的情況。

後續發現是測試次數過少而導致的數值不穩定,後面會補充結果。


回去觀摩參考共筆,發現因為我先前閱讀的重點都在運算加速,沒注意到資料搬移優化的部份,嘗試將其納入我的實作:

    for (int currbit = 1 << (32 - __builtin_clzll(k) - 1 - 1); currbit;
         currbit = currbit >> 1) {
        /* compute 2n-1 */
        ubignum_mult(fast[1], fast[1], &fast[0]);
        ubignum_mult(fast[2], fast[2], &fast[3]);
        ubignum_add(fast[0], fast[3], &fast[3]);
        /* compute 2n */
        ubignum_left_shift(fast[1], 1, &fast[4]);
        ubignum_add(fast[4], fast[2], &fast[4]);
        ubignum_mult(fast[4], fast[2], &fast[4]);
        n *= 2;
        if (k & currbit) {
            ubignum_add(fast[3], fast[4], &fast[0]);
            n++;
-           ubignum_copy(fast[2], fast[0]);
-           ubignum_copy(fast[1], fast[4]);
+           ubignum_swapptr(&fast[2], &fast[0]);
+           ubignum_swapptr(&fast[1], &fast[4]);
        } else {
-           ubignum_copy(fast[2], fast[4]);
-           ubignum_copy(fast[1], fast[3]);
+           ubignum_swapptr(&fast[2], &fast[4]);
+           ubignum_swapptr(&fast[1], &fast[3]);
        }
    }

效能差異:

我測出來的結果和參考共筆的結論不符合,並沒有出現效能大幅改善的情形。

perf 測起來仍然非常不穩:

perf 測試
$ sudo perf stat -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses ./exp 1 > /dev/null

 Performance counter stats for './exp 1':

    99,537,523,424      cycles                                                      
   241,725,789,043      instructions              #    2.43  insn per cycle         
         1,579,228      cache-misses              #   53.451 % of all cache refs    
         2,954,538      cache-references                                            
    34,729,788,080      branch-instructions                                         
        72,001,270      branch-misses             #    0.21% of all branches        

      29.372597211 seconds time elapsed

       0.963940000 seconds user
      28.402259000 seconds sys
Performance counter stats for './exp 1':

   110,100,048,298      cycles                                                      
   251,991,005,629      instructions              #    2.29  insn per cycle         
        11,886,534      cache-misses              #   14.939 % of all cache refs    
        79,566,707      cache-references                                            
    36,196,420,296      branch-instructions                                         
        71,695,967      branch-misses             #    0.20% of all branches        

      32.880962908 seconds time elapsed

       0.939414000 seconds user
      31.908111000 seconds sys


嘗試在 perf 命令設置參數 -r 10 來降低測量數值的波動,為避免時間過長,改為測試 fib 到

n=1000
經過這個設定後,每次測試間得到的數值就不再有顯著差異,至少每個 events 的前兩個數字會保持不變。

$ bash measure
...

 Performance counter stats for 'taskset -c 1 ./exp 1' (10 runs):

    23,018,839,194      cycles                                                        ( +-  0.27% )
    49,622,366,443      instructions              #    2.16  insn per cycle           ( +-  0.00% )
         1,634,045      cache-misses              #   20.062 % of all cache refs      ( +-  6.49% )
         8,145,159      cache-references                                              ( +-  5.01% )
     8,319,541,416      branch-instructions                                           ( +-  0.00% )
        13,417,461      branch-misses             #    0.16% of all branches          ( +-  0.18% )

           14.4259 +- 0.0388 seconds time elapsed  ( +-  0.27% )

確立測試方法之後再回去重新測量使用 ubignum_copy() 的版本,得:

$ bash measure
...

 Performance counter stats for 'taskset -c 1 ./exp 1' (10 runs):

    25,215,127,347      cycles                                                        ( +-  2.31% )
    51,971,604,860      instructions              #    2.06  insn per cycle           ( +-  0.00% )
         2,423,932      cache-misses              #   13.441 % of all cache refs      ( +- 18.29% )
        18,033,276      cache-references                                              ( +- 24.11% )
     8,811,912,111      branch-instructions                                           ( +-  0.00% )
        13,058,167      branch-misses             #    0.15% of all branches          ( +-  1.10% )

            15.804 +- 0.367 seconds time elapsed  ( +-  2.32% )

對照之下,從舊版到新版雖然 cache-miss-rate 由 13% 增長到 20% ,但我們可以注意到 cache-references 總數大約降低了一半,我推測這個差距來自於 copy 的過程對資料的 access 。
因此就 data cache 而言,因為省下這些操作,新版仍是更佳的實作方式。

20% 的 cache miss 非常驚人,因此目前最直接的改進目標就是要想辦法降低 miss rate 。


引入 ubignum_square()

 Performance counter stats for 'taskset -c 1 ./exp 1' (10 runs):

    25,405,898,533      cycles                                                        ( +-  1.21% )
    53,288,279,701      instructions              #    2.10  insn per cycle           ( +-  0.00% )
         1,370,567      cache-misses              #   20.309 % of all cache refs      ( +- 34.02% )
         6,748,682      cache-references                                              ( +- 32.89% )
     9,033,980,854      branch-instructions                                           ( +-  0.00% )
        12,907,672      branch-misses             #    0.14% of all branches          ( +-  0.62% )

            15.921 +- 0.193 seconds time elapsed  ( +-  1.21% )

差異實在也不是很明顯,甚至是落後於全數使用 ubignum_mult() 的版本。

雖然我想要嘗試測到大一點的 case ,


執行測試有時會出現頻繁的失敗,不確定原因是什麼,但與引入 ubignum_square() 應該沒有關聯,因為即使改回 mult 的版本也沒有恢復。其錯誤情景如下:

$ sudo perf stat -r 10 -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses taskset -c 0 ./exp 1 > /dev/null
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束

我想看看是否是 fib_fast() 的計算過程中的問題,例如 reallocation 失敗等等,因此嘗試引入 flag 變數來紀錄:

fib_fast()
static ubn *fib_fast(long long k)
{
    ubn *fast[5] = {NULL};
    bool flag = true;
    if (k == 0) {
        flag &= ubignum_init(&fast[0]);
        ubignum_zero(fast[0]);
        return fast[0];
    } else if (k == 1) {
        flag &= ubignum_init(&fast[0]);
        ubignum_uint(fast[0], 1);
        return fast[0];
    }

    for (int i = 0; i < 5; i++)
        flag &= ubignum_init(&fast[i]);
    ubignum_zero(fast[1]);
    ubignum_uint(fast[2], 1);
    int n = 1;
    for (int currbit = 1 << (32 - __builtin_clzll(k) - 1 - 1); currbit;
         currbit = currbit >> 1) {
        /* compute 2n-1 */
        flag &= ubignum_square(fast[1], &fast[0]);
        flag &= ubignum_square(fast[2], &fast[3]);
        //flag &= ubignum_mult(fast[1], fast[1], &fast[0]);
        //flag &= ubignum_mult(fast[2], fast[2], &fast[3]);
        flag &= ubignum_add(fast[0], fast[3], &fast[3]);
        /* compute 2n */
        flag &= ubignum_left_shift(fast[1], 1, &fast[4]);
        flag &= ubignum_add(fast[4], fast[2], &fast[4]);
        flag &= ubignum_mult(fast[4], fast[2], &fast[4]);
        n *= 2;
        if (k & currbit) {
            flag &= ubignum_add(fast[3], fast[4], &fast[0]);
            n++;
            ubignum_swapptr(&fast[2], &fast[0]);
            ubignum_swapptr(&fast[1], &fast[4]);
        } else {
            ubignum_swapptr(&fast[2], &fast[4]);
            ubignum_swapptr(&fast[1], &fast[3]);
        }
    }
    ubignum_free(fast[0]);
    ubignum_free(fast[1]);
    ubignum_free(fast[3]);
    ubignum_free(fast[4]);
    if(!flag)
        printk(KERN_INFO "flag is %d\n", flag);
    return fast[2];
}


執行後於 dmesg 中並未看到任何 flag 相關的報錯,不過出現了一段紅底黑字:

[  990.777318] BUG: unable to handle page fault for address: ffff96d435beb220
[  990.777322] #PF: supervisor read access in kernel mode
[  990.777324] #PF: error_code(0x0000) - not-present page

June Rework

隔一段時間沒有著手於這份專案,回來先進行一些實驗重做等等,並補上一些之前忽略的內容。

由於第一次做的時候認為需要專注於提昇運算 Fibonacci number 的速度,而只測量計算消耗的時間。然而這類 character device 在使用上總是要 read,因此應該也進行實驗量測 user space 到 kernel space 的時間、以及資料複製的時間等等納入成本計算,進行效能的全盤考量,完整地進行實驗才能正確認知到效能瓶頸何在。

user-kernel transition time

時間量測

首先實驗量測由 user mode 進出 kernel mode 的耗時

code
    for (int i = 0; i <= offset; i++) {
        lseek(fd, i, SEEK_SET);
        unsigned long long ucons = 0, kcons = 0;
        for (int j = 0; j < TEST_NUM; j++) {
            struct timespec t1, t2;
            clock_gettime(CLOCK_MONOTONIC, &t1);
            kcons += write(fd, NULL, select);
            clock_gettime(CLOCK_MONOTONIC, &t2);
            ucons += (unsigned long long) (t2.tv_sec * 1e9 + t2.tv_nsec) -
                     (t1.tv_sec * 1e9 + t1.tv_nsec);
        }
        printf("%d,%llu,%llu,%llu\n", i, kcons / TEST_NUM, ucons / TEST_NUM, (ucons - kcons)/TEST_NUM);
    }
/* write operation is skipped */
/* try to measure time within this function */
static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    ubn *N = NULL;  // do not initialize
    ktime_t kt = ktime_get();
    switch (size) {
    case 0:
        N = fib_sequence(*offset);
        break;
    case 1:
        N = fib_fast(*offset);
        break;
    default:
        return 0;
    }
    ubignum_free(N);
    kt = ktime_sub(ktime_get(), kt);
    return (ssize_t) ktime_to_ns(kt);
}


以 iteration 算法測試,並使用較小的 offset 來突顯該數值,兩個時間相減大致可得 (user

kernel)
+
(kernel
user) 的時間。

即使用上了作業說明參考共筆提及的所有方式,仍會產生這種巨大的 peak。從 csv 檔觀察,發現其位於

n=13,43,44,63,64

  • fib(13) = 233
  • fib(43) = 433494437
  • fib(44) = 701408733
  • fib(63) = 6557470319842
  • fib(64) = 10610209857723

第一猜想是使用 ubignum_resize() 導致的消耗,但結果這幾個值看起來也並非是 UINT32_MAXUINT64_MAX,從此 value 上看不出有特別意義。接著又測試幾次,發現每次 peak 都產生在不同地方。

試著排除 peak 點再次作圖,結果發現在其它情況我的量測也不太穩定。

換成另一種叫簡易的方式來測量:將 fib_write() 函式清空並直接把 user space 下的來回時間當成是 transition time。

注意到這時 x 軸

n 不再有意義

無論怎麼做都一直量到 peak,但可以看出這個來回時間幾乎都非常貼近 1000 ns。

為了考察這個問題,我試著把每次測試中過大的的耗時數值印出來,結果發現偶爾會出現 case 有十倍左右的時間消耗。於是我了解到排除統計上的離群值有其必要性,果然是不該偷吃步。

排除統計離群值

在統計的過程中,對於某些不特定的

n 會出現標準差大於平均值的奇怪現象,於是我同樣把過程中的資料全部存下來,再去奇怪的
n
看看出了什麼問題,借助 sprintf() 的小技巧來把資料全部存起來:

        char name[128];
        sprintf(name, "data/%d.dat", i);
        FILE *f = fopen(name, "w");

我這裡跑完時間測量後,看到

n=67
μ=1073,σ=1322
,於是查詢 data/67.dat,原來其中一次的耗時測量到 42656。然而經過離群值的排除,調整後的平均值正常地落在了 1085。

outlier elimination
/* Eliminate 5% outliers and compute the average.
 */
uint64_t average(int64_t *nums, int numsSize)
{
    int64_t mu = 0;
    for (int i = 0; i < numsSize; i++)
        mu += nums[i];
    mu /= numsSize;
    int64_t deviation = 0;
    for (int i = 0; i < numsSize; i++)
        deviation += (nums[i] - mu) * (nums[i] - mu) / numsSize;
    deviation = sqrt(deviation);
    for (int i = 0; i < numsSize; i++)
        if (nums[i] >= mu + 2 * deviation || nums[i] <= mu - 2 * deviation)
            nums[i] = -1;
    // printf("%ld, %ld\t", mu, deviation);
    uint64_t ans = 0;
    for (int i = 0; i < numsSize; i++)
        if (nums[i] != -1)
            ans += nums[i];
    return ans / (numsSize / 20 * 19);
}

其中 sqrt() 為我自己實作的整數開根號函式。


至此我們成功排除掉了 peak 的問題,於是耗時的細微變化便可在圖表上呈現出來。

圖表的刻度變小之後,可以看出測量上有 100 ~ 300 ns 的起伏,仍無法達到如 KYG 同學的平滑結果。這裡由於任何

n 的輸入都是直接 return 沒有計算,因此最理想的狀況應該要是一條水平線。

打算做一個 commit 時收到警告訊息:

Dangerous function detected in /home/tinyynoob/code/fibdrv/exp.c
54:        sprintf(name, "data/%d.dat", i);

於是改為使用 snprint() 並順利完成 commit。

於 desktop 量測

由於怎麼測總是拿到漂浮的數據,我想 laptop 或許天生就較為不穩定,因此我想從 laptop 改成用 desktop 跑跑看是否會出現相同的結果。

cpu info
架構:                           x86_64
CPU 作業模式:                   32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
每核心執行緒數:                 1
每通訊端核心數:                 4
Socket(s):                       1
NUMA 節點:                      1
供應商識別號:                   GenuineIntel
CPU 家族:                       6
型號:                           94
Model name:                      Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
製程:                           3
CPU MHz:                        2511.819
CPU max MHz:                     3600.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6399.96
虛擬:                           VT-x
L1d 快取:                       128 KiB
L1i 快取:                       128 KiB
L2 快取:                        1 MiB
L3 快取:                        6 MiB
NUMA node0 CPU(s):              0-3
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cach
                                 e flushes, SMT disabled
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
                                 ia prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user
                                  pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_
                                 FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Mitigation; Microcode
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtr
                                 r pge mca cmov pat pse36 clflush dts acpi mmx f
                                 xsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rd
                                 tscp lm constant_tsc art arch_perfmon pebs bts 
                                 rep_good nopl xtopology nonstop_tsc cpuid aperf
                                 mperf pni pclmulqdq dtes64 monitor ds_cpl vmx s
                                 mx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid s
                                 se4_1 sse4_2 x2apic movbe popcnt tsc_deadline_t
                                 imer aes xsave avx f16c rdrand lahf_lm abm 3dno
                                 wprefetch cpuid_fault invpcid_single pti ssbd i
                                 brs ibpb stibp tpr_shadow vnmi flexpriority ept
                                  vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 
                                 smep bmi2 erms invpcid rtm mpx rdseed adx smap 
                                 clflushopt intel_pt xsaveopt xsavec xgetbv1 xsa
                                 ves dtherm ida arat pln pts hwp hwp_notify hwp_
                                 act_window hwp_epp md_clear flush_l1d


然而同樣的程式碼和 scripts 移過來執行竟然出現 segmentation fault。

我使用 gdb 工具查詢,發現出錯在 fprintf() 處,我的 f 意圖打開目錄 data 下的檔案,然而不存在這個 directory 因此產生記憶體區段錯誤,mkdir data 後解決。

跑出來的圖表:

雖然看似仍有很大的浮動,但仔細觀察可以注意到 y 軸的刻度由 50 降至 10,確實穩定了不少。從比較大的 scale 看是像這樣子:

雖然還沒辦法變成完全的水平線,但已從最初上千 ns 的 peak 改善許多。

版本 commit id:b80d536b08bd1a3c3325b53d684f97e39ac734f9

接著再把之前從 fib_write() 清空的內容補回進行測試:

可以看出無論

n 的變化,user space 到 kernel space 的來回時間穩定落在 600 ns 左右。

即使一路算到

n=500 也是如此:

版本 id:766a4b80f6fbcf8a0f5ee0931113b16e6ea629ec

運算時間

繼續使用 desktop 來重新量測兩種 Fibonacci number 算法在 kernel space 的耗時:

像之前一樣看到我實作的 fast fib 算法會出現這種階梯形的跳躍現象,這次我想要好好研究這些階梯的成因。

先從小一點的 case 開始探討:

fast.csv 查看發現跳點出現在

n=2,4,8,16,32,64,每次往上跳 500 或 1000 ns。每個跳躍竟然都恰好落在 2 的冪,這肯定有很好的理由可以解釋。

仔細想想之後也很直觀,因為 fast fib 算法每次迴圈都會將

n 加倍,因此需要
log(n)
次 iteration 以算完要求結果。而這麼解釋就代表每個跳躍便是 fast doubling 算法 each iteration 的耗時,從 fast_fib() 程式碼也可以看出有個迴圈:

    int n = 1;
    for (int currbit = 1 << (32 - __builtin_clzll(k) - 1 - 1); currbit;
         currbit = currbit >> 1) {
        /* compute 2n-1 */
        flag &= ubignum_square(fast[1], &fast[0]);
        flag &= ubignum_square(fast[2], &fast[3]);
        flag &= ubignum_add(fast[0], fast[3], &fast[3]);
        /* compute 2n */
        flag &= ubignum_left_shift(fast[1], 1, &fast[4]);
        flag &= ubignum_add(fast[4], fast[2], &fast[4]);
        flag &= ubignum_mult(fast[4], fast[2], &fast[4]);
        n *= 2;
        if (k & currbit) {
            flag &= ubignum_add(fast[3], fast[4], &fast[0]);
            n++;
            ubignum_swapptr(&fast[2], &fast[0]);
            ubignum_swapptr(&fast[1], &fast[4]);
        } else {
            ubignum_swapptr(&fast[2], &fast[4]);
            ubignum_swapptr(&fast[1], &fast[3]);
        }
    }

在這裡可以簡單比較兩種算法的特性如下:

sequencial fast doubling
迭代次數
n
log(n)
每輪迭代的複雜度 低,約 16 ns 高,約 834 ns

該數字使用

n=0
n=500
的測量結果算得

比較

KYG 同學

為了確認自己實作的水準,clone KYG 同學的 fibdrv 進行對比,使用 optimize-bn 分支在我的電腦上執行 make plot,結果超乎我的預期:

耗時僅僅我的

1/10,顯然我的程式碼有不只一點改進空間。此外,為何使用相同的設定,它的程式就能有非常平滑的結果,而我卻會一直抖動,這也需要再好好研究。

但仔細去看程式碼發現其測量有誤,我 clone 到的版本並非 bignum 的實作。接著繼續閱讀之後,發現同學並未把自己的最新版本推上 github,因此我無法在自己的電腦上與其進行耗時比較。

需要找找看哪個同學的程式碼適合拿來做效能對比

研讀 0xff07 同學的實作

研讀同學的實作,看看是否有值得借鏡的地方。

apm_internal.h

此 apm 取的是 Arbitrary-Precision arithmetic 的意思

首先我看到同學將 inline assembly 的部份包裝成 macro:

#if defined(__amd64__) || defined(__x86_64__)
#define digit_mul(u, v, hi, lo) \
    __asm__("mulq %3" : "=a"(lo), "=d"(hi) : "%0"(u), "rm"(v))
#define digit_div(n1, n0, d, q, r) \
    __asm__("divq %4" : "=a"(q), "=d"(r) : "0"(n0), "1"(n1), "rm"(d))
#endif

我認為這是非常好的想法,由於 inline assembly 的語法邏輯跟 C 的邏輯有些不同,因此包裝起來可以使人閱讀時統一依 C 的邏輯理解,較為易讀,同時也方便重複使用。

同時他在無法使用 x86 assembly 的情況下仍提供了對應實作:

#ifndef digit_mul
#define digit_mul(u, v, hi, lo)                    \
    do {                                           \
        apm_digit __u0 = (u);                      \
        apm_digit __u1 = __u0 >> APM_DIGIT_HSHIFT; \
        __u0 &= APM_DIGIT_LMASK;                   \
        apm_digit __v0 = (v);                      \
        apm_digit __v1 = __v0 >> APM_DIGIT_HSHIFT; \
        __v0 &= APM_DIGIT_LMASK;                   \
        apm_digit __lo = __u0 * __v0;              \
        apm_digit __hi = __u1 * __v1;              \
        __u1 *= __v0;                              \
        __v0 = __u1 << APM_DIGIT_HSHIFT;           \
        __u1 >>= APM_DIGIT_HSHIFT;                 \
        __hi += __u1 + ((__lo += __v0) < __v0);    \
        __v1 *= __u0;                              \
        __u0 = __v1 << APM_DIGIT_HSHIFT;           \
        __v1 >>= APM_DIGIT_HSHIFT;                 \
        __hi += __v1 + ((__lo += __u0) < __u0);    \
        (lo) = __lo;                               \
        (hi) = __hi;                               \
    } while (0)
#endif

apm.c apm.h

這部份提供許多大數的 operations 包括:

  • increment/decrement
  • add
  • sub
  • dmul,計算 partial product 的函式
  • shift
  • cmp

預設是 uv 運算後將結果放到 w,其中沒看到計算 apm 乘法的函式,僅有計算 partial product 的 dmul()

apm 乘法被獨立寫在 mul.c

  • i postfix 的 operations 似乎是代表 in-place,都是 u, v 運算後放到 u
  • n postfix 的 operations 不確定語意上是代表哪個英文單字,不過其共通點為 u, v 兩數的初始 size 相同

mul.c

在其乘法實作當中混用了兩種乘法

短碼時使用直式乘法,長碼時使用 Karatsuba,使兩者各展所長。

Karatsuba multiplication 解說

假設我們要作

x×y,那麼藉由分解較大的
x
y
,可以達到:

x×y=(x1×2b+x0)×(y1×2b+y0)=(x1y1)22b+(x1y0+x0y1)2b+(x0y0)

(x1y0+x0y1)=(x1+x0)×(y1+y0)x1y1x0y0,可將乘法次數由 4 次降到 3 次。

此外每個

xiyi 又呈現
x×y
的形式,因此可以再次遞迴下去,總結而言這是個 divide and conquer 的算法。

bn.h

此部份將前面大數相關的 type 使用 struct 包起來:

typedef struct {
    apm_digit *digits; /* Digits of number. */
    apm_size size;     /* Length of number. */
    apm_size alloc;    /* Size of allocation. */
    unsigned sign : 1; /* Sign bit. */
} bn, bn_t[1];

並使用 apm 的介面提供一系列的 arithmetic operations,再次包裝成 bn 介面。