# 2022q1 Homework3 (fibdrv)
contributed by < [`tinyynoob`](https://github.com/tinyynoob) >
> [作業要求](https://hackmd.io/@sysprog/linux2022-fibdrv)
> 最近的方向可以跳到 [June rework](https://hackmd.io/MJ1kMQBLRBCuo08yEmhDeQ?view#June-Rework) 開始看
> 因為一篇筆記放不下了,再開一篇:[fibdrv 2](https://hackmd.io/@tinyynoob/linux2022q1-fibdrv2)
## 研讀資料
### Fibonacci 相關性質
> https://hackmd.io/@sysprog/fibonacci-number
> [Algorithms for Computing Fibonacci Numbers Quickly](https://ir.library.oregonstate.edu/downloads/t435gg51w)
[Fast doubling 算法](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling)
$$\begin{aligned}
F_{2n+1} &= F_{n+1}^2 + F_n^2 \\
F_{2n} &= F_n\cdot(2F_{n+1}-F_n)
\end{aligned}
$$
作業說明中有給出 pseudocode :
```c
Fast_Fib(n)
a = 0; b = 1; // m = 0
for i = (number of binary digit in n) to 1
t1 = a*(2*b - a);
t2 = b^2 + a^2;
a = t1; b = t2; // m *= 2
if (current binary digit == 1)
t1 = a + b; // m++
a = b; b = t1;
return a;
```
注意到這個 `^` 是指 power 而非 xor 。
目前看不出 `clz()` 能對計算 Fib 做到什麼強力優化,似乎只能加快 initialize `i` 而已,需要再研究研究。
:::info
待
:::
### Follow 《The Linux Kernel Module Programming Guide》
> 將自己研讀的摘要整理到另一篇筆記: https://hackmd.io/@tinyynoob/Hk89qPw-9
本地版本:
```shell
$ apt-cache search linux-headers-`uname -r`
linux-headers-5.13.0-30-generic - Linux kernel headers for version 5.13.0 on 64 bit x86 SMP
```
根據書中範例嘗試建立 **hello-1.c** 及對應的 **makefile** 。
make 時馬上出現問題:
:::spoiler log
```shell
$ make -p | grep PWD
scripts/Makefile.build:44: /home/tinyynoob/code/kernel/hello-1/Makefile: 沒有此一檔案或目錄
make[2]: *** 沒有規則可製作目標「/home/tinyynoob/code/kernel/hello-1/Makefile」。 停止。
OLDPWD = /home/tinyynoob/code/kernel
PWD = /home/tinyynoob/code/kernel/hello-1
CONFIG_HPWDT_NMI_DECODING = y
make[1]: *** [Makefile:1879:/home/tinyynoob/code/kernel/hello-1] 錯誤 2
OLDPWD = /home/tinyynoob/code/kernel
PWD = /home/tinyynoob/code/kernel/hello-1
CONFIG_HPWDT_NMI_DECODING = y
@echo ' Syntax: make -C path/to/kernel/src M=$$PWD target'
make: *** [makefile:7:all] 錯誤 2
PWD := /home/tinyynoob/code/kernel/hello-1
OLDPWD = /home/tinyynoob/code/kernel
echo $((PWD)
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
```
:::
\
發現原來檔名把 `Makefile` 寫成 `makefile` 也不行,更改後得到:
:::spoiler log
```shell
$ make
make -C /lib/modules/5.13.0-30-generic/build M=/home/tinyynoob/code/kernel/hello-1 modules
make[1]: 進入目錄「/usr/src/linux-headers-5.13.0-30-generic」
CC [M] /home/tinyynoob/code/kernel/hello-1/hello-1.o
MODPOST /home/tinyynoob/code/kernel/hello-1/Module.symvers
CC [M] /home/tinyynoob/code/kernel/hello-1/hello-1.mod.o
LD [M] /home/tinyynoob/code/kernel/hello-1/hello-1.ko
BTF [M] /home/tinyynoob/code/kernel/hello-1/hello-1.ko
Skipping BTF generation for /home/tinyynoob/code/kernel/hello-1/hello-1.ko due to unavailability of vmlinux
make[1]: 離開目錄「/usr/src/linux-headers-5.13.0-30-generic」
```
:::
\
看似往正確的方向前進了,也有產出 **hello-1.ko** ,但沒有完全成功,它在 `vmlinux` 遇到問題
:::info
`vmlinux` 問題尚未解決
:::
根據 `modinfo` 可以發現在 **hello-1.ko** 的 vermagic 欄位有 `modversions` 。回去閱讀書中提到 Modversioning 的部份,得知我可能需要關掉它。
參考[此](https://bugs.launchpad.net/ubuntu/+source/linux-raspi2/+bug/1863245),試著進到 /boot/config-5.13.0-30-generic 關掉 `CONFIG_MODVERSIONS` 。
:::info
關掉 modversions 但尚未 recompile kernel
:::
把 **hello-1.c** 裡的 return 值改成負數會導致:
```shell
insmod: ERROR: could not insert module hello-1.ko: Operation not permitted
```
### 研讀數字字串運算範例程式
> [source code](https://github.com/AdrianHuang/fibdrv/tree/big_number)
數字字串在運算時可以因應字串 size 選擇不同的策略。
範例程式使用的字串使用 union 定義如下:
```c=
typedef union {
/* allow strings up to 15 bytes to stay on the stack
* use the last byte as a null terminator and to store flags
* much like fbstring:
* https://github.com/facebook/folly/blob/master/folly/docs/FBString.md
*/
char data[16];
struct {
uint8_t filler[15],
/* how many free bytes in this stack allocated string
* same idea as fbstring
*/
space_left : 4,
/* if it is on heap, set to 1 */
is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1;
};
/* heap allocated */
struct {
char *ptr;
/* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */
size_t size : MAX_STR_LEN_BITS,
/* capacity is always a power of 2 (unsigned)-1 */
capacity : 6;
/* the last 4 bits are important flags */
};
} xs;
```
- 在短碼的情況會使用第 9 行的 struct ,直接將資料字串存在 `filler` 並把一些 flag 資訊放在最後一個 byte
- 在長碼的情況會使用第 20 行的 struct ,使用一個 `ptr` 指標並配置空間在 heap ,後面的 8 byte 拿來存放一些 flag
:::warning
還沒完全理解在長碼情況下,
1. 如何繼續使用 `is_ptr` flag
2. `capacity` 使用的詳細狀況
:::
範例程式在 `fib_sequence()` 計算 Fib ,算法使用逐項相加:
```c
for (i = 2; i <= k; i++)
string_number_add(&f[i - 1], &f[i - 2], &f[i]);
```
因此若是我們能改善乘法及平方的複雜度,則可以使用 fast doubling 的技巧來改善目前算法。
### Linux 效能分析
嘗試查看 CPU affinity :
```shell
$ taskset -p 8528
pid 8528 目前的近似者遮罩:ff
```
ff 的二進位為 11111111 ,代表這個 process 可以在 7~0 任意一個 CPU 上執行。試著更改:
```shell
$ taskset -p FA 8528
pid 8528 目前的近似者遮罩:ff
pid 8528 新的近似者遮罩:fa
```
根據[網路上的教學](https://charleslin74.pixnet.net/blog/post/405173965),在 */boot/grub/grub.cfg* 中加入 `isolcpus=0,1` 。
最後,完成 [排除干擾效能分析的因素](https://hackmd.io/PMozqaxqRA-jF2KmgTZcPQ?view#%E6%8E%92%E9%99%A4%E5%B9%B2%E6%93%BE%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E5%9B%A0%E7%B4%A0) 中的所有設定。
reboot 之後查看 process 發現 core mask 變成了 fc :
```shell
$ taskset -p 1
pid 1 目前的近似者遮罩:fc
```
雖然有些已經變成 fc 但有些仍顯示 ff ,因此重新依據[參考共筆](https://hackmd.io/@KYWeng/rkGdultSU)進行設定。結果仍失敗,目前的在設定 smp_affinity 上因腳本錯誤遇到困難。
---
Fork
嘗試掛載模組之後輸入 `sudo ./client` 可以得到
:::spoiler 結果
```shell
Reading from /dev/fibonacci at offset 0, returned the sequence 0.
Reading from /dev/fibonacci at offset 1, returned the sequence 1.
Reading from /dev/fibonacci at offset 2, returned the sequence 1.
Reading from /dev/fibonacci at offset 3, returned the sequence 2.
Reading from /dev/fibonacci at offset 4, returned the sequence 3.
Reading from /dev/fibonacci at offset 5, returned the sequence 5.
Reading from /dev/fibonacci at offset 6, returned the sequence 8.
Reading from /dev/fibonacci at offset 7, returned the sequence 13.
Reading from /dev/fibonacci at offset 8, returned the sequence 21.
Reading from /dev/fibonacci at offset 9, returned the sequence 34.
Reading from /dev/fibonacci at offset 10, returned the sequence 55.
Reading from /dev/fibonacci at offset 11, returned the sequence 89.
Reading from /dev/fibonacci at offset 12, returned the sequence 144.
Reading from /dev/fibonacci at offset 13, returned the sequence 233.
Reading from /dev/fibonacci at offset 14, returned the sequence 377.
Reading from /dev/fibonacci at offset 15, returned the sequence 610.
Reading from /dev/fibonacci at offset 16, returned the sequence 987.
Reading from /dev/fibonacci at offset 17, returned the sequence 1597.
Reading from /dev/fibonacci at offset 18, returned the sequence 2584.
Reading from /dev/fibonacci at offset 19, returned the sequence 4181.
Reading from /dev/fibonacci at offset 20, returned the sequence 6765.
Reading from /dev/fibonacci at offset 21, returned the sequence 10946.
Reading from /dev/fibonacci at offset 22, returned the sequence 17711.
Reading from /dev/fibonacci at offset 23, returned the sequence 28657.
Reading from /dev/fibonacci at offset 24, returned the sequence 46368.
Reading from /dev/fibonacci at offset 25, returned the sequence 75025.
Reading from /dev/fibonacci at offset 26, returned the sequence 121393.
Reading from /dev/fibonacci at offset 27, returned the sequence 196418.
Reading from /dev/fibonacci at offset 28, returned the sequence 317811.
Reading from /dev/fibonacci at offset 29, returned the sequence 514229.
Reading from /dev/fibonacci at offset 30, returned the sequence 832040.
Reading from /dev/fibonacci at offset 31, returned the sequence 1346269.
Reading from /dev/fibonacci at offset 32, returned the sequence 2178309.
Reading from /dev/fibonacci at offset 33, returned the sequence 3524578.
Reading from /dev/fibonacci at offset 34, returned the sequence 5702887.
Reading from /dev/fibonacci at offset 35, returned the sequence 9227465.
Reading from /dev/fibonacci at offset 36, returned the sequence 14930352.
Reading from /dev/fibonacci at offset 37, returned the sequence 24157817.
Reading from /dev/fibonacci at offset 38, returned the sequence 39088169.
Reading from /dev/fibonacci at offset 39, returned the sequence 63245986.
Reading from /dev/fibonacci at offset 40, returned the sequence 102334155.
Reading from /dev/fibonacci at offset 41, returned the sequence 165580141.
Reading from /dev/fibonacci at offset 42, returned the sequence 267914296.
Reading from /dev/fibonacci at offset 43, returned the sequence 433494437.
Reading from /dev/fibonacci at offset 44, returned the sequence 701408733.
Reading from /dev/fibonacci at offset 45, returned the sequence 1134903170.
Reading from /dev/fibonacci at offset 46, returned the sequence 1836311903.
Reading from /dev/fibonacci at offset 47, returned the sequence 2971215073.
Reading from /dev/fibonacci at offset 48, returned the sequence 4807526976.
Reading from /dev/fibonacci at offset 49, returned the sequence 7778742049.
Reading from /dev/fibonacci at offset 50, returned the sequence 12586269025.
Reading from /dev/fibonacci at offset 51, returned the sequence 20365011074.
Reading from /dev/fibonacci at offset 52, returned the sequence 32951280099.
Reading from /dev/fibonacci at offset 53, returned the sequence 53316291173.
Reading from /dev/fibonacci at offset 54, returned the sequence 86267571272.
Reading from /dev/fibonacci at offset 55, returned the sequence 139583862445.
Reading from /dev/fibonacci at offset 56, returned the sequence 225851433717.
Reading from /dev/fibonacci at offset 57, returned the sequence 365435296162.
Reading from /dev/fibonacci at offset 58, returned the sequence 591286729879.
Reading from /dev/fibonacci at offset 59, returned the sequence 956722026041.
Reading from /dev/fibonacci at offset 60, returned the sequence 1548008755920.
Reading from /dev/fibonacci at offset 61, returned the sequence 2504730781961.
Reading from /dev/fibonacci at offset 62, returned the sequence 4052739537881.
Reading from /dev/fibonacci at offset 63, returned the sequence 6557470319842.
Reading from /dev/fibonacci at offset 64, returned the sequence 10610209857723.
Reading from /dev/fibonacci at offset 65, returned the sequence 17167680177565.
Reading from /dev/fibonacci at offset 66, returned the sequence 27777890035288.
Reading from /dev/fibonacci at offset 67, returned the sequence 44945570212853.
Reading from /dev/fibonacci at offset 68, returned the sequence 72723460248141.
Reading from /dev/fibonacci at offset 69, returned the sequence 117669030460994.
Reading from /dev/fibonacci at offset 70, returned the sequence 190392490709135.
Reading from /dev/fibonacci at offset 71, returned the sequence 308061521170129.
Reading from /dev/fibonacci at offset 72, returned the sequence 498454011879264.
Reading from /dev/fibonacci at offset 73, returned the sequence 806515533049393.
Reading from /dev/fibonacci at offset 74, returned the sequence 1304969544928657.
Reading from /dev/fibonacci at offset 75, returned the sequence 2111485077978050.
Reading from /dev/fibonacci at offset 76, returned the sequence 3416454622906707.
Reading from /dev/fibonacci at offset 77, returned the sequence 5527939700884757.
Reading from /dev/fibonacci at offset 78, returned the sequence 8944394323791464.
Reading from /dev/fibonacci at offset 79, returned the sequence 14472334024676221.
Reading from /dev/fibonacci at offset 80, returned the sequence 23416728348467685.
Reading from /dev/fibonacci at offset 81, returned the sequence 37889062373143906.
Reading from /dev/fibonacci at offset 82, returned the sequence 61305790721611591.
Reading from /dev/fibonacci at offset 83, returned the sequence 99194853094755497.
Reading from /dev/fibonacci at offset 84, returned the sequence 160500643816367088.
Reading from /dev/fibonacci at offset 85, returned the sequence 259695496911122585.
Reading from /dev/fibonacci at offset 86, returned the sequence 420196140727489673.
Reading from /dev/fibonacci at offset 87, returned the sequence 679891637638612258.
Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931.
Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189.
Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189.
Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931.
Reading from /dev/fibonacci at offset 87, returned the sequence 679891637638612258.
Reading from /dev/fibonacci at offset 86, returned the sequence 420196140727489673.
Reading from /dev/fibonacci at offset 85, returned the sequence 259695496911122585.
Reading from /dev/fibonacci at offset 84, returned the sequence 160500643816367088.
Reading from /dev/fibonacci at offset 83, returned the sequence 99194853094755497.
Reading from /dev/fibonacci at offset 82, returned the sequence 61305790721611591.
Reading from /dev/fibonacci at offset 81, returned the sequence 37889062373143906.
Reading from /dev/fibonacci at offset 80, returned the sequence 23416728348467685.
Reading from /dev/fibonacci at offset 79, returned the sequence 14472334024676221.
Reading from /dev/fibonacci at offset 78, returned the sequence 8944394323791464.
Reading from /dev/fibonacci at offset 77, returned the sequence 5527939700884757.
Reading from /dev/fibonacci at offset 76, returned the sequence 3416454622906707.
Reading from /dev/fibonacci at offset 75, returned the sequence 2111485077978050.
Reading from /dev/fibonacci at offset 74, returned the sequence 1304969544928657.
Reading from /dev/fibonacci at offset 73, returned the sequence 806515533049393.
Reading from /dev/fibonacci at offset 72, returned the sequence 498454011879264.
Reading from /dev/fibonacci at offset 71, returned the sequence 308061521170129.
Reading from /dev/fibonacci at offset 70, returned the sequence 190392490709135.
Reading from /dev/fibonacci at offset 69, returned the sequence 117669030460994.
Reading from /dev/fibonacci at offset 68, returned the sequence 72723460248141.
Reading from /dev/fibonacci at offset 67, returned the sequence 44945570212853.
Reading from /dev/fibonacci at offset 66, returned the sequence 27777890035288.
Reading from /dev/fibonacci at offset 65, returned the sequence 17167680177565.
Reading from /dev/fibonacci at offset 64, returned the sequence 10610209857723.
Reading from /dev/fibonacci at offset 63, returned the sequence 6557470319842.
Reading from /dev/fibonacci at offset 62, returned the sequence 4052739537881.
Reading from /dev/fibonacci at offset 61, returned the sequence 2504730781961.
Reading from /dev/fibonacci at offset 60, returned the sequence 1548008755920.
Reading from /dev/fibonacci at offset 59, returned the sequence 956722026041.
Reading from /dev/fibonacci at offset 58, returned the sequence 591286729879.
Reading from /dev/fibonacci at offset 57, returned the sequence 365435296162.
Reading from /dev/fibonacci at offset 56, returned the sequence 225851433717.
Reading from /dev/fibonacci at offset 55, returned the sequence 139583862445.
Reading from /dev/fibonacci at offset 54, returned the sequence 86267571272.
Reading from /dev/fibonacci at offset 53, returned the sequence 53316291173.
Reading from /dev/fibonacci at offset 52, returned the sequence 32951280099.
Reading from /dev/fibonacci at offset 51, returned the sequence 20365011074.
Reading from /dev/fibonacci at offset 50, returned the sequence 12586269025.
Reading from /dev/fibonacci at offset 49, returned the sequence 7778742049.
Reading from /dev/fibonacci at offset 48, returned the sequence 4807526976.
Reading from /dev/fibonacci at offset 47, returned the sequence 2971215073.
Reading from /dev/fibonacci at offset 46, returned the sequence 1836311903.
Reading from /dev/fibonacci at offset 45, returned the sequence 1134903170.
Reading from /dev/fibonacci at offset 44, returned the sequence 701408733.
Reading from /dev/fibonacci at offset 43, returned the sequence 433494437.
Reading from /dev/fibonacci at offset 42, returned the sequence 267914296.
Reading from /dev/fibonacci at offset 41, returned the sequence 165580141.
Reading from /dev/fibonacci at offset 40, returned the sequence 102334155.
Reading from /dev/fibonacci at offset 39, returned the sequence 63245986.
Reading from /dev/fibonacci at offset 38, returned the sequence 39088169.
Reading from /dev/fibonacci at offset 37, returned the sequence 24157817.
Reading from /dev/fibonacci at offset 36, returned the sequence 14930352.
Reading from /dev/fibonacci at offset 35, returned the sequence 9227465.
Reading from /dev/fibonacci at offset 34, returned the sequence 5702887.
Reading from /dev/fibonacci at offset 33, returned the sequence 3524578.
Reading from /dev/fibonacci at offset 32, returned the sequence 2178309.
Reading from /dev/fibonacci at offset 31, returned the sequence 1346269.
Reading from /dev/fibonacci at offset 30, returned the sequence 832040.
Reading from /dev/fibonacci at offset 29, returned the sequence 514229.
Reading from /dev/fibonacci at offset 28, returned the sequence 317811.
Reading from /dev/fibonacci at offset 27, returned the sequence 196418.
Reading from /dev/fibonacci at offset 26, returned the sequence 121393.
Reading from /dev/fibonacci at offset 25, returned the sequence 75025.
Reading from /dev/fibonacci at offset 24, returned the sequence 46368.
Reading from /dev/fibonacci at offset 23, returned the sequence 28657.
Reading from /dev/fibonacci at offset 22, returned the sequence 17711.
Reading from /dev/fibonacci at offset 21, returned the sequence 10946.
Reading from /dev/fibonacci at offset 20, returned the sequence 6765.
Reading from /dev/fibonacci at offset 19, returned the sequence 4181.
Reading from /dev/fibonacci at offset 18, returned the sequence 2584.
Reading from /dev/fibonacci at offset 17, returned the sequence 1597.
Reading from /dev/fibonacci at offset 16, returned the sequence 987.
Reading from /dev/fibonacci at offset 15, returned the sequence 610.
Reading from /dev/fibonacci at offset 14, returned the sequence 377.
Reading from /dev/fibonacci at offset 13, returned the sequence 233.
Reading from /dev/fibonacci at offset 12, returned the sequence 144.
Reading from /dev/fibonacci at offset 11, returned the sequence 89.
Reading from /dev/fibonacci at offset 10, returned the sequence 55.
Reading from /dev/fibonacci at offset 9, returned the sequence 34.
Reading from /dev/fibonacci at offset 8, returned the sequence 21.
Reading from /dev/fibonacci at offset 7, returned the sequence 13.
Reading from /dev/fibonacci at offset 6, returned the sequence 8.
Reading from /dev/fibonacci at offset 5, returned the sequence 5.
Reading from /dev/fibonacci at offset 4, returned the sequence 3.
Reading from /dev/fibonacci at offset 3, returned the sequence 2.
Reading from /dev/fibonacci at offset 2, returned the sequence 1.
Reading from /dev/fibonacci at offset 1, returned the sequence 1.
Reading from /dev/fibonacci at offset 0, returned the sequence 0.
```
:::
\
可以發現到 92 項以後數值就不再改變,代表這裡存在嚴重缺陷,對於超過 64 位元的數值無能為力。
另外在輸入指令
```shell
$ ls -l /dev/fibonacci
```
後,不是得到 235 也不是得到 236 而是 505
```shell
crw------- 1 root root 505, 0 3月 11 13:10 /dev/fibonacci
```
> [ktime doc](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html)
看[參考共筆](https://hackmd.io/@KYWeng/rkGdultSU)依樣畫葫蘆,由於目前 **fibdrv.c** 的 `fib_write()` 沒有功能,所以在其中塞入測量時間的程式碼:
```c
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
ktime_t kt;
kt = ktime_get();
fib_sequence(*offset);
kt = ktime_sub(ktime_get(), kt);
return (ssize_t) ktime_to_ns(kt);
}
```
然而輸出的數值看起來都差不多,回頭想找找看這個 `offset` 到底是怎麼設的。我發現在 **client.c** 呼叫時傳入的參數並不包含 `offset` ,不過在呼叫 `write()` 跟 `read()` 時有個不同是 `read()` 多了一行:
```c
lseek(fd, i, SEEK_SET);
```
其中 `SEEK_SET` 將被展開為 0 。觀察發現呼叫 `lseek()` 會連到 **fibdrv.c** 中的:
```c
static loff_t fib_device_lseek(struct file *file, loff_t offset, int orig)
{
loff_t new_pos = 0;
switch (orig) {
case 0: /* SEEK_SET: */
new_pos = offset;
break;
case 1: /* SEEK_CUR: */
new_pos = file->f_pos + offset;
break;
case 2: /* SEEK_END: */
new_pos = MAX_LENGTH - offset;
break;
}
if (new_pos > MAX_LENGTH)
new_pos = MAX_LENGTH; // max case
if (new_pos < 0)
new_pos = 0; // min case
file->f_pos = new_pos; // This is what we'll use now
return new_pos;
}
```
雖然可以猜測這個函式的回傳值會更改 `offset` ,但是目前不知道這到底是如何達成的,我甚至不曉得 `offset` 到底存在哪裡。
:::info
TODO:
- 釐清 `offset`
答:
我認為就是存在 file 結構的 `f_pos` 成員中
:::
### 研讀參考共筆中的大數運算
> https://hackmd.io/@KYWeng/rkGdultSU#%E5%A6%82%E4%BD%95%E6%B8%9B%E5%B0%91%E5%A4%A7%E6%95%B8%E9%81%8B%E7%AE%97%E7%9A%84%E6%88%90%E6%9C%AC
這份共筆提出多種實作上改進計算 Fib 速度的技巧,摘要如下:
#### 調整 Fast Fib 算法
將原先的算法由
$$\begin{aligned}
F_{2n+1} &= F_{n+1}^2 + F_n^2 \\
F_{2n} &= F_n\cdot(2F_{n+1}-F_n)
\end{aligned}
$$
改良為:
$$\begin{aligned}
F_{2n-1} &= F_n^2 + F_{n-1}^2 \\
F_{2n} &= F_n\cdot(2F_{n-1}+F_n)
\end{aligned}
$$
以迴避所有減法運算,並且不會再出現任何負數,可以全部用 unsigned 型態儲存。
#### 使用 memory pool
在大數結構中引入 memory pool 的概念來避免頻繁的 `realloc()` 。
#### 使用內嵌組合語言加速乘法
做乘法運算時直接指定高低位的位置,而不是使用擴展整數型別。
以參考共筆中的語法為例:
```
__asm__("mulq %3" : "=a"(low), "=d"(high) : "%0"(a[i]), "rm"(k));
assembly language : output : input
```
其中 instruction 顯然為 `mulq` ,而
1. `=` 是 write only 的意思
2. `a`, `d` 分別表示 `$rax`, `$rdx`
3. `( )` 內使用 C 變數
4. `rm` 表示由 compiler 自己選擇存放位置
5. `%0` 在此等同 `a` 暫存器
6. `%3` 代表 `rm`
參考:
> https://ithelp.ithome.com.tw/articles/10213500
> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
#### 引入 [Karatsuba algorithm](https://en.wikipedia.org/wiki/Karatsuba_algorithm) 算法
核心思想為在做乘法時,將一些乘法運算改為加減,考慮:
$$\begin{aligned}
(a_12^n+a_0)\times(b_12^n+b_0) &= c_22^{2n} + c_12^n +c_0 \\
&= (a_1b_1)2^{2n} + (a_1b_0+a_0b_1)2^n + a_0b_0
\end{aligned}
$$
原先需要做 4 次乘法運算,經過一些設計可以變成:
$$
(2^{2n}+2^n)(a_1b_1) + 2^n(a_1-a_0)(b_0-b_1) + (2^n+1)(a_0b_0)
$$
便能少做一次乘法。
## 改進 fibdrv
> [實作程式碼 github](https://github.com/tinyynoob/fibdrv/tree/master)
### 大數運算
在為 fibdrv 改進效能之前,我們需要先確保它的正確性,為此我們首先引入大數運算的機制。對於存放大數的方式前面提出了兩種設計:
- 單純使用結構
- 使用 union 達成 small string optimization
由於在計算 Fib 的情境下不會一直需要宣告新的變數,也不會有頻繁居於小數計算的情況,因此我們選擇使用結構的方式,如此也較便利於 maintain resize 的功能。
因為我目前不太熟悉 kernel programming 的各種 function 用法,因此我打算先在 user space 開發,並 debug 確認功能沒問題之後再移植到 kernel space 。
我想借助 macro 來完成此工作,例如這樣:
```c
#if DEBUG
N = (ubn *) malloc(sizeof(ubn));
#else
N = (ubn *) kmalloc(sizeof(ubn), GFP_KERNEL);
#endif
```
由於在計算 Fib 的過程中不會遇到任何負數,因此我的大數運算假設不存在負數,相關的運算也都沒有提供。
引入[共筆用的 macro](https://hackmd.io/@KYWeng/rkGdultSU#%E6%94%B9%E5%96%84%E6%96%B9%E6%A1%88-4---%E5%96%84%E7%94%A8-64-bit-CPU-%E7%9A%84%E5%84%AA%E5%8B%A2) :
```c
#if defined(__LP64__) || defined(__x86_64__) || defined(__amd64__) || \
defined(__aarch64__)
typedef uint64_t ubn_unit;
// typedef unsigned __int128 bn_data_tmp; // gcc support __int128
#define ubn_unit_bit 64
#else
typedef uint32_t ubn_unit;
// typedef uint64_t bn_data_tmp;
#define ubn_unit_bit 32
#endif
```
目前還未使用到加倍的型別,暫時先註解掉。
基於沒有負數的假設,我們得以直接移除共筆結構中的 `sign` 成員,訂出:
```c
/* unsigned big number
* @data: MSB:[size-1], LSB:[0]
* @size: sizeof(ubn_unit)
* @capacity: allocated size
*/
typedef struct {
ubn_unit *data;
int size;
int capacity;
} ubn;
```
預期 (無號) 大數運算的界面提供以下功能:
- `ubignum_init()`
- `ubignum_free()`
- `ubignum_resize()`
- `ubignum_add()`
- `ubignum_mult()`
- `ubignum_assign()`
目前比較沒有頭緒的是 `assign()` 的作法
#### ubignum_init()
:::spoiler 程式碼
```c
bool ubignum_init(ubn *N)
{
#if DEBUG
N = (ubn *) malloc(sizeof(ubn));
#else
N = (ubn *) kmalloc(sizeof(ubn), GFP_KERNEL);
#endif
if (!N)
goto struct_aloc_failed;
#if DEBUG
N->data = (ubn_unit *) calloc(sizeof(ubn_unit),
DEFAULT_CAPACITY); // including mem reset
#else
N->data = (ubn_unit *) kcalloc(sizeof(ubn_unit), DEFAULT_CAPACITY,
GFP_KERNEL); // including mem reset
#endif
if (!N->data)
goto data_aloc_failed;
N->size = 0;
N->capacity = DEFAULT_CAPACITY;
return true;
data_aloc_failed:
free(N);
N = NULL;
struct_aloc_failed:
return false;
}
```
:::
其中 `DEFAULT_CAPACITY` 目前訂為 16 。
靜態分析出現一堆問題,這才發現我原本的寫法根本不會正確把初始化後的 `N` 回傳,因此修改程式碼:
#### ubignum_resize()
:::spoiler 程式碼
```c
bool ubignum_resize(ubn *N, int new_size)
{
if (new_size < 0)
return false;
#if DEBUG
ubn_unit *new = (ubn_unit *) realloc(N->data, sizeof(ubn_unit) * new_size);
#else
ubn_unit *new =
(ubn_unit *) krealloc(N->data, sizeof(ubn_unit) * new_size, GFP_KERNEL);
#endif
if (!new)
return false;
N = new;
N->capacity = new_size;
return true;
}
```
:::
目前不確定允許 `new_size` 傳入 0 是否是個好主意。
#### ubignum_add()
```c
bool ubignum_add(const ubn *a, const ubn *b, ubn *out)
```
目前設計讓使用者傳入存放 sum 的變數,但這麼做就需要解決 pointer aliasing 的問題。
由低到高做加法的流程大致可以分成兩段:
1. 兩數相加
2. 一數加上 1. 的 carry out
首先開發 1. 的部份:
```c=
ubn_unit carry = 0;
int i = 0;
for (i = 0; i < min(a->size, b->size); i++) {
if (__builtin_expect(i >= out->capacity, 0)) {
if (!resize(out, out->capacity * 2)) {
return false;
}
}
out->data[i] = a->data[i] + b->data[i] + carry;
// The following consideration does not include carry
// fix it!
carry = ((a->data[i] ^ b->data[i]) >> 1 ^ (a->data[i] & b->data[i])) &
(1u << (ubn_unit_bit - 1));
}
```
若超出 `out` 範圍則對它進行 `resize()` ,其中加入 `builtin_expect()` 協助 compiler 進行 branch prediction 。
撰寫過程馬上讓我遭遇了幾個問題:
1. 第 11 行,我嘗試引入 [quiz2 第一題](https://hackmd.io/cmwilKeBSsStEwvKUgkfMA#%E6%B8%AC%E9%A9%97-1) 計算平均的方式再取 MSB 來得出新的 `carry` ,但我發覺這樣的話舊的 `carry` 沒有被加進來,所以在例如 `(a, b, c) = (UINT_MAX, 0, 1)` 的情況下會算出錯誤結果。
2. 我不熟悉 gcc 對於 unsigned 型態的右移會如何處理
根據 [gcc 的 mail list](https://gcc.gnu.org/legacy-ml/gcc/2000-04/msg00152.html) :
> For right shifts, if the type is signed, then we use an arithmetic shift;
if the type is unsigned then we use a logical.
因此對於 unsigned 右移的行為符合我的預期。
:::info
1\. 待解決
$\to$ 後面完全更換作法就不再有這個問題
:::
為了應對 pointer aliasing 的問題,我們應該不能直接拿 `a->data` 和 `b->data` 去運算,因為它們在第 9 行可能會被改變 (如果他們跟 `out` 指到同樣位置) ,那麼第 12 行的運算將會錯誤。因此需要先將它們在第 9 行的值存起來,將程式碼改成:
```c=
for (i = 0; i < min(a->size, b->size); i++) {
if (__builtin_expect(i >= out->capacity, 0)) {
if (!resize(out, out->capacity * 2)) {
goto realoc_failed;
}
}
ubn_unit an = a->data[i];
ubn_unit bn = b->data[i];
out->data[i] = an + bn + carry;
// The following consideration does not include carry
// fix it!
carry = ((an ^ bn) >> 1 + (an & bn)) & (1u << (ubn_unit_bit - 1));
}
```
其中第 4 行跳到最後去把 `a` 跟 `b` 還原成原本的樣子,這也是因為 aliasing 才有的問題。
第二段程式碼:
```c=
ubn *remain = NULL;
if (i == a->size)
remain = b;
else if (i == b->size)
remain = a;
for (; carry || i < remain->size; i++) {
if (__builtin_expect(i >= out->capacity, 0)) {
if (!resize(out, out->capacity * 2)) {
return false;
}
}
if (__builtin_expect(i < remain->size, 1)) {
out->data[i] = remain->data[i] + carry;
carry =
((remain->data[i] ^ carry) >> 1 ^ (remain->data[i] & carry)) &
(1u << (ubn_unit_bit - 1));
} else { // the last carry out carries to a new ubn_unit
out->data[i] = carry; // = 1
carry = 0;
}
}
```
先使用 `remain` 來記住哪一個加數還有剩餘的位數沒加,接著就進入迴圈繼續加完,大部分的情況都會走 12 行,唯一會走到 17 行的情況是最高位有 carry out 而且剛好 carry 到一個新的 `ubn_unit` chunk 。
同樣處理 pointer aliasing 的問題並改進寫法得:
```c
ubn *remain = (i == a->size) ? b : a;
for (; carry || i < remain->size; i++) {
if (__builtin_expect(i >= out->capacity, 0)) {
if (!resize(out, out->capacity * 2)) {
goto realoc_failed;
}
}
if (__builtin_expect(i < remain->size, 1)) {
ubn_unit rn = remain->data[i];
out->data[i] = rn + carry;
carry =
((rn ^ carry) >> 1 ^ (rn & carry)) & (1u << (ubn_unit_bit - 1));
} else { // the last carry out carries to a new ubn_unit
out->data[i] = carry; // = 1
carry = 0;
}
}
```
處理 aliasing 的程式碼:
```c
ubn *store = NULL;
int alias = 0;
if (a == out) // pointer aliasing
alias ^= 1;
if (b == out) // pointer aliasing
alias ^= 2;
if (alias) {
store = ubignum_duplicate(store, out);
if (!store)
return false;
}
```
直接使用一個變數 `alias` 中的 bit 來觀察,若有 aliasing 則配置空間給保留原值。
```c
realoc_failed:
switch (alias) {
case 1:
swapptr(&a, &store);
break;
case 2:
case 3:
swapptr(&b, &store);
case 0:
}
return false;
```
注意到這裡 2, 3 共用的原因是,如果 `a` 和 `b` 指到同一個位置,那麼恢復 `b` 也同時恢復了 `a` 。
---
aliasing 的問題比我預期的還更難處理,我直接重改邏輯,變成如果有 aliasing 則配置一個新空間存放計算結果,算完再放到 `out` ,改寫過程中也修復了一些 bug 。
::: spoiler 更改後的 `ubignum_add()` 程式碼
```c=
/* out = a + b
* Aliasing arguments are acceptable.
* If it return true, the result is put at @out.
* If return false with alias input, the @out would be unchanged.
* If return false without alias input, the @out would return neither answer nor
* original out.
*/
bool ubignum_add(const ubn *a, const ubn *b, ubn **out)
{
ubn *ans = *out;
int alias = 0;
if (a == *out) // pointer aliasing
alias ^= 1;
if (b == *out) // pointer aliasing
alias ^= 2;
if (alias) { // if so, allocate space to store the result
if (!ubignum_init(&ans))
return false;
}
ubn_unit carry = 0;
int i = 0;
for (i = 0; i < min(a->size, b->size); i++) {
if (i >= ans->size) {
if (__builtin_expect(i >= ans->capacity, 0))
if (!ubignum_resize(ans, ans->capacity * 2))
goto realoc_failed;
ans->size++;
}
const ubn_unit_extend sum = a->data[i] + b->data[i] + carry;
ans->data[i] = sum;
carry = sum >> ubn_unit_bit;
}
const ubn *remain = (i == a->size) ? b : a;
for (; carry || i < remain->size; i++) {
if (i >= ans->size) {
if (__builtin_expect(i >= ans->capacity, 0))
if (!ubignum_resize(ans, ans->capacity * 2))
goto realoc_failed;
ans->size++;
}
if (__builtin_expect(i < remain->size, 1)) {
const ubn_unit_extend sum = remain->data[i] + carry;
ans->data[i] = sum;
carry = sum >> ubn_unit_bit;
} else { // the last carry out carries to a new ubn_unit
ans->data[i] = carry; // = 1
carry = 0;
}
}
*out = ans; // no condition needed
return true;
realoc_failed:
if (alias)
ubignum_free(ans);
return false;
}
```
:::
\
其中 `ubn_unit_extend` 在 x64 定義為 `unsigned __int128` ,目前嘗試在 user space 配合 gdb 工具進行測試,得到錯誤的結果, `carry` 沒有被正確計算
```c
carry = sum >> ubn_unit_bit;
```
疑似 uint128 的 shift operation 有問題,我嘗試去找 [gcc doc](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html) ,但是內容只有三行,沒有提到是否支援 shift operation 。
我在 assignment 之前加入強制轉型就有我預期的結果了
```c
const ubn_unit_extend sum = (ubn_unit_extend) a->data[i] + b->data[i] + carry;
```
我好像已經吃了很多次這種虧,卻都沒有記住,我一直誤以為計算的時候就會轉成左邊的型態,這次特別寫進 commit message ,希望能確實記得。
#### ubignum_mult()
雖然前面有提及[加速的算法](https://hackmd.io/MJ1kMQBLRBCuo08yEmhDeQ?both#%E5%BC%95%E5%85%A5-Karatsuba-algorithm-%E7%AE%97%E6%B3%95),但是因為我目前的大數機制沒有減法,所以暫時先略過此項。
在開發時發現 `ubn_unit` 的加法一直重複遇到,所以將其模組化
```c
/* no checking, sum and cout input should be guarantee */
static inline void ubn_unit_add(const ubn_unit a,
const ubn_unit b,
const int cin,
ubn_unit *sum,
int *cout)
{
const ubn_unit_extend SUM = (ubn_unit_extend) a + b + cin;
*sum = SUM;
*cout = SUM >> ubn_unit_bit;
}
```
長得有點像硬體加法器的樣子。 `cin` 和 `cout` 都只會用到一個 bit ,用什麼型態似乎都蠻浪費的,最後選用了最常見的 int 。
:::spoiler `mult()` 程式碼
```c=
bool ubignum_mult(const ubn *a, const ubn *b, ubn **out)
{
if (!a || !b || !out || !*out)
return false;
ubn *ans = *out;
int alias = 0;
if (a == *out) // pointer aliasing
alias ^= 1;
if (b == *out) // pointer aliasing
alias ^= 2;
if (alias) { // if alias, allocate space to store the result
if (!ubignum_init(&ans))
return false;
}
/* keep mcand longer then mplier */
const ubn *mcand = a->size > b->size ? a : b;
const ubn *mplier = a->size > b->size ? b : a;
ubn *prod;
if (!ubignum_init(&prod))
goto prod_aloc_failed;
if (prod->capacity < mcand->size + 1)
if (!ubignum_resize(&prod, mcand->size + 1))
goto prod_resize_failed;
/* The outer loop */
for (int i = 0; i < mplier->size; i++) {
/* The inner loop explanation:
* a b c = mcand->data[2, 1, 0]
* * d = mplier->data[0]
* --------
* c c = (high, low) denotes result of c * d
* b b
* + a a
* ----------
* x y ? c sum of bb and cc, it may be carry out to x
* + a a
*
* let x be recorded in @carry and y be @overlap
* no realloc needed for @prod
*/
int carry = 0;
ubn_unit overlap = 0;
for (int j = 0; j < mcand->size; j++) {
ubn_unit low, high;
// need macro to select appropriate assembly
__asm__("mulq %3"
: "=a"(low), "=d"(high)
: "%0"(mcand->data[j]), "rm"(mplier->data[i]));
int cout = 0;
ubn_unit_add(low, overlap, carry & 2, &prod->data[j], &cout);
carry = (carry << 1) | cout; // update carry where cout is 0 or 1
overlap = high;
}
prod->data[mcand->size] =
overlap + (ubn_unit) carry & 2; // no carry out would be generated
if (!ubignum_mult_add(prod, i, &ans))
goto multadd_failed;
}
*out = ans;
return true;
multadd_failed:
prod_resize_failed:
ubignum_free(prod);
prod_aloc_failed:
if (alias)
ubignum_free(ans);
return false;
}
```
:::
\
目前的 bug 是 inline asm 沒有正確作用,看起來像是 `low` 跟 `high` 相反了,也可以看出確實是把 `$rdx`, `$rax` 分別接到 `high` 跟 `low` 。
```shell
Breakpoint 2, ubignum_mult (a=0x5555555592a0, b=0x555555559350, out=0x7fffffffdc88) at bignum.h:336
336 ubn_unit_add(low, overlap, carry & 2, &prod->data[j], &cout);
(gdb) p mcand->data[j]
$1 = 18446744073709551615
(gdb) p i
$2 = 0
(gdb) p j
$3 = 0
(gdb) p mplier->data[i]
$4 = 18446744073709551615
(gdb) p low
$5 = 1
(gdb) p high
$6 = 18446744073709551614
(gdb) info registers rax
rax 0x1 1
(gdb) info registers rdx
rdx 0xfffffffffffffffe -2
```
難道說它執行 `mulq` instruction 時的資料放置方式是跟原本相反嗎?
我嘗試去查看組合語言的編譯結果,看到:
```
mulq %rdx
```
雖然它把 `"rm"` 選成 `$rdx` ,但這應該也不會影響指令執行結果才對。
我發現其實它沒有算錯,這個結果是正確的,但因為我對乘法的直覺還不夠,誤用加法的邏輯來思考乘法才誤判結果。
在 pointer aliasing 的情況下無法做到 in place 的算法,否則錯誤處理會出問題。
---
### 大數運算 第二版
在開發的過程中,因為大數運算難以 debug 只能經由 gdb 慢慢嘗試,因此修修改改花了很多時間,改了又錯又重改的情況多次發生,難以一一列舉,在此大略敘述過程遇到的狀況。
#### pointer aliasing issues
首先是前面提及 pointer aliasing 的問題,這個狀況是我在第一個版本遇到最大的問題,第一次是在開發加法時遇到。例如當想計算 `a = a + b` 時,若加出更大的位數則需要 reallocation ,但中途的 reallocation 失敗會導致傳入值 `a` 被破壞而正確的新 `a` 也沒有算出來,進退兩難的情況。
起初為了應對這個問題,我每次計算都先查看是否有 aliasing ,若有則進行複製, maintain 一份副本以利在 reallocation 失敗時能將數值復原。然而這個方法顯然導致很糟糕的效能,雖然之後為了節省交換的操作而直接計算在新空間,但仍然用了 3 個數。
使用者在使用界面時若指定 `a = a + b` ,那通常直覺也會認為自己使用了 in place 的算法,可是我在實作上卻無論如何都需要 3 個數的空間,違反使用直覺。
:::spoiler 原程式碼 (add)
```c
/* (*out) = a + b
* Aliasing arguments are acceptable.
* If it return true, the result is put at @out.
* If return false with alias input, the @out would be unchanged.
* If return false without alias input, the @out would return neither answer nor
* original out.
*/
bool ubignum_add(const ubn *a, const ubn *b, ubn **out)
{
if (!a || !b || !out || !*out)
return false;
ubn *ans = *out;
int alias = 0;
if (a == *out) // pointer aliasing
alias ^= 1;
if (b == *out) // pointer aliasing
alias ^= 2;
if (alias) { // if alias, allocate space to store the result
if (!ubignum_init(&ans))
return false;
}
int i = 0, carry = 0;
for (i = 0; i < min(a->size, b->size); i++) {
if (i >= ans->size) {
if (__builtin_expect(i >= ans->capacity, 0))
if (!ubignum_resize(&ans, ans->capacity * 2))
goto realoc_failed;
ans->size++;
}
ubn_unit_add(a->data[i], b->data[i], carry, &ans->data[i], &carry);
}
const ubn *remain = (i == a->size) ? b : a;
for (; carry || i < remain->size; i++) {
if (i >= ans->size) {
if (__builtin_expect(i >= ans->capacity, 0))
if (!ubignum_resize(&ans, ans->capacity * 2))
goto realoc_failed;
ans->size++;
}
ubn_unit_add(remain->data[i], 0, carry, &ans->data[i], &carry);
}
if (alias)
ubignum_free(*out);
*out = ans;
return true;
realoc_failed:
if (alias)
ubignum_free(ans);
return false;
}
```
:::
\
對於每個 $n$ 測量 1000 次 $\mathrm{fib}(n)$ 計算後印出平均消耗時間。
其糟糕效能如下:

為此我後來發現若是能提早知道 reallocation 的大小,或至少知道該次計算的上界,那麼即可在開始計算之前就進行 reallocation 。如此一來即使配置失敗,那麼直接回傳 false 原先的值也不會被破壞。因此在計算 `a = a + b` 時便可以達成 in place 的目標,使用 2 個數的空間完成加法,也因為每次 `ubignum_add()` 都只會做一次 reallocation ,大幅降低了 memory allocation 的花費。此外也不再需要確認 aliasing ,移除判斷條件後也使程式碼簡潔許多。
:::spoiler 改良 (add)
```c
/* (*out) = a + b
* Aliasing arguments are acceptable.
* If false is returned, the values of (*out) remains unchanged.
*/
bool ubignum_add(const ubn *a, const ubn *b, ubn **out)
{
if (!a || !b || !out || !*out)
return false;
while (__builtin_expect((*out)->capacity < max(a->size, b->size) + 1, 0))
if (!ubignum_resize(out, (*out)->capacity * 2))
return false;
if (*out != a && *out != b) // if no pointer aliasing
ubignum_zero(*out);
int i = 0, carry = 0;
for (i = 0; i < min(a->size, b->size); i++)
ubn_unit_add(a->data[i], b->data[i], carry, &(*out)->data[i], &carry);
const ubn *const remain = (i == a->size) ? b : a;
for (; i < remain->size; i++)
ubn_unit_add(remain->data[i], 0, carry, &(*out)->data[i], &carry);
(*out)->size = remain->size;
if (carry) {
(*out)->data[i] = 1;
(*out)->size++;
}
return true;
}
```
:::
\
效能:

相比原先的版本加速將近十倍之多。
#### 列印
第二個困難出在列印,雖然我的 `ubignum` 已能進行數種運算,也正確地將數值儲存在檔案中,但它們就是一串由上百、上千個 0, 1 組成的資料,難以為人類所理解。
為了將很長的二進位轉成十進制,我左思右想許久,最後認為終究是離不開除法的命運,還是得乖乖寫一個除法。我想起前幾周的測驗題中,恰好有[計算除法的內容](https://hackmd.io/u4lvdDcAQeC8UoQ1dT7LUg?view#%E6%B8%AC%E9%A9%97-5),於是我就嘗試套用這個模式,意圖以位移和減法來達成除法運算。此外因為目前沒有其它除法的應用場景,所以可以只考量「除以 10 」的情況並進行特化。
為了實作出 `ubignum_divby_ten()` ,首先需要 `ubignum_left_shift()` 及 `ubignum_sub()` 。
##### ubignum_sub()
由於我的大數體系原先完全不存在任何負的概念,因此想引入減法也需要做一些權衡。
不過因為除法中使用的減法會保證結果為非負 (餘數) ,如此一來就不必引入負數,減法直接使用二補數加法進行,並把最高的進位拋棄。基於直接假設減法都是大減小,原本的 ubignum 體系可以維持不變。
以下程式碼都展示解決前述 pointer aliasing 問題後的版本。
:::spoiler `sub()` 程式碼
```c
bool ubignum_sub(const ubn *a, const ubn *b, ubn **out)
{
if (!a || !b || !out || !*out)
return false;
/* ones' complement of b */
ubn *cmt;
if (!ubignum_init(&cmt))
goto cmt_failed;
if (__builtin_expect(cmt->capacity < a->size, 0))
if (!ubignum_resize(&cmt, a->size))
goto cmt_realoc_failed;
cmt->size = a->size;
for (int i = 0; i < b->size; i++)
cmt->data[i] = ~b->data[i];
for (int i = b->size; i < a->size; i++)
cmt->data[i] = CPU_64 ? UINT64_MAX : UINT32_MAX;
int carry = 1;
for (int i = 0; i < a->size; i++) // compute result and store in cmt
ubn_unit_add(a->data[i], cmt->data[i], carry, &cmt->data[i], &carry);
// the last carry is discarded
for (int i = a->size - 1; i >= 0; i--)
if (!cmt->data[i])
cmt->size--;
ubignum_free(*out);
*out = cmt;
return true;
cmt_realoc_failed:
ubignum_free(cmt);
cmt_failed:
return false;
}
```
:::
##### ubignum_left_shift()
:::spoiler `left_shift()` 程式碼
```c=
/* left shift a->data by d bit */
bool ubignum_left_shift(const ubn *a, int d, ubn **out)
{
if (!a || d < 0 || !out || !*out)
return false;
else if (d == 0) {
if (a == *out)
return true;
if (__builtin_expect((*out)->capacity < a->size, 0))
if (!ubignum_resize(out, a->size))
return false;
for (int i = 0; i < a->size; i++)
(*out)->data[i] = a->data[i];
(*out)->size = a->size;
return true;
}
const int chunk_shift = d / ubn_unit_bit;
const int shift = d % ubn_unit_bit;
const int new_size = a->size + chunk_shift + 1;
if (__builtin_expect((*out)->capacity < new_size, 0))
if (!ubignum_resize(out, new_size))
return false;
int ai = new_size - chunk_shift - 1; // note a->size may be changed due to
// aliasing, should not use a->size
int oi = new_size - 1;
/* copy data from a to (*out) */
if (shift) {
(*out)->data[oi--] = a->data[ai - 1] >> (ubn_unit_bit - shift);
for (ai--; ai > 0; ai--)
(*out)->data[oi--] = (a->data[ai] << shift) |
(a->data[ai - 1] >> (ubn_unit_bit - shift));
(*out)->data[oi--] = a->data[ai] << shift; // ai is now 0
} else {
while (ai >= 0)
(*out)->data[oi--] = a->data[ai--];
}
/* end copy */
while (oi >= 0) // set remaining part of (*out) to 0
(*out)->data[oi--] = 0;
(*out)->size = new_size;
if ((*out)->data[(*out)->size - 1] == 0) // if MS chunk is 0
(*out)->size--;
return true;
}
```
:::
\
大致上就是一些搬移的操作,在此解釋 29 至 33 行的操作,假設我們要將 `a` :
```
00000000 11111111 01001111 00000000
```
左移 3 位。
- 注意到 chunk 編號是 $(3, 2, 1, 0)$ ,左邊為 most significant 。
每次我們需要目前 chunk 的最低 5 位放到高 (置左) 與其右方 chunk 的最高 3 位放到低 (置右) 進行合併,以編號 $2$ 為例即是:
```
11111... from [2]
or .....010 from [1]
----------
11111010
```
- 取最低 5 位放到高:左移 $3$
- 取最高 3 位放到低:右移 $(8-3)$
最高跟最低的 chunk 都只有分別一邊所以需要特殊處理,最終得:
```
00000111 11111010 01111000 00000000
```
由於我們每次都不會去取目標左方的新 chunk ,因此 pointer aliasing 在此並不會造成問題。
另外,考量到當 `shift` 為 0 時,原本的程式碼會導致數值右移超過自己的 size ,產生 undefined behavior ,因此需要第 28 行的 if..else 來進行 branch 。
##### ubignum_divby_ten()
概念源於 [quiz3](https://hackmd.io/u4lvdDcAQeC8UoQ1dT7LUg?view#%E6%B8%AC%E9%A9%97-5) 的算法。
在撰寫的過程中,再多寫了一個 `ubignum_compare()` 函式以利比較。
:::spoiler `divby_ten()` 程式碼
```c
/* a / 10 = (*quo)...rmd */
bool ubignum_divby_ten(const ubn *a, ubn **quo, int *rmd)
{
if (!a || !a->size || !quo || !*quo || !rmd)
return false;
ubn *ans, *dvd, *suber, *ten;
if (!ubignum_init(&ans))
return false;
if (!ubignum_resize(&ans, a->size))
goto cleanup_ans;
if (!ubignum_init(&dvd))
goto cleanup_ans;
if (!ubignum_resize(&dvd, a->size))
goto cleanup_dvd;
if (!ubignum_init(&suber))
goto cleanup_dvd;
if (!ubignum_resize(&suber, dvd->capacity + 1))
goto cleanup_suber;
if (!ubignum_init(&ten))
goto cleanup_suber;
ubignum_uint(ten, 10);
for (int i = 0; i < a->size; i++)
dvd->data[i] = a->data[i];
dvd->size = a->size;
int shift = dvd->size * ubn_unit_bit - 4; // can be improved by clz()
ubignum_left_shift(ten, shift, &suber);
while (ubignum_compare(dvd, ten) >= 0) { // if dvd >= 10
while (ubignum_compare(dvd, suber) < 0) {
shift--;
ubignum_left_shift(ten, shift, &suber);
}
ans->data[shift / ubn_unit_bit] |= (ubn_unit) 1
<< (shift % ubn_unit_bit);
ubignum_sub(dvd, suber, &dvd);
}
ans->size = a->size;
if (ans->data[a->size - 1] == 0)
ans->size--;
*rmd = (int) dvd->data[0];
ubignum_free(ten);
ubignum_free(suber);
ubignum_free(dvd);
ubignum_free(*quo);
*quo = ans;
return true;
cleanup_suber:
ubignum_free(suber);
cleanup_dvd:
ubignum_free(dvd);
cleanup_ans:
ubignum_free(ans);
return false;
}
```
:::
\
之後同樣如 quiz3 引入 `builtin_clz()` 並重新包裝成 `ubignum_clz()` 進行改進,關鍵段落變化:
```diff
for (int i = 0; i < a->size; i++)
dvd->data[i] = a->data[i];
dvd->size = a->size;
- int shift = dvd->size * ubn_unit_bit - 4;
- ubignum_left_shift(ten, shift, &suber);
while (ubignum_compare(dvd, ten) >= 0) { // if dvd >= 10
- while (ubignum_compare(dvd, suber) < 0) {
- shift--;
- ubignum_left_shift(ten, shift, &suber);
- }
+ int shift = dvd->size * ubn_unit_bit - ubignum_clz(dvd) - 4;
+ ubignum_left_shift(ten, shift, &suber);
+ if (ubignum_compare(dvd, suber) < 0)
+ ubignum_left_shift(ten, --shift, &suber);
ans->data[shift / ubn_unit_bit] |= (ubn_unit) 1
<< (shift % ubn_unit_bit);
ubignum_sub(dvd, suber, &dvd);
}
ans->size = a->size; // FIX: ans = 0 should be considered
if (ans->data[a->size - 1] == 0)
ans->size--;
*rmd = (int) dvd->data[0];
```
##### ubignum_2decimal()
在實作完成除以 10 之後,我們即可透過不斷除以 10 來將二進位數轉成字串印出。
:::spoiler `2decimal()` 程式碼
```c
/* convert the ubn to ascii string */
char *ubignum_2decimal(const ubn *N)
{
if (!N)
return NULL;
if (ubignum_iszero(N)) {
#if DEBUG
char *ans = (char *) calloc(sizeof(char), 2);
#else
char *ans = (char *) kcalloc(sizeof(char), 2, GFP_KERNEL);
#endif
if (!ans)
return NULL;
ans[0] = '0';
ans[1] = 0;
return ans;
}
ubn *dvd = NULL;
if (!ubignum_init(&dvd))
return NULL;
if (!ubignum_resize(&dvd, N->size))
goto cleanup_dvd;
for (int i = 0; i < N->size; i++)
dvd->data[i] = N->data[i];
dvd->size = N->size;
/* Let n be the number.
* digit = 1 + log_10(n) = 1 + \frac{log_2(n)}{log_2(10)}
* log_2(10) \approx 3.3219 \approx 7/2, we simply choose 3
*/
unsigned digit = (ubn_unit_bit * N->size / 3) + 1;
#if DEBUG
char *ans = (char *) calloc(sizeof(char), digit);
#else
char *ans = (char *) kcalloc(sizeof(char), digit, GFP_KERNEL);
#endif
if (!ans)
goto cleanup_dvd;
/* convert 2-base to 10-base */
int index = 0;
while (dvd->size) {
int rmd;
if (__builtin_expect(!ubignum_divby_ten(dvd, &dvd, &rmd), 0))
goto cleanup_ans;
ans[index++] = rmd | '0';
}
int len = index;
/* reverse string */
index = len - 1;
for (int i = 0; i < index; i++, index--) {
char tmp = ans[i];
ans[i] = ans[index];
ans[index] = tmp;
}
ubignum_free(dvd);
return ans;
cleanup_ans:
#if DEBUG
free(ans);
#else
kfree(ans);
#endif
cleanup_dvd:
ubignum_free(dvd);
return NULL;
}
```
:::
\
目前我認為字串反轉的方式還可以再改良,但因為這個部份並非 fib 運算的主軸,因此暫時就先如此。
---
### fibdrv 效能分析與改進
原先想與[範例程式碼](https://github.com/AdrianHuang/fibdrv/tree/big_number)比較,然而稍微更改後在自己電腦實驗的結果卻非常糟:


因此目前先以自己改進為方向,
由於速度的提昇,我們現在將範圍拉到 $n=1000$ 來測試。

對於目前 fib sequence 逐項往上加的算法
```c
static inline void ubn_unit_add(const ubn_unit a,
const ubn_unit b,
const int cin,
ubn_unit *sum,
int *cout)
{
const ubn_unit_extend SUM = (ubn_unit_extend) a + b + cin;
*sum = SUM;
*cout = SUM >> ubn_unit_bit;
}
```
我認為仍有可能改進的部份:
1. 使用 inline asm 來做加法,如此 (在 64 位元情況) 可以避免用到 `uint128_t`
今天 (3/24) 上課的時候,聽到老師介紹 gcc 有 builtin 的 overflow 函式,藉由這點我們即可改寫原本的 `ubn_unit_add()`,不需要低到 asm 的層級,相對 asm 的優點在於 ==architecture independence==。然而這裡是需要 3 個數相加,因此需要呼叫兩次 `builtin_uadd_overflow()`。
```c
static inline void ubn_unit_add(const ubn_unit a,
const ubn_unit b,
const int cin,
ubn_unit *sum,
int *cout)
{
*cout = __builtin_uaddll_overflow(a, cin, sum);
*cout |= __builtin_uaddll_overflow(*sum, b, sum);
}
```

測試後可以發現在較大的 case 上,使用 `builtin_uadd_overflow()` 可以得到略佳的效果。再測到更大一些得:

---
當我嘗試將 $n$ 提高到 3000 時,卻觀察到一個奇怪的現象,就是在 1000 之後,計算時間就不再有提昇的趨勢,如圖:

我覺得這非常不合理,因為以我的算法對於越大的數肯定有越高的計算量。目前想到的可能出在,因數值太大而產生 overflow ,但 20000 $\times$ 1000 也仍不到 32-bit int 的上界,應不可能超出 long long ,且可以從圖中見得有些數字遠遠超過水平線卻仍被印了出來。或許需要更理解 ktime_t 相關的知識。
> On 64-bit systems, a ktime_t is really just a 64-bit integer value in nanoseconds.
> --- [The high-resolution timer API](https://lwn.net/Articles/167897/)
以這個資料來說,顯然更不是時間變數資料型態的問題了。
:::success
經過數天之後我終於發現問題,是 **fibdrv.c** 中的 `MAX_LENGTH` 。之前改過一次之後我就忘記有這個參數,目前直接設定成 1000000 以利後續實驗。
更正後試跑已可正常進行統計:

:::
---
嘗試引入 fast doubling 算法加速

效能改善的情況高於我的預期,我原先以為 fast doubling 要到很大的 case 才會有顯著的改善。但從實驗結果可以發現,在 $n=400$ 左右兩種算法即出現執行時間的交叉。
試著仿造使用 `perf stat` 測量我寫的 fast doubling 算法到 $n=3000$ :
```shell
$ sudo perf stat -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses ./exp 1 > /dev/null
Performance counter stats for './exp 1':
118,264,893,816 cycles
264,687,477,706 instructions # 2.24 insn per cycle
15,142,354 cache-misses # 15.801 % of all cache refs
95,829,386 cache-references
38,631,044,731 branch-instructions
69,373,051 branch-misses # 0.18% of all branches
35.744624633 seconds time elapsed
0.931959000 seconds user
34.806501000 seconds sys
```
可以看到我的實作程式碼有高達 15% 的 cache miss 。
我想利用 cachegrind 來測試是哪個函式發生最多 cache miss ,結果如下:
```shell
$ sudo valgrind --tool=cachegrind ./exp 1 > /dev/null
==37329== Cachegrind, a cache and branch-prediction profiler
==37329== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==37329== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==37329== Command: ./exp 1
==37329==
--37329-- warning: L3 cache found, using its data for the LL simulation.
==37329==
==37329== I refs: 69,563,678
==37329== I1 misses: 1,159
==37329== LLi misses: 1,128
==37329== I1 miss rate: 0.00%
==37329== LLi miss rate: 0.00%
==37329==
==37329== D refs: 28,270,312 (24,764,894 rd + 3,505,418 wr)
==37329== D1 misses: 3,323 ( 2,585 rd + 738 wr)
==37329== LLd misses: 2,733 ( 2,070 rd + 663 wr)
==37329== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==37329== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==37329==
==37329== LL refs: 4,482 ( 3,744 rd + 738 wr)
==37329== LL misses: 3,861 ( 3,198 rd + 663 wr)
==37329== LL miss rate: 0.0% ( 0.0% + 0.0% )
$ sudo cg_annotate cachegrind.out.37329
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 6291456 B, 64 B, 12-way associative
Command: ./exp 1
Data file: cachegrind.out.37329
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
69,563,678 1,159 1,128 24,764,894 2,585 2,070 3,505,418 738 663 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
33,078,041 4 4 15,024,011 1 0 3,012,013 0 0 ???:main
27,000,072 1 1 6,000,016 0 0 0 0 0 /build/glibc-sMfBJT/glibc-2.31/io/../sysdeps/unix/sysv/linux/write.c:write
6,054,221 27 24 3,027,082 11 0 15 1 1 ???:???
1,493,991 48 47 398,997 19 6 261,001 6 2 /build/glibc-sMfBJT/glibc-2.31/stdio-common/vfprintf-internal.c:__vfprintf_internal
648,170 4 4 141,032 4 1 114,012 1 0 /build/glibc-sMfBJT/glibc-2.31/libio/fileops.c:_IO_file_xsputn@@GLIBC_2.2.5
418,262 3 3 30,733 2 1 24,733 0 0 /build/glibc-sMfBJT/glibc-2.31/stdio-common/_itoa.c:_itoa_word
228,000 5 5 21,000 3 2 0 0 0 /build/glibc-sMfBJT/glibc-2.31/string/../sysdeps/x86_64/multiarch/strchr-avx2.S:__strchrnul_avx2
163,977 2 2 29,983 0 0 17,985 63 63 /build/glibc-sMfBJT/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
90,000 4 4 18,000 3 0 33,000 3 2 /build/glibc-sMfBJT/glibc-2.31/stdio-common/printf.c:printf
78,000 1 1 24,000 0 0 6,000 0 0 /build/glibc-sMfBJT/glibc-2.31/stdio-common/../libio/libioP.h:__vfprintf_internal
71,545 10 10 14,316 1,080 893 15 2 0 /build/glibc-sMfBJT/glibc-2.31/elf/dl-addr.c:_dl_addr
```
:::warning
cachegrind 似乎無法配合 taskset 使用
:::
:::success
改用
```shell
sudo taskset -c 1 valgrind --tool=cachegrind ./exp 1 > /dev/null
```
即可正常使用 cachegrind ,推測為 Valgrind 實作方式的影響
:::
測試結果竟顯示 data cache miss 為 0% ,與從 perf stat 收到的資訊天差地遠。
可以觀察到 cache reference 在 perf 顯示 95,829,386 ;而在 cachegrind 顯示為 28,270,312 。
:::spoiler 進行更多次 perf 測試,發現每次的 cache references 竟有巨大落差:
```shell
Performance counter stats for './exp 1':
112,781,466,006 cycles
261,497,199,771 instructions # 2.32 insn per cycle
6,962,452 cache-misses # 11.023 % of all cache refs
63,160,276 cache-references
38,179,989,631 branch-instructions
67,070,255 branch-misses # 0.18% of all branches
33.386655206 seconds time elapsed
0.975406000 seconds user
32.384290000 seconds sys
```
```shell
Performance counter stats for './exp 1':
108,714,836,439 cycles
264,555,729,876 instructions # 2.43 insn per cycle
1,520,786 cache-misses # 14.088 % of all cache refs
10,795,259 cache-references
38,609,436,202 branch-instructions
63,560,865 branch-misses # 0.16% of all branches
32.059672731 seconds time elapsed
0.847985000 seconds user
31.207482000 seconds sys
```
:::
\
我不曉得為什麼會出現這樣的情況。
:::success
後續發現是測試次數過少而導致的數值不穩定,後面會補充結果。
:::
---
回去觀摩[參考共筆](https://hackmd.io/@KYWeng/rkGdultSU#%E6%94%B9%E5%96%84%E6%96%B9%E6%A1%88-1---%E6%94%B9%E5%AF%AB-bn_fib_fdoubling-%E5%AF%A6%E4%BD%9C%E7%9A%84%E6%96%B9%E5%BC%8F),發現因為我先前閱讀的重點都在運算加速,沒注意到資料搬移優化的部份,嘗試將其納入我的實作:
```diff
for (int currbit = 1 << (32 - __builtin_clzll(k) - 1 - 1); currbit;
currbit = currbit >> 1) {
/* compute 2n-1 */
ubignum_mult(fast[1], fast[1], &fast[0]);
ubignum_mult(fast[2], fast[2], &fast[3]);
ubignum_add(fast[0], fast[3], &fast[3]);
/* compute 2n */
ubignum_left_shift(fast[1], 1, &fast[4]);
ubignum_add(fast[4], fast[2], &fast[4]);
ubignum_mult(fast[4], fast[2], &fast[4]);
n *= 2;
if (k & currbit) {
ubignum_add(fast[3], fast[4], &fast[0]);
n++;
- ubignum_copy(fast[2], fast[0]);
- ubignum_copy(fast[1], fast[4]);
+ ubignum_swapptr(&fast[2], &fast[0]);
+ ubignum_swapptr(&fast[1], &fast[4]);
} else {
- ubignum_copy(fast[2], fast[4]);
- ubignum_copy(fast[1], fast[3]);
+ ubignum_swapptr(&fast[2], &fast[4]);
+ ubignum_swapptr(&fast[1], &fast[3]);
}
}
```
效能差異:

我測出來的結果和參考共筆的結論不符合,並沒有出現效能大幅改善的情形。
perf 測起來仍然非常不穩:
:::spoiler perf 測試
```shell
$ sudo perf stat -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses ./exp 1 > /dev/null
Performance counter stats for './exp 1':
99,537,523,424 cycles
241,725,789,043 instructions # 2.43 insn per cycle
1,579,228 cache-misses # 53.451 % of all cache refs
2,954,538 cache-references
34,729,788,080 branch-instructions
72,001,270 branch-misses # 0.21% of all branches
29.372597211 seconds time elapsed
0.963940000 seconds user
28.402259000 seconds sys
```
```shell
Performance counter stats for './exp 1':
110,100,048,298 cycles
251,991,005,629 instructions # 2.29 insn per cycle
11,886,534 cache-misses # 14.939 % of all cache refs
79,566,707 cache-references
36,196,420,296 branch-instructions
71,695,967 branch-misses # 0.20% of all branches
32.880962908 seconds time elapsed
0.939414000 seconds user
31.908111000 seconds sys
```
:::
\
嘗試在 perf 命令設置參數 `-r 10` 來降低測量數值的波動,為避免時間過長,改為測試 fib 到 $n=1000$ 。
經過這個設定後,每次測試間得到的數值就不再有顯著差異,至少每個 events 的前兩個數字會保持不變。
```shell
$ bash measure
...
Performance counter stats for 'taskset -c 1 ./exp 1' (10 runs):
23,018,839,194 cycles ( +- 0.27% )
49,622,366,443 instructions # 2.16 insn per cycle ( +- 0.00% )
1,634,045 cache-misses # 20.062 % of all cache refs ( +- 6.49% )
8,145,159 cache-references ( +- 5.01% )
8,319,541,416 branch-instructions ( +- 0.00% )
13,417,461 branch-misses # 0.16% of all branches ( +- 0.18% )
14.4259 +- 0.0388 seconds time elapsed ( +- 0.27% )
```
確立測試方法之後再回去重新測量使用 `ubignum_copy()` 的版本,得:
```shell
$ bash measure
...
Performance counter stats for 'taskset -c 1 ./exp 1' (10 runs):
25,215,127,347 cycles ( +- 2.31% )
51,971,604,860 instructions # 2.06 insn per cycle ( +- 0.00% )
2,423,932 cache-misses # 13.441 % of all cache refs ( +- 18.29% )
18,033,276 cache-references ( +- 24.11% )
8,811,912,111 branch-instructions ( +- 0.00% )
13,058,167 branch-misses # 0.15% of all branches ( +- 1.10% )
15.804 +- 0.367 seconds time elapsed ( +- 2.32% )
```
對照之下,從舊版到新版雖然 cache-miss-rate 由 13% 增長到 20% ,但我們可以注意到 cache-references 總數大約降低了一半,我推測這個差距來自於 copy 的過程對資料的 access 。
因此就 data cache 而言,因為省下這些操作,新版仍是更佳的實作方式。
20% 的 cache miss 非常驚人,因此目前最直接的改進目標就是要想辦法降低 miss rate 。
---
引入 `ubignum_square()`
```shell
Performance counter stats for 'taskset -c 1 ./exp 1' (10 runs):
25,405,898,533 cycles ( +- 1.21% )
53,288,279,701 instructions # 2.10 insn per cycle ( +- 0.00% )
1,370,567 cache-misses # 20.309 % of all cache refs ( +- 34.02% )
6,748,682 cache-references ( +- 32.89% )
9,033,980,854 branch-instructions ( +- 0.00% )
12,907,672 branch-misses # 0.14% of all branches ( +- 0.62% )
15.921 +- 0.193 seconds time elapsed ( +- 1.21% )
```
差異實在也不是很明顯,甚至是落後於全數使用 `ubignum_mult()` 的版本。
雖然我想要嘗試測到大一點的 case ,
---
執行測試有時會出現頻繁的失敗,不確定原因是什麼,但與引入 `ubignum_square()` 應該沒有關聯,因為即使改回 mult 的版本也沒有恢復。其錯誤情景如下:
```shell
$ sudo perf stat -r 10 -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses taskset -c 0 ./exp 1 > /dev/null
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
taskset: 強制結束
```
我想看看是否是 `fib_fast()` 的計算過程中的問題,例如 reallocation 失敗等等,因此嘗試引入 `flag` 變數來紀錄:
:::spoiler `fib_fast()`
```c
static ubn *fib_fast(long long k)
{
ubn *fast[5] = {NULL};
bool flag = true;
if (k == 0) {
flag &= ubignum_init(&fast[0]);
ubignum_zero(fast[0]);
return fast[0];
} else if (k == 1) {
flag &= ubignum_init(&fast[0]);
ubignum_uint(fast[0], 1);
return fast[0];
}
for (int i = 0; i < 5; i++)
flag &= ubignum_init(&fast[i]);
ubignum_zero(fast[1]);
ubignum_uint(fast[2], 1);
int n = 1;
for (int currbit = 1 << (32 - __builtin_clzll(k) - 1 - 1); currbit;
currbit = currbit >> 1) {
/* compute 2n-1 */
flag &= ubignum_square(fast[1], &fast[0]);
flag &= ubignum_square(fast[2], &fast[3]);
//flag &= ubignum_mult(fast[1], fast[1], &fast[0]);
//flag &= ubignum_mult(fast[2], fast[2], &fast[3]);
flag &= ubignum_add(fast[0], fast[3], &fast[3]);
/* compute 2n */
flag &= ubignum_left_shift(fast[1], 1, &fast[4]);
flag &= ubignum_add(fast[4], fast[2], &fast[4]);
flag &= ubignum_mult(fast[4], fast[2], &fast[4]);
n *= 2;
if (k & currbit) {
flag &= ubignum_add(fast[3], fast[4], &fast[0]);
n++;
ubignum_swapptr(&fast[2], &fast[0]);
ubignum_swapptr(&fast[1], &fast[4]);
} else {
ubignum_swapptr(&fast[2], &fast[4]);
ubignum_swapptr(&fast[1], &fast[3]);
}
}
ubignum_free(fast[0]);
ubignum_free(fast[1]);
ubignum_free(fast[3]);
ubignum_free(fast[4]);
if(!flag)
printk(KERN_INFO "flag is %d\n", flag);
return fast[2];
}
```
:::
\
執行後於 `dmesg` 中並未看到任何 `flag` 相關的報錯,不過出現了一段紅底黑字:
```
[ 990.777318] BUG: unable to handle page fault for address: ffff96d435beb220
[ 990.777322] #PF: supervisor read access in kernel mode
[ 990.777324] #PF: error_code(0x0000) - not-present page
```
---
## June Rework
隔一段時間沒有著手於這份專案,回來先進行一些實驗重做等等,並補上一些之前忽略的內容。
由於第一次做的時候認為需要專注於提昇運算 Fibonacci number 的速度,而只測量計算消耗的時間。然而這類 character device 在使用上總是要 read,因此應該也進行實驗量測 user space 到 kernel space 的時間、以及資料複製的時間等等納入成本計算,進行效能的全盤考量,完整地進行實驗才能正確認知到效能瓶頸何在。
### user-kernel transition time
#### 時間量測
首先實驗量測由 user mode 進出 kernel mode 的耗時
:::spoiler code
```c
for (int i = 0; i <= offset; i++) {
lseek(fd, i, SEEK_SET);
unsigned long long ucons = 0, kcons = 0;
for (int j = 0; j < TEST_NUM; j++) {
struct timespec t1, t2;
clock_gettime(CLOCK_MONOTONIC, &t1);
kcons += write(fd, NULL, select);
clock_gettime(CLOCK_MONOTONIC, &t2);
ucons += (unsigned long long) (t2.tv_sec * 1e9 + t2.tv_nsec) -
(t1.tv_sec * 1e9 + t1.tv_nsec);
}
printf("%d,%llu,%llu,%llu\n", i, kcons / TEST_NUM, ucons / TEST_NUM, (ucons - kcons)/TEST_NUM);
}
```
```c
/* write operation is skipped */
/* try to measure time within this function */
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
ubn *N = NULL; // do not initialize
ktime_t kt = ktime_get();
switch (size) {
case 0:
N = fib_sequence(*offset);
break;
case 1:
N = fib_fast(*offset);
break;
default:
return 0;
}
ubignum_free(N);
kt = ktime_sub(ktime_get(), kt);
return (ssize_t) ktime_to_ns(kt);
}
```
:::
\
以 iteration 算法測試,並使用較小的 offset 來突顯該數值,兩個時間相減大致可得 (user $\to$ kernel) $+$ (kernel $\to$ user) 的時間。

即使用上了[作業說明](https://hackmd.io/@sysprog/linux2022-fibdrv)和[參考共筆](https://hackmd.io/@KYWeng/rkGdultSU)提及的所有方式,仍會產生這種巨大的 peak。從 csv 檔觀察,發現其位於 $n = 13,43,44,63,64$
- fib(13) = 233
- fib(43) = 433494437
- fib(44) = 701408733
- fib(63) = 6557470319842
- fib(64) = 10610209857723
第一猜想是使用 `ubignum_resize()` 導致的消耗,但結果這幾個值看起來也並非是 `UINT32_MAX` 或 `UINT64_MAX`,從此 value 上看不出有特別意義。接著又測試幾次,發現每次 peak 都產生在不同地方。
試著排除 peak 點再次作圖,結果發現在其它情況我的量測也不太穩定。

換成另一種叫簡易的方式來測量:將 `fib_write()` 函式清空並直接把 user space 下的來回時間當成是 transition time。
:::success
注意到這時 x 軸 $n$ 不再有意義
:::

無論怎麼做都一直量到 peak,但可以看出這個來回時間幾乎都非常貼近 1000 ns。
為了考察這個問題,我試著把每次測試中過大的的耗時數值印出來,結果發現偶爾會出現 case 有十倍左右的時間消耗。於是我了解到[排除統計上的離群值](https://hackmd.io/@sysprog/linux2022-fibdrv#%E7%94%A8%E7%B5%B1%E8%A8%88%E6%89%8B%E6%B3%95%E5%8E%BB%E9%99%A4%E6%A5%B5%E7%AB%AF%E5%80%BC)有其必要性,果然是不該偷吃步。
#### 排除統計離群值
在統計的過程中,對於某些不特定的 $n$ 會出現標準差大於平均值的奇怪現象,於是我同樣把過程中的資料全部存下來,再去奇怪的 $n$ 看看出了什麼問題,借助 `sprintf()` 的小技巧來把資料全部存起來:
```c
char name[128];
sprintf(name, "data/%d.dat", i);
FILE *f = fopen(name, "w");
```
我這裡跑完時間測量後,看到 $n=67$ 時 $\mu=1073, \; \sigma=1322$,於是查詢 **data/67.dat**,原來其中一次的耗時測量到 42656。然而經過離群值的排除,調整後的平均值正常地落在了 1085。
:::spoiler outlier elimination
```c
/* Eliminate 5% outliers and compute the average.
*/
uint64_t average(int64_t *nums, int numsSize)
{
int64_t mu = 0;
for (int i = 0; i < numsSize; i++)
mu += nums[i];
mu /= numsSize;
int64_t deviation = 0;
for (int i = 0; i < numsSize; i++)
deviation += (nums[i] - mu) * (nums[i] - mu) / numsSize;
deviation = sqrt(deviation);
for (int i = 0; i < numsSize; i++)
if (nums[i] >= mu + 2 * deviation || nums[i] <= mu - 2 * deviation)
nums[i] = -1;
// printf("%ld, %ld\t", mu, deviation);
uint64_t ans = 0;
for (int i = 0; i < numsSize; i++)
if (nums[i] != -1)
ans += nums[i];
return ans / (numsSize / 20 * 19);
}
```
其中 `sqrt()` 為我自己實作的整數開根號函式。
:::
\
至此我們成功排除掉了 peak 的問題,於是耗時的細微變化便可在圖表上呈現出來。

圖表的刻度變小之後,可以看出測量上有 100 ~ 300 ns 的起伏,仍無法達到如[ KYG 同學的平滑結果](https://hackmd.io/@KYWeng/rkGdultSU#user-space-%E8%88%87-kernel-space-%E7%9A%84%E5%82%B3%E9%81%9E%E6%99%82%E9%96%93)。這裡由於任何 $n$ 的輸入都是直接 return 沒有計算,因此最理想的狀況應該要是一條水平線。
打算做一個 commit 時收到警告訊息:
```
Dangerous function detected in /home/tinyynoob/code/fibdrv/exp.c
54: sprintf(name, "data/%d.dat", i);
```
於是改為使用 `snprint()` 並順利完成 commit。
#### 於 desktop 量測
由於怎麼測總是拿到漂浮的數據,我想 laptop 或許天生就較為不穩定,因此我想從 laptop 改成用 desktop 跑跑看是否會出現相同的結果。
:::spoiler cpu info
```
架構: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
每核心執行緒數: 1
每通訊端核心數: 4
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 94
Model name: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
製程: 3
CPU MHz: 2511.819
CPU max MHz: 3600.0000
CPU min MHz: 800.0000
BogoMIPS: 6399.96
虛擬: VT-x
L1d 快取: 128 KiB
L1i 快取: 128 KiB
L2 快取: 1 MiB
L3 快取: 6 MiB
NUMA node0 CPU(s): 0-3
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cach
e flushes, SMT disabled
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
ia prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user
pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_
FW, STIBP disabled, RSB filling
Vulnerability Srbds: Mitigation; Microcode
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT disabled
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtr
r pge mca cmov pat pse36 clflush dts acpi mmx f
xsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rd
tscp lm constant_tsc art arch_perfmon pebs bts
rep_good nopl xtopology nonstop_tsc cpuid aperf
mperf pni pclmulqdq dtes64 monitor ds_cpl vmx s
mx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid s
se4_1 sse4_2 x2apic movbe popcnt tsc_deadline_t
imer aes xsave avx f16c rdrand lahf_lm abm 3dno
wprefetch cpuid_fault invpcid_single pti ssbd i
brs ibpb stibp tpr_shadow vnmi flexpriority ept
vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2
smep bmi2 erms invpcid rtm mpx rdseed adx smap
clflushopt intel_pt xsaveopt xsavec xgetbv1 xsa
ves dtherm ida arat pln pts hwp hwp_notify hwp_
act_window hwp_epp md_clear flush_l1d
```
:::
\
然而同樣的程式碼和 scripts 移過來執行竟然出現 segmentation fault。
我使用 gdb 工具查詢,發現出錯在 `fprintf()` 處,我的 `f` 意圖打開目錄 **data** 下的檔案,然而不存在這個 directory 因此產生記憶體區段錯誤,`mkdir data` 後解決。
跑出來的圖表:

雖然看似仍有很大的浮動,但仔細觀察可以注意到 y 軸的刻度由 50 降至 10,確實穩定了不少。從比較大的 scale 看是像這樣子:

雖然還沒辦法變成完全的水平線,但已從最初上千 ns 的 peak 改善許多。
> 版本 commit id:b80d536b08bd1a3c3325b53d684f97e39ac734f9
接著再把之前從 `fib_write()` 清空的內容補回進行測試:

可以看出無論 $n$ 的變化,user space 到 kernel space 的來回時間穩定落在 600 ns 左右。
即使一路算到 $n=500$ 也是如此:

> 版本 id:766a4b80f6fbcf8a0f5ee0931113b16e6ea629ec
### 運算時間
繼續使用 desktop 來重新量測兩種 Fibonacci number 算法在 kernel space 的耗時:

像之前一樣看到我實作的 fast fib 算法會出現這種階梯形的跳躍現象,這次我想要好好研究這些階梯的成因。
先從小一點的 case 開始探討:

從 **fast.csv** 查看發現跳點出現在 $n=2,4,8,16,32,64$,每次往上跳 500 或 1000 ns。每個跳躍竟然都恰好落在 2 的冪,這肯定有很好的理由可以解釋。
仔細想想之後也很直觀,因為 fast fib 算法每次迴圈都會將 $n$ 加倍,因此需要 $\log(n)$ 次 iteration 以算完要求結果。而這麼解釋就代表每個跳躍便是 fast doubling 算法 each iteration 的耗時,從 `fast_fib()` 程式碼也可以看出有個迴圈:
```c
int n = 1;
for (int currbit = 1 << (32 - __builtin_clzll(k) - 1 - 1); currbit;
currbit = currbit >> 1) {
/* compute 2n-1 */
flag &= ubignum_square(fast[1], &fast[0]);
flag &= ubignum_square(fast[2], &fast[3]);
flag &= ubignum_add(fast[0], fast[3], &fast[3]);
/* compute 2n */
flag &= ubignum_left_shift(fast[1], 1, &fast[4]);
flag &= ubignum_add(fast[4], fast[2], &fast[4]);
flag &= ubignum_mult(fast[4], fast[2], &fast[4]);
n *= 2;
if (k & currbit) {
flag &= ubignum_add(fast[3], fast[4], &fast[0]);
n++;
ubignum_swapptr(&fast[2], &fast[0]);
ubignum_swapptr(&fast[1], &fast[4]);
} else {
ubignum_swapptr(&fast[2], &fast[4]);
ubignum_swapptr(&fast[1], &fast[3]);
}
}
```
在這裡可以簡單比較兩種算法的特性如下:
||sequencial|fast doubling|
|:-:|:-:|:-:|
|迭代次數|$n$|$\lfloor\log(n)\rfloor$|
|每輪迭代的複雜度|低,約 ==16== ns|高,約 ==834== ns|
> 該數字使用 $n=0$ 到 $n=500$ 的測量結果算得
### 比較
#### KYG 同學
為了確認自己實作的水準,clone [KYG 同學的 fibdrv](https://github.com/KYG-yaya573142/fibdrv) 進行對比,使用 optimize-bn 分支在我的電腦上執行 `make plot`,結果超乎我的預期:

耗時僅僅我的 $1/10$,顯然我的程式碼有不只一點改進空間。此外,為何使用相同的設定,它的程式就能有非常平滑的結果,而我卻會一直抖動,這也需要再好好研究。
但仔細去看程式碼發現其測量有誤,我 clone 到的版本並非 bignum 的實作。接著繼續閱讀之後,發現同學並未把自己的最新版本推上 github,因此我無法在自己的電腦上與其進行耗時比較。
:::info
需要找找看哪個同學的程式碼適合拿來做效能對比
:::
### 研讀 [0xff07 同學的實作](https://github.com/0xff07/bignum/tree/fibdrv)
研讀同學的實作,看看是否有值得借鏡的地方。
#### apm_internal.h
> 此 apm 取的是 Arbitrary-Precision arithmetic 的意思
首先我看到同學將 inline assembly 的部份包裝成 macro:
```c
#if defined(__amd64__) || defined(__x86_64__)
#define digit_mul(u, v, hi, lo) \
__asm__("mulq %3" : "=a"(lo), "=d"(hi) : "%0"(u), "rm"(v))
#define digit_div(n1, n0, d, q, r) \
__asm__("divq %4" : "=a"(q), "=d"(r) : "0"(n0), "1"(n1), "rm"(d))
#endif
```
我認為這是非常好的想法,由於 inline assembly 的語法邏輯跟 C 的邏輯有些不同,因此包裝起來可以使人閱讀時統一依 C 的邏輯理解,較為易讀,同時也方便重複使用。
同時他在無法使用 x86 assembly 的情況下仍提供了對應實作:
```c
#ifndef digit_mul
#define digit_mul(u, v, hi, lo) \
do { \
apm_digit __u0 = (u); \
apm_digit __u1 = __u0 >> APM_DIGIT_HSHIFT; \
__u0 &= APM_DIGIT_LMASK; \
apm_digit __v0 = (v); \
apm_digit __v1 = __v0 >> APM_DIGIT_HSHIFT; \
__v0 &= APM_DIGIT_LMASK; \
apm_digit __lo = __u0 * __v0; \
apm_digit __hi = __u1 * __v1; \
__u1 *= __v0; \
__v0 = __u1 << APM_DIGIT_HSHIFT; \
__u1 >>= APM_DIGIT_HSHIFT; \
__hi += __u1 + ((__lo += __v0) < __v0); \
__v1 *= __u0; \
__u0 = __v1 << APM_DIGIT_HSHIFT; \
__v1 >>= APM_DIGIT_HSHIFT; \
__hi += __v1 + ((__lo += __u0) < __u0); \
(lo) = __lo; \
(hi) = __hi; \
} while (0)
#endif
```
#### apm.c apm.h
這部份提供許多大數的 operations 包括:
- increment/decrement
- add
- sub
- dmul,計算 partial product 的函式
- shift
- cmp
預設是 `u` 和 `v` 運算後將結果放到 `w`,其中沒看到計算 apm 乘法的函式,僅有計算 partial product 的 `dmul()`。
:::success
apm 乘法被獨立寫在 **mul.c**
:::
- 帶 *i* postfix 的 operations 似乎是代表 in-place,都是 `u`, `v` 運算後放到 `u`
- 帶 *n* postfix 的 operations 不確定語意上是代表哪個英文單字,不過其共通點為 `u`, `v` 兩數的初始 size 相同
#### mul.c
在其乘法實作當中混用了兩種乘法
- base,最原始的直式乘法方式
- [Karatsuba multiplication](https://en.wikipedia.org/wiki/Karatsuba_algorithm)
短碼時使用直式乘法,長碼時使用 Karatsuba,使兩者各展所長。
##### Karatsuba multiplication 解說
假設我們要作 $x\times y$,那麼藉由分解較大的 $x$ 及 $y$,可以達到:
$$
\begin{aligned}
x\times y &= (x_1\times 2^b+x_0)\times(y_1\times 2^b+y_0) \\
&= (x_1 y_1)2^{2b} + (x_1 y_0+ x_0 y_1)2^b + (x_0 y_0)
\end{aligned}
$$
因 $(x_1 y_0+ x_0 y_1) = (x_1+x_0)\times(y_1+y_0) - x_1y_1 - x_0y_0$,可將乘法次數由 4 次降到 3 次。
此外每個 $x_iy_i$ 又呈現 $x\times y$ 的形式,因此可以再次遞迴下去,總結而言這是個 divide and conquer 的算法。
#### bn.h
此部份將前面大數相關的 type 使用 struct 包起來:
```c
typedef struct {
apm_digit *digits; /* Digits of number. */
apm_size size; /* Length of number. */
apm_size alloc; /* Size of allocation. */
unsigned sign : 1; /* Sign bit. */
} bn, bn_t[1];
```
並使用 apm 的介面提供一系列的 arithmetic operations,再次包裝成 bn 介面。