# 2022q1 Homework3 (fibdrv) contributed by < [`tinyynoob`](https://github.com/tinyynoob) > > [作業要求](https://hackmd.io/@sysprog/linux2022-fibdrv) > 最近的方向可以跳到 [June rework](https://hackmd.io/MJ1kMQBLRBCuo08yEmhDeQ?view#June-Rework) 開始看 > 因為一篇筆記放不下了,再開一篇:[fibdrv 2](https://hackmd.io/@tinyynoob/linux2022q1-fibdrv2) ## 研讀資料 ### Fibonacci 相關性質 > https://hackmd.io/@sysprog/fibonacci-number > [Algorithms for Computing Fibonacci Numbers Quickly](https://ir.library.oregonstate.edu/downloads/t435gg51w) [Fast doubling 算法](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) $$\begin{aligned} F_{2n+1} &= F_{n+1}^2 + F_n^2 \\ F_{2n} &= F_n\cdot(2F_{n+1}-F_n) \end{aligned} $$ 作業說明中有給出 pseudocode : ```c Fast_Fib(n) a = 0; b = 1; // m = 0 for i = (number of binary digit in n) to 1 t1 = a*(2*b - a); t2 = b^2 + a^2; a = t1; b = t2; // m *= 2 if (current binary digit == 1) t1 = a + b; // m++ a = b; b = t1; return a; ``` 注意到這個 `^` 是指 power 而非 xor 。 目前看不出 `clz()` 能對計算 Fib 做到什麼強力優化,似乎只能加快 initialize `i` 而已,需要再研究研究。 :::info 待 ::: ### Follow 《The Linux Kernel Module Programming Guide》 > 將自己研讀的摘要整理到另一篇筆記: https://hackmd.io/@tinyynoob/Hk89qPw-9 本地版本: ```shell $ apt-cache search linux-headers-`uname -r` linux-headers-5.13.0-30-generic - Linux kernel headers for version 5.13.0 on 64 bit x86 SMP ``` 根據書中範例嘗試建立 **hello-1.c** 及對應的 **makefile** 。 make 時馬上出現問題: :::spoiler log ```shell $ make -p | grep PWD scripts/Makefile.build:44: /home/tinyynoob/code/kernel/hello-1/Makefile: 沒有此一檔案或目錄 make[2]: *** 沒有規則可製作目標「/home/tinyynoob/code/kernel/hello-1/Makefile」。 停止。 OLDPWD = /home/tinyynoob/code/kernel PWD = /home/tinyynoob/code/kernel/hello-1 CONFIG_HPWDT_NMI_DECODING = y make[1]: *** [Makefile:1879:/home/tinyynoob/code/kernel/hello-1] 錯誤 2 OLDPWD = /home/tinyynoob/code/kernel PWD = /home/tinyynoob/code/kernel/hello-1 CONFIG_HPWDT_NMI_DECODING = y @echo ' Syntax: make -C path/to/kernel/src M=$$PWD target' make: *** [makefile:7:all] 錯誤 2 PWD := /home/tinyynoob/code/kernel/hello-1 OLDPWD = /home/tinyynoob/code/kernel echo $((PWD) make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean ``` ::: \ 發現原來檔名把 `Makefile` 寫成 `makefile` 也不行,更改後得到: :::spoiler log ```shell $ make make -C /lib/modules/5.13.0-30-generic/build M=/home/tinyynoob/code/kernel/hello-1 modules make[1]: 進入目錄「/usr/src/linux-headers-5.13.0-30-generic」 CC [M] /home/tinyynoob/code/kernel/hello-1/hello-1.o MODPOST /home/tinyynoob/code/kernel/hello-1/Module.symvers CC [M] /home/tinyynoob/code/kernel/hello-1/hello-1.mod.o LD [M] /home/tinyynoob/code/kernel/hello-1/hello-1.ko BTF [M] /home/tinyynoob/code/kernel/hello-1/hello-1.ko Skipping BTF generation for /home/tinyynoob/code/kernel/hello-1/hello-1.ko due to unavailability of vmlinux make[1]: 離開目錄「/usr/src/linux-headers-5.13.0-30-generic」 ``` ::: \ 看似往正確的方向前進了,也有產出 **hello-1.ko** ,但沒有完全成功,它在 `vmlinux` 遇到問題 :::info `vmlinux` 問題尚未解決 ::: 根據 `modinfo` 可以發現在 **hello-1.ko** 的 vermagic 欄位有 `modversions` 。回去閱讀書中提到 Modversioning 的部份,得知我可能需要關掉它。 參考[此](https://bugs.launchpad.net/ubuntu/+source/linux-raspi2/+bug/1863245),試著進到 /boot/config-5.13.0-30-generic 關掉 `CONFIG_MODVERSIONS` 。 :::info 關掉 modversions 但尚未 recompile kernel ::: 把 **hello-1.c** 裡的 return 值改成負數會導致: ```shell insmod: ERROR: could not insert module hello-1.ko: Operation not permitted ``` ### 研讀數字字串運算範例程式 > [source code](https://github.com/AdrianHuang/fibdrv/tree/big_number) 數字字串在運算時可以因應字串 size 選擇不同的策略。 範例程式使用的字串使用 union 定義如下: ```c= typedef union { /* allow strings up to 15 bytes to stay on the stack * use the last byte as a null terminator and to store flags * much like fbstring: * https://github.com/facebook/folly/blob/master/folly/docs/FBString.md */ char data[16]; struct { uint8_t filler[15], /* how many free bytes in this stack allocated string * same idea as fbstring */ space_left : 4, /* if it is on heap, set to 1 */ is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1; }; /* heap allocated */ struct { char *ptr; /* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */ size_t size : MAX_STR_LEN_BITS, /* capacity is always a power of 2 (unsigned)-1 */ capacity : 6; /* the last 4 bits are important flags */ }; } xs; ``` - 在短碼的情況會使用第 9 行的 struct ,直接將資料字串存在 `filler` 並把一些 flag 資訊放在最後一個 byte - 在長碼的情況會使用第 20 行的 struct ,使用一個 `ptr` 指標並配置空間在 heap ,後面的 8 byte 拿來存放一些 flag :::warning 還沒完全理解在長碼情況下, 1. 如何繼續使用 `is_ptr` flag 2. `capacity` 使用的詳細狀況 ::: 範例程式在 `fib_sequence()` 計算 Fib ,算法使用逐項相加: ```c for (i = 2; i <= k; i++) string_number_add(&f[i - 1], &f[i - 2], &f[i]); ``` 因此若是我們能改善乘法及平方的複雜度,則可以使用 fast doubling 的技巧來改善目前算法。 ### Linux 效能分析 嘗試查看 CPU affinity : ```shell $ taskset -p 8528 pid 8528 目前的近似者遮罩:ff ``` ff 的二進位為 11111111 ,代表這個 process 可以在 7~0 任意一個 CPU 上執行。試著更改: ```shell $ taskset -p FA 8528 pid 8528 目前的近似者遮罩:ff pid 8528 新的近似者遮罩:fa ``` 根據[網路上的教學](https://charleslin74.pixnet.net/blog/post/405173965),在 */boot/grub/grub.cfg* 中加入 `isolcpus=0,1` 。 最後,完成 [排除干擾效能分析的因素](https://hackmd.io/PMozqaxqRA-jF2KmgTZcPQ?view#%E6%8E%92%E9%99%A4%E5%B9%B2%E6%93%BE%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E5%9B%A0%E7%B4%A0) 中的所有設定。 reboot 之後查看 process 發現 core mask 變成了 fc : ```shell $ taskset -p 1 pid 1 目前的近似者遮罩:fc ``` 雖然有些已經變成 fc 但有些仍顯示 ff ,因此重新依據[參考共筆](https://hackmd.io/@KYWeng/rkGdultSU)進行設定。結果仍失敗,目前的在設定 smp_affinity 上因腳本錯誤遇到困難。 --- Fork 嘗試掛載模組之後輸入 `sudo ./client` 可以得到 :::spoiler 結果 ```shell Reading from /dev/fibonacci at offset 0, returned the sequence 0. Reading from /dev/fibonacci at offset 1, returned the sequence 1. Reading from /dev/fibonacci at offset 2, returned the sequence 1. Reading from /dev/fibonacci at offset 3, returned the sequence 2. Reading from /dev/fibonacci at offset 4, returned the sequence 3. Reading from /dev/fibonacci at offset 5, returned the sequence 5. Reading from /dev/fibonacci at offset 6, returned the sequence 8. Reading from /dev/fibonacci at offset 7, returned the sequence 13. Reading from /dev/fibonacci at offset 8, returned the sequence 21. Reading from /dev/fibonacci at offset 9, returned the sequence 34. Reading from /dev/fibonacci at offset 10, returned the sequence 55. Reading from /dev/fibonacci at offset 11, returned the sequence 89. Reading from /dev/fibonacci at offset 12, returned the sequence 144. Reading from /dev/fibonacci at offset 13, returned the sequence 233. Reading from /dev/fibonacci at offset 14, returned the sequence 377. Reading from /dev/fibonacci at offset 15, returned the sequence 610. Reading from /dev/fibonacci at offset 16, returned the sequence 987. Reading from /dev/fibonacci at offset 17, returned the sequence 1597. Reading from /dev/fibonacci at offset 18, returned the sequence 2584. Reading from /dev/fibonacci at offset 19, returned the sequence 4181. Reading from /dev/fibonacci at offset 20, returned the sequence 6765. Reading from /dev/fibonacci at offset 21, returned the sequence 10946. Reading from /dev/fibonacci at offset 22, returned the sequence 17711. Reading from /dev/fibonacci at offset 23, returned the sequence 28657. Reading from /dev/fibonacci at offset 24, returned the sequence 46368. Reading from /dev/fibonacci at offset 25, returned the sequence 75025. Reading from /dev/fibonacci at offset 26, returned the sequence 121393. Reading from /dev/fibonacci at offset 27, returned the sequence 196418. Reading from /dev/fibonacci at offset 28, returned the sequence 317811. Reading from /dev/fibonacci at offset 29, returned the sequence 514229. Reading from /dev/fibonacci at offset 30, returned the sequence 832040. Reading from /dev/fibonacci at offset 31, returned the sequence 1346269. Reading from /dev/fibonacci at offset 32, returned the sequence 2178309. Reading from /dev/fibonacci at offset 33, returned the sequence 3524578. Reading from /dev/fibonacci at offset 34, returned the sequence 5702887. Reading from /dev/fibonacci at offset 35, returned the sequence 9227465. Reading from /dev/fibonacci at offset 36, returned the sequence 14930352. Reading from /dev/fibonacci at offset 37, returned the sequence 24157817. Reading from /dev/fibonacci at offset 38, returned the sequence 39088169. Reading from /dev/fibonacci at offset 39, returned the sequence 63245986. Reading from /dev/fibonacci at offset 40, returned the sequence 102334155. Reading from /dev/fibonacci at offset 41, returned the sequence 165580141. Reading from /dev/fibonacci at offset 42, returned the sequence 267914296. Reading from /dev/fibonacci at offset 43, returned the sequence 433494437. Reading from /dev/fibonacci at offset 44, returned the sequence 701408733. Reading from /dev/fibonacci at offset 45, returned the sequence 1134903170. Reading from /dev/fibonacci at offset 46, returned the sequence 1836311903. Reading from /dev/fibonacci at offset 47, returned the sequence 2971215073. Reading from /dev/fibonacci at offset 48, returned the sequence 4807526976. Reading from /dev/fibonacci at offset 49, returned the sequence 7778742049. Reading from /dev/fibonacci at offset 50, returned the sequence 12586269025. Reading from /dev/fibonacci at offset 51, returned the sequence 20365011074. Reading from /dev/fibonacci at offset 52, returned the sequence 32951280099. Reading from /dev/fibonacci at offset 53, returned the sequence 53316291173. Reading from /dev/fibonacci at offset 54, returned the sequence 86267571272. Reading from /dev/fibonacci at offset 55, returned the sequence 139583862445. Reading from /dev/fibonacci at offset 56, returned the sequence 225851433717. Reading from /dev/fibonacci at offset 57, returned the sequence 365435296162. Reading from /dev/fibonacci at offset 58, returned the sequence 591286729879. Reading from /dev/fibonacci at offset 59, returned the sequence 956722026041. Reading from /dev/fibonacci at offset 60, returned the sequence 1548008755920. Reading from /dev/fibonacci at offset 61, returned the sequence 2504730781961. Reading from /dev/fibonacci at offset 62, returned the sequence 4052739537881. Reading from /dev/fibonacci at offset 63, returned the sequence 6557470319842. Reading from /dev/fibonacci at offset 64, returned the sequence 10610209857723. Reading from /dev/fibonacci at offset 65, returned the sequence 17167680177565. Reading from /dev/fibonacci at offset 66, returned the sequence 27777890035288. Reading from /dev/fibonacci at offset 67, returned the sequence 44945570212853. Reading from /dev/fibonacci at offset 68, returned the sequence 72723460248141. Reading from /dev/fibonacci at offset 69, returned the sequence 117669030460994. Reading from /dev/fibonacci at offset 70, returned the sequence 190392490709135. Reading from /dev/fibonacci at offset 71, returned the sequence 308061521170129. Reading from /dev/fibonacci at offset 72, returned the sequence 498454011879264. Reading from /dev/fibonacci at offset 73, returned the sequence 806515533049393. Reading from /dev/fibonacci at offset 74, returned the sequence 1304969544928657. Reading from /dev/fibonacci at offset 75, returned the sequence 2111485077978050. Reading from /dev/fibonacci at offset 76, returned the sequence 3416454622906707. Reading from /dev/fibonacci at offset 77, returned the sequence 5527939700884757. Reading from /dev/fibonacci at offset 78, returned the sequence 8944394323791464. Reading from /dev/fibonacci at offset 79, returned the sequence 14472334024676221. Reading from /dev/fibonacci at offset 80, returned the sequence 23416728348467685. Reading from /dev/fibonacci at offset 81, returned the sequence 37889062373143906. Reading from /dev/fibonacci at offset 82, returned the sequence 61305790721611591. Reading from /dev/fibonacci at offset 83, returned the sequence 99194853094755497. Reading from /dev/fibonacci at offset 84, returned the sequence 160500643816367088. Reading from /dev/fibonacci at offset 85, returned the sequence 259695496911122585. Reading from /dev/fibonacci at offset 86, returned the sequence 420196140727489673. Reading from /dev/fibonacci at offset 87, returned the sequence 679891637638612258. Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931. Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189. Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120. Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309. Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 100, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 99, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 98, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 97, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 96, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429. Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309. Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120. Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189. Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931. Reading from /dev/fibonacci at offset 87, returned the sequence 679891637638612258. Reading from /dev/fibonacci at offset 86, returned the sequence 420196140727489673. Reading from /dev/fibonacci at offset 85, returned the sequence 259695496911122585. Reading from /dev/fibonacci at offset 84, returned the sequence 160500643816367088. Reading from /dev/fibonacci at offset 83, returned the sequence 99194853094755497. Reading from /dev/fibonacci at offset 82, returned the sequence 61305790721611591. Reading from /dev/fibonacci at offset 81, returned the sequence 37889062373143906. Reading from /dev/fibonacci at offset 80, returned the sequence 23416728348467685. Reading from /dev/fibonacci at offset 79, returned the sequence 14472334024676221. Reading from /dev/fibonacci at offset 78, returned the sequence 8944394323791464. Reading from /dev/fibonacci at offset 77, returned the sequence 5527939700884757. Reading from /dev/fibonacci at offset 76, returned the sequence 3416454622906707. Reading from /dev/fibonacci at offset 75, returned the sequence 2111485077978050. Reading from /dev/fibonacci at offset 74, returned the sequence 1304969544928657. Reading from /dev/fibonacci at offset 73, returned the sequence 806515533049393. Reading from /dev/fibonacci at offset 72, returned the sequence 498454011879264. Reading from /dev/fibonacci at offset 71, returned the sequence 308061521170129. Reading from /dev/fibonacci at offset 70, returned the sequence 190392490709135. Reading from /dev/fibonacci at offset 69, returned the sequence 117669030460994. Reading from /dev/fibonacci at offset 68, returned the sequence 72723460248141. Reading from /dev/fibonacci at offset 67, returned the sequence 44945570212853. Reading from /dev/fibonacci at offset 66, returned the sequence 27777890035288. Reading from /dev/fibonacci at offset 65, returned the sequence 17167680177565. Reading from /dev/fibonacci at offset 64, returned the sequence 10610209857723. Reading from /dev/fibonacci at offset 63, returned the sequence 6557470319842. Reading from /dev/fibonacci at offset 62, returned the sequence 4052739537881. Reading from /dev/fibonacci at offset 61, returned the sequence 2504730781961. Reading from /dev/fibonacci at offset 60, returned the sequence 1548008755920. Reading from /dev/fibonacci at offset 59, returned the sequence 956722026041. Reading from /dev/fibonacci at offset 58, returned the sequence 591286729879. Reading from /dev/fibonacci at offset 57, returned the sequence 365435296162. Reading from /dev/fibonacci at offset 56, returned the sequence 225851433717. Reading from /dev/fibonacci at offset 55, returned the sequence 139583862445. Reading from /dev/fibonacci at offset 54, returned the sequence 86267571272. Reading from /dev/fibonacci at offset 53, returned the sequence 53316291173. Reading from /dev/fibonacci at offset 52, returned the sequence 32951280099. Reading from /dev/fibonacci at offset 51, returned the sequence 20365011074. Reading from /dev/fibonacci at offset 50, returned the sequence 12586269025. Reading from /dev/fibonacci at offset 49, returned the sequence 7778742049. Reading from /dev/fibonacci at offset 48, returned the sequence 4807526976. Reading from /dev/fibonacci at offset 47, returned the sequence 2971215073. Reading from /dev/fibonacci at offset 46, returned the sequence 1836311903. Reading from /dev/fibonacci at offset 45, returned the sequence 1134903170. Reading from /dev/fibonacci at offset 44, returned the sequence 701408733. Reading from /dev/fibonacci at offset 43, returned the sequence 433494437. Reading from /dev/fibonacci at offset 42, returned the sequence 267914296. Reading from /dev/fibonacci at offset 41, returned the sequence 165580141. Reading from /dev/fibonacci at offset 40, returned the sequence 102334155. Reading from /dev/fibonacci at offset 39, returned the sequence 63245986. Reading from /dev/fibonacci at offset 38, returned the sequence 39088169. Reading from /dev/fibonacci at offset 37, returned the sequence 24157817. Reading from /dev/fibonacci at offset 36, returned the sequence 14930352. Reading from /dev/fibonacci at offset 35, returned the sequence 9227465. Reading from /dev/fibonacci at offset 34, returned the sequence 5702887. Reading from /dev/fibonacci at offset 33, returned the sequence 3524578. Reading from /dev/fibonacci at offset 32, returned the sequence 2178309. Reading from /dev/fibonacci at offset 31, returned the sequence 1346269. Reading from /dev/fibonacci at offset 30, returned the sequence 832040. Reading from /dev/fibonacci at offset 29, returned the sequence 514229. Reading from /dev/fibonacci at offset 28, returned the sequence 317811. Reading from /dev/fibonacci at offset 27, returned the sequence 196418. Reading from /dev/fibonacci at offset 26, returned the sequence 121393. Reading from /dev/fibonacci at offset 25, returned the sequence 75025. Reading from /dev/fibonacci at offset 24, returned the sequence 46368. Reading from /dev/fibonacci at offset 23, returned the sequence 28657. Reading from /dev/fibonacci at offset 22, returned the sequence 17711. Reading from /dev/fibonacci at offset 21, returned the sequence 10946. Reading from /dev/fibonacci at offset 20, returned the sequence 6765. Reading from /dev/fibonacci at offset 19, returned the sequence 4181. Reading from /dev/fibonacci at offset 18, returned the sequence 2584. Reading from /dev/fibonacci at offset 17, returned the sequence 1597. Reading from /dev/fibonacci at offset 16, returned the sequence 987. Reading from /dev/fibonacci at offset 15, returned the sequence 610. Reading from /dev/fibonacci at offset 14, returned the sequence 377. Reading from /dev/fibonacci at offset 13, returned the sequence 233. Reading from /dev/fibonacci at offset 12, returned the sequence 144. Reading from /dev/fibonacci at offset 11, returned the sequence 89. Reading from /dev/fibonacci at offset 10, returned the sequence 55. Reading from /dev/fibonacci at offset 9, returned the sequence 34. Reading from /dev/fibonacci at offset 8, returned the sequence 21. Reading from /dev/fibonacci at offset 7, returned the sequence 13. Reading from /dev/fibonacci at offset 6, returned the sequence 8. Reading from /dev/fibonacci at offset 5, returned the sequence 5. Reading from /dev/fibonacci at offset 4, returned the sequence 3. Reading from /dev/fibonacci at offset 3, returned the sequence 2. Reading from /dev/fibonacci at offset 2, returned the sequence 1. Reading from /dev/fibonacci at offset 1, returned the sequence 1. Reading from /dev/fibonacci at offset 0, returned the sequence 0. ``` ::: \ 可以發現到 92 項以後數值就不再改變,代表這裡存在嚴重缺陷,對於超過 64 位元的數值無能為力。 另外在輸入指令 ```shell $ ls -l /dev/fibonacci ``` 後,不是得到 235 也不是得到 236 而是 505 ```shell crw------- 1 root root 505, 0 3月 11 13:10 /dev/fibonacci ``` > [ktime doc](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html) 看[參考共筆](https://hackmd.io/@KYWeng/rkGdultSU)依樣畫葫蘆,由於目前 **fibdrv.c** 的 `fib_write()` 沒有功能,所以在其中塞入測量時間的程式碼: ```c static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { ktime_t kt; kt = ktime_get(); fib_sequence(*offset); kt = ktime_sub(ktime_get(), kt); return (ssize_t) ktime_to_ns(kt); } ``` 然而輸出的數值看起來都差不多,回頭想找找看這個 `offset` 到底是怎麼設的。我發現在 **client.c** 呼叫時傳入的參數並不包含 `offset` ,不過在呼叫 `write()` 跟 `read()` 時有個不同是 `read()` 多了一行: ```c lseek(fd, i, SEEK_SET); ``` 其中 `SEEK_SET` 將被展開為 0 。觀察發現呼叫 `lseek()` 會連到 **fibdrv.c** 中的: ```c static loff_t fib_device_lseek(struct file *file, loff_t offset, int orig) { loff_t new_pos = 0; switch (orig) { case 0: /* SEEK_SET: */ new_pos = offset; break; case 1: /* SEEK_CUR: */ new_pos = file->f_pos + offset; break; case 2: /* SEEK_END: */ new_pos = MAX_LENGTH - offset; break; } if (new_pos > MAX_LENGTH) new_pos = MAX_LENGTH; // max case if (new_pos < 0) new_pos = 0; // min case file->f_pos = new_pos; // This is what we'll use now return new_pos; } ``` 雖然可以猜測這個函式的回傳值會更改 `offset` ,但是目前不知道這到底是如何達成的,我甚至不曉得 `offset` 到底存在哪裡。 :::info TODO: - 釐清 `offset` 答: 我認為就是存在 file 結構的 `f_pos` 成員中 ::: ### 研讀參考共筆中的大數運算 > https://hackmd.io/@KYWeng/rkGdultSU#%E5%A6%82%E4%BD%95%E6%B8%9B%E5%B0%91%E5%A4%A7%E6%95%B8%E9%81%8B%E7%AE%97%E7%9A%84%E6%88%90%E6%9C%AC 這份共筆提出多種實作上改進計算 Fib 速度的技巧,摘要如下: #### 調整 Fast Fib 算法 將原先的算法由 $$\begin{aligned} F_{2n+1} &= F_{n+1}^2 + F_n^2 \\ F_{2n} &= F_n\cdot(2F_{n+1}-F_n) \end{aligned} $$ 改良為: $$\begin{aligned} F_{2n-1} &= F_n^2 + F_{n-1}^2 \\ F_{2n} &= F_n\cdot(2F_{n-1}+F_n) \end{aligned} $$ 以迴避所有減法運算,並且不會再出現任何負數,可以全部用 unsigned 型態儲存。 #### 使用 memory pool 在大數結構中引入 memory pool 的概念來避免頻繁的 `realloc()` 。 #### 使用內嵌組合語言加速乘法 做乘法運算時直接指定高低位的位置,而不是使用擴展整數型別。 以參考共筆中的語法為例: ``` __asm__("mulq %3" : "=a"(low), "=d"(high) : "%0"(a[i]), "rm"(k)); assembly language : output : input ``` 其中 instruction 顯然為 `mulq` ,而 1. `=` 是 write only 的意思 2. `a`, `d` 分別表示 `$rax`, `$rdx` 3. `( )` 內使用 C 變數 4. `rm` 表示由 compiler 自己選擇存放位置 5. `%0` 在此等同 `a` 暫存器 6. `%3` 代表 `rm` 參考: > https://ithelp.ithome.com.tw/articles/10213500 > https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html #### 引入 [Karatsuba algorithm](https://en.wikipedia.org/wiki/Karatsuba_algorithm) 算法 核心思想為在做乘法時,將一些乘法運算改為加減,考慮: $$\begin{aligned} (a_12^n+a_0)\times(b_12^n+b_0) &= c_22^{2n} + c_12^n +c_0 \\ &= (a_1b_1)2^{2n} + (a_1b_0+a_0b_1)2^n + a_0b_0 \end{aligned} $$ 原先需要做 4 次乘法運算,經過一些設計可以變成: $$ (2^{2n}+2^n)(a_1b_1) + 2^n(a_1-a_0)(b_0-b_1) + (2^n+1)(a_0b_0) $$ 便能少做一次乘法。 ## 改進 fibdrv > [實作程式碼 github](https://github.com/tinyynoob/fibdrv/tree/master) ### 大數運算 在為 fibdrv 改進效能之前,我們需要先確保它的正確性,為此我們首先引入大數運算的機制。對於存放大數的方式前面提出了兩種設計: - 單純使用結構 - 使用 union 達成 small string optimization 由於在計算 Fib 的情境下不會一直需要宣告新的變數,也不會有頻繁居於小數計算的情況,因此我們選擇使用結構的方式,如此也較便利於 maintain resize 的功能。 因為我目前不太熟悉 kernel programming 的各種 function 用法,因此我打算先在 user space 開發,並 debug 確認功能沒問題之後再移植到 kernel space 。 我想借助 macro 來完成此工作,例如這樣: ```c #if DEBUG N = (ubn *) malloc(sizeof(ubn)); #else N = (ubn *) kmalloc(sizeof(ubn), GFP_KERNEL); #endif ``` 由於在計算 Fib 的過程中不會遇到任何負數,因此我的大數運算假設不存在負數,相關的運算也都沒有提供。 引入[共筆用的 macro](https://hackmd.io/@KYWeng/rkGdultSU#%E6%94%B9%E5%96%84%E6%96%B9%E6%A1%88-4---%E5%96%84%E7%94%A8-64-bit-CPU-%E7%9A%84%E5%84%AA%E5%8B%A2) : ```c #if defined(__LP64__) || defined(__x86_64__) || defined(__amd64__) || \ defined(__aarch64__) typedef uint64_t ubn_unit; // typedef unsigned __int128 bn_data_tmp; // gcc support __int128 #define ubn_unit_bit 64 #else typedef uint32_t ubn_unit; // typedef uint64_t bn_data_tmp; #define ubn_unit_bit 32 #endif ``` 目前還未使用到加倍的型別,暫時先註解掉。 基於沒有負數的假設,我們得以直接移除共筆結構中的 `sign` 成員,訂出: ```c /* unsigned big number * @data: MSB:[size-1], LSB:[0] * @size: sizeof(ubn_unit) * @capacity: allocated size */ typedef struct { ubn_unit *data; int size; int capacity; } ubn; ``` 預期 (無號) 大數運算的界面提供以下功能: - `ubignum_init()` - `ubignum_free()` - `ubignum_resize()` - `ubignum_add()` - `ubignum_mult()` - `ubignum_assign()` 目前比較沒有頭緒的是 `assign()` 的作法 #### ubignum_init() :::spoiler 程式碼 ```c bool ubignum_init(ubn *N) { #if DEBUG N = (ubn *) malloc(sizeof(ubn)); #else N = (ubn *) kmalloc(sizeof(ubn), GFP_KERNEL); #endif if (!N) goto struct_aloc_failed; #if DEBUG N->data = (ubn_unit *) calloc(sizeof(ubn_unit), DEFAULT_CAPACITY); // including mem reset #else N->data = (ubn_unit *) kcalloc(sizeof(ubn_unit), DEFAULT_CAPACITY, GFP_KERNEL); // including mem reset #endif if (!N->data) goto data_aloc_failed; N->size = 0; N->capacity = DEFAULT_CAPACITY; return true; data_aloc_failed: free(N); N = NULL; struct_aloc_failed: return false; } ``` ::: 其中 `DEFAULT_CAPACITY` 目前訂為 16 。 靜態分析出現一堆問題,這才發現我原本的寫法根本不會正確把初始化後的 `N` 回傳,因此修改程式碼: #### ubignum_resize() :::spoiler 程式碼 ```c bool ubignum_resize(ubn *N, int new_size) { if (new_size < 0) return false; #if DEBUG ubn_unit *new = (ubn_unit *) realloc(N->data, sizeof(ubn_unit) * new_size); #else ubn_unit *new = (ubn_unit *) krealloc(N->data, sizeof(ubn_unit) * new_size, GFP_KERNEL); #endif if (!new) return false; N = new; N->capacity = new_size; return true; } ``` ::: 目前不確定允許 `new_size` 傳入 0 是否是個好主意。 #### ubignum_add() ```c bool ubignum_add(const ubn *a, const ubn *b, ubn *out) ``` 目前設計讓使用者傳入存放 sum 的變數,但這麼做就需要解決 pointer aliasing 的問題。 由低到高做加法的流程大致可以分成兩段: 1. 兩數相加 2. 一數加上 1. 的 carry out 首先開發 1. 的部份: ```c= ubn_unit carry = 0; int i = 0; for (i = 0; i < min(a->size, b->size); i++) { if (__builtin_expect(i >= out->capacity, 0)) { if (!resize(out, out->capacity * 2)) { return false; } } out->data[i] = a->data[i] + b->data[i] + carry; // The following consideration does not include carry // fix it! carry = ((a->data[i] ^ b->data[i]) >> 1 ^ (a->data[i] & b->data[i])) & (1u << (ubn_unit_bit - 1)); } ``` 若超出 `out` 範圍則對它進行 `resize()` ,其中加入 `builtin_expect()` 協助 compiler 進行 branch prediction 。 撰寫過程馬上讓我遭遇了幾個問題: 1. 第 11 行,我嘗試引入 [quiz2 第一題](https://hackmd.io/cmwilKeBSsStEwvKUgkfMA#%E6%B8%AC%E9%A9%97-1) 計算平均的方式再取 MSB 來得出新的 `carry` ,但我發覺這樣的話舊的 `carry` 沒有被加進來,所以在例如 `(a, b, c) = (UINT_MAX, 0, 1)` 的情況下會算出錯誤結果。 2. 我不熟悉 gcc 對於 unsigned 型態的右移會如何處理 根據 [gcc 的 mail list](https://gcc.gnu.org/legacy-ml/gcc/2000-04/msg00152.html) : > For right shifts, if the type is signed, then we use an arithmetic shift; if the type is unsigned then we use a logical. 因此對於 unsigned 右移的行為符合我的預期。 :::info 1\. 待解決 $\to$ 後面完全更換作法就不再有這個問題 ::: 為了應對 pointer aliasing 的問題,我們應該不能直接拿 `a->data` 和 `b->data` 去運算,因為它們在第 9 行可能會被改變 (如果他們跟 `out` 指到同樣位置) ,那麼第 12 行的運算將會錯誤。因此需要先將它們在第 9 行的值存起來,將程式碼改成: ```c= for (i = 0; i < min(a->size, b->size); i++) { if (__builtin_expect(i >= out->capacity, 0)) { if (!resize(out, out->capacity * 2)) { goto realoc_failed; } } ubn_unit an = a->data[i]; ubn_unit bn = b->data[i]; out->data[i] = an + bn + carry; // The following consideration does not include carry // fix it! carry = ((an ^ bn) >> 1 + (an & bn)) & (1u << (ubn_unit_bit - 1)); } ``` 其中第 4 行跳到最後去把 `a` 跟 `b` 還原成原本的樣子,這也是因為 aliasing 才有的問題。 第二段程式碼: ```c= ubn *remain = NULL; if (i == a->size) remain = b; else if (i == b->size) remain = a; for (; carry || i < remain->size; i++) { if (__builtin_expect(i >= out->capacity, 0)) { if (!resize(out, out->capacity * 2)) { return false; } } if (__builtin_expect(i < remain->size, 1)) { out->data[i] = remain->data[i] + carry; carry = ((remain->data[i] ^ carry) >> 1 ^ (remain->data[i] & carry)) & (1u << (ubn_unit_bit - 1)); } else { // the last carry out carries to a new ubn_unit out->data[i] = carry; // = 1 carry = 0; } } ``` 先使用 `remain` 來記住哪一個加數還有剩餘的位數沒加,接著就進入迴圈繼續加完,大部分的情況都會走 12 行,唯一會走到 17 行的情況是最高位有 carry out 而且剛好 carry 到一個新的 `ubn_unit` chunk 。 同樣處理 pointer aliasing 的問題並改進寫法得: ```c ubn *remain = (i == a->size) ? b : a; for (; carry || i < remain->size; i++) { if (__builtin_expect(i >= out->capacity, 0)) { if (!resize(out, out->capacity * 2)) { goto realoc_failed; } } if (__builtin_expect(i < remain->size, 1)) { ubn_unit rn = remain->data[i]; out->data[i] = rn + carry; carry = ((rn ^ carry) >> 1 ^ (rn & carry)) & (1u << (ubn_unit_bit - 1)); } else { // the last carry out carries to a new ubn_unit out->data[i] = carry; // = 1 carry = 0; } } ``` 處理 aliasing 的程式碼: ```c ubn *store = NULL; int alias = 0; if (a == out) // pointer aliasing alias ^= 1; if (b == out) // pointer aliasing alias ^= 2; if (alias) { store = ubignum_duplicate(store, out); if (!store) return false; } ``` 直接使用一個變數 `alias` 中的 bit 來觀察,若有 aliasing 則配置空間給保留原值。 ```c realoc_failed: switch (alias) { case 1: swapptr(&a, &store); break; case 2: case 3: swapptr(&b, &store); case 0: } return false; ``` 注意到這裡 2, 3 共用的原因是,如果 `a` 和 `b` 指到同一個位置,那麼恢復 `b` 也同時恢復了 `a` 。 --- aliasing 的問題比我預期的還更難處理,我直接重改邏輯,變成如果有 aliasing 則配置一個新空間存放計算結果,算完再放到 `out` ,改寫過程中也修復了一些 bug 。 ::: spoiler 更改後的 `ubignum_add()` 程式碼 ```c= /* out = a + b * Aliasing arguments are acceptable. * If it return true, the result is put at @out. * If return false with alias input, the @out would be unchanged. * If return false without alias input, the @out would return neither answer nor * original out. */ bool ubignum_add(const ubn *a, const ubn *b, ubn **out) { ubn *ans = *out; int alias = 0; if (a == *out) // pointer aliasing alias ^= 1; if (b == *out) // pointer aliasing alias ^= 2; if (alias) { // if so, allocate space to store the result if (!ubignum_init(&ans)) return false; } ubn_unit carry = 0; int i = 0; for (i = 0; i < min(a->size, b->size); i++) { if (i >= ans->size) { if (__builtin_expect(i >= ans->capacity, 0)) if (!ubignum_resize(ans, ans->capacity * 2)) goto realoc_failed; ans->size++; } const ubn_unit_extend sum = a->data[i] + b->data[i] + carry; ans->data[i] = sum; carry = sum >> ubn_unit_bit; } const ubn *remain = (i == a->size) ? b : a; for (; carry || i < remain->size; i++) { if (i >= ans->size) { if (__builtin_expect(i >= ans->capacity, 0)) if (!ubignum_resize(ans, ans->capacity * 2)) goto realoc_failed; ans->size++; } if (__builtin_expect(i < remain->size, 1)) { const ubn_unit_extend sum = remain->data[i] + carry; ans->data[i] = sum; carry = sum >> ubn_unit_bit; } else { // the last carry out carries to a new ubn_unit ans->data[i] = carry; // = 1 carry = 0; } } *out = ans; // no condition needed return true; realoc_failed: if (alias) ubignum_free(ans); return false; } ``` ::: \ 其中 `ubn_unit_extend` 在 x64 定義為 `unsigned __int128` ,目前嘗試在 user space 配合 gdb 工具進行測試,得到錯誤的結果, `carry` 沒有被正確計算 ```c carry = sum >> ubn_unit_bit; ``` 疑似 uint128 的 shift operation 有問題,我嘗試去找 [gcc doc](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html) ,但是內容只有三行,沒有提到是否支援 shift operation 。 我在 assignment 之前加入強制轉型就有我預期的結果了 ```c const ubn_unit_extend sum = (ubn_unit_extend) a->data[i] + b->data[i] + carry; ``` 我好像已經吃了很多次這種虧,卻都沒有記住,我一直誤以為計算的時候就會轉成左邊的型態,這次特別寫進 commit message ,希望能確實記得。 #### ubignum_mult() 雖然前面有提及[加速的算法](https://hackmd.io/MJ1kMQBLRBCuo08yEmhDeQ?both#%E5%BC%95%E5%85%A5-Karatsuba-algorithm-%E7%AE%97%E6%B3%95),但是因為我目前的大數機制沒有減法,所以暫時先略過此項。 在開發時發現 `ubn_unit` 的加法一直重複遇到,所以將其模組化 ```c /* no checking, sum and cout input should be guarantee */ static inline void ubn_unit_add(const ubn_unit a, const ubn_unit b, const int cin, ubn_unit *sum, int *cout) { const ubn_unit_extend SUM = (ubn_unit_extend) a + b + cin; *sum = SUM; *cout = SUM >> ubn_unit_bit; } ``` 長得有點像硬體加法器的樣子。 `cin` 和 `cout` 都只會用到一個 bit ,用什麼型態似乎都蠻浪費的,最後選用了最常見的 int 。 :::spoiler `mult()` 程式碼 ```c= bool ubignum_mult(const ubn *a, const ubn *b, ubn **out) { if (!a || !b || !out || !*out) return false; ubn *ans = *out; int alias = 0; if (a == *out) // pointer aliasing alias ^= 1; if (b == *out) // pointer aliasing alias ^= 2; if (alias) { // if alias, allocate space to store the result if (!ubignum_init(&ans)) return false; } /* keep mcand longer then mplier */ const ubn *mcand = a->size > b->size ? a : b; const ubn *mplier = a->size > b->size ? b : a; ubn *prod; if (!ubignum_init(&prod)) goto prod_aloc_failed; if (prod->capacity < mcand->size + 1) if (!ubignum_resize(&prod, mcand->size + 1)) goto prod_resize_failed; /* The outer loop */ for (int i = 0; i < mplier->size; i++) { /* The inner loop explanation: * a b c = mcand->data[2, 1, 0] * * d = mplier->data[0] * -------- * c c = (high, low) denotes result of c * d * b b * + a a * ---------- * x y ? c sum of bb and cc, it may be carry out to x * + a a * * let x be recorded in @carry and y be @overlap * no realloc needed for @prod */ int carry = 0; ubn_unit overlap = 0; for (int j = 0; j < mcand->size; j++) { ubn_unit low, high; // need macro to select appropriate assembly __asm__("mulq %3" : "=a"(low), "=d"(high) : "%0"(mcand->data[j]), "rm"(mplier->data[i])); int cout = 0; ubn_unit_add(low, overlap, carry & 2, &prod->data[j], &cout); carry = (carry << 1) | cout; // update carry where cout is 0 or 1 overlap = high; } prod->data[mcand->size] = overlap + (ubn_unit) carry & 2; // no carry out would be generated if (!ubignum_mult_add(prod, i, &ans)) goto multadd_failed; } *out = ans; return true; multadd_failed: prod_resize_failed: ubignum_free(prod); prod_aloc_failed: if (alias) ubignum_free(ans); return false; } ``` ::: \ 目前的 bug 是 inline asm 沒有正確作用,看起來像是 `low` 跟 `high` 相反了,也可以看出確實是把 `$rdx`, `$rax` 分別接到 `high` 跟 `low` 。 ```shell Breakpoint 2, ubignum_mult (a=0x5555555592a0, b=0x555555559350, out=0x7fffffffdc88) at bignum.h:336 336 ubn_unit_add(low, overlap, carry & 2, &prod->data[j], &cout); (gdb) p mcand->data[j] $1 = 18446744073709551615 (gdb) p i $2 = 0 (gdb) p j $3 = 0 (gdb) p mplier->data[i] $4 = 18446744073709551615 (gdb) p low $5 = 1 (gdb) p high $6 = 18446744073709551614 (gdb) info registers rax rax 0x1 1 (gdb) info registers rdx rdx 0xfffffffffffffffe -2 ``` 難道說它執行 `mulq` instruction 時的資料放置方式是跟原本相反嗎? 我嘗試去查看組合語言的編譯結果,看到: ``` mulq %rdx ``` 雖然它把 `"rm"` 選成 `$rdx` ,但這應該也不會影響指令執行結果才對。 我發現其實它沒有算錯,這個結果是正確的,但因為我對乘法的直覺還不夠,誤用加法的邏輯來思考乘法才誤判結果。 在 pointer aliasing 的情況下無法做到 in place 的算法,否則錯誤處理會出問題。 --- ### 大數運算 第二版 在開發的過程中,因為大數運算難以 debug 只能經由 gdb 慢慢嘗試,因此修修改改花了很多時間,改了又錯又重改的情況多次發生,難以一一列舉,在此大略敘述過程遇到的狀況。 #### pointer aliasing issues 首先是前面提及 pointer aliasing 的問題,這個狀況是我在第一個版本遇到最大的問題,第一次是在開發加法時遇到。例如當想計算 `a = a + b` 時,若加出更大的位數則需要 reallocation ,但中途的 reallocation 失敗會導致傳入值 `a` 被破壞而正確的新 `a` 也沒有算出來,進退兩難的情況。 起初為了應對這個問題,我每次計算都先查看是否有 aliasing ,若有則進行複製, maintain 一份副本以利在 reallocation 失敗時能將數值復原。然而這個方法顯然導致很糟糕的效能,雖然之後為了節省交換的操作而直接計算在新空間,但仍然用了 3 個數。 使用者在使用界面時若指定 `a = a + b` ,那通常直覺也會認為自己使用了 in place 的算法,可是我在實作上卻無論如何都需要 3 個數的空間,違反使用直覺。 :::spoiler 原程式碼 (add) ```c /* (*out) = a + b * Aliasing arguments are acceptable. * If it return true, the result is put at @out. * If return false with alias input, the @out would be unchanged. * If return false without alias input, the @out would return neither answer nor * original out. */ bool ubignum_add(const ubn *a, const ubn *b, ubn **out) { if (!a || !b || !out || !*out) return false; ubn *ans = *out; int alias = 0; if (a == *out) // pointer aliasing alias ^= 1; if (b == *out) // pointer aliasing alias ^= 2; if (alias) { // if alias, allocate space to store the result if (!ubignum_init(&ans)) return false; } int i = 0, carry = 0; for (i = 0; i < min(a->size, b->size); i++) { if (i >= ans->size) { if (__builtin_expect(i >= ans->capacity, 0)) if (!ubignum_resize(&ans, ans->capacity * 2)) goto realoc_failed; ans->size++; } ubn_unit_add(a->data[i], b->data[i], carry, &ans->data[i], &carry); } const ubn *remain = (i == a->size) ? b : a; for (; carry || i < remain->size; i++) { if (i >= ans->size) { if (__builtin_expect(i >= ans->capacity, 0)) if (!ubignum_resize(&ans, ans->capacity * 2)) goto realoc_failed; ans->size++; } ubn_unit_add(remain->data[i], 0, carry, &ans->data[i], &carry); } if (alias) ubignum_free(*out); *out = ans; return true; realoc_failed: if (alias) ubignum_free(ans); return false; } ``` ::: \ 對於每個 $n$ 測量 1000 次 $\mathrm{fib}(n)$ 計算後印出平均消耗時間。 其糟糕效能如下: ![](https://i.imgur.com/xFs08l4.png) 為此我後來發現若是能提早知道 reallocation 的大小,或至少知道該次計算的上界,那麼即可在開始計算之前就進行 reallocation 。如此一來即使配置失敗,那麼直接回傳 false 原先的值也不會被破壞。因此在計算 `a = a + b` 時便可以達成 in place 的目標,使用 2 個數的空間完成加法,也因為每次 `ubignum_add()` 都只會做一次 reallocation ,大幅降低了 memory allocation 的花費。此外也不再需要確認 aliasing ,移除判斷條件後也使程式碼簡潔許多。 :::spoiler 改良 (add) ```c /* (*out) = a + b * Aliasing arguments are acceptable. * If false is returned, the values of (*out) remains unchanged. */ bool ubignum_add(const ubn *a, const ubn *b, ubn **out) { if (!a || !b || !out || !*out) return false; while (__builtin_expect((*out)->capacity < max(a->size, b->size) + 1, 0)) if (!ubignum_resize(out, (*out)->capacity * 2)) return false; if (*out != a && *out != b) // if no pointer aliasing ubignum_zero(*out); int i = 0, carry = 0; for (i = 0; i < min(a->size, b->size); i++) ubn_unit_add(a->data[i], b->data[i], carry, &(*out)->data[i], &carry); const ubn *const remain = (i == a->size) ? b : a; for (; i < remain->size; i++) ubn_unit_add(remain->data[i], 0, carry, &(*out)->data[i], &carry); (*out)->size = remain->size; if (carry) { (*out)->data[i] = 1; (*out)->size++; } return true; } ``` ::: \ 效能: ![](https://i.imgur.com/jFQsPRi.png) 相比原先的版本加速將近十倍之多。 #### 列印 第二個困難出在列印,雖然我的 `ubignum` 已能進行數種運算,也正確地將數值儲存在檔案中,但它們就是一串由上百、上千個 0, 1 組成的資料,難以為人類所理解。 為了將很長的二進位轉成十進制,我左思右想許久,最後認為終究是離不開除法的命運,還是得乖乖寫一個除法。我想起前幾周的測驗題中,恰好有[計算除法的內容](https://hackmd.io/u4lvdDcAQeC8UoQ1dT7LUg?view#%E6%B8%AC%E9%A9%97-5),於是我就嘗試套用這個模式,意圖以位移和減法來達成除法運算。此外因為目前沒有其它除法的應用場景,所以可以只考量「除以 10 」的情況並進行特化。 為了實作出 `ubignum_divby_ten()` ,首先需要 `ubignum_left_shift()` 及 `ubignum_sub()` 。 ##### ubignum_sub() 由於我的大數體系原先完全不存在任何負的概念,因此想引入減法也需要做一些權衡。 不過因為除法中使用的減法會保證結果為非負 (餘數) ,如此一來就不必引入負數,減法直接使用二補數加法進行,並把最高的進位拋棄。基於直接假設減法都是大減小,原本的 ubignum 體系可以維持不變。 以下程式碼都展示解決前述 pointer aliasing 問題後的版本。 :::spoiler `sub()` 程式碼 ```c bool ubignum_sub(const ubn *a, const ubn *b, ubn **out) { if (!a || !b || !out || !*out) return false; /* ones' complement of b */ ubn *cmt; if (!ubignum_init(&cmt)) goto cmt_failed; if (__builtin_expect(cmt->capacity < a->size, 0)) if (!ubignum_resize(&cmt, a->size)) goto cmt_realoc_failed; cmt->size = a->size; for (int i = 0; i < b->size; i++) cmt->data[i] = ~b->data[i]; for (int i = b->size; i < a->size; i++) cmt->data[i] = CPU_64 ? UINT64_MAX : UINT32_MAX; int carry = 1; for (int i = 0; i < a->size; i++) // compute result and store in cmt ubn_unit_add(a->data[i], cmt->data[i], carry, &cmt->data[i], &carry); // the last carry is discarded for (int i = a->size - 1; i >= 0; i--) if (!cmt->data[i]) cmt->size--; ubignum_free(*out); *out = cmt; return true; cmt_realoc_failed: ubignum_free(cmt); cmt_failed: return false; } ``` ::: ##### ubignum_left_shift() :::spoiler `left_shift()` 程式碼 ```c= /* left shift a->data by d bit */ bool ubignum_left_shift(const ubn *a, int d, ubn **out) { if (!a || d < 0 || !out || !*out) return false; else if (d == 0) { if (a == *out) return true; if (__builtin_expect((*out)->capacity < a->size, 0)) if (!ubignum_resize(out, a->size)) return false; for (int i = 0; i < a->size; i++) (*out)->data[i] = a->data[i]; (*out)->size = a->size; return true; } const int chunk_shift = d / ubn_unit_bit; const int shift = d % ubn_unit_bit; const int new_size = a->size + chunk_shift + 1; if (__builtin_expect((*out)->capacity < new_size, 0)) if (!ubignum_resize(out, new_size)) return false; int ai = new_size - chunk_shift - 1; // note a->size may be changed due to // aliasing, should not use a->size int oi = new_size - 1; /* copy data from a to (*out) */ if (shift) { (*out)->data[oi--] = a->data[ai - 1] >> (ubn_unit_bit - shift); for (ai--; ai > 0; ai--) (*out)->data[oi--] = (a->data[ai] << shift) | (a->data[ai - 1] >> (ubn_unit_bit - shift)); (*out)->data[oi--] = a->data[ai] << shift; // ai is now 0 } else { while (ai >= 0) (*out)->data[oi--] = a->data[ai--]; } /* end copy */ while (oi >= 0) // set remaining part of (*out) to 0 (*out)->data[oi--] = 0; (*out)->size = new_size; if ((*out)->data[(*out)->size - 1] == 0) // if MS chunk is 0 (*out)->size--; return true; } ``` ::: \ 大致上就是一些搬移的操作,在此解釋 29 至 33 行的操作,假設我們要將 `a` : ``` 00000000 11111111 01001111 00000000 ``` 左移 3 位。 - 注意到 chunk 編號是 $(3, 2, 1, 0)$ ,左邊為 most significant 。 每次我們需要目前 chunk 的最低 5 位放到高 (置左) 與其右方 chunk 的最高 3 位放到低 (置右) 進行合併,以編號 $2$ 為例即是: ``` 11111... from [2] or .....010 from [1] ---------- 11111010 ``` - 取最低 5 位放到高:左移 $3$ - 取最高 3 位放到低:右移 $(8-3)$ 最高跟最低的 chunk 都只有分別一邊所以需要特殊處理,最終得: ``` 00000111 11111010 01111000 00000000 ``` 由於我們每次都不會去取目標左方的新 chunk ,因此 pointer aliasing 在此並不會造成問題。 另外,考量到當 `shift` 為 0 時,原本的程式碼會導致數值右移超過自己的 size ,產生 undefined behavior ,因此需要第 28 行的 if..else 來進行 branch 。 ##### ubignum_divby_ten() 概念源於 [quiz3](https://hackmd.io/u4lvdDcAQeC8UoQ1dT7LUg?view#%E6%B8%AC%E9%A9%97-5) 的算法。 在撰寫的過程中,再多寫了一個 `ubignum_compare()` 函式以利比較。 :::spoiler `divby_ten()` 程式碼 ```c /* a / 10 = (*quo)...rmd */ bool ubignum_divby_ten(const ubn *a, ubn **quo, int *rmd) { if (!a || !a->size || !quo || !*quo || !rmd) return false; ubn *ans, *dvd, *suber, *ten; if (!ubignum_init(&ans)) return false; if (!ubignum_resize(&ans, a->size)) goto cleanup_ans; if (!ubignum_init(&dvd)) goto cleanup_ans; if (!ubignum_resize(&dvd, a->size)) goto cleanup_dvd; if (!ubignum_init(&suber)) goto cleanup_dvd; if (!ubignum_resize(&suber, dvd->capacity + 1)) goto cleanup_suber; if (!ubignum_init(&ten)) goto cleanup_suber; ubignum_uint(ten, 10); for (int i = 0; i < a->size; i++) dvd->data[i] = a->data[i]; dvd->size = a->size; int shift = dvd->size * ubn_unit_bit - 4; // can be improved by clz() ubignum_left_shift(ten, shift, &suber); while (ubignum_compare(dvd, ten) >= 0) { // if dvd >= 10 while (ubignum_compare(dvd, suber) < 0) { shift--; ubignum_left_shift(ten, shift, &suber); } ans->data[shift / ubn_unit_bit] |= (ubn_unit) 1 << (shift % ubn_unit_bit); ubignum_sub(dvd, suber, &dvd); } ans->size = a->size; if (ans->data[a->size - 1] == 0) ans->size--; *rmd = (int) dvd->data[0]; ubignum_free(ten); ubignum_free(suber); ubignum_free(dvd); ubignum_free(*quo); *quo = ans; return true; cleanup_suber: ubignum_free(suber); cleanup_dvd: ubignum_free(dvd); cleanup_ans: ubignum_free(ans); return false; } ``` ::: \ 之後同樣如 quiz3 引入 `builtin_clz()` 並重新包裝成 `ubignum_clz()` 進行改進,關鍵段落變化: ```diff for (int i = 0; i < a->size; i++) dvd->data[i] = a->data[i]; dvd->size = a->size; - int shift = dvd->size * ubn_unit_bit - 4; - ubignum_left_shift(ten, shift, &suber); while (ubignum_compare(dvd, ten) >= 0) { // if dvd >= 10 - while (ubignum_compare(dvd, suber) < 0) { - shift--; - ubignum_left_shift(ten, shift, &suber); - } + int shift = dvd->size * ubn_unit_bit - ubignum_clz(dvd) - 4; + ubignum_left_shift(ten, shift, &suber); + if (ubignum_compare(dvd, suber) < 0) + ubignum_left_shift(ten, --shift, &suber); ans->data[shift / ubn_unit_bit] |= (ubn_unit) 1 << (shift % ubn_unit_bit); ubignum_sub(dvd, suber, &dvd); } ans->size = a->size; // FIX: ans = 0 should be considered if (ans->data[a->size - 1] == 0) ans->size--; *rmd = (int) dvd->data[0]; ``` ##### ubignum_2decimal() 在實作完成除以 10 之後,我們即可透過不斷除以 10 來將二進位數轉成字串印出。 :::spoiler `2decimal()` 程式碼 ```c /* convert the ubn to ascii string */ char *ubignum_2decimal(const ubn *N) { if (!N) return NULL; if (ubignum_iszero(N)) { #if DEBUG char *ans = (char *) calloc(sizeof(char), 2); #else char *ans = (char *) kcalloc(sizeof(char), 2, GFP_KERNEL); #endif if (!ans) return NULL; ans[0] = '0'; ans[1] = 0; return ans; } ubn *dvd = NULL; if (!ubignum_init(&dvd)) return NULL; if (!ubignum_resize(&dvd, N->size)) goto cleanup_dvd; for (int i = 0; i < N->size; i++) dvd->data[i] = N->data[i]; dvd->size = N->size; /* Let n be the number. * digit = 1 + log_10(n) = 1 + \frac{log_2(n)}{log_2(10)} * log_2(10) \approx 3.3219 \approx 7/2, we simply choose 3 */ unsigned digit = (ubn_unit_bit * N->size / 3) + 1; #if DEBUG char *ans = (char *) calloc(sizeof(char), digit); #else char *ans = (char *) kcalloc(sizeof(char), digit, GFP_KERNEL); #endif if (!ans) goto cleanup_dvd; /* convert 2-base to 10-base */ int index = 0; while (dvd->size) { int rmd; if (__builtin_expect(!ubignum_divby_ten(dvd, &dvd, &rmd), 0)) goto cleanup_ans; ans[index++] = rmd | '0'; } int len = index; /* reverse string */ index = len - 1; for (int i = 0; i < index; i++, index--) { char tmp = ans[i]; ans[i] = ans[index]; ans[index] = tmp; } ubignum_free(dvd); return ans; cleanup_ans: #if DEBUG free(ans); #else kfree(ans); #endif cleanup_dvd: ubignum_free(dvd); return NULL; } ``` ::: \ 目前我認為字串反轉的方式還可以再改良,但因為這個部份並非 fib 運算的主軸,因此暫時就先如此。 --- ### fibdrv 效能分析與改進 原先想與[範例程式碼](https://github.com/AdrianHuang/fibdrv/tree/big_number)比較,然而稍微更改後在自己電腦實驗的結果卻非常糟: ![](https://i.imgur.com/BDVkuDe.png) ![](https://i.imgur.com/Kd0V1c9.png) 因此目前先以自己改進為方向, 由於速度的提昇,我們現在將範圍拉到 $n=1000$ 來測試。 ![](https://i.imgur.com/xE4ejiO.png) 對於目前 fib sequence 逐項往上加的算法 ```c static inline void ubn_unit_add(const ubn_unit a, const ubn_unit b, const int cin, ubn_unit *sum, int *cout) { const ubn_unit_extend SUM = (ubn_unit_extend) a + b + cin; *sum = SUM; *cout = SUM >> ubn_unit_bit; } ``` 我認為仍有可能改進的部份: 1. 使用 inline asm 來做加法,如此 (在 64 位元情況) 可以避免用到 `uint128_t` 今天 (3/24) 上課的時候,聽到老師介紹 gcc 有 builtin 的 overflow 函式,藉由這點我們即可改寫原本的 `ubn_unit_add()`,不需要低到 asm 的層級,相對 asm 的優點在於 ==architecture independence==。然而這裡是需要 3 個數相加,因此需要呼叫兩次 `builtin_uadd_overflow()`。 ```c static inline void ubn_unit_add(const ubn_unit a, const ubn_unit b, const int cin, ubn_unit *sum, int *cout) { *cout = __builtin_uaddll_overflow(a, cin, sum); *cout |= __builtin_uaddll_overflow(*sum, b, sum); } ``` ![](https://i.imgur.com/uD7JceJ.png) 測試後可以發現在較大的 case 上,使用 `builtin_uadd_overflow()` 可以得到略佳的效果。再測到更大一些得: ![](https://i.imgur.com/BUdjRc1.png) --- 當我嘗試將 $n$ 提高到 3000 時,卻觀察到一個奇怪的現象,就是在 1000 之後,計算時間就不再有提昇的趨勢,如圖: ![](https://i.imgur.com/FhKNPup.png) 我覺得這非常不合理,因為以我的算法對於越大的數肯定有越高的計算量。目前想到的可能出在,因數值太大而產生 overflow ,但 20000 $\times$ 1000 也仍不到 32-bit int 的上界,應不可能超出 long long ,且可以從圖中見得有些數字遠遠超過水平線卻仍被印了出來。或許需要更理解 ktime_t 相關的知識。 > On 64-bit systems, a ktime_t is really just a 64-bit integer value in nanoseconds. > --- [The high-resolution timer API](https://lwn.net/Articles/167897/) 以這個資料來說,顯然更不是時間變數資料型態的問題了。 :::success 經過數天之後我終於發現問題,是 **fibdrv.c** 中的 `MAX_LENGTH` 。之前改過一次之後我就忘記有這個參數,目前直接設定成 1000000 以利後續實驗。 更正後試跑已可正常進行統計: ![](https://i.imgur.com/TJMiICj.png) ::: --- 嘗試引入 fast doubling 算法加速 ![](https://i.imgur.com/EBIoEDE.png) 效能改善的情況高於我的預期,我原先以為 fast doubling 要到很大的 case 才會有顯著的改善。但從實驗結果可以發現,在 $n=400$ 左右兩種算法即出現執行時間的交叉。 試著仿造使用 `perf stat` 測量我寫的 fast doubling 算法到 $n=3000$ : ```shell $ sudo perf stat -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses ./exp 1 > /dev/null Performance counter stats for './exp 1': 118,264,893,816 cycles 264,687,477,706 instructions # 2.24 insn per cycle 15,142,354 cache-misses # 15.801 % of all cache refs 95,829,386 cache-references 38,631,044,731 branch-instructions 69,373,051 branch-misses # 0.18% of all branches 35.744624633 seconds time elapsed 0.931959000 seconds user 34.806501000 seconds sys ``` 可以看到我的實作程式碼有高達 15% 的 cache miss 。 我想利用 cachegrind 來測試是哪個函式發生最多 cache miss ,結果如下: ```shell $ sudo valgrind --tool=cachegrind ./exp 1 > /dev/null ==37329== Cachegrind, a cache and branch-prediction profiler ==37329== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==37329== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==37329== Command: ./exp 1 ==37329== --37329-- warning: L3 cache found, using its data for the LL simulation. ==37329== ==37329== I refs: 69,563,678 ==37329== I1 misses: 1,159 ==37329== LLi misses: 1,128 ==37329== I1 miss rate: 0.00% ==37329== LLi miss rate: 0.00% ==37329== ==37329== D refs: 28,270,312 (24,764,894 rd + 3,505,418 wr) ==37329== D1 misses: 3,323 ( 2,585 rd + 738 wr) ==37329== LLd misses: 2,733 ( 2,070 rd + 663 wr) ==37329== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==37329== LLd miss rate: 0.0% ( 0.0% + 0.0% ) ==37329== ==37329== LL refs: 4,482 ( 3,744 rd + 738 wr) ==37329== LL misses: 3,861 ( 3,198 rd + 663 wr) ==37329== LL miss rate: 0.0% ( 0.0% + 0.0% ) $ sudo cg_annotate cachegrind.out.37329 -------------------------------------------------------------------------------- I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 6291456 B, 64 B, 12-way associative Command: ./exp 1 Data file: cachegrind.out.37329 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Thresholds: 0.1 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: off -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -------------------------------------------------------------------------------- 69,563,678 1,159 1,128 24,764,894 2,585 2,070 3,505,418 738 663 PROGRAM TOTALS -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function -------------------------------------------------------------------------------- 33,078,041 4 4 15,024,011 1 0 3,012,013 0 0 ???:main 27,000,072 1 1 6,000,016 0 0 0 0 0 /build/glibc-sMfBJT/glibc-2.31/io/../sysdeps/unix/sysv/linux/write.c:write 6,054,221 27 24 3,027,082 11 0 15 1 1 ???:??? 1,493,991 48 47 398,997 19 6 261,001 6 2 /build/glibc-sMfBJT/glibc-2.31/stdio-common/vfprintf-internal.c:__vfprintf_internal 648,170 4 4 141,032 4 1 114,012 1 0 /build/glibc-sMfBJT/glibc-2.31/libio/fileops.c:_IO_file_xsputn@@GLIBC_2.2.5 418,262 3 3 30,733 2 1 24,733 0 0 /build/glibc-sMfBJT/glibc-2.31/stdio-common/_itoa.c:_itoa_word 228,000 5 5 21,000 3 2 0 0 0 /build/glibc-sMfBJT/glibc-2.31/string/../sysdeps/x86_64/multiarch/strchr-avx2.S:__strchrnul_avx2 163,977 2 2 29,983 0 0 17,985 63 63 /build/glibc-sMfBJT/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms 90,000 4 4 18,000 3 0 33,000 3 2 /build/glibc-sMfBJT/glibc-2.31/stdio-common/printf.c:printf 78,000 1 1 24,000 0 0 6,000 0 0 /build/glibc-sMfBJT/glibc-2.31/stdio-common/../libio/libioP.h:__vfprintf_internal 71,545 10 10 14,316 1,080 893 15 2 0 /build/glibc-sMfBJT/glibc-2.31/elf/dl-addr.c:_dl_addr ``` :::warning cachegrind 似乎無法配合 taskset 使用 ::: :::success 改用 ```shell sudo taskset -c 1 valgrind --tool=cachegrind ./exp 1 > /dev/null ``` 即可正常使用 cachegrind ,推測為 Valgrind 實作方式的影響 ::: 測試結果竟顯示 data cache miss 為 0% ,與從 perf stat 收到的資訊天差地遠。 可以觀察到 cache reference 在 perf 顯示 95,829,386 ;而在 cachegrind 顯示為 28,270,312 。 :::spoiler 進行更多次 perf 測試,發現每次的 cache references 竟有巨大落差: ```shell Performance counter stats for './exp 1': 112,781,466,006 cycles 261,497,199,771 instructions # 2.32 insn per cycle 6,962,452 cache-misses # 11.023 % of all cache refs 63,160,276 cache-references 38,179,989,631 branch-instructions 67,070,255 branch-misses # 0.18% of all branches 33.386655206 seconds time elapsed 0.975406000 seconds user 32.384290000 seconds sys ``` ```shell Performance counter stats for './exp 1': 108,714,836,439 cycles 264,555,729,876 instructions # 2.43 insn per cycle 1,520,786 cache-misses # 14.088 % of all cache refs 10,795,259 cache-references 38,609,436,202 branch-instructions 63,560,865 branch-misses # 0.16% of all branches 32.059672731 seconds time elapsed 0.847985000 seconds user 31.207482000 seconds sys ``` ::: \ 我不曉得為什麼會出現這樣的情況。 :::success 後續發現是測試次數過少而導致的數值不穩定,後面會補充結果。 ::: --- 回去觀摩[參考共筆](https://hackmd.io/@KYWeng/rkGdultSU#%E6%94%B9%E5%96%84%E6%96%B9%E6%A1%88-1---%E6%94%B9%E5%AF%AB-bn_fib_fdoubling-%E5%AF%A6%E4%BD%9C%E7%9A%84%E6%96%B9%E5%BC%8F),發現因為我先前閱讀的重點都在運算加速,沒注意到資料搬移優化的部份,嘗試將其納入我的實作: ```diff for (int currbit = 1 << (32 - __builtin_clzll(k) - 1 - 1); currbit; currbit = currbit >> 1) { /* compute 2n-1 */ ubignum_mult(fast[1], fast[1], &fast[0]); ubignum_mult(fast[2], fast[2], &fast[3]); ubignum_add(fast[0], fast[3], &fast[3]); /* compute 2n */ ubignum_left_shift(fast[1], 1, &fast[4]); ubignum_add(fast[4], fast[2], &fast[4]); ubignum_mult(fast[4], fast[2], &fast[4]); n *= 2; if (k & currbit) { ubignum_add(fast[3], fast[4], &fast[0]); n++; - ubignum_copy(fast[2], fast[0]); - ubignum_copy(fast[1], fast[4]); + ubignum_swapptr(&fast[2], &fast[0]); + ubignum_swapptr(&fast[1], &fast[4]); } else { - ubignum_copy(fast[2], fast[4]); - ubignum_copy(fast[1], fast[3]); + ubignum_swapptr(&fast[2], &fast[4]); + ubignum_swapptr(&fast[1], &fast[3]); } } ``` 效能差異: ![](https://i.imgur.com/vchjKJS.png) 我測出來的結果和參考共筆的結論不符合,並沒有出現效能大幅改善的情形。 perf 測起來仍然非常不穩: :::spoiler perf 測試 ```shell $ sudo perf stat -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses ./exp 1 > /dev/null Performance counter stats for './exp 1': 99,537,523,424 cycles 241,725,789,043 instructions # 2.43 insn per cycle 1,579,228 cache-misses # 53.451 % of all cache refs 2,954,538 cache-references 34,729,788,080 branch-instructions 72,001,270 branch-misses # 0.21% of all branches 29.372597211 seconds time elapsed 0.963940000 seconds user 28.402259000 seconds sys ``` ```shell Performance counter stats for './exp 1': 110,100,048,298 cycles 251,991,005,629 instructions # 2.29 insn per cycle 11,886,534 cache-misses # 14.939 % of all cache refs 79,566,707 cache-references 36,196,420,296 branch-instructions 71,695,967 branch-misses # 0.20% of all branches 32.880962908 seconds time elapsed 0.939414000 seconds user 31.908111000 seconds sys ``` ::: \ 嘗試在 perf 命令設置參數 `-r 10` 來降低測量數值的波動,為避免時間過長,改為測試 fib 到 $n=1000$ 。 經過這個設定後,每次測試間得到的數值就不再有顯著差異,至少每個 events 的前兩個數字會保持不變。 ```shell $ bash measure ... Performance counter stats for 'taskset -c 1 ./exp 1' (10 runs): 23,018,839,194 cycles ( +- 0.27% ) 49,622,366,443 instructions # 2.16 insn per cycle ( +- 0.00% ) 1,634,045 cache-misses # 20.062 % of all cache refs ( +- 6.49% ) 8,145,159 cache-references ( +- 5.01% ) 8,319,541,416 branch-instructions ( +- 0.00% ) 13,417,461 branch-misses # 0.16% of all branches ( +- 0.18% ) 14.4259 +- 0.0388 seconds time elapsed ( +- 0.27% ) ``` 確立測試方法之後再回去重新測量使用 `ubignum_copy()` 的版本,得: ```shell $ bash measure ... Performance counter stats for 'taskset -c 1 ./exp 1' (10 runs): 25,215,127,347 cycles ( +- 2.31% ) 51,971,604,860 instructions # 2.06 insn per cycle ( +- 0.00% ) 2,423,932 cache-misses # 13.441 % of all cache refs ( +- 18.29% ) 18,033,276 cache-references ( +- 24.11% ) 8,811,912,111 branch-instructions ( +- 0.00% ) 13,058,167 branch-misses # 0.15% of all branches ( +- 1.10% ) 15.804 +- 0.367 seconds time elapsed ( +- 2.32% ) ``` 對照之下,從舊版到新版雖然 cache-miss-rate 由 13% 增長到 20% ,但我們可以注意到 cache-references 總數大約降低了一半,我推測這個差距來自於 copy 的過程對資料的 access 。 因此就 data cache 而言,因為省下這些操作,新版仍是更佳的實作方式。 20% 的 cache miss 非常驚人,因此目前最直接的改進目標就是要想辦法降低 miss rate 。 --- 引入 `ubignum_square()` ```shell Performance counter stats for 'taskset -c 1 ./exp 1' (10 runs): 25,405,898,533 cycles ( +- 1.21% ) 53,288,279,701 instructions # 2.10 insn per cycle ( +- 0.00% ) 1,370,567 cache-misses # 20.309 % of all cache refs ( +- 34.02% ) 6,748,682 cache-references ( +- 32.89% ) 9,033,980,854 branch-instructions ( +- 0.00% ) 12,907,672 branch-misses # 0.14% of all branches ( +- 0.62% ) 15.921 +- 0.193 seconds time elapsed ( +- 1.21% ) ``` 差異實在也不是很明顯,甚至是落後於全數使用 `ubignum_mult()` 的版本。 雖然我想要嘗試測到大一點的 case , --- 執行測試有時會出現頻繁的失敗,不確定原因是什麼,但與引入 `ubignum_square()` 應該沒有關聯,因為即使改回 mult 的版本也沒有恢復。其錯誤情景如下: ```shell $ sudo perf stat -r 10 -e cycles,instructions,cache-misses,cache-references,branch-instructions,branch-misses taskset -c 0 ./exp 1 > /dev/null taskset: 強制結束 taskset: 強制結束 taskset: 強制結束 taskset: 強制結束 taskset: 強制結束 taskset: 強制結束 taskset: 強制結束 taskset: 強制結束 taskset: 強制結束 taskset: 強制結束 ``` 我想看看是否是 `fib_fast()` 的計算過程中的問題,例如 reallocation 失敗等等,因此嘗試引入 `flag` 變數來紀錄: :::spoiler `fib_fast()` ```c static ubn *fib_fast(long long k) { ubn *fast[5] = {NULL}; bool flag = true; if (k == 0) { flag &= ubignum_init(&fast[0]); ubignum_zero(fast[0]); return fast[0]; } else if (k == 1) { flag &= ubignum_init(&fast[0]); ubignum_uint(fast[0], 1); return fast[0]; } for (int i = 0; i < 5; i++) flag &= ubignum_init(&fast[i]); ubignum_zero(fast[1]); ubignum_uint(fast[2], 1); int n = 1; for (int currbit = 1 << (32 - __builtin_clzll(k) - 1 - 1); currbit; currbit = currbit >> 1) { /* compute 2n-1 */ flag &= ubignum_square(fast[1], &fast[0]); flag &= ubignum_square(fast[2], &fast[3]); //flag &= ubignum_mult(fast[1], fast[1], &fast[0]); //flag &= ubignum_mult(fast[2], fast[2], &fast[3]); flag &= ubignum_add(fast[0], fast[3], &fast[3]); /* compute 2n */ flag &= ubignum_left_shift(fast[1], 1, &fast[4]); flag &= ubignum_add(fast[4], fast[2], &fast[4]); flag &= ubignum_mult(fast[4], fast[2], &fast[4]); n *= 2; if (k & currbit) { flag &= ubignum_add(fast[3], fast[4], &fast[0]); n++; ubignum_swapptr(&fast[2], &fast[0]); ubignum_swapptr(&fast[1], &fast[4]); } else { ubignum_swapptr(&fast[2], &fast[4]); ubignum_swapptr(&fast[1], &fast[3]); } } ubignum_free(fast[0]); ubignum_free(fast[1]); ubignum_free(fast[3]); ubignum_free(fast[4]); if(!flag) printk(KERN_INFO "flag is %d\n", flag); return fast[2]; } ``` ::: \ 執行後於 `dmesg` 中並未看到任何 `flag` 相關的報錯,不過出現了一段紅底黑字: ``` [ 990.777318] BUG: unable to handle page fault for address: ffff96d435beb220 [ 990.777322] #PF: supervisor read access in kernel mode [ 990.777324] #PF: error_code(0x0000) - not-present page ``` --- ## June Rework 隔一段時間沒有著手於這份專案,回來先進行一些實驗重做等等,並補上一些之前忽略的內容。 由於第一次做的時候認為需要專注於提昇運算 Fibonacci number 的速度,而只測量計算消耗的時間。然而這類 character device 在使用上總是要 read,因此應該也進行實驗量測 user space 到 kernel space 的時間、以及資料複製的時間等等納入成本計算,進行效能的全盤考量,完整地進行實驗才能正確認知到效能瓶頸何在。 ### user-kernel transition time #### 時間量測 首先實驗量測由 user mode 進出 kernel mode 的耗時 :::spoiler code ```c for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); unsigned long long ucons = 0, kcons = 0; for (int j = 0; j < TEST_NUM; j++) { struct timespec t1, t2; clock_gettime(CLOCK_MONOTONIC, &t1); kcons += write(fd, NULL, select); clock_gettime(CLOCK_MONOTONIC, &t2); ucons += (unsigned long long) (t2.tv_sec * 1e9 + t2.tv_nsec) - (t1.tv_sec * 1e9 + t1.tv_nsec); } printf("%d,%llu,%llu,%llu\n", i, kcons / TEST_NUM, ucons / TEST_NUM, (ucons - kcons)/TEST_NUM); } ``` ```c /* write operation is skipped */ /* try to measure time within this function */ static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { ubn *N = NULL; // do not initialize ktime_t kt = ktime_get(); switch (size) { case 0: N = fib_sequence(*offset); break; case 1: N = fib_fast(*offset); break; default: return 0; } ubignum_free(N); kt = ktime_sub(ktime_get(), kt); return (ssize_t) ktime_to_ns(kt); } ``` ::: \ 以 iteration 算法測試,並使用較小的 offset 來突顯該數值,兩個時間相減大致可得 (user $\to$ kernel) $+$ (kernel $\to$ user) 的時間。 ![](https://i.imgur.com/XbQbwnF.png) 即使用上了[作業說明](https://hackmd.io/@sysprog/linux2022-fibdrv)和[參考共筆](https://hackmd.io/@KYWeng/rkGdultSU)提及的所有方式,仍會產生這種巨大的 peak。從 csv 檔觀察,發現其位於 $n = 13,43,44,63,64$ - fib(13) = 233 - fib(43) = 433494437 - fib(44) = 701408733 - fib(63) = 6557470319842 - fib(64) = 10610209857723 第一猜想是使用 `ubignum_resize()` 導致的消耗,但結果這幾個值看起來也並非是 `UINT32_MAX` 或 `UINT64_MAX`,從此 value 上看不出有特別意義。接著又測試幾次,發現每次 peak 都產生在不同地方。 試著排除 peak 點再次作圖,結果發現在其它情況我的量測也不太穩定。 ![](https://i.imgur.com/bdrU2Lr.png) 換成另一種叫簡易的方式來測量:將 `fib_write()` 函式清空並直接把 user space 下的來回時間當成是 transition time。 :::success 注意到這時 x 軸 $n$ 不再有意義 ::: ![](https://i.imgur.com/BpktAXO.png) 無論怎麼做都一直量到 peak,但可以看出這個來回時間幾乎都非常貼近 1000 ns。 為了考察這個問題,我試著把每次測試中過大的的耗時數值印出來,結果發現偶爾會出現 case 有十倍左右的時間消耗。於是我了解到[排除統計上的離群值](https://hackmd.io/@sysprog/linux2022-fibdrv#%E7%94%A8%E7%B5%B1%E8%A8%88%E6%89%8B%E6%B3%95%E5%8E%BB%E9%99%A4%E6%A5%B5%E7%AB%AF%E5%80%BC)有其必要性,果然是不該偷吃步。 #### 排除統計離群值 在統計的過程中,對於某些不特定的 $n$ 會出現標準差大於平均值的奇怪現象,於是我同樣把過程中的資料全部存下來,再去奇怪的 $n$ 看看出了什麼問題,借助 `sprintf()` 的小技巧來把資料全部存起來: ```c char name[128]; sprintf(name, "data/%d.dat", i); FILE *f = fopen(name, "w"); ``` 我這裡跑完時間測量後,看到 $n=67$ 時 $\mu=1073, \; \sigma=1322$,於是查詢 **data/67.dat**,原來其中一次的耗時測量到 42656。然而經過離群值的排除,調整後的平均值正常地落在了 1085。 :::spoiler outlier elimination ```c /* Eliminate 5% outliers and compute the average. */ uint64_t average(int64_t *nums, int numsSize) { int64_t mu = 0; for (int i = 0; i < numsSize; i++) mu += nums[i]; mu /= numsSize; int64_t deviation = 0; for (int i = 0; i < numsSize; i++) deviation += (nums[i] - mu) * (nums[i] - mu) / numsSize; deviation = sqrt(deviation); for (int i = 0; i < numsSize; i++) if (nums[i] >= mu + 2 * deviation || nums[i] <= mu - 2 * deviation) nums[i] = -1; // printf("%ld, %ld\t", mu, deviation); uint64_t ans = 0; for (int i = 0; i < numsSize; i++) if (nums[i] != -1) ans += nums[i]; return ans / (numsSize / 20 * 19); } ``` 其中 `sqrt()` 為我自己實作的整數開根號函式。 ::: \ 至此我們成功排除掉了 peak 的問題,於是耗時的細微變化便可在圖表上呈現出來。 ![](https://i.imgur.com/32bLFD9.png) 圖表的刻度變小之後,可以看出測量上有 100 ~ 300 ns 的起伏,仍無法達到如[ KYG 同學的平滑結果](https://hackmd.io/@KYWeng/rkGdultSU#user-space-%E8%88%87-kernel-space-%E7%9A%84%E5%82%B3%E9%81%9E%E6%99%82%E9%96%93)。這裡由於任何 $n$ 的輸入都是直接 return 沒有計算,因此最理想的狀況應該要是一條水平線。 打算做一個 commit 時收到警告訊息: ``` Dangerous function detected in /home/tinyynoob/code/fibdrv/exp.c 54: sprintf(name, "data/%d.dat", i); ``` 於是改為使用 `snprint()` 並順利完成 commit。 #### 於 desktop 量測 由於怎麼測總是拿到漂浮的數據,我想 laptop 或許天生就較為不穩定,因此我想從 laptop 改成用 desktop 跑跑看是否會出現相同的結果。 :::spoiler cpu info ``` 架構: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數: 1 每通訊端核心數: 4 Socket(s): 1 NUMA 節點: 1 供應商識別號: GenuineIntel CPU 家族: 6 型號: 94 Model name: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz 製程: 3 CPU MHz: 2511.819 CPU max MHz: 3600.0000 CPU min MHz: 800.0000 BogoMIPS: 6399.96 虛擬: VT-x L1d 快取: 128 KiB L1i 快取: 128 KiB L2 快取: 1 MiB L3 快取: 6 MiB NUMA node0 CPU(s): 0-3 Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cach e flushes, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v ia prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_ FW, STIBP disabled, RSB filling Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtr r pge mca cmov pat pse36 clflush dts acpi mmx f xsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rd tscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperf mperf pni pclmulqdq dtes64 monitor ds_cpl vmx s mx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid s se4_1 sse4_2 x2apic movbe popcnt tsc_deadline_t imer aes xsave avx f16c rdrand lahf_lm abm 3dno wprefetch cpuid_fault invpcid_single pti ssbd i brs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsa ves dtherm ida arat pln pts hwp hwp_notify hwp_ act_window hwp_epp md_clear flush_l1d ``` ::: \ 然而同樣的程式碼和 scripts 移過來執行竟然出現 segmentation fault。 我使用 gdb 工具查詢,發現出錯在 `fprintf()` 處,我的 `f` 意圖打開目錄 **data** 下的檔案,然而不存在這個 directory 因此產生記憶體區段錯誤,`mkdir data` 後解決。 跑出來的圖表: ![](https://i.imgur.com/UlmxlW0.png) 雖然看似仍有很大的浮動,但仔細觀察可以注意到 y 軸的刻度由 50 降至 10,確實穩定了不少。從比較大的 scale 看是像這樣子: ![](https://i.imgur.com/my86CIE.png) 雖然還沒辦法變成完全的水平線,但已從最初上千 ns 的 peak 改善許多。 > 版本 commit id:b80d536b08bd1a3c3325b53d684f97e39ac734f9 接著再把之前從 `fib_write()` 清空的內容補回進行測試: ![](https://i.imgur.com/uWsgTOd.png) 可以看出無論 $n$ 的變化,user space 到 kernel space 的來回時間穩定落在 600 ns 左右。 即使一路算到 $n=500$ 也是如此: ![](https://i.imgur.com/GHOdAgH.png) > 版本 id:766a4b80f6fbcf8a0f5ee0931113b16e6ea629ec ### 運算時間 繼續使用 desktop 來重新量測兩種 Fibonacci number 算法在 kernel space 的耗時: ![](https://i.imgur.com/P2ik2ZG.png) 像之前一樣看到我實作的 fast fib 算法會出現這種階梯形的跳躍現象,這次我想要好好研究這些階梯的成因。 先從小一點的 case 開始探討: ![](https://i.imgur.com/6kLMTPL.png) 從 **fast.csv** 查看發現跳點出現在 $n=2,4,8,16,32,64$,每次往上跳 500 或 1000 ns。每個跳躍竟然都恰好落在 2 的冪,這肯定有很好的理由可以解釋。 仔細想想之後也很直觀,因為 fast fib 算法每次迴圈都會將 $n$ 加倍,因此需要 $\log(n)$ 次 iteration 以算完要求結果。而這麼解釋就代表每個跳躍便是 fast doubling 算法 each iteration 的耗時,從 `fast_fib()` 程式碼也可以看出有個迴圈: ```c int n = 1; for (int currbit = 1 << (32 - __builtin_clzll(k) - 1 - 1); currbit; currbit = currbit >> 1) { /* compute 2n-1 */ flag &= ubignum_square(fast[1], &fast[0]); flag &= ubignum_square(fast[2], &fast[3]); flag &= ubignum_add(fast[0], fast[3], &fast[3]); /* compute 2n */ flag &= ubignum_left_shift(fast[1], 1, &fast[4]); flag &= ubignum_add(fast[4], fast[2], &fast[4]); flag &= ubignum_mult(fast[4], fast[2], &fast[4]); n *= 2; if (k & currbit) { flag &= ubignum_add(fast[3], fast[4], &fast[0]); n++; ubignum_swapptr(&fast[2], &fast[0]); ubignum_swapptr(&fast[1], &fast[4]); } else { ubignum_swapptr(&fast[2], &fast[4]); ubignum_swapptr(&fast[1], &fast[3]); } } ``` 在這裡可以簡單比較兩種算法的特性如下: ||sequencial|fast doubling| |:-:|:-:|:-:| |迭代次數|$n$|$\lfloor\log(n)\rfloor$| |每輪迭代的複雜度|低,約 ==16== ns|高,約 ==834== ns| > 該數字使用 $n=0$ 到 $n=500$ 的測量結果算得 ### 比較 #### KYG 同學 為了確認自己實作的水準,clone [KYG 同學的 fibdrv](https://github.com/KYG-yaya573142/fibdrv) 進行對比,使用 optimize-bn 分支在我的電腦上執行 `make plot`,結果超乎我的預期: ![](https://i.imgur.com/wVMqyny.png) 耗時僅僅我的 $1/10$,顯然我的程式碼有不只一點改進空間。此外,為何使用相同的設定,它的程式就能有非常平滑的結果,而我卻會一直抖動,這也需要再好好研究。 但仔細去看程式碼發現其測量有誤,我 clone 到的版本並非 bignum 的實作。接著繼續閱讀之後,發現同學並未把自己的最新版本推上 github,因此我無法在自己的電腦上與其進行耗時比較。 :::info 需要找找看哪個同學的程式碼適合拿來做效能對比 ::: ### 研讀 [0xff07 同學的實作](https://github.com/0xff07/bignum/tree/fibdrv) 研讀同學的實作,看看是否有值得借鏡的地方。 #### apm_internal.h > 此 apm 取的是 Arbitrary-Precision arithmetic 的意思 首先我看到同學將 inline assembly 的部份包裝成 macro: ```c #if defined(__amd64__) || defined(__x86_64__) #define digit_mul(u, v, hi, lo) \ __asm__("mulq %3" : "=a"(lo), "=d"(hi) : "%0"(u), "rm"(v)) #define digit_div(n1, n0, d, q, r) \ __asm__("divq %4" : "=a"(q), "=d"(r) : "0"(n0), "1"(n1), "rm"(d)) #endif ``` 我認為這是非常好的想法,由於 inline assembly 的語法邏輯跟 C 的邏輯有些不同,因此包裝起來可以使人閱讀時統一依 C 的邏輯理解,較為易讀,同時也方便重複使用。 同時他在無法使用 x86 assembly 的情況下仍提供了對應實作: ```c #ifndef digit_mul #define digit_mul(u, v, hi, lo) \ do { \ apm_digit __u0 = (u); \ apm_digit __u1 = __u0 >> APM_DIGIT_HSHIFT; \ __u0 &= APM_DIGIT_LMASK; \ apm_digit __v0 = (v); \ apm_digit __v1 = __v0 >> APM_DIGIT_HSHIFT; \ __v0 &= APM_DIGIT_LMASK; \ apm_digit __lo = __u0 * __v0; \ apm_digit __hi = __u1 * __v1; \ __u1 *= __v0; \ __v0 = __u1 << APM_DIGIT_HSHIFT; \ __u1 >>= APM_DIGIT_HSHIFT; \ __hi += __u1 + ((__lo += __v0) < __v0); \ __v1 *= __u0; \ __u0 = __v1 << APM_DIGIT_HSHIFT; \ __v1 >>= APM_DIGIT_HSHIFT; \ __hi += __v1 + ((__lo += __u0) < __u0); \ (lo) = __lo; \ (hi) = __hi; \ } while (0) #endif ``` #### apm.c apm.h 這部份提供許多大數的 operations 包括: - increment/decrement - add - sub - dmul,計算 partial product 的函式 - shift - cmp 預設是 `u` 和 `v` 運算後將結果放到 `w`,其中沒看到計算 apm 乘法的函式,僅有計算 partial product 的 `dmul()`。 :::success apm 乘法被獨立寫在 **mul.c** ::: - 帶 *i* postfix 的 operations 似乎是代表 in-place,都是 `u`, `v` 運算後放到 `u` - 帶 *n* postfix 的 operations 不確定語意上是代表哪個英文單字,不過其共通點為 `u`, `v` 兩數的初始 size 相同 #### mul.c 在其乘法實作當中混用了兩種乘法 - base,最原始的直式乘法方式 - [Karatsuba multiplication](https://en.wikipedia.org/wiki/Karatsuba_algorithm) 短碼時使用直式乘法,長碼時使用 Karatsuba,使兩者各展所長。 ##### Karatsuba multiplication 解說 假設我們要作 $x\times y$,那麼藉由分解較大的 $x$ 及 $y$,可以達到: $$ \begin{aligned} x\times y &= (x_1\times 2^b+x_0)\times(y_1\times 2^b+y_0) \\ &= (x_1 y_1)2^{2b} + (x_1 y_0+ x_0 y_1)2^b + (x_0 y_0) \end{aligned} $$ 因 $(x_1 y_0+ x_0 y_1) = (x_1+x_0)\times(y_1+y_0) - x_1y_1 - x_0y_0$,可將乘法次數由 4 次降到 3 次。 此外每個 $x_iy_i$ 又呈現 $x\times y$ 的形式,因此可以再次遞迴下去,總結而言這是個 divide and conquer 的算法。 #### bn.h 此部份將前面大數相關的 type 使用 struct 包起來: ```c typedef struct { apm_digit *digits; /* Digits of number. */ apm_size size; /* Length of number. */ apm_size alloc; /* Size of allocation. */ unsigned sign : 1; /* Sign bit. */ } bn, bn_t[1]; ``` 並使用 apm 的介面提供一系列的 arithmetic operations,再次包裝成 bn 介面。