# 2020q1 Homework2 (fibdrv)
contributed by < `rwe0214` >
###### tags: `Linux`, `rwe0214`, `NCKU`
+ [作業說明](https://hackmd.io/@sysprog/linux2020-fibdrv)
## 自我檢查清單
- [x] 研讀上述 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
- [x] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
- [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢?
- [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題
### 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
Fibonacci 數定義為
$$
F(0) = 0, F(1) = 1 \\
F(n) = F(n-1) + F(n-2),\ where \ n > 1
$$
但是此種遞迴計算的成本太高,而 [Nikolai. N. Vorobev](https://www.amazon.com/Fibonacci-Numbers-Dover-Books-Mathematics/dp/048648386X) 揭露了 Fibonacci 擁有以下性質 (可由數學歸納法證明之)
$$
F(n+k) = F(n-1) \times F(k) + F(n) \times F(k+1)
$$
若將 $n=n+1,\ k=n$ 和 $n=n+1,\ k=n+1$ 代入上式可得
$$
\begin{align}
F(2n+1) & = F(n)^2 + F(n+1)^2 \\
F(2n+2) & = F(n)\times F(n+1) + F(n+1)\times F(n+2) \notag\\
& =F(n)\times F(n+1) + F(n+1)\times [F(n) + F(n+1)] \notag\\
& =2F(n)\times F(n+1) + F(n+1)^2
\end{align}
$$
再將 $F(2n+2)-F(2n+1)$ 可得
$$
F(2n) = 2F(n)\times F(n+1) - F(n)^2
$$
如此一來可以整理出當 $n$ 分別是偶數和奇數的演算法:
$$
\begin{align}
&F(0) = 0,\ F(1) = 1,\ F(2) = 1 \\
&F(2n) = 2F(n)\times F(n+1) - F(n)^2 \\
&F(2n+1) = F(n)^2 + F(n+1)^2
\end{align}
$$
不過此種遞迴方式會比較快嗎?以 $F(6)$ 來說,
* 1st recurrsion
```graphviz
strict digraph G
{
1[label="F(6)"]
2[label="F(4)"]
3[label="F(5)"]
4[label="F(2)"]
5[label="F(3)"]
6[label="F(3)"]
7[label="F(4)"]
8[label="F(0)", style=filled]
9[label="F(1)", style=filled]
10[label="F(1)", style=filled]
11[label="F(2)"]
12[label="F(1)", style=filled]
13[label="F(2)"]
14[label="F(2)"]
15[label="F(3)"]
16[label="F(0)", style=filled]
17[label="F(1)", style=filled]
18[label="F(0)", style=filled]
19[label="F(1)", style=filled]
20[label="F(0)", style=filled]
21[label="F(1)", style=filled]
22[label="F(1)", style=filled]
23[label="F(2)", style=filled]
24[label="F(0)", style=filled]
25[label="F(1)", style=filled]
1 -> {2, 3}
2 -> {4, 5}
3 -> {6, 7}
4 -> {8, 9}
5 -> {10, 11}
6 -> {12, 13}
7 -> {14, 15}
11 -> {16, 17}
13 -> {18, 19}
14 -> {20, 21}
15 -> {22, 23}
23 -> {24, 25}
}
```
* 2st recurrsion
```graphviz
strict digraph G
{
1[label="F(6)"]
2[label="F(3)"]
3[label="F(4)"]
4[label="F(1)", style=filled]
5[label="F(2)", style=filled]
6[label="F(2)", style=filled]
7[label="F(3)"]
8[label="F(1)", style=filled]
9[label="F(2)", style=filled]
1 -> {2, 3}
2 -> {4, 5}
3 -> {6, 7}
7 -> {8, 9}
}
```
可以看出遞迴的次數減少非常多,那還能再更快嗎?
再觀察到 $F(6)$ 被分成 $F(3)$ 和 $F(4)$ 兩個部分,其中 $F(4) = F(2)+F(3)$ ,可以利用 $F(3)$ 和遞迴 $F(3)$ 時所得到的 $F(2)$ 去計算 $F(4)$,這樣可以再次降低運算的次數,如下:
* 2st recurrsion
```graphviz
strict digraph G
{
1[label="F(6)"]
2[label="F(3)"]
3[label="F(4)"]
4[label="F(1)", style=filled]
5[label="F(2)", style=filled]
6[label=" " ,color=white]
7[label=" " ,color=white]
{rank = same; 2;3;}
{rank = same; 4;5;}
1 -> {2, 3}
2 -> {4, 5}
2 -> 3 [style=dashed; arrowhead=vee]
5 -> 3 [style=dashed; arrowhead=vee]
3 -> {6, 7} [color=white]
}
```
接著再將這兩種想法以 bottom-up 的實作方式來比較,可以看到相當大的差距。
由於受限 `unsigned long long` 只有 64 bits 的表達範圍 ( i.e., Max value = $2^{64}-1$ ),且 $Fib(93)$ 的值大於 $2^{64}-1$,在尚未實作大數運算前,只設計了數值範圍到 $Fib(92)$ 的實驗。
程式碼如下:
```cpp
unsigned long long Adding(int k) {
unsigned long long f[92] = {0, 1};
for (int i = 2; i <= k; i++)
f[i] = f[i - 1] + f[i - 2];
return f[k];
}
```
```cpp
unsigned long long Fast_doubling(int k) {
int bs = 0, saved = k;
while (k) {
bs++;
k /= 2;
}
k = saved;
unsigned long long t1, t2, a = 0, b = 1;
for (int i = bs; i > 0; i--) {
t1 = a * (2 * b - a);
t2 = b * b + a * a;
a = t1;
b = t2;
if (k & (1 << (i - 1))) {
t1 = a + b;
a = b;
b = t1;
k &= ~(1 << (i - 1));
}
}
return a;
}
```
實驗結果:

明顯看到 `Fast doubling` 明顯優於 `Adding` 方法。
接著比較在 `fast doubling` 上使用硬體指令 `clz` 的差別,
```cpp
//Check the i_th bit
if (k & (1 << (i - 1)) && k > 0) {
```
和
```cpp
//Check the i_th bit
if ((32 - __builtin_clz(k)) == i && k > 0) {
```
執行[實驗程式碼](https://gist.github.com/rwe0214/5dc1f3964ca2bd738ece68dbb200e103)
```shell
$ gcc -o fibonacci fibonacci.c -lm
$ taskset 0x1 ./fibonacci #限定在相同的處理器上執行
$ gnuplot runtime.gp
$ eog runtime.png
```
每個 $Fib(n)$ 取樣 $10^5$ 次取 95% 的信賴區間來統計結果:

可以看到有無使用 `clz` 的結果非常相近, 而 `clz` 在 n 值越大的時候表現略優 ( 紫色線 ),推測是因為硬體加速在數字越大 ( i.e., bits 數越多 ) 效果越顯著。
但是這個 `clz` 是使用 builtin 的編譯器優化,並不是使用 Linux 核心 API,結果可能會有所誤差。
接著進行使用 memorization 的方法,將之前計算過得 $Fib(n)$ 保存在 LUT 中,觀察是否有獲得更佳的效能,因為範圍只到 $Fin(92)$ 所以只建立的 $Fib(0)$ 至 $Fib(49)$ 來查表,實驗結果如下:

可以看到當 n 越大時,使用 LUT 的速度越快,但是同時也損失了 memory 空間去儲存 LUT。
### 印出 Fib(92) 以後的數值
考慮需要更多的位元空間來儲存 $Fib(92)$ 以後的數值,定義了一個結構,
```cpp
typedef struct uint128_t {
unsigned long long msb;
unsigned long long lsb;
} my_uint128
```
:::warning
為何不採用變動長度的數值表示法?一如 quiz2 採用的策略
:notes: jserv
:::
> 目前這個部分仍在思考,只是目前還沒有完整的想法,所以先實作固定長度,之後再做修改
> [name=rwe0214][time=Fri, Mar 6, 2020 11:08 AM]
在比較完計算 $Fib$ 的方法後,決定採用 `fast_doubling` 的方式來實作。
因為在 `fast doubling` 中需乘法、加法和減法的操作,其中加減法方法較為簡單,乘法則是模擬計算機的乘法器的行為,方式如下
* 乘法
```cpp
#define MAX_INT64 0xFFFFFFFFFFFFFFFF
my_uint128 RESHI, RESLO;
void rshift_sl(my_uint128 *h, my_uint128 *l){
l->lsb >>= 1;
if(l->msb & 0x1)
l->lsb |= 0x8000000000000000;
l->msb >>= 1;
if(h->lsb & 0x1)
l->msb |= 0x8000000000000000;
h->lsb >>= 1;
if(h->msb & 0x1)
h->lsb |= 0x8000000000000000;
h->msb >>= 1;
return;
}
void mul_ql_bin(my_uint128 a, my_uint128 b){
init_mul();
RESLO = b;
int count = 128;
for(int i = 128; i>0; i--){
if(RESLO.lsb & 0x1)
RESHI = add_ql_bin(RESHI, a);
rshift_sl(&RESHI, &RESLO);
}
return;
}
```
* 加法和減法
```cpp
my_uint128 add_ql_bin(my_uint128 a, my_uint128 b){
my_uint128 c;
c.msb = a.msb + b.msb;
if(MAX_INT64 - a.lsb < b.lsb){
c.msb++;
c.lsb = b.lsb - (MAX_INT64 - a.lsb) - 1;
}
else
c.lsb = a.lsb + b.lsb;
return c;
}
//This function did not handle with the situation a < b
my_uint128 sub_ql_bin(my_uint128 a, my_uint128 b){
//c = a - b
my_uint128 c;
if(a.lsb < b.lsb){
a.msb--;
c.lsb = MAX_INT64 - (b.lsb - a.lsb - 1);
}
else
c.lsb = a.lsb - b.lsb;
return c;
}
```
將結果印出並和 [Fibonacci numbers](http://www.protocol5.com/Fibonacci/100.htm) 做確認相同
```shell
$ ./fibonacci
...
f(93) = 0 a94fad42221f2702
f(94) = 1 11f38ad0840bf6bf
f(95) = 1 bb433812a62b1dc1
f(96) = 2 cd36c2e32a371480
f(97) = 4 8879faf5d0623241
f(98) = 7 55b0bdd8fa9946c1
f(99) = b de2ab8cecafb7902
f(100) = 13 33db76a7c594bfc3
```
試著利用 quiz2 的方法實作可變長度的大數,定義以下的結構 `ubgi`,大小為 128 bits ( i.e., 16 bytes ),
```cpp
typedef union UBIGNUMI {
unsigned long long val;
struct {
unsigned long long filled : 63;
unsigned long long left;
/*Indicate that whether ubgi is a large number or not*/
uint8_t is_ptr : 1;
};
struct {
/*[lsb][...]...[...][msb]*/
unsigned long long *ptr;
/*Length of ptr array*/
unsigned long long size : 63;
};
} ubgi;
```
接著利用 `new_ubgi` 函式定義一個新的大數,因為是以十進位的字串輸入數值,而一個 63 bits 的 `unsigned long long` 的最大值在十進位中恰為 19 位數,所以以 19 當作數值位數的分界。
```cpp
ubgi new_ubgi(char *val) {
ubgi new;
unsigned long long tmp = 0;
int flag = 0;
int size_dec = strlen(val);
if (size_dec > 19) {
/*Actual size of ubgi value, it is up to 2^(size-1)*/
int size = 1024;
char binary[size];
dec2bin(binary, val);
new.is_ptr = 1;
new.size = (strlen(binary) % 64) ? strlen(binary) / 64 + 1
: strlen(binary) / 64;
new.ptr = malloc(new.size * sizeof(unsigned long long));
for (int i = 0; i < strlen(binary); i++) {
if (binary[i] - '0') {
new.ptr[i / 64] |= ((unsigned long long)1 << (i % 64));
}
}
} else {
for (int i = size_dec - 1; i >= 0; i--) {
tmp += (val[i] - '0') * power(10, size_dec - 1 - i);
}
if (tmp & 0x8000000000000000 && size_dec == 19) {
new.is_ptr = 1;
new.size = 1;
new.ptr = malloc(sizeof(unsigned long long));
*(new.ptr) = tmp;
} else {
new.is_ptr = 0;
new.filled = tmp;
new.left = MAX_INT63 - tmp;
}
}
return new;
};
```
完成了數值的新增後,雖然數值可以儲存的非常大,而且可以達到作業要求,比對 [Fibonacci numbers 300](http://www.protocol5.com/Fibonacci/300.htm) 後結果為正確的。
```console.log
f(97) = 8 3621143489848422977
f(98) = 13 5301852344706746049
f(99) = 21 8922995834555169026
f(100) = 35 4224848179261915075
...
f(299) = 137347 0805771631154320257 7171027913184570027 5212767467264610201
f(300) = 222232 2446294204455297398 9346190996720666693 9096499764990979600
```
但是在計算上總覺得非常沒有效率,以 `c = a + b` 加法來說,需要判斷 a 和 b 的種類,共有四種狀況,如此會在程式碼中有著大量的條件判斷式如下,
```cpp
ubgi add(ubgi a, ubgi b) {
ubgi c;
if (a.is_ptr)
if (b.is_ptr)
addll(&c, a, b);
else
addl(&c, a, b);
else
if (b.is_ptr)
addl(&c, b, a);
else
addnl(&c, a, b);
return c;
}
/*Add two large number*/
void addll(ubgi *c, ubgi a, ubgi b) {
unsigned long long carry = 0;
c->is_ptr = 1;
c->size = max(a.size, b.size) + 1;
c->ptr = malloc(c->size * sizeof(unsigned long long));
for (int i = 0; i < a.size && i < b.size; i++) {
if (MAX_INT64 - a.ptr[i] < b.ptr[i]) {
c->ptr[i] = b.ptr[i] - (MAX_INT64 - a.ptr[i]) - 1 + carry;
carry = 1;
} else {
if (MAX_INT64 - a.ptr[i] - b.ptr[i] < carry) {
c->ptr[i] = carry - (MAX_INT64_DEC - a.ptr[i] - b.ptr[i]) - 1;
carry = 1;
} else {
c->ptr[i] = carry + a.ptr[i] + b.ptr[i];
carry = 0;
}
}
}
if (a.size < b.size) {
for (int i = a.size; i < b.size; i++) {
if (MAX_INT64 - b.ptr[i] < carry) {
c->ptr[i] = carry - (MAX_INT64_DEC - b.ptr[i]) - 1;
carry = 1;
} else {
c->ptr[i] = carry + b.ptr[i];
carry = 0;
}
}
}
if ((b.size < a.size)) {
for (int i = b.size; i < a.size; i++) {
if (MAX_INT64 - a.ptr[i] < carry) {
c->ptr[i] = carry - (MAX_INT64 - a.ptr[i]) - 1;
carry = 1;
} else {
c->ptr[i] = carry + a.ptr[i];
carry = 0;
}
}
}
if (carry == 1)
c->ptr[c->size - 1] = carry;
else {
c->size--;
c->ptr = realloc(c->ptr, c->size * sizeof(unsigned long long));
}
}
/*Only a is large number*/
void addl(ubgi *c, ubgi a, ubgi b) {
unsigned long long carry = 0;
c->is_ptr = 1;
c->size = a.size + 1;
c->ptr = malloc(c->size * sizeof(unsigned long long));
if (MAX_INT64 - a.ptr[0] < b.val) {
c->ptr[0] = b.val - (MAX_INT64 - a.ptr[0]) - 1 + carry;
carry = 1;
} else {
c->ptr[0] = b.val + a.ptr[0];
}
for (int i = 1; i < a.size; i++) {
if (MAX_INT64 - a.ptr[i] < carry) {
c->ptr[i] = carry - (MAX_INT64 - a.ptr[i]) - 1;
carry = 1;
} else {
c->ptr[i] = carry + a.ptr[i];
carry = 0;
}
}
if (carry == 1)
c->ptr[c->size - 1] = carry;
else {
c->size--;
c->ptr = realloc(c->ptr, c->size * sizeof(unsigned long long));
}
}
/*Neither a and b are large number*/
void addnl(ubgi *c, ubgi a, ubgi b) {
if (MAX_INT64 - a.val < b.val) {
c->is_ptr = 1;
c->size = 2;
c->ptr = malloc(c->size * sizeof(unsigned long long));
c->ptr[0] = a.val - (MAX_INT64 - b.val) - 1;
c->ptr[1] += 1;
} else {
c->is_ptr = 0;
c->val = a.val + b.val;
}
}
```
接著是減法、乘法,所以目前還想嘗試思考有無其他做法,或是改變結構體來達到精簡化。
後來將結構體改為
```cpp
typedef struct UBIGNUM {
uint64_t *val;
int len;
} big;
```
加法在實作上仍然採用傳統的直式加法,而乘法則以硬體乘法器的模型去完成,

只是我將 product 中的位數個數以乘數及被乘數的位數來決定,並把乘數放入低位,判斷 LSB 決定是否將乘數用大數加法累加進 product 的高位。
```cpp
/*
* big c {
* len = a.len + b.len
* *val = [a.len-1]...[b.len][b.len-1]...[0]
* |------RESHI------||----RESLO----|
* }
*/
big mul_big(big a, big b)
{
big c;
if(/*handle with a=0 or a=1 or b=0 or b=1*/){
...
return c;
}
c.len = a.len + b.len;
c.val = calloc(c.len, sizeof(uint64_t));
for (int i = 0; i < b.len; i++)
c.val[i] = b.val[i];
big RESLO, RESHI;
RESLO.len = b.len;
RESLO.val = &c.val[0];
RESHI.len = a.len;
RESHI.val = &c.val[b.len];
for (int i = 0; i < b.len << 6; i++) {
if (RESLO.val[0] & (uint64_t) 1)
RESHI = add_big(RESHI, a);
/*right shift RES*/
uint64_t shifted = 0;
for (int j = 0; j < RESLO.len - 1; j++) {
shifted = RESLO.val[j + 1] & (uint64_t) 1;
RESLO.val[j] >>= 1;
if (shifted)
RESLO.val[j] |= ((uint64_t) 1 << 63);
}
shifted = RESHI.val[0] & (uint64_t) 1;
RESLO.val[RESLO.len - 1] >>= 1;
if (shifted)
RESLO.val[RESLO.len - 1] |= ((uint64_t) 1 << 63);
for (int j = 0; j < RESHI.len - 1; j++) {
shifted = RESHI.val[j + 1] & (uint64_t) 1;
RESHI.val[j] >>= 1;
if (shifted)
RESHI.val[j] |= ((uint64_t) 1 << 63);
}
RESHI.val[RESHI.len - 1] >>= 1;
}
for (int i = 0; i < RESLO.len; i++)
c.val[i] = RESLO.val[i];
for (int i = 0; i < RESHI.len; i++)
c.val[RESLO.len + i] = RESHI.val[i];
for (int i = c.len - 1; i >= 0; i--)
if (c.val[i] != (uint64_t) 0x0) {
c.len = i + 1;
c.val = realloc(c.val, (i + 1) * sizeof(uint64_t));
break;
}
return c;
}
```
上面的儲存數值的方法是單純的二進位制,但是如果要將二進位制的大數以十進位的方式印出,不能單除用數學的方法將數值累加,我參考了 [Double dabble(1)](https://www.youtube.com/watch?v=eXIfZ1yKFlA) 和 [Double dabble(2)](https://en.wikipedia.org/wiki/Double_dabble) 的方法轉換二進位成十進制。
假設數字是 `01101011`,其十進位是 `107` 以下是 double dabble 進行的步驟,
```
100 10 1 | Binary
===============|==========
0000 0000 0000 | 01101011 1th shift left
0000 0000 0000 | 11010110 2nd shift left
0000 0000 0001 | 10101100 3rd shift left
0000 0000 0011 | 01011000 4th shift left
0000 0000 0110 | 10110000 '1' digit is greater than or equal to 5,add 3
0000 0000 1001 | 10110000 5th shift left
0000 0001 0011 | 01100000 6th shift left
0000 0010 0110 | 11000000 '1' digit is greater than or equal to 5,add 3
0000 0010 1001 | 11000000 7th shift left
0000 0101 0011 | 10000000 '10' digit is greater than or equal to 5,add 3
0000 1000 0011 | 10000000 8th shift left
0001 0000 0111 | 00000000
---------------|----------
1 0 7
```
以上面的手法修改後,在記憶體運許的狀況下,可以將 `Fib` 數印出,為了驗證正確性,寫了一個 [verify.py](https://github.com/rwe0214/fibdrv/blob/master/big/helper/verify.py) 的小程式去驗證 `Fib` 的計算
```console.log
$ time ./test 3000
f(3000) = 410615886307971260333568378719267105220125108637369252408885430926905584274113403731330491660850044560830036835706942274588569362145476502674373045446852160486606292497360503469773453733196887405847255290082049086907512622059054542195889758031109222670849274793859539133318371244795543147611073276240066737934085191731810993201706776838934766764778739502174470268627820918553842225858306408301661862900358266857238210235802504351951472997919676524004784236376453347268364152648346245840573214241419937917242918602639810097866942392015404620153818671425739835074851396421139982713640679581178458198658692285968043243656709796000
./test 3000 0.01s user 0.00s system 85% cpu 0.013 total
```
```console.log
$ make check 3000
./test 3000
f(3000) = 410615886307971260333568378719267105220125108637369252408885430926905584274113403731330491660850044560830036835706942274588569362145476502674373045446852160486606292497360503469773453733196887405847255290082049086907512622059054542195889758031109222670849274793859539133318371244795543147611073276240066737934085191731810993201706776838934766764778739502174470268627820918553842225858306408301661862900358266857238210235802504351951472997919676524004784236376453347268364152648346245840573214241419937917242918602639810097866942392015404620153818671425739835074851396421139982713640679581178458198658692285968043243656709796000
python3 helper/verify.py
.
----------------------------------------------------------------------
Ran 1 test in 0.001s
OK
```