# 2020q1 Homework2 (fibdrv) contributed by < `rwe0214` > ###### tags: `Linux`, `rwe0214`, `NCKU` + [作業說明](https://hackmd.io/@sysprog/linux2020-fibdrv) ## 自我檢查清單 - [x] 研讀上述 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗; - [x] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢? - [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題 ### 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 Fibonacci 數定義為 $$ F(0) = 0, F(1) = 1 \\ F(n) = F(n-1) + F(n-2),\ where \ n > 1 $$ 但是此種遞迴計算的成本太高,而 [Nikolai. N. Vorobev](https://www.amazon.com/Fibonacci-Numbers-Dover-Books-Mathematics/dp/048648386X) 揭露了 Fibonacci 擁有以下性質 (可由數學歸納法證明之) $$ F(n+k) = F(n-1) \times F(k) + F(n) \times F(k+1) $$ 若將 $n=n+1,\ k=n$ 和 $n=n+1,\ k=n+1$ 代入上式可得 $$ \begin{align} F(2n+1) & = F(n)^2 + F(n+1)^2 \\ F(2n+2) & = F(n)\times F(n+1) + F(n+1)\times F(n+2) \notag\\ & =F(n)\times F(n+1) + F(n+1)\times [F(n) + F(n+1)] \notag\\ & =2F(n)\times F(n+1) + F(n+1)^2 \end{align} $$ 再將 $F(2n+2)-F(2n+1)$ 可得 $$ F(2n) = 2F(n)\times F(n+1) - F(n)^2 $$ 如此一來可以整理出當 $n$ 分別是偶數和奇數的演算法: $$ \begin{align} &F(0) = 0,\ F(1) = 1,\ F(2) = 1 \\ &F(2n) = 2F(n)\times F(n+1) - F(n)^2 \\ &F(2n+1) = F(n)^2 + F(n+1)^2 \end{align} $$ 不過此種遞迴方式會比較快嗎?以 $F(6)$ 來說, * 1st recurrsion ```graphviz strict digraph G { 1[label="F(6)"] 2[label="F(4)"] 3[label="F(5)"] 4[label="F(2)"] 5[label="F(3)"] 6[label="F(3)"] 7[label="F(4)"] 8[label="F(0)", style=filled] 9[label="F(1)", style=filled] 10[label="F(1)", style=filled] 11[label="F(2)"] 12[label="F(1)", style=filled] 13[label="F(2)"] 14[label="F(2)"] 15[label="F(3)"] 16[label="F(0)", style=filled] 17[label="F(1)", style=filled] 18[label="F(0)", style=filled] 19[label="F(1)", style=filled] 20[label="F(0)", style=filled] 21[label="F(1)", style=filled] 22[label="F(1)", style=filled] 23[label="F(2)", style=filled] 24[label="F(0)", style=filled] 25[label="F(1)", style=filled] 1 -> {2, 3} 2 -> {4, 5} 3 -> {6, 7} 4 -> {8, 9} 5 -> {10, 11} 6 -> {12, 13} 7 -> {14, 15} 11 -> {16, 17} 13 -> {18, 19} 14 -> {20, 21} 15 -> {22, 23} 23 -> {24, 25} } ``` * 2st recurrsion ```graphviz strict digraph G { 1[label="F(6)"] 2[label="F(3)"] 3[label="F(4)"] 4[label="F(1)", style=filled] 5[label="F(2)", style=filled] 6[label="F(2)", style=filled] 7[label="F(3)"] 8[label="F(1)", style=filled] 9[label="F(2)", style=filled] 1 -> {2, 3} 2 -> {4, 5} 3 -> {6, 7} 7 -> {8, 9} } ``` 可以看出遞迴的次數減少非常多,那還能再更快嗎? 再觀察到 $F(6)$ 被分成 $F(3)$ 和 $F(4)$ 兩個部分,其中 $F(4) = F(2)+F(3)$ ,可以利用 $F(3)$ 和遞迴 $F(3)$ 時所得到的 $F(2)$ 去計算 $F(4)$,這樣可以再次降低運算的次數,如下: * 2st recurrsion ```graphviz strict digraph G { 1[label="F(6)"] 2[label="F(3)"] 3[label="F(4)"] 4[label="F(1)", style=filled] 5[label="F(2)", style=filled] 6[label=" " ,color=white] 7[label=" " ,color=white] {rank = same; 2;3;} {rank = same; 4;5;} 1 -> {2, 3} 2 -> {4, 5} 2 -> 3 [style=dashed; arrowhead=vee] 5 -> 3 [style=dashed; arrowhead=vee] 3 -> {6, 7} [color=white] } ``` 接著再將這兩種想法以 bottom-up 的實作方式來比較,可以看到相當大的差距。 由於受限 `unsigned long long` 只有 64 bits 的表達範圍 ( i.e., Max value = $2^{64}-1$ ),且 $Fib(93)$ 的值大於 $2^{64}-1$,在尚未實作大數運算前,只設計了數值範圍到 $Fib(92)$ 的實驗。 程式碼如下: ```cpp unsigned long long Adding(int k) { unsigned long long f[92] = {0, 1}; for (int i = 2; i <= k; i++) f[i] = f[i - 1] + f[i - 2]; return f[k]; } ``` ```cpp unsigned long long Fast_doubling(int k) { int bs = 0, saved = k; while (k) { bs++; k /= 2; } k = saved; unsigned long long t1, t2, a = 0, b = 1; for (int i = bs; i > 0; i--) { t1 = a * (2 * b - a); t2 = b * b + a * a; a = t1; b = t2; if (k & (1 << (i - 1))) { t1 = a + b; a = b; b = t1; k &= ~(1 << (i - 1)); } } return a; } ``` 實驗結果: ![實驗結果](https://i.imgur.com/M9tCvOG.png) 明顯看到 `Fast doubling` 明顯優於 `Adding` 方法。 接著比較在 `fast doubling` 上使用硬體指令 `clz` 的差別, ```cpp //Check the i_th bit if (k & (1 << (i - 1)) && k > 0) { ``` 和 ```cpp //Check the i_th bit if ((32 - __builtin_clz(k)) == i && k > 0) { ``` 執行[實驗程式碼](https://gist.github.com/rwe0214/5dc1f3964ca2bd738ece68dbb200e103) ```shell $ gcc -o fibonacci fibonacci.c -lm $ taskset 0x1 ./fibonacci #限定在相同的處理器上執行 $ gnuplot runtime.gp $ eog runtime.png ``` 每個 $Fib(n)$ 取樣 $10^5$ 次取 95% 的信賴區間來統計結果: ![](https://i.imgur.com/deGnLi0.png) 可以看到有無使用 `clz` 的結果非常相近, 而 `clz` 在 n 值越大的時候表現略優 ( 紫色線 ),推測是因為硬體加速在數字越大 ( i.e., bits 數越多 ) 效果越顯著。 但是這個 `clz` 是使用 builtin 的編譯器優化,並不是使用 Linux 核心 API,結果可能會有所誤差。 接著進行使用 memorization 的方法,將之前計算過得 $Fib(n)$ 保存在 LUT 中,觀察是否有獲得更佳的效能,因為範圍只到 $Fin(92)$ 所以只建立的 $Fib(0)$ 至 $Fib(49)$ 來查表,實驗結果如下: ![](https://i.imgur.com/caZEsWv.png) 可以看到當 n 越大時,使用 LUT 的速度越快,但是同時也損失了 memory 空間去儲存 LUT。 ### 印出 Fib(92) 以後的數值 考慮需要更多的位元空間來儲存 $Fib(92)$ 以後的數值,定義了一個結構, ```cpp typedef struct uint128_t { unsigned long long msb; unsigned long long lsb; } my_uint128 ``` :::warning 為何不採用變動長度的數值表示法?一如 quiz2 採用的策略 :notes: jserv ::: > 目前這個部分仍在思考,只是目前還沒有完整的想法,所以先實作固定長度,之後再做修改 > [name=rwe0214][time=Fri, Mar 6, 2020 11:08 AM] 在比較完計算 $Fib$ 的方法後,決定採用 `fast_doubling` 的方式來實作。 因為在 `fast doubling` 中需乘法、加法和減法的操作,其中加減法方法較為簡單,乘法則是模擬計算機的乘法器的行為,方式如下 * 乘法 ```cpp #define MAX_INT64 0xFFFFFFFFFFFFFFFF my_uint128 RESHI, RESLO; void rshift_sl(my_uint128 *h, my_uint128 *l){ l->lsb >>= 1; if(l->msb & 0x1) l->lsb |= 0x8000000000000000; l->msb >>= 1; if(h->lsb & 0x1) l->msb |= 0x8000000000000000; h->lsb >>= 1; if(h->msb & 0x1) h->lsb |= 0x8000000000000000; h->msb >>= 1; return; } void mul_ql_bin(my_uint128 a, my_uint128 b){ init_mul(); RESLO = b; int count = 128; for(int i = 128; i>0; i--){ if(RESLO.lsb & 0x1) RESHI = add_ql_bin(RESHI, a); rshift_sl(&RESHI, &RESLO); } return; } ``` * 加法和減法 ```cpp my_uint128 add_ql_bin(my_uint128 a, my_uint128 b){ my_uint128 c; c.msb = a.msb + b.msb; if(MAX_INT64 - a.lsb < b.lsb){ c.msb++; c.lsb = b.lsb - (MAX_INT64 - a.lsb) - 1; } else c.lsb = a.lsb + b.lsb; return c; } //This function did not handle with the situation a < b my_uint128 sub_ql_bin(my_uint128 a, my_uint128 b){ //c = a - b my_uint128 c; if(a.lsb < b.lsb){ a.msb--; c.lsb = MAX_INT64 - (b.lsb - a.lsb - 1); } else c.lsb = a.lsb - b.lsb; return c; } ``` 將結果印出並和 [Fibonacci numbers](http://www.protocol5.com/Fibonacci/100.htm) 做確認相同 ```shell $ ./fibonacci ... f(93) = 0 a94fad42221f2702 f(94) = 1 11f38ad0840bf6bf f(95) = 1 bb433812a62b1dc1 f(96) = 2 cd36c2e32a371480 f(97) = 4 8879faf5d0623241 f(98) = 7 55b0bdd8fa9946c1 f(99) = b de2ab8cecafb7902 f(100) = 13 33db76a7c594bfc3 ``` 試著利用 quiz2 的方法實作可變長度的大數,定義以下的結構 `ubgi`,大小為 128 bits ( i.e., 16 bytes ), ```cpp typedef union UBIGNUMI { unsigned long long val; struct { unsigned long long filled : 63; unsigned long long left; /*Indicate that whether ubgi is a large number or not*/ uint8_t is_ptr : 1; }; struct { /*[lsb][...]...[...][msb]*/ unsigned long long *ptr; /*Length of ptr array*/ unsigned long long size : 63; }; } ubgi; ``` 接著利用 `new_ubgi` 函式定義一個新的大數,因為是以十進位的字串輸入數值,而一個 63 bits 的 `unsigned long long` 的最大值在十進位中恰為 19 位數,所以以 19 當作數值位數的分界。 ```cpp ubgi new_ubgi(char *val) { ubgi new; unsigned long long tmp = 0; int flag = 0; int size_dec = strlen(val); if (size_dec > 19) { /*Actual size of ubgi value, it is up to 2^(size-1)*/ int size = 1024; char binary[size]; dec2bin(binary, val); new.is_ptr = 1; new.size = (strlen(binary) % 64) ? strlen(binary) / 64 + 1 : strlen(binary) / 64; new.ptr = malloc(new.size * sizeof(unsigned long long)); for (int i = 0; i < strlen(binary); i++) { if (binary[i] - '0') { new.ptr[i / 64] |= ((unsigned long long)1 << (i % 64)); } } } else { for (int i = size_dec - 1; i >= 0; i--) { tmp += (val[i] - '0') * power(10, size_dec - 1 - i); } if (tmp & 0x8000000000000000 && size_dec == 19) { new.is_ptr = 1; new.size = 1; new.ptr = malloc(sizeof(unsigned long long)); *(new.ptr) = tmp; } else { new.is_ptr = 0; new.filled = tmp; new.left = MAX_INT63 - tmp; } } return new; }; ``` 完成了數值的新增後,雖然數值可以儲存的非常大,而且可以達到作業要求,比對 [Fibonacci numbers 300](http://www.protocol5.com/Fibonacci/300.htm) 後結果為正確的。 ```console.log f(97) = 8 3621143489848422977 f(98) = 13 5301852344706746049 f(99) = 21 8922995834555169026 f(100) = 35 4224848179261915075 ... f(299) = 137347 0805771631154320257 7171027913184570027 5212767467264610201 f(300) = 222232 2446294204455297398 9346190996720666693 9096499764990979600 ``` 但是在計算上總覺得非常沒有效率,以 `c = a + b` 加法來說,需要判斷 a 和 b 的種類,共有四種狀況,如此會在程式碼中有著大量的條件判斷式如下, ```cpp ubgi add(ubgi a, ubgi b) { ubgi c; if (a.is_ptr) if (b.is_ptr) addll(&c, a, b); else addl(&c, a, b); else if (b.is_ptr) addl(&c, b, a); else addnl(&c, a, b); return c; } /*Add two large number*/ void addll(ubgi *c, ubgi a, ubgi b) { unsigned long long carry = 0; c->is_ptr = 1; c->size = max(a.size, b.size) + 1; c->ptr = malloc(c->size * sizeof(unsigned long long)); for (int i = 0; i < a.size && i < b.size; i++) { if (MAX_INT64 - a.ptr[i] < b.ptr[i]) { c->ptr[i] = b.ptr[i] - (MAX_INT64 - a.ptr[i]) - 1 + carry; carry = 1; } else { if (MAX_INT64 - a.ptr[i] - b.ptr[i] < carry) { c->ptr[i] = carry - (MAX_INT64_DEC - a.ptr[i] - b.ptr[i]) - 1; carry = 1; } else { c->ptr[i] = carry + a.ptr[i] + b.ptr[i]; carry = 0; } } } if (a.size < b.size) { for (int i = a.size; i < b.size; i++) { if (MAX_INT64 - b.ptr[i] < carry) { c->ptr[i] = carry - (MAX_INT64_DEC - b.ptr[i]) - 1; carry = 1; } else { c->ptr[i] = carry + b.ptr[i]; carry = 0; } } } if ((b.size < a.size)) { for (int i = b.size; i < a.size; i++) { if (MAX_INT64 - a.ptr[i] < carry) { c->ptr[i] = carry - (MAX_INT64 - a.ptr[i]) - 1; carry = 1; } else { c->ptr[i] = carry + a.ptr[i]; carry = 0; } } } if (carry == 1) c->ptr[c->size - 1] = carry; else { c->size--; c->ptr = realloc(c->ptr, c->size * sizeof(unsigned long long)); } } /*Only a is large number*/ void addl(ubgi *c, ubgi a, ubgi b) { unsigned long long carry = 0; c->is_ptr = 1; c->size = a.size + 1; c->ptr = malloc(c->size * sizeof(unsigned long long)); if (MAX_INT64 - a.ptr[0] < b.val) { c->ptr[0] = b.val - (MAX_INT64 - a.ptr[0]) - 1 + carry; carry = 1; } else { c->ptr[0] = b.val + a.ptr[0]; } for (int i = 1; i < a.size; i++) { if (MAX_INT64 - a.ptr[i] < carry) { c->ptr[i] = carry - (MAX_INT64 - a.ptr[i]) - 1; carry = 1; } else { c->ptr[i] = carry + a.ptr[i]; carry = 0; } } if (carry == 1) c->ptr[c->size - 1] = carry; else { c->size--; c->ptr = realloc(c->ptr, c->size * sizeof(unsigned long long)); } } /*Neither a and b are large number*/ void addnl(ubgi *c, ubgi a, ubgi b) { if (MAX_INT64 - a.val < b.val) { c->is_ptr = 1; c->size = 2; c->ptr = malloc(c->size * sizeof(unsigned long long)); c->ptr[0] = a.val - (MAX_INT64 - b.val) - 1; c->ptr[1] += 1; } else { c->is_ptr = 0; c->val = a.val + b.val; } } ``` 接著是減法、乘法,所以目前還想嘗試思考有無其他做法,或是改變結構體來達到精簡化。 後來將結構體改為 ```cpp typedef struct UBIGNUM { uint64_t *val; int len; } big; ``` 加法在實作上仍然採用傳統的直式加法,而乘法則以硬體乘法器的模型去完成, ![](https://3.bp.blogspot.com/-YhLJfA1IRuI/VL_YF0ewScI/AAAAAAAAlmA/AyNb4-PSxd0/s1600/%E8%9E%A2%E5%B9%95%E5%BF%AB%E7%85%A7%2B2015-01-21%2B%E4%B8%8B%E5%8D%8811.24.45.png) 只是我將 product 中的位數個數以乘數及被乘數的位數來決定,並把乘數放入低位,判斷 LSB 決定是否將乘數用大數加法累加進 product 的高位。 ```cpp /* * big c { * len = a.len + b.len * *val = [a.len-1]...[b.len][b.len-1]...[0] * |------RESHI------||----RESLO----| * } */ big mul_big(big a, big b) { big c; if(/*handle with a=0 or a=1 or b=0 or b=1*/){ ... return c; } c.len = a.len + b.len; c.val = calloc(c.len, sizeof(uint64_t)); for (int i = 0; i < b.len; i++) c.val[i] = b.val[i]; big RESLO, RESHI; RESLO.len = b.len; RESLO.val = &c.val[0]; RESHI.len = a.len; RESHI.val = &c.val[b.len]; for (int i = 0; i < b.len << 6; i++) { if (RESLO.val[0] & (uint64_t) 1) RESHI = add_big(RESHI, a); /*right shift RES*/ uint64_t shifted = 0; for (int j = 0; j < RESLO.len - 1; j++) { shifted = RESLO.val[j + 1] & (uint64_t) 1; RESLO.val[j] >>= 1; if (shifted) RESLO.val[j] |= ((uint64_t) 1 << 63); } shifted = RESHI.val[0] & (uint64_t) 1; RESLO.val[RESLO.len - 1] >>= 1; if (shifted) RESLO.val[RESLO.len - 1] |= ((uint64_t) 1 << 63); for (int j = 0; j < RESHI.len - 1; j++) { shifted = RESHI.val[j + 1] & (uint64_t) 1; RESHI.val[j] >>= 1; if (shifted) RESHI.val[j] |= ((uint64_t) 1 << 63); } RESHI.val[RESHI.len - 1] >>= 1; } for (int i = 0; i < RESLO.len; i++) c.val[i] = RESLO.val[i]; for (int i = 0; i < RESHI.len; i++) c.val[RESLO.len + i] = RESHI.val[i]; for (int i = c.len - 1; i >= 0; i--) if (c.val[i] != (uint64_t) 0x0) { c.len = i + 1; c.val = realloc(c.val, (i + 1) * sizeof(uint64_t)); break; } return c; } ``` 上面的儲存數值的方法是單純的二進位制,但是如果要將二進位制的大數以十進位的方式印出,不能單除用數學的方法將數值累加,我參考了 [Double dabble(1)](https://www.youtube.com/watch?v=eXIfZ1yKFlA) 和 [Double dabble(2)](https://en.wikipedia.org/wiki/Double_dabble) 的方法轉換二進位成十進制。 假設數字是 `01101011`,其十進位是 `107` 以下是 double dabble 進行的步驟, ``` 100 10 1 | Binary ===============|========== 0000 0000 0000 | 01101011 1th shift left 0000 0000 0000 | 11010110 2nd shift left 0000 0000 0001 | 10101100 3rd shift left 0000 0000 0011 | 01011000 4th shift left 0000 0000 0110 | 10110000 '1' digit is greater than or equal to 5,add 3 0000 0000 1001 | 10110000 5th shift left 0000 0001 0011 | 01100000 6th shift left 0000 0010 0110 | 11000000 '1' digit is greater than or equal to 5,add 3 0000 0010 1001 | 11000000 7th shift left 0000 0101 0011 | 10000000 '10' digit is greater than or equal to 5,add 3 0000 1000 0011 | 10000000 8th shift left 0001 0000 0111 | 00000000 ---------------|---------- 1 0 7 ``` 以上面的手法修改後,在記憶體運許的狀況下,可以將 `Fib` 數印出,為了驗證正確性,寫了一個 [verify.py](https://github.com/rwe0214/fibdrv/blob/master/big/helper/verify.py) 的小程式去驗證 `Fib` 的計算 ```console.log $ time ./test 3000 f(3000) = 410615886307971260333568378719267105220125108637369252408885430926905584274113403731330491660850044560830036835706942274588569362145476502674373045446852160486606292497360503469773453733196887405847255290082049086907512622059054542195889758031109222670849274793859539133318371244795543147611073276240066737934085191731810993201706776838934766764778739502174470268627820918553842225858306408301661862900358266857238210235802504351951472997919676524004784236376453347268364152648346245840573214241419937917242918602639810097866942392015404620153818671425739835074851396421139982713640679581178458198658692285968043243656709796000 ./test 3000 0.01s user 0.00s system 85% cpu 0.013 total ``` ```console.log $ make check 3000 ./test 3000 f(3000) = 410615886307971260333568378719267105220125108637369252408885430926905584274113403731330491660850044560830036835706942274588569362145476502674373045446852160486606292497360503469773453733196887405847255290082049086907512622059054542195889758031109222670849274793859539133318371244795543147611073276240066737934085191731810993201706776838934766764778739502174470268627820918553842225858306408301661862900358266857238210235802504351951472997919676524004784236376453347268364152648346245840573214241419937917242918602639810097866942392015404620153818671425739835074851396421139982713640679581178458198658692285968043243656709796000 python3 helper/verify.py . ---------------------------------------------------------------------- Ran 1 test in 0.001s OK ```