# 2020q1 Homework2 (fibdrv)
contributed by < `jeeeff` >
## 實驗環境
作業系統
```shell
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
```
編譯器
```shell
$ gcc --version
gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
```
## 作業要求
- [ ] 撰寫適用於 Linux 核心層級的程式
* 學習 ktimer, copy_to_user 一類的核心 API
- [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise)
- [ ] 初探 Linux VFS
- [ ] 自動測試機制
- [x] 透過工具進行效能分析
- [x] 研讀上述 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
- [ ] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
- [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢?
- [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題
## 實做過程
### client.c
```cpp
int main()
{
long long sz;
char buf[100];
char write_buf[] = "testing writing";
int offset = 100; /* TODO: try test something bigger than the limit */
int fd = open(FIB_DEV, O_RDWR);
if (fd < 0) {
perror("Failed to open character device");
exit(1);
}
for (int i = 0; i <= offset; i++) {
sz = write(fd, write_buf, strlen(write_buf));
printf("Writing to " FIB_DEV ", returned the sequence %lld\n", sz);
}
for (int i = 0; i <= offset; i++) {
lseek(fd, i, SEEK_SET);
sz = read(fd, buf, 1);
big_fibnum(buf, sz);
printf("Reading from " FIB_DEV
" at offset %d, returned the sequence "
"%s.\n",
i, buf);
}
for (int i = offset; i >= 0; i--) {
lseek(fd, i, SEEK_SET);
sz = read(fd, buf, 1);
big_fibnum(buf, sz);
printf("Reading from " FIB_DEV
" at offset %d, returned the sequence "
"%s.\n",
i, buf);
}
close(fd);
return 0;
}
```
### 基本工具準備
* 老師都幫我們準備好了基本工具,只要貼上就好,而且也是準備做大數的形式
:::danger
請依循 Linux 核心原始碼風格,避免 [camelCase](https://en.wikipedia.org/wiki/Camel_case),變數或韓式名稱用簡潔的全小寫。
:notes: jserv
:::
```cpp
struct BigN {
unsigned long long lower, upper;
};
static inline void addBigN(struct BigN *output, struct BigN x, struct BigN y)
{
output->upper = x.upper + y.upper;
if (y.lower > ~x.lower)
output->upper++;
output->lower = x.lower + y.lower;
}
```
### addBigN()
* 做大數加法的關鍵,這個工具老師也幫我們準備好
```cpp
static inline void addBigN(struct BigN *output, struct BigN x, struct BigN y)
{
output->upper = x.upper + y.upper;
if (y.lower > ~x.lower)
output->upper++;
output->lower = x.lower + y.lower;
}
```
### fib_sequence()
* 工具都準備好了,用數學課遞迴的方式做 fibonacci sequence 運算
```cpp
static struct BigN fib_sequence(long long k)
{
/* FIXME: use clz/ctz and fast algorithms to speed up */
struct BigN f[k + 2];
f[0].upper = 0;
f[0].lower = 0;
f[1].upper = 0;
f[1].lower = 1;
for (int i = 2; i <= k; i++) {
addBigN(&f[i], f[i - 1], f[i - 2]);
}
return f[k];
}
```
* gnuplot 部份是參考 < `pingsutw` > 同學的 analysis
* 這邊值得一提的是因為 k+2 這麼長的陣列比起用 3 個變數做相對浪費記憶體空間,但不確定效能是是否有明顯影響,所以拿 fib_sequence 和下面這個 code 做比較
```cpp
static struct BigN fib_3(int k)
{
struct BigN t1,t2,t3;
t1.upper = 0;
t1.lower = 1;
t2.upper = 0;
t2.lower = 0;
for(int i=2;i<=k;i++) {
addBigN(&t3,&t1,&t2);
t2 = t1;
t1 = t3;
}
return t3;
}
```
* 做一萬次取平均,先貼 code 再貼圖
:::danger
避免用「平均」,改用統計手法去除 [outlier](https://en.wikipedia.org/wiki/Outlier)
:notes: jserv
:::
```cpp
#define count 10000
int main()
{
FILE *fp = fopen("./performance", "w");
char time_buf[100] = {0};
double t1_rs, t3_rs;
for (int i = 0; i <= MAX_LENGTH; i++) {
struct timespec start, stop;
t1_rs = 0;
t3_rs = 0;
/* fib_sequence */
for(int j = 0; j < count; j++) {
clock_gettime(CLOCK_MONOTONIC, &start);
struct BigN res1 = fib_sequence(i);
clock_gettime(CLOCK_MONOTONIC, &stop);
//if(j == 0)
// display_big_fibnum(i, res1);
double t1 = diff_in_ns(start, stop);
t1_rs += t1;
}
/* fib_3 */
for(int j = 0; j < count; j++) {
clock_gettime(CLOCK_MONOTONIC, &start);
struct BigN res3 = fib_3(i);
clock_gettime(CLOCK_MONOTONIC, &stop);
//if(j == 0)
// display_big_fibnum(i, res3);
double t3 = diff_in_ns(start, stop);
t3_rs += t3;
}
t1_rs /= count;
t3_rs /= count;
snprintf(time_buf, sizeof(time_buf), "%d %.10lf %.10lf\n", i, t1_rs, t3_rs);
fputs(time_buf, fp);
}
return 0;
}
```

* 因為把 display_big_fibnum 註解掉所以時間有再降更低,不過若要看答案是否正確,要刪掉註解
* 結果與猜的相反!原本以為 K + 2 的不需要做 t1, t2, t3 交換的動作會比較快,結果證實 3-variable 較快,可能是因為耗費 memory 很少,讓 CPU 不需要做 paging
```
92 = 7540113804746346429
92 = 7540113804746346429
93 = 12200160415121876738
93 = 12200160415121876738
94 = 19740274219868223167
94 = 19740274219868223167
...
132 = 1725375039079340637797070384
132 = 1725375039079340637797070384
133 = 2791715456571051233611642553
133 = 2791715456571051233611642553
134 = 45170904956460969042((712937
134 = 45170904956460969042((712937
gnuplot analysis.gp
eog performance.png
```
* 對照 [The first 300 Fibonacci numbers](http://www.maths.surrey.ac.uk/hosted-sites/R.Knott/Fibonacci/fibtable.html) 答案是對的,可以做到 fib(133) ,從 fib(134) 開始會出現一些符號
```
92 : 7540113804746346429 = 3 x 139 x 461 x 4969 x 28657 x 275449
93 : 12200160415121876738 = 2 x 557 x 2417 x 4531100550901
94 : 19740274219868223167 = 2971215073 x 6643838879
...
132 : 1725375039079340637797070384 = 24 x 32 x 43 x 89 x 199 x 307 x 9901 x 19801 x 261399601
133 : 2791715456571051233611642553 = 13 x 37 x 113 x 3457 x 42293 x 351301301942501
134 : 4517090495650391871408712937 = 269 x 4021 x 116849 x 1429913 x 24994118449
```
* fib_sequence (k + 2), fib_3 和 fib_ctz 都各自跑一萬次取平均值的結果
* 之前跟同學寫作業的時候,fast doubling 就一直都比 recursion 累加的方式慢
### fast_doubling
* 這邊寫的 fib_fast_logn 就是 fast_doubling ,單純因為[來源網站](https://www.hackerearth.com/zh/practice/notes/fast-doubling-method-to-find-nth-fibonacci-number/),只是速度仍然比 fib_sequence 還慢,費氏數列一樣可以算到 n = 133 還是對的,從 n = 134 開始出現一些符號
```cpp
void fib_fast_doubling(int k, uint128_t *ans0, uint128_t *ans1)
{
if (k == 0) {
ans0->upper = 0llu;
ans0->lower = 0llu;
ans1->upper = 0llu;
ans1->lower = 1llu;
return;
}
fib_fast_doubling((k/2), ans0, ans1);
uint128_t a = *ans0;
uint128_t b = *ans1;
uint128_t b2;
uint128_t c = {.upper = 0llu, .lower = 0llu};
uint128_t d = {.upper = 0llu, .lower = 0llu};
add_t(&b2,b,b);
sub_t(&c,b2,a);
uint128_t cpa = {.upper = 0llu, .lower = 0llu};
mul_t(&cpa,a,c);
mul_t(&a,a,a);
mul_t(&b,b,b);
add_t(&d,a,b);
if (k%2) {
*ans0 = d;
add_t(ans1,cpa,d);
} else {
*ans0 = cpa;
*ans1 = d;
}
}
```

```
92 = 7540113804746346429
92 = 7540113804746346429
92 = 7540113804746346429
93 = 12200160415121876738
93 = 12200160415121876738
93 = 12200160415121876738
94 = 19740274219868223167
94 = 19740274219868223167
94 = 19740274219868223167
...
132 = 1725375039079340637797070384
132 = 1725375039079340637797070384
132 = 1725375039079340637797070384
133 = 2791715456571051233611642553
133 = 2791715456571051233611642553
133 = 2791715456571051233611642553
134 = 45170904956460969042((712937
134 = 45170904956460969042((712937
134 = 45170904956460969042((712937
```
* 因為抖動蠻大的,所以 < `pingsutw` > 同學建議我跑一百萬次,跑出來的圖在下面:

* 程式在單核上執行,過程中電腦有自己切換 CPU ,下圖是從 CPU1 切到 CPU3 的截圖

### clz 和 ctz
* 試驗有用 clz, ctz 的演算法和沒有用 clz, ctz 的,分別是 fib_clzctz() 和 fib_fast_doubling()
#### fib_clzctz
* 這邊借鑒 < `AndybnACT` > 同學的演算法,然後前面 fast_doubling 的減法器和乘法器也是借鑒 < `AndybnACT` > 同學的,所以可以確保實驗變數只有一個(是否使用 clz 和 ctz)
```cpp
static uint128_t fib_clzctz(int k)
{
uint128_t a = {.lower = 0, .upper = 0};
uint128_t b = {.lower = 1, .upper = 0};
if (k <= 1) {
if(k == 0)
a = a;
else
a = b;
return a;
}
int clz = __builtin_clzll(k);
for (int i = (64 - clz); i > 0; i--) {
uint128_t t1, t2, tmp;
ls_t(&tmp, b, 1); // tmp = 2b
sub_t(&tmp, tmp, a); // tmp = tmp -a
mul_t(&t1, a, tmp); // t1 = a*tmp
mul_t(&a, a, a); // a = a^2
mul_t(&b, b, b); // b = b^2
add_t(&t2, a, b); // t2 = a^2 + b^2
a = t1;
b = t2;
if (k & (1ull << (i - 1))) { // current bit == 1
add_t(&t1, a, b);
a = b;
b = t1;
}
}
return a;
}
```
* 跑一萬次取平均

* 跑一百萬次取平均 (這裡的 fib_fast_logn 就是 fib_fast_doubling)

* 雖然 n > 133 之後答案都是錯的,但可以看出有用 clz, ctz 的演算法略快一些(綠色線大部份是貼在紫色線上面的)
## 參考資料
* [[ gcc Built-in Functions for binary ] gcc 內建處理二進位函數](http://sunmoon-template.blogspot.com/2017/04/gcc-built-in-functions-for-binary-gcc.html)
* [The first 300 Fibonacci numbers](http://www.maths.surrey.ac.uk/hosted-sites/R.Knott/Fibonacci/fibtable.html)
* [uint128_t does not name a type](https://stackoverflow.com/questions/34588650/uint128-t-does-not-name-a-type)
* [Fast Doubling method to find nth Fibonacci number](https://www.hackerearth.com/zh/practice/notes/fast-doubling-method-to-find-nth-fibonacci-number/)