# 2020q1 Homework2 (fibdrv)
contributed by < `jwang0306` >
> [作業要求](https://hackmd.io/@sysprog/linux2020-fibdrv#-%E4%BD%9C%E6%A5%AD%E8%A6%81%E6%B1%82)
> [GitHub](https://github.com/jwang0306/fibdrv)
---
## 自我檢查清單
- [ ] 研讀上述 Linux 效能分析的提示描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\rightarrow$ 從中也該理解為何不希望在虛擬機器中進行實驗;
- [x] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
* 快速的 fibonacci 公式與計算方法可歸納如下:
$$F(2k) = F(k)[2F(k+1) - F(k)] \\ F(2k+1) = F(k+1)^2+F(k)^2$$
* 遇到 `0` $\rightarrow$ 求 $F(2n)$ 和 $F(2n+1)$
* 遇到 `1` $\rightarrow$ 求 $F(2n)$ 和 $F(2n+1)$,再求 $F(2n+2)$
* 一個 32 bit 的數字,只要算它所使用到的就好,也就是從左邊數來第一個非 0 開始,所以可以把 leading zero 扣除,透過內建的 `__builtin_clz()` 得到 leading zero 的數量,再用 32 減掉,即為第一個非 0 的起始位置
- [ ] 複習 C 語言 數值系統 和 bitwise operation,思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 “each module’s use count and a list of referring modules”,但如何實作出來呢?模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢?
- [x] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題
* 使用 lock 場景:在多個 thread 需要使用共同資源的時候,為了確保執行先後順序的正確性,需要使用 mutex lock 來保證一次只能有一個執行緒進入 critical section 。
* 撰寫多執行緒的 userspace 程式
```cpp
void *execute(void *arg)
{
char buf[100];
char write_buf[] = "testing writing";
char method[1];
int offset = 100;
long long sz;
int fd = open(FIB_DEV, O_RDWR);
if (fd < 0) {
perror("Failed to open character device");
}
method[0] = FIB_DOUB_CLZ;
sz = write(fd, method, 1);
for (int i = 0; i <= offset; i++) {
lseek(fd, i, SEEK_SET);
sz = read(fd, buf, i);
printf("Reading from " FIB_DEV
" at offset %d by thread #%ld, returned the sequence "
"%s.\n",
i, (long)arg, buf);
}
close(fd);
}
int main()
{
pthread_t tid[NUM_THREAD];
for (int i = 0; i < NUM_THREAD; ++i) {
pthread_create(&(tid[i]), NULL, &execute, (void *)(long)(i+1));
}
for (int i = 0; i < NUM_THREAD; ++i) {
pthread_join(tid[i], NULL);
}
}
```
沒有 lock
- 若是把 `fibdrv.c` 的 `mutex_trylock` 拿掉,可以看到 thread 3 和 4 的順序是亂的
- 但是算得的結果是正確的,因為只是對 device 進行讀取,而非寫入
```shell
...
Reading from /dev/fibonacci at offset 72 by thread #4, returned the sequence 498454011879264.
Reading from /dev/fibonacci at offset 73 by thread #4, returned the sequence 806515533049393.
Reading from /dev/fibonacci at offset 43 by thread #3, returned the sequence 433494437.
Reading from /dev/fibonacci at offset 74 by thread #4, returned the sequence 1304969544928657.
Reading from /dev/fibonacci at offset 44 by thread #3, returned the sequence 701408733.
Reading from /dev/fibonacci at offset 45 by thread #3, returned the sequence 1134903170.
Reading from /dev/fibonacci at offset 75 by thread #4, returned the sequence 2111485077978050.
Reading from /dev/fibonacci at offset 76 by thread #4, returned the sequence 3416454622906707.
Reading from /dev/fibonacci at offset 77 by thread #4, returned the sequence 5527939700884757.
Reading from /dev/fibonacci at offset 46 by thread #3, returned the sequence 1836311903.
Reading from /dev/fibonacci at offset 47 by thread #3, returned the sequence 2971215073.
...
```
有 `mutex_trylock()`
```cpp
static int fib_open(struct inode *inode, struct file *file)
{
if (!mutex_trylock(&fib_mutex)) {
printk(KERN_ALERT "fibdrv is in use");
return -EBUSY;
}
return 0;
}
```
- `mutex_trylock` 顧名思義會嘗試去取得 lock ,但是他不會等待,因此沒有取得 lock 的話該執行緒不會睡在那邊等,會直接回傳成功與否,然後接著下去執行了
- 因此按照原本的程式碼,它會回傳一個負的值,導致讀取 `fibdrv` 失敗,輸出也就會留白:
```shell
...
Reading from /dev/fibonacci at offset 98 by thread #5, returned the sequence 135301852344706746049.
Reading from /dev/fibonacci at offset 99 by thread #5, returned the sequence 218922995834555169026.
Reading from /dev/fibonacci at offset 100 by thread #5, returned the sequence 354224848179261915075.
thread #3 Failed to open character device
Reading from /dev/fibonacci at offset 0 by thread #3, returned the sequence .
Reading from /dev/fibonacci at offset 1 by thread #3, returned the sequence .
Reading from /dev/fibonacci at offset 2 by thread #3, returned the sequence .
Reading from /dev/fibonacci at offset 3 by thread #3, returned the sequence .
Reading from /dev/fibonacci at offset 4 by thread #3, returned the sequence .
Reading from /dev/fibonacci at offset 5 by thread #3, returned the sequence .
...
```
有 `mutex_lock()`
```cpp
static int fib_open(struct inode *inode, struct file *file)
{
mutex_lock(&fib_mutex);
return 0;
}
```
- 我進行以上修改,將其換成 `mutex_lock` ,如此一來若是沒有取得 lock ,該執行緒就會睡到可以取得為止
- 因此輸出的順序是正確的(只有放上部份,全部貼上來太長)
```shell
Reading from /dev/fibonacci at offset 0 by thread #5, returned the sequence 0.
Reading from /dev/fibonacci at offset 1 by thread #5, returned the sequence 1.
Reading from /dev/fibonacci at offset 2 by thread #5, returned the sequence 1.
Reading from /dev/fibonacci at offset 3 by thread #5, returned the sequence 2.
Reading from /dev/fibonacci at offset 4 by thread #5, returned the sequence 3.
Reading from /dev/fibonacci at offset 5 by thread #5, returned the sequence 5.
Reading from /dev/fibonacci at offset 6 by thread #5,
...
Reading from /dev/fibonacci at offset 96 by thread #1, returned the sequence 51680708854858323072.
Reading from /dev/fibonacci at offset 97 by thread #1, returned the sequence 83621143489848422977.
Reading from /dev/fibonacci at offset 98 by thread #1, returned the sequence 135301852344706746049.
Reading from /dev/fibonacci at offset 99 by thread #1, returned the sequence 218922995834555169026.
Reading from /dev/fibonacci at offset 100 by thread #1, returned the sequence 354224848179261915075.
```
等待 lock 所花時間
- 由於我們無法單純由 `mutex_lock` 直接得知 lock 到底是否正被佔著,因此我迂迴地進行了測試,如果 `mutex_trylock` 沒有搶到 lock ,那麼我就利用 `mutex_lock` 再去搶一次,目的是為了讓該 thread 「卡」在那邊,直到其他 thread 將 lock 釋出
- 後面緊接著要 `mutex_unlock` ,否則程式會整個當掉,因為後面的執行緒永遠等不到 lock
```cpp
static int fib_open(struct inode *inode, struct file *file)
{
if (!mutex_trylock(&fib_mutex)) {
printk(KERN_ALERT "fibdrv is in use");
mutex_lock(&fib_mutex);
printk(KERN_ALERT "fibdrv is now available");
mutex_unlock(&fib_mutex);
return -EBUSY;
}
return 0;
}
```
- 由於我無法在 kernel space 得知此刻到底是哪個執行緒在跑,因此我選擇在 user space 進行時間測量
- 當 thread 沒搶到 lock 的時候, `fd` 會小於 0 ,再把所花時間印出來,即為該 thread 的等待時間
```cpp
void *execute(void *arg)
{
...
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
int fd = open(FIB_DEV, O_RDWR);
clock_gettime(CLOCK_MONOTONIC, &end);
long long spent = (long long)(end.tv_sec * 1e9 + end.tv_nsec) - (long long)(start.tv_sec * 1e9 + start.tv_nsec);
if (fd < 0)
printf("Thread #%ld has slept for %lld (ns).\n", (long)arg, spent);
...
}
```
- 範例輸出
```shell
Thread #3 has slept for 139201 (ns).
```
經過多次實驗,有五個執行緒的情況下,萬一某執行緒沒有搶到 lock ,其等待的時間約為好幾萬、甚至是幾十萬 nsec 。因此在我們這個 case 底下,若是有多個執行緒,且只是單純對檔案進行讀取、沒有對共享資源進行操作的話,不使用 lock 的整體速度表現上會比較好(假設對輸出順序不在意,畢竟數字的計算仍然正確)。
---
## Kernel space 與 User space 之間的傳遞時間
我參考了 [ambersun](https://hackmd.io/@ambersun1234/fibdrv) 的作法。我透過暫時修改 `write` 函式來計時:
```cpp
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
return ktime_to_ns(ktime_get());
}
```
回傳到 user space 時再透過與 monotonic time 之間的差異來得到 kernel to user 與 user to kernel 的時間:
```cpp
for (int i = 0; i <= offset; i++) {
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
sz = write(fd, write_buf, strlen(write_buf));
clock_gettime(CLOCK_MONOTONIC, &end);
long long u2k = sz - (long long)(start.tv_sec * 1e9 + start.tv_nsec);
long long k2u = (long long)(end.tv_sec * 1e9 + end.tv_nsec) - sz;
}
```
如果作法是將 user space 測得的 fibonacci 執行時間與 kernel space 測得的 fibonacci 執行時間相減,那麼嚴格說起來是 **kernel to user 與 user to kernel 時間的總和**(藍色的線),並不能代表任何一方:
```cpp
clock_gettime(CLOCK_REALTIME, &start);
sz = read(fd, buf, 1);
clock_gettime(CLOCK_REALTIME, &end);
unsigned int t = diff_in_ns(start, end);
long long sum = t - sz;
```
實驗結果顯示, user to kernel 的傳遞時間要多上一些。
![](https://i.imgur.com/OtyiMBk.png)
---
## 執行時間的優化
首先進行[干擾因素的排除](https://hackmd.io/@sysprog/linux2020-fibdrv#%E6%8E%92%E9%99%A4%E5%B9%B2%E6%93%BE%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E5%9B%A0%E7%B4%A0),並透過 ktimer 於 kernel space 中測量執行不同 fibonacci 演算法所花的時間,回傳到 user space 再輸出並畫圖。全程透過指定 `taskset 0x1` 固定在同一個 CPU 上執行。
我們希望從 client 端去選擇想要執行的 fibonacci 方法,老師建議可以利用修改 sysfs 來完成。我原先透過 `fib_write` 來控制一個全域 integer 變數,然後根據這個 flag 來決定 `fib_read` 要執行的方法:
```cpp
int flag;
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
char tmp_buf[1];
copy_from_user(tmp_buf, buf, 1);
flag = tmp_buf[0];
return 1;
}
static long long fib_time_proxy(long long k)
{
ktime_t kt = ktime_get();
long long result;
if (flag == 0)
result = fib_dp(k);
else if (flag == 1)
result = fib_double(k);
else if (flag == 2)
result = fib_double_clz(k);
unsigned int ns = ktime_to_ns(ktime_sub(ktime_get(), kt));
return ns;
}
```
後來看到了 [AndybnACT](https://hackmd.io/@AndybnA/fibdrv) 的方法,我覺得比我的優雅多了,<s>直接把 function 當成變數</s>,就減少了我原先需要一直執行的 if-else block :
:::danger
不是「直接把 function 當成變數」,而該說「傳遞 function pointer」,請複習 [你所不知道的C語言:指標篇](https://hackmd.io/@sysprog/c-pointer) 和 [你所不知道的 C 語言:物件導向程式設計篇](https://hackmd.io/@sysprog/c-oop),以得知 function pointer 使用案例。
:notes: jserv
:::
> [name=jwang0306]謝謝老師,已修正我錯誤的觀念以及說法
```cpp
long long (*fib_sequence)(long long);
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
char tmp_buf[1];
copy_from_user(tmp_buf, buf, 1);
switch (tmp_buf[0]) {
case 0: /* FIB_DP */
fib_sequence = fib_dp;
break;
case 1: /* FIB_DOUB */
fib_sequence = fib_double;
break;
case 2: /* FIB_DOUB_CLZ */
fib_sequence = fib_double_clz;
break;
}
return 1;
}
static long long fib_time_proxy(long long k)
{
ktime_t kt = ktime_get();
long long result = fib_sequence(k);
unsigned int ns = ktime_to_ns(ktime_sub(ktime_get(), kt));
return ns;
}
```
### fast algorithm
```cpp
long long fib_double(long long k)
{
if (k <= 1)
return k;
long long a = 0, b = 1;
for (int i = 31; i >= 0; --i) {
long long t1 = a * (b * 2 - a);
long long t2 = a * a + b * b;
a = t1;
b = t2;
if (k & (1ull << i)) {
t1 = a + b;
a = b;
b = t1;
}
}
return a;
}
```
<!-- ![](https://i.imgur.com/6rhXiJF.png) -->
### clz/ctz
```cpp
int i = 31 - __builtin_clz(k);
```
利用 `__builtin_clz` 找出 leading zero 從運算過程中排除,但與原本的 dynamic programming 版本放在一起比較時,看不出 fast double 演算法在 clz 的有無之下所造成的差異:
![](https://i.imgur.com/m0x9iJX.png)
:::danger
調整時間刻度再來觀察,關鍵是執行時間的分佈 (複習統計學)
:notes: jserv
:::
> [name=jwang0306]已補上實做,感謝老師提示
因此將有無 clz 與否的時間測量單獨拉出來比較,於每個 offset 採樣五次,畫出分佈圖,加入 clz 後於執行時間的分佈明顯低於沒有使用 clz 的版本,這時就能明顯看出加速了:
```cpp
FILE *fp = fopen("./FIB_DOUB.txt", "w");
char data[50];
method[0] = FIB_DOUB;
sz = write(fd, method, 1);
for (int i = 0; i <= offset; ++i) {
lseek(fd, i, SEEK_SET);
for (int j = 0; j < 5; ++j) {
sz = read(fd, buf, 1);
sprintf(data, "%d %lld\n", i, sz);
fputs(data, fp);
}
}
```
![](https://i.imgur.com/JYhhqA4.png)
---
## 大數運算
原本用 64 位元去存取,超過 $fib(92)$ 之後就會遇到 overflow,也就是超出 64 位元無號整數所能表達的有效數值:
```shell
f(93) fail
input: 7540113804746346429
expected: 12200160415121876738
```
因此想法就是做一個大數運算,嘗試並評估後決定使用十進位來實做,因為二進位的轉換太難處理。
### 結構
- 簡單地用一個陣列來存放每一個十進位數字
- 紀錄總共有幾個十進位數字
- 存放的順序是顛倒的,所以會像: `12345` $\rightarrow$ `arr[5,4,3,2,1,0,0,0,....]` ,這樣迴圈可以從 0 開始算
- 直接使用 integer 的陣列的缺點就是,不能開太大,因為每個位置都是 4 bytes ,否則會爆掉
```cpp
typedef struct bignum {
int num_digits;
int digits[MAX_DIGITS]; // 32
} bn_t;
```
### 加法
- 首先將目標陣列初始化為 0
- 掃過每一個 digit ,並逐次進位
- 最後數數有幾個 digit ,並保存
```cpp
/* C = A + B */
void bn_add(bn_t *c, bn_t *a, bn_t *b)
{
for (int i = 0; i < MAX_DIGITS; ++i)
c->digits[i] = 0;
int carry = 0;
int maxlen = MAX(a->num_digits, b->num_digits) + 1;
for (int i = 0; i < maxlen; ++i) {
c->digits[i] = a->digits[i] + b->digits[i] + carry;
carry = c->digits[i] / 10;
c->digits[i] %= 10;
}
c->num_digits = count_digits(c);
}
```
### 減法
- 與加法相似,逐個借位
- 最後數 digit 數量
```cpp
/* C = A - B */
void bn_sub(bn_t *c, bn_t *a, bn_t *b)
{
for (int i = 0; i < MAX_DIGITS; ++i)
c->digits[i] = 0;
int borrow = 0;
int maxlen = MAX(a->num_digits, b->num_digits) + 1;
for (int i = 0; i < maxlen; ++i) {
c->digits[i] = a->digits[i] - b->digits[i] - borrow;
borrow = 0;
if (c->digits[i] < 0) {
borrow = 1;
c->digits[i] += 10;
}
}
c->num_digits = count_digits(c);
}
```
:::danger
十進位運算麻煩之處:借位帶來的時間開銷和運算量。
:notes: jserv
:::
> [name=jwang0306]正在參考[sysprog21/bignum](https://github.com/sysprog21/bignum)的二進位實做手法
### 乘法
:::danger
翻閱辭典,查閱「雙」的意義,不要將「二」和「雙」的適用情境搞混了!
:notes: jserv
:::
> [name=jwang0306]收到,附上辭典![](https://i.imgur.com/xqstiRp.png)
- 乘法利用<s>雙層</s> 二層迴圈,將每一個 digit 對齊相乘,如果遇到 0 就跳過。過程中超過 10 先放著不管
- 等全部乘完後,陣列的每個元素逐個檢查並進位
- 最後數 digit 數量
```cpp
/* C = A x B */
void bn_mul(bn_t *c, bn_t *a, bn_t *b)
{
for (int i = 0; i < MAX_DIGITS; ++i)
c->digits[i] = 0;
for (int i = 0; i < a->num_digits; ++i) {
if (a->digits[i] == 0)
continue;
for (int j = 0; j < b->num_digits; ++j) {
if (b->digits[j] == 0)
continue;
c->digits[i + j] += (a->digits[i]) * (b->digits[j]);
}
}
for (int i = 0; i < MAX_DIGITS - 1; ++i) {
int carry = 0;
carry = c->digits[i] / 10;
c->digits[i + 1] += carry;
c->digits[i] %= 10;
}
c->num_digits = count_digits(c);
}
```
### 數 digit
- 數字會在陣列的前面,`12345` $\to$ `arr[5,4,3,2,1,0,0,0,....]` ,因此從最後面數過來, 0 的話就減一,遇到非 0 時的 counter index 就是我要的 digit 數
```cpp
int count_digits(bn_t *n)
{
int i;
for (i = MAX_DIGITS; i > 0 && n->digits[i - 1] == 0; --i)
;
return (i == 0) ? 1 : i;
}
```
### 放進 buffer
- 回傳到 user space 必須透過 `char *` 型態的 buffer ,因此要將 integer 型態的一個一個數字放,要注意也要反過來放
- `sprintf` 在 Cppcheck 會跳出警告不給 commit ,因此改用 `snprintf`
- 一定得透過 `copy_tp_user` 回傳給 user space ,因為同樣叫做 `buf` 但是在 user space 和 kernel space 所指向的地址可能值會不同,因此不能直接透過 `sprintf` 寫進 `buf` 參數,否則整個 module 會直接被 Killed
- `long copy_to_user( void __user *to, const void * from, unsigned long n );`
```cpp
for (int i = 0, index = 0; i < result.num_digits; ++i)
index += snprintf(&tmp[index], MAX_DIGITS - index, "%d",
result.digits[result.num_digits - i - 1]);
copy_to_user(buf, tmp, MAX_DIGITS);
```
### 測試
- 比對 [The Fibonacci numbers](http://www.maths.surrey.ac.uk/hosted-sites/R.Knott/Fibonacci/fibtable.html) 後,確認結果符合預期:
```shell
...
Reading from /dev/fibonacci at offset 93, returned the sequence 12200160415121876738.
Reading from /dev/fibonacci at offset 94, returned the sequence 19740274219868223167.
Reading from /dev/fibonacci at offset 95, returned the sequence 31940434634990099905.
Reading from /dev/fibonacci at offset 96, returned the sequence 51680708854858323072.
Reading from /dev/fibonacci at offset 97, returned the sequence 83621143489848422977.
Reading from /dev/fibonacci at offset 98, returned the sequence 135301852344706746049.
Reading from /dev/fibonacci at offset 99, returned the sequence 218922995834555169026.
Reading from /dev/fibonacci at offset 100, returned the sequence 354224848179261915075.
```
- 再次測時間:
- 單純比較有無 clz 硬體加速的話,有 clz 仍然較快
- 與 dynamic programming 版本比較的話就顯得慢了,因為乘法的成本太高,在有硬體加速、輸入數字又大的情況下才拼得過
![](https://i.imgur.com/biAien8.png)
---
## TODO
### 大數運算加速
參考 [sysprog21/bignum](https://github.com/sysprog21/bignum),全部用二進位就可透過 bitwise operation 加速
### add plot to makefile
```shell
plot: time.gp time
@gnuplot $<
@eog time.png
```
```shell
$ taskset 0x1 [program]
$ gnuplot [script].gp
$ eog [image].png
```