---
tags: linux2022
---
# 2022q1 Homework3 (fibdrv)
contributed by < [`cwl0429`](https://github.com/cwl0429) >
```shell 
$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           142
Model name:                      Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz
Stepping:                        11
CPU MHz:                         1800.000
CPU max MHz:                     3900.0000
CPU min MHz:                     400.0000
BogoMIPS:                        3600.00
Virtualization:                  VT-x
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        6 MiB
NUMA node0 CPU(s):               0-7
```
## 自我檢查清單
- [x] 研讀上述 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗
- [x] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
- [x] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] 研讀 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU),指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措?
- [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢?
    > 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html)
- [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。
## fibdrv 實作
### 使用 char 存放 Big Number
參考 [AdrianHuang/fibdrv](https://github.com/AdrianHuang/fibdrv/tree/big_number),讓數字以 char 的型態儲存
```c 
typedef union {
    /* allow strings up to 15 bytes to stay on the stack
     * use the last byte as a null terminator and to store flags
     * much like fbstring:
     * https://github.com/facebook/folly/blob/master/folly/docs/FBString.md
     */
    char data[16];
    struct {
        uint8_t filler[15],
            /* how many free bytes in this stack allocated string
             * same idea as fbstring
             */
            space_left : 4,
            /* if it is on heap, set to 1 */
            is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1;
    };
    /* heap allocated */
    struct {
        char *ptr;
        /* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */
        size_t size : MAX_STR_LEN_BITS,
                      /* capacity is always a power of 2 (unsigned)-1 */
                      capacity : 6;
        /* the last 4 bits are important flags */
    };
} xs;
```
接著研究一下如何將 char 型態的大數相加
觀察以下程式碼,發現在相加之前會需將 string 反轉,原因是 string 在反轉後 大數 `a` 及 `b` 的最低位數便可對齊,以下假設 a 為 1234 ; b 為 534
反轉前:
```graphviz
digraph{
node[shape = "record"]
n1[label = "1|2|3|4"]
n2[label = "5|3|4"]
}
```
反轉後:
```graphviz
digraph{
node[shape = "record"]
n1[label = "4|3|2|1"]
n2[label = "4|3|5"]
}
```
此時 a 及 b 陣列開頭存放的 char 就會是個位數
```c 
static void string_number_add(xs *a, xs *b, xs *out)
{
    char *data_a, *data_b;
    size_t size_a, size_b;
    int i, carry = 0;
    int sum;
    /*
     * Make sure the string length of 'a' is always greater than
     * the one of 'b'.
     */
    if (xs_size(a) < xs_size(b))
        __swap((void *) &a, (void *) &b, sizeof(void *));
    data_a = xs_data(a);
    data_b = xs_data(b);
    size_a = xs_size(a);
    size_b = xs_size(b);
    reverse_str(data_a, size_a);
    reverse_str(data_b, size_b);
    char *buf = kmalloc(sizeof(char) * (size_a + 2), GFP_KERNEL);
    for (i = 0; i < size_b; i++) {
        sum = (data_a[i] - '0') + (data_b[i] - '0') + carry;
        buf[i] = '0' + sum % 10;
        carry = sum / 10;
    }
    for (i = size_b; i < size_a; i++) {
        sum = (data_a[i] - '0') + carry;
        buf[i] = '0' + sum % 10;
        carry = sum / 10;
    }
    if (carry)
        buf[i++] = '0' + carry;
    buf[i] = 0;
    reverse_str(buf, i);
    /* Restore the original string */
    reverse_str(data_a, size_a);
    reverse_str(data_b, size_b);
    if (out)
        *out = *xs_tmp(buf);
}
```
 
在與老師討論後,發現以 char 存放數字的方式會有幾點問題
__浪費大量記憶體空間__
一個 char 的大小為 1 byte 也就是 8 個 bits,能表示的數字範圍為 `0 ~ 255`,而 char 能表示的數字範圍為 ``'0' ~ '9'`` 而此範圍能用 4 個 bits 表示,所以在每個 char 中都至少浪費一半的 bits
基於以上理由,我嘗試實作以 `int *` 的方式存放的大數
參考 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 的大數資料結構
```c 
/**
 * @brief store bignumber with dynamic size
 * number[size - 1] is the msb
 */
typedef struct _BigN {
    unsigned int *number;
    unsigned int size;
    /* sign is not use now */
    int sign;
} BigN;
```
利用可動態配置的 `number` 存放大數 `size` 紀錄 `number` 的長度,而 `sign` 會紀錄大數的正負號但目前實作還沒用上。
為了實作 `fib` 及 `fib_fast_doubling`,我實作了以下函式
#### `bign_to_string`
參考 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 的方法,利用此函式將計算完畢的 fib number 從 kernel 端複製到 user 端,過程如下圖:
先假設 bign a 的 `a->number[0]` 為 `0...001011`, `a->size` 為 `1` , 對應的 `string` 長度為 `12` 
> 以下圖示為了作圖方便,省略 number 的前 28 個數值為 0 的 bits
```graphviz
digraph{
node[shape = "record"]
n1[label = "<1>1|<2>0|<3>1|<4>1"]
s[label ="0|0|0|0|0|0|0|0|0|0|0|\\0"]
number[shape = plaintext]
string[shape = plaintext]
number -> n1
string -> s
}
```
從 `number` 的 msb 開始往右掃描,若當前位元為 `1` 則 `carry` 為 `1`,並更新整個 `string`,對每個位元做 `c[j] += c[j] - '0' + carry;` 若過程中發現 `c[j]` 會大於 `9` 則設置 `carry` 為 `1` 用以進位到 `c[j-1]` 否則設置為 `0`
```graphviz
digraph{
node[shape = "record"]
n1[label = "<1>1|<2>0|<3>1|<4>1"]
s[label ="0|0|0|0|0|0|0|0|0|0|<cur>0 + 0 + 1|'\\0'"]
current_bit[shape = plaintext]
j[shape = plaintext]
current_bit -> n1:1
j -> s:cur
}
```
接著重複執行上述動作
```graphviz
digraph{
node[shape = "record"]
n1[label = "<1>1|<2>0|<3>1|<4>1"]
s[label ="0|0|0|0|0|0|0|0|0|0|<cur>1 + 1 + 0|'\\0'"]
current_bit[shape = plaintext]
j[shape = plaintext]
current_bit -> n1:2
j -> s:cur
}
```
```graphviz
digraph{
node[shape = "record"]
n1[label = "<1>1|<2>0|<3>1|<4>1"]
s[label ="0|0|0|0|0|0|0|0|0|0|<cur>2 + 2 + 1|'\\0'"]
current_bit[shape = plaintext]
j[shape = plaintext]
current_bit -> n1:3
j -> s:cur
}
```
```graphviz
digraph{
node[shape = "record"]
n1[label = "<1>1|<2>0|<3>1|<4>1"]
s[label ="0|0|0|0|0|0|0|0|0|0|<cur>5 + 5 + 1|'\\0'"]
s_2[label ="0|0|0|0|0|0|0|0|0|<cur>0 + 0 + 1|1|'\\0'"]
current_bit[shape = plaintext]
j[shape = plaintext]
current_bit -> n1:4
j -> s:cur
s -> s_2
j_nxt -> s_2:cur
}
```
掃描完後,得到最後結果 `11`
```graphviz
digraph{
node[shape = "record"]
n1[label = "<1>1|<2>0|<3>1|<4>1"]
s[label ="0|0|0|0|0|0|0|0|0|1|1|\\0"]
number[shape = plaintext]
string[shape = plaintext]
number -> n1
string -> s
}
```
```c 
static inline char *bign_to_string(bign *a)
{
    /* change of base formula: log10(x) = log2(x) / log2(10) */
    size_t len = (8 * sizeof(int) * a->size) / 3 + 2;
    char *c = kmalloc(len, GFP_KERNEL);
    char *p = c;
    memset(c, '0', len - 1);
    c[len - 1] = '\0';
    /* transform bits into decimal string */
    for (int i = a->size - 1; i >= 0; i--) {
        for (unsigned int d = 1U << 31; d; d >>= 1) {
            int carry = !!(d & a->number[i]);
            for (int j = len - 2; j >= 0; j--) {
                c[j] += c[j] - '0' + carry;
                carry = (c[j] > '9');
                if (carry)
                    c[j] -= 10;
            }
        }
    }
    while (*p == '0' && *(p + 1) != '\0')
        p++;
    if (a->sign)
        *(--p) = '-';
    memmove(c, p, strlen(p) + 1);
    return c;
}
```
#### `bign_add` 及 `bign_sub`
- `add_BigN` 
    - 從 `number[i]` 的 lsb 開始計算,將  `a->number[i], b->number[i]` 及 `carry` 加至 `c->number[i]` ,並在過程利用巨集 `DETECT_OVERFLOW` 偵測是否有溢位發生,若是溢位便將 `carry` 設置為 `1` 否則為 `0`
- `sub_BigN`  
    - `sub_BigN` 和 `add_BigN` 想法類似,主要差別為偵測溢位的方式
```c 
static inline void bign_add(bign *a, bign *b, bign *c)
{
    bign_resize(c, a->size + 1);
    unsigned int carry = 0;
    for (int i = 0; i < c->size; i++) {
        unsigned int tmp_a = (i < a->size) ? a->number[i] : 0;
        unsigned int tmp_b = (i < b->size) ? b->number[i] : 0;
        c->number[i] = tmp_a + tmp_b + carry;
        carry = DETECT_OVERFLOW(tmp_a, tmp_b + carry);
    }
    if (!c->number[c->size - 1] && c->size > 1)
        bign_resize(c, c->size - 1);
}
static inline void bign_sub(bign *a, bign *b, bign *c)
{
    bign_resize(c, a->size);
    unsigned int carry = 0;
    for (int i = 0; i < c->size; i++) {
        long tmp_a = (i < a->size) ? (long) a->number[i] : 0;
        long tmp_b = (i < b->size) ? (long) b->number[i] : 0;
        c->number[i] = tmp_a - tmp_b - carry;
        carry = DETECT_OVERFLOW_SUB(tmp_a - carry, tmp_b);
    }
    if (!c->number[c->size - 1] && c->size > 1)
        bign_resize(c, c->size - 1);
}
```
#### `bign_mul`
令 `a` 為被乘數 `b` 為乘數,之後便用乘法直式的過程計算
需要注意的地方,在 `mul_BigN` 中我使用了 [`__fls`](https://github.com/torvalds/linux/blob/master/include/asm-generic/bitops/__fls.h) 來省略乘數中的 leading zeros ,避免不必要的計算
> __fls - find last (most-significant) set bit in a long word 
```c 
static inline void bign_mul(bign *a, bign *b, bign *c)
{
    bign *a_tmp = bign_alloc(1);
    bign_cpy(a, a_tmp);
    if ((!a->number[0] && (a->size == 1)) ||
        (!b->number[0] && (b->size == 1))) {
        /* prevent a == c or b == c */
        bign_resize(c, 1);
        c->number[0] = 0;
        // printk("0 appear");
        return;
    }
    bign_resize(c, 1);
    c->number[0] = 0;
    /* then resize to limit range of a * b */
    bign_resize(c, a->size + b->size);
    for (int i = 0; i < b->size; i++) {
        if (!b->number[i])
            continue;
        unsigned int clz = __fls(b->number[i]) + 1;
        unsigned int shift_cnt = 32 - clz;
        for (unsigned int d = 1U; clz; d <<= 1) {
            if (!!(d & b->number[i])) {
                bign_add(c, a_tmp, c);
            }
            bign_shl(a_tmp, 1);
            clz--;
        }
        bign_shl(a_tmp, shift_cnt);
    }
    int c_size = 0;
    for (c_size = c->size - 1; !c->number[c_size];)
        c_size--;
    if (c->size > 1)
        bign_resize(c, c_size + 1);
}
```
__更新 `bign_mul`__
- 假設有三個 bign a, b, c 
- c = a * b
- c 為最終儲存的 bign
- 令 a->size 為 A, b->size 為 B 
原先:
從乘數的 lsb 開始往左掃描 bits,若 bit 為 1 則將被乘數加入 c 並在每回合結束時將乘數左移一格
需花費 (由於兩版本實作主要差別為兩層 for 內部,故只計算迴圈內的執行時間)
- $B * (A * 32 * (bign\_add\_time + bign\_shl\_time)) = 32A * B (bign\_add\_time + bign\_shl\_time)$
其中 `bign_add` 會隨著資料大小而改變運算時間
後來:
參考 [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU) 的 `mul` 實作方式
將乘數的 `a->number[0~size]` 分別和被乘數的 `b->number[0~size]` 兩兩相乘即可
需花費
- $B * (A * (unsigned int * unsigned int\_exc\_time))$
其中 unsigned int * unsigned int 可視為常數時間,不會隨著資料大小而改變運算時間
可以發現後來的實作方式因避免使用 `bign_add` 而改善效能
```c 
static inline void bign_mul(bign *a, bign *b, bign *c)
{   
    /* prevent a == c or b == c */
    bign *a_tmp = bign_alloc(1);
    bign_cpy(a, a_tmp);
    /* handle multiplier or multiplicand are zero */
    if ((!a->number[0] && (a->size == 1)) ||
        (!b->number[0] && (b->size == 1))) {   
        bign_resize(c, 1);
        c->number[0] = 0;
        return;
    }
    bign_resize(c, 1);
    c->number[0] = 0;
    /* then resize to limit range of a * b */
    bign_resize(c, a->size + b->size);
    for (int i = 0; i < b->size; i++) {
        if (!b->number[i])
            continue;
        for (int j = 0; j < a_tmp->size; j++) {
            unsigned long long carry;
            carry = (unsigned long long) a_tmp->number[j] * b->number[i];
            bign_mul_add(carry, i + j, c);
        }
    }
    int c_size = 0;
    for (c_size = c->size - 1; !c->number[c_size];)
        c_size--;
    if (c->size > 1)
        bign_resize(c, c_size + 1);
}
```
最後將 `bign_fib` 及 `bign_fib_fast` 實作
```c 
static inline void bign_fib(bign *dest, int fn)
{
    bign_resize(dest, 1);
    if (fn < 2) {
        dest->number[0] = fn;
        return;
    }
    bign *a = bign_alloc(1);
    bign *b = bign_alloc(1);
    a->number[0] = 0;
    b->number[0] = 1;
    for (int i = 2; i <= fn; i++) {
        bign_add(a, b, dest);
        bign_cpy(dest, a);
        bign_swap(a, b);
    }
    bign_free(a);
    bign_free(b);
}
static inline void bign_fib_fast(bign *dest, const int fn)
{
    bign_resize(dest, 1);
    if (fn < 2) {
        dest->number[0] = fn;
        return;
    }
    bign *a = bign_alloc(1);
    bign *b = bign_alloc(1);
    bign *tmp = bign_alloc(1);
    a->number[0] = 0;
    b->number[0] = 1;
    /* find leftmost bit position equal to 1*/
    int fn_fls = __fls(fn);
    for (unsigned int d = 1U << fn_fls; d > 0; d >>= 1) {        
        /* F(2k) */
        bign_cpy(b, dest);
        bign_shl(dest, 1);
        bign_sub(dest, a, dest);
        bign_mul(dest, a, dest);
        /* F(2k + 1) */
        bign_mul(a, a, tmp);
        bign_mul(b, b, a);
        bign_add(a, tmp, tmp);
        bign_cpy(dest, a);
        bign_cpy(tmp, b);
        if (!!(d & fn)) {
            bign_add(a, b, a);
            bign_swap(a, b);
            bign_cpy(a, dest);
        }
    }
    
    bign_free(a);
    bign_free(b);
    bign_free(tmp);
}
```
:::warning
回去翻閱 [lab0](https://hackmd.io/@sysprog/linux2022-lab0) 作業說明的「變數和函式命名力求簡潔且精準」,上述程式碼在函式名稱中混用大小寫,對理解行為沒有幫助。
:notes: jserv
:::
> 已改正 coding style
## 測量效能
根據 [fibdrv/linux 效能分析的提示](https://hackmd.io/@sysprog/linux2022-fibdrv) 設定測量環境
- 抑制 [ASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization)
```shell
$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"
```
- 關閉 turbo mode
```shell
$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"
```
- 設定 scaling_governor
```shell 
sudo bash -c "echo performance > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor"
```
利用 `ktime` 來測量效能,以下為使用自己實作的 bign 執行 naive 及 fast doubing 的結果

先撇除影響效能測量的因素,可以發現 fast doubing 比起 naive 要慢上不少,顯示了我實作的 bign 有效能問題
經由 `ktime` 測量後,發現是 `bign_mul`  執行速度過慢,因此拖垮了 `bign_fib_fast` 的執行速度
下圖為優化了 `bign_mul` 的效能後的結果

可以發現在改善 `bign_mul` 效能後,有正確產生預期結果
## 使用統計手法測量效能
測量 fib(0)~fib(1000) 效能
額外寫一個 [`performance_stattistic.c`](https://github.com/cwl0429/fibdrv/blob/master/performance_statistic.c) 
```c 
int main()
{
    char write_buf[] = "testing writing";
    int offset = 1000;
    int test_cnt = 1000;
    int fib_tmp[TEST_CASE_CNT];
    int fib_fast_tmp[TEST_CASE_CNT];
    int fd = open(FIB_DEV, O_RDWR);
    if (fd < 0) {
        perror("Failed to open character device");
        exit(1);
    }
    for (int i = 0; i <= offset; i++) {
        for (int j = 0; j <= test_cnt; j++) {
            lseek(fd, i, SEEK_SET);
            int f0, f1;
            f0 = write(fd, write_buf, FIB_NAIVE);
            f1 = write(fd, write_buf, FIB_FAST_DOUBLING);
            fib_tmp[j] = f0;
            fib_fast_tmp[j] = f1;
        }
        
        float fib_sd = 0, fib_fast_sd = 0;
        int fib_mean = 0, fib_fast_mean = 0;
        /* calculate mean */
        for (int j = 0; j <= test_cnt; j++) {
            fib_mean += fib_tmp[j];
            fib_fast_mean += fib_fast_tmp[j];
        }
        fib_mean /= test_cnt;
        fib_fast_mean /= test_cnt;
        /* calculate sd */
        for (int j = 0; j <= test_cnt; j++) {
            fib_sd += (fib_tmp[j] - fib_mean);
            fib_fast_sd += (fib_fast_tmp[j] - fib_fast_mean);
        }
        fib_sd = sqrt(fib_sd);
        fib_fast_sd = sqrt(fib_fast_sd);
        /* remove outliner */
        for (int j = 0; j <= test_cnt; j++) {
            if ((fib_tmp[j] > (fib_mean + 3 * fib_sd)) ||
                (fib_tmp[j] < (fib_mean - 3 * fib_sd))) {
                fib_tmp[j] = 0;
            }
            if ((fib_fast_tmp[j] > (fib_fast_mean + 3 * fib_fast_sd)) ||
                (fib_fast_tmp[j] < (fib_fast_mean - 3 * fib_fast_sd))) {
                fib_fast_tmp[j] = 0;
            }
        }
        /* calculate mean*/
        int fib_cnt = 0, fib_fast_cnt = 0;
        fib_mean = 0;
        fib_fast_mean = 0;
        for (int j = 0; j <= test_cnt; j++) {
            if (fib_tmp[j] != 0) {
                fib_mean += fib_tmp[j];
                fib_cnt++;
            }
            if (fib_fast_tmp[j] != 0) {
                fib_fast_mean += fib_fast_tmp[j];
                fib_fast_cnt++;
            }
        }
        fib_mean /= fib_cnt;
        fib_fast_mean /= fib_fast_cnt;
        printf("%d %d %d\n", i, fib_mean, fib_fast_mean);
    }
    close(fd);
    return 0;
}
```
想法是,將每個 `fib(n)` 及 `fib_fast(n)` 分別執行 1000 次,得到數據後,算出其標準差,將超過三個標準差的離群值去除,也就是取 99.7% 的資料

最後取去掉離群值後的平均值作圖,結果如下圖:

因為去除離群值的緣故,可以看出數據抖動的狀況大幅減少,但是在前 100 個費氏數仍有抖動的情況發生,推測原因是