---
tags: linux2022
---
# 2022q1 Homework3 (fibdrv)
contributed by < [`cwl0429`](https://github.com/cwl0429) >
```shell
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz
Stepping: 11
CPU MHz: 1800.000
CPU max MHz: 3900.0000
CPU min MHz: 400.0000
BogoMIPS: 3600.00
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-7
```
## 自我檢查清單
- [x] 研讀上述 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗
- [x] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
- [x] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] 研讀 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU),指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措?
- [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢?
> 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html)
- [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。
## fibdrv 實作
### 使用 char 存放 Big Number
參考 [AdrianHuang/fibdrv](https://github.com/AdrianHuang/fibdrv/tree/big_number),讓數字以 char 的型態儲存
```c
typedef union {
/* allow strings up to 15 bytes to stay on the stack
* use the last byte as a null terminator and to store flags
* much like fbstring:
* https://github.com/facebook/folly/blob/master/folly/docs/FBString.md
*/
char data[16];
struct {
uint8_t filler[15],
/* how many free bytes in this stack allocated string
* same idea as fbstring
*/
space_left : 4,
/* if it is on heap, set to 1 */
is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1;
};
/* heap allocated */
struct {
char *ptr;
/* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */
size_t size : MAX_STR_LEN_BITS,
/* capacity is always a power of 2 (unsigned)-1 */
capacity : 6;
/* the last 4 bits are important flags */
};
} xs;
```
接著研究一下如何將 char 型態的大數相加
觀察以下程式碼,發現在相加之前會需將 string 反轉,原因是 string 在反轉後 大數 `a` 及 `b` 的最低位數便可對齊,以下假設 a 為 1234 ; b 為 534
反轉前:
```graphviz
digraph{
node[shape = "record"]
n1[label = "1|2|3|4"]
n2[label = "5|3|4"]
}
```
反轉後:
```graphviz
digraph{
node[shape = "record"]
n1[label = "4|3|2|1"]
n2[label = "4|3|5"]
}
```
此時 a 及 b 陣列開頭存放的 char 就會是個位數
```c
static void string_number_add(xs *a, xs *b, xs *out)
{
char *data_a, *data_b;
size_t size_a, size_b;
int i, carry = 0;
int sum;
/*
* Make sure the string length of 'a' is always greater than
* the one of 'b'.
*/
if (xs_size(a) < xs_size(b))
__swap((void *) &a, (void *) &b, sizeof(void *));
data_a = xs_data(a);
data_b = xs_data(b);
size_a = xs_size(a);
size_b = xs_size(b);
reverse_str(data_a, size_a);
reverse_str(data_b, size_b);
char *buf = kmalloc(sizeof(char) * (size_a + 2), GFP_KERNEL);
for (i = 0; i < size_b; i++) {
sum = (data_a[i] - '0') + (data_b[i] - '0') + carry;
buf[i] = '0' + sum % 10;
carry = sum / 10;
}
for (i = size_b; i < size_a; i++) {
sum = (data_a[i] - '0') + carry;
buf[i] = '0' + sum % 10;
carry = sum / 10;
}
if (carry)
buf[i++] = '0' + carry;
buf[i] = 0;
reverse_str(buf, i);
/* Restore the original string */
reverse_str(data_a, size_a);
reverse_str(data_b, size_b);
if (out)
*out = *xs_tmp(buf);
}
```
在與老師討論後,發現以 char 存放數字的方式會有幾點問題
__浪費大量記憶體空間__
一個 char 的大小為 1 byte 也就是 8 個 bits,能表示的數字範圍為 `0 ~ 255`,而 char 能表示的數字範圍為 ``'0' ~ '9'`` 而此範圍能用 4 個 bits 表示,所以在每個 char 中都至少浪費一半的 bits
基於以上理由,我嘗試實作以 `int *` 的方式存放的大數
參考 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 的大數資料結構
```c
/**
* @brief store bignumber with dynamic size
* number[size - 1] is the msb
*/
typedef struct _BigN {
unsigned int *number;
unsigned int size;
/* sign is not use now */
int sign;
} BigN;
```
利用可動態配置的 `number` 存放大數 `size` 紀錄 `number` 的長度,而 `sign` 會紀錄大數的正負號但目前實作還沒用上。
為了實作 `fib` 及 `fib_fast_doubling`,我實作了以下函式
#### `bign_to_string`
參考 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 的方法,利用此函式將計算完畢的 fib number 從 kernel 端複製到 user 端,過程如下圖:
先假設 bign a 的 `a->number[0]` 為 `0...001011`, `a->size` 為 `1` , 對應的 `string` 長度為 `12`
> 以下圖示為了作圖方便,省略 number 的前 28 個數值為 0 的 bits
```graphviz
digraph{
node[shape = "record"]
n1[label = "<1>1|<2>0|<3>1|<4>1"]
s[label ="0|0|0|0|0|0|0|0|0|0|0|\\0"]
number[shape = plaintext]
string[shape = plaintext]
number -> n1
string -> s
}
```
從 `number` 的 msb 開始往右掃描,若當前位元為 `1` 則 `carry` 為 `1`,並更新整個 `string`,對每個位元做 `c[j] += c[j] - '0' + carry;` 若過程中發現 `c[j]` 會大於 `9` 則設置 `carry` 為 `1` 用以進位到 `c[j-1]` 否則設置為 `0`
```graphviz
digraph{
node[shape = "record"]
n1[label = "<1>1|<2>0|<3>1|<4>1"]
s[label ="0|0|0|0|0|0|0|0|0|0|<cur>0 + 0 + 1|'\\0'"]
current_bit[shape = plaintext]
j[shape = plaintext]
current_bit -> n1:1
j -> s:cur
}
```
接著重複執行上述動作
```graphviz
digraph{
node[shape = "record"]
n1[label = "<1>1|<2>0|<3>1|<4>1"]
s[label ="0|0|0|0|0|0|0|0|0|0|<cur>1 + 1 + 0|'\\0'"]
current_bit[shape = plaintext]
j[shape = plaintext]
current_bit -> n1:2
j -> s:cur
}
```
```graphviz
digraph{
node[shape = "record"]
n1[label = "<1>1|<2>0|<3>1|<4>1"]
s[label ="0|0|0|0|0|0|0|0|0|0|<cur>2 + 2 + 1|'\\0'"]
current_bit[shape = plaintext]
j[shape = plaintext]
current_bit -> n1:3
j -> s:cur
}
```
```graphviz
digraph{
node[shape = "record"]
n1[label = "<1>1|<2>0|<3>1|<4>1"]
s[label ="0|0|0|0|0|0|0|0|0|0|<cur>5 + 5 + 1|'\\0'"]
s_2[label ="0|0|0|0|0|0|0|0|0|<cur>0 + 0 + 1|1|'\\0'"]
current_bit[shape = plaintext]
j[shape = plaintext]
current_bit -> n1:4
j -> s:cur
s -> s_2
j_nxt -> s_2:cur
}
```
掃描完後,得到最後結果 `11`
```graphviz
digraph{
node[shape = "record"]
n1[label = "<1>1|<2>0|<3>1|<4>1"]
s[label ="0|0|0|0|0|0|0|0|0|1|1|\\0"]
number[shape = plaintext]
string[shape = plaintext]
number -> n1
string -> s
}
```
```c
static inline char *bign_to_string(bign *a)
{
/* change of base formula: log10(x) = log2(x) / log2(10) */
size_t len = (8 * sizeof(int) * a->size) / 3 + 2;
char *c = kmalloc(len, GFP_KERNEL);
char *p = c;
memset(c, '0', len - 1);
c[len - 1] = '\0';
/* transform bits into decimal string */
for (int i = a->size - 1; i >= 0; i--) {
for (unsigned int d = 1U << 31; d; d >>= 1) {
int carry = !!(d & a->number[i]);
for (int j = len - 2; j >= 0; j--) {
c[j] += c[j] - '0' + carry;
carry = (c[j] > '9');
if (carry)
c[j] -= 10;
}
}
}
while (*p == '0' && *(p + 1) != '\0')
p++;
if (a->sign)
*(--p) = '-';
memmove(c, p, strlen(p) + 1);
return c;
}
```
#### `bign_add` 及 `bign_sub`
- `add_BigN`
- 從 `number[i]` 的 lsb 開始計算,將 `a->number[i], b->number[i]` 及 `carry` 加至 `c->number[i]` ,並在過程利用巨集 `DETECT_OVERFLOW` 偵測是否有溢位發生,若是溢位便將 `carry` 設置為 `1` 否則為 `0`
- `sub_BigN`
- `sub_BigN` 和 `add_BigN` 想法類似,主要差別為偵測溢位的方式
```c
static inline void bign_add(bign *a, bign *b, bign *c)
{
bign_resize(c, a->size + 1);
unsigned int carry = 0;
for (int i = 0; i < c->size; i++) {
unsigned int tmp_a = (i < a->size) ? a->number[i] : 0;
unsigned int tmp_b = (i < b->size) ? b->number[i] : 0;
c->number[i] = tmp_a + tmp_b + carry;
carry = DETECT_OVERFLOW(tmp_a, tmp_b + carry);
}
if (!c->number[c->size - 1] && c->size > 1)
bign_resize(c, c->size - 1);
}
static inline void bign_sub(bign *a, bign *b, bign *c)
{
bign_resize(c, a->size);
unsigned int carry = 0;
for (int i = 0; i < c->size; i++) {
long tmp_a = (i < a->size) ? (long) a->number[i] : 0;
long tmp_b = (i < b->size) ? (long) b->number[i] : 0;
c->number[i] = tmp_a - tmp_b - carry;
carry = DETECT_OVERFLOW_SUB(tmp_a - carry, tmp_b);
}
if (!c->number[c->size - 1] && c->size > 1)
bign_resize(c, c->size - 1);
}
```
#### `bign_mul`
令 `a` 為被乘數 `b` 為乘數,之後便用乘法直式的過程計算
需要注意的地方,在 `mul_BigN` 中我使用了 [`__fls`](https://github.com/torvalds/linux/blob/master/include/asm-generic/bitops/__fls.h) 來省略乘數中的 leading zeros ,避免不必要的計算
> __fls - find last (most-significant) set bit in a long word
```c
static inline void bign_mul(bign *a, bign *b, bign *c)
{
bign *a_tmp = bign_alloc(1);
bign_cpy(a, a_tmp);
if ((!a->number[0] && (a->size == 1)) ||
(!b->number[0] && (b->size == 1))) {
/* prevent a == c or b == c */
bign_resize(c, 1);
c->number[0] = 0;
// printk("0 appear");
return;
}
bign_resize(c, 1);
c->number[0] = 0;
/* then resize to limit range of a * b */
bign_resize(c, a->size + b->size);
for (int i = 0; i < b->size; i++) {
if (!b->number[i])
continue;
unsigned int clz = __fls(b->number[i]) + 1;
unsigned int shift_cnt = 32 - clz;
for (unsigned int d = 1U; clz; d <<= 1) {
if (!!(d & b->number[i])) {
bign_add(c, a_tmp, c);
}
bign_shl(a_tmp, 1);
clz--;
}
bign_shl(a_tmp, shift_cnt);
}
int c_size = 0;
for (c_size = c->size - 1; !c->number[c_size];)
c_size--;
if (c->size > 1)
bign_resize(c, c_size + 1);
}
```
__更新 `bign_mul`__
- 假設有三個 bign a, b, c
- c = a * b
- c 為最終儲存的 bign
- 令 a->size 為 A, b->size 為 B
原先:
從乘數的 lsb 開始往左掃描 bits,若 bit 為 1 則將被乘數加入 c 並在每回合結束時將乘數左移一格
需花費 (由於兩版本實作主要差別為兩層 for 內部,故只計算迴圈內的執行時間)
- $B * (A * 32 * (bign\_add\_time + bign\_shl\_time)) = 32A * B (bign\_add\_time + bign\_shl\_time)$
其中 `bign_add` 會隨著資料大小而改變運算時間
後來:
參考 [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU) 的 `mul` 實作方式
將乘數的 `a->number[0~size]` 分別和被乘數的 `b->number[0~size]` 兩兩相乘即可
需花費
- $B * (A * (unsigned int * unsigned int\_exc\_time))$
其中 unsigned int * unsigned int 可視為常數時間,不會隨著資料大小而改變運算時間
可以發現後來的實作方式因避免使用 `bign_add` 而改善效能
```c
static inline void bign_mul(bign *a, bign *b, bign *c)
{
/* prevent a == c or b == c */
bign *a_tmp = bign_alloc(1);
bign_cpy(a, a_tmp);
/* handle multiplier or multiplicand are zero */
if ((!a->number[0] && (a->size == 1)) ||
(!b->number[0] && (b->size == 1))) {
bign_resize(c, 1);
c->number[0] = 0;
return;
}
bign_resize(c, 1);
c->number[0] = 0;
/* then resize to limit range of a * b */
bign_resize(c, a->size + b->size);
for (int i = 0; i < b->size; i++) {
if (!b->number[i])
continue;
for (int j = 0; j < a_tmp->size; j++) {
unsigned long long carry;
carry = (unsigned long long) a_tmp->number[j] * b->number[i];
bign_mul_add(carry, i + j, c);
}
}
int c_size = 0;
for (c_size = c->size - 1; !c->number[c_size];)
c_size--;
if (c->size > 1)
bign_resize(c, c_size + 1);
}
```
最後將 `bign_fib` 及 `bign_fib_fast` 實作
```c
static inline void bign_fib(bign *dest, int fn)
{
bign_resize(dest, 1);
if (fn < 2) {
dest->number[0] = fn;
return;
}
bign *a = bign_alloc(1);
bign *b = bign_alloc(1);
a->number[0] = 0;
b->number[0] = 1;
for (int i = 2; i <= fn; i++) {
bign_add(a, b, dest);
bign_cpy(dest, a);
bign_swap(a, b);
}
bign_free(a);
bign_free(b);
}
static inline void bign_fib_fast(bign *dest, const int fn)
{
bign_resize(dest, 1);
if (fn < 2) {
dest->number[0] = fn;
return;
}
bign *a = bign_alloc(1);
bign *b = bign_alloc(1);
bign *tmp = bign_alloc(1);
a->number[0] = 0;
b->number[0] = 1;
/* find leftmost bit position equal to 1*/
int fn_fls = __fls(fn);
for (unsigned int d = 1U << fn_fls; d > 0; d >>= 1) {
/* F(2k) */
bign_cpy(b, dest);
bign_shl(dest, 1);
bign_sub(dest, a, dest);
bign_mul(dest, a, dest);
/* F(2k + 1) */
bign_mul(a, a, tmp);
bign_mul(b, b, a);
bign_add(a, tmp, tmp);
bign_cpy(dest, a);
bign_cpy(tmp, b);
if (!!(d & fn)) {
bign_add(a, b, a);
bign_swap(a, b);
bign_cpy(a, dest);
}
}
bign_free(a);
bign_free(b);
bign_free(tmp);
}
```
:::warning
回去翻閱 [lab0](https://hackmd.io/@sysprog/linux2022-lab0) 作業說明的「變數和函式命名力求簡潔且精準」,上述程式碼在函式名稱中混用大小寫,對理解行為沒有幫助。
:notes: jserv
:::
> 已改正 coding style
## 測量效能
根據 [fibdrv/linux 效能分析的提示](https://hackmd.io/@sysprog/linux2022-fibdrv) 設定測量環境
- 抑制 [ASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization)
```shell
$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"
```
- 關閉 turbo mode
```shell
$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"
```
- 設定 scaling_governor
```shell
sudo bash -c "echo performance > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor"
```
利用 `ktime` 來測量效能,以下為使用自己實作的 bign 執行 naive 及 fast doubing 的結果
![](https://i.imgur.com/uM4SlOW.png)
先撇除影響效能測量的因素,可以發現 fast doubing 比起 naive 要慢上不少,顯示了我實作的 bign 有效能問題
經由 `ktime` 測量後,發現是 `bign_mul` 執行速度過慢,因此拖垮了 `bign_fib_fast` 的執行速度
下圖為優化了 `bign_mul` 的效能後的結果
![](https://i.imgur.com/n24clAm.png)
可以發現在改善 `bign_mul` 效能後,有正確產生預期結果
## 使用統計手法測量效能
測量 fib(0)~fib(1000) 效能
額外寫一個 [`performance_stattistic.c`](https://github.com/cwl0429/fibdrv/blob/master/performance_statistic.c)
```c
int main()
{
char write_buf[] = "testing writing";
int offset = 1000;
int test_cnt = 1000;
int fib_tmp[TEST_CASE_CNT];
int fib_fast_tmp[TEST_CASE_CNT];
int fd = open(FIB_DEV, O_RDWR);
if (fd < 0) {
perror("Failed to open character device");
exit(1);
}
for (int i = 0; i <= offset; i++) {
for (int j = 0; j <= test_cnt; j++) {
lseek(fd, i, SEEK_SET);
int f0, f1;
f0 = write(fd, write_buf, FIB_NAIVE);
f1 = write(fd, write_buf, FIB_FAST_DOUBLING);
fib_tmp[j] = f0;
fib_fast_tmp[j] = f1;
}
float fib_sd = 0, fib_fast_sd = 0;
int fib_mean = 0, fib_fast_mean = 0;
/* calculate mean */
for (int j = 0; j <= test_cnt; j++) {
fib_mean += fib_tmp[j];
fib_fast_mean += fib_fast_tmp[j];
}
fib_mean /= test_cnt;
fib_fast_mean /= test_cnt;
/* calculate sd */
for (int j = 0; j <= test_cnt; j++) {
fib_sd += (fib_tmp[j] - fib_mean);
fib_fast_sd += (fib_fast_tmp[j] - fib_fast_mean);
}
fib_sd = sqrt(fib_sd);
fib_fast_sd = sqrt(fib_fast_sd);
/* remove outliner */
for (int j = 0; j <= test_cnt; j++) {
if ((fib_tmp[j] > (fib_mean + 3 * fib_sd)) ||
(fib_tmp[j] < (fib_mean - 3 * fib_sd))) {
fib_tmp[j] = 0;
}
if ((fib_fast_tmp[j] > (fib_fast_mean + 3 * fib_fast_sd)) ||
(fib_fast_tmp[j] < (fib_fast_mean - 3 * fib_fast_sd))) {
fib_fast_tmp[j] = 0;
}
}
/* calculate mean*/
int fib_cnt = 0, fib_fast_cnt = 0;
fib_mean = 0;
fib_fast_mean = 0;
for (int j = 0; j <= test_cnt; j++) {
if (fib_tmp[j] != 0) {
fib_mean += fib_tmp[j];
fib_cnt++;
}
if (fib_fast_tmp[j] != 0) {
fib_fast_mean += fib_fast_tmp[j];
fib_fast_cnt++;
}
}
fib_mean /= fib_cnt;
fib_fast_mean /= fib_fast_cnt;
printf("%d %d %d\n", i, fib_mean, fib_fast_mean);
}
close(fd);
return 0;
}
```
想法是,將每個 `fib(n)` 及 `fib_fast(n)` 分別執行 1000 次,得到數據後,算出其標準差,將超過三個標準差的離群值去除,也就是取 99.7% 的資料
![](https://i.imgur.com/EJ3pVDs.png)
最後取去掉離群值後的平均值作圖,結果如下圖:
![](https://i.imgur.com/5xBGXnY.png)
因為去除離群值的緣故,可以看出數據抖動的狀況大幅減少,但是在前 100 個費氏數仍有抖動的情況發生,推測原因是