2022q1 Homework3 (fibdrv)

--- tags: linux2022 --- # 2022q1 Homework3 (fibdrv) contributed by < [`cwl0429`](https://github.com/cwl0429) > ```shell $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 142 Model name: Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz Stepping: 11 CPU MHz: 1800.000 CPU max MHz: 3900.0000 CPU min MHz: 400.0000 BogoMIPS: 3600.00 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-7 ``` ## 自我檢查清單 - [x] 研讀上述 ==Linux 效能分析的提示== 描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗 - [x] 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [x] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise)，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] 研讀 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU)，指出針對大數運算，有哪些加速運算和縮減記憶體操作成本的舉措？ - [ ] `lsmod` 的輸出結果有一欄名為 `Used by`，這是 "each module's use count and a list of referring modules"，但如何實作出來呢？模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢？ > 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html) - [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。 ## fibdrv 實作 ### 使用 char 存放 Big Number 參考 [AdrianHuang/fibdrv](https://github.com/AdrianHuang/fibdrv/tree/big_number)，讓數字以 char 的型態儲存 ```c typedef union { /* allow strings up to 15 bytes to stay on the stack * use the last byte as a null terminator and to store flags * much like fbstring: * https://github.com/facebook/folly/blob/master/folly/docs/FBString.md */ char data[16]; struct { uint8_t filler[15], /* how many free bytes in this stack allocated string * same idea as fbstring */ space_left : 4, /* if it is on heap, set to 1 */ is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1; }; /* heap allocated */ struct { char *ptr; /* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */ size_t size : MAX_STR_LEN_BITS, /* capacity is always a power of 2 (unsigned)-1 */ capacity : 6; /* the last 4 bits are important flags */ }; } xs; ``` 接著研究一下如何將 char 型態的大數相加觀察以下程式碼，發現在相加之前會需將 string 反轉，原因是 string 在反轉後大數 `a` 及 `b` 的最低位數便可對齊，以下假設 a 為 1234 ; b 為 534 反轉前： ```graphviz digraph{ node[shape = "record"] n1[label = "1|2|3|4"] n2[label = "5|3|4"] } ``` 反轉後： ```graphviz digraph{ node[shape = "record"] n1[label = "4|3|2|1"] n2[label = "4|3|5"] } ``` 此時 a 及 b 陣列開頭存放的 char 就會是個位數 ```c static void string_number_add(xs *a, xs *b, xs *out) { char *data_a, *data_b; size_t size_a, size_b; int i, carry = 0; int sum; /* * Make sure the string length of 'a' is always greater than * the one of 'b'. */ if (xs_size(a) < xs_size(b)) __swap((void *) &a, (void *) &b, sizeof(void *)); data_a = xs_data(a); data_b = xs_data(b); size_a = xs_size(a); size_b = xs_size(b); reverse_str(data_a, size_a); reverse_str(data_b, size_b); char *buf = kmalloc(sizeof(char) * (size_a + 2), GFP_KERNEL); for (i = 0; i < size_b; i++) { sum = (data_a[i] - '0') + (data_b[i] - '0') + carry; buf[i] = '0' + sum % 10; carry = sum / 10; } for (i = size_b; i < size_a; i++) { sum = (data_a[i] - '0') + carry; buf[i] = '0' + sum % 10; carry = sum / 10; } if (carry) buf[i++] = '0' + carry; buf[i] = 0; reverse_str(buf, i); /* Restore the original string */ reverse_str(data_a, size_a); reverse_str(data_b, size_b); if (out) *out = *xs_tmp(buf); } ``` 在與老師討論後，發現以 char 存放數字的方式會有幾點問題 __浪費大量記憶體空間__ 一個 char 的大小為 1 byte 也就是 8 個 bits，能表示的數字範圍為 `0 ~ 255`，而 char 能表示的數字範圍為 ``'0' ~ '9'`` 而此範圍能用 4 個 bits 表示，所以在每個 char 中都至少浪費一半的 bits 基於以上理由，我嘗試實作以 `int *` 的方式存放的大數參考 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 的大數資料結構 ```c /** * @brief store bignumber with dynamic size * number[size - 1] is the msb */ typedef struct _BigN { unsigned int *number; unsigned int size; /* sign is not use now */ int sign; } BigN; ``` 利用可動態配置的 `number` 存放大數 `size` 紀錄 `number` 的長度，而 `sign` 會紀錄大數的正負號但目前實作還沒用上。為了實作 `fib` 及 `fib_fast_doubling`，我實作了以下函式 #### `bign_to_string` 參考 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 的方法，利用此函式將計算完畢的 fib number 從 kernel 端複製到 user 端，過程如下圖：先假設 bign a 的 `a->number[0]` 為 `0...001011`， `a->size` 為 `1` ，對應的 `string` 長度為 `12` > 以下圖示為了作圖方便，省略 number 的前 28 個數值為 0 的 bits ```graphviz digraph{ node[shape = "record"] n1[label = "<1>1|<2>0|<3>1|<4>1"] s[label ="0|0|0|0|0|0|0|0|0|0|0|\\0"] number[shape = plaintext] string[shape = plaintext] number -> n1 string -> s } ``` 從 `number` 的 msb 開始往右掃描，若當前位元為 `1` 則 `carry` 為 `1`，並更新整個 `string`，對每個位元做 `c[j] += c[j] - '0' + carry;` 若過程中發現 `c[j]` 會大於 `9` 則設置 `carry` 為 `1` 用以進位到 `c[j-1]` 否則設置為 `0` ```graphviz digraph{ node[shape = "record"] n1[label = "<1>1|<2>0|<3>1|<4>1"] s[label ="0|0|0|0|0|0|0|0|0|0|<cur>0 + 0 + 1|'\\0'"] current_bit[shape = plaintext] j[shape = plaintext] current_bit -> n1:1 j -> s:cur } ``` 接著重複執行上述動作 ```graphviz digraph{ node[shape = "record"] n1[label = "<1>1|<2>0|<3>1|<4>1"] s[label ="0|0|0|0|0|0|0|0|0|0|<cur>1 + 1 + 0|'\\0'"] current_bit[shape = plaintext] j[shape = plaintext] current_bit -> n1:2 j -> s:cur } ``` ```graphviz digraph{ node[shape = "record"] n1[label = "<1>1|<2>0|<3>1|<4>1"] s[label ="0|0|0|0|0|0|0|0|0|0|<cur>2 + 2 + 1|'\\0'"] current_bit[shape = plaintext] j[shape = plaintext] current_bit -> n1:3 j -> s:cur } ``` ```graphviz digraph{ node[shape = "record"] n1[label = "<1>1|<2>0|<3>1|<4>1"] s[label ="0|0|0|0|0|0|0|0|0|0|<cur>5 + 5 + 1|'\\0'"] s_2[label ="0|0|0|0|0|0|0|0|0|<cur>0 + 0 + 1|1|'\\0'"] current_bit[shape = plaintext] j[shape = plaintext] current_bit -> n1:4 j -> s:cur s -> s_2 j_nxt -> s_2:cur } ``` 掃描完後，得到最後結果 `11` ```graphviz digraph{ node[shape = "record"] n1[label = "<1>1|<2>0|<3>1|<4>1"] s[label ="0|0|0|0|0|0|0|0|0|1|1|\\0"] number[shape = plaintext] string[shape = plaintext] number -> n1 string -> s } ``` ```c static inline char *bign_to_string(bign *a) { /* change of base formula: log10(x) = log2(x) / log2(10) */ size_t len = (8 * sizeof(int) * a->size) / 3 + 2; char *c = kmalloc(len, GFP_KERNEL); char *p = c; memset(c, '0', len - 1); c[len - 1] = '\0'; /* transform bits into decimal string */ for (int i = a->size - 1; i >= 0; i--) { for (unsigned int d = 1U << 31; d; d >>= 1) { int carry = !!(d & a->number[i]); for (int j = len - 2; j >= 0; j--) { c[j] += c[j] - '0' + carry; carry = (c[j] > '9'); if (carry) c[j] -= 10; } } } while (*p == '0' && *(p + 1) != '\0') p++; if (a->sign) *(--p) = '-'; memmove(c, p, strlen(p) + 1); return c; } ``` #### `bign_add` 及 `bign_sub` - `add_BigN` - 從 `number[i]` 的 lsb 開始計算，將 `a->number[i], b->number[i]` 及 `carry` 加至 `c->number[i]` ，並在過程利用巨集 `DETECT_OVERFLOW` 偵測是否有溢位發生，若是溢位便將 `carry` 設置為 `1` 否則為 `0` - `sub_BigN` - `sub_BigN` 和 `add_BigN` 想法類似，主要差別為偵測溢位的方式 ```c static inline void bign_add(bign *a, bign *b, bign *c) { bign_resize(c, a->size + 1); unsigned int carry = 0; for (int i = 0; i < c->size; i++) { unsigned int tmp_a = (i < a->size) ? a->number[i] : 0; unsigned int tmp_b = (i < b->size) ? b->number[i] : 0; c->number[i] = tmp_a + tmp_b + carry; carry = DETECT_OVERFLOW(tmp_a, tmp_b + carry); } if (!c->number[c->size - 1] && c->size > 1) bign_resize(c, c->size - 1); } static inline void bign_sub(bign *a, bign *b, bign *c) { bign_resize(c, a->size); unsigned int carry = 0; for (int i = 0; i < c->size; i++) { long tmp_a = (i < a->size) ? (long) a->number[i] : 0; long tmp_b = (i < b->size) ? (long) b->number[i] : 0; c->number[i] = tmp_a - tmp_b - carry; carry = DETECT_OVERFLOW_SUB(tmp_a - carry, tmp_b); } if (!c->number[c->size - 1] && c->size > 1) bign_resize(c, c->size - 1); } ``` #### `bign_mul` 令 `a` 為被乘數 `b` 為乘數，之後便用乘法直式的過程計算需要注意的地方，在 `mul_BigN` 中我使用了 [`__fls`](https://github.com/torvalds/linux/blob/master/include/asm-generic/bitops/__fls.h) 來省略乘數中的 leading zeros ，避免不必要的計算 > __fls - find last (most-significant) set bit in a long word ```c static inline void bign_mul(bign *a, bign *b, bign *c) { bign *a_tmp = bign_alloc(1); bign_cpy(a, a_tmp); if ((!a->number[0] && (a->size == 1)) || (!b->number[0] && (b->size == 1))) { /* prevent a == c or b == c */ bign_resize(c, 1); c->number[0] = 0; // printk("0 appear"); return; } bign_resize(c, 1); c->number[0] = 0; /* then resize to limit range of a * b */ bign_resize(c, a->size + b->size); for (int i = 0; i < b->size; i++) { if (!b->number[i]) continue; unsigned int clz = __fls(b->number[i]) + 1; unsigned int shift_cnt = 32 - clz; for (unsigned int d = 1U; clz; d <<= 1) { if (!!(d & b->number[i])) { bign_add(c, a_tmp, c); } bign_shl(a_tmp, 1); clz--; } bign_shl(a_tmp, shift_cnt); } int c_size = 0; for (c_size = c->size - 1; !c->number[c_size];) c_size--; if (c->size > 1) bign_resize(c, c_size + 1); } ``` __更新 `bign_mul`__ - 假設有三個 bign a, b, c - c = a * b - c 為最終儲存的 bign - 令 a->size 為 A, b->size 為 B 原先：從乘數的 lsb 開始往左掃描 bits，若 bit 為 1 則將被乘數加入 c 並在每回合結束時將乘數左移一格需花費 (由於兩版本實作主要差別為兩層 for 內部，故只計算迴圈內的執行時間) - $B * (A * 32 * (bign\_add\_time + bign\_shl\_time)) = 32A * B (bign\_add\_time + bign\_shl\_time)$ 其中 `bign_add` 會隨著資料大小而改變運算時間後來：參考 [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU) 的 `mul` 實作方式將乘數的 `a->number[0~size]` 分別和被乘數的 `b->number[0~size]` 兩兩相乘即可需花費 - $B * (A * (unsigned int * unsigned int\_exc\_time))$ 其中 unsigned int * unsigned int 可視為常數時間，不會隨著資料大小而改變運算時間可以發現後來的實作方式因避免使用 `bign_add` 而改善效能 ```c static inline void bign_mul(bign *a, bign *b, bign *c) { /* prevent a == c or b == c */ bign *a_tmp = bign_alloc(1); bign_cpy(a, a_tmp); /* handle multiplier or multiplicand are zero */ if ((!a->number[0] && (a->size == 1)) || (!b->number[0] && (b->size == 1))) { bign_resize(c, 1); c->number[0] = 0; return; } bign_resize(c, 1); c->number[0] = 0; /* then resize to limit range of a * b */ bign_resize(c, a->size + b->size); for (int i = 0; i < b->size; i++) { if (!b->number[i]) continue; for (int j = 0; j < a_tmp->size; j++) { unsigned long long carry; carry = (unsigned long long) a_tmp->number[j] * b->number[i]; bign_mul_add(carry, i + j, c); } } int c_size = 0; for (c_size = c->size - 1; !c->number[c_size];) c_size--; if (c->size > 1) bign_resize(c, c_size + 1); } ``` 最後將 `bign_fib` 及 `bign_fib_fast` 實作 ```c static inline void bign_fib(bign *dest, int fn) { bign_resize(dest, 1); if (fn < 2) { dest->number[0] = fn; return; } bign *a = bign_alloc(1); bign *b = bign_alloc(1); a->number[0] = 0; b->number[0] = 1; for (int i = 2; i <= fn; i++) { bign_add(a, b, dest); bign_cpy(dest, a); bign_swap(a, b); } bign_free(a); bign_free(b); } static inline void bign_fib_fast(bign *dest, const int fn) { bign_resize(dest, 1); if (fn < 2) { dest->number[0] = fn; return; } bign *a = bign_alloc(1); bign *b = bign_alloc(1); bign *tmp = bign_alloc(1); a->number[0] = 0; b->number[0] = 1; /* find leftmost bit position equal to 1*/ int fn_fls = __fls(fn); for (unsigned int d = 1U << fn_fls; d > 0; d >>= 1) { /* F(2k) */ bign_cpy(b, dest); bign_shl(dest, 1); bign_sub(dest, a, dest); bign_mul(dest, a, dest); /* F(2k + 1) */ bign_mul(a, a, tmp); bign_mul(b, b, a); bign_add(a, tmp, tmp); bign_cpy(dest, a); bign_cpy(tmp, b); if (!!(d & fn)) { bign_add(a, b, a); bign_swap(a, b); bign_cpy(a, dest); } } bign_free(a); bign_free(b); bign_free(tmp); } ``` :::warning 回去翻閱 [lab0](https://hackmd.io/@sysprog/linux2022-lab0) 作業說明的「變數和函式命名力求簡潔且精準」，上述程式碼在函式名稱中混用大小寫，對理解行為沒有幫助。 :notes: jserv ::: > 已改正 coding style ## 測量效能根據 [fibdrv/linux 效能分析的提示](https://hackmd.io/@sysprog/linux2022-fibdrv) 設定測量環境 - 抑制 [ASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization) ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` - 關閉 turbo mode ```shell $ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" ``` - 設定 scaling_governor ```shell sudo bash -c "echo performance > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor" ``` 利用 `ktime` 來測量效能，以下為使用自己實作的 bign 執行 naive 及 fast doubing 的結果 ![](https://i.imgur.com/uM4SlOW.png) 先撇除影響效能測量的因素，可以發現 fast doubing 比起 naive 要慢上不少，顯示了我實作的 bign 有效能問題經由 `ktime` 測量後，發現是 `bign_mul` 執行速度過慢，因此拖垮了 `bign_fib_fast` 的執行速度下圖為優化了 `bign_mul` 的效能後的結果 ![](https://i.imgur.com/n24clAm.png) 可以發現在改善 `bign_mul` 效能後，有正確產生預期結果 ## 使用統計手法測量效能測量 fib(0)~fib(1000) 效能額外寫一個 [`performance_stattistic.c`](https://github.com/cwl0429/fibdrv/blob/master/performance_statistic.c) ```c int main() { char write_buf[] = "testing writing"; int offset = 1000; int test_cnt = 1000; int fib_tmp[TEST_CASE_CNT]; int fib_fast_tmp[TEST_CASE_CNT]; int fd = open(FIB_DEV, O_RDWR); if (fd < 0) { perror("Failed to open character device"); exit(1); } for (int i = 0; i <= offset; i++) { for (int j = 0; j <= test_cnt; j++) { lseek(fd, i, SEEK_SET); int f0, f1; f0 = write(fd, write_buf, FIB_NAIVE); f1 = write(fd, write_buf, FIB_FAST_DOUBLING); fib_tmp[j] = f0; fib_fast_tmp[j] = f1; } float fib_sd = 0, fib_fast_sd = 0; int fib_mean = 0, fib_fast_mean = 0; /* calculate mean */ for (int j = 0; j <= test_cnt; j++) { fib_mean += fib_tmp[j]; fib_fast_mean += fib_fast_tmp[j]; } fib_mean /= test_cnt; fib_fast_mean /= test_cnt; /* calculate sd */ for (int j = 0; j <= test_cnt; j++) { fib_sd += (fib_tmp[j] - fib_mean); fib_fast_sd += (fib_fast_tmp[j] - fib_fast_mean); } fib_sd = sqrt(fib_sd); fib_fast_sd = sqrt(fib_fast_sd); /* remove outliner */ for (int j = 0; j <= test_cnt; j++) { if ((fib_tmp[j] > (fib_mean + 3 * fib_sd)) || (fib_tmp[j] < (fib_mean - 3 * fib_sd))) { fib_tmp[j] = 0; } if ((fib_fast_tmp[j] > (fib_fast_mean + 3 * fib_fast_sd)) || (fib_fast_tmp[j] < (fib_fast_mean - 3 * fib_fast_sd))) { fib_fast_tmp[j] = 0; } } /* calculate mean*/ int fib_cnt = 0, fib_fast_cnt = 0; fib_mean = 0; fib_fast_mean = 0; for (int j = 0; j <= test_cnt; j++) { if (fib_tmp[j] != 0) { fib_mean += fib_tmp[j]; fib_cnt++; } if (fib_fast_tmp[j] != 0) { fib_fast_mean += fib_fast_tmp[j]; fib_fast_cnt++; } } fib_mean /= fib_cnt; fib_fast_mean /= fib_fast_cnt; printf("%d %d %d\n", i, fib_mean, fib_fast_mean); } close(fd); return 0; } ``` 想法是，將每個 `fib(n)` 及 `fib_fast(n)` 分別執行 1000 次，得到數據後，算出其標準差，將超過三個標準差的離群值去除，也就是取 99.7% 的資料 ![](https://i.imgur.com/EJ3pVDs.png) 最後取去掉離群值後的平均值作圖，結果如下圖: ![](https://i.imgur.com/5xBGXnY.png) 因為去除離群值的緣故，可以看出數據抖動的狀況大幅減少，但是在前 100 個費氏數仍有抖動的情況發生，推測原因是