--- tags: linux2022 --- # 2022q1 Homework3 (fibdrv) contributed by < [`cwl0429`](https://github.com/cwl0429) > ```shell $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 142 Model name: Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz Stepping: 11 CPU MHz: 1800.000 CPU max MHz: 3900.0000 CPU min MHz: 400.0000 BogoMIPS: 3600.00 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-7 ``` ## 自我檢查清單 - [x] 研讀上述 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗 - [x] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [x] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] 研讀 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU),指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措? - [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢? > 搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html) - [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。 ## fibdrv 實作 ### 使用 char 存放 Big Number 參考 [AdrianHuang/fibdrv](https://github.com/AdrianHuang/fibdrv/tree/big_number),讓數字以 char 的型態儲存 ```c typedef union { /* allow strings up to 15 bytes to stay on the stack * use the last byte as a null terminator and to store flags * much like fbstring: * https://github.com/facebook/folly/blob/master/folly/docs/FBString.md */ char data[16]; struct { uint8_t filler[15], /* how many free bytes in this stack allocated string * same idea as fbstring */ space_left : 4, /* if it is on heap, set to 1 */ is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1; }; /* heap allocated */ struct { char *ptr; /* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */ size_t size : MAX_STR_LEN_BITS, /* capacity is always a power of 2 (unsigned)-1 */ capacity : 6; /* the last 4 bits are important flags */ }; } xs; ``` 接著研究一下如何將 char 型態的大數相加 觀察以下程式碼,發現在相加之前會需將 string 反轉,原因是 string 在反轉後 大數 `a` 及 `b` 的最低位數便可對齊,以下假設 a 為 1234 ; b 為 534 反轉前: ```graphviz digraph{ node[shape = "record"] n1[label = "1|2|3|4"] n2[label = "5|3|4"] } ``` 反轉後: ```graphviz digraph{ node[shape = "record"] n1[label = "4|3|2|1"] n2[label = "4|3|5"] } ``` 此時 a 及 b 陣列開頭存放的 char 就會是個位數 ```c static void string_number_add(xs *a, xs *b, xs *out) { char *data_a, *data_b; size_t size_a, size_b; int i, carry = 0; int sum; /* * Make sure the string length of 'a' is always greater than * the one of 'b'. */ if (xs_size(a) < xs_size(b)) __swap((void *) &a, (void *) &b, sizeof(void *)); data_a = xs_data(a); data_b = xs_data(b); size_a = xs_size(a); size_b = xs_size(b); reverse_str(data_a, size_a); reverse_str(data_b, size_b); char *buf = kmalloc(sizeof(char) * (size_a + 2), GFP_KERNEL); for (i = 0; i < size_b; i++) { sum = (data_a[i] - '0') + (data_b[i] - '0') + carry; buf[i] = '0' + sum % 10; carry = sum / 10; } for (i = size_b; i < size_a; i++) { sum = (data_a[i] - '0') + carry; buf[i] = '0' + sum % 10; carry = sum / 10; } if (carry) buf[i++] = '0' + carry; buf[i] = 0; reverse_str(buf, i); /* Restore the original string */ reverse_str(data_a, size_a); reverse_str(data_b, size_b); if (out) *out = *xs_tmp(buf); } ``` 在與老師討論後,發現以 char 存放數字的方式會有幾點問題 __浪費大量記憶體空間__ 一個 char 的大小為 1 byte 也就是 8 個 bits,能表示的數字範圍為 `0 ~ 255`,而 char 能表示的數字範圍為 ``'0' ~ '9'`` 而此範圍能用 4 個 bits 表示,所以在每個 char 中都至少浪費一半的 bits 基於以上理由,我嘗試實作以 `int *` 的方式存放的大數 參考 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 的大數資料結構 ```c /** * @brief store bignumber with dynamic size * number[size - 1] is the msb */ typedef struct _BigN { unsigned int *number; unsigned int size; /* sign is not use now */ int sign; } BigN; ``` 利用可動態配置的 `number` 存放大數 `size` 紀錄 `number` 的長度,而 `sign` 會紀錄大數的正負號但目前實作還沒用上。 為了實作 `fib` 及 `fib_fast_doubling`,我實作了以下函式 #### `bign_to_string` 參考 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU) 的方法,利用此函式將計算完畢的 fib number 從 kernel 端複製到 user 端,過程如下圖: 先假設 bign a 的 `a->number[0]` 為 `0...001011`, `a->size` 為 `1` , 對應的 `string` 長度為 `12` > 以下圖示為了作圖方便,省略 number 的前 28 個數值為 0 的 bits ```graphviz digraph{ node[shape = "record"] n1[label = "<1>1|<2>0|<3>1|<4>1"] s[label ="0|0|0|0|0|0|0|0|0|0|0|\\0"] number[shape = plaintext] string[shape = plaintext] number -> n1 string -> s } ``` 從 `number` 的 msb 開始往右掃描,若當前位元為 `1` 則 `carry` 為 `1`,並更新整個 `string`,對每個位元做 `c[j] += c[j] - '0' + carry;` 若過程中發現 `c[j]` 會大於 `9` 則設置 `carry` 為 `1` 用以進位到 `c[j-1]` 否則設置為 `0` ```graphviz digraph{ node[shape = "record"] n1[label = "<1>1|<2>0|<3>1|<4>1"] s[label ="0|0|0|0|0|0|0|0|0|0|<cur>0 + 0 + 1|'\\0'"] current_bit[shape = plaintext] j[shape = plaintext] current_bit -> n1:1 j -> s:cur } ``` 接著重複執行上述動作 ```graphviz digraph{ node[shape = "record"] n1[label = "<1>1|<2>0|<3>1|<4>1"] s[label ="0|0|0|0|0|0|0|0|0|0|<cur>1 + 1 + 0|'\\0'"] current_bit[shape = plaintext] j[shape = plaintext] current_bit -> n1:2 j -> s:cur } ``` ```graphviz digraph{ node[shape = "record"] n1[label = "<1>1|<2>0|<3>1|<4>1"] s[label ="0|0|0|0|0|0|0|0|0|0|<cur>2 + 2 + 1|'\\0'"] current_bit[shape = plaintext] j[shape = plaintext] current_bit -> n1:3 j -> s:cur } ``` ```graphviz digraph{ node[shape = "record"] n1[label = "<1>1|<2>0|<3>1|<4>1"] s[label ="0|0|0|0|0|0|0|0|0|0|<cur>5 + 5 + 1|'\\0'"] s_2[label ="0|0|0|0|0|0|0|0|0|<cur>0 + 0 + 1|1|'\\0'"] current_bit[shape = plaintext] j[shape = plaintext] current_bit -> n1:4 j -> s:cur s -> s_2 j_nxt -> s_2:cur } ``` 掃描完後,得到最後結果 `11` ```graphviz digraph{ node[shape = "record"] n1[label = "<1>1|<2>0|<3>1|<4>1"] s[label ="0|0|0|0|0|0|0|0|0|1|1|\\0"] number[shape = plaintext] string[shape = plaintext] number -> n1 string -> s } ``` ```c static inline char *bign_to_string(bign *a) { /* change of base formula: log10(x) = log2(x) / log2(10) */ size_t len = (8 * sizeof(int) * a->size) / 3 + 2; char *c = kmalloc(len, GFP_KERNEL); char *p = c; memset(c, '0', len - 1); c[len - 1] = '\0'; /* transform bits into decimal string */ for (int i = a->size - 1; i >= 0; i--) { for (unsigned int d = 1U << 31; d; d >>= 1) { int carry = !!(d & a->number[i]); for (int j = len - 2; j >= 0; j--) { c[j] += c[j] - '0' + carry; carry = (c[j] > '9'); if (carry) c[j] -= 10; } } } while (*p == '0' && *(p + 1) != '\0') p++; if (a->sign) *(--p) = '-'; memmove(c, p, strlen(p) + 1); return c; } ``` #### `bign_add` 及 `bign_sub` - `add_BigN` - 從 `number[i]` 的 lsb 開始計算,將 `a->number[i], b->number[i]` 及 `carry` 加至 `c->number[i]` ,並在過程利用巨集 `DETECT_OVERFLOW` 偵測是否有溢位發生,若是溢位便將 `carry` 設置為 `1` 否則為 `0` - `sub_BigN` - `sub_BigN` 和 `add_BigN` 想法類似,主要差別為偵測溢位的方式 ```c static inline void bign_add(bign *a, bign *b, bign *c) { bign_resize(c, a->size + 1); unsigned int carry = 0; for (int i = 0; i < c->size; i++) { unsigned int tmp_a = (i < a->size) ? a->number[i] : 0; unsigned int tmp_b = (i < b->size) ? b->number[i] : 0; c->number[i] = tmp_a + tmp_b + carry; carry = DETECT_OVERFLOW(tmp_a, tmp_b + carry); } if (!c->number[c->size - 1] && c->size > 1) bign_resize(c, c->size - 1); } static inline void bign_sub(bign *a, bign *b, bign *c) { bign_resize(c, a->size); unsigned int carry = 0; for (int i = 0; i < c->size; i++) { long tmp_a = (i < a->size) ? (long) a->number[i] : 0; long tmp_b = (i < b->size) ? (long) b->number[i] : 0; c->number[i] = tmp_a - tmp_b - carry; carry = DETECT_OVERFLOW_SUB(tmp_a - carry, tmp_b); } if (!c->number[c->size - 1] && c->size > 1) bign_resize(c, c->size - 1); } ``` #### `bign_mul` 令 `a` 為被乘數 `b` 為乘數,之後便用乘法直式的過程計算 需要注意的地方,在 `mul_BigN` 中我使用了 [`__fls`](https://github.com/torvalds/linux/blob/master/include/asm-generic/bitops/__fls.h) 來省略乘數中的 leading zeros ,避免不必要的計算 > __fls - find last (most-significant) set bit in a long word ```c static inline void bign_mul(bign *a, bign *b, bign *c) { bign *a_tmp = bign_alloc(1); bign_cpy(a, a_tmp); if ((!a->number[0] && (a->size == 1)) || (!b->number[0] && (b->size == 1))) { /* prevent a == c or b == c */ bign_resize(c, 1); c->number[0] = 0; // printk("0 appear"); return; } bign_resize(c, 1); c->number[0] = 0; /* then resize to limit range of a * b */ bign_resize(c, a->size + b->size); for (int i = 0; i < b->size; i++) { if (!b->number[i]) continue; unsigned int clz = __fls(b->number[i]) + 1; unsigned int shift_cnt = 32 - clz; for (unsigned int d = 1U; clz; d <<= 1) { if (!!(d & b->number[i])) { bign_add(c, a_tmp, c); } bign_shl(a_tmp, 1); clz--; } bign_shl(a_tmp, shift_cnt); } int c_size = 0; for (c_size = c->size - 1; !c->number[c_size];) c_size--; if (c->size > 1) bign_resize(c, c_size + 1); } ``` __更新 `bign_mul`__ - 假設有三個 bign a, b, c - c = a * b - c 為最終儲存的 bign - 令 a->size 為 A, b->size 為 B 原先: 從乘數的 lsb 開始往左掃描 bits,若 bit 為 1 則將被乘數加入 c 並在每回合結束時將乘數左移一格 需花費 (由於兩版本實作主要差別為兩層 for 內部,故只計算迴圈內的執行時間) - $B * (A * 32 * (bign\_add\_time + bign\_shl\_time)) = 32A * B (bign\_add\_time + bign\_shl\_time)$ 其中 `bign_add` 會隨著資料大小而改變運算時間 後來: 參考 [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU) 的 `mul` 實作方式 將乘數的 `a->number[0~size]` 分別和被乘數的 `b->number[0~size]` 兩兩相乘即可 需花費 - $B * (A * (unsigned int * unsigned int\_exc\_time))$ 其中 unsigned int * unsigned int 可視為常數時間,不會隨著資料大小而改變運算時間 可以發現後來的實作方式因避免使用 `bign_add` 而改善效能 ```c static inline void bign_mul(bign *a, bign *b, bign *c) { /* prevent a == c or b == c */ bign *a_tmp = bign_alloc(1); bign_cpy(a, a_tmp); /* handle multiplier or multiplicand are zero */ if ((!a->number[0] && (a->size == 1)) || (!b->number[0] && (b->size == 1))) { bign_resize(c, 1); c->number[0] = 0; return; } bign_resize(c, 1); c->number[0] = 0; /* then resize to limit range of a * b */ bign_resize(c, a->size + b->size); for (int i = 0; i < b->size; i++) { if (!b->number[i]) continue; for (int j = 0; j < a_tmp->size; j++) { unsigned long long carry; carry = (unsigned long long) a_tmp->number[j] * b->number[i]; bign_mul_add(carry, i + j, c); } } int c_size = 0; for (c_size = c->size - 1; !c->number[c_size];) c_size--; if (c->size > 1) bign_resize(c, c_size + 1); } ``` 最後將 `bign_fib` 及 `bign_fib_fast` 實作 ```c static inline void bign_fib(bign *dest, int fn) { bign_resize(dest, 1); if (fn < 2) { dest->number[0] = fn; return; } bign *a = bign_alloc(1); bign *b = bign_alloc(1); a->number[0] = 0; b->number[0] = 1; for (int i = 2; i <= fn; i++) { bign_add(a, b, dest); bign_cpy(dest, a); bign_swap(a, b); } bign_free(a); bign_free(b); } static inline void bign_fib_fast(bign *dest, const int fn) { bign_resize(dest, 1); if (fn < 2) { dest->number[0] = fn; return; } bign *a = bign_alloc(1); bign *b = bign_alloc(1); bign *tmp = bign_alloc(1); a->number[0] = 0; b->number[0] = 1; /* find leftmost bit position equal to 1*/ int fn_fls = __fls(fn); for (unsigned int d = 1U << fn_fls; d > 0; d >>= 1) { /* F(2k) */ bign_cpy(b, dest); bign_shl(dest, 1); bign_sub(dest, a, dest); bign_mul(dest, a, dest); /* F(2k + 1) */ bign_mul(a, a, tmp); bign_mul(b, b, a); bign_add(a, tmp, tmp); bign_cpy(dest, a); bign_cpy(tmp, b); if (!!(d & fn)) { bign_add(a, b, a); bign_swap(a, b); bign_cpy(a, dest); } } bign_free(a); bign_free(b); bign_free(tmp); } ``` :::warning 回去翻閱 [lab0](https://hackmd.io/@sysprog/linux2022-lab0) 作業說明的「變數和函式命名力求簡潔且精準」,上述程式碼在函式名稱中混用大小寫,對理解行為沒有幫助。 :notes: jserv ::: > 已改正 coding style ## 測量效能 根據 [fibdrv/linux 效能分析的提示](https://hackmd.io/@sysprog/linux2022-fibdrv) 設定測量環境 - 抑制 [ASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization) ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` - 關閉 turbo mode ```shell $ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" ``` - 設定 scaling_governor ```shell sudo bash -c "echo performance > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor" ``` 利用 `ktime` 來測量效能,以下為使用自己實作的 bign 執行 naive 及 fast doubing 的結果 ![](https://i.imgur.com/uM4SlOW.png) 先撇除影響效能測量的因素,可以發現 fast doubing 比起 naive 要慢上不少,顯示了我實作的 bign 有效能問題 經由 `ktime` 測量後,發現是 `bign_mul` 執行速度過慢,因此拖垮了 `bign_fib_fast` 的執行速度 下圖為優化了 `bign_mul` 的效能後的結果 ![](https://i.imgur.com/n24clAm.png) 可以發現在改善 `bign_mul` 效能後,有正確產生預期結果 ## 使用統計手法測量效能 測量 fib(0)~fib(1000) 效能 額外寫一個 [`performance_stattistic.c`](https://github.com/cwl0429/fibdrv/blob/master/performance_statistic.c) ```c int main() { char write_buf[] = "testing writing"; int offset = 1000; int test_cnt = 1000; int fib_tmp[TEST_CASE_CNT]; int fib_fast_tmp[TEST_CASE_CNT]; int fd = open(FIB_DEV, O_RDWR); if (fd < 0) { perror("Failed to open character device"); exit(1); } for (int i = 0; i <= offset; i++) { for (int j = 0; j <= test_cnt; j++) { lseek(fd, i, SEEK_SET); int f0, f1; f0 = write(fd, write_buf, FIB_NAIVE); f1 = write(fd, write_buf, FIB_FAST_DOUBLING); fib_tmp[j] = f0; fib_fast_tmp[j] = f1; } float fib_sd = 0, fib_fast_sd = 0; int fib_mean = 0, fib_fast_mean = 0; /* calculate mean */ for (int j = 0; j <= test_cnt; j++) { fib_mean += fib_tmp[j]; fib_fast_mean += fib_fast_tmp[j]; } fib_mean /= test_cnt; fib_fast_mean /= test_cnt; /* calculate sd */ for (int j = 0; j <= test_cnt; j++) { fib_sd += (fib_tmp[j] - fib_mean); fib_fast_sd += (fib_fast_tmp[j] - fib_fast_mean); } fib_sd = sqrt(fib_sd); fib_fast_sd = sqrt(fib_fast_sd); /* remove outliner */ for (int j = 0; j <= test_cnt; j++) { if ((fib_tmp[j] > (fib_mean + 3 * fib_sd)) || (fib_tmp[j] < (fib_mean - 3 * fib_sd))) { fib_tmp[j] = 0; } if ((fib_fast_tmp[j] > (fib_fast_mean + 3 * fib_fast_sd)) || (fib_fast_tmp[j] < (fib_fast_mean - 3 * fib_fast_sd))) { fib_fast_tmp[j] = 0; } } /* calculate mean*/ int fib_cnt = 0, fib_fast_cnt = 0; fib_mean = 0; fib_fast_mean = 0; for (int j = 0; j <= test_cnt; j++) { if (fib_tmp[j] != 0) { fib_mean += fib_tmp[j]; fib_cnt++; } if (fib_fast_tmp[j] != 0) { fib_fast_mean += fib_fast_tmp[j]; fib_fast_cnt++; } } fib_mean /= fib_cnt; fib_fast_mean /= fib_fast_cnt; printf("%d %d %d\n", i, fib_mean, fib_fast_mean); } close(fd); return 0; } ``` 想法是,將每個 `fib(n)` 及 `fib_fast(n)` 分別執行 1000 次,得到數據後,算出其標準差,將超過三個標準差的離群值去除,也就是取 99.7% 的資料 ![](https://i.imgur.com/EJ3pVDs.png) 最後取去掉離群值後的平均值作圖,結果如下圖: ![](https://i.imgur.com/5xBGXnY.png) 因為去除離群值的緣故,可以看出數據抖動的狀況大幅減少,但是在前 100 個費氏數仍有抖動的情況發生,推測原因是