2020q1 Homework2 (fibdrv)

# 2020q1 Homework2 (fibdrv) contributed by < `jwang0306` > > [作業要求](https://hackmd.io/@sysprog/linux2020-fibdrv#-%E4%BD%9C%E6%A5%AD%E8%A6%81%E6%B1%82) > [GitHub](https://github.com/jwang0306/fibdrv) --- ## 自我檢查清單 - [ ] 研讀上述 Linux 效能分析的提示描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作 $\rightarrow$ 從中也該理解為何不希望在虛擬機器中進行實驗; - [x] 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 * 快速的 fibonacci 公式與計算方法可歸納如下： $$F(2k) = F(k)[2F(k+1) - F(k)] \\ F(2k+1) = F(k+1)^2+F(k)^2$$ * 遇到 `0` $\rightarrow$ 求 $F(2n)$ 和 $F(2n+1)$ * 遇到 `1` $\rightarrow$ 求 $F(2n)$ 和 $F(2n+1)$，再求 $F(2n+2)$ * 一個 32 bit 的數字，只要算它所使用到的就好，也就是從左邊數來第一個非 0 開始，所以可以把 leading zero 扣除，透過內建的 `__builtin_clz()` 得到 leading zero 的數量，再用 32 減掉，即為第一個非 0 的起始位置 - [ ] 複習 C 語言數值系統和 bitwise operation，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] `lsmod` 的輸出結果有一欄名為 `Used by`，這是 “each module’s use count and a list of referring modules”，但如何實作出來呢？模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢？ - [x] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題 * 使用 lock 場景：在多個 thread 需要使用共同資源的時候，為了確保執行先後順序的正確性，需要使用 mutex lock 來保證一次只能有一個執行緒進入 critical section 。 * 撰寫多執行緒的 userspace 程式 ```cpp void *execute(void *arg) { char buf[100]; char write_buf[] = "testing writing"; char method[1]; int offset = 100; long long sz; int fd = open(FIB_DEV, O_RDWR); if (fd < 0) { perror("Failed to open character device"); } method[0] = FIB_DOUB_CLZ; sz = write(fd, method, 1); for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); sz = read(fd, buf, i); printf("Reading from " FIB_DEV " at offset %d by thread #%ld, returned the sequence " "%s.\n", i, (long)arg, buf); } close(fd); } int main() { pthread_t tid[NUM_THREAD]; for (int i = 0; i < NUM_THREAD; ++i) { pthread_create(&(tid[i]), NULL, &execute, (void *)(long)(i+1)); } for (int i = 0; i < NUM_THREAD; ++i) { pthread_join(tid[i], NULL); } } ``` 沒有 lock - 若是把 `fibdrv.c` 的 `mutex_trylock` 拿掉，可以看到 thread 3 和 4 的順序是亂的 - 但是算得的結果是正確的，因為只是對 device 進行讀取，而非寫入 ```shell ... Reading from /dev/fibonacci at offset 72 by thread #4, returned the sequence 498454011879264. Reading from /dev/fibonacci at offset 73 by thread #4, returned the sequence 806515533049393. Reading from /dev/fibonacci at offset 43 by thread #3, returned the sequence 433494437. Reading from /dev/fibonacci at offset 74 by thread #4, returned the sequence 1304969544928657. Reading from /dev/fibonacci at offset 44 by thread #3, returned the sequence 701408733. Reading from /dev/fibonacci at offset 45 by thread #3, returned the sequence 1134903170. Reading from /dev/fibonacci at offset 75 by thread #4, returned the sequence 2111485077978050. Reading from /dev/fibonacci at offset 76 by thread #4, returned the sequence 3416454622906707. Reading from /dev/fibonacci at offset 77 by thread #4, returned the sequence 5527939700884757. Reading from /dev/fibonacci at offset 46 by thread #3, returned the sequence 1836311903. Reading from /dev/fibonacci at offset 47 by thread #3, returned the sequence 2971215073. ... ``` 有 `mutex_trylock()` ```cpp static int fib_open(struct inode *inode, struct file *file) { if (!mutex_trylock(&fib_mutex)) { printk(KERN_ALERT "fibdrv is in use"); return -EBUSY; } return 0; } ``` - `mutex_trylock` 顧名思義會嘗試去取得 lock ，但是他不會等待，因此沒有取得 lock 的話該執行緒不會睡在那邊等，會直接回傳成功與否，然後接著下去執行了 - 因此按照原本的程式碼，它會回傳一個負的值，導致讀取 `fibdrv` 失敗，輸出也就會留白： ```shell ... Reading from /dev/fibonacci at offset 98 by thread #5, returned the sequence 135301852344706746049. Reading from /dev/fibonacci at offset 99 by thread #5, returned the sequence 218922995834555169026. Reading from /dev/fibonacci at offset 100 by thread #5, returned the sequence 354224848179261915075. thread #3 Failed to open character device Reading from /dev/fibonacci at offset 0 by thread #3, returned the sequence . Reading from /dev/fibonacci at offset 1 by thread #3, returned the sequence . Reading from /dev/fibonacci at offset 2 by thread #3, returned the sequence . Reading from /dev/fibonacci at offset 3 by thread #3, returned the sequence . Reading from /dev/fibonacci at offset 4 by thread #3, returned the sequence . Reading from /dev/fibonacci at offset 5 by thread #3, returned the sequence . ... ``` 有 `mutex_lock()` ```cpp static int fib_open(struct inode *inode, struct file *file) { mutex_lock(&fib_mutex); return 0; } ``` - 我進行以上修改，將其換成 `mutex_lock` ，如此一來若是沒有取得 lock ，該執行緒就會睡到可以取得為止 - 因此輸出的順序是正確的（只有放上部份，全部貼上來太長） ```shell Reading from /dev/fibonacci at offset 0 by thread #5, returned the sequence 0. Reading from /dev/fibonacci at offset 1 by thread #5, returned the sequence 1. Reading from /dev/fibonacci at offset 2 by thread #5, returned the sequence 1. Reading from /dev/fibonacci at offset 3 by thread #5, returned the sequence 2. Reading from /dev/fibonacci at offset 4 by thread #5, returned the sequence 3. Reading from /dev/fibonacci at offset 5 by thread #5, returned the sequence 5. Reading from /dev/fibonacci at offset 6 by thread #5, ... Reading from /dev/fibonacci at offset 96 by thread #1, returned the sequence 51680708854858323072. Reading from /dev/fibonacci at offset 97 by thread #1, returned the sequence 83621143489848422977. Reading from /dev/fibonacci at offset 98 by thread #1, returned the sequence 135301852344706746049. Reading from /dev/fibonacci at offset 99 by thread #1, returned the sequence 218922995834555169026. Reading from /dev/fibonacci at offset 100 by thread #1, returned the sequence 354224848179261915075. ``` 等待 lock 所花時間 - 由於我們無法單純由 `mutex_lock` 直接得知 lock 到底是否正被佔著，因此我迂迴地進行了測試，如果 `mutex_trylock` 沒有搶到 lock ，那麼我就利用 `mutex_lock` 再去搶一次，目的是為了讓該 thread 「卡」在那邊，直到其他 thread 將 lock 釋出 - 後面緊接著要 `mutex_unlock` ，否則程式會整個當掉，因為後面的執行緒永遠等不到 lock ```cpp static int fib_open(struct inode *inode, struct file *file) { if (!mutex_trylock(&fib_mutex)) { printk(KERN_ALERT "fibdrv is in use"); mutex_lock(&fib_mutex); printk(KERN_ALERT "fibdrv is now available"); mutex_unlock(&fib_mutex); return -EBUSY; } return 0; } ``` - 由於我無法在 kernel space 得知此刻到底是哪個執行緒在跑，因此我選擇在 user space 進行時間測量 - 當 thread 沒搶到 lock 的時候， `fd` 會小於 0 ，再把所花時間印出來，即為該 thread 的等待時間 ```cpp void *execute(void *arg) { ... struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); int fd = open(FIB_DEV, O_RDWR); clock_gettime(CLOCK_MONOTONIC, &end); long long spent = (long long)(end.tv_sec * 1e9 + end.tv_nsec) - (long long)(start.tv_sec * 1e9 + start.tv_nsec); if (fd < 0) printf("Thread #%ld has slept for %lld (ns).\n", (long)arg, spent); ... } ``` - 範例輸出 ```shell Thread #3 has slept for 139201 (ns). ``` 經過多次實驗，有五個執行緒的情況下，萬一某執行緒沒有搶到 lock ，其等待的時間約為好幾萬、甚至是幾十萬 nsec 。因此在我們這個 case 底下，若是有多個執行緒，且只是單純對檔案進行讀取、沒有對共享資源進行操作的話，不使用 lock 的整體速度表現上會比較好（假設對輸出順序不在意，畢竟數字的計算仍然正確）。 --- ## Kernel space 與 User space 之間的傳遞時間我參考了 [ambersun](https://hackmd.io/@ambersun1234/fibdrv) 的作法。我透過暫時修改 `write` 函式來計時： ```cpp static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { return ktime_to_ns(ktime_get()); } ``` 回傳到 user space 時再透過與 monotonic time 之間的差異來得到 kernel to user 與 user to kernel 的時間： ```cpp for (int i = 0; i <= offset; i++) { struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); sz = write(fd, write_buf, strlen(write_buf)); clock_gettime(CLOCK_MONOTONIC, &end); long long u2k = sz - (long long)(start.tv_sec * 1e9 + start.tv_nsec); long long k2u = (long long)(end.tv_sec * 1e9 + end.tv_nsec) - sz; } ``` 如果作法是將 user space 測得的 fibonacci 執行時間與 kernel space 測得的 fibonacci 執行時間相減，那麼嚴格說起來是 **kernel to user 與 user to kernel 時間的總和**（藍色的線），並不能代表任何一方： ```cpp clock_gettime(CLOCK_REALTIME, &start); sz = read(fd, buf, 1); clock_gettime(CLOCK_REALTIME, &end); unsigned int t = diff_in_ns(start, end); long long sum = t - sz; ``` 實驗結果顯示， user to kernel 的傳遞時間要多上一些。 ![](https://i.imgur.com/OtyiMBk.png) --- ## 執行時間的優化首先進行[干擾因素的排除](https://hackmd.io/@sysprog/linux2020-fibdrv#%E6%8E%92%E9%99%A4%E5%B9%B2%E6%93%BE%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E5%9B%A0%E7%B4%A0)，並透過 ktimer 於 kernel space 中測量執行不同 fibonacci 演算法所花的時間，回傳到 user space 再輸出並畫圖。全程透過指定 `taskset 0x1` 固定在同一個 CPU 上執行。我們希望從 client 端去選擇想要執行的 fibonacci 方法，老師建議可以利用修改 sysfs 來完成。我原先透過 `fib_write` 來控制一個全域 integer 變數，然後根據這個 flag 來決定 `fib_read` 要執行的方法： ```cpp int flag; static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { char tmp_buf[1]; copy_from_user(tmp_buf, buf, 1); flag = tmp_buf[0]; return 1; } static long long fib_time_proxy(long long k) { ktime_t kt = ktime_get(); long long result; if (flag == 0) result = fib_dp(k); else if (flag == 1) result = fib_double(k); else if (flag == 2) result = fib_double_clz(k); unsigned int ns = ktime_to_ns(ktime_sub(ktime_get(), kt)); return ns; } ``` 後來看到了 [AndybnACT](https://hackmd.io/@AndybnA/fibdrv) 的方法，我覺得比我的優雅多了，<s>直接把 function 當成變數</s>，就減少了我原先需要一直執行的 if-else block ： :::danger 不是「直接把 function 當成變數」，而該說「傳遞 function pointer」，請複習 [你所不知道的C語言：指標篇](https://hackmd.io/@sysprog/c-pointer) 和 [你所不知道的 C 語言：物件導向程式設計篇](https://hackmd.io/@sysprog/c-oop)，以得知 function pointer 使用案例。 :notes: jserv ::: > [name=jwang0306]謝謝老師，已修正我錯誤的觀念以及說法 ```cpp long long (*fib_sequence)(long long); static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { char tmp_buf[1]; copy_from_user(tmp_buf, buf, 1); switch (tmp_buf[0]) { case 0: /* FIB_DP */ fib_sequence = fib_dp; break; case 1: /* FIB_DOUB */ fib_sequence = fib_double; break; case 2: /* FIB_DOUB_CLZ */ fib_sequence = fib_double_clz; break; } return 1; } static long long fib_time_proxy(long long k) { ktime_t kt = ktime_get(); long long result = fib_sequence(k); unsigned int ns = ktime_to_ns(ktime_sub(ktime_get(), kt)); return ns; } ``` ### fast algorithm ```cpp long long fib_double(long long k) { if (k <= 1) return k; long long a = 0, b = 1; for (int i = 31; i >= 0; --i) { long long t1 = a * (b * 2 - a); long long t2 = a * a + b * b; a = t1; b = t2; if (k & (1ull << i)) { t1 = a + b; a = b; b = t1; } } return a; } ```  ### clz/ctz ```cpp int i = 31 - __builtin_clz(k); ``` 利用 `__builtin_clz` 找出 leading zero 從運算過程中排除，但與原本的 dynamic programming 版本放在一起比較時，看不出 fast double 演算法在 clz 的有無之下所造成的差異： ![](https://i.imgur.com/m0x9iJX.png) :::danger 調整時間刻度再來觀察，關鍵是執行時間的分佈 (複習統計學) :notes: jserv ::: > [name=jwang0306]已補上實做，感謝老師提示因此將有無 clz 與否的時間測量單獨拉出來比較，於每個 offset 採樣五次，畫出分佈圖，加入 clz 後於執行時間的分佈明顯低於沒有使用 clz 的版本，這時就能明顯看出加速了： ```cpp FILE *fp = fopen("./FIB_DOUB.txt", "w"); char data[50]; method[0] = FIB_DOUB; sz = write(fd, method, 1); for (int i = 0; i <= offset; ++i) { lseek(fd, i, SEEK_SET); for (int j = 0; j < 5; ++j) { sz = read(fd, buf, 1); sprintf(data, "%d %lld\n", i, sz); fputs(data, fp); } } ``` ![](https://i.imgur.com/JYhhqA4.png) --- ## 大數運算原本用 64 位元去存取，超過 $fib(92)$ 之後就會遇到 overflow，也就是超出 64 位元無號整數所能表達的有效數值： ```shell f(93) fail input: 7540113804746346429 expected: 12200160415121876738 ``` 因此想法就是做一個大數運算，嘗試並評估後決定使用十進位來實做，因為二進位的轉換太難處理。 ### 結構 - 簡單地用一個陣列來存放每一個十進位數字 - 紀錄總共有幾個十進位數字 - 存放的順序是顛倒的，所以會像： `12345` $\rightarrow$ `arr[5,4,3,2,1,0,0,0,....]` ，這樣迴圈可以從 0 開始算 - 直接使用 integer 的陣列的缺點就是，不能開太大，因為每個位置都是 4 bytes ，否則會爆掉 ```cpp typedef struct bignum { int num_digits; int digits[MAX_DIGITS]; // 32 } bn_t; ``` ### 加法 - 首先將目標陣列初始化為 0 - 掃過每一個 digit ，並逐次進位 - 最後數數有幾個 digit ，並保存 ```cpp /* C = A + B */ void bn_add(bn_t *c, bn_t *a, bn_t *b) { for (int i = 0; i < MAX_DIGITS; ++i) c->digits[i] = 0; int carry = 0; int maxlen = MAX(a->num_digits, b->num_digits) + 1; for (int i = 0; i < maxlen; ++i) { c->digits[i] = a->digits[i] + b->digits[i] + carry; carry = c->digits[i] / 10; c->digits[i] %= 10; } c->num_digits = count_digits(c); } ``` ### 減法 - 與加法相似，逐個借位 - 最後數 digit 數量 ```cpp /* C = A - B */ void bn_sub(bn_t *c, bn_t *a, bn_t *b) { for (int i = 0; i < MAX_DIGITS; ++i) c->digits[i] = 0; int borrow = 0; int maxlen = MAX(a->num_digits, b->num_digits) + 1; for (int i = 0; i < maxlen; ++i) { c->digits[i] = a->digits[i] - b->digits[i] - borrow; borrow = 0; if (c->digits[i] < 0) { borrow = 1; c->digits[i] += 10; } } c->num_digits = count_digits(c); } ``` :::danger 十進位運算麻煩之處：借位帶來的時間開銷和運算量。 :notes: jserv ::: > [name=jwang0306]正在參考[sysprog21/bignum](https://github.com/sysprog21/bignum)的二進位實做手法 ### 乘法 :::danger 翻閱辭典，查閱「雙」的意義，不要將「二」和「雙」的適用情境搞混了！ :notes: jserv ::: > [name=jwang0306]收到，附上辭典![](https://i.imgur.com/xqstiRp.png) - 乘法利用<s>雙層</s> 二層迴圈，將每一個 digit 對齊相乘，如果遇到 0 就跳過。過程中超過 10 先放著不管 - 等全部乘完後，陣列的每個元素逐個檢查並進位 - 最後數 digit 數量 ```cpp /* C = A x B */ void bn_mul(bn_t *c, bn_t *a, bn_t *b) { for (int i = 0; i < MAX_DIGITS; ++i) c->digits[i] = 0; for (int i = 0; i < a->num_digits; ++i) { if (a->digits[i] == 0) continue; for (int j = 0; j < b->num_digits; ++j) { if (b->digits[j] == 0) continue; c->digits[i + j] += (a->digits[i]) * (b->digits[j]); } } for (int i = 0; i < MAX_DIGITS - 1; ++i) { int carry = 0; carry = c->digits[i] / 10; c->digits[i + 1] += carry; c->digits[i] %= 10; } c->num_digits = count_digits(c); } ``` ### 數 digit - 數字會在陣列的前面，`12345` $\to$ `arr[5,4,3,2,1,0,0,0,....]` ，因此從最後面數過來， 0 的話就減一，遇到非 0 時的 counter index 就是我要的 digit 數 ```cpp int count_digits(bn_t *n) { int i; for (i = MAX_DIGITS; i > 0 && n->digits[i - 1] == 0; --i) ; return (i == 0) ? 1 : i; } ``` ### 放進 buffer - 回傳到 user space 必須透過 `char *` 型態的 buffer ，因此要將 integer 型態的一個一個數字放，要注意也要反過來放 - `sprintf` 在 Cppcheck 會跳出警告不給 commit ，因此改用 `snprintf` - 一定得透過 `copy_tp_user` 回傳給 user space ，因為同樣叫做 `buf` 但是在 user space 和 kernel space 所指向的地址可能值會不同，因此不能直接透過 `sprintf` 寫進 `buf` 參數，否則整個 module 會直接被 Killed - `long copy_to_user( void __user *to, const void * from, unsigned long n );` ```cpp for (int i = 0, index = 0; i < result.num_digits; ++i) index += snprintf(&tmp[index], MAX_DIGITS - index, "%d", result.digits[result.num_digits - i - 1]); copy_to_user(buf, tmp, MAX_DIGITS); ``` ### 測試 - 比對 [The Fibonacci numbers](http://www.maths.surrey.ac.uk/hosted-sites/R.Knott/Fibonacci/fibtable.html) 後，確認結果符合預期： ```shell ... Reading from /dev/fibonacci at offset 93, returned the sequence 12200160415121876738. Reading from /dev/fibonacci at offset 94, returned the sequence 19740274219868223167. Reading from /dev/fibonacci at offset 95, returned the sequence 31940434634990099905. Reading from /dev/fibonacci at offset 96, returned the sequence 51680708854858323072. Reading from /dev/fibonacci at offset 97, returned the sequence 83621143489848422977. Reading from /dev/fibonacci at offset 98, returned the sequence 135301852344706746049. Reading from /dev/fibonacci at offset 99, returned the sequence 218922995834555169026. Reading from /dev/fibonacci at offset 100, returned the sequence 354224848179261915075. ``` - 再次測時間： - 單純比較有無 clz 硬體加速的話，有 clz 仍然較快 - 與 dynamic programming 版本比較的話就顯得慢了，因為乘法的成本太高，在有硬體加速、輸入數字又大的情況下才拼得過 ![](https://i.imgur.com/biAien8.png) --- ## TODO ### 大數運算加速參考 [sysprog21/bignum](https://github.com/sysprog21/bignum)，全部用二進位就可透過 bitwise operation 加速 ### add plot to makefile ```shell plot: time.gp time @gnuplot $< @eog time.png ``` ```shell $ taskset 0x1 [program] $ gnuplot [script].gp $ eog [image].png ```