2020q1 Homework2 (fibdrv)

# 2020q1 Homework2 (fibdrv) contributed by < `Yu-Wei-Chang` > ## 實驗環境 ```shell $ uname -a Linux ywc-ThinkPad-X220 5.3.0-46-generic #38~18.04.1-Ubuntu SMP Tue Mar 31 04:17:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux $ gcc --version gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 ``` ## 作業要求 * 回答以下所有問題，需要附上對應的參考資料和必要的程式碼，以第一手材料 (包含自己設計的實驗) 為佳 * [x] 研讀上述 ==Linux 效能分析的提示== 描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗; * [ ] 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 * [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise)，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; * [ ] `lsmod` 的輸出結果有一欄名為 `Used by`，這是 "each module's use count and a list of referring modules"，但如何實作出來呢？模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢？ * [x] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題 * 在 GitHub 上 fork [fibdrv](https://github.com/sysprog21/fibdrv)，目標是修正 Fibonacci 數計算的正確性 (現有實作存在缺陷，請指出)，隨後改善 `fibdrv` 計算 Fibinacci 數列的執行效率，過程中需要量化執行時間，可在 Linux 核心模組和使用層級去測量 * 原本的程式碼只能列出到 Fibonacci(100) 而且==部分輸出是錯誤的數值==，請修改程式碼，列出後方更多數值 (注意: 檢查正確性和數值系統的使用) * 務必研讀上述 ==Linux 效能分析的提示== 的描述，降低效能分析過程中的干擾; * 在 Linux 核心模組中，可用 ktime 系列的 API; * 在 userspace 可用 [clock_gettime](https://linux.die.net/man/2/clock_gettime) 相關 API; * 分別[用 gnuplot 製圖](https://hackmd.io/@sysprog/Skwp-alOg)，分析 Fibonacci 數列在核心計算和傳遞到 userspace 的時間開銷，單位需要用 us 或 ns (自行斟酌) * 嘗試解讀上述實驗的時間分佈，特別是隨著 Fibonacci 數列增長後，對於 Linux 核心的影響。可修改 VFS 函式，允許透過寫入 `/dev/fibonacci` 裝置檔案來變更不同的 Fibonacci 求值的實作程式碼。 * 用 printk 固然方便，但其執行時間易受各種因素而不穩定，除了透過統計來觀察，也可改用核心的 sysfs 介面，後者可讓我們自行宣告 show 和 store 介面來提供讀寫操作，可參見 [Sample kobject implementation](https://elixir.bootlin.com/linux/latest/source/samples/kobject/kobject-example.c) (注意: 切換到對應的 Linux 核心版本)。 * 逐步最佳化 Fibonacci 的執行時間，引入[費氏數列分析](https://hackmd.io/@sysprog/fibonacci-number) 提到的策略，並善用 [clz / ffs](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令 (Linux 核心有對應的包裝函式)，過程中都要充分量化 ## 撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題 * 實驗預期行為說明 * `client` 程式在開啟 /dev/fibonaccifibonacci device file 後，反覆對其使用 `lseek()` 系統呼叫修改 file position，接著使用 `read()` 系統呼叫，此時`fibdrv` 會將目前的 file position 當為項數去求出 Fibonacci 數列。 (0~100 然後再 100~0) * 若 `fibdrv` 在 `open()` 實作上沒有使用 mutex 的情況下，~~假設多個 `client` 程式的執行緒同時對其修改 file position，以及讀取 Fibonacci 數列時可能會互向干擾，例：A 執行緒目前讀取到項數 80，但同時 B 執行緒把 file position 改為 10，導致 A 執行緒讀出來的 Fibonacci 數列是錯誤的。~~ [==錯誤想法，更新如下==] * 實驗修改 * `client` 使用 POSIX thread (pthread.h) 提供的函式來建立三個執行緒同時進行和 main process 同樣的動作。 * `fibdrv` 在 open() 和 release() 實作上都不使用 mutex。 * makefile 中編譯 `client` 的部份加入 pthread 函式庫的 link 參數。 * 觀察結果：==反覆跑幾次沒有看到預期互相干擾的情形==，~~推測是執行緒在做 lseek() 和 read() 中間沒有被其他執行緒的 lseek() 干擾到，原因待查。~~ [錯誤想法] * [實驗用的程式碼](https://github.com/Yu-Wei-Chang/fibdrv/blob/master/client_multithread.c) * [更新] 根據核心的文件 [vfs.txt](https://www.kernel.org/doc/Documentation/filesystems/vfs.txt) 中的說明，其中一段對 file object 的說明如下: 開啟的檔案 (file descriptor) 對應到核心中的 file struct，而 file position 就是記錄在 file struct 之中 `file->f_pos`，所以各個 fd 都有自己的 file position。 > Opening a file requires another operation: allocation of a file structure (this is the kernel-side implementation of file descriptors). The freshly allocated file structure is initialized with a pointer to the dentry and a set of file operation member functions. These are taken from the inode data. The open() file method is then called so the specific filesystem implementation can do its work. You can see that this is another switch performed by the VFS. The file structure is placed into the file descriptor table for the process. > Reading, writing and closing files (and other assorted VFS operations) is done by using the userspace file descriptor to grab the appropriate file structure, and then calling the required file structure method to do whatever is required. For as long as the file is open, it keeps the dentry in use, which in turn means that the VFS inode is still in use. ## 修正 Fibonacci 數計算的正確性 * f(93) 的 fibonacci 數列會超過 long long 的64位元的最大儲存大小，採用作業說明的 struct BigN 型態來儲存 fibonacci 數列。 ```cpp struct BigN { unsigned long long lower, upper; }; static inline void addBigN(struct BigN *output, struct BigN x, struct BigN y) { output->upper = x.upper + y.upper; if (y.lower > ~x.lower) output->upper++; output->lower = x.lower + y.lower; } static struct BigN fib_sequence(long long k) { /* FIXME: use clz/ctz and fast algorithms to speed up */ struct BigN f[k + 2]; f[0].lower = 0; /* f[0] = 0; */ f[0].upper = 0; f[1].lower = 1; /* f[1] = 1; */ f[1].upper = 0; for (int i = 2; i <= k; i++) { addBigN(&f[i], f[i - 1], f[i - 2]); } // pr_warn("offset [%llu], upper [%llu], lower [%llu]\n", k, f[k].upper, // f[k].lower); return f[k]; } ``` * 直接將整個 `struct BigN` 傳到位於 user mode 的 `client` 應用程式運行的 address space 去 (使用 linux/uaccess.h 提供的 copy_to_user() 函式) ```cpp /* calculate the fibonacci number at given offset */ static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { struct BigN fib_seq; if ((!buf) || (size < sizeof(fib_seq))) { return -EINVAL; } fib_seq = fib_sequence(*offset); if (copy_to_user(buf, &fib_seq, sizeof(fib_seq))) { return -EFAULT; } return (ssize_t) sizeof(fib_seq); } ``` * 回到 `client` 這個 user mode 的應用程式，採用字串的方式來表現超過 long long 64位元的大數值。這邊實作了 `print_BigN_string()` 函式來將 struct BigN 的數值轉成字串，以及數字字串的加法和乘法運算函式。 ```cpp /* Max. digits of BigN in decimal is 39. */ #define LEN_BN_STR 40 struct BigN { unsigned long long lower, upper; }; /* Reverse given string */ static void reverse_string(char *str) { if (!str) return; for (int i = 0; i < (strlen(str) / 2); i++) { char c = str[i]; str[i] = str[strlen(str) - 1 - i]; str[strlen(str) - 1 - i] = c; } } /* Add str_num and str_addend, and store result in str_sum. * digit which exceed value of "digits" will be truncated. * return -1 if any failed, or string length of result. */ static int string_of_number_add(char *str_sum, char *str_addend, unsigned int digits) { int length; int idx = 0, carry = 0; char str_tmp[LEN_BN_STR] = ""; if (!str_sum || !str_addend || (digits == 0)) return -1; /* Reverse string because we process it from lowest digit. */ reverse_string(str_sum); reverse_string(str_addend); /* Take the longer length, because we need to go through all digits. */ length = (strlen(str_sum) > strlen(str_addend)) ? strlen(str_sum) : strlen(str_addend); /* Keep calculate if both str_sum and str_addend does not meet '\0'. */ for (int i = 0, j = 0; (i < length) && (j < length);) { int sum = 0; if (str_sum[i] != '\0') { sum += (str_sum[i++] - '0'); } if (str_addend[j] != '\0') { sum += (str_addend[j++] - '0'); } if (idx < digits) str_tmp[idx++] = ((sum + carry) % 10) + '0'; carry = (sum + carry) / 10; } if ((idx < digits) && carry) str_tmp[idx++] = carry + '0'; str_tmp[idx] = '\0'; /* Reverse string back */ reverse_string(str_tmp); reverse_string(str_addend); snprintf(str_sum, digits, "%s", str_tmp); return (int) strlen(str_sum); } /* Multiply multiplicand with miltiplier, and store result in multiplicand. * digit which exceed value of "digits" will be truncated. * return -1 if any failed, or string length of result. */ static int string_of_number_multiply(char *multiplicand, char *multiplier, unsigned int digits) { char str_tmp[LEN_BN_STR] = ""; char str_product[LEN_BN_STR] = ""; int carry = 0; if (!multiplicand || !multiplier || (digits == 0) || (strlen(multiplicand) == 0) || (strlen(multiplier) == 0)) return -1; /* Sepcial case of "0" */ if ((strlen(multiplier) == 1 && multiplier[0] == '0') || (strlen(multiplicand) == 1 && multiplicand[0] == '0')) { snprintf(multiplicand, digits, "%s", "0"); return (int) strlen(multiplicand); } /* Reverse string because we process it from lowest digit. */ reverse_string(multiplicand); reverse_string(multiplier); for (int i = 0; i < strlen(multiplier); i++) { int idx = 0; for (int decuple = i; (idx < digits) && (decuple > 0); decuple--) str_tmp[idx++] = '0'; for (int j = 0; j < strlen(multiplicand); j++) { int product = (multiplicand[j] - '0') * (multiplier[i] - '0'); if (idx < digits) str_tmp[idx++] = ((product + carry) % 10) + '0'; carry = (product + carry) / 10; } if ((idx < digits) && carry) { str_tmp[idx++] = carry + '0'; carry = 0; } str_tmp[idx] = '\0'; reverse_string(str_tmp); string_of_number_add(str_product, str_tmp, sizeof(str_product) - 1); } /* Reverse back */ reverse_string(multiplier); snprintf(multiplicand, digits, "%s", str_product); return (int) strlen(multiplicand); } /* Convert BigN to string which represent in decimal */ static void print_BigN_string(struct BigN fib_seq, char *bn_out, unsigned int digits) { char bn_scale[LEN_BN_STR] = "18446744073709551616"; char bn_lower[LEN_BN_STR] = ""; char bn_upper[LEN_BN_STR] = ""; if (!bn_out) return; snprintf(bn_out, digits, "%d", 0); snprintf(bn_lower, sizeof(bn_lower), "%llu", fib_seq.lower); snprintf(bn_upper, sizeof(bn_upper), "%llu", fib_seq.upper); string_of_number_multiply(bn_scale, bn_upper, sizeof(bn_scale)); string_of_number_add(bn_out, bn_scale, digits); string_of_number_add(bn_out, bn_lower, digits); } ``` * 檢查到 f(183) 數值都還正確 ```shell Reading from /dev/fibonacci at offset 180, returned the sequence 18547707689471986212190138521399707760. Reading from /dev/fibonacci at offset 181, returned the sequence 30010821454963453907530667147829489881. Reading from /dev/fibonacci at offset 182, returned the sequence 48558529144435440119720805669229197641. Reading from /dev/fibonacci at offset 183, returned the sequence 78569350599398894027251472817058687522. Reading from /dev/fibonacci at offset 183, returned the sequence 78569350599398894027251472817058687522. Reading from /dev/fibonacci at offset 182, returned the sequence 48558529144435440119720805669229197641. Reading from /dev/fibonacci at offset 181, returned the sequence 30010821454963453907530667147829489881. Reading from /dev/fibonacci at offset 180, returned the sequence 18547707689471986212190138521399707760. ``` ## 分析 Fibonacci 數列在核心計算和傳遞到 userspace 的時間開銷 ### Fibonacci 數列在核心中的計算開銷時間 * 核心的開銷時間使用 `ktime_t` 型態以及`ktime_get()` 等 ktime.h 提供的 API 來計算。開銷時間將和 Fibonacci 數列一起儲存在 struct BigN 中傳上去給 user space。 * 單位 nano seconds。 ```cpp /* calculate the fibonacci number at given offset */ static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { struct BigN fib_seq; if ((!buf) || (size < sizeof(fib_seq))) { return -EINVAL; } kt = ktime_get(); fib_seq = fib_sequence(*offset); kt = ktime_sub(ktime_get(), kt); fib_seq.fib_cost_time_ns = ktime_to_ns(kt); if (copy_to_user(buf, &fib_seq, sizeof(fib_seq))) { return -EFAULT; } return (ssize_t) sizeof(fib_seq); } ``` ### Fibonacci 數列在 user-level 應用程式中的取得開銷時間 * 使用標準函式庫 time.h 提供的 `clock_gettime()`來取得對 fibonacci device file 做 read 系統呼叫前後的時間，然後相減就可以得到 read 系統呼叫花了多少時間開銷。 * 單位 nano seconds。 * 將此段開銷時間扣除上述在核心中的運算時間，即可得到應用程式呼叫 read 系統呼叫 (client -> kernel)，以及核心透過 copy_to_user() 將資料傳回給應用程式 (kernel -> client) 所花費的時間。 * 將上述計算出來的開銷時間寫入檔案，使用 gnuplot 將圖表繪製出來。 ```cpp int main() { long long sz; struct BigN bn_buf; ... k_fptr = fopen(KERNEL_OUTPUT_FILE, "w+"); u_fptr = fopen(USER_OUTPUT_FILE, "w+"); for (int i = 0; i <= offset; i++) { struct timespec t1, t2; lseek(fd, i, SEEK_SET); clock_gettime(CLOCK_REALTIME, &t1); if ((sz = read(fd, &bn_buf, sizeof(bn_buf))) < 0) { printf("Reading failed at offset %d\n", i); } else { clock_gettime(CLOCK_REALTIME, &t2); if (u_fptr) fprintf(u_fptr, "%d %ld %lld\n", i, timespec_diff(&t1, &t2), timespec_diff(&t1, &t2) - bn_buf.fib_cost_time_ns); if (k_fptr) fprintf(k_fptr, "%d %lld\n", i, bn_buf.fib_cost_time_ns); print_BigN_string(bn_buf, fib_str_buf, sizeof(fib_str_buf)); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%s.\n", i, fib_str_buf); } } fclose(k_fptr); fclose(u_fptr); ... return 0; } ``` ### 開銷時間 * user <-> kernel 之間的傳遞開銷時間還算固定，大約在 1000 ns 左右。 * Fibonacci 的運算時間隨著 Fibonacci sequence number 的增加而變多。 ![](https://i.imgur.com/rXmUmlu.png) ## 逐步最佳化 Fibonacci 的執行時間，引入費氏數列分析提到的策略，採用 [Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 的方式計算 Fibonacci sequence * 依照以下公式實作 fast doubling 函式: $$ \begin{split} F(2k) &= F(k)[2F(k+1) - F(k)] \\ F(2k+1) &= F(k+1)^2+F(k)^2 \end{split} $$ * 因為需要對 BigN 結構做乘法和減法運算，因此需要新增實作。 * 減法 (==此實作不是完整的實作，只能用在大數減小數上，在我們的例子可以使用是因為我們只會用到 f(k+1) - f(k)==) ```cpp static inline void subBigN(struct BigN *output, struct BigN x, struct BigN y) { if ((x.upper >= y.upper) && (x.lower >= y.lower)) { output->upper = x.upper - y.upper; output->lower = x.lower - y.lower; } else { output->upper = x.upper - y.upper - 1; output->lower = ULLONG_MAX + x.lower - y.lower + 1; } } ``` * 乘法 (參考[pingsutw](https://hackmd.io/@pingsutw/Hyv4PFGvE) 同學的實作) ```cpp static inline void multiBigN(struct BigN *output, struct BigN x, struct BigN y) { size_t w = 8 * sizeof(unsigned long long); struct BigN product = {.upper = 0, .lower = 0}; for (size_t i = 0; i < w; i++) { if ((y.lower >> i) & 0x1) { struct BigN tmp; product.upper += x.upper << i; tmp.lower = (x.lower << i); tmp.upper = i == 0 ? 0 : (x.lower >> (w - i)); addBigN(&product, product, tmp); } } for (size_t i = 0; i < w; i++) { if ((y.upper >> i) & 0x1) { product.upper += (x.lower << i); } } output->upper = product.upper; output->lower = product.lower; } ``` * Fast-doubling apporach ```cpp static struct BigN fib_sequence_fd(long long k) { /* The position of the highest bit of k. */ /* So we need to loop `rounds` times to get the answer. */ int rounds = 0; for (int i = k; i; ++rounds, i >>= 1) ; struct BigN t1 = {.upper = 0, .lower = 0}, t2 = {.upper = 0, .lower = 0}; struct BigN a = {.upper = 0, .lower = 0}, b = {.upper = 0, .lower = 1}; struct BigN multi_two = {.upper = 0, .lower = 2}, tmp = {.upper = 0, .lower = 0}; for (int i = rounds; i > 0; i--) { // t1 = a * (2 * b - a); /* F(2k) = F(k)[2F(k+1) − F(k)] */ multiBigN(&t1, b, multi_two); subBigN(&t1, t1, a); multiBigN(&t1, a, t1); // t2 = b * b + a * a; /* F(2k+1) = F(k+1)^2 + F(k)^2 */ multiBigN(&t2, b, b); multiBigN(&tmp, a, a); addBigN(&t2, t2, tmp); if ((k >> (i - 1)) & 1) { a = t2; /* Threat F(2k+1) as F(k) next round. */ addBigN(&b, t1, t2); /* Threat F(2k) + F(2k+1) as F(k+1) next round. */ } else { a = t1; /* Threat F(2k) as F(k) next round. */ b = t2; /* Threat F(2k+1) as F(k+1) next round. */ } } return a; } ``` * 結果開銷時間遠遠超過原本的算法 (如下圖)，從數據上可以發現當 Fibonacci sequence 的項數碰到 $2^n$ 時開銷時間就會有明顯增加，推測是因為運算時又多用了 4 次的 `multiBigN()` 函式，接著找出原因並改善效能。 (WIP) ![](https://i.imgur.com/oSIspeg.png) ###### tags: `Linux核心課程筆記 - Homework`