2021q1 Homework3 (fibdrv)

# 2021q1 Homework3 (fibdrv) contributed by < [shawn5141](https://github.com/Shawn5141/fibdrv) > ## 指定閱讀 [作業解說](https://hackmd.io/@sysprog/linux2022-fibdrv) [Youtube講解](https://www.youtube.com/watch?v=IfsldrCuX28) [費氏數列分析](https://hackmd.io/@sysprog/fibonacci-number) [Linux 核心設計: 檔案系統概念及實作手法](https://hackmd.io/@sysprog/linux-file-system)。 [Linux 核心設計: 中斷處理和現代架構考量](https://hackmd.io/@sysprog/linux-interrupt)。 [Linux 核心設計: Timer 及其管理機制](https://hackmd.io/@sysprog/linux-timer)。 [Linux Virtual File System](https://www.win.tue.nl/~aeb/linux/lk/lk-8.html) ## 預期目標 - [x] 撰寫適用於 Linux 核心層級的程式 - [x] 學習 ktimer, copy_to_user 一類的核心 API - [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise](https://hackmd.io/@sysprog/c-bitwise) 操作 - [ ] 數值分析和運算改進策略 - [ ] 初探 [Linux VFS](https://www.win.tue.nl/~aeb/linux/lk/lk-8.html) - [ ] 自動測試機制 - [ ] 透過工具進行效能分析 ## 自我檢查清單 - [x] 研讀上述 `Linux 效能分析的提示` 描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作 →從中也該理解為何不希望在虛擬機器中進行實驗; - [ ] 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 [clz/ ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/linux2022-fibdrv#:~:text=%E8%AA%9E%E8%A8%80%20%E6%95%B8%E5%80%BC%E7%B3%BB%E7%B5%B1%20%E5%92%8C-,bitwise%20operation,-%EF%BC%8C%E6%80%9D%E8%80%83%20Fibonacci%20%E6%95%B8)，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] 研讀 [KYG-yaya573142](https://hackmd.io/@sysprog/linux2022-fibdrv#:~:text=%E7%A0%94%E8%AE%80-,KYG%2Dyaya573142%20%E7%9A%84%E5%A0%B1%E5%91%8A,-%EF%BC%8C%E6%8C%87%E5%87%BA%E9%87%9D%E5%B0%8D%E5%A4%A7) 的報告，指出針對大數運算，有哪些加速運算和縮減記憶體操作成本的舉措？ - [ ] lsmod 的輸出結果有一欄名為 `Used by`，這是 “each module’s use count and a list of referring modules”，但如何實作出來呢？模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢？ >搭配閱讀 [The Linux driver implementer’s API guide » Driver Basics](https://www.kernel.org/doc/html/latest/driver-api/basics.html) - [ ] 注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。 - [ ] 修正 Fibonacci 數計算的正確性，隨後改善 fibdrv 計算 Fibinacci 數列的執行效率，在 Linux 核心模組和使用層級去測量執行時間。 - 正確性和數值系統的使用 - 務必研讀上述 Linux 效能分析的提示的描述，降低效能分析過程中的干擾; - 在 Linux 核心模組中，可用 ktime 系列的 API; - 在 userspace 可用 clock_gettime 相關 API; - 善用統計模型，除去極端數值，過程中應詳述你的手法 - 分別用 gnuplot 製圖，分析 Fibonacci 數列在核心計算和傳遞到 userspace 的時間開銷，單位需要用 us 或 ns (自行斟酌) - 嘗試解讀上述實驗的時間分佈，特別是隨著 Fibonacci 數列增長後，對於 Linux 核心的影響。可修改 VFS 函式，允許透過寫入 /dev/fibonacci 裝置檔案來變更不同的 Fibonacci 求值的實作程式碼。 ## 建立效能分析實驗 > 以下操作參考[k04 fibdiv](https://hackmd.io/@sysprog/linux2021-fibdrv#%E6%8E%92%E9%99%A4%E5%B9%B2%E6%93%BE%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E5%9B%A0%E7%B4%A0), [KYG-yaya573142 2020q1 Homework2 (fibdrv)](https://hackmd.io/@KYWeng/rkGdultSU#:~:text=%E9%80%8F%E9%81%8E%E5%B7%A5%E5%85%B7%E9%80%B2%E8%A1%8C,%E5%88%86%E6%9E%90%E7%9A%84%E5%9B%A0%E7%B4%A0) <details> <summary>學習如何使用taskset</summary> #### [taskset](https://linux.die.net/man/1/taskset)來查看行程的 CPU affinity - ##### 什麼是 Affinity mask: > The CPU affinity is represented as a bitmask, with the lowest order bit corresponding to the first logical CPU and the highest order bit corresponding to the last logical CPU ```cpp= 0x00000001 is processor #0 0x00000003 is processors #0 and #1 0xFFFFFFFF is all processors (#0 through #31) ``` - ##### 使用flag `cp`: (p is pid, c is core) 查詢pid 可使用的cpu。 ```cpp= $taskset -cp 1 pid 1's current affinity list: 0-7 ``` - ##### 將行程固定在特定的 CPU 中執行: ```bash= taskset -p COREMASK PID //其中 COREMASK 就是指定的十六進位 core mask askset -cp CORELIST PID //CORELIST 為 CPU 核心列表，以逗點分隔各個核心的編號或是使用連字號指定連續的區間，例如: 0,2,5,7-10。 ``` > 在 Linux 中的使用者必須有開啟 [CAP_SYS_NICE](https://man7.org/linux/man-pages/man7/capabilities.7.html) 這個權限，才能更動行程的處理器親和性 - ##### 以特定的 CPU 執行程式: 使用 taskset 指定 CPU 核心來執行一個新的程式：例如：以第 0 個 CPU 核心執行 vlc 則使用 ```cpp taskset 0x1 vlc // `taskset COREMASK EXECUTABLE` ``` - ##### 限定 CPU 給特定的程式使用光是使用上述的方法，指定的CPU依然可能被其人Process使用。可以使用 `isolcpus` 這個 Linux 核心起始參數，這個參數可以讓特定的 CPU 核心在開機時就被保留下來。 > 設定的方式有兩種，一個是在開機時使用 boot loader 所提供的自訂開機參數功能，手動加入 isolcpus=cpu_id 參數，或是直接加在 [GRUB 的設定檔](http://doc.dpdk.org/spp-18.02/setup/performance_opt.html#reduce-context-switches)中，這樣 Linux 的排程器在預設的狀況下就不會將任何一般行程放在這個被保留的 CPU 核心中執行，只有那些被 taskset 特別指定的行程可使用這些 CPU 核心。 </details> ### 排除干擾效能分析的因素 1. Isolate cpu and assign task to this cpu 修改 `/etc/default/grub` 內的 `GRUB_CMDLINE_LINUX_DEFAULT`，加入 `isolcpus=7` 來指定編號 7 的核心不被排程器賦予任務，修改完成後需 `sudo update-grub` 來更新設定，重開機後即生效 (可從 `/sys/devices/system/cpu/isolated` 確認是否生效) ```shell // sudo vim /etc/default/grub GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=7" ``` 可透過以下命令確認是否設定成功 ```shell $ cat /sys/devices/system/cpu/isolated 7 ``` 接著以 [taskset](http://man7.org/linux/man-pages/man1/taskset.1.html) 指定 CPU 核心來執行程式： ```shell $ taskset -c 7 EXECUTABLE ``` 1. 暫時抑制 [address space layout randomization](https://docs.oracle.com/cd/E37670_01/E36387/html/ol_aslr_sec.html) (ASLR) ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` 2. [設定 scaling_governor 為 performance](https://community.mellanox.com/s/article/how-to-set-cpu-scaling-governor-to-max-performance--scaling-governor-x)。準備以下 shell script，檔名為 `performance.sh`: ```shell for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > ${i} done ``` 之後再用 `$ sudo sh performance.sh` 執行 3. 針對 Intel 處理器，關閉 turbo mode ```shell $ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" ``` > Good refs: https://askubuntu.com/questions/37618/is-turbo-boost-working ## 量測時間的方式 ### user space 使用 clock_gettime 下的 CLOCK_MONOTONIC 來計時，CLOCK_MONOTONIC 是從特定起始時間單調遞增的計時方式。 ### kernel space 使用[htimer](https://lwn.net/Articles/167897/)，放置在device read前後。參考[隔壁同學bakudr18](https://hackmd.io/@MR-9Qa9wQLWglSAUyd6UhQ/HJ363p_Qd#%E6%B8%AC%E9%87%8F-user-space-%E8%88%87-kernel-space-%E6%99%82%E9%96%93)建立device_attribute的做法： >為了方便開啟 kernel space 計時功能，這裡在 [sysfs](https://www.kernel.org/doc/Documentation/filesystems/sysfs.txt) 底下註冊了一個 `ktime_measure` 變數來開關計時功能。根據 [sysfs](https://www.kernel.org/doc/Documentation/filesystems/sysfs.txt) 文件 :::success sysfs is a ram-based filesystem initially based on ramfs. It provides a means to export kernel data structures, their attributes, and the linkages between them to userspace. ::: show 是一個callback，會在讀文件時會觸發，並讀出回傳值。store則是把值寫入。透過寫入 "0" 或 "1" 來調整 ```cpp= static ssize_t ktime_measure_show(struct device *dev, struct device_attribute *attr, char *buf) { return scnprintf(buf, PAGE_SIZE, "%d\n", ktime_enable); } static ssize_t ktime_measure_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { ktime_enable = (bool) (buf[0] - '0'); fib_exec = ktime_enable ? &fib_time_proxy_p : &fib_p; return count; } ``` 同時我們也建立另一個sysfs，來選擇不同的fib_method。 ```cpp= // fib_method related utils function static ssize_t fib_method_show(struct device *dev, struct device_attribute *attr, char *buf) { return scnprintf(buf, PAGE_SIZE, "%d\n", ktime_enable); } static ssize_t fib_method_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { unsigned int fib_method_index = (unsigned int) (buf[0] - '0'); fib_method_set(fib_method_index); return count; } ``` 在client.c 當中設定ktimer 和 fib_method。 ```cpp= #define KTIME_ENABLE "/sys/class/fibonacci/fibonacci/ktime_measure" #define FIB_METHOD "/sys/class/fibonacci/fibonacci/fib_method" int main(){ int fd_kt = open(KTIME_ENABLE, O_RDWR); if (fd_kt < 0) { perror("Failed to open sysfs"); err = 1; fd = fd_kt; goto close_fib; } int fd_fib = open(FIB_METHOD, O_RDWR); if (fd_fib < 0) { perror("Failed to open sysfs"); err = 1; fd = fd_fib; goto close_fib; } // Set ktimer write(fd_kt, "1", 2); close(fd_kt); // Set fib method write(fd_fib, "0", 2); close(fd_fib); } for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); long long start = get_nanotime(); sz = read(fd, buf, 1); long long utime = get_nanotime() - start; ssize_t kt = write(fd, NULL, 0); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%lld. utime %lld, ktime %ld\n", i, sz, utime, kt); } ``` 可以得到下圖 ![](https://i.imgur.com/un9D2rV.png) ### copy_{from/to}_user ## Fibonacci 實作使用不同的策略，並計算時間 #### [Vorobev 等式](https://hackmd.io/@sysprog/fibonacci-number#Vorobev-%E7%AD%89%E5%BC%8F) \begin{split} F(2k) &= F(k)[2F(k+1) - F(k)] \\ F(2k+1) &= F(k+1)^2+F(k)^2 \end{split} 參考 [pusedo code](https://hackmd.io/@sysprog/linux2021-fibdrv#:~:text=Fast_Fib(n)%0A%20%20%20%20a%20%3D%200%3B%20b%20%3D%201%3B%20%20%20%20%20%20%20//%20m%20%3D%200%0A%20%20%20%20for%20i%20%3D%20(number%20of%20binary%20digit%20in%20n)%20to%201%0A%20%20%20%20%20%20%20%20t1%20%3D%20a*(2*b%20%2D%20a)%3B%0A%20%20%20%20%20%20%20%20t2%20%3D%20b%5E2%20%2B%20a%5E2%3B%0A%20%20%20%20%20%20%20%20a%20%3D%20t1%3B%20b%20%3D%20t2%3B%20//%20m%20*%3D%202%0A%20%20%20%20%20%20%20%20if%20(current%20binary%20digit%20%3D%3D%201)%0A%20%20%20%20%20%20%20%20%20%20%20%20t1%20%3D%20a%20%2B%20b%3B%20//%20m%2B%2B%0A%20%20%20%20%20%20%20%20%20%20%20%20a%20%3D%20b%3B%20b%20%3D%20t1%3B%0A%20%20%20%20return%20a%3B) 可以產生以下的程式碼。 ```cpp= tatic long long fib_sequence_fdouble(unsigned int k) { long long h = 0; for (long long i = k; i; ++h, i >>= 1) ; long long a = 0; // F(0) = 0 long long b = 1; // F(1) = 1 for (long long mask = 1 << (h - 1); mask; mask >>= 1) { // Run h times! // Let j = h-i (looping from i = 1 to i = h), n_j = floor(n / 2^j) = n // >> j (n_j = n when j = 0), k = floor(n_j / 2), then a = F(k), b = // F(k+1) now. long long c = a * (2 * b - a); // F(2k) = F(k) * [ 2 * F(k+1) – F(k) ] long long d = a * a + b * b; // F(2k+1) = F(k)^2 + F(k+1)^2 if (mask & k) { // n_j is odd: k = (n_j-1)/2 => n_j = 2k + 1 a = d; // F(n_j) = F(2k + 1) b = c + d; // F(n_j + 1) = F(2k + 2) = F(2k) + F(2k + 1) } else { // n_j is even: k = n_j/2 => n_j = 2k a = c; // F(n_j) = F(2k) b = d; // F(n_j + 1) = F(2k + 1) } } return a; } ``` 再來可以使用硬體加速，直接去計算leading zero的數量。這邊讓`h=sizeof(long long) * 8 - 1 - clz(k)`，多減一是因為是要找第一個set bit。 ```cpp= static long long fib_sequence_fdouble_clz(unsigned int k) { long long h = 0; long long a = 0; // F(0) = 0 long long b = 1; // F(1) = 1 for (long long mask = 1 << (sizeof(long long) * 8 - 1 - clz(k) - 1); mask; mask >>= 1) { // Run h times! // Let j = h-i (looping from i = 1 to i = h), n_j = floor(n / 2^j) = n // >> j (n_j = n when j = 0), k = floor(n_j / 2), then a = F(k), b = // F(k+1) now. long long c = a * (2 * b - a); // F(2k) = F(k) * [ 2 * F(k+1) – F(k) ] long long d = a * a + b * b; // F(2k+1) = F(k)^2 + F(k+1)^2 if (mask & k) { // n_j is odd: k = (n_j-1)/2 => n_j = 2k + 1 a = d; // F(n_j) = F(2k + 1) b = c + d; // F(n_j + 1) = F(2k + 2) = F(2k) + F(2k + 1) } else { // n_j is even: k = n_j/2 => n_j = 2k a = c; // F(n_j) = F(2k) b = d; // F(n_j + 1) = F(2k + 1) } } return a; } ``` user space comparsion ![](https://i.imgur.com/J8jkiYt.png) ![](https://i.imgur.com/lfNdzP1.png) ![](https://i.imgur.com/3GSTdme.png) 可以發現使用clz明顯會讓kernel space的運算速度提升。 ### Page faults 不論是哪個fib method，都可已發現前幾次的runtime特別高，這應該是Page faults造成的。參考同學<>的[做法](https://hackmd.io/@MR-9Qa9wQLWglSAUyd6UhQ/HJ363p_Qd#Fibonacci-%E5%AF%A6%E4%BD%9C)，可以使用[mlock](https://linux.die.net/man/2/mlockall) > lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area ## 大數運算 ```cpp= typedef struct _bn { unsigned long long lower, upper; } bn; ``` 嘗試使用上面的結構，讓big num可以存到128 digit，需要思考怎麼變回character。 ```cpp= static inline void bn_add(bn *output, bn x, bn y) { output->upper = x.upper + y.upper; if (y.lower > ~x.lower) output->upper++; output->lower = x.lower + y.lower; } ``` ## Mutex