2022q1 Homework3 (fibdrv)

--- tags: linux2022 --- # 2022q1 Homework3 (fibdrv) contributed by < `Xx-oX` > ## 實驗環境 ```shell $ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 17 Model name: AMD Ryzen 7 2700U with Radeon Vega Mobile Gfx Stepping: 0 Frequency boost: enabled CPU MHz: 1700.000 CPU max MHz: 2200.0000 CPU min MHz: 1600.0000 BogoMIPS: 4391.49 Virtualization: AMD-V L1d cache: 128 KiB L1i cache: 256 KiB L2 cache: 2 MiB L3 cache: 4 MiB NUMA node0 CPU(s): 0-7 ``` ## 回答==自我檢查清單== - [x] 研讀上述 ==Linux 效能分析==的提示描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作 - [x] 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [x] 複習 C 語言數值系統和 bitwise operation，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本 - [ ] 研讀 KYG-yaya573142 的報告，指出針對大數運算，有哪些加速運算和縮減記憶體操作成本的舉措？ - [ ] lsmod 的輸出結果有一欄名為 Used by，這是 “each module’s use count and a list of referring modules”，但如何實作出來呢？模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢？ > 搭配閱讀 The Linux driver implementer’s API guide » Driver Basics - [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。 ## 效能分析設定 ### 實驗環境設定 > 參考 [Linux 效能分析的提示](https://hackmd.io/@sysprog/linux2022-fibdrv#-Linux-%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E6%8F%90%E7%A4%BA)以及 [KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU#%E6%8E%92%E9%99%A4%E5%B9%B2%E6%93%BE%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E5%9B%A0%E7%B4%A0) #### 空出編號 6, 7 的兩個核心 > 為何要空出兩個 => 一來預備給未來跑多執行緒或者其他狀況使用，二來不想跟別人一樣XD > 從實驗環境可以看到這顆 cpu 只有一個 NUMA node => 不用考慮共享記憶體的問題修改 `/etc/default/grub` 檔案內的 `GRUB_CMDLINE_LINUX_DEFAULT` ```bash GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=6,7" ``` 然後執行 ```shell $ sudo update-grub ``` 更新設定，接著重新啟動從 /sys/devices/system/cpu/isolated 中確認是否生效 ```shell $ cat /sys/devices/system/cpu/isolated 6-7 ``` #### 將測試程式固定在特定 cpu 上執行 > 這邊使用預留的 6 號 cpu 利用 [taskset](https://man7.org/linux/man-pages/man1/taskset.1.html) 指定特定行程的處理器親和性 (processor affinity) 將行程固定在特定 cpu 上執行 ```shell # 0x40 -> 0100 0000 -> 表示只使用第 6 個 cpu (從左而右為編號 7-0) # 啟動時指定 $ taskset 0x40 <EXECUTABLE> # 更改已經啟動的行程 $ taskset -p 0x40 <PID> # 查看特定行程的處理器親和性 $ taskset -p <PID> ``` #### 排出干擾分析的因素 * 抑制 [address space layout randomization (ASLR)](https://en.wikipedia.org/wiki/Address_space_layout_randomization) ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` * 設定 [scaling_governor](https://www.kernel.org/doc/html/latest/admin-guide/pm/cpufreq.html) 為 performance > 設定 CPU 以最高頻率執行所有 process * 只對 6 號 cpu 進行設定 ```shell $ sudo sh -c "echo performance > /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor" ``` * 設定全部 cpu 準備以下 shell script ```bash # performance.sh for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor done ``` ```shell $ sudo sh performance.sh ``` * 關閉 Intel 處理器的 turbo mode 使用之 AMD 處理器沒有這個東西 ### 記錄時間與繪圖 #### Kernel space 在 `fib_read` 中使用 [ktime API](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html) 進行計時，並將結果輸出到 `kernel ring buffer` 中 ```c static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { ktime_t kt_start = ktime_get(); ssize_t result = fib_sequence(*offset); ktime_t duration = ktime_get() - kt_start; printk(KERN_DEBUG "%lld %lld", *offset, ktime_to_ns(duration)); return result; } ``` 使用 [dmesg](https://man7.org/linux/man-pages/man1/dmesg.1.html) 命令可以看到結果 > `sudo dmesg -c` 可以在顯示結果後清空 `kernel ring buffer` #### User space > 記錄之總時間 T = T(user space) + T(kernel space) 在 `client.c` 中使用 [clock_gettime](https://man7.org/linux/man-pages/man2/clock_getres.2.html) 進行計時，並將結果記錄以便繪圖 ```c FILE *fp = fopen(PATH, "w+"); if (!fp) { printf("Time record error!\n"); } for (int i = 0; i <= offset; i++) { clock_gettime(CLOCK_REALTIME, &ts_start); lseek(fd, i, SEEK_SET); sz = read(fd, buf, 1); clock_gettime(CLOCK_REALTIME, &ts_end); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%lld.\n", i, sz); char tbuf[16]; snprintf(tbuf, 16, "%d %ld\n", i, ts_end.tv_nsec - ts_start.tv_nsec); fwrite(tbuf, sizeof(char), strlen(tbuf), fp); } fclose(fp); ``` #### 繪圖使用 gnuplot 進行繪圖 ```bash reset set title 'Runtime comparison' set xlabel 'F(n)' set ylabel 'time(nsec)' set term png enhanced font 'Verdana,10' set output 'timeplot.png' set grid plot [0:][0:] \ './original' using 1:2 with linespoints linewidth 2 title 'iterative', \ './fast_doubling' using 1:2 with linespoints linewidth 2 title 'fast doubling' ``` ## 費式數列 > 改寫自[作業要求](https://hackmd.io/@sysprog/linux2022-fibdrv#-%E8%B2%BB%E6%B0%8F%E6%95%B8%E5%88%97) ### Scenario * 成年 (adult) 兔每個月生一隻小兔兔 * 小兔兔 (juvenile) 生長一個月後即成年 * 起初只有一隻小兔兔 * 兔子不會死畫成表格 | Month | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | -------- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Juvenile | 1 | 0 | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | | Adult | 0 | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 | | Total | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 | 144 | 用 $a_n$ 表示成年兔， $j_n$ 表示小兔兔 $F_n$ 表示兔子總量，即 $a_n + j_n$ 又$F_n = a_{n+1}$ = $j_{n+2}$ 整理可得 $$ \left\{ \begin{array}\\ a_{n+1} = a_n + j_n\\ j_{n+1} = a_n\\ \end{array}\\ \right. $$ 寫成矩陣 $$ \begin{pmatrix} a_{n+1} \\ j_{n+1} \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \begin{pmatrix} a_n \\ j_n \end{pmatrix} $$ 即 $$ \begin{pmatrix} F_n \\ F_{n-1} \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} \begin{pmatrix} F_{n-1} \\ F_{n-2} \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 0 & 0 \end{pmatrix}^{n-1} \begin{pmatrix} F_1 \\ F_0 \end{pmatrix} $$ 其中 $\begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}$ 又稱為 Q-matrix $$ Q = \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix} = \begin{pmatrix} F_2 & F_1 \\ F_1 & F_0 \end{pmatrix} \\ Q^n = \begin{pmatrix} F_{n+1} & F_n \\ F_n & F_{n-1} \end{pmatrix} $$ ### Fast Doubling 帶入 $n:= 2n + 1$ $$ \begin{split} \begin{pmatrix} F_{2n+1} \\ F_{2n} \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^{2n} \begin{pmatrix} F_1 \\ F_0 \end{pmatrix}\\ \\ &= \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^n \begin{pmatrix} 1 & 1 \\ 1 & 0 \end{pmatrix}^n \begin{pmatrix} F_1 \\ F_0 \end{pmatrix}\\ \\ &= \begin{pmatrix} F_{n+1} & F_n \\ F_n & F_{n-1} \end{pmatrix} \begin{pmatrix} F_{n+1} & F_n \\ F_n & F_{n-1} \end{pmatrix} \begin{pmatrix} 1 \\ 0 \end{pmatrix}\\ \\ &= \begin{pmatrix} F_{n+1}^2 + F_n^2\\ F_nF_{n+1} + F_{n-1}F_n \end{pmatrix} \end{split} $$ 因此可得: $$ \begin{aligned} F_{2k} &= F_k ( 2F_{k+1} - F_k ) \\ F_{2k+1} &= F_{k+1}^2+F_k^2 \end{aligned} $$ 其中第一式即 Vorobev 等式 ### 分析 #### 利用定義計算 $F_n = F_{n-1} + F_{n-2}$ 以 $F(6)$ 為例: ```graphviz strict digraph G { 1[label="F(6)"] 2[label="F(4)"] 3[label="F(5)"] 4[label="F(2)"] 5[label="F(3)"] 6[label="F(3)"] 7[label="F(4)"] 8[label="F(0)", style=filled] 9[label="F(1)", style=filled] 10[label="F(1)", style=filled] 11[label="F(2)"] 12[label="F(1)", style=filled] 13[label="F(2)"] 14[label="F(2)"] 15[label="F(3)"] 16[label="F(0)", style=filled] 17[label="F(1)", style=filled] 18[label="F(0)", style=filled] 19[label="F(1)", style=filled] 20[label="F(0)", style=filled] 21[label="F(1)", style=filled] 22[label="F(1)", style=filled] 23[label="F(2)", style=filled] 24[label="F(0)", style=filled] 25[label="F(1)", style=filled] 1 -> {2, 3} 2 -> {4, 5} 3 -> {6, 7} 4 -> {8, 9} 5 -> {10, 11} 6 -> {12, 13} 7 -> {14, 15} 11 -> {16, 17} 13 -> {18, 19} 14 -> {20, 21} 15 -> {22, 23} 23 -> {24, 25} } ``` #### 利用 Fast Doubling 計算一樣以 $F(6)$ 為例: ```graphviz strict digraph G { 1[label="F(6)"] 2[label="F(3)"] 3[label="F(4)"] 4[label="F(1)", style=filled] 5[label="F(2)", style=filled] 6[label="F(2)", style=filled] 7[label="F(3)"] 8[label="F(1)", style=filled] 9[label="F(2)", style=filled] 1 -> {2, 3} 2 -> {4, 5} 3 -> {6, 7} 7 -> {8, 9} } ``` 利用 $F_0 = 0, F_1 = 1, F_2 = 1$ 即可推導出隨後的數值，而且明顯比前面的遞迴定義要有效率。 ## 實作 `fibdrv` ### 使用 Fast Doubling 改進計算效率將上述之 Fast Doubling 手法實作成程式碼 > 參考並使用 `__builtin_clz` 改寫 [Calculating Fibonacci Numbers by Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 一文中提及的實作 > int __builtin_clz (unsigned int x) > Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined. 加入 `unlikely` 巨集，使編譯器產生對分支預測友善的組合語言 >long __builtin_expect (long exp, long c) > You may use __builtin_expect to provide the compiler with branch prediction information. ... > The return value is the value of exp, which should be an integral expression. ```c #define unlikely(x) __builtin_expect(!!(x), 0) static long long fib_fast_doubling(long long k) { /* use fast doubling method */ // F(0) = 0, F(1) = 1, F(2) = 2 if (unlikely(k < 2)) return k; unsigned long long a = 0, b = 1; unsigned int i = 1u << (31 - __builtin_clz(k)); // find the highest set digit of k for (; i; i >>= 1) { unsigned long long c = a * (2 * b - a); // F(2n) = F(n) * [ 2 * F(n+1) – F(n) ] unsigned long long d = a * a + b * b; // F(2n+1) = F(n)^2 + F(n+1)^2 if (i & k) { // n_j is odd: n = (k_j-1)/2 => k_j = 2n + 1 a = d; // F(k_j) = F(2n+1) b = c + d; // F(k_j + 1) = F(2n + 2) = F(2n) + F(2n+1) } else { // k_j is even: n = k_j/2 => k_j = 2n a = c; // F(k_j) = F(2n) b = d; // F(k_j + 1) = F(2n + 1) } } return a; } ``` #### 實驗並比較計算效率使用前述效能分析設定中的方法，使用 6 號 cpu 進行實驗利用前述記錄時間的方法記錄採用 iterative 與 fast doubling 方法的時間 * Kernel time ![](https://i.imgur.com/rUXoDx8.png) 可以看出在 F(40) 之後兩種方法的時間差持續擴大雖然只有算前 92 項，但大略可以看出 iterative 的時間複雜度約成 O(n)，而 fast doubling 則比較偏向 O(1) * total (user + kernel) time ![](https://i.imgur.com/0LURyBl.png) 去掉過大的資料之後變成 ![](https://i.imgur.com/5kQOyUc.png) 比較 total time 與 kernel time 發現超過 50 項之後 total time 並沒有像 kernel time 所展現的兩種方法的時間差距越來越大預計多作幾次實驗並採平均來降低單一個案的誤差 ### 計算第 93 項（含）之後的 Fibonacci 數 > Unfinished #### SSO (Small String Optimization) [參考資料](http://wdv4758h.github.io/notes/cpp/string-optimization.html) 實作資料結構如下數字小的時候直接使用前 7 byte 來儲存數字大的時候使用陣列，並記錄起始位置及長度 ```c /* * byte map: |----|----|----|----|----|----|----|----| * long ver: | ptr | length | * short ver: | number |flag| */ struct _long { unsigned long long *ptr; unsigned int length; }; struct _short { unsigned char number[sizeof(struct _long) - 1]; unsigned char flag; // 0x0 }; typedef union { struct _long; struct _short; } bn_t; ``` 運用於大數運算: 二元運算時須先判斷兩者的類型，過於麻煩，提昇之效率不符合比例原則。 #### 實作採用類似字元陣列的方式儲存十進位的每位數字，只是將本來一個十進位數字 1 byte 的空間縮減成 1/2 byte > 1/2 byte => 4 bit 可表示 1~15 ，已經滿足需求 #### 資料結構 [測試 code](https://github.com/Xx-oX/BigNumber) 目前資料結構設計 ```c /* * Hadron: basic chunk for storing decimal digits * size: 4 bytes * |xx|xx|xx|xx| * |00|bt|sc|du| (little endian) * @u, d, c, s, t, b: for storing a decimal digit (0~9) * @p: last 8bits of next Hardon chunk's address (00 for null) * "Hadron": composite of "Quark" * Naming by particles of "Quark". */ typedef struct { unsigned char u : 4; //up unsigned char d : 4; //down unsigned char c : 4; //charm unsigned char s : 4; //strange unsigned char t : 4; //top unsigned char b : 4; //bottom unsigned char p; // pointer[-1: -4] } Hadron; /* * buint_t: variable size unsigned big integer * size: 16 bytes * @ptr: list of Hardons * @len: size of list(0~65535) * @addrm: address mask for concating Hadron.p * |xx|xx|xx|xx|xx|xx|xx|xx|xx|xx|xx|xx|xx|xx|xx|00| * | ptr | len | addrm | */ typedef struct{ Hadron *ptr; //start of Hadron list uint16_t len; //size of ptr[] uint64_t addrm : 48; //virtual memory address mask } buint_t; ``` > [Hadron](https://en.wikipedia.org/wiki/Hadron): 強子 > [Quark](https://en.wikipedia.org/wiki/Quark): 夸克，共有六種(即 u, d, c, s, t, b) > 剩下 1 byte padding 作為紀錄下一個區塊記憶體位置結尾 1 byte 利用 bitfield 創造 1/2 byte 的空間，用來儲存 0~9 的數字每 6 個數字打包成一組並使用結合 linked list 與區塊鏈的方式，每個區塊末端儲存下一個區塊的記憶體位置(結尾 1 byte) > Pros: 不用白不用，最後的 1 byte 也會因為 struct alignment 而被創造、計算記憶體使用簡單的 bitwise 運算 > Cons: 需要確保同個 buint_t 下的區塊記憶體位置的前 48 - 8 bits 必須一樣 => 自行實作記憶體管理 > 運行加法、乘法繁瑣 => 遲遲無法完成 #### 加法 * 運用直式加法的手法實作 #### 乘法 * 計畫使用 [Schönhage–Strassen 演算法](https://zh.wikipedia.org/wiki/Sch%C3%B6nhage-Strassen%E6%BC%94%E7%AE%97%E6%B3%95) * 但因為 [FFT](https://en.wikipedia.org/wiki/Fast_Fourier_transform) 實作及理解不易，預計拿掉使用 FFT 的部份，演變成使用線性卷積統一處理進位的簡易版本 ``` e.g. 123 * 456 = 56088 1 2 3 x) 4 5 6 --------------- 6 12 18 5 10 15 +) 4 8 12 -------------------- 4 13 28 27 18 計算卷積 18 27- 28-- 13--- +) 4---- --------- 56088...即為所求 ```