2020q1 Homework2 (fibdrv)

# 2020q1 Homework2 (fibdrv) contributed by < `ignite1771` > ## 開發環境參考 [Linux 效能分析的提示: 排除干擾效能分析的因素](https://hackmd.io/CTwO5ldOSgKxP-N4YNRVOA?view#排除干擾效能分析的因素), 試著以下面 script 執行 QEMU/KVM: ```shell qemu-system-x86_64 -m 2048 \ ..(省略).. -accel hvf \ -cpu host -smp 2 \ # Specify 2 cores Guest can use ..(省略).. ``` 並在其中試著設定 scaling_governor, 但發現 `/sys/devices/system/cpu/cpu*/cpufreq` 目錄並不存在 (然而 `/sys/devices/system/cpu/cpufreq` 目錄存在)： :::danger "directory" 的翻譯是「目錄」，絕非 folder，後者才翻譯為「資料夾」 :notes: jserv ::: - **判斷**：雖然指派 CPU 給 VM 使用，但 VM 裡的 CPU (Vitrual CPU, or vCPU) 並不是真的實體 CPU, 而僅僅是 Mapping 關係, vCPU 並不能做 [CPU frequency scaling](https://wiki.archlinux.org/index.php/CPU_frequency_scaling) 根據 [QEMU mailing list](https://lists.gnu.org/archive/html/qemu-devel/2013-09/msg02807.html) 某篇討論中 Gleb Natapov (QEMU 眾多 contributors 之一) 提到： > ... take guest cpu count, host cpu model and frequency, pcpu/vcpu over commit (if any), guest/host memory overcommit (if any) and estimate performance based on this. For pure computational performance guest core performance should be close to host core performance if there is not cpu/memory overcommit. - **判斷**：必須捨棄 [2020q1 Homework1 (lab0)](https://hackmd.io/@ignite1771/HksYncqEI#開發環境) 的開發環境，才能進行準確的效能評估： :::warning 很好，你認清事實了。每天用 GNU/Linux，就會變強。 :notes: jserv ::: ### :new: 新的開發環境： #### OS ```shell $ cat /etc/os-release NAME="Ubuntu" VERSION="20.04 LTS (Focal Fossa)" ``` #### Linux kernel version ```shell $ uname -a Linux ben-m51ad-ubuntu 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux ``` #### CPU info (4 cores) ```shell $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 60 Model name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz Stepping: 3 CPU MHz: 1008.870 CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 6385.58 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-3 ``` #### GCC version ```shell $ gcc --version gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0 ``` ## 自我檢查清單 - [x] 研讀上述 [Linux 效能分析的提示](https://hackmd.io/CTwO5ldOSgKxP-N4YNRVOA?view#-Linux-效能分析的提示) 描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗; - [x] 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise)，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] `lsmod` 的輸出結果有一欄名為 `Used by`，這是 "each module's use count and a list of referring modules"，但如何實作出來呢？模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢？ - [x] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題 ## Fibonacci numbers 進階演算法分析 ### Tail-Recursion Pseudo-code 如下： ``` Tail_recusion_fib(n, a, b) if (n == 0) return a; return Tail_resursion_fib(n - 1, b, a + b) ``` ### Fast Doubling Pseudo-code 如下： ``` Fast_Fib(n) a = 0; b = 1; // m = 0 for i = (number of binary digit in n) to 1 t1 = a*(2*b - a); t2 = b^2 + a^2; a = t1; b = t2; // m *= 2 if (current binary digit == 1) t1 = a + b; // m++ a = b; b = t1; return a; ``` 討論1: 為何 `if (current binary digit == 1)` 的條件下，要讓 $[F_{2n}, F_{2n+1}]$ 變成 $[F_{2n+1}, F_{2n+2}]$ ? > 解釋：引用 [Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 之程式碼： ![](https://i.imgur.com/TumRpy1.png) 討論2: 在觀察上方程式碼時，一開始有疑問：為什麼要先從 **highest bit** of n 一路判斷下來？ - 初始想法：第一輪應該要先不斷將 $n\gets\lfloor\dfrac{n}{2}\rfloor, until \ n=1$, 這樣子應該是從 lowest bit of n 來判斷 n 是奇數或偶數？ - 如同 [Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 中 *ITERATIVE (BOTTOM-UP) APPROACH* 提到，使用 stack 去實作。 > ... > The reason we need the stack is to get the track for each $n, where \ n\gets\lfloor\dfrac{n}{2}\rfloor, until \ n=1$. And the track is used to determine what state we should update from $[F_{n},F_{n+1}]$, to $[F_{2n},F_{2n+1}]$ or $[F_{2n+1},F_{2n+2}]$, by the given $n$ is even or odd. - 因為使用 stack 實作就必須花費記憶體空間，思考另一種方法，在 *NON-STACK APPROACH* 中提到： - > ... > In the above implementation, we put the $n_0=n,n_1,n_2,...,n_j,...,n_t−1,n_t=1, where \ n_j=\lfloor\dfrac{n}{2^j}\rfloor$ denotes $n$ is right shifted by $j$ bits ($n_j$ = `n >> j`) and $j≥1$ - 其實 `n >> j` 而 j 從 n 至 1（第一輪 j == n-1 就只剩下 **highest bit** of n) - 第一輪等價於不斷將 $n\gets\lfloor\dfrac{n}{2}\rfloor, until \ n=1$。 - 解決疑問。 ## 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 ### 不同演算法執行時間比較 (包含排除效能評估干擾因素) #### 1. 測量位置 - 在 kernel-spcace 中測量 - 在 `fib_read` 函式中使用 `ktime` API 來記錄不同演算法執行時間 - 使用 `printk()` 印出結果 ```clike static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { // record execution time in kernel space ktime_t kt = ktime_get(); fib_sequence(*offset); kt = ktime_sub(ktime_get(), kt); ktime_t kt2 = ktime_get(); ssize_t result = fib_tail_recursive(*offset, 0, 1); kt2 = ktime_sub(ktime_get(), kt2); ktime_t kt3 = ktime_get(); fib_fast_doubling(*offset); kt3 = ktime_sub(ktime_get(), kt3); printk(KERN_INFO "%lld %lld %lld", ktime_to_ns(kt), ktime_to_ns(kt2), ktime_to_ns(kt3)); return result; } ``` #### 2. 綁定行程的 CPU Affinity: - 原生開發環境 process 的 CPU affinity mask and list: ``` $ tasksest -p <pid> pid <pid>'s current affinity mask: f $ taskset -cp <pid> pid <pid>'s current affinity list: 0-3 ``` - 未使用 `taskset` 綁定行程於特定 CPU: ![](https://i.imgur.com/gczUC2D.png) - 使用 `taskset 0x1 sudo ./client` 綁定 CPU 0 後: :::info 0x1 代表 0001, 最低（LSB; 最右邊）的位元代表第 0 個 CPU 核心，次低的代表第 1 個，以次類推。 ::: ![](https://i.imgur.com/SWgYrjW.png) - 開機先隔離特定 CPU core(s): - 參考此篇[教學](https://charleslin74.pixnet.net/blog/post/405173965)，在 boot loader 配置文件 `/boot/grub/grub.cfg` 中 `linux` 開機指令後面加入參數 `isolcpus=<cpu_id0, cpu_id1...>` 確保所有行程在未使用 `taskset` 綁定前不會使用特定的 CPU(s) :::info 使用 `top` 指令再按 `1` 來看各個 CPU core 的使用狀況，詳細說明可參考 [Linux “top” command: What are us, sy, ni, id, wa, hi, si and st (for CPU usage)?](https://unix.stackexchange.com/questions/18918/linux-top-command-what-are-us-sy-ni-id-wa-hi-si-and-st-for-cpu-usage) ::: ![](https://i.imgur.com/ckIc8RX.png) #### 3. 其他 CPU 效能干擾因素排除: - 參考 [排除干擾效能分析的因素](https://hackmd.io/CTwO5ldOSgKxP-N4YNRVOA?view#排除干擾效能分析的因素)，(1) 抑制 address space layout randomization (ASLR, 隨機在不同位置載入程式資料，可防止 buffer overflow 攻擊), (2) 設定 cpufreq 調整模式為 `performance`(只注重效率，將 CPU 頻率固定工作在其支援的最高執行頻率上，而不動態調節), (3) 針對 Intel 處理器關閉 turbo mode ![](https://i.imgur.com/VrRSYBQ.png) ### 自動化：參考 [OscarShiang](https://hackmd.io/@oscarshiang/S1mexZhEL) 使用 Python Script 執行運算 $10^3$ 次，避免實驗出現單一極端執行時間數值 - 使用 `dmesg` 來看 `printk()` 輸出之內容, 並運用 `sudo dmesg -c` 指令來清除 buffer. ```python #!/usr/bin/env python3 import os import pandas as pd import math PATH = '/home/ben/Desktop/statistic' times = int(1e3) sequence = pd.DataFrame(columns=range(times)) trecursion = pd.DataFrame(columns=range(times)) doubling = pd.DataFrame(columns=range(times)) # executing experiment for i in range(times): os.system('taskset 0x1 sudo ./client') os.system('dmesg > ' + PATH + '/tmp/data' + str(i)) os.system('sudo dmesg -c') for i in range(times): f = open(PATH + '/tmp/data' + str(i), 'r') content = f.read() raw = content.split('\n') raw.remove('') seq = [] tr = [] double = [] for j in range(len(raw)): numbers = raw[j].split(' ') # numbers[:2] is the time of the message printed seq.append(int(numbers[2])) tr.append(int(numbers[3])) double.append(int(numbers[4])) sequence[i] = seq trecursion[i] = tr doubling[i] = double print('data' + str(i) + ' completed') print(sequence) print('-' * 30) print(trecursion) print('-' * 30) print(doubling) # processing the data mean_seq, std_seq = sequence.mean(axis=1), sequence.std(axis=1) mean_tr, std_tr = trecursion.mean(axis=1), trecursion.std(axis=1) mean_double, std_double = doubling.mean(axis=1), doubling.std(axis=1) seq = [] tr = [] double = [] print('processing sequence data...') for i in range(len(sequence.index)): sum = 0 cnt = 0 for j in sequence.iloc[i]: if abs(j - mean_seq[i]) < 2 * std_seq[i]: sum = sum + j cnt = cnt + 1 sum = sum / cnt seq.append(sum) print('processing trecursion data...') for i in range(len(trecursion.index)): sum = 0 cnt = 0 for j in trecursion.iloc[i]: if abs(j - mean_tr[i]) < 2 * std_tr[i]: sum = sum + j cnt = cnt + 1 sum = sum / cnt tr.append(sum) print('processing doubling data...') for i in range(len(doubling.index)): sum = 0 cnt = 0 for j in doubling.iloc[i]: if abs(j - mean_double[i]) < 2 * std_double[i]: sum = sum + j cnt = cnt + 1 sum = sum / cnt double.append(sum) out = [] for i, j, k, l in zip(range(len(tr)), seq, tr, double): out.append('{} {} {} {}'.format(i+1, j, k, l)) # output the result f = open(PATH + '/runtime_result', 'w') print('file write: {}'.format(f.write('\n'.join(out)))) f.close() # move result to project directory os.system('mv {}/runtime_result ./runtime_result'.format(PATH)) # TODO: remove this after implemetation of calculating 93th fibonacci num os.system("sed -i '93,$ d' runtime_result") # use gnuplot to make a result graph os.system("gnuplot runtime.gp") ``` ![](https://i.imgur.com/dVpojGI.png) ### 使用 clz/ctz 硬體手段加速：計算不需要的 leading zeros 有多少位，跳過它 - gcc 中有定義函式 [`__bulitin_clz(k)`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 可以使用 > Returns the number of leading 0-bits in x, starting at the most significant bit position. **If x is 0, the result is undefined.** ![](https://i.imgur.com/5I2xP4L.png) ### 使用 `unlikely` 巨集（在 [linux/compiler.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler.h) 有定義)，提示編譯器產生對 branch predictor 友善的程式碼 - 關鍵程式碼如下 ```diff + define unlikely(x) __built_in(!!(x), 0) + if (unlikely(!k)) - if (k == 0) return 0 ``` ![](https://i.imgur.com/lDkb4gg.png) ### Fibonacci sequence 演算法比較參考 [你所不知道的C語言：遞迴呼叫篇](https://hackmd.io/@sysprog/c-recursion?type=view#Fibonacci-sequence) | 方法 | 時間複雜度 | | -------- | -------- | | Iterative | $O(n)$ | | Tail Recursion | $O(n)$ | | Fast Doubling (每次 $k = \frac{n - 1}{2}$ 或 $\frac{n}{2}$) | $O(logn)$ | ## 大數運算：計算至 Fib(100) Reproduce [AndybnACT](https://hackmd.io/@AndybnA/fibdrv) 的實作過程，設計 `bigint.h`: 1. 128 bits unsigned int struct: ```cpp typedef struct uint128 { unsigned long long upper; unsigned long long lower; } uint128_t ``` 2. 128 bits unsigned int arithmetic helper `static inline` functions: - `void add128(uint128_t *out, uint128_t op1, uint128_t op2)` - `void sub128(uint128_t *out, uint128_t op1, uint128_t op2)` - `void lsft128(uint128_t *out, uint128_t op, uint8_t shift)` - `void rsft128(uint128_t *out, uint128_t op, uint8_t shift)` - `void mul128(uint128_t *out, uint128_t op1, uint128_t op2)` 即可驗證至 Fib(100) 的正確性。 ![](https://i.imgur.com/g6Khzyi.png) 然而，使用大數運算在效能方面觀察到 Fast Doubling 演算法因為運用自定義之乘法，讓其效能遠不如只運用加法之 Iterative 及 Tail Recursion 演算法，推測應該是目前電腦有對原生乘法做加速。 ## Reference counting 是如何計算的？參考: - [Reference Counts](http://books.gigatux.nl/mirror/kerneldevelopment/0672327201/ch17lev1sec7.html) - Paper: [kobjects and krefs](http://www.kroah.com/linux/talks/ols_2004_kref_paper/Reprint-Kroah-Hartman-OLS2004.pdf) - Paper: [Overview of Linux-Kernel Reference Counting](http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2167.pdf) ## Mutex 在 Linux 核心模組的重要性在 `linux/mutex.h:128` 中可看到 `DEFINE_MUTEX` 巨集定義，得知其會初始化一個 mutex. ```cpp #define DEFINE_MUTEX(mutexname) \ struct mutex mutexname = __MUTEX_INITIALIZER(mutexname) ``` 1. 首先定義一個 mutex `fib_mutex` ```cpp static DEFINE_MUTEX(fib_mutex); ``` 2. `mutex_init`: 在 `init_fib_dev` 函式中： ```cpp static int __init init_fib_dev(void) { int rc = 0; mutex_init(&fib_mutex); //... ``` 3. `mutex_trylock`: 在 `fib_open` 函式中： ```cpp static int fib_open(struct inode *inode, struct file *file) { if (!mutex_trylock(&fib_mutex)) { printk(KERN_ALERT "fibdrv is in use"); return -EBUSY; } return 0; } ``` 可以看到這邊用 `mutex_trylock` 來控制一次只能 one thread 來 open 這個 character device driver, 避免多個 threads 同時存取. 4. `mutex_unlock`: 在 `fib_release` 函式中： ```cpp static int fib_release(struct inode *inode, struct file *file) { mutex_unlock(&fib_mutex); return 0; } ``` ## Mischellousness - 記錄 module 的 reference counting 目的是為了做 garbage collection。