2020q1 Homework2 (fibdrv)

# 2020q1 Homework2 (fibdrv) contributed by < `0xff07` > ## 實驗環境 * `/etc/lsb-release` ``` DISTRIB_ID=Ubuntu DISTRIB_RELEASE=19.10 DISTRIB_CODENAME=eoan DISTRIB_DESCRIPTION="Ubuntu 19.10" ``` * `lscpu` ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 70 Model name: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz Stepping: 1 CPU MHz: 1292.163 CPU max MHz: 3700.0000 CPU min MHz: 800.0000 BogoMIPS: 4988.76 Virtualization: VT-x L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 6 MiB L4 cache: 128 MiB ``` * `uname` ``` Linux 5.3.0-26-generic #28-Ubuntu SMP x86_64 GNU/Linux ``` * gcc ``` gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008 ``` ## 統計執行時間下面會先討論如何對原先的費氏數列實作計時。對於核心空間的函數，使用 eBPF 作為計時工具，計算進入與離開 `fib_read` 所觸發的 kprobe 與 kretprobe 之間的時間差。userspace 中暫時使用 `clock_gettime`，預期接下來使用 uprobe 作為計時工具。 ### 核心：追蹤`fib_read` 這部份的程式是仿照 BCC 中 [vfsreadlat](https://github.com/iovisor/bcc/blob/master/examples/tracing/vfsreadlat.c) 工具的範例，利用他統計 `fib_read` 這個函數： ```python= from bcc import BPF prog = """ #include <uapi/linux/ptrace.h> #include <linux/fs.h> BPF_HASH(start, u64, u64); BPF_HASH(args, u64, loff_t); int probe_handler(struct pt_regs *ctx, struct file *file, char *buf, size_t size, loff_t *offset) { u64 ts = bpf_ktime_get_ns(); u64 pid = bpf_get_current_pid_tgid(); loff_t arg = *offset; start.update(&pid, &ts); args.update(&pid, &arg); return 0; } int ret_handler(struct pt_regs *ctx) { u64 ts = bpf_ktime_get_ns(); u64 pid = bpf_get_current_pid_tgid(); u64 *tsp = start.lookup(&pid); loff_t *offset = args.lookup(&pid); if (tsp != 0 && offset != 0) { bpf_trace_printk("%lld, %llu\\n", *offset, ts - *tsp); start.delete(&pid); args.delete(&pid); } return 0; } """ b = BPF(text=prog) b.attach_kprobe(event="fib_read", fn_name="probe_handler") b.attach_kretprobe(event="fib_read", fn_name="ret_handler") while 1: try: res = b.trace_fields() except ValueError: continue print(res[5].decode("UTF-8")) ``` 這段程式具體的解釋寫在[這裡](https://hackmd.io/@0xff07/r1f4B8aGI)。 > 這裡發現一件事：使用 BPF 進行實驗時，對於那些 `lseek` 到 92 以上的求值，BPF 顯示`fib_ead` 的輸入卻都是 92，而不是 `lseek` 所預期的，比 92 更大的值。 ### userspace : 使用 `clock_gettime` 就是明確地使用 `time.h` 中的函數，並且使用條件編譯的方式決定是否啟動： ```c= [...] #ifdef USERSPACE_TIMER for (int i = 0; i <= offset; i++) { struct timespec start, end; lseek(fd, i, SEEK_SET); clock_gettime(CLOCK_MONOTONIC, &start); sz = read(fd, buf, 1); clock_gettime(CLOCK_MONOTONIC, &end); unsigned long duration = 1000000000 * (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec); printf("%d, %lu\n", i, duration); } #else for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); sz = read(fd, buf, 1); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%lld.\n", i, sz); } for (int i = offset; i >= 0; i--) { lseek(fd, i, SEEK_SET); sz = read(fd, buf, 1); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%lld.\n", i, sz); } #endif [...] ``` 在未 `taskset` 之前，將兩者輸出資料一併做圖： ![](https://i.imgur.com/8vbwLo7.png) 剔除標準差過大的資料後，大致得到下圖的結果： ![](https://i.imgur.com/5lcSd8P.png) ### userspace : uprobe > 這其實是原先預計的作法，不過正在解決一些小問題（下面描述) 因為整個 `client.c` 中只有對 `fobdrv` 進行 `read` ，所以感覺好像可以從 `read` 這個系統呼叫的 tracepoint 下手。不過如果用 `strace` 一看，會發現 `read` 的呼叫不只這些，所以這樣會統計到讀取 `fibdrv` 以外的那些 `read` 呼叫。這時想到可以仿照 lab0-c 的作法，先 `#undef read` 再重新把某個 wrapper `#define` 成 `read`： ```c= ssize_t read_wrapper(int fd, void *buf, size_t count) { return read(fd, buf, count); } #undef read #define read read_wrapper int main() { /* ... */ } ``` 這時如果檢測 `client` 這個可執行檔中的 `uprobe`： ```shell= $ sudo bpftrace -l 'uprobe:/tmp/fibdrv/client:*' uprobe:/tmp/fibdrv/client:__do_global_dtors_aux uprobe:/tmp/fibdrv/client:__libc_csu_fini uprobe:/tmp/fibdrv/client:__libc_csu_init uprobe:/tmp/fibdrv/client:_fini uprobe:/tmp/fibdrv/client:_init uprobe:/tmp/fibdrv/client:_start uprobe:/tmp/fibdrv/client:deregister_tm_clones uprobe:/tmp/fibdrv/client:frame_dummy uprobe:/tmp/fibdrv/client:main uprobe:/tmp/fibdrv/client:read_wrapper uprobe:/tmp/fibdrv/client:register_tm_clones ``` 會發現 `read_wrapper` 列在其中。因此，只要對 `read_wrapper` 這個函數進入點與回傳時所觸發的 uprobe 及 uretprobe 進行時間差的計算，就可以作為一個 user space 中的計時依據。不過，與前面 kprobe 不一樣的狀況是：指定要讀取第幾個費氏數列的值時，是使用 `lseek` 指定。因此在 `read` 呼叫自身並不會知道目前正在讀取第幾個費氏數列。 ## 處理溢位：大數運算在第 93 時就因為溢位錯掉了： ```shell= $ make check [...] f(93) fail input: 7540113804746346429 expected: 12200160415121876738 ``` 既然是溢位，就做一個大數加法給他。這邊大數所使用的資料結構是 `uint64_t` 陣列。假定有這樣一個陣列 `a`，則 `a[i]` 的第 `j` 個位元表示這個大數的第 $(64 \cdot i + j)$ 個位元。而加法的實作原理，則類似對陣列逐原個元素進行直式計算： ```c= int biguint_add (uint64_t *dst, uint64_t *src1, uint64_t *src2, int nints) { uint64_t mask = ((uint64_t)1 << 63) - 1; uint64_t carry = 0; for (int i = 0; i < nints; i++) { uint64_t tmp = (src1[i] & mask) + (src2[i] & mask) + carry; carry = (src1[i] >> 63) + (src2[i] >> 63) + (tmp >> 63); dst[i] = (tmp & mask) + ((carry & 1) << 63); carry >>= 1; } return carry; } ``` `biguint_add` 做的是 `dst = src1 + src2`，其中 `dst`, `src1`, `src2` 是至少有 `nbits` 個 `uint64_t` 的陣列。計算方式是把每一個 `uint64_t` 的最高位與低 63 位分離出來各自加，並且由最高位的結果判斷是否需要進位。變數 `carry` 在每一個迴圈頭時，會存前一個元素來的進位。所以如果迴圈執行結束還有進位，就表示大數加法有溢位。先在 userspace 寫一個程式計算費氏數列並測試： ```c= void biguint_dump (uint64_t *src, int nints) { printf("0x"); for (int i = nints - 1; i >= 0; i--) { printf("%016lx", src[i]); } printf("\n"); } #define UINT64_PER_BINT 17 int main() { uint64_t src1[UINT64_PER_BINT] = {0}; uint64_t src2[UINT64_PER_BINT] = {1}; uint64_t src3[UINT64_PER_BINT]; uint64_t *a1 = &src1[0]; uint64_t *a2 = &src2[0]; uint64_t *a3 = &src3[0]; for (int i = 0; i < 1500; i++) { int ret = biguint_add(a3, a1, a2, UINT64_PER_BINT); biguint_dump(a3, UINT64_PER_BINT); if (ret) { printf("WARNING: bigint addition overflow at %d!\n", i); } uint64_t *tmp = a1; a1 = a2; a2 = a3; a3 = tmp; } } ``` 在沒有溢位發生的狀況下，上面的程式會逐一輸出費氏數列的 16 進位表示法： ```= 0x0000000000000000000000000000000000000000000000000000000000000001 0x0000000000000000000000000000000000000000000000000000000000000002 0x0000000000000000000000000000000000000000000000000000000000000003 0x0000000000000000000000000000000000000000000000000000000000000005 0x0000000000000000000000000000000000000000000000000000000000000008 0x000000000000000000000000000000000000000000000000000000000000000d 0x0000000000000000000000000000000000000000000000000000000000000015 0x0000000000000000000000000000000000000000000000000000000000000022 [...] ``` 把輸出存成 `fib.data`，並用簡單的 python 程式驗證正確性： ```python3= #! /bin/python3 import sys a1 = 0 a2 = 1 with open('fib.data') as f: for line in f: a3 = a2 + a1 assert(a3 == int(line, 16)) a1 = a2 a2 = a3 ``` 當使用 16 個 `uint64_t` (有 1024 bit) 作為紀錄時，會在第 1475 個費氏數列溢位。這個結果跟 [](https://hackmd.io/@sysprog/fibonacci-number) 中，估計費氏數列所需的位元數為 $n \cdot 0.69424$ 的結論相近。新增一個 `read` 的實作： ```c= #define FIB_BUFSZ 16 static ssize_t fib_read_large(struct file *file, char *buf, size_t size, loff_t *offset) { uint64_t src1[FIB_BUFSZ] = {1}; uint64_t src2[FIB_BUFSZ] = {0}; uint64_t src3[FIB_BUFSZ] = {0}; uint64_t *a1 = &src1[0]; uint64_t *a2 = &src2[0]; uint64_t *a3 = &src3[0]; int carry = 0; for (int i = 1; i <= (*offset); i++) { carry = biguint_add(a3, a1, a2, FIB_BUFSZ); uint64_t *tmp = a1; a1 = a2; a2 = a3; a3 = tmp; } if (carry) goto err_overflow; unsigned long ret = copy_to_user(buf, (char*)a2, size); return (size - ret); err_overflow: return -1; } ``` 這個函式會依照 `lseek` 所給定的 offset 來計算數值。除此之外，在 `fibdrv.c` 最前面有一段： ```c= [...] #define DEV_FIBONACCI_NAME "fibonacci" /* MAX_LENGTH is set to 92 because * ssize_t can't fit the number > 92 */ #define MAX_LENGTH 92 [...] ``` 並且在原本的 `lseek` 中，有： ```c= static loff_t fib_device_lseek(struct file *file, loff_t offset, int orig) { loff_t new_pos = 0; switch (orig) { case 0: /* SEEK_SET: */ new_pos = offset; break; case 1: /* SEEK_CUR: */ new_pos = file->f_pos + offset; break; case 2: /* SEEK_END: */ new_pos = MAX_LENGTH - offset; break; } if (new_pos > MAX_LENGTH) new_pos = MAX_LENGTH; // max case if (new_pos < 0) new_pos = 0; // min case file->f_pos = new_pos; // This is what we'll use now return new_pos; } ``` 也就是說：`lseek` 的實作會強制 offset 最大到 92 (這也是前面 `fib_read` 中，比 92 大的輸入最後都會變成 92 的理由)。暫時把這個 `MAX_LENGTH` 改成 1000： ```c= [...] #define DEV_FIBONACCI_NAME "fibonacci" /* MAX_LENGTH is set to 92 because * ssize_t can't fit the number > 92 */ #define MAX_LENGTH 1000 [...] ``` :::warning 現在 fibdrv 對於 VFS 介面的操作較為取巧 (tricky)，一般的 character device driver 不會這樣寫，當初的手段就是儘量壓低程式碼行數。不過你可以想，如何調整 VFS 的介面，讓更大的數值得以傳遞。 :notes: jserv :::