# 2019q1 Homework2 (fibdrv) ## Self-Checking Items * 檔案 fibdrv.c 裡頭的 MODULE_LICENSE, MODULE_AUTHOR, MODULE_DESCRIPTION, MODULE_VERSION 等巨集做了什麼事,可以讓核心知曉呢? insmod 這命令背後,對應 Linux 核心內部有什麼操作呢?請舉出相關 Linux 核心原始碼並解讀 * 當我們透過 insmod 去載入一個核心模組時,為何 module_init 所設定的函式得以執行呢?Linux 核心做了什麼事呢? Seeing the linux source code: ```clike= #define MODULE_LICENSE(_license) MODULE_INFO(license, _license) /* * Author(s), use "Name <email>" or just "Name", for multiple * authors use multiple MODULE_AUTHOR() statements/lines. */ #define MODULE_AUTHOR(_author) MODULE_INFO(author, _author) /* What your module does. */ #define MODULE_DESCRIPTION(_description) MODULE_INFO(description, _description) ``` Refer to MODULE_INFO: ```clike= /* Generic info of form tag = "info" */ #define MODULE_INFO(tag, info) __MODULE_INFO(tag, tag, info) ``` And the refer to MODULE_INFO: ```clike= #ifdef MODULE #define __MODULE_INFO(tag, name, info) \ static const char __UNIQUE_ID(name)[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = __stringify(tag) "=" info #else /* !MODULE */ /* This struct is here for syntactic coherency, it is not used */ #define __MODULE_INFO(tag, name, info) \ struct __UNIQUE_ID(name) {} #endif ``` It can be seen that if it defines `MODULE`, then it declares a const char array which is put into section `.modinfo` and its value is `__stringify(tag) "=" info`. e.g. `MODULE_LICENSE("Dual MIT/GPL")` will be `license = Dual MIT/GPL` ```clike= #define __stringify_1(x...) #x #define __stringify(x...) __stringify_1(x) ``` When linking, kernel will know `.modinfo` section address and then can know its content further. Each module will register init and exit functions through `module_init` and `module_exit`. At that time, it will declare a `init_module`/`cleanup_module` function whose **alias** name is module's init function and exit function. And when `insmod`, it will call `init_module`, namely module's pre-registered functions. :::info The following source code refering to Linux Kernel v.4.18 ::: ```clike= /* Each module must use one module_init(). */ #define module_init(initfn) \ static inline initcall_t __maybe_unused __inittest(void) \ { return initfn; } \ int init_module(void) __copy(initfn) __attribute__((alias(#initfn))); /* This is only required if you want to be unloadable. */ #define module_exit(exitfn) \ static inline exitcall_t __maybe_unused __exittest(void) \ { return exitfn; } \ void cleanup_module(void) __copy(exitfn) __attribute__((alias(#exitfn))); #endif ``` I found that there are some differences between loadable modules and built-in modules, which will make `module_init` macro expand to different forms. I think here is loadable module and when we see symbol table in fibdrv.ko, there is a `init_module`: ``` lineagech@~/Linux_Kernel_Design/fibdrv$ readelf -s fibdrv.ko Symbol table '.symtab' contains 59 entries: Num: Value Size Type Bind Vis DEFAULT 4 init_module ...... 47: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND cdev_add 48: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND device_create 49: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND printk 50: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND __mutex_init 51: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND mutex_trylock 52: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND unregister_chrdev_region 53: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND mutex_unlock 54: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND device_destroy 55: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND ktime_get 56: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND cdev_init 57: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND cdev_del 58: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND cdev_alloc ``` Using `strace` to see what system calls are called: ``` sudo strace insmod fibdrv.ko ...... finit_module(3, "", 0) = 0 munmap(0x7fe19680d000, 8416) = 0 close(3) = 0 exit_group(0) = ? +++ exited with 0 +++ ``` There is a `finit_module`: ```clike= SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) { struct load_info info = { }; loff_t size; void *hdr; int err; err = may_init_module(); if (err) return err; pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags); if (flags & ~(MODULE_INIT_IGNORE_MODVERSIONS |MODULE_INIT_IGNORE_VERMAGIC)) return -EINVAL; err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX, READING_MODULE); if (err) return err; info.hdr = hdr; info.len = size; return load_module(&info, uargs, flags); } ``` the call `load_module`: ```clike= static int load_module(struct load_info *info, const char __user *uargs, int flags) { struct module *mod; long err; char *after_dashes; ...... /* Now we've got everything in the final locations, we can * find optional sections. */ err = find_module_sections(mod, info); if (err) goto free_unload; err = check_module_license_and_versions(mod); if (err) goto free_unload; /* Set up MODINFO_ATTR fields */ setup_modinfo(mod, info); /* Fix up syms, so that st_value is a pointer to location. */ err = simplify_symbols(mod, info); if (err < 0) goto free_modinfo; err = apply_relocations(mod, info); if (err < 0) goto free_modinfo; err = post_relocation(mod, info); if (err < 0) goto free_modinfo; flush_module_icache(mod); /* Now copy in args */ mod->args = strndup_user(uargs, ~0UL >> 1); if (IS_ERR(mod->args)) { err = PTR_ERR(mod->args); goto free_arch_cleanup; } ...... /* Link in to sysfs. */ err = mod_sysfs_setup(mod, info, mod->kp, mod->num_kp); if (err < 0) goto coming_cleanup; if (is_livepatch_module(mod)) { err = copy_module_elf(mod, info); if (err < 0) goto sysfs_cleanup; } /* Get rid of temporary copy. */ free_copy(info); /* Done! */ trace_module_load(mod); return do_init_module(mod); ``` Finally calling `do_init_module` before returning, `mod->init` is the registered init function: ```clike= static noinline int do_init_module(struct module *mod) { ...... if (mod->init != NULL) ret = do_one_initcall(mod->init); ``` * 試著執行 $ readelf -a fibdrv.ko, 觀察裡頭的資訊和原始程式碼及 modinfo 的關聯,搭配上述提問,解釋像 fibdrv.ko 這樣的 ELF 執行檔案是如何「植入」到 Linux 核心 See `modinfo fibdrv.ko` information: ```sh lineagech@~/Linux_Kernel_Design/fibdrv$ modinfo fibdrv.ko filename: /home/lineagech/Linux_Kernel_Design/fibdrv/fibdrv.ko version: 0.1 description: Fibonacci engine driver author: National Cheng Kung University, Taiwan license: Dual MIT/GPL srcversion: 2761C1DBC64BBDC99E866F4 depends: retpoline: Y name: fibdrv vermagic: 4.18.0-15-generic SMP mod_unload ``` See `readelf -a fibdrv.ko`, we can find a section named `.modinfo`: ``` Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [ 0] NULL 0000000000000000 00000000 0000000000000000 0000000000000000 0 0 0 [ 1] .note.gnu.build-i NOTE 0000000000000000 00000040 0000000000000024 0000000000000000 A 0 0 4 [ 2] .text PROGBITS 0000000000000000 00000070 0000000000000229 0000000000000000 AX 0 0 16 [ 3] .rela.text RELA 0000000000000000 00001338 0000000000000180 0000000000000018 I 22 2 8 [ 4] .init.text PROGBITS 0000000000000000 00000299 0000000000000153 0000000000000000 AX 0 0 1 [ 5] .rela.init.text RELA 0000000000000000 000014b8 00000000000003a8 0000000000000018 I 22 4 8 [ 6] .exit.text PROGBITS 0000000000000000 000003ec 0000000000000040 0000000000000000 AX 0 0 1 [ 7] .rela.exit.text RELA 0000000000000000 00001860 00000000000000d8 0000000000000018 I 22 6 8 [ 8] __mcount_loc PROGBITS 0000000000000000 0000042c 0000000000000030 0000000000000000 A 0 0 1 [ 9] .rela__mcount_loc RELA 0000000000000000 00001938 0000000000000090 0000000000000018 I 22 8 8 [10] .rodata.str1.1 PROGBITS 0000000000000000 0000045c 000000000000006e 0000000000000001 AMS 0 0 1 [11] .rodata.str1.8 PROGBITS 0000000000000000 000004d0 0000000000000080 0000000000000001 AMS 0 0 8 [12] .rodata PROGBITS 0000000000000000 00000560 0000000000000100 0000000000000000 A 0 0 32 [13] .rela.rodata RELA 0000000000000000 000019c8 0000000000000090 0000000000000018 I 22 12 8 [14] .modinfo PROGBITS 0000000000000000 00000660 00000000000000ec 0000000000000000 A 0 0 8 [15] .data PROGBITS 0000000000000000 00000760 0000000000000020 0000000000000000 WA 0 0 32 [16] .rela.data RELA 0000000000000000 00001a58 0000000000000030 0000000000000018 I 22 15 8 [17] .gnu.linkonce.thi PROGBITS 0000000000000000 00000780 0000000000000340 0000000000000000 WA 0 0 64 [18] .rela.gnu.linkonc RELA 0000000000000000 00001a88 0000000000000030 0000000000000018 I 22 17 8 [19] .bss NOBITS 0000000000000000 00000ac0 0000000000000014 0000000000000000 WA 0 0 8 [20] .comment PROGBITS 0000000000000000 00000ac0 0000000000000056 0000000000000001 MS 0 0 1 [21] .note.GNU-stack PROGBITS 0000000000000000 00000b16 0000000000000000 0000000000000000 0 0 1 [22] .symtab SYMTAB 0000000000000000 00000b18 00000000000005a0 0000000000000018 23 39 8 [23] .strtab STRTAB 0000000000000000 000010b8 000000000000027b 0000000000000000 0 0 1 [24] .shstrtab STRTAB 0000000000000000 00001ab8 00000000000000e7 0000000000000000 0 0 1 ``` * 這個 fibdrv 名稱取自 Fibonacci driver 的簡稱,儘管在這裡顯然是為了展示和教學用途而存在,但針對若干關鍵的應用場景,特別去撰寫 Linux 核心模組,仍有其意義,請找出 Linux 核心的案例並解讀。提示: 可參閱 Random numbers from CPU execution time jitter The article mentions that how to use CPU execution jitters as entropy, and /dev/random as well. /dev/random can be used for key-gen: ``` * The two other interfaces are two character devices /dev/random and * /dev/urandom. /dev/random is suitable for use when very high * quality randomness is desired (for example, for key generation or * one-time pads), as it will only return a maximum of the number of * bits of randomness (as estimated by the random number generator) * contained in the entropy pool. ``` source code(linux v4.18): ```clike= const struct file_operations random_fops = { .read = random_read, .write = random_write, .poll = random_poll, .unlocked_ioctl = random_ioctl, .fasync = random_fasync, .llseek = noop_llseek, }; ``` User can get random bytes through reading /dev/random, also can re-seed generator through ioctl. Read operation: ```clike= static ssize_t random_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) { return _random_read(file->f_flags & O_NONBLOCK, buf, nbytes); } static ssize_t _random_read(int nonblock, char __user *buf, size_t nbytes) { ssize_t n; if (nbytes == 0) return 0; nbytes = min_t(size_t, nbytes, SEC_XFER_SIZE); while (1) { n = extract_entropy_user(&blocking_pool, buf, nbytes); if (n < 0) return n; trace_random_read(n*8, (nbytes-n)*8, ENTROPY_BITS(&blocking_pool), ENTROPY_BITS(&input_pool)); if (n > 0) return n; /* Pool is (near) empty. Maybe wait and retry. */ if (nonblock) return -EAGAIN; wait_event_interruptible(random_read_wait, ENTROPY_BITS(&input_pool) >= random_read_wakeup_bits); if (signal_pending(current)) return -ERESTARTSYS; } } ``` * 查閱 ktime 相關的 API,並找出使用案例 (需要有核心模組和簡化的程式碼來解說) ```clike= /* Nanosecond scalar representation for kernel time values */ typedef s64 ktime_t; ``` ```clike= typedef int64_t s64; ``` There are several basic ktime interface: ```clike= ktime_t ktime_get(void) /*CLOCK_MONOTONIC*/ ktime_t ktime_get_boottime(void) /*CLOCK_BOOTTIME*/ ktime_t ktime_get_real(void) /*CLOCK_REALTIME*/ ktime_t ktime_get_clocktai(void) /*CLOCK_TAI, International Atomic Time (TAI)*/ ktime_t ktime_get_raw(void) /*CLOCK_MONOTONIC_RAW, without NTP adjustments f or clock drift*/ ``` Example is like sleep function following, using ktime_add_us to get target time and waiting until it is expired. ``` /** * usleep_range - Sleep for an approximate time * @min: Minimum time in usecs to sleep * @max: Maximum time in usecs to sleep * * In non-atomic context where the exact wakeup time is flexible, use * usleep_range() instead of udelay(). The sleep improves responsiveness * by avoiding the CPU-hogging busy-wait of udelay(), and the range reduces * power usage by allowing hrtimers to take advantage of an already- * scheduled interrupt instead of scheduling a new one just for this sleep. */ void __sched usleep_range(unsigned long min, unsigned long max) { ktime_t exp = ktime_add_us(ktime_get(), min); u64 delta = (u64)(max - min) * NSEC_PER_USEC; for (;;) { __set_current_state(TASK_UNINTERRUPTIBLE); /* Do not return before the requested sleep time has elapsed */ if (!schedule_hrtimeout_range(&exp, delta, HRTIMER_MODE_ABS)) break; } } ``` * clock_gettime 和 High Resolution TImers (HRT) 的關聯為何?請參閱 POSIX 文件並搭配程式碼解說 According to [High Resolution TImers (HRT)](https://elinux.org/High_Resolution_Timers), I check my system timers to see if they are supporting high-resolution timers. :::info If your clock supports high resolution, it will have a .resolution value of 1 nsecs. If it does not, then it will have a .resolution value that equals the number of nanoseconds in a jiffy (usually 10000 nsecs, on embedded platforms): ::: ```sh= lineagech@/tmp$ sudo cat /proc/timer_list ...... active timers: clock 4: .base: ffff95512f39cd40 .index: 4 .resolution: 1 nsecs .get_time: ktime_get .offset: 0 nsecs active timers: clock 5: .base: ffff95512f39cd80 .index: 5 .resolution: 1 nsecs .get_time: ktime_get_real .offset: 1552500249691559463 nsecs ...... active timers: .expires_next : 8312236000000 nsecs .hres_active : 1 ...... ``` Write a test program to check if clock_gettime uses HRT: ```clike= #include <stdio.h> #include <stdlib.h> #include <time.h> int main (int argc, char **argv) { struct timespec res; int re = clock_getres(CLOCK_MONOTONIC, &res); printf("%lu sec %lu nsec\n", res.tv_sec, res.tv_nsec); return 0; } ``` ```sh= lineagech@/tmp$ ./a.out 0 sec 1 nsec ``` * fibdrv 如何透過 Linux Virtual File System 介面,讓計算出來的 Fibonacci 數列得以讓 userspace (使用者層級) 程式 (本例就是 client.c 程式) 得以存取呢?解釋原理,並撰寫或找出相似的 Linux 核心模組範例 Linux provides a unified Virtual File System supporting several operations, like read, write, stat etc. Here kernel modules need to hand its own `file_operation` to VFS through `cdev_add` ([Character Device Files - Creation & Operations](https://sysplay.github.io/books/LinuxDrivers/book/Content/Part05.html)). When `insmod`, kerenel modules will be linked to sysfs. `man sysfs`: :::info The sysfs filesystem is a pseudo-filesystem which provides an interface to kernel data structures. (More precisely, the files and directories in sysfs provide a view of the kobject structures defined internally within the kernel.) The files under sysfs provide information about devices, kernel modules, filesystems, and other kernel components. ::: User provides file descriptor and user-space buffer and uses `read` system call to find provided inode and file struct as arguments and call registered read API in kernel modules, here is fib_read. Kernel module can just do something and return result or put results into user's buffer through `copy_to_user`. ![](https://upload.wikimedia.org/wikipedia/commons/thumb/6/65/Simplified_Structure_of_the_Linux_Kernel.svg/1920px-Simplified_Structure_of_the_Linux_Kernel.svg.png) Linux Virtual File System: >* Manage kernel level file abstractions in one format for all file systems >* Receives system call requests from user level(e.g. write, open, stat, link) >* Interact with a specific file system based on mount point traversal >* Receive requests from other part of the kernel, mostly from memory management >* VFS Data structure - inode: VFS abstraction for the file - file - superblock: Handle metadata(attributes), like device, blocksize, dirty flags, list of dirty inodes - dentry: a name to inode translation structure * 注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題 When existing multi-porcesses or multi-threads, they might access the kernel module simultaneously, possible to have race conditions. * 許多現代處理器提供了 clz / ctz 一類的指令,你知道如何透過演算法的調整,去加速 費氏數列 運算嗎?請列出關鍵程式碼並解說 My implementation is using doubling method: :::info $F_{2n}=F_n*2F_{n+1}-F_n^2$ $F_{2n+1}=F_n^2+F_{n+1}^2$ ::: Check bit starting from the first bit 1 which is found from MSB. Initial state is either $F_0=0$ or $F_1=1$, depending on the first bit is 0 or 1. And we can use `__builtin_clz` to find the first position of bit 1 and start to calculate next $F_n$. The process is traversing the first bit 1 from MSB to LSB, so when move to next bit, it is like multiplying 2 for current $F_n$, that is $F_{2n}$. What if next bit is 1, then current value becomes $F_{2n+1}$. Until moving to LSB, we can get the result. ```clike= long long curr; long long a = 0; long long b = 1; long long next_a, next_b; int i; int lz = __builtin_clz(k); int tz = __builtin_ctz(k); int count = 1, target = ((sizeof(long long) << 3) - lz); if (k == 0) return 0; curr = k >> ((sizeof(long long) << 3) - lz - count); while (count <= target) { next_a = a * ((b << 1) - a); next_b = a * a + b * b; if ((curr & 0x1) == 1) { a = next_b; b = next_a + next_b; } else { a = next_a; b = next_b; } curr = k >> ((sizeof(long long) << 3) - lz - (++count)); } return a; ``` ## Fibonacci * Peformance evaluation First we see the execution time in kernel space and user space individually: ![](https://i.imgur.com/OmCyhZ1.png) And if using fast-fibonacci algorithm: ![](https://i.imgur.com/tojOtFM.png)