# 2019q1 Homework2 (fibdrv)
## Self-Checking Items
* 檔案 fibdrv.c 裡頭的 MODULE_LICENSE, MODULE_AUTHOR, MODULE_DESCRIPTION, MODULE_VERSION 等巨集做了什麼事,可以讓核心知曉呢? insmod 這命令背後,對應 Linux 核心內部有什麼操作呢?請舉出相關 Linux 核心原始碼並解讀
* 當我們透過 insmod 去載入一個核心模組時,為何 module_init 所設定的函式得以執行呢?Linux 核心做了什麼事呢?
Seeing the linux source code:
```clike=
#define MODULE_LICENSE(_license) MODULE_INFO(license, _license)
/*
* Author(s), use "Name <email>" or just "Name", for multiple
* authors use multiple MODULE_AUTHOR() statements/lines.
*/
#define MODULE_AUTHOR(_author) MODULE_INFO(author, _author)
/* What your module does. */
#define MODULE_DESCRIPTION(_description) MODULE_INFO(description, _description)
```
Refer to MODULE_INFO:
```clike=
/* Generic info of form tag = "info" */
#define MODULE_INFO(tag, info) __MODULE_INFO(tag, tag, info)
```
And the refer to MODULE_INFO:
```clike=
#ifdef MODULE
#define __MODULE_INFO(tag, name, info) \
static const char __UNIQUE_ID(name)[] \
__used __attribute__((section(".modinfo"), unused, aligned(1))) \
= __stringify(tag) "=" info
#else /* !MODULE */
/* This struct is here for syntactic coherency, it is not used */
#define __MODULE_INFO(tag, name, info) \
struct __UNIQUE_ID(name) {}
#endif
```
It can be seen that if it defines `MODULE`, then it declares a const char array which is put into section `.modinfo` and its value is `__stringify(tag) "=" info`.
e.g. `MODULE_LICENSE("Dual MIT/GPL")` will be `license = Dual MIT/GPL`
```clike=
#define __stringify_1(x...) #x
#define __stringify(x...) __stringify_1(x)
```
When linking, kernel will know `.modinfo` section address and then can know its content further.
Each module will register init and exit functions through `module_init` and `module_exit`. At that time, it will declare a `init_module`/`cleanup_module` function whose **alias** name is module's init function and exit function. And when `insmod`, it will call `init_module`, namely module's pre-registered functions.
:::info
The following source code refering to Linux Kernel v.4.18
:::
```clike=
/* Each module must use one module_init(). */
#define module_init(initfn) \
static inline initcall_t __maybe_unused __inittest(void) \
{ return initfn; } \
int init_module(void) __copy(initfn) __attribute__((alias(#initfn)));
/* This is only required if you want to be unloadable. */
#define module_exit(exitfn) \
static inline exitcall_t __maybe_unused __exittest(void) \
{ return exitfn; } \
void cleanup_module(void) __copy(exitfn) __attribute__((alias(#exitfn)));
#endif
```
I found that there are some differences between loadable modules and built-in modules, which will make `module_init` macro expand to different forms. I think here is loadable module and when we see symbol table in fibdrv.ko, there is a `init_module`:
```
lineagech@~/Linux_Kernel_Design/fibdrv$ readelf -s fibdrv.ko
Symbol table '.symtab' contains 59 entries:
Num: Value Size Type Bind Vis DEFAULT 4 init_module
......
47: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND cdev_add
48: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND device_create
49: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND printk
50: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND __mutex_init
51: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND mutex_trylock
52: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND unregister_chrdev_region
53: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND mutex_unlock
54: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND device_destroy
55: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND ktime_get
56: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND cdev_init
57: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND cdev_del
58: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND cdev_alloc
```
Using `strace` to see what system calls are called:
```
sudo strace insmod fibdrv.ko
......
finit_module(3, "", 0) = 0
munmap(0x7fe19680d000, 8416) = 0
close(3) = 0
exit_group(0) = ?
+++ exited with 0 +++
```
There is a `finit_module`:
```clike=
SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
{
struct load_info info = { };
loff_t size;
void *hdr;
int err;
err = may_init_module();
if (err)
return err;
pr_debug("finit_module: fd=%d, uargs=%p, flags=%i\n", fd, uargs, flags);
if (flags & ~(MODULE_INIT_IGNORE_MODVERSIONS
|MODULE_INIT_IGNORE_VERMAGIC))
return -EINVAL;
err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
READING_MODULE);
if (err)
return err;
info.hdr = hdr;
info.len = size;
return load_module(&info, uargs, flags);
}
```
the call `load_module`:
```clike=
static int load_module(struct load_info *info, const char __user *uargs,
int flags)
{
struct module *mod;
long err;
char *after_dashes;
......
/* Now we've got everything in the final locations, we can
* find optional sections. */
err = find_module_sections(mod, info);
if (err)
goto free_unload;
err = check_module_license_and_versions(mod);
if (err)
goto free_unload;
/* Set up MODINFO_ATTR fields */
setup_modinfo(mod, info);
/* Fix up syms, so that st_value is a pointer to location. */
err = simplify_symbols(mod, info);
if (err < 0)
goto free_modinfo;
err = apply_relocations(mod, info);
if (err < 0)
goto free_modinfo;
err = post_relocation(mod, info);
if (err < 0)
goto free_modinfo;
flush_module_icache(mod);
/* Now copy in args */
mod->args = strndup_user(uargs, ~0UL >> 1);
if (IS_ERR(mod->args)) {
err = PTR_ERR(mod->args);
goto free_arch_cleanup;
}
......
/* Link in to sysfs. */
err = mod_sysfs_setup(mod, info, mod->kp, mod->num_kp);
if (err < 0)
goto coming_cleanup;
if (is_livepatch_module(mod)) {
err = copy_module_elf(mod, info);
if (err < 0)
goto sysfs_cleanup;
}
/* Get rid of temporary copy. */
free_copy(info);
/* Done! */
trace_module_load(mod);
return do_init_module(mod);
```
Finally calling `do_init_module` before returning, `mod->init` is the registered init function:
```clike=
static noinline int do_init_module(struct module *mod)
{
......
if (mod->init != NULL)
ret = do_one_initcall(mod->init);
```
* 試著執行 $ readelf -a fibdrv.ko, 觀察裡頭的資訊和原始程式碼及 modinfo 的關聯,搭配上述提問,解釋像 fibdrv.ko 這樣的 ELF 執行檔案是如何「植入」到 Linux 核心
See `modinfo fibdrv.ko` information:
```sh
lineagech@~/Linux_Kernel_Design/fibdrv$ modinfo fibdrv.ko
filename: /home/lineagech/Linux_Kernel_Design/fibdrv/fibdrv.ko
version: 0.1
description: Fibonacci engine driver
author: National Cheng Kung University, Taiwan
license: Dual MIT/GPL
srcversion: 2761C1DBC64BBDC99E866F4
depends:
retpoline: Y
name: fibdrv
vermagic: 4.18.0-15-generic SMP mod_unload
```
See `readelf -a fibdrv.ko`, we can find a section named `.modinfo`:
```
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .note.gnu.build-i NOTE 0000000000000000 00000040
0000000000000024 0000000000000000 A 0 0 4
[ 2] .text PROGBITS 0000000000000000 00000070
0000000000000229 0000000000000000 AX 0 0 16
[ 3] .rela.text RELA 0000000000000000 00001338
0000000000000180 0000000000000018 I 22 2 8
[ 4] .init.text PROGBITS 0000000000000000 00000299
0000000000000153 0000000000000000 AX 0 0 1
[ 5] .rela.init.text RELA 0000000000000000 000014b8
00000000000003a8 0000000000000018 I 22 4 8
[ 6] .exit.text PROGBITS 0000000000000000 000003ec
0000000000000040 0000000000000000 AX 0 0 1
[ 7] .rela.exit.text RELA 0000000000000000 00001860
00000000000000d8 0000000000000018 I 22 6 8
[ 8] __mcount_loc PROGBITS 0000000000000000 0000042c
0000000000000030 0000000000000000 A 0 0 1
[ 9] .rela__mcount_loc RELA 0000000000000000 00001938
0000000000000090 0000000000000018 I 22 8 8
[10] .rodata.str1.1 PROGBITS 0000000000000000 0000045c
000000000000006e 0000000000000001 AMS 0 0 1
[11] .rodata.str1.8 PROGBITS 0000000000000000 000004d0
0000000000000080 0000000000000001 AMS 0 0 8
[12] .rodata PROGBITS 0000000000000000 00000560
0000000000000100 0000000000000000 A 0 0 32
[13] .rela.rodata RELA 0000000000000000 000019c8
0000000000000090 0000000000000018 I 22 12 8
[14] .modinfo PROGBITS 0000000000000000 00000660
00000000000000ec 0000000000000000 A 0 0 8
[15] .data PROGBITS 0000000000000000 00000760
0000000000000020 0000000000000000 WA 0 0 32
[16] .rela.data RELA 0000000000000000 00001a58
0000000000000030 0000000000000018 I 22 15 8
[17] .gnu.linkonce.thi PROGBITS 0000000000000000 00000780
0000000000000340 0000000000000000 WA 0 0 64
[18] .rela.gnu.linkonc RELA 0000000000000000 00001a88
0000000000000030 0000000000000018 I 22 17 8
[19] .bss NOBITS 0000000000000000 00000ac0
0000000000000014 0000000000000000 WA 0 0 8
[20] .comment PROGBITS 0000000000000000 00000ac0
0000000000000056 0000000000000001 MS 0 0 1
[21] .note.GNU-stack PROGBITS 0000000000000000 00000b16
0000000000000000 0000000000000000 0 0 1
[22] .symtab SYMTAB 0000000000000000 00000b18
00000000000005a0 0000000000000018 23 39 8
[23] .strtab STRTAB 0000000000000000 000010b8
000000000000027b 0000000000000000 0 0 1
[24] .shstrtab STRTAB 0000000000000000 00001ab8
00000000000000e7 0000000000000000 0 0 1
```
* 這個 fibdrv 名稱取自 Fibonacci driver 的簡稱,儘管在這裡顯然是為了展示和教學用途而存在,但針對若干關鍵的應用場景,特別去撰寫 Linux 核心模組,仍有其意義,請找出 Linux 核心的案例並解讀。提示: 可參閱 Random numbers from CPU execution time jitter
The article mentions that how to use CPU execution jitters as entropy, and /dev/random as well. /dev/random can be used for key-gen:
```
* The two other interfaces are two character devices /dev/random and
* /dev/urandom. /dev/random is suitable for use when very high
* quality randomness is desired (for example, for key generation or
* one-time pads), as it will only return a maximum of the number of
* bits of randomness (as estimated by the random number generator)
* contained in the entropy pool.
```
source code(linux v4.18):
```clike=
const struct file_operations random_fops = {
.read = random_read,
.write = random_write,
.poll = random_poll,
.unlocked_ioctl = random_ioctl,
.fasync = random_fasync,
.llseek = noop_llseek,
};
```
User can get random bytes through reading /dev/random, also can re-seed generator through ioctl. Read operation:
```clike=
static ssize_t
random_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
{
return _random_read(file->f_flags & O_NONBLOCK, buf, nbytes);
}
static ssize_t _random_read(int nonblock, char __user *buf, size_t nbytes)
{
ssize_t n;
if (nbytes == 0)
return 0;
nbytes = min_t(size_t, nbytes, SEC_XFER_SIZE);
while (1) {
n = extract_entropy_user(&blocking_pool, buf, nbytes);
if (n < 0)
return n;
trace_random_read(n*8, (nbytes-n)*8,
ENTROPY_BITS(&blocking_pool),
ENTROPY_BITS(&input_pool));
if (n > 0)
return n;
/* Pool is (near) empty. Maybe wait and retry. */
if (nonblock)
return -EAGAIN;
wait_event_interruptible(random_read_wait,
ENTROPY_BITS(&input_pool) >=
random_read_wakeup_bits);
if (signal_pending(current))
return -ERESTARTSYS;
}
}
```
* 查閱 ktime 相關的 API,並找出使用案例 (需要有核心模組和簡化的程式碼來解說)
```clike=
/* Nanosecond scalar representation for kernel time values */
typedef s64 ktime_t;
```
```clike=
typedef int64_t s64;
```
There are several basic ktime interface:
```clike=
ktime_t ktime_get(void) /*CLOCK_MONOTONIC*/
ktime_t ktime_get_boottime(void) /*CLOCK_BOOTTIME*/
ktime_t ktime_get_real(void) /*CLOCK_REALTIME*/
ktime_t ktime_get_clocktai(void) /*CLOCK_TAI, International Atomic Time (TAI)*/
ktime_t ktime_get_raw(void) /*CLOCK_MONOTONIC_RAW, without NTP adjustments f or clock drift*/
```
Example is like sleep function following, using ktime_add_us to get target time and waiting until it is expired.
```
/**
* usleep_range - Sleep for an approximate time
* @min: Minimum time in usecs to sleep
* @max: Maximum time in usecs to sleep
*
* In non-atomic context where the exact wakeup time is flexible, use
* usleep_range() instead of udelay(). The sleep improves responsiveness
* by avoiding the CPU-hogging busy-wait of udelay(), and the range reduces
* power usage by allowing hrtimers to take advantage of an already-
* scheduled interrupt instead of scheduling a new one just for this sleep.
*/
void __sched usleep_range(unsigned long min, unsigned long max)
{
ktime_t exp = ktime_add_us(ktime_get(), min);
u64 delta = (u64)(max - min) * NSEC_PER_USEC;
for (;;) {
__set_current_state(TASK_UNINTERRUPTIBLE);
/* Do not return before the requested sleep time has elapsed */
if (!schedule_hrtimeout_range(&exp, delta, HRTIMER_MODE_ABS))
break;
}
}
```
* clock_gettime 和 High Resolution TImers (HRT) 的關聯為何?請參閱 POSIX 文件並搭配程式碼解說
According to [High Resolution TImers (HRT)](https://elinux.org/High_Resolution_Timers), I check my system timers to see if they are supporting high-resolution timers.
:::info
If your clock supports high resolution, it will have a .resolution value of 1 nsecs. If it does not, then it will have a .resolution value that equals the number of nanoseconds in a jiffy (usually 10000 nsecs, on embedded platforms):
:::
```sh=
lineagech@/tmp$ sudo cat /proc/timer_list
......
active timers:
clock 4:
.base: ffff95512f39cd40
.index: 4
.resolution: 1 nsecs
.get_time: ktime_get
.offset: 0 nsecs
active timers:
clock 5:
.base: ffff95512f39cd80
.index: 5
.resolution: 1 nsecs
.get_time: ktime_get_real
.offset: 1552500249691559463 nsecs
......
active timers:
.expires_next : 8312236000000 nsecs
.hres_active : 1
......
```
Write a test program to check if clock_gettime uses HRT:
```clike=
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main (int argc, char **argv)
{
struct timespec res;
int re = clock_getres(CLOCK_MONOTONIC, &res);
printf("%lu sec %lu nsec\n", res.tv_sec, res.tv_nsec);
return 0;
}
```
```sh=
lineagech@/tmp$ ./a.out
0 sec 1 nsec
```
* fibdrv 如何透過 Linux Virtual File System 介面,讓計算出來的 Fibonacci 數列得以讓 userspace (使用者層級) 程式 (本例就是 client.c 程式) 得以存取呢?解釋原理,並撰寫或找出相似的 Linux 核心模組範例
Linux provides a unified Virtual File System supporting several operations, like read, write, stat etc. Here kernel modules need to hand its own `file_operation` to VFS through `cdev_add` ([Character Device Files - Creation & Operations](https://sysplay.github.io/books/LinuxDrivers/book/Content/Part05.html)).
When `insmod`, kerenel modules will be linked to sysfs. `man sysfs`:
:::info
The sysfs filesystem is a pseudo-filesystem which provides an interface to kernel data structures. (More precisely, the files and directories in sysfs provide a view of the kobject structures defined internally within the kernel.) The files under sysfs provide information about devices, kernel modules, filesystems, and other kernel components.
:::
User provides file descriptor and user-space buffer and uses `read` system call to find provided inode and file struct as arguments and call registered read API in kernel modules, here is fib_read. Kernel module can just do something and return result or put results into user's buffer through `copy_to_user`.

Linux Virtual File System:
>* Manage kernel level file abstractions in one format for all file systems
>* Receives system call requests from user level(e.g. write, open, stat, link)
>* Interact with a specific file system based on mount point traversal
>* Receive requests from other part of the kernel, mostly from memory management
>* VFS Data structure
- inode: VFS abstraction for the file
- file
- superblock: Handle metadata(attributes), like device, blocksize, dirty flags, list of dirty inodes
- dentry: a name to inode translation structure
* 注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題
When existing multi-porcesses or multi-threads, they might access the kernel module simultaneously, possible to have race conditions.
* 許多現代處理器提供了 clz / ctz 一類的指令,你知道如何透過演算法的調整,去加速 費氏數列 運算嗎?請列出關鍵程式碼並解說
My implementation is using doubling method:
:::info
$F_{2n}=F_n*2F_{n+1}-F_n^2$
$F_{2n+1}=F_n^2+F_{n+1}^2$
:::
Check bit starting from the first bit 1 which is found from MSB. Initial state is either $F_0=0$ or $F_1=1$, depending on the first bit is 0 or 1. And we can use `__builtin_clz` to find the first position of bit 1 and start to calculate next $F_n$.
The process is traversing the first bit 1 from MSB to LSB, so when move to next bit, it is like multiplying 2 for current $F_n$, that is $F_{2n}$. What if next bit is 1, then current value becomes $F_{2n+1}$. Until moving to LSB, we can get the result.
```clike=
long long curr;
long long a = 0;
long long b = 1;
long long next_a, next_b;
int i;
int lz = __builtin_clz(k);
int tz = __builtin_ctz(k);
int count = 1, target = ((sizeof(long long) << 3) - lz);
if (k == 0)
return 0;
curr = k >> ((sizeof(long long) << 3) - lz - count);
while (count <= target) {
next_a = a * ((b << 1) - a);
next_b = a * a + b * b;
if ((curr & 0x1) == 1) {
a = next_b;
b = next_a + next_b;
} else {
a = next_a;
b = next_b;
}
curr = k >> ((sizeof(long long) << 3) - lz - (++count));
}
return a;
```
## Fibonacci
* Peformance evaluation
First we see the execution time in kernel space and user space individually:

And if using fast-fibonacci algorithm:
