Learning the Linux Kernel with tracing

# Learning the Linux Kernel with tracing source: [Steven Rostedt - Learning the Linux Kernel with tracing](https://www.youtube.com/watch?v=JRyrhsx-L5Y) --- ## Hello World - Source code ```clike #include <stdio.h> int main() {printf("Hello World %p!\n", main);} ``` - Compile it ```bash gcc main.c -o main ``` - Read the output file ```bash objdump -D main | less ``` - Execute it ```bash ./main ``` ### `puts`? The `objdump` output indicates that the `main` function is calling the `puts` method. In reality, however, the source code calls the `printf` method rather than `puts`. ``` 0000000000001149 <main>: 1149: f3 0f 1e fa endbr64 114d: 55 push %rbp 114e: 48 89 e5 mov %rsp,%rbp 1151: 48 8d 05 ac 0e 00 00 lea 0xeac(%rip),%rax # 2004 <_IO_stdin_used+0x4> 1158: 48 89 c7 mov %rax,%rdi 115b: e8 f0 fe ff ff call 1050 <puts@plt> 1160: 90 nop 1161: 5d pop %rbp 1162: c3 ret ``` This is due to the `gcc` compiler optimizing our source code. If we prefer to prevent the compiler from performing such optimizations, we can pass the `-O0` flag to disable them. ```bash gcc -O0 main.c -o main ``` After disabling optimization with the `-O0` flag, the `printf` function call is present in our `main` function as expected. ``` 0000000000001149 <main>: 1149: f3 0f 1e fa endbr64 114d: 55 push %rbp 114e: 48 89 e5 mov %rsp,%rbp 1151: 48 8d 05 f1 ff ff ff lea -0xf(%rip),%rax # 1149 <main> 1158: 48 89 c6 mov %rax,%rsi 115b: 48 8d 05 a2 0e 00 00 lea 0xea2(%rip),%rax # 2004 <_IO_stdin_used+0x4> 1162: 48 89 c7 mov %rax,%rdi 1165: b8 00 00 00 00 mov $0x0,%eax 116a: e8 e1 fe ff ff call 1050 <printf@plt> 116f: 90 nop 1170: 5d pop %rbp 1171: c3 ret ``` ## Why the outputs are different? Everytime we execute the `main` file, the outputs are different. ```bash matt_jan@tw-mattjan:/tmp$ ./a.out Hello World 0x560799d6e149! matt_jan@tw-mattjan:/tmp$ ./a.out Hello World 0x55e782c03149! matt_jan@tw-mattjan:/tmp$ ./a.out Hello World 0x55710621d149! matt_jan@tw-mattjan:/tmp$ ./a.out Hello World 0x5650608c0149! matt_jan@tw-mattjan:/tmp$ ./a.out Hello World 0x556077e8e149! matt_jan@tw-mattjan:/tmp$ ./a.out Hello World 0x5587b9b91149! matt_jan@tw-mattjan:/tmp$ ./a.out Hello World 0x557a80c84149! matt_jan@tw-mattjan:/tmp$ ./a.out Hello World 0x55fa66a19149! ``` This is cause by the Address Space Layout Randomization(ASLR) in linux kernel. We can disable it by the following command. ```bash sudo echo 0 > /proc/sys/kernel/randomize_va_space ``` After disabled ASLR, the `main` file always output same address. ``` matt_jan@tw-mattjan:/tmp$ ./a.out Hello World 0x555555555149! matt_jan@tw-mattjan:/tmp$ ./a.out Hello World 0x555555555149! matt_jan@tw-mattjan:/tmp$ ./a.out Hello World 0x555555555149! matt_jan@tw-mattjan:/tmp$ ./a.out Hello World 0x555555555149! ``` :::warning :notes: You must restore the value of `randomize_va_space` after the experiment, as it is a security feature. ::: ## Dig into `0x560799d6e149` ![virtual address mapping](https://hackmd.io/_uploads/SJWo6m0mR.png) The virtual address is structured in a format comprising "9 bits-9 bits-9 bits-9 bits-12 bits," which together make up a 48-bit address space. We observe that the kernel attempts to determine whether the PGD contains `0xac`, the PUD contains `0x117`, the PMD contains `0xf1`, and the PTE contains `0x194`. Ultimately, the final 12 bits represent the offset within the address pointed by the PTE. - PGD: Page Global Directory - PUD: Page Upper Directory - PMD: Page Middle Directory - PTE: Page Table Entry If the address does not exist in the Page Directory, the kernel will load it from memory and update the corresponding Page Directory. Since each application has its own virtual address space, each can have the same virtual address that points to a different physical address. ![image](https://hackmd.io/_uploads/B1hLbNAXR.png) ## What's in kernel? The application has its own process space, which contains sections such as data, text, stack, heap, and others. However, what resides in the kernel space? ![image](https://hackmd.io/_uploads/Hkg0Rb4RQ0.png) We can determine the system calls invoked in user space by using `strace`, which reveals each call made by the `main` program. On the other hand, to understand the activities within the kernel space, we can employ `ftrace`. ### Strace Strace stands for System Call Trace. You can figure out each system call the `main` file really called by utilies `strace` ```bash strace main ``` Here is the output. ``` execve("./a.out", ["./a.out"], 0x7ffecddb6af0 /* 38 vars */) = 0 brk(NULL) = 0x55e101412000 arch_prctl(0x3001 /* ARCH_??? */, 0x7fffa4c139e0) = -1 EINVAL (Invalid argument) mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa59f732000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=61443, ...}, AT_EMPTY_PATH) = 0 mmap(NULL, 61443, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa59f722000 close(3) = 0 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\237\2\0\0\0\0\0"..., 832) = 832 pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784 pread64(3, "\4\0\0\0 \0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 48, 848) = 48 pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\226 \25\252\235\23<l\274\3731\3540\5\226\327"..., 68, 896) = 68 newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=2220400, ...}, AT_EMPTY_PATH) = 0 pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784 mmap(NULL, 2264656, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa59f4f9000 mprotect(0x7fa59f521000, 2023424, PROT_NONE) = 0 mmap(0x7fa59f521000, 1658880, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x28000) = 0x7fa59f521000 mmap(0x7fa59f6b6000, 360448, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bd000) = 0x7fa59f6b6000 mmap(0x7fa59f70f000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x215000) = 0x7fa59f70f000 mmap(0x7fa59f715000, 52816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa59f715000 close(3) = 0 mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa59f4f6000 arch_prctl(ARCH_SET_FS, 0x7fa59f4f6740) = 0 set_tid_address(0x7fa59f4f6a10) = 96198 set_robust_list(0x7fa59f4f6a20, 24) = 0 rseq(0x7fa59f4f70e0, 0x20, 0, 0x53053053) = 0 mprotect(0x7fa59f70f000, 16384, PROT_READ) = 0 mprotect(0x55e0ffdf4000, 4096, PROT_READ) = 0 mprotect(0x7fa59f76c000, 8192, PROT_READ) = 0 prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0 munmap(0x7fa59f722000, 61443) = 0 newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}, AT_EMPTY_PATH) = 0 getrandom("\x71\xf9\xb8\x86\x89\xfe\x93\xe9", 8, GRND_NONBLOCK) = 8 brk(NULL) = 0x55e101412000 brk(0x55e101433000) = 0x55e101433000 write(1, "Hello World 0x55e0ffdf2149!\n", 28Hello World 0x55e0ffdf2149! ) = 28 exit_group(28) = ? +++ exited with 28 +++ ``` Here is what these system calls actually do, as generated by ChatGPT. - **execve**: The program starts with the `execve` system call, which is used to execute a program pointed to by the filename. This indicates the start of the `a.out` program. - **brk**: This system call is used to control the amount of memory allocated to the data segment of the process. Initially, it is called with `NULL` to find the current location of the program break. - **arch_prctl**: This is an architecture-specific system call for setting certain thread state information. The `EINVAL` (Invalid argument) error suggests that the call was made with an incorrect or unsupported code. - **mmap**: This system call is used to map files or devices into memory. It is used here for allocating memory, probably for the program's runtime use. - **access**: Checks if the calling process can access the file pathname. The error `ENOENT` (No such file or directory) suggests that the file `/etc/ld.so.preload` does not exist. - **openat**, **newfstatat**, **read**, **pread64**, **mmap**, **mprotect**: These calls are part of the dynamic linking process, where the program is loading and preparing the `libc.so.6` library, which is the standard C library, into memory. - **close**: The file descriptor opened for `/etc/ld.so.cache` and the `libc.so.6` library is closed after being read and mapped into memory. - **arch_prctl(ARCH_SET_FS, ...)**: Sets the base address of the FS segment, which is often used to manage thread-local storage. - **set_tid_address**: Sets the clear_child_tid value in the thread control block, which is used for thread synchronization when exiting. - **set_robust_list**: Sets up a list of locks to be released if the thread dies. - **rseq**: This system call is intended for implementing restartable sequences in user space. - **mprotect**: Changes the access protections for the calling process's memory pages. - **prlimit64**: This system call is used to retrieve and/or set the resource limits of a process. - **munmap**: Unmaps a mapped region of memory. - **newfstatat**: Retrieves the status of a file descriptor, in this case, stdout. - **getrandom**: Generates random numbers. - **write**: Writes the string "Hello World 0x55e0ffdf2149!\n" to stdout, which is the file descriptor 1. - **exit_group**: Exits all threads in a process. The `+++ exited with 28 +++` line indicates that the program has finished executing and has exited with the status code 28. ### `ftrace` - The official tracer of the Linux Kernel - Probably already enabled in the Linux Kernel you are running - `/sys/kernel/tracing` - `/sys/kernel/debug/tracing` (for backward compatibility) - Only need echo and cat #### How to use it? First of all, please read the README in `/sys/kernel/tracing`. ##### Functions Let's trace all functions in the kernel. ```bash= cd /sys/kernel/tracing echo function > current_tracer cat trace ``` Here is the sample output. ``` # tracer: function # # entries-in-buffer/entries-written: 725629/78268149 #P:16 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <idle>-0 [015] d.... 33920.474733: rcu_needs_cpu <-tick_nohz_next_event <idle>-0 [015] d.... 33920.474733: get_next_timer_interrupt <-tick_nohz_next_event ``` You may have observed that the file contains an extensive amount of content and seems to be never-ending. This occurs because we are continuously tracing the function. If we wish to halt the tracing process, we need to write `nop` into the `current_tracer`. ```bash= echo nop > current_tracer ``` The primary reason we can stop the tracing using the aforementioned method is because the `trace` employs a non-consuming read approach. Additionally, there is an alternative method for observing the current tracing activity, which is through the `trace_pipe`. However, when monitoring the tracing function via this file, we cannot terminate the tracing in the same manner. The `trace_pipe` prevents the user from writing to the `current_tracer`. As for `trace_pipe`, it continuously provides the latest function calls because we use the `cat` command to read from it, which inadvertently triggers events in its own trace. Consequently, it ends up reading from itself in a loop. #### trace-cmd The `trace` command is powerful but not particularly user-friendly. You might not want to go through the aforementioned steps each time. Therefore, Steven has implemented `trace-cmd` for users like you. - A command line tool taht does the work for you - Must run as root or sudo - Will mount `/sys/kernel/tracing` for you - Can save the trace offline instead in the trace file ##### How to use? Using the following command with `trace-cmd`, we can obtain the same output as before without the need to modify or mount any files. ```bash trace-cmd start -p function trace-cmd show ``` Furthermore, we can capture system calls using the `record` command and retrieve them with the `report` command. ```bash trace-cmd record -e syscalls -F ./Hello trace-cmd report ``` ## What does the `write` function do - `printf("Hello world\n")` does a system call - But thats all we know - How does it ends up on our console? Steven had traced our "hello" program by the following commands. ```bash trace-cmd record -p function_graph --max-graph-depth 1 -e syscalls -F ./Hello trace-cmd report ``` The `function_graph` tracer, similar to the `function` tracer, traces not only all kernel function calls but also their entry and exit points. This enables the creation of a graphical representation of the call hierarchy that closely reflects the structure of the source code in C. More detail in [The Linux Kernel](https://www.kernel.org/doc/html/v4.18/trace/ftrace.html#the-tracers). - `-F`: This will filter only the executable that is given on the command line. If no command is given, then it will filter itself (pretty pointless). Using `-F` will let you trace only events that are caused by the given command. - `--max-graph-depth`: `--max-graph-depth` option specifies the maximum depth to which the `function_graph` tracer will record function calls. - `-g <function name>`: This option is for the `function_graph` plugin. It will graph the given function. That is, it will only trace the function and all functions that it calls. You can have more than one `-g` on the command line. - `-l <function name>`: This will limit the `function` and `function_graph` tracers to only trace the given function name. More than one `-l` may be specified on the command line to trace more than one function. This supports both full regex(3) parsing, or basic glob parsing. If the filter has only alphanumeric, '_', '*', '?' and '.' characters, then it will be parsed as a basic glob. to force it to be a regex, prefix the filter with '^' or append it with '$' and it will then be parsed as a regex. - `-n <function name>`: This has the opposite effect of `-l`. The function given with the `-n` option will not be traced. This takes precedence, that is, if you include the same function for both `-n` and `-l`, it will not be traced. - `-O <option>`: Ftrace has various options that can be enabled or disabled. This allows you to set them. Appending the text 'no' to an option disables it. For example: `-O nograph-time` will disable the "graph-time" Ftrace option. [Here](https://www.kernel.org/doc/html/v4.18/trace/ftrace.html#trace-options) is more detail about options :::danger Too much information!! - trace-cmd can record a lot of information - Some times, too much information is worse than too little - Information overload - It takes an expert to be able to decipher what is happening - Would be much better to visualize it! ::: ## KernelShark - KernelShark can read trace.dat files - trace.da is produced by trace-cmd record - It gives a visualization of what is happening - Much easier to parse large amounts of data