Try   HackMD

Learning the Linux Kernel with tracing

source: Steven Rostedt - Learning the Linux Kernel with tracing


Hello World

  • Source code
#include <stdio.h>

int main() {printf("Hello World %p!\n", main);}
  • Compile it
gcc main.c -o main
  • Read the output file
objdump -D main | less
  • Execute it
./main

puts?

The objdump output indicates that the main function is calling the puts method. In reality, however, the source code calls the printf method rather than puts.

0000000000001149 <main>:
    1149:       f3 0f 1e fa             endbr64
    114d:       55                      push   %rbp
    114e:       48 89 e5                mov    %rsp,%rbp
    1151:       48 8d 05 ac 0e 00 00    lea    0xeac(%rip),%rax        # 2004 <_IO_stdin_used+0x4>
    1158:       48 89 c7                mov    %rax,%rdi
    115b:       e8 f0 fe ff ff          call   1050 <puts@plt>
    1160:       90                      nop
    1161:       5d                      pop    %rbp
    1162:       c3                      ret

This is due to the gcc compiler optimizing our source code. If we prefer to prevent the compiler from performing such optimizations, we can pass the -O0 flag to disable them.

gcc -O0 main.c -o main

After disabling optimization with the -O0 flag, the printf function call is present in our main function as expected.

0000000000001149 <main>:
    1149:       f3 0f 1e fa             endbr64
    114d:       55                      push   %rbp
    114e:       48 89 e5                mov    %rsp,%rbp
    1151:       48 8d 05 f1 ff ff ff    lea    -0xf(%rip),%rax        # 1149 <main>
    1158:       48 89 c6                mov    %rax,%rsi
    115b:       48 8d 05 a2 0e 00 00    lea    0xea2(%rip),%rax        # 2004 <_IO_stdin_used+0x4>
    1162:       48 89 c7                mov    %rax,%rdi
    1165:       b8 00 00 00 00          mov    $0x0,%eax
    116a:       e8 e1 fe ff ff          call   1050 <printf@plt>
    116f:       90                      nop
    1170:       5d                      pop    %rbp
    1171:       c3                      ret

Why the outputs are different?

Everytime we execute the main file, the outputs are different.

matt_jan@tw-mattjan:/tmp$ ./a.out
Hello World 0x560799d6e149!
matt_jan@tw-mattjan:/tmp$ ./a.out
Hello World 0x55e782c03149!
matt_jan@tw-mattjan:/tmp$ ./a.out
Hello World 0x55710621d149!
matt_jan@tw-mattjan:/tmp$ ./a.out
Hello World 0x5650608c0149!
matt_jan@tw-mattjan:/tmp$ ./a.out
Hello World 0x556077e8e149!
matt_jan@tw-mattjan:/tmp$ ./a.out
Hello World 0x5587b9b91149!
matt_jan@tw-mattjan:/tmp$ ./a.out
Hello World 0x557a80c84149!
matt_jan@tw-mattjan:/tmp$ ./a.out
Hello World 0x55fa66a19149!

This is cause by the Address Space Layout Randomization(ASLR) in linux kernel. We can disable it by the following command.

sudo echo 0 > /proc/sys/kernel/randomize_va_space

After disabled ASLR, the main file always output same address.

matt_jan@tw-mattjan:/tmp$ ./a.out
Hello World 0x555555555149!
matt_jan@tw-mattjan:/tmp$ ./a.out
Hello World 0x555555555149!
matt_jan@tw-mattjan:/tmp$ ./a.out
Hello World 0x555555555149!
matt_jan@tw-mattjan:/tmp$ ./a.out
Hello World 0x555555555149!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
You must restore the value of randomize_va_space after the experiment, as it is a security feature.

Dig into 0x560799d6e149

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The virtual address is structured in a format comprising "9 bits-9 bits-9 bits-9 bits-12 bits," which together make up a 48-bit address space. We observe that the kernel attempts to determine whether the PGD contains 0xac, the PUD contains 0x117, the PMD contains 0xf1, and the PTE contains 0x194. Ultimately, the final 12 bits represent the offset within the address pointed by the PTE.

  • PGD: Page Global Directory
  • PUD: Page Upper Directory
  • PMD: Page Middle Directory
  • PTE: Page Table Entry

If the address does not exist in the Page Directory, the kernel will load it from memory and update the corresponding Page Directory.

Since each application has its own virtual address space, each can have the same virtual address that points to a different physical address.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

What's in kernel?

The application has its own process space, which contains sections such as data, text, stack, heap, and others. However, what resides in the kernel space?

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

We can determine the system calls invoked in user space by using strace, which reveals each call made by the main program. On the other hand, to understand the activities within the kernel space, we can employ ftrace.

Strace

Strace stands for System Call Trace. You can figure out each system call the main file really called by utilies strace

strace main

Here is the output.

execve("./a.out", ["./a.out"], 0x7ffecddb6af0 /* 38 vars */) = 0
brk(NULL)                               = 0x55e101412000
arch_prctl(0x3001 /* ARCH_??? */, 0x7fffa4c139e0) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa59f732000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=61443, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 61443, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa59f722000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\237\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0 \0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 48, 848) = 48
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\226 \25\252\235\23<l\274\3731\3540\5\226\327"..., 68, 896) = 68
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=2220400, ...}, AT_EMPTY_PATH) = 0
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 2264656, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa59f4f9000
mprotect(0x7fa59f521000, 2023424, PROT_NONE) = 0
mmap(0x7fa59f521000, 1658880, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x28000) = 0x7fa59f521000
mmap(0x7fa59f6b6000, 360448, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bd000) = 0x7fa59f6b6000
mmap(0x7fa59f70f000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x215000) = 0x7fa59f70f000
mmap(0x7fa59f715000, 52816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa59f715000
close(3)                                = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa59f4f6000
arch_prctl(ARCH_SET_FS, 0x7fa59f4f6740) = 0
set_tid_address(0x7fa59f4f6a10)         = 96198
set_robust_list(0x7fa59f4f6a20, 24)     = 0
rseq(0x7fa59f4f70e0, 0x20, 0, 0x53053053) = 0
mprotect(0x7fa59f70f000, 16384, PROT_READ) = 0
mprotect(0x55e0ffdf4000, 4096, PROT_READ) = 0
mprotect(0x7fa59f76c000, 8192, PROT_READ) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7fa59f722000, 61443)           = 0
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}, AT_EMPTY_PATH) = 0
getrandom("\x71\xf9\xb8\x86\x89\xfe\x93\xe9", 8, GRND_NONBLOCK) = 8
brk(NULL)                               = 0x55e101412000
brk(0x55e101433000)                     = 0x55e101433000
write(1, "Hello World 0x55e0ffdf2149!\n", 28Hello World 0x55e0ffdf2149!
) = 28
exit_group(28)                          = ?
+++ exited with 28 +++

Here is what these system calls actually do, as generated by ChatGPT.

  • execve: The program starts with the execve system call, which is used to execute a program pointed to by the filename. This indicates the start of the a.out program.
  • brk: This system call is used to control the amount of memory allocated to the data segment of the process. Initially, it is called with NULL to find the current location of the program break.
  • arch_prctl: This is an architecture-specific system call for setting certain thread state information. The EINVAL (Invalid argument) error suggests that the call was made with an incorrect or unsupported code.
  • mmap: This system call is used to map files or devices into memory. It is used here for allocating memory, probably for the program's runtime use.
  • access: Checks if the calling process can access the file pathname. The error ENOENT (No such file or directory) suggests that the file /etc/ld.so.preload does not exist.
  • openat, newfstatat, read, pread64, mmap, mprotect: These calls are part of the dynamic linking process, where the program is loading and preparing the libc.so.6 library, which is the standard C library, into memory.
  • close: The file descriptor opened for /etc/ld.so.cache and the libc.so.6 library is closed after being read and mapped into memory.
  • arch_prctl(ARCH_SET_FS, ): Sets the base address of the FS segment, which is often used to manage thread-local storage.
  • set_tid_address: Sets the clear_child_tid value in the thread control block, which is used for thread synchronization when exiting.
  • set_robust_list: Sets up a list of locks to be released if the thread dies.
  • rseq: This system call is intended for implementing restartable sequences in user space.
  • mprotect: Changes the access protections for the calling process's memory pages.
  • prlimit64: This system call is used to retrieve and/or set the resource limits of a process.
  • munmap: Unmaps a mapped region of memory.
  • newfstatat: Retrieves the status of a file descriptor, in this case, stdout.
  • getrandom: Generates random numbers.
  • write: Writes the string "Hello World 0x55e0ffdf2149!\n" to stdout, which is the file descriptor 1.
  • exit_group: Exits all threads in a process.

The +++ exited with 28 +++ line indicates that the program has finished executing and has exited with the status code 28.

ftrace

  • The official tracer of the Linux Kernel
  • Probably already enabled in the Linux Kernel you are running
  • /sys/kernel/tracing
  • /sys/kernel/debug/tracing (for backward compatibility)
  • Only need echo and cat

How to use it?

First of all, please read the README in /sys/kernel/tracing.

Functions

Let's trace all functions in the kernel.

cd /sys/kernel/tracing echo function > current_tracer cat trace

Here is the sample output.

# tracer: function
#
# entries-in-buffer/entries-written: 725629/78268149   #P:16
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
          <idle>-0       [015] d.... 33920.474733: rcu_needs_cpu <-tick_nohz_next_event
          <idle>-0       [015] d.... 33920.474733: get_next_timer_interrupt <-tick_nohz_next_event

You may have observed that the file contains an extensive amount of content and seems to be never-ending. This occurs because we are continuously tracing the function. If we wish to halt the tracing process, we need to write nop into the current_tracer.

echo nop > current_tracer

The primary reason we can stop the tracing using the aforementioned method is because the trace employs a non-consuming read approach.

Additionally, there is an alternative method for observing the current tracing activity, which is through the trace_pipe. However, when monitoring the tracing function via this file, we cannot terminate the tracing in the same manner. The trace_pipe prevents the user from writing to the current_tracer.

As for trace_pipe, it continuously provides the latest function calls because we use the cat command to read from it, which inadvertently triggers events in its own trace. Consequently, it ends up reading from itself in a loop.

trace-cmd

The trace command is powerful but not particularly user-friendly. You might not want to go through the aforementioned steps each time. Therefore, Steven has implemented trace-cmd for users like you.

  • A command line tool taht does the work for you
  • Must run as root or sudo
  • Will mount /sys/kernel/tracing for you
  • Can save the trace offline instead in the trace file
How to use?

Using the following command with trace-cmd, we can obtain the same output as before without the need to modify or mount any files.

trace-cmd start -p function
trace-cmd show

Furthermore, we can capture system calls using the record command and retrieve them with the report command.

trace-cmd record -e syscalls -F ./Hello
trace-cmd report

What does the write function do

  • printf("Hello world\n") does a system call
  • But thats all we know
  • How does it ends up on our console?

Steven had traced our "hello" program by the following commands.

trace-cmd record -p function_graph --max-graph-depth 1 -e syscalls -F ./Hello
trace-cmd report

The function_graph tracer, similar to the function tracer, traces not only all kernel function calls but also their entry and exit points. This enables the creation of a graphical representation of the call hierarchy that closely reflects the structure of the source code in C. More detail in The Linux Kernel.

  • -F: This will filter only the executable that is given on the command line. If no command is given, then it will filter itself (pretty pointless). Using -F will let you trace only events that are caused by the given command.
  • --max-graph-depth: --max-graph-depth option specifies the maximum depth to which the function_graph tracer will record function calls.
  • -g <function name>: This option is for the function_graph plugin. It will graph the given function. That is, it will only trace the function and all functions that it calls. You can have more than one -g on the command line.
  • -l <function name>: This will limit the function and function_graph tracers to only trace the given function name. More than one -l may be specified on the command line to trace more than one function. This supports both full regex(3) parsing, or basic glob parsing. If the filter has only alphanumeric, '_', '*', '?' and '.' characters, then it will be parsed as a basic glob. to force it to be a regex, prefix the filter with '^' or append it with '$'
    and it will then be parsed as a regex.
  • -n <function name>: This has the opposite effect of -l. The function given with the -n option will not be traced. This takes precedence, that is, if you include the same function for both -n and -l, it will not be traced.
  • -O <option>: Ftrace has various options that can be enabled or disabled. This allows you to set them. Appending the text 'no' to an option disables it. For example: -O nograph-time will disable the "graph-time" Ftrace option. Here is more detail about options

Too much information!!

  • trace-cmd can record a lot of information
  • Some times, too much information is worse than too little - Information overload
  • It takes an expert to be able to decipher what is happening
  • Would be much better to visualize it!

KernelShark

  • KernelShark can read trace.dat files
    • trace.da is produced by trace-cmd record
  • It gives a visualization of what is happening
  • Much easier to parse large amounts of data