Debug always make developer sad, especially in a big system. How to debug in Linux Kernel?
In early kernel patch, we have some tracing tools, e.g. strace, gdb. They are all based on the old tracing system call: ptrace.
gdb -q test.o
./ strace test.o
Man page:
#include <sys/ptrace.h>
long ptrace(enum __ptrace_request request, pid_t pid,
void *addr, void *data);
@request: Command for tracing
(ptrace_request request is defined in /usr/include/sys/ptrace.h)
@pid: Tracee thread
@addr: Tracee memory address
@data: input or output data address
When a task (tracee) is signaled by GDB or other tracer task, the task state will hang and it's state will be set to t. In this state, the task will ignore any signal except SISKILL and notice from tracer. Then, the tracer will be notified at its next call to waitpid(2) (or one of the related "wait" system calls).
ps man page:
PROCESS STATE CODES Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a process: D uninterruptible sleep (usually IO) I Idle kernel thread R running or runnable (on run queue) S interruptible sleep (waiting for an event to complete) T stopped by job control signal -> t stopped by debugger during the tracing W paging (not valid since the 2.6.xx kernel) X dead (should never be seen) Z defunct ("zombie") process, terminated but not reaped by its parent
After Tracer set the break point in tracee, GDB will make the tracee run. When the tracee entry break point, it will send signal(SIGTRAP) to tracer.
However, in the real big system, ptrace is very hard to be use to find the bug in system. It will stop the signal and lose the real behavior in the system.
Steven Rostedt, the investor of Ftrace, had a speech in Openfest Bulgaria in 2018.
Steven Rostedt - Learning the Linux Kernel with tracing
This video shows how to trace Hello World by Ftrace.
For Ftrace:
~$ mount -t tracefs nodev /sys/kernel/
~$ sudo ls /sys/kernel/tracing/
available_events current_tracer free_buffer max_graph_depth saved_cmdlines_size set_ftrace_notrace stack_max_size trace_clock tracing_cpumask
available_filter_functions dynamic_events function_profile_enabled options saved_tgids set_ftrace_notrace_pid stack_trace trace_marker tracing_max_latency
available_tracers dyn_ftrace_total_info hwlat_detector per_cpu set_event set_ftrace_pid stack_trace_filter trace_marker_raw tracing_on
buffer_percent enabled_functions instances printk_formats set_event_notrace_pid set_graph_function synthetic_events trace_options tracing_thresh
buffer_size_kb error_log kprobe_events README set_event_pid set_graph_notrace timestamp_mode trace_pipe uprobe_events
buffer_total_size_kb events kprobe_profile saved_cmdlines set_ftrace_filter snapshot trace trace_stat uprobe_profile
A Readme file in virtual file system!!
~$ sudo cat /sys/kernel/tracing/README
tracing mini-HOWTO:
# echo 0 > tracing_on : quick way to disable tracing
# echo 1 > tracing_on : quick way to re-enable tracing
Important files:
trace - The static contents of the buffer
To clear the buffer write into this file: echo > trace
trace_pipe - A consuming read to see the contents of the buffer
current_tracer - function and latency tracers
available_tracers - list of configured tracers for current_tracer
error_log - error log for failed commands (that support it)
buffer_size_kb - view and modify size of per cpu buffer
buffer_total_size_kb - view total size of all cpu buffers
...
trace will show alot of information as below:
~$ sudo cat /sys/kernel/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
$ sudo echo function | sudo tee -a /sys/kernel/tracing/current_tracer
function
cheyenyu@u49049006de455c:~$ sudo cat /sys/kernel/tracing/trace
...
<idle>-0 [007] dN.. 49264.400969: rcu_qs <-rcu_note_context_switch
<idle>-0 [007] dN.. 49264.400969: _raw_spin_lock <-__schedule
<idle>-0 [007] dN.. 49264.400969: update_rq_clock <-__schedule
<idle>-0 [007] dN.. 49264.400969: pick_next_task_fair <-__schedule
<idle>-0 [007] dN.. 49264.400969: put_prev_task_idle <-pick_next_task_fair
<idle>-0 [007] dN.. 49264.400970: pick_next_entity <-pick_next_task_fair
<idle>-0 [007] dN.. 49264.400970: clear_buddies <-pick_next_entity
<idle>-0 [007] dN.. 49264.400970: set_next_entity <-pick_next_task_fair
<idle>-0 [007] dN.. 49264.400971: __update_load_avg_se <-update_load_avg
<idle>-0 [007] dN.. 49264.400971: __update_load_avg_cfs_rq <-update_load_avg
<idle>-0 [007] d... 49264.400972: psi_task_switch <-__schedule
<idle>-0 [007] d... 49264.400972: psi_flags_change <-psi_task_switch
<idle>-0 [007] d... 49264.400972: psi_group_change <-psi_task_switch
<idle>-0 [007] d... 49264.400974: enter_lazy_tlb <-__schedule
kworker/u17:0-580536 [007] d... 49264.400975: finish_task_switch <-__schedule
kworker/u17:0-580536 [007] .... 49264.400976: wq_worker_running <-schedule
kworker/u17:0-580536 [007] .... 49264.400976: kthread_data <-wq_worker_running
kworker/u17:0-580536 [007] .... 49264.400977: _raw_spin_lock_irq <-worker_thread
kworker/u17:0-580536 [007] d... 49264.400977: process_one_work <-worker_thread
kworker/u17:0-580536 [007] .... 49264.400978: intel_atomic_commit_work <-process_one_work
kworker/u17:0-580536 [007] .... 49264.400979: intel_atomic_commit_tail <-intel_atomic_commit_work
kworker/u17:0-580536 [007] .... 49264.400979: intel_atomic_commit_fence_wait <-intel_atomic_commit_tail
kworker/u17:0-580536 [007] .... 49264.400979: init_wait_entry <-intel_atomic_commit_fence_wait
...
We can operator Ftrace to tracing kernel by echo/cat to kernel's virtual file system: /sys/kernel/tracing/ and /sys/kernel/debug/. We can use particular method to tell kernel: Please take the initiative to inform if some events I want to know happened.
trace-cmd is the interface to easy use ftrace.
$ sudo apt install trace-cmd
$ sudo trace-cmd record -e syscalls -F ./hello
Hello world
CPU0 data recorded at offset=0x92b000
0 bytes in size
CPU1 data recorded at offset=0x92b000
4096 bytes in size
CPU2 data recorded at offset=0x92c000
0 bytes in size
CPU3 data recorded at offset=0x92c000
0 bytes in size
CPU4 data recorded at offset=0x92c000
0 bytes in size
CPU5 data recorded at offset=0x92c000
0 bytes in size
CPU6 data recorded at offset=0x92c000
0 bytes in size
CPU7 data recorded at offset=0x92c000
0 bytes in size
$ sudo trace-cmd report
trace-cmd: No such file or directory
Error: expected type 4 but read 5
CPU 0 is empty
CPU 2 is empty
CPU 3 is empty
CPU 4 is empty
CPU 5 is empty
CPU 6 is empty
CPU 7 is empty
cpus=8
hello-595328 [001] 52021.850476: sys_exit_write: 0x1
hello-595328 [001] 52021.850483: sys_enter_execve: filename: 0x7ffe7a40d853, argv: 0x7ffe7a40ce10, envp: 0x7ffe7a40ce20
hello-595328 [001] 52021.851088: sys_exit_execve: 0x0
hello-595328 [001] 52021.851163: sys_enter_brk: brk: 0x00000000
hello-595328 [001] 52021.851165: sys_exit_brk: 0x562413abc000
hello-595328 [001] 52021.851175: sys_enter_arch_prctl: option: 0x00003001, arg2: 0x7ffed32afc80
hello-595328 [001] 52021.851177: sys_exit_arch_prctl: 0xffffffffffffffea
hello-595328 [001] 52021.851237: sys_enter_access: filename: 0x7f3c281759e0, mode: 0x00000004
hello-595328 [001] 52021.851252: sys_exit_access: 0xfffffffffffffffe
hello-595328 [001] 52021.851266: sys_enter_openat: dfd: 0xffffff9c, filename: 0x7f3c28172b80, flags: 0x00080000, mode: 0x00000000
hello-595328 [001] 52021.851278: sys_exit_openat: 0x3
hello-595328 [001] 52021.851279: sys_enter_newfstat: fd: 0x00000003, statbuf: 0x7ffed32aee80
hello-595328 [001] 52021.851284: sys_exit_newfstat: 0x0
hello-595328 [001] 52021.851286: sys_enter_mmap: addr: 0x00000000, len: 0x000161f4, prot: 0x00000001, flags: 0x00000002, fd: 0x00000003, off: 0x00000000
hello-595328 [001] 52021.851294: sys_exit_mmap: 0x7f3c28136000
hello-595328 [001] 52021.851295: sys_enter_close: fd: 0x00000003
hello-595328 [001] 52021.851296: sys_exit_close: 0x0
hello-595328 [001] 52021.851323: sys_enter_openat: dfd: 0xffffff9c, filename: 0x7f3c2817ce10, flags: 0x00080000, mode: 0x00000000
hello-595328 [001] 52021.851333: sys_exit_openat: 0x3
hello-595328 [001] 52021.851334: sys_enter_read: fd: 0x00000003, buf: 0x7ffed32af028, count: 0x00000340
hello-595328 [001] 52021.851340: sys_exit_read: 0x340
hello-595328 [001] 52021.851341: sys_enter_pread64: fd: 0x00000003, buf: 0x7ffed32aec40, count: 0x00000310, pos: 0x00000040
hello-595328 [001] 52021.851344: sys_exit_pread64: 0x310
hello-595328 [001] 52021.851345: sys_enter_pread64: fd: 0x00000003, buf: 0x7ffed32aec10, count: 0x00000020, pos: 0x00000350
hello-595328 [001] 52021.851347: sys_exit_pread64: 0x20
hello-595328 [001] 52021.851348: sys_enter_pread64: fd: 0x00000003, buf: 0x7ffed32aebc0, count: 0x00000044, pos: 0x00000370
hello-595328 [001] 52021.851349: sys_exit_pread64: 0x44
hello-595328 [001] 52021.851351: sys_enter_newfstat: fd: 0x00000003, statbuf: 0x7ffed32aeed0
hello-595328 [001] 52021.851353: sys_exit_newfstat: 0x0
hello-595328 [001] 52021.851354: sys_enter_mmap: addr: 0x00000000, len: 0x00002000, prot: 0x00000003, flags: 0x00000022, fd: 0xffffffff, off: 0x00000000
hello-595328 [001] 52021.851361: sys_exit_mmap: 0x7f3c28134000
hello-595328 [001] 52021.851370: sys_enter_pread64: fd: 0x00000003, buf: 0x7ffed32aeb20, count: 0x00000310, pos: 0x00000040
hello-595328 [001] 52021.851372: sys_exit_pread64: 0x310
hello-595328 [001] 52021.851374: sys_enter_pread64: fd: 0x00000003, buf: 0x7ffed32ae800, count: 0x00000020, pos: 0x00000350
hello-595328 [001] 52021.851376: sys_exit_pread64: 0x20
hello-595328 [001] 52021.851377: sys_enter_pread64: fd: 0x00000003, buf: 0x7ffed32ae7e0, count: 0x00000044, pos: 0x00000370
hello-595328 [001] 52021.851379: sys_exit_pread64: 0x44
hello-595328 [001] 52021.851380: sys_enter_mmap: addr: 0x00000000, len: 0x001f1660, prot: 0x00000001, flags: 0x00000802, fd: 0x00000003, off: 0x00000000
hello-595328 [001] 52021.851391: sys_exit_mmap: 0x7f3c27f42000
hello-595328 [001] 52021.851392: sys_enter_mmap: addr: 0x7f3c27f64000, len: 0x00178000, prot: 0x00000005, flags: 0x00000812, fd: 0x00000003, off: 0x00022000
hello-595328 [001] 52021.851422: sys_exit_mmap: 0x7f3c27f64000
hello-595328 [001] 52021.851423: sys_enter_mmap: addr: 0x7f3c280dc000, len: 0x0004e000, prot: 0x00000001, flags: 0x00000812, fd: 0x00000003, off: 0x0019a000
hello-595328 [001] 52021.851438: sys_exit_mmap: 0x7f3c280dc000
hello-595328 [001] 52021.851438: sys_enter_mmap: addr: 0x7f3c2812a000, len: 0x00006000, prot: 0x00000003, flags: 0x00000812, fd: 0x00000003, off: 0x001e7000
hello-595328 [001] 52021.851453: sys_exit_mmap: 0x7f3c2812a000
hello-595328 [001] 52021.851465: sys_enter_mmap: addr: 0x7f3c28130000, len: 0x00003660, prot: 0x00000003, flags: 0x00000032, fd: 0xffffffff, off: 0x00000000
hello-595328 [001] 52021.851474: sys_exit_mmap: 0x7f3c28130000
hello-595328 [001] 52021.851494: sys_enter_close: fd: 0x00000003
hello-595328 [001] 52021.851495: sys_exit_close: 0x0
hello-595328 [001] 52021.851541: sys_enter_arch_prctl: option: 0x00001002, arg2: 0x7f3c28135540
hello-595328 [001] 52021.851542: sys_exit_arch_prctl: 0x0
hello-595328 [001] 52021.851674: sys_enter_mprotect: start: 0x7f3c2812a000, len: 0x00004000, prot: 0x00000001
hello-595328 [001] 52021.851690: sys_exit_mprotect: 0x0
hello-595328 [001] 52021.851697: sys_enter_mprotect: start: 0x5624121b4000, len: 0x00001000, prot: 0x00000001
hello-595328 [001] 52021.851704: sys_exit_mprotect: 0x0
hello-595328 [001] 52021.851714: sys_enter_mprotect: start: 0x7f3c2817a000, len: 0x00001000, prot: 0x00000001
hello-595328 [001] 52021.851725: sys_exit_mprotect: 0x0
hello-595328 [001] 52021.851727: sys_enter_munmap: addr: 0x7f3c28136000, len: 0x000161f4
hello-595328 [001] 52021.851760: sys_exit_munmap: 0x0
hello-595328 [001] 52021.851899: sys_enter_newfstat: fd: 0x00000001, statbuf: 0x7ffed32afae0
hello-595328 [001] 52021.851902: sys_exit_newfstat: 0x0
hello-595328 [001] 52021.852057: sys_enter_brk: brk: 0x00000000
hello-595328 [001] 52021.852058: sys_exit_brk: 0x562413abc000
hello-595328 [001] 52021.852059: sys_enter_brk: brk: 0x562413add000
hello-595328 [001] 52021.852065: sys_exit_brk: 0x562413add000
hello-595328 [001] 52021.852081: sys_enter_write: fd: 0x00000001, buf: 0x562413abc2a0, count: 0x0000000c
hello-595328 [001] 52021.852113: sys_exit_write: 0xc
hello-595328 [001] 52021.852134: sys_enter_exit_group: error_code: 0x00000000
sys_enter_write and sys_exit_write is the real place where puts function start and exit. This is too messy to understand what kernel do in hello.c. Ftrace give a lot of parameter to show what part we want to follow.
For example:
$ sudo trace-cmd record -p function_graph -g __x64_sys_write -F ./hello
plugin 'function_graph'
Hello world
CPU0 data recorded at offset=0x8d1000
0 bytes in size
CPU1 data recorded at offset=0x8d1000
0 bytes in size
CPU2 data recorded at offset=0x8d1000
0 bytes in size
CPU3 data recorded at offset=0x8d1000
0 bytes in size
CPU4 data recorded at offset=0x8d1000
0 bytes in size
CPU5 data recorded at offset=0x8d1000
20480 bytes in size
CPU6 data recorded at offset=0x8d6000
0 bytes in size
CPU7 data recorded at offset=0x8d6000
0 bytes in size
$ sudo trace-cmd report | cut -d "|" -f 2
trace-cmd: No such file or directory
Error: expected type 4 but read 5
CPU 0 is empty
CPU 1 is empty
CPU 2 is empty
CPU 3 is empty
CPU 4 is empty
CPU 6 is empty
CPU 7 is empty
cpus=8
mutex_unlock();
__x64_sys_write() {
ksys_write() {
__fdget_pos() {
__fget_light();
}
vfs_write() {
rw_verify_area() {
security_file_permission() {
apparmor_file_permission() {
common_file_perm() {
aa_file_perm() {
rcu_read_unlock_strict();
}
}
}
}
}
new_sync_write() {
tty_write() {
file_tty_write.isra.0() {
tty_paranoia_check();
tty_ldisc_ref_wait() {
ldsem_down_read() {
__cond_resched() {
rcu_all_qs();
}
}
}
tty_write_lock() {
mutex_trylock();
}
__check_object_size() {
check_stack_object();
__virt_addr_valid();
__check_heap_object();
}
n_tty_write() {
down_read() {
__cond_resched() {
rcu_all_qs();
}
}
process_echoes();
add_wait_queue() {
_raw_spin_lock_irqsave();
__lock_text_start();
}
tty_hung_up_p();
mutex_lock() {
__cond_resched() {
rcu_all_qs();
}
}
tty_write_room() {
pty_write_room() {
tty_buffer_space_avail();
}
}
pty_write() {
_raw_spin_lock_irqsave();
tty_insert_flip_string_fixed_flag() {
__tty_buffer_request_room();
}
__lock_text_start();
tty_flip_buffer_push() {
queue_work_on() {
__queue_work() {
get_work_pool();
_raw_spin_lock();
insert_work() {
wake_up_process() {
try_to_wake_up() {
_raw_spin_lock_irqsave();
__traceiter_sched_waking();
select_task_rq_fair() {
available_idle_cpu();
available_idle_cpu();
select_idle_sibling() {
available_idle_cpu();
}
rcu_read_unlock_strict();
}
kthread_is_per_cpu();
ttwu_queue_wakelist();
_raw_spin_lock();
update_rq_clock();
ttwu_do_activate() {
psi_task_change() {
psi_flags_change();
psi_group_change();
}
enqueue_task_fair() {
enqueue_entity() {
update_curr();
__update_load_avg_se();
__update_load_avg_cfs_rq();
update_cfs_group();
__enqueue_entity();
}
hrtick_update();
}
ttwu_do_wakeup() {
check_preempt_curr() {
resched_curr();
}
__traceiter_sched_wakeup();
}
}
__lock_text_start();
}
}
}
rcu_read_unlock_strict();
}
irq_enter_rcu();
__sysvec_irq_work() {
__wake_up() {
__wake_up_common_lock() {
_raw_spin_lock_irqsave();
__wake_up_common();
__lock_text_start();
}
}
__wake_up() {
__wake_up_common_lock() {
_raw_spin_lock_irqsave();
__wake_up_common() {
autoremove_wake_function() {
default_wake_function() {
try_to_wake_up() {
_raw_spin_lock_irqsave();
__traceiter_sched_waking();
select_task_rq_fair() {
select_idle_sibling() {
available_idle_cpu();
cpus_share_cache();
available_idle_cpu();
select_idle_cpu() {
available_idle_cpu();
available_idle_cpu();
available_idle_cpu();
available_idle_cpu();
}
}
rcu_read_unlock_strict();
}
set_task_cpu() {
migrate_task_rq_fair() {
remove_entity_load_avg() {
__update_load_avg_blocked_se();
_raw_spin_lock_irqsave();
__lock_text_start();
}
}
set_task_rq_fair();
}
ttwu_queue_wakelist();
_raw_spin_lock();
update_rq_clock();
ttwu_do_activate() {
psi_task_change() {
psi_flags_change();
psi_group_change();
psi_group_change();
psi_group_change();
psi_group_change();
psi_group_change();
psi_group_change();
psi_group_change();
}
enqueue_task_fair() {
enqueue_entity() {
update_curr();
__update_load_avg_cfs_rq();
attach_entity_load_avg();
update_cfs_group();
__enqueue_entity();
}
enqueue_entity() {
update_curr();
__update_load_avg_se();
__update_load_avg_cfs_rq();
update_cfs_group() {
reweight_entity();
}
__enqueue_entity();
}
hrtick_update();
}
ttwu_do_wakeup() {
check_preempt_curr() {
resched_curr();
}
__traceiter_sched_wakeup();
}
}
__lock_text_start();
}
}
}
}
__lock_text_start();
}
}
}
irq_exit_rcu() {
idle_cpu();
}
}
}
}
mutex_unlock();
mutex_lock() {
__cond_resched() {
rcu_all_qs();
}
}
tty_write_room() {
pty_write_room() {
tty_buffer_space_avail();
}
}
do_output_char() {
pty_write() {
_raw_spin_lock_irqsave();
tty_insert_flip_string_fixed_flag() {
__tty_buffer_request_room();
}
__lock_text_start() {
irq_enter_rcu();
__sysvec_irq_work() {
__wake_up() {
__wake_up_common_lock() {
_raw_spin_lock_irqsave();
__wake_up_common();
__lock_text_start();
}
}
__wake_up() {
__wake_up_common_lock() {
_raw_spin_lock_irqsave();
__wake_up_common() {
autoremove_wake_function() {
default_wake_function() {
try_to_wake_up() {
_raw_spin_lock_irqsave();
__traceiter_sched_waking();
select_task_rq_fair() {
available_idle_cpu();
available_idle_cpu();
select_idle_sibling() {
available_idle_cpu();
}
rcu_read_unlock_strict();
}
ttwu_queue_wakelist();
_raw_spin_lock();
update_rq_clock();
ttwu_do_activate() {
psi_task_change() {
psi_flags_change();
psi_group_change();
psi_group_change();
psi_group_change();
psi_group_change();
psi_group_change();
psi_group_change();
psi_group_change();
}
enqueue_task_fair() {
enqueue_entity() {
update_curr();
__update_load_avg_se();
__update_load_avg_cfs_rq();
update_cfs_group();
__enqueue_entity();
}
enqueue_entity() {
update_curr();
__update_load_avg_se();
__update_load_avg_cfs_rq();
update_cfs_group() {
reweight_entity();
}
__enqueue_entity();
}
hrtick_update();
}
ttwu_do_wakeup() {
check_preempt_curr() {
resched_curr();
}
__traceiter_sched_wakeup();
}
}
__lock_text_start();
}
}
}
}
__lock_text_start();
}
}
}
irq_exit_rcu() {
idle_cpu();
}
}
tty_flip_buffer_push() {
queue_work_on() {
__queue_work() {
get_work_pool();
_raw_spin_lock();
insert_work() {
wake_up_process() {
try_to_wake_up() {
_raw_spin_lock_irqsave();
__traceiter_sched_waking();
select_task_rq_fair() {
available_idle_cpu();
available_idle_cpu();
select_idle_sibling() {
available_idle_cpu();
}
rcu_read_unlock_strict();
}
kthread_is_per_cpu();
ttwu_queue_wakelist();
_raw_spin_lock();
update_rq_clock();
ttwu_do_activate() {
psi_task_change() {
psi_flags_change();
psi_group_change();
}
enqueue_task_fair() {
enqueue_entity() {
update_curr();
__update_load_avg_se();
__update_load_avg_cfs_rq();
update_cfs_group();
__enqueue_entity();
}
hrtick_update();
}
ttwu_do_wakeup() {
check_preempt_curr() {
resched_curr();
}
__traceiter_sched_wakeup();
}
}
__lock_text_start();
}
}
}
rcu_read_unlock_strict();
}
}
}
}
}
mutex_unlock();
remove_wait_queue() {
_raw_spin_lock_irqsave();
__lock_text_start();
}
up_read();
}
ktime_get_real_seconds();
tty_write_unlock() {
mutex_unlock();
__wake_up() {
__wake_up_common_lock() {
_raw_spin_lock_irqsave();
__wake_up_common();
__lock_text_start();
}
}
}
tty_ldisc_deref() {
ldsem_up_read();
}
}
}
}
__fsnotify_parent();
}
}
}
We can trace how puts function do in kernel by refering kernel source code.
eBPF is a JIT virtual machine in kernel, and user can inject system call, bpf(2), in the code writen by eBPF instruction set. These code will start to run when it is triggered by kprobe and tracepoint.
Kernel Analysis Using eBPF - Daniel Thompson, Linaro
https://www.kernel.org/doc/html/v5.17/bpf/classic_vs_extended.html
BPF (Berkeley Packet Filter) is the filter for network packet, and some network tools, e.g. tcpdump and host…, are made by BPF.
However, eBPF (Extended Berkeley Packet Filter) has hugely different than BPF:
We will start from BPF(correctly to say, cBPF). It need to be built in the kernel by enabling Kconfig: CONFIG_BPF_SYSCALL.
Prepare BTF code in user space, and it will be compile to BPF byte code to kernel's sandbox.
Kernel will first check the BPF byte code is safe or not when loading it from user space.
After BPF code passing the verifier, Kernel will choose one of the testing methods to trace the code.
a. kprobes: kernel dynamic tracing.
b. uprobes: user level dynamic tracing.
c. tracepoints: kernel static tracing.
d. perf_events: timed sampling and PMCs(Preventive maintenance checks and services).
The result can be show by two method:
a. perf-output: all data will be show in user space
b. maps in kernel space and show statistic to user
BPF can be considered as in-Kernel virtual machine.
It use tcpdump (libpcap library) to filter the internet packet.
tcpdump(8) - Linux man page
Tcpdump prints out a description of the contents of packets on a network interface that match the boolean expression.
Kernel space will reserve a block of space to run BPF kernel process.
BPF kernel process will deal with the BPF byte code and sent the result(packet information) back to user space.
$ sudo tcpdump -p -ni wlp58s0 "arp"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on wlp58s0, link-type EN10MB (Ethernet), capture size 262144 bytes
01:32:09.903540 ARP, Request who-has 192.168.1.100 tell 192.168.1.1, length 28
01:32:19.938171 ARP, Request who-has 192.168.1.101 tell 192.168.1.1, length 28
01:32:21.884529 ARP, Request who-has 192.168.1.101 tell 192.168.1.1, length 28
01:32:21.884985 ARP, Request who-has 192.168.1.100 tell 192.168.1.1, length 28
The BPF machine abstraction consists of an accumulator, an index register (x), a scratch memory store, and an implicit program counter. The operations on these elements can be categorized into load instructions, store instructions, ALU instructions, branch instructions and miscellaneous instructions.
Here is an example, how the packet filter works, in BPF.
-d Dump the compiled packet-matching code in a human readable form to standard output and stop.
$ sudo tcpdump -p -d -ni wlp58s0 "arp"
(000) ldh [12]
(001) jeq #0x806 jt 2 jf 3
(002) ret #262144
(003) ret #0
-dd Dump packet-matching code as a C program fragment.
JT: true; JF: false
$ sudo tcpdump -p -dd -ni wlp58s0 "arp"
{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 1, 0x00000806 },
{ 0x6, 0, 0, 0x00040000 },
{ 0x6, 0, 0, 0x00000000 },
(001) -> jump if ethernet frame is 0x806(arp)
Ethernet II
Ethernet II framing (also known as DIX Ethernet, named after DEC, Intel and Xerox, the major participants in its design[8]), defines the two-octet EtherType field in an Ethernet frame, preceded by destination and source MAC addresses, that identifies an upper layer protocol encapsulated by the frame data. For example, an EtherType value of 0x0800 signals that the frame contains an IPv4 datagram. Likewise, an EtherType of 0x0806 indicates an ARP frame, 0x86DD indicates an IPv6 frame and 0x8100 indicates the presence of an IEEE 802.1Q tag (as described above).
For more eBPF bytecode example: https://www.kernel.org/doc/Documentation/networking/filter.txt
We can write eBPF program to the JIT compiler, and it will be sent to kernel's sandbox.
eBPF programs are triggered by events in the kernel. When some specific instructions are executed, these events will be caught at the hook. When the hook is triggered, the eBPF program is executed to capture and manipulate the data. The variety of hook positioning is one of the shining points of eBPF. For example the following:
There are two feature in eBPF's process:
helper function
Helper functions are called when the eBPF program is triggered. These special functions allow eBPF to have rich functions for accessing memory.
All helper command is defined as kernel's white list.
https://man7.org/linux/man-pages/man7/bpf-helpers.7.html
Maps
To store and share data between eBPF programs and the kernel and user space, eBPF requires the use of Maps.
The following is some supported map types to give an understanding of the diversity in data structures. For various map types, both a shared and a per-CPU variation is available.
- Hash tables, Arrays
- LRU (Least Recently Used)
- Ring Buffer
- Stack Trace
- LPM (Longest Prefix match)
-Todo: XDP (eXpress Data Path)
$ vim who-open.bt
tracepoint:syscalls:sys_enter_openat
{
printf("%s %s\n", comm, str(args->filename));
}
f$ sudo bpftrace who-open.bt
Attaching 1 probe...
TaniumClient /opt/Tanium/TaniumClient/extensions/core/mailbox/inbox
minicom /var/lock/LCK..ttyUSB0
minicom /var/lock/LCK..ttyUSB0
minicom /dev/ttyUSB0
TaniumClient /opt/Tanium/TaniumClient/extensions/core/mailbox/inbox
sudo /usr/lib/sudo/tls/haswell/x86_64/libaudit.so.1
sudo /usr/lib/sudo/tls/haswell/libaudit.so.1
sudo /usr/lib/sudo/tls/x86_64/libaudit.so.1
sudo /usr/lib/sudo/tls/libaudit.so.1
sudo /usr/lib/sudo/haswell/x86_64/libaudit.so.1
sudo /usr/lib/sudo/haswell/libaudit.so.1
sudo /usr/lib/sudo/x86_64/libaudit.so.1
sudo /usr/lib/sudo/libaudit.so.1
...
Now we can open another terminal and input "ls" to open some file. But still too much information.
We can add filter in eBPF code:
tracepoint:syscalls:sys_enter_openat
/comm == "ls"/
{
printf("%s %s\n", comm, str(args->filename));
}
~
$ sudo bpftrace who-open.bt
Attaching 1 probe...
ls tls/haswell/x86_64/libselinux.so.1
ls tls/haswell/libselinux.so.1
ls tls/x86_64/libselinux.so.1
ls tls/libselinux.so.1
ls haswell/x86_64/libselinux.so.1
ls haswell/libselinux.so.1
ls x86_64/libselinux.so.1
ls libselinux.so.1
ls /home/ANT.AMAZON.COM/cheyenyu/Cocoa_red_tools/tls/haswell/x86_6
ls /home/ANT.AMAZON.COM/cheyenyu/Cocoa_red_tools/tls/haswell/libse
ls /home/ANT.AMAZON.COM/cheyenyu/Cocoa_red_tools/tls/x86_64/libsel
ls /home/ANT.AMAZON.COM/cheyenyu/Cocoa_red_tools/tls/libselinux.so
ls /home/ANT.AMAZON.COM/cheyenyu/Cocoa_red_tools/haswell/x86_64/li
ls /home/ANT.AMAZON.COM/cheyenyu/Cocoa_red_tools/haswell/libselinu
ls /home/ANT.AMAZON.COM/cheyenyu/Cocoa_red_tools/x86_64/libselinux
ls /home/ANT.AMAZON.COM/cheyenyu/Cocoa_red_tools/libselinux.so.1
ls ./red-tools/linux64/tls/haswell/x86_64/libselinux.so.1
ls ./red-tools/linux64/tls/haswell/libselinux.so.1
ls ./red-tools/linux64/tls/x86_64/libselinux.so.1
ls ./red-tools/linux64/tls/libselinux.so.1
ls ./red-tools/linux64/haswell/x86_64/libselinux.so.1
This is how ebpf do soomething like "strace", but there is so mant thing we can do in eBPF.
$ sudo bpftrace -l
tracepoint:xen:xen_mmu_alloc_ptpage
tracepoint:xen:xen_mmu_release_ptpage
tracepoint:xen:xen_mmu_pgd_pin
tracepoint:xen:xen_mmu_pgd_unpin
tracepoint:xen:xen_mmu_flush_tlb_one_user
tracepoint:xen:xen_mmu_flush_tlb_multi
tracepoint:xen:xen_mmu_write_cr3
tracepoint:xen:xen_cpu_write_ldt_entry
tracepoint:xen:xen_cpu_write_idt_entry
tracepoint:xen:xen_cpu_load_idt
tracepoint:xen:xen_cpu_write_gdt_entry
tracepoint:xen:xen_cpu_set_ldt
tracepoint:vsyscall:emulate_vsyscall
tracepoint:initcall:initcall_level
tracepoint:initcall:initcall_start
...
kprobe:iwl_pcie_irq_msix_handler
kprobe:iwl_pcie_txq_inc_wr_ptr
kprobe:iwl_pcie_txq_build_tfd
kprobe:iwl_fill_data_tbs_amsdu.constprop.0
kprobe:iwl_fill_data_tbs
kprobe:iwl_pcie_clear_cmd_in_flight
kprobe:iwl_pcie_txq_unmap
kprobe:iwl_pcie_alloc_dma_ptr
kprobe:iwl_pcie_free_dma_ptr
kprobe:iwl_pcie_txq_check_wrptrs
$ sudo bpftrace -lv tracepoint:xen:xen_cpu_load_idt
tracepoint:xen:xen_cpu_load_idt
unsigned long addr;
$ sudo bpftrace -lv tracepoint:initcall:initcall_finish
tracepoint:initcall:initcall_finish
initcall_t func;
int ret;
https://blog.sofiane.cc/eBPFunderstanding/
https://github.com/byoman/intro-ebpf/blob/master/notes.md
https://hackmd.io/@sysprog/linux-ebpf