###### tags: `Profiling` # kprobes ## history - Been around since 2004 (developed by IBM) - basic infrastructure - kprobe events (act like trace event) - Introduced in 2009 (by Masami Hiramatsu) - Allows to create dynamic events - Automatically check instruction boundary - BPF attached to kprobes since kernel 4.1 ## kprobes implementation kprobe 會將目標指令的第一個 byte 替換成 breakpoint 指令 (x86 上為 int3),並將原始指令存入 kprobe 相關的資料結構,當執行到目標指令就會觸發 breakpoint exception,執行註冊的 kprobe pre-handler,然後設定 TF = 1 (Trap flag) 並單步執行原始的指令後,觸發一個 debug exception,最後執行註冊的 post-handler。 以下圖為例: ![](https://i.imgur.com/mTh3XTe.png) ```c struct kprobe { struct hlist_node hlist; /* Internal */ kprobe_opcode_t addr; /* Address of probe */ kprobe_pre_handler_t pre_handler; /* Address of pre-handler */ kprobe_post_handler_t post_handler; /* Address of post-handler */ kprobe_fault_handler_t fault_handler; /* Address of fault handler */ kprobe_break_handler_t break_handler; /* Internal */ kprobe_opcode_t opcode; /* Internal */ kprobe_opcode_t insn[MAX_INSN_SIZE]; /* Internal */ }; ``` `pre_handler` 和 `post_handler` 是 function pointer 指向使用者自行定義的函式,`opcode` 則是被 breakpoint 指令替換前原始的指令。 範例: ```c /* * NOTE: This example is works on x86 and powerpc. * Here's a sample kernel module showing the use of kprobes to dump a * stack trace and selected registers when do_fork() is called. * * For more information on theory of operation of kprobes, see * Documentation/kprobes.txt * * You will see the trace data in /var/log/messages and on the console * whenever do_fork() is invoked to create a new process. */ #include <linux/kernel.h> #include <linux/module.h> #include <linux/kprobes.h> /* For each probe you need to allocate a kprobe structure */ static struct kprobe kp = { .symbol_name = "_do_fork", }; /* kprobe pre_handler: called just before the probed instruction is executed */ static int handler_pre(struct kprobe *p, struct pt_regs *regs) { printk(KERN_INFO "pre_handler: p->addr = 0x%p, ip = %lx," " flags = 0x%lx\n", p->addr, regs->ip, regs->flags); dump_stack(); return 0; } /* kprobe post_handler: called after the probed instruction is executed */ static void handler_post(struct kprobe *p, struct pt_regs *regs, unsigned long flags) { printk(KERN_INFO "post_handler: p->addr = 0x%p, flags = 0x%lx\n", p->addr, regs->flags); } /* * fault_handler: this is called if an exception is generated for any * instruction within the pre- or post-handler, or when Kprobes * single-steps the probed instruction. */ static int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr) { printk(KERN_INFO "fault_handler: p->addr = 0x%p, trap #%dn", p->addr, trapnr); /* Return 0 because we don't handle the fault. */ return 0; } static int __init kprobe_init(void) { int ret; kp.pre_handler = handler_pre; kp.post_handler = handler_post; kp.fault_handler = handler_fault; ret = register_kprobe(&kp); if (ret < 0) { printk(KERN_INFO "register_kprobe failed, returned %d\n", ret); return ret; } printk(KERN_INFO "Planted kprobe at %p\n", kp.addr); return 0; } static void __exit kprobe_exit(void) { unregister_kprobe(&kp); printk(KERN_INFO "kprobe at %p unregistered\n", kp.addr); } module_init(kprobe_init) module_exit(kprobe_exit) MODULE_LICENSE("GPL"); ``` ## kprobe_events CONFIG_KPROBE_EVENTS=y https://www.kernel.org/doc/html/latest/trace/kprobetrace.html ``` p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe r[MAXACTIVE][:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe -:[GRP/]EVENT : Clear a probe GRP : Group name. If omitted, use "kprobes" for it. EVENT : Event name. If omitted, the event name is generated based on SYM+offs or MEMADDR. MOD : Module name which has given SYM. SYM[+offs] : Symbol+offset where the probe is inserted. MEMADDR : Address where the probe is inserted. MAXACTIVE : Maximum number of instances of the specified function that can be probed simultaneously, or 0 for the default value as defined in Documentation/kprobes.txt section 1.3.1. FETCHARGS : Arguments. Each probe can have up to 128 args. %REG : Fetch register REG @ADDR : Fetch memory at ADDR (ADDR should be in kernel) @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) $stackN : Fetch Nth entry of stack (N >= 0) $stack : Fetch stack address. $argN : Fetch the Nth function argument. (N >= 1) (\*1) $retval : Fetch return value.(\*2) $comm : Fetch current task comm. +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4) \IMM : Store an immediate value to the argument. NAME=FETCHARG : Set NAME as the argument name of FETCHARG. FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types (x8/x16/x32/x64), "string", "ustring" and bitfield are supported. (\*1) only for the probe on function entry (offs == 0). (\*2) only for return probe. (\*3) this is useful for fetching a field of data structures. (\*4) "u" means user-space dereference. See :ref:`user_mem_access`. ``` ### some important files - enable kprobe: ```/sys/kernel/debug/tracing/events/kprobes/<event name>/enable``` - register/unregister kprobe: ```/sys/kernel/debug/tracing/kprobe_events``` - list kprobe: `/sys/kernel/debug/kprobes/list` - trace ouput: `/sys/kernel/debug/tracing/trace` trace open system call `open()`: ```c SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) { if (force_o_largefile()) flags |= O_LARGEFILE; return do_sys_open(AT_FDCWD, filename, flags, mode); } ``` ### kprobe usage: add a probe to `do_sys_open()` x86_64 calling convention ``` x86 function call convention, 64-bit: ------------------------------------- arguments | callee-saved | extra caller-saved | return [callee-clobbered] | | [callee-clobbered] | --------------------------------------------------------------------------- rdi rsi rdx rcx r8-9 | rbx rbp [*] r12-15 | r10-11 | rax, rdx [**] ``` get the address of `filename` character array ```bash echo 'p:myprobe do_sys_open file=%si' > /sys/kernel/debug/tracing/kprobe_events ``` print the `filename` string ```bash echo 'p:myprobe do_sys_open file=+0(%si):string' > /sys/kernel/debug/tracing/kprobe_events echo 'p:myprobe do_sys_open file=+0($arg2):string' > /sys/kernel/debug/tracing/kprobe_events ``` ### Example: trace `do_sys_open()` ```sh # echo 'p:myprobe do_sys_open file=+0(%si):string' > /sys/kernel/debug/tracing/kprobe_events # echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable # cat /sys/kernel/debug/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 115/115 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | ... sensors-20360 [001] .... 265527.761064: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/class/hwmon/hwmon1/temp3_label" sensors-20360 [001] .... 265527.761078: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/class/hwmon/hwmon1/temp3_input" sensors-20360 [001] .... 265527.761092: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/class/hwmon/hwmon1/temp3_input" sensors-20360 [001] .... 265527.761105: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/class/hwmon/hwmon1/temp3_crit_alarm" sensors-20360 [001] .... 265527.761119: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/class/hwmon/hwmon1/temp3_max" sensors-20360 [001] .... 265527.761131: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/class/hwmon/hwmon1/temp3_crit" code-17029 [003] .... 265527.799119: myprobe: (do_sys_open+0x0/0x2d0) file="/proc/17620/cmdline" code-7612 [002] .... 265527.800211: myprobe: (do_sys_open+0x0/0x2d0) file="/proc/9371/cmdline" code-7612 [002] .... 265527.800384: myprobe: (do_sys_open+0x0/0x2d0) file="/proc/9394/cmdline" code-7612 [002] .... 265527.800543: myprobe: (do_sys_open+0x0/0x2d0) file="/proc/9381/cmdline" cat-20361 [002] .... 265528.280524: myprobe: (do_sys_open+0x0/0x2d0) file="/etc/ld.so.cache" cat-20361 [002] .... 265528.280620: myprobe: (do_sys_open+0x0/0x2d0) file="/lib/x86_64-linux-gnu/libc.so.6" cat-20361 [002] .... 265528.281240: myprobe: (do_sys_open+0x0/0x2d0) file="/usr/lib/locale/locale-archive" cat-20361 [002] .... 265528.281340: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/kernel/debug/tracing/trace" ``` ## BPF attached to kprobes ```python #!/usr/bin/python from bcc import BPF from time import sleep import operator prog=""" #include <uapi/linux/ptrace.h> #include <linux/sched.h> struct key_t { u32 prev_pid; u32 curr_pid; }; BPF_HASH(stats, struct key_t, u64, 1024); int count_sched(struct pt_regs *ctx, struct task_struct *prev) { struct key_t key = {}; u64 *val; key.curr_pid = bpf_get_current_pid_tgid(); key.prev_pid = prev->pid; stats.increment(key); return 0; } """ b = BPF(text=prog) b.attach_kprobe(event="finish_task_switch", fn_name="count_sched") # generate many schedule events for i in range(0, 100): sleep(0.01) for k, v in sorted(b["stats"].items(), key=lambda (k, v): v.value): print("task_switch[%5d->%5d]=%u" % (k.prev_pid, k.curr_pid, v.value)) ``` --- References: - https://lwn.net/Articles/132196/ - https://stackoverflow.com/questions/33096225/how-to-use-registers-in-kprobe - http://www.brendangregg.com/offcpuanalysis.html - Fun with Dynamic Kernel Tracing Events: https://events19.linuxfoundation.org/wp-content/uploads/2017/12/oss-eu-2018-fun-with-dynamic-trace-events_steven-rostedt.pdf