###### tags: `Profiling`
# kprobes
## history
- Been around since 2004 (developed by IBM)
- basic infrastructure
- kprobe events (act like trace event)
- Introduced in 2009 (by Masami Hiramatsu)
- Allows to create dynamic events
- Automatically check instruction boundary
- BPF attached to kprobes since kernel 4.1
## kprobes implementation
kprobe 會將目標指令的第一個 byte 替換成 breakpoint 指令 (x86 上為 int3),並將原始指令存入 kprobe 相關的資料結構,當執行到目標指令就會觸發 breakpoint exception,執行註冊的 kprobe pre-handler,然後設定 TF = 1 (Trap flag) 並單步執行原始的指令後,觸發一個 debug exception,最後執行註冊的 post-handler。
以下圖為例:

```c
struct kprobe {
struct hlist_node hlist; /* Internal */
kprobe_opcode_t addr; /* Address of probe */
kprobe_pre_handler_t pre_handler; /* Address of pre-handler */
kprobe_post_handler_t post_handler; /* Address of post-handler */
kprobe_fault_handler_t fault_handler; /* Address of fault handler */
kprobe_break_handler_t break_handler; /* Internal */
kprobe_opcode_t opcode; /* Internal */
kprobe_opcode_t insn[MAX_INSN_SIZE]; /* Internal */
};
```
`pre_handler` 和 `post_handler` 是 function pointer 指向使用者自行定義的函式,`opcode` 則是被 breakpoint 指令替換前原始的指令。
範例:
```c
/*
* NOTE: This example is works on x86 and powerpc.
* Here's a sample kernel module showing the use of kprobes to dump a
* stack trace and selected registers when do_fork() is called.
*
* For more information on theory of operation of kprobes, see
* Documentation/kprobes.txt
*
* You will see the trace data in /var/log/messages and on the console
* whenever do_fork() is invoked to create a new process.
*/
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
/* For each probe you need to allocate a kprobe structure */
static struct kprobe kp = {
.symbol_name = "_do_fork",
};
/* kprobe pre_handler: called just before the probed instruction is executed */
static int handler_pre(struct kprobe *p, struct pt_regs *regs)
{
printk(KERN_INFO "pre_handler: p->addr = 0x%p, ip = %lx,"
" flags = 0x%lx\n", p->addr, regs->ip, regs->flags);
dump_stack();
return 0;
}
/* kprobe post_handler: called after the probed instruction is executed */
static void handler_post(struct kprobe *p, struct pt_regs *regs,
unsigned long flags)
{
printk(KERN_INFO "post_handler: p->addr = 0x%p, flags = 0x%lx\n",
p->addr, regs->flags);
}
/*
* fault_handler: this is called if an exception is generated for any
* instruction within the pre- or post-handler, or when Kprobes
* single-steps the probed instruction.
*/
static int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
{
printk(KERN_INFO "fault_handler: p->addr = 0x%p, trap #%dn",
p->addr, trapnr);
/* Return 0 because we don't handle the fault. */
return 0;
}
static int __init kprobe_init(void)
{
int ret;
kp.pre_handler = handler_pre;
kp.post_handler = handler_post;
kp.fault_handler = handler_fault;
ret = register_kprobe(&kp);
if (ret < 0) {
printk(KERN_INFO "register_kprobe failed, returned %d\n", ret);
return ret;
}
printk(KERN_INFO "Planted kprobe at %p\n", kp.addr);
return 0;
}
static void __exit kprobe_exit(void)
{
unregister_kprobe(&kp);
printk(KERN_INFO "kprobe at %p unregistered\n", kp.addr);
}
module_init(kprobe_init)
module_exit(kprobe_exit)
MODULE_LICENSE("GPL");
```
## kprobe_events
CONFIG_KPROBE_EVENTS=y
https://www.kernel.org/doc/html/latest/trace/kprobetrace.html
```
p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe
r[MAXACTIVE][:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe
-:[GRP/]EVENT : Clear a probe
GRP : Group name. If omitted, use "kprobes" for it.
EVENT : Event name. If omitted, the event name is generated
based on SYM+offs or MEMADDR.
MOD : Module name which has given SYM.
SYM[+offs] : Symbol+offset where the probe is inserted.
MEMADDR : Address where the probe is inserted.
MAXACTIVE : Maximum number of instances of the specified function that
can be probed simultaneously, or 0 for the default value
as defined in Documentation/kprobes.txt section 1.3.1.
FETCHARGS : Arguments. Each probe can have up to 128 args.
%REG : Fetch register REG
@ADDR : Fetch memory at ADDR (ADDR should be in kernel)
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
$stackN : Fetch Nth entry of stack (N >= 0)
$stack : Fetch stack address.
$argN : Fetch the Nth function argument. (N >= 1) (\*1)
$retval : Fetch return value.(\*2)
$comm : Fetch current task comm.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
(x8/x16/x32/x64), "string", "ustring" and bitfield
are supported.
(\*1) only for the probe on function entry (offs == 0).
(\*2) only for return probe.
(\*3) this is useful for fetching a field of data structures.
(\*4) "u" means user-space dereference. See :ref:`user_mem_access`.
```
### some important files
- enable kprobe: ```/sys/kernel/debug/tracing/events/kprobes/<event name>/enable```
- register/unregister kprobe: ```/sys/kernel/debug/tracing/kprobe_events```
- list kprobe: `/sys/kernel/debug/kprobes/list`
- trace ouput: `/sys/kernel/debug/tracing/trace`
trace open system call `open()`:
```c
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
if (force_o_largefile())
flags |= O_LARGEFILE;
return do_sys_open(AT_FDCWD, filename, flags, mode);
}
```
### kprobe usage: add a probe to `do_sys_open()`
x86_64 calling convention
```
x86 function call convention, 64-bit:
-------------------------------------
arguments | callee-saved | extra caller-saved | return
[callee-clobbered] | | [callee-clobbered] |
---------------------------------------------------------------------------
rdi rsi rdx rcx r8-9 | rbx rbp [*] r12-15 | r10-11 | rax, rdx [**]
```
get the address of `filename` character array
```bash
echo 'p:myprobe do_sys_open file=%si' > /sys/kernel/debug/tracing/kprobe_events
```
print the `filename` string
```bash
echo 'p:myprobe do_sys_open file=+0(%si):string' > /sys/kernel/debug/tracing/kprobe_events
echo 'p:myprobe do_sys_open file=+0($arg2):string' > /sys/kernel/debug/tracing/kprobe_events
```
### Example: trace `do_sys_open()`
```sh
# echo 'p:myprobe do_sys_open file=+0(%si):string' > /sys/kernel/debug/tracing/kprobe_events
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 115/115 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
...
sensors-20360 [001] .... 265527.761064: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/class/hwmon/hwmon1/temp3_label"
sensors-20360 [001] .... 265527.761078: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/class/hwmon/hwmon1/temp3_input"
sensors-20360 [001] .... 265527.761092: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/class/hwmon/hwmon1/temp3_input"
sensors-20360 [001] .... 265527.761105: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/class/hwmon/hwmon1/temp3_crit_alarm"
sensors-20360 [001] .... 265527.761119: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/class/hwmon/hwmon1/temp3_max"
sensors-20360 [001] .... 265527.761131: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/class/hwmon/hwmon1/temp3_crit"
code-17029 [003] .... 265527.799119: myprobe: (do_sys_open+0x0/0x2d0) file="/proc/17620/cmdline"
code-7612 [002] .... 265527.800211: myprobe: (do_sys_open+0x0/0x2d0) file="/proc/9371/cmdline"
code-7612 [002] .... 265527.800384: myprobe: (do_sys_open+0x0/0x2d0) file="/proc/9394/cmdline"
code-7612 [002] .... 265527.800543: myprobe: (do_sys_open+0x0/0x2d0) file="/proc/9381/cmdline"
cat-20361 [002] .... 265528.280524: myprobe: (do_sys_open+0x0/0x2d0) file="/etc/ld.so.cache"
cat-20361 [002] .... 265528.280620: myprobe: (do_sys_open+0x0/0x2d0) file="/lib/x86_64-linux-gnu/libc.so.6"
cat-20361 [002] .... 265528.281240: myprobe: (do_sys_open+0x0/0x2d0) file="/usr/lib/locale/locale-archive"
cat-20361 [002] .... 265528.281340: myprobe: (do_sys_open+0x0/0x2d0) file="/sys/kernel/debug/tracing/trace"
```
## BPF attached to kprobes
```python
#!/usr/bin/python
from bcc import BPF
from time import sleep
import operator
prog="""
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
struct key_t {
u32 prev_pid;
u32 curr_pid;
};
BPF_HASH(stats, struct key_t, u64, 1024);
int count_sched(struct pt_regs *ctx, struct task_struct *prev) {
struct key_t key = {};
u64 *val;
key.curr_pid = bpf_get_current_pid_tgid();
key.prev_pid = prev->pid;
stats.increment(key);
return 0;
}
"""
b = BPF(text=prog)
b.attach_kprobe(event="finish_task_switch", fn_name="count_sched")
# generate many schedule events
for i in range(0, 100): sleep(0.01)
for k, v in sorted(b["stats"].items(), key=lambda (k, v): v.value):
print("task_switch[%5d->%5d]=%u" % (k.prev_pid, k.curr_pid, v.value))
```
---
References:
- https://lwn.net/Articles/132196/
- https://stackoverflow.com/questions/33096225/how-to-use-registers-in-kprobe
- http://www.brendangregg.com/offcpuanalysis.html
- Fun with Dynamic Kernel Tracing Events: https://events19.linuxfoundation.org/wp-content/uploads/2017/12/oss-eu-2018-fun-with-dynamic-trace-events_steven-rostedt.pdf