eBPF - HackMD

# eBPF > reference: 0108 [[Day 4] 正式踏入 Cilium 前，先搞懂 eBPF - 概念篇](https://ithelp.ithome.com.tw/m/articles/10382042) ## [tutorial 1](https://eunomia.dev/tutorials/1-helloworld/) book: What Is eBPF? >Once you have an eBPF program loaded into the kernel, it has to be attached to an event. Whenever the event happens, the associated eBPF program(s) are run. some events that you can attach programs to: 1. Entry to/Exit from Functions > You can attach an eBPF program to be triggered whenever a kernel function is entered or exited. Many of today’s eBPF examples use the mechanism of kprobes (attached to a kernel function entry point) and kretprobes (function exit). In more recent kernel versions, there is a more efficient alternative called fentry/fexit. >You can also attach eBPF programs to user space functions with uprobes and uretprobes. 2. Tracepoints > You can also attach eBPF programs to [tracepoints](https://blogs.oracle.com/linux/post/taming-tracepoints-in-the-linux-kernel) defined within the kernel. Find the events on your machine by looking under /sys/kernel/debug/tracing/events. 3. Network Interfaces—eXpress Data Path > eXpress Data Path (XDP) allows attaching an eBPF program to a network interface, so that it is triggered whenever a packet is received. It can inspect or even modify the packet, and the program’s exit code can tell the kernel what to do with that packet: pass it on, drop it, or redirect it. This can form the basis of some very efficient networking functionality 4. Sockets and Other Networking Hooks > You can attach eBPF programs to run when applications open or perform other operations on a network socket, as well as when messages are sent or received. There are also hooks called traffic control or tc within the kernel’s network stack where eBPF programs can run after initial packet processing. > Some features can be implemented with an eBPF program alone, but in many cases we want the eBPF code to receive information from, or pass data to, a user space application. The mechanism that allows data to pass between eBPF programs and user space, or between different eBPF programs, is called ==maps==. ## eBPF Maps > Maps are data structures that are defined alongside eBPF programs. There are a variety of different types of maps, but ==they are all essentially key–value stores==.eBPF programs can read and write to them, as can user space code. Common uses for maps include: • An eBPF program writing metrics and other data about an event, for user space code to later retrieve • User space code writing configuration information, for an eBPF program to read and behave accordingly • An eBPF program writing data into a map, for later retrieval by another eBPF program, allowing the coordination of information across multiple kernel events > >If both the kernel and user space code will access the same map, they will need a common understanding of the data structures stored in that map. This can be done by including header files that define those data structures in both the user space and kernel code, but if these aren’t written in the same language, the author(s) will need to carefully create structure definitions that are byte-for-byte compatible ## > https://eunomia.dev/tutorials/1-helloworld/ Download the ecli tool for running eBPF programs ```shell $ wget https://aka.pw/bpf-ecli -O ecli && chmod +x ./ecli ``` Download the compiler toolchain for compiling eBPF kernel code into config files or WASM modules: ```shell $ wget https://github.com/eunomia-bpf/eunomia-bpf/releases/latest/download/ecc && chmod +x ./ecc ``` ```clike /* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ #define BPF_NO_GLOBAL_DATA #include <linux/bpf.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> typedef unsigned int u32; typedef int pid_t; const pid_t pid_filter = 0; char LICENSE[] SEC("license") = "Dual BSD/GPL"; SEC("tp/syscalls/sys_enter_write") int handle_tp(void *ctx) { pid_t pid = bpf_get_current_pid_tgid() >> 32; if (pid_filter && pid != pid_filter) return 0; bpf_printk("BPF triggered sys_enter_write from PID %d.\n", pid); return 0; } ``` ```shell $ ./ecc minimal.bpf.c minimal.bpf.o minimal.skel.json package.json ``` terminal1 ```shell $ sudo ./ecli run package.json INFO [faerie::elf] strtab: 0x259 symtab 0x298 relocs 0x2e0 sh_offset 0x2e0 INFO [bpf_loader_lib::skeleton::preload::section_loader] User didn't specify custom value for variable pid_filter, use the default one in ELF INFO [bpf_loader_lib::skeleton::poller] Running ebpf program... ``` terminal2 ```clike $ sudo cat /sys/kernel/debug/tracing/trace_pipe | grep "BPF triggered sys_enter_write" ter_write from PID 106406. grep-106406 [002] ...21 1898517.381908: bpf_trace_printk: BPF triggered sys_enter_write from PID 106406. grep-106406 [002] ...21 1898517.381912: bpf_trace_printk: BPF triggered sys_enter_write from PID 106406. KMS thread-2950 [006] ...21 1898517.385664: bpf_trace_printk: BPF triggered sys_enter_write from PID 2935. KMS thread-2950 [006] ...21 1898517.385676: bpf_trace_printk: BPF triggered sys_enter_write from PID 2935. KMS thread-2950 [006] ...21 1898517.385678: bpf_trace_printk: BPF triggered sys_enter_write from PID 2935. KMS thread-2950 [006] ...21 1898517.385681: bpf_trace_printk: BPF triggered sys_enter_write from PID 2935. pw-data-loop-2699 [001] ...21 1898517.385932: bpf_trace_printk: BPF triggered sys_enter_write from PID 2659. ld-linux-x86-64-106318 [006] ...21 1898517.386135: bpf_trace_printk: BPF triggered sys_enter_write from PID 106303. ``` 如果在 terminal1 ctrl+c 則 terminal2 也會中止 ### CODE EXPLAIN ```clike SEC("tp/syscalls/sys_enter_write") ``` 指定了程式會被掛載在 tracepoint syscalls/sys_enter_write 上，也就是說，每當任何一個行程呼叫 write() 系統呼叫時，這個程式就會被執行一次 ```clike pid_t pid = bpf_get_current_pid_tgid() >> 32; ``` bpf_get_current_pid_tgid() 會回傳一個 64 位元的數字（上 32 位元是 PID，下 32 位元是 TID）透過 >> 32 取得 PID ```clike bpf_printk("BPF triggered sys_enter_write from PID %d.\n", pid); ``` 使用 bpf_printk() 把資料輸出到 kernel 的 trace_pipe（位於 /sys/kernel/debug/tracing/trace_pipe）。這個訊息可以被使用者空間讀取到，通常用來 debug :::info 1. 為什麼使用 bpf_get_current_pid_tgid() 而不用 getpid()？在 eBPF 程式裡面，不能直接呼叫 Linux 的標準庫（例如 getpid()）。eBPF 程式裡面只允許呼叫特殊的 BPF Helper Functions，而不是一般的 C library 或系統呼叫。 2. 什麼是 trace_pipe？ trace_pipe 是 kernel tracing 子系統中的一個 debug 訊息管道，位於 `/sys/kernel/debug/tracing/trace_pipe` 。當 eBPF 程式呼叫 bpf_printk() 時，訊息不會直接輸出到使用者空間，而是寫到 trace_pipe，使用者空間可以用 cat（或 less）指令讀取 `sudo cat /sys/kernel/debug/tracing/trace_pipe` 3. 為甚麼要 return 0? https://github.com/iovisor/bcc/issues/139 有說明，但目前還沒看懂 ::: ### Basic Framework of eBPF Program >As mentioned above, the basic framework of an eBPF program includes: > - Including header files: You need to include and header files, among others. >- Defining a license: You need to define a license, typically using "Dual BSD/GPL". >- Defining a BPF function: You need to define a BPF function, for example, named handle_tp, which takes void *ctx as a parameter and returns int. This is usually written in the C language. >- Using BPF helper functions: In the BPF function, you can use BPF helper functions such as bpf_get_current_pid_tgid() and bpf_printk(). >- Return value ## [tutorial 2](https://eunomia.dev/tutorials/2-kprobe-unlink/) >By using the kprobes technology, users can define their own callback functions and dynamically insert probes into almost all functions in the kernel or modules (some functions cannot be probed, such as the kprobes' own implementation functions, which will be explained in detail later). When the kernel execution flow reaches the specified probe function, it will invoke the callback function, allowing the user to collect the desired information. The kernel will then return to the normal execution flow. If the user has collected sufficient information and no longer needs to continue probing, the probes can be dynamically removed. Therefore, the kprobes technology has the advantages of minimal impact on the kernel execution flow and easy operation. ```clike #include "vmlinux.h" #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> #include <bpf/bpf_core_read.h> char LICENSE[] SEC("license") = "Dual BSD/GPL"; SEC("kprobe/do_unlinkat") int BPF_KPROBE(do_unlinkat, int dfd, struct filename *name) { pid_t pid; const char *filename; pid = bpf_get_current_pid_tgid() >> 32; filename = BPF_CORE_READ(name, name); bpf_printk("KPROBE ENTRY pid = %d, filename = %s\n", pid, filename); return 0; } SEC("kretprobe/do_unlinkat") int BPF_KRETPROBE(do_unlinkat_exit, long ret) { pid_t pid; pid = bpf_get_current_pid_tgid() >> 32; bpf_printk("KPROBE EXIT: pid = %d, ret = %ld\n", pid, ret); return 0; } ``` BSD/GPL 雙授權允許使用所有的 eBPF helper functions 當想要刪除檔案時就會透過 unlink() 系統呼叫 ```shell $ sudo cat /sys/kernel/debug/tracing/trace_pipe rm-120410 [006] ...21 1963728.151187: bpf_trace_printk: KPROBE ENTRY pid = 120410, filename = test1 rm-120410 [006] ...21 1963728.151273: bpf_trace_printk: KPROBE EXIT: pid = 120410, ret = 0 rm-120412 [006] ...21 1963738.886388: bpf_trace_printk: KPROBE ENTRY pid = 120412, filename = test2 rm-120412 [006] ...21 1963738.886466: bpf_trace_printk: KPROBE EXIT: pid = 120412, ret = 0 ``` ## [tutorial 11: Develop User-Space Programs with libbpf and Trace exec() and exit()](https://eunomia.dev/tutorials/11-bootstrap/) ### BTF (BPF Type Format) >BTF is a metadata format used to describe type information in eBPF programs. The primary purpose of BTF is to provide a structured way to describe data structures in the kernel so that eBPF programs can access and manipulate them more easily. Kernel-space eBPF Program bootstrap.bpf.c ```clike // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause /* Copyright (c) 2020 Facebook */ #include "vmlinux.h" #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> #include <bpf/bpf_core_read.h> #include "bootstrap.h" char LICENSE[] SEC("license") = "Dual BSD/GPL"; struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 8192); __type(key, pid_t); __type(value, u64); } exec_start SEC(".maps"); struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, 256 * 1024); } rb SEC(".maps"); const volatile unsigned long long min_duration_ns = 0; SEC("tp/sched/sched_process_exec") int handle_exec(struct trace_event_raw_sched_process_exec *ctx) { struct task_struct *task; unsigned fname_off; struct event *e; pid_t pid; u64 ts; /* remember time exec() was executed for this PID */ pid = bpf_get_current_pid_tgid() >> 32; ts = bpf_ktime_get_ns(); bpf_map_update_elem(&exec_start, &pid, &ts, BPF_ANY); /* don't emit exec events when minimum duration is specified */ if (min_duration_ns) return 0; /* reserve sample from BPF ringbuf */ e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0); if (!e) return 0; /* fill out the sample with data */ task = (struct task_struct *)bpf_get_current_task(); e->exit_event = false; e->pid = pid; e->ppid = BPF_CORE_READ(task, real_parent, tgid); bpf_get_current_comm(&e->comm, sizeof(e->comm)); fname_off = ctx->__data_loc_filename & 0xFFFF; bpf_probe_read_str(&e->filename, sizeof(e->filename), (void *)ctx + fname_off); /* successfully submit it to user-space for post-processing */ bpf_ringbuf_submit(e, 0); return 0; } SEC("tp/sched/sched_process_exit") int handle_exit(struct trace_event_raw_sched_process_template* ctx) { struct task_struct *task; struct event *e; pid_t pid, tid; u64 id, ts, *start_ts, duration_ns = 0; /* get PID and TID of exiting thread/process */ id = bpf_get_current_pid_tgid(); pid = id >> 32; tid = (u32)id; /* ignore thread exits */ if (pid != tid) return 0; /* if we recorded start of the process, calculate lifetime duration */ start_ts = bpf_map_lookup_elem(&exec_start, &pid); if (start_ts)duration_ns = bpf_ktime_get_ns() - *start_ts; else if (min_duration_ns) return 0; bpf_map_delete_elem(&exec_start, &pid); /* if process didn't live long enough, return early */ if (min_duration_ns && duration_ns < min_duration_ns) return 0; /* reserve sample from BPF ringbuf */ e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0); if (!e) return 0; /* fill out the sample with data */ task = (struct task_struct *)bpf_get_current_task(); e->exit_event = true; e->duration_ns = duration_ns; e->pid = pid; e->ppid = BPF_CORE_READ(task, real_parent, tgid); e->exit_code = (BPF_CORE_READ(task, exit_code) >> 8) & 0xff; bpf_get_current_comm(&e->comm, sizeof(e->comm)); /* send data to user-space for post-processing */ bpf_ringbuf_submit(e, 0); return 0; } ``` ```clike const volatile unsigned long long min_duration_ns = 0; ``` :::info 1. why use `const volatile` ? ::: ```clike struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 8192); __type(key, pid_t); __type(value, u64); } exec_start SEC(".maps"); struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, 256 * 1024); } rb SEC(".maps"); ``` `exec_start` is a hash type eBPF map used to store the timestamp when a process starts executing. `rb` is a ring buffer type eBPF map used to store captured event data and send it to the user-space program. :::info 1. 為什麼要使用 BPF_MAP_TYPE_HASH 和 BPF_MAP_TYPE_RINGBUF 這兩種 map type? BPF_MAP_TYPE_HASH: - 支援 key-value 形式的資料儲存，特別適合儲存類似 PID ➔ start_time 這種映射關係。它允許快速查找、插入和刪除鍵值對 BPF_MAP_TYPE_RINGBUF - 提供先進先出（FIFO）的資料結構，適合即時傳輸事件資料。 - 支援用戶空間程式通過 libbpf 監聽和處理資料。 - 適合高吞吐量場景，因為它減少了內核和用戶空間之間的複製開銷。資料一旦提交到 ring buffer，就無法在內核中修改。 2. ::: ```clike SEC("tp/sched/sched_process_exit") int handle_exit(struct trace_event_raw_sched_process_template* ctx) { struct task_struct *task; struct event *e; pid_t pid, tid; u64 id, ts, *start_ts, duration_ns = 0; ``` [source code](https://github.com/libbpf/libbpf-bootstrap/blob/master/examples/c/bootstrap.h) ```clike struct event { int pid; int ppid; unsigned exit_code; unsigned long long duration_ns; char comm[TASK_COMM_LEN]; char filename[MAX_FILENAME_LEN]; bool exit_event; }; ``` ```clike /* don't emit exec events when minimum duration is specified */ if (min_duration_ns) return 0; ``` 如果設定了 min_duration_ns（表示使用者只想看執行超過某個時間的事件），此時就先不送出事件： ```clike /* reserve sample from BPF ringbuf */ e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0); ``` bpf_ringbuf_reserve: - sizeof(*e)：表示保留的 buffer 大小（即事件資料的結構大小）。例如，假設 struct event 是 128 bytes，就向 ring buffer 保留 128 bytes。從 BPF ring buffer 中保留一塊足夠大的空間，用來存放「事件資料」，未來會再傳送給 user-space ```clike if (!e) return 0; ``` 如果 e 是 NULL，表示 ring buffer 空間不足，這次事件就略過，不送出資料到使用者空間。這樣可以避免 ring buffer overflow（環形緩衝區溢位）。 ```clike fname_off = ctx->__data_loc_filename & 0xFFFF; ``` 計算 filename 在 event struct 裡的偏移量（offset） ```clike bpf_probe_read_str(&e->filename, sizeof(e->filename), (void *)ctx + fname_off); ``` 透過 offset，從 ctx 中讀取 filename（程式執行的路徑）字串，寫入事件結構的 e->filename 欄位。