閱讀摘要--"Learning eBPF"

# Abstract of "Learning eBPF" Content is from "Learning eBPF" by Liz Rice. ## Chapter 1: What Is eBPF, and Why Is It Important? eBPF can be described in one sentence as a new generation of highly performant tools for networking, observability, and security development. ### The Evolution of eBPF in Production Systems - Kprobes (kernel probes) have existed in the Linux kernel since 2005, and eBPF programs began to support kprobes in 2015, marking the start of a revolution in tracing Linux systems. - By 2016, eBPF-based tools were being used in production environments. Brendan Gregg’s work on tracing at Netflix became widely recognized in the infrastructure domain. - Every single packet sent to Facebook.com since 2017 has passed through eBPF/XDP. ### Goals and Features of eBPF Goal: To add new functionalities—such as networking, tracing, and security—to the Linux kernel. Problem: Traditionally, there have been two ways to modify the kernel: either by committing changes upstream or by distributing a kernel module. The former approach faces technical challenges and can be time-consuming, while the latter raises concerns among users regarding potential vulnerabilities and exploitable flaws. Features of eBPF: - Verifier: Ensures that a loaded eBPF program will not crash or get stuck in an infinite loop, and prevents the use of compromised data. - Dynamic Loading: Machines can be live patched while the development interface remains unchanged. In container environments, this includes visibility over all processes running inside those containers as well as those on the host machine. - High Performance Potential: Handles events without context switching between kernel and user space. For example, a 2018 paper shows that routing in XDP improves performance by 2.5 times compared to the regular Linux kernel implementation, and by 4.3 times over IPVS for load balancing. ### Example: eBPF in Cloud-Native Environments The sidecar model can be used to add functionalities like logging, tracing, and security. The sidecar is implemented by applying YAML to a pod. However, this approach has several downsides: - The pod must be restarted, and if multiple services are running, waiting for their restart can be time-consuming. - Networking functionalities, such as service mesh, will redirect packets to a proxy, increasing latency. - Passive observation cannot detect implanted applications. ## Chapter 2: eBPF’s “Hello World” ### Running "Hello World" System block diagram: ![System Block Diagram](https://hackmd.io/_uploads/ByPo075gJe.png) The following example utilizes the BCC toolchain to create an eBPF program, which consists of two parts: the kernel and the user space. ```python #!/usr/bin/python from bcc import BPF # Kernel part is defined as a string. # BCC will compile the program before loading it. program = r""" int hello(void *ctx) { bpf_trace_printk("Hello World!"); return 0; } """ # Start of the user part b = BPF(text=program) # Attach the program to the syscall "execve" syscall = b.get_syscall_fnname("execve") b.attach_kprobe(event=syscall, fn_name="hello") # Block and display the trace log b.trace_print() ``` The log is written to a pseudo-file located at */sys/kernel/debug/tracing/trace_pipe*. Executing the program in the console will display output similar to the following, indicating that the process triggering the event is `bash` with PID 5412. ```shell $ hello.py b' bash-5412 [001] .... 90432.904952: 0: bpf_trace_printk: Hello World' ``` Running eBPF programs requires privileges like `CAP_BPF`, `CAP_PERFMON`, and `CAP_NET_ADMIN`. The log file *trace_pipe* is shared among all eBPF programs and supports only string output. ### BPF Maps BPF Maps enable: - Sharing data among multiple eBPF programs. - Communicating between user space applications and eBPF code running in the kernel. Typical use cases include: - User configuration. - Sharing state between multiple eBPF programs. - Communicating between user space and the kernel. Various types of maps are available, including: - Array - Hash table - Perf and ring buffer - Stack and queue - ... ### Hash Table Map In this example, the kernel increments the counter for the corresponding UID in the map, while the user space retrieves the data and outputs it to the console. ```python #!/usr/bin/python3 from bcc import BPF from time import sleep program = r""" BPF_HASH(counter_table); int hello(void *ctx) { u64 uid; u64 counter = 0; u64 *p; uid = bpf_get_current_uid_gid() & 0xFFFFFFFF; p = counter_table.lookup(&uid); if (p != 0) { counter = *p; } counter++; counter_table.update(&uid, &counter); return 0; } """ b = BPF(text=program) syscall = b.get_syscall_fnname("execve") b.attach_kprobe(event=syscall, fn_name="hello") # Attach to a tracepoint that gets hit for all syscalls # b.attach_raw_tracepoint(tp="sys_enter", fn_name="hello") while True: sleep(2) s = "" for k,v in b["counter_table"].items(): s += f"ID {k.value}: {v.value}\t" print(s) ``` `BPF_HASH(counter_table)` and `counter_table.lookup(&uid)` are BCC specific features that cannot be used in C program. executing program gives result like ```shell $ hello.py b' bash-5412 [001] .... 90432.904952: 0: bpf_trace_printk: Hello World' ``` ### Perf and Ring Buffer Maps this example shows event trigger API for user program. ```python #!/usr/bin/python3 from bcc import BPF program = r""" BPF_PERF_OUTPUT(output); struct data_t { int pid; int uid; char command[16]; char message[12]; }; int hello(void *ctx) { struct data_t data = {}; char message[12] = "Hello World"; data.pid = bpf_get_current_pid_tgid() >> 32; data.uid = bpf_get_current_uid_gid() & 0xFFFFFFFF; bpf_get_current_comm(&data.command, sizeof(data.command)); bpf_probe_read_kernel(&data.message, sizeof(data.message), message); output.perf_submit(ctx, &data, sizeof(data)); return 0; } """ b = BPF(text=program) syscall = b.get_syscall_fnname("execve") b.attach_kprobe(event=syscall, fn_name="hello") def print_event(cpu, data, size): data = b["output"].event(data) print(f"{data.pid} {data.uid} {data.command.decode()} {data.message.decode()}") b["output"].open_perf_buffer(print_event) while True: b.perf_buffer_poll() ``` - `bpf_get_current_pid_tgid()`, `bpf_get_current_uid_gid()` and `bpf_get_current_comm()` gather information in kernel. - `perf_submit()` updates data to map in kernel side, and `perf_buffer_poll()` reads data from from the map in user side. :::warning eBPF programs are constrained to a maximum of 1 million instructions and a 512-byte stack. ::: ## Chapter 3: Anatomy of an eBPF Program eBPF programs are typically written in C or Rust and compiled into eBPF bytecode, which runs within an eBPF virtual machine (VM) inside the kernel. The bytecode can be either JIT-compiled to machine code or interpreted. ![How to Generate Machine Code](https://hackmd.io/_uploads/H1zByvxsC.png) ### The eBPF Virtual Machine and Verifier The eBPF verifier performs checks on bytecode instructions, which are generated and optimized by the compiler. For instance, when eBPF didn’t support function calls, using `__always_inline` was helpful for inlining functions. ### eBPF "Hello World" for a Network Interface eBPF programs can be loaded via command-line tools similarly to kernel modules, so only the kernel part needs to be programed. An XDP event is triggered when a packet arrives at a network interface, allowing the eBPF program to inspect, modify, redirect, or drop the packet. ```clike #include <linux/bpf.h> #include <bpf/bpf_helpers.h> int counter = 0; SEC("xdp") int hello(void *ctx) { bpf_printk("Hello World %d", counter); counter++; return XDP_PASS; } char LICENSE[] SEC("license") = "Dual BSD/GPL"; ``` - The type of an eBPF program needs to be specified and `SEC("xdp")` suffices. - `return XDP_PASS` tells the kernel to continue processing the packet as usual. eBPF programs can influence kernel behavior through their return codes. #### Compiling an eBPF Object File Clang is preferred over GNU for compiling eBPF programs. To compile eBPF programs, add the option `-target bpf`. ```shell hello.bpf.o: hello.bpf.c clang \ -target bpf \ -I/usr/include/$(shell uname -m)-linux-gnu \ -g \ -O2 -c $< -o $@ ``` #### Loading the Program into the Kernel `$ bpftool prog load hello.bpf.o /sys/fs/bpf/hello` Here, `bpftool prog load` loads the compiled eBPF program and pins it to the specified path, in this case, */sys/fs/bpf/hello*. #### Inspecting the Loaded Program ```shell $ ls /sys/fs/bpf hello ``` ```shell $ bpftool prog list ... 540: xdp name hello tag d35b94b4c0c10efb gpl loaded_at 2022-08-02T17:39:47+0000 uid 0 xlated 96B jited 148B memlock 4096B map_ids 165,166 btf_id 254 ``` The output shows the program with id 540, type xdp, and name hello, along with its associated maps with ids 165 and 166. These fields can be used to pinpoint specific eBPF programs or maps when using bpftool. For example: - `bpftool prog show id 540` - `bpftool map show id 165` #### Attaching to an Event The program type must match the type of event it is being attached to. For instance, you can attach an XDP program to an XDP event on a specific network interface like this: `$ bpftool net attach xdp id 540 dev eth0` To view all the network-attached eBPF programs: ```shell $ bpftool net list xdp: eth0(2) driver id 540 tc: flow_dissector: ``` #### Global Variables Global variables in eBPF are implemented using maps. ```shell $ bpftool map list 165: array name hello.bss flags 0x400 key 4B value 4B max_entries 1 memlock 4096B btf_id 254 166: array name hello.rodata flags 0x80 key 4B value 15B max_entries 1 memlock 4096B btf_id 254 frozen ``` For example, `counter` is stored in `hello.bss` in this case. #### Detaching and Unloading the Program To detach a program: `$ bpftool net detach xdp dev eth0` ```shell $ bpftool net list xdp: tc: flow_dissector: ``` To unload a program: `$ rm /sys/fs/bpf/hello` `$ bpftool prog show name hello` ## Chapter 4: The bpf() System Call ### BPF Program and Map Reference The life cycle of eBPF objects (programs or maps) is managed by reference counting. Pinning an eBPF object, owning maps, creating links, or attaching to hooks can increase the reference count. Pinning creates a pseudofile that does not persist after rebooting, allowing user eBPF programs to access the object through the file. Links refer to the attachment between programs and hooks. :::info Some hooks that are used for tracing (e.g., kprobes and tracepoints) are associated with user processes, while hooks like cgroups or the network stack are not. After a user process exits, only the latter may persist. ::: ## Chapter 7: eBPF Program and Attachment Types The type of an eBPF program defines: - The meaning of the return code and the arguments of the eBPF program. - The set of helper functions and kfuncs the eBPF program can call. The verifier enforces rules for these features. Helper functions are an external and stable interface of Linux, while kfuncs are the opposite. Program types fall into two categories: tracing and network. ### Tracing Tracing-type programs are attached to kprobes, tracepoints, raw tracepoints, fentry/fexit, and perf events. #### Kprobes and Kretprobes Kprobes are triggered when entering a function, while kretprobes are triggered when exiting. It's possible to attach to any instruction, but kernel functions and syscalls are more stable. To attach to a kernel function, an example is shown below: ```clike SEC("kprobe/do_execve") int BPF_KPROBE(kprobe_do_execve, struct filename *filename) ``` The last part of the arguments can be ignored. ```clike int do_execve(struct filename *filename, const char __user *const __user *__argv, const char __user *const __user *__envp) ``` #### Fentry/Fexit Fentry/fexit is similar, but fexit can access not only the return value but also the input arguments. Fexit has signature: ```clike SEC("fexit/do_unlinkat") int BPF_PROG(do_unlinkat_exit, int dfd, struct filename *name, long ret) ``` Compare to kretprobe version: ```clike SEC("kretprobe/do_unlinkat") int BPF_KRETPROBE(do_unlinkat_exit, long ret) ``` #### Tracepoints Tracepoints are stable interfaces marked in kernel code. To use tracepoints, the argument structure must be defined manually. For instance, here's the format of `execve()`: ```shell $ cat /sys/kernel/tracing/events/syscalls/sys_enter_execve/format name: sys_enter_execve ID: 622 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:int __syscall_nr; offset:8; size:4; signed:1; field:const char * filename; offset:16; size:8; signed:0; field:const char *const * argv; offset:24; size:8; signed:0; field:const char *const * envp; offset:32; size:8; signed:0; ``` The corresponding structure is defined here, and the first four fields are disallowed from being accessed. ```clike struct my_syscalls_enter_execve { unsigned short common_type; unsigned char common_flags; unsigned char common_preempt_count; int common_pid; long syscall_nr; long filename_ptr; long argv_ptr; long envp_ptr; }; SEC("tp/syscalls/sys_enter_execve") int tp_sys_enter_execve(struct my_syscalls_enter_execve *ctx) ``` #### BTF-enabled Tracepoints With BTF support, there will be a structure defined in *vmlinux.h*. The program should use the section definition `SEC("tp_btf/<name>")`, and the structure is `trace_event_raw_<name>`. ```clike SEC("tp_btf/sched_process_exec") int handle_exec(struct trace_event_raw_sched_process_exec *ctx) ``` #### User Space Attachments Uprobe/uretprobe and user statically defined tracepoints (USDT) are counterparts for user programs. To use uprobe/uretprobe, the attachment point is specified by the path to the shared library or executable: `SEC("uprobe/usr/lib/aarch64-linux-gnu/libssl.so.3/SSL_write")` #### Linux Security Module (LSM) LSM provides a stable interface within the kernel originally intended for kernel modules to use to enforce security policies. The return value affects the way the kernel behaves. A nonzero return indicates that the security check wasn't passed. ### Networking Networking has two main characteristics: - It uses return codes to tell the kernel what actions to take. - It allows for the modification of packets or socket configuration parameters. #### Sockets - **BPF_PROG_TYPE_SOCKET_FILTER**: Filters a copy of the packets without discarding them for observation. - **BPF_PROG_TYPE_SOCK_OPS**: Intercepts various operations on a socket and sets socket configuration parameters, such as TCP timeout. - **BPF_PROG_TYPE_SK_SKB**: Redirects traffic at the socket layer. #### Traffic Control Traffic control provides custom filters and classifiers for ingress and egress traffic. #### Dissector - **BPF_PROG_TYPE_FLOW_DISSECTOR**: Extracts details from packet headers. #### Cgroup Cgroups can enforce behavior specific to certain cgroups. - **BPF_PROG_TYPE_CGROUP_SOCK** and **BPF_PROG_TYPE_CGROUP_SKB** determine whether a given cgroup is permitted to perform socket operations or data transmission. #### Some Others XDP, Lightweight tunnel. ## Chapter 8: eBPF for Networking ### Packet Drops There are several network security features that involve dropping certain packets, including firewalling, DDoS protection, and mitigating packet-of-death. #### XDP Program Return Codes - **XDP_PASS**: Processes the packet in the normal way. - **XDP_DROP**: Discards the packet. - **XDP_TX**: Sends the packet back to the interface from which it arrived. - **XDP_REDIRECT**: Forwards the packet to another interface. - **XDP_ABORTED**: Discards the packet and raises a warning. #### XDP Packet Parsing The parameter of an XDP program is `struct xdp_md *`, which holds metadata about the packet. ```clike struct xdp_md { __u32 data; __u32 data_end; __u32 data_meta; ... } ``` - **data, data_end**: A packet is stored in the range [data, data_end). - **data_meta**: The range [data_meta, data) is a metadata area specific to eBPF. A basic XDP program that drops *ping* packets could look like this: ```clike SEC("xdp") int ping(struct xdp_md *ctx) { long protocol = lookup_protocol(ctx); if (protocol == 1) // ICMP { bpf_printk("Hello ping"); return XDP_DROP; } return XDP_PASS; } ``` :::spoiler lookup_protocol() ```clike unsigned char lookup_protocol(struct xdp_md *ctx) { unsigned char protocol = 0; void *data = (void *)(long)ctx->data; void *data_end = (void *)(long)ctx->data_end; struct ethhdr *eth = data; if (data + sizeof(struct ethhdr) > data_end) return 0; // Check that it's an IP packet if (bpf_ntohs(eth->h_proto) == ETH_P_IP) { // Return the protocol of this packet // 1 = ICMP // 6 = TCP // 17 = UDP struct iphdr *iph = data + sizeof(struct ethhdr); if (data + sizeof(struct ethhdr) + sizeof(struct iphdr) <= data_end) protocol = iph->protocol; } return protocol; } ``` ::: ### Load Balancing and Forwarding Another style of XDP program can directly modify packet content. A load balancer forwarding packets to a number of backends can be implemented in this manner. ![Load Balancer Diagram](https://hackmd.io/_uploads/SJMhFhPpC.png) As shown in the figure, the load balancer and backends run in their own containers, and they are loaded with the following command: `bpftool net attach xdpgeneric pinned /sys/fs/bpf/$(TARGET) dev eth0` The programs will load into the same kernel, while their attachment points are within a particular namespace. #### XDP offloading The three trigger points of XDP, in order of execution, are: - Offloaded to NIC: executed directly on the network interface card, with no CPU cycles used - XDP in the NIC driver: executed in the driver, minimizing data copying within the kernel - Kernel network stack: executed after the packet is processed by the kernel networking subsystem ### Traffic Control (TC) The TC subsystem is intended to regulate how network traffic is scheduled, and traffic control is split into classifiers and separate actions. eBPF programs are attached as classifiers, indicating actions by their return codes: - **TC_ACT_SHOT**: Drops the packet. - **TC_ACT_UNSPEC**: Passes the packet to the next classifier. - **TC_ACT_OK**: Passes the packet to the next layer in the stack. - **TC_ACT_REDIRECT**: Sends the packet to the ingress/egress of another device. eBPF can be attached to either ingress or egress. Unlike XDP, it can attach multiple programs and process them in sequence. ### Packet Encryption and Decryption User programs use `SSL_write()` to encrypt context and `SSL_read()` to obtain decrypted context. An eBPF uprobe is suitable for catching the context before it's encrypted, while a uretprobe captures the reading part. The program should attach to common libraries like OpenSSL, GnuTLS, and NSS. ### eBPF and Kubernetes Networking Naive approach: <img src="https://hackmd.io/_uploads/By4xAI_60.png" width=450 style="display: block; margin: auto;"> After modification: <img src="https://hackmd.io/_uploads/SkdlRUOaC.png" width=450 style="display: block; margin: auto;"> The more code a network packet has to pass through, the higher the latency. eBPF provides a flexible solution that replaces iptables and conntrack. #### Avoiding iptables The drawbacks of the original iptables include: - Updating iptables in large-scale applications is expensive. - Looking up an iptable is O(n). Cilium replaces iptables with an eBPF hash table to store network rules and improve performance. #### Coordinated Network Programs To intercept traffic as early as possible and implement realistic complex policies, multiple programs, attached at different points in the kernel and network stack, must be coordinated. #### Network Policy Enforcement The limitation of traditional firewalls in Kubernetes is that IP addresses are not reliable identifiers, as they are dynamically assigned and can be reused for different applications over time. With eBPF's capabilities, Cilium can enforce rules based on cloud-native attributes, such as pod labels. #### Encrypted Connections Encrypting traffic between applications typically requires each application to establish secure connections. eBPF offloads the encryption requirement from the application to the kernel, preventing changes in the applications.