Hands-On with sched_ext: Building Custom eBPF CPU Schedulers

# Hands-On with sched_ext: Building Custom eBPF CPU Schedulers ## Introduction In part 1: [Exploring sched_ext: BPF-Powered CPU Schedulers in the Linux Kernel](https://free5gc.org/blog/20250305/20250305/), we explored the architecture and concepts behind `sched_ext`, Linux's framework for implementing custom CPU schedulers using eBPF. We examined how this revolutionary approach allows dynamically loading schedulers without kernel recompilation, and we compared different implementation styles including C-based (`scx_simple`), Rust-based (`scx_bpfland`,`scx_rustland`), and even potential Go implementations `scx_goland_core` Now it's time to get our hands dirty. In this second part, we'll move from theory to practice by building our own custom schedulers using the `sched_ext` framework. Start with a minimal implementation to understand the basics, then gradually add more sophisticated scheduling policies to handle real-world scenarios. Such as network packet processing to optimized for packet-processing intensive workloads. By prioritizing packet-handling threads and optimizing CPU allocation, we can reduce latency and improve throughput in these critical systems. By the end of this hands-on guide, you'll have: * Set up a proper development environment for `sched_ext` * Implemented and loaded your own custom BPF scheduler * Explored different scheduling policies and their effects on performance * Learned how to debug and test your scheduler under various workloads Let's dive into the practical world of CPU scheduling with eBPF! ## environment To start building custom schedulers with `sched_ext`, we need to set up a proper development environment. Let's go through this process step by step so you can follow along on your own system. ### Kernel Requirements (6.12+) `sched_ext` requires Linux kernel 6.12 or newer, we can use the mainline utility to easily install newer kernels: ```bash sudo add-apt-repository ppa:cappelikan/ppa sudo apt update sudo apt install -y mainline ``` ![image](https://hackmd.io/_uploads/HykQSZ5lee.png) ### Cloning and Building the sched_ext Project ```bash git clone https://github.com/sched-ext/scx.git cd scx ``` ```bash # Install BPF development tools: sudo apt install libbpf-dev clang llvm libelf-dev #　Build the schedulers using meson, you also need rust in your system cd ~/work/scx meson setup build --prefix ~ meson compile -C build meson install -C build ``` ```bash # sched_ext framework requires these configurations to work properly, check them for config in BPF SCHED_CLASS_EXT BPF_SYSCALL BPF_JIT BPF_JIT_ALWAYS_ON BPF_JIT_DEFAULT_ON PAHOLE_HAS_BTF_TAG DEBUG_INFO_BTF SCHED_DEBUG DEBUG_INFO DEBUG_INFO_DWARF5 DEBUG_INFO_BTF_MODULES; do grep -w CONFIG_$config /boot/config-$(uname -r) done # you'll see: CONFIG_BPF=y CONFIG_SCHED_CLASS_EXT=y CONFIG_BPF_SYSCALL=y CONFIG_BPF_JIT=y CONFIG_BPF_JIT_ALWAYS_ON=y CONFIG_BPF_JIT_DEFAULT_ON=y CONFIG_DEBUG_INFO_BTF=y CONFIG_SCHED_DEBUG=y CONFIG_DEBUG_INFO=y CONFIG_DEBUG_INFO_DWARF5=y CONFIG_DEBUG_INFO_BTF_MODULES=y ``` ## Implementation of an Even-CPU-Only Scheduler with SCX ### What We Did and Expected We modified the scx_packet scheduler to distribute tasks exclusively to even-numbered CPUs (0, 2, 4, 6, 8) while keeping odd-numbered CPUs idle. This demonstrates fine-grained CPU control within the sched_ext framework. ### Creating the Initial Scheduler Our custom scheduler mainly use same func with scx_simple, and I called it `scx_packet` Now, let's modify our `scx_packet.bpf.c` file to implement our strategy of only using even-numbered CPUs. Here's our initial implementation: ```c /* Main dispatch function that decides which tasks run on which CPUs */ void BPF_STRUCT_OPS(packet_dispatch, s32 cpu, struct task_struct *prev) { /* Only dispatch tasks to even-numbered CPUs */ if ((cpu & 1) == 0) { scx_bpf_dsq_move_to_local(SHARED_DSQ); } /* Odd-numbered CPUs remain idle as we don't dispatch tasks to them */ } ``` but not much easy! ### Encountering and Fixing Stalls When we tested this initial implementation, we ran into a critical issue: ``` kworker/u48:3[154254] triggered exit kind 1026: runnable task stall (kworker/0:1[141497] failed to run for 30.357s) ``` This means that tasks that can only run on odd-numbered CPUs are stuck in a "runnable" state but never get scheduled to run. So we need to modify our implementation to ensure that: * In `packet_select_cpu`: - Simplified to use the default selection, as the real control happens in enqueue and dispatch * In `packet_enqueue`: - Special-cases kernel threads with single-CPU affinity to respect their requirements - Uses `scx_bpf_dsq_insert()` instead of queue insertion for better control - For regular tasks, dispatches them to the shared queue Actively kicks an even CPU (0, 2, 4, etc.) to process tasks from the queue * In `packet_dispatch`: - Only allows even CPUs to consume from the shared queue - Odd CPUs will only run tasks that were specifically dispatched to them (kernel threads) ### The Comprehensive Solution Let's build a comprehensive solution that respects system requirements while still implementing our even-CPU policy: ```c void BPF_STRUCT_OPS(packet_enqueue, struct task_struct *p, u64 enq_flags) { stat_inc(1); /* count global queueing */ /* Handle kernel threads with restricted CPU affinity */ if ((p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1) { scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags); return; } /* For all other tasks, use the shared queue for later dispatch to even CPUs */ if (fifo_sched) { scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags); } else { u64 vtime = p->scx.dsq_vtime; /* * Limit the amount of budget that an idling task can accumulate * to one slice. */ if (time_before(vtime, vtime_now - SCX_SLICE_DFL)) vtime = vtime_now - SCX_SLICE_DFL; scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime, enq_flags); } /* If we dispatched to the shared queue, kick an even CPU to process it */ s32 target_cpu = 0; /* Start with CPU 0 */ /* Find the next even CPU by checking CPU 0, 2, 4, etc. */ for (s32 i = 0; i < 5; i++) { /* Limit to checking 5 CPUs to avoid BPF loop limits */ target_cpu = 2 * i; /* Only even CPUs */ if (target_cpu < 10) { /* Assume max 10 CPUs, adjust if needed */ scx_bpf_kick_cpu(target_cpu, SCX_KICK_PREEMPT); break; } } } ``` * Task Identification: We use `(p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1` to identify kernel threads with strict CPU affinity requirements. * CPU Selection: The bitwise operation `(cpu & 1) == 0` efficiently determines if a CPU is even-numbered (0, 2, 4...). * Dispatch Strategy: * Regular tasks go to the shared queue * CPU-specific kernel threads go directly to their required local CPU queue * Only even CPUs pull from the shared queue in the dispatch function ### Rebuild and Run In `scx/scheds/c/meson.build`, add our custom scheduler ``` c_scheds = ['scx_simple', 'scx_qmap', 'scx_central', 'scx_userland', 'scx_nest', 'scx_flatcg', 'scx_pair', 'scx_prev', 'scx_packet'] ``` ```bash meson setup build --reconfigure meson compile -C build scx_packet # If successful, the binary will be available at build/scheds/c/scx_packet. sudo ./build/scheds/c/scx_packet # verified our scheduler vboxuser@sch:~/work/scx$ cat /sys/kernel/sched_ext/state /sys/kernel/sched_ext/*/ops 2>/dev/null enabled packet ``` ## Testing Our Custom Scheduler Now that we've implemented our even-CPU-only scheduler, it's time to put it to the test. We'll use two different types of workloads to verify its behavior: * A CPU-intensive workload using stress-ng to see how tasks are distributed * A graphics application (glxgears) to observe how our scheduler affects rendering performance ### Installing the Testing Tools First, let's install stress-ng for CPU stress testing: ``` sudo apt update sudo apt install -y stress-ng sudo apt install -y mangohud sudo apt install -y mesa-utils ``` ### Test 1: CPU Stress Testing Creates 5 worker processes performing intensive matrix multiplication ```bash sudo stress-ng --cpu 5 --cpu-method matrixprod --timeout 15s ``` `htop` should show activity primarily on CPUs 0, 2, 4, 6, 8 ![image](https://hackmd.io/_uploads/HJ-NP69exl.png) but what if we use `--cpu 10`? still only 5 cpu running!! ![image](https://hackmd.io/_uploads/Byu9z1sxlx.png) ### Test 2: Graphics Performance Testing ``` # Run glxgears with MangoHud overlay MANGOHUD=1 mangohud --dlsym glxgears -info ``` This will display the FPS counter overlaid on the rotating gears. Note the FPS values and CPU utilization shown in the MangoHud overlay. In `scx_simple` ![image](https://hackmd.io/_uploads/B1a0r1sggl.png) In `scx_packet` ![image](https://hackmd.io/_uploads/r1615aclxe.png) Performance Metrics Analysis, the metrics perfectly align with the logical expectations: | Metric | scx_packet (Even CPUs) | scx_simple (All CPUs) | Difference | |--------|-------------------|-------------------|------------| | Bogo-ops | 5,603,885 | 8,479,124 | ~51% higher for all CPUs | | Bogo-ops-per-second | 546,351 | 830,616 | ~52% higher for all CPUs | | CPU usage per instance | 103.92% | 134.21% | ~29% higher for all CPUs | | MB received per second | 5.59 | 8.46 | ~51% higher for all CPUs | ## Conclusion and Future work Our even-CPU scheduler demonstrates the basic principles of CPU control with `sched_ext`, but real-world applications like Free5GC or other networking stacks present more complex scheduling challenges. Let's explore how we might adapt our scheduler for packet processing workloads. #### Optimizing for Network Packet Processing Packet processing workloads have unique characteristics that require specialized scheduling approaches: * I/O Bound: Network interfaces generate interrupts when packets arrive, making some tasks I/O bound as they wait for new packets * CPU Bound: Once packets arrive, processing them (parsing, encrypting/decrypting, routing) can be CPU intensive #### Task Classification ```c /* Check if a task is a network-related process */ static inline bool is_network_task(struct task_struct *p) { /* Common network process names to prioritize */ const char *network_processes[] = {"upf", "dpdk", "ovs", "xdp"}; for (int i = 0; i < sizeof(network_processes)/sizeof(network_processes[0]); i++) { if (belong network task) return true; } return false; } ``` ##### Prioritizing Packet Task we can also modify our enqueue function to give packet processing tasks higher priority ```c void BPF_STRUCT_OPS(packet_enqueue, struct task_struct *p, u64 enq_flags) { /* Prioritize network packet processing tasks */ if (is_network_task(p)) { /* Use a shorter time slice for responsiveness */ u64 short_slice = SCX_SLICE_DFL / 2; /* Give packet tasks negative vtime to prioritize them */ u64 priority_vtime = vtime_now - (SCX_SLICE_DFL * 2); // same as before return; } } ``` #### Next Steps * Integrating with DPDK or XDP for Zero-Copy Packet Processing * Benchmarking against standard schedulers with realistic network traffic ## Reference * [sched-ext Tutorial](https://wiki.cachyos.org/configuration/sched-ext/) * [内核调度客制化利器：SCHED_EXT](https://blog.csdn.net/feelabclihu/article/details/139364772) * [BPF 赋能调度器：万字详解 sched_ext 实现机制与工作流程](https://mp.weixin.qq.com/s?__biz=MzA3NzUzNTM4NA==&mid=2649615315&idx=1&sn=ab7adba4ac6b53c543f399d0d3c8c4c3&chksm=8749ca24b03e4332c9dfdb915ae2012873b97c1be864a539319942449ffe098587de2b5c8745&cur_album_id=3509980851549372416&scene=189#wechat_redirect) * [Pluggable CPU schedulers](https://en.opensuse.org/Pluggable_CPU_schedulers#:~:text=,possible%20to%20leverage%20data%20locality) * [sched_ext: scheduler architecture and interfaces](https://blogs.igalia.com/changwoo/sched-ext-scheduler-architecture-and-interfaces-part-2/) * [eBPF 隨筆（七）：sched_ext](https://medium.com/@ianchen0119/ebpf-%E9%9A%A8%E7%AD%86-%E4%B8%83-sched-ext-f7b60ea28976) * [scx_goland_core](https://github.com/Gthulhu/scx_goland_core) ## About Hello, I'm William Lin. I'd like to share my excitement about being a member of the free5gc project, which is a part of the Linux Foundation. I'm always eager to discuss any aspects of core network development or related technologies. ### Connect with Me - GitHub: [williamlin0518](https://github.com/williamlin0518) - Linkedin: [Cheng Wei Lin](https://www.linkedin.com/in/cheng-wei-lin-b36b7b235/)