# Hands-On with sched_ext: Building Custom eBPF CPU Schedulers
## Introduction
In part 1: [Exploring sched_ext: BPF-Powered CPU Schedulers in the Linux Kernel](https://free5gc.org/blog/20250305/20250305/), we explored the architecture and concepts behind `sched_ext`, Linux's framework for implementing custom CPU schedulers using eBPF. We examined how this revolutionary approach allows dynamically loading schedulers without kernel recompilation, and we compared different implementation styles including C-based (`scx_simple`), Rust-based (`scx_bpfland`,`scx_rustland`), and even potential Go implementations `scx_goland_core`
Now it's time to get our hands dirty. In this second part, we'll move from theory to practice by building our own custom schedulers using the `sched_ext` framework. Start with a minimal implementation to understand the basics, then gradually add more sophisticated scheduling policies to handle real-world scenarios.
Such as network packet processing to optimized for packet-processing intensive workloads. By prioritizing packet-handling threads and optimizing CPU allocation, we can reduce latency and improve throughput in these critical systems.
By the end of this hands-on guide, you'll have:
* Set up a proper development environment for `sched_ext`
* Implemented and loaded your own custom BPF scheduler
* Explored different scheduling policies and their effects on performance
* Learned how to debug and test your scheduler under various workloads
Let's dive into the practical world of CPU scheduling with eBPF!
## environment
To start building custom schedulers with `sched_ext`, we need to set up a proper development environment. Let's go through this process step by step so you can follow along on your own system.
### Kernel Requirements (6.12+)
`sched_ext` requires Linux kernel 6.12 or newer, we can use the mainline utility to easily install newer kernels:
```bash
sudo add-apt-repository ppa:cappelikan/ppa
sudo apt update
sudo apt install -y mainline
```

### Cloning and Building the sched_ext Project
```bash
git clone https://github.com/sched-ext/scx.git
cd scx
```
```bash
# Install BPF development tools:
sudo apt install libbpf-dev clang llvm libelf-dev
# Build the schedulers using meson, you also need rust in your system
cd ~/work/scx
meson setup build --prefix ~
meson compile -C build
meson install -C build
```
```bash
# sched_ext framework requires these configurations to work properly, check them
for config in BPF SCHED_CLASS_EXT BPF_SYSCALL BPF_JIT BPF_JIT_ALWAYS_ON BPF_JIT_DEFAULT_ON PAHOLE_HAS_BTF_TAG DEBUG_INFO_BTF SCHED_DEBUG DEBUG_INFO DEBUG_INFO_DWARF5 DEBUG_INFO_BTF_MODULES; do
grep -w CONFIG_$config /boot/config-$(uname -r)
done
# you'll see:
CONFIG_BPF=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_DWARF5=y
CONFIG_DEBUG_INFO_BTF_MODULES=y
```
## Implementation of an Even-CPU-Only Scheduler with SCX
### What We Did and Expected
We modified the scx_packet scheduler to distribute tasks exclusively to even-numbered CPUs (0, 2, 4, 6, 8) while keeping odd-numbered CPUs idle. This demonstrates fine-grained CPU control within the sched_ext framework.
### Creating the Initial Scheduler
Our custom scheduler mainly use same func with scx_simple, and I called it `scx_packet`
Now, let's modify our `scx_packet.bpf.c` file to implement our strategy of only using even-numbered CPUs. Here's our initial implementation:
```c
/* Main dispatch function that decides which tasks run on which CPUs */
void BPF_STRUCT_OPS(packet_dispatch, s32 cpu, struct task_struct *prev)
{
/* Only dispatch tasks to even-numbered CPUs */
if ((cpu & 1) == 0) {
scx_bpf_dsq_move_to_local(SHARED_DSQ);
}
/* Odd-numbered CPUs remain idle as we don't dispatch tasks to them */
}
```
but not much easy!
### Encountering and Fixing Stalls
When we tested this initial implementation, we ran into a critical issue:
```
kworker/u48:3[154254] triggered exit kind 1026:
runnable task stall (kworker/0:1[141497] failed to run for 30.357s)
```
This means that tasks that can only run on odd-numbered CPUs are stuck in a "runnable" state but never get scheduled to run.
So we need to modify our implementation to ensure that:
* In `packet_select_cpu`:
- Simplified to use the default selection, as the real control happens in enqueue and dispatch
* In `packet_enqueue`:
- Special-cases kernel threads with single-CPU affinity to respect their requirements
- Uses `scx_bpf_dsq_insert()` instead of queue insertion for better control
- For regular tasks, dispatches them to the shared queue
Actively kicks an even CPU (0, 2, 4, etc.) to process tasks from the queue
* In `packet_dispatch`:
- Only allows even CPUs to consume from the shared queue
- Odd CPUs will only run tasks that were specifically dispatched to them (kernel threads)
### The Comprehensive Solution
Let's build a comprehensive solution that respects system requirements while still implementing our even-CPU policy:
```c
void BPF_STRUCT_OPS(packet_enqueue, struct task_struct *p, u64 enq_flags)
{
stat_inc(1); /* count global queueing */
/* Handle kernel threads with restricted CPU affinity */
if ((p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1) {
scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
return;
}
/* For all other tasks, use the shared queue for later dispatch to even CPUs */
if (fifo_sched) {
scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
} else {
u64 vtime = p->scx.dsq_vtime;
/*
* Limit the amount of budget that an idling task can accumulate
* to one slice.
*/
if (time_before(vtime, vtime_now - SCX_SLICE_DFL))
vtime = vtime_now - SCX_SLICE_DFL;
scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime, enq_flags);
}
/* If we dispatched to the shared queue, kick an even CPU to process it */
s32 target_cpu = 0; /* Start with CPU 0 */
/* Find the next even CPU by checking CPU 0, 2, 4, etc. */
for (s32 i = 0; i < 5; i++) { /* Limit to checking 5 CPUs to avoid BPF loop limits */
target_cpu = 2 * i; /* Only even CPUs */
if (target_cpu < 10) { /* Assume max 10 CPUs, adjust if needed */
scx_bpf_kick_cpu(target_cpu, SCX_KICK_PREEMPT);
break;
}
}
}
```
* Task Identification: We use `(p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1` to identify kernel threads with strict CPU affinity requirements.
* CPU Selection: The bitwise operation `(cpu & 1) == 0` efficiently determines if a CPU is even-numbered (0, 2, 4...).
* Dispatch Strategy:
* Regular tasks go to the shared queue
* CPU-specific kernel threads go directly to their required local CPU queue
* Only even CPUs pull from the shared queue in the dispatch function
### Rebuild and Run
In `scx/scheds/c/meson.build`, add our custom scheduler
```
c_scheds = ['scx_simple', 'scx_qmap', 'scx_central', 'scx_userland', 'scx_nest',
'scx_flatcg', 'scx_pair', 'scx_prev', 'scx_packet']
```
```bash
meson setup build --reconfigure
meson compile -C build scx_packet
# If successful, the binary will be available at build/scheds/c/scx_packet.
sudo ./build/scheds/c/scx_packet
# verified our scheduler
vboxuser@sch:~/work/scx$ cat /sys/kernel/sched_ext/state /sys/kernel/sched_ext/*/ops 2>/dev/null
enabled
packet
```
## Testing Our Custom Scheduler
Now that we've implemented our even-CPU-only scheduler, it's time to put it to the test. We'll use two different types of workloads to verify its behavior:
* A CPU-intensive workload using stress-ng to see how tasks are distributed
* A graphics application (glxgears) to observe how our scheduler affects rendering performance
### Installing the Testing Tools
First, let's install stress-ng for CPU stress testing:
```
sudo apt update
sudo apt install -y stress-ng
sudo apt install -y mangohud
sudo apt install -y mesa-utils
```
### Test 1: CPU Stress Testing
Creates 5 worker processes performing intensive matrix multiplication
```bash
sudo stress-ng --cpu 5 --cpu-method matrixprod --timeout 15s
```
`htop` should show activity primarily on CPUs 0, 2, 4, 6, 8

but what if we use `--cpu 10`?
still only 5 cpu running!!

### Test 2: Graphics Performance Testing
```
# Run glxgears with MangoHud overlay
MANGOHUD=1 mangohud --dlsym glxgears -info
```
This will display the FPS counter overlaid on the rotating gears. Note the FPS values and CPU utilization shown in the MangoHud overlay.
In `scx_simple`

In `scx_packet`

Performance Metrics Analysis, the metrics perfectly align with the logical expectations:
| Metric | scx_packet (Even CPUs) | scx_simple (All CPUs) | Difference |
|--------|-------------------|-------------------|------------|
| Bogo-ops | 5,603,885 | 8,479,124 | ~51% higher for all CPUs |
| Bogo-ops-per-second | 546,351 | 830,616 | ~52% higher for all CPUs |
| CPU usage per instance | 103.92% | 134.21% | ~29% higher for all CPUs |
| MB received per second | 5.59 | 8.46 | ~51% higher for all CPUs |
## Conclusion and Future work
Our even-CPU scheduler demonstrates the basic principles of CPU control with `sched_ext`, but real-world applications like Free5GC or other networking stacks present more complex scheduling challenges. Let's explore how we might adapt our scheduler for packet processing workloads.
#### Optimizing for Network Packet Processing
Packet processing workloads have unique characteristics that require specialized scheduling approaches:
* I/O Bound: Network interfaces generate interrupts when packets arrive, making some tasks I/O bound as they wait for new packets
* CPU Bound: Once packets arrive, processing them (parsing, encrypting/decrypting, routing) can be CPU intensive
#### Task Classification
```c
/* Check if a task is a network-related process */
static inline bool is_network_task(struct task_struct *p)
{
/* Common network process names to prioritize */
const char *network_processes[] = {"upf", "dpdk", "ovs", "xdp"};
for (int i = 0; i < sizeof(network_processes)/sizeof(network_processes[0]); i++) {
if (belong network task)
return true;
}
return false;
}
```
##### Prioritizing Packet Task
we can also modify our enqueue function to give packet processing tasks higher priority
```c
void BPF_STRUCT_OPS(packet_enqueue, struct task_struct *p, u64 enq_flags)
{
/* Prioritize network packet processing tasks */
if (is_network_task(p)) {
/* Use a shorter time slice for responsiveness */
u64 short_slice = SCX_SLICE_DFL / 2;
/* Give packet tasks negative vtime to prioritize them */
u64 priority_vtime = vtime_now - (SCX_SLICE_DFL * 2);
// same as before
return;
}
}
```
#### Next Steps
* Integrating with DPDK or XDP for Zero-Copy Packet Processing
* Benchmarking against standard schedulers with realistic network traffic
## Reference
* [sched-ext Tutorial](https://wiki.cachyos.org/configuration/sched-ext/)
* [内核调度客制化利器:SCHED_EXT](https://blog.csdn.net/feelabclihu/article/details/139364772)
* [BPF 赋能调度器:万字详解 sched_ext 实现机制与工作流程](https://mp.weixin.qq.com/s?__biz=MzA3NzUzNTM4NA==&mid=2649615315&idx=1&sn=ab7adba4ac6b53c543f399d0d3c8c4c3&chksm=8749ca24b03e4332c9dfdb915ae2012873b97c1be864a539319942449ffe098587de2b5c8745&cur_album_id=3509980851549372416&scene=189#wechat_redirect)
* [Pluggable CPU schedulers](https://en.opensuse.org/Pluggable_CPU_schedulers#:~:text=,possible%20to%20leverage%20data%20locality)
* [sched_ext: scheduler architecture and interfaces](https://blogs.igalia.com/changwoo/sched-ext-scheduler-architecture-and-interfaces-part-2/)
* [eBPF 隨筆(七):sched_ext](https://medium.com/@ianchen0119/ebpf-%E9%9A%A8%E7%AD%86-%E4%B8%83-sched-ext-f7b60ea28976)
* [scx_goland_core](https://github.com/Gthulhu/scx_goland_core)
## About
Hello, I'm William Lin. I'd like to share my excitement about being a member of the free5gc project, which is a part of the Linux Foundation. I'm always eager to discuss any aspects of core network development or related technologies.
### Connect with Me
- GitHub: [williamlin0518](https://github.com/williamlin0518)
- Linkedin: [Cheng Wei Lin](https://www.linkedin.com/in/cheng-wei-lin-b36b7b235/)