The Linux kernel's scheduler is one of its most critical components, determining which tasks run on which CPUs and for how long.
But in today's complex computing environments, the default Linux scheduler doesn't always provide optimal performance for specialized workloads. This is where sched_ext (SCX) comes in – a framework that allows implementing custom CPU schedulers in BPF (Berkeley Packet Filter) and loading them dynamically. In this technical analysis.
I'll examine the architecture and implementation of SCX schedulers, with a particular focus on scx_rustland and scx_bpflandand, compared to traditional schedulers.
Linux kernel 6.12 introduced sched_ext
(“extensible scheduler”) as a new scheduling class that allows pluggable CPU schedulers via eBPF
Enables implementing and dynamically loading thread schedulers written in BPF means it unlike traditional scheduler modifications that require kernel recompilation and rebooting, sched_ext
defines a set of hook points (operations) that an eBPF-based scheduler can implement (such as picking the next task, enqueuing/dequeuing tasks, etc.)
scx project is a collection of sched_ext
schedulers and tools. Schedulers in scx range from simple demonstrative policies to production-oriented ones tailored for specific use cases:
Each scheduler in SCX implements the required sched_ext hooks (via eBPF programs) and can be selected at runtime. The default Linux scheduler can always be restored if needed
Source: https://www.ebpf.top/post/bpf_sched_ext_dive_into/
I'll provide a deep dive into the end-to-end task flow in sched_ext, specifically examining how tasks move through the scheduling cycle. This will cover:
SCHED_EXT
are switched to the new sched_ext
scheduling classsched_ext
management (either by being created with SCHED_EXT
or through a global switch), it will be integrated into the BPF scheduler's queues.sched_ext
core calls the BPF scheduler’s ops.init_task()
callback for each task joining sched_ext
, giving the BPF code to initialize per-task state (e.g. tracking virtual runtime, etc.)sched_ext
– it’s now subject to the BPF scheduling logic rather than the default CFS rules.SCX_DSQ_GLOBAL
) and one local DSQ per CPU (SCX_DSQ_LOCAL
).sched_ext
core invokes the BPF scheduler’s ops.select_cpu()
followed by ops.enqueue()
callbacks.
scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags)
to place the task directly into the target CPU’s local queueSCX_DSQ_GLOBAL
) or a custom DSQ.We know the actual decision of which task runs next on each CPU is made by the interplay of the ops.dispatch()
and (optionally) ops.consume()
callbacks, in conjunction with the dispatch queues. The scheduling cycle on a CPU, from a high-level perspective, is as follows:
select_cpu()
, This function allows the scheduler to choose an optimal CPU for the task before it is enqueued. For example, a scheduler might want to put a waking task on the same CPU it ran last (for cache affinity) or find an idle CPU to improve latency.enqueue()
: As described above, the BPF scheduler’s .enqueue()
either dispatches the task to a DSQ (local, global, or custom) or holds it in an internal queuesched_ext
for the next task to run. The CPU first check local DSQ for any tasks waiting. Then, that task is dequeued and chosen to run immediately. If the local DSQ is empty, the CPU can try to consume from a global shared queue.dispatch()
: If after consuming global the CPU still has no task, the core invokes the BPF scheduler’s ops.dispatch()
callback. It can call scx_bpf_dispatch()
to dispatch tasks to either the requesting CPU’s local DSQ, the global DSQFunction | Purpose | Called By |
---|---|---|
select_cpu() |
Choose target CPU hint | Kernel on task wakeup |
enqueue() |
Place task in queue or DSQ | Kernel after CPU selection |
dispatch() |
Find next task to run on CPU | Kernel when CPU needs work |
scx_bpf_dsq_insert() |
Add task to FIFO DSQ | BPF scheduler |
scx_bpf_dsq_insert_vtime() |
Add task to priority DSQ | BPF scheduler |
scx_bpf_dsq_move_to_local() |
Move task from DSQ to CPU | BPF in dispatch() |
scx_bpf_kick_cpu() |
Wake up idle CPU | BPF scheduler |
running() |
Track task starting on CPU | Kernel when task runs |
I use blog post by Andrea Righi outlines a great workflow for testing sched_ext without modifying existing system
First, start the virtual environment with the sched_ext kernel:
Once inside the virtual environment, you can run one of the schedulers with the helper function:
scx_simple
scx_simple
is a minimal scheduler that functions either as a global weighted vtime scheduler (similar to the Completely Fair Scheduler) or as a FIFO scheduler. It's designed primarily to demonstrate basic scheduling concepts
In the code below, from scx_simple.bpf.c
, the .enqueue
callback handles a task that needs to be scheduled.
If the scheduler is in FIFO mode (fifo_sched == true), it simply inserts the task into the shared dispatch queue without any priority sorting.
If in normal (weighted vtime) mode, it retrieves the task’s current virtual time (p->scx.dsq_vtime), adjusts it so that no task gains more than one slice worth of idle credit,
inserts the task into the shared queue with that virtual time as the key:
scx_rustland
is designed to prioritize interactive workloads over CPU-intensive background workloads.
In practical terms, this likely means it keeps track of each task’s recent CPU usage or blocking behavior and assigns higher priority (sooner scheduling, more CPU time) to tasks that have shorter CPU bursts.
scx_rustland
uses a hybrid approach that splits functionality between kernel space and user space:
BPF Dispatcher (scx_rustland_core): The BPF part of scx_rustland implements minimal logic required to interface with the kernel. Its enqueue hook, for example, does not directly decide a run queue as a normal scheduler would. Instead, it may place the task into a BPF queue map that represents “tasks waiting for user-space decision.”
User-Space Scheduler (Rust): On the user side, the Rust scheduler process uses libraries (like libbpf or Aya in Rust) to interact with the eBPF program. It attaches to the maps exposed by BPF. Typically, it might use a ring buffer to receives a task to schedule, it runs its algorithm to decide where/when that task should run.
scx_rustland
has two main components:
The scheduling logic in scx_rustland
is split across kernel (BPF) and user-space
In user-space, scx_rustland
maintains its own runqueue data structure – specifically a BTreeSet of tasks ordered by weighted vruntime (essentially mimicking CFS in user-space). It also monitors each task’s behavior to detect interactive tasks: if a task consistently voluntarily yields CPU (releasing the CPU before its time slice is fully used), it’s considered interactive.
scx_bpfland
implements its scheduling logic almost entirely in BPF (Berkeley Packet Filter) code that runs in kernel space. It follows a more traditional approach where all scheduling decisions happen within the kernel:
scx_bpfland
implements all scheduling callbacks directly in BPF:
scx_bpfland
internal logic is very similar in spirit to scx_rustland’s algorithm, but it executes entirely within the BPF program (in kernel). It effectively merges the two-tier logic into one.
The sched_ext + eBPF approach (exemplified by SCX schedulers) brings several key advantages over traditional in-kernel schedulers:
The scx_goland_core
project represents another approach to implementing a user-space scheduler for Linux using the sched_ext
framework. This project follows the architectural patterns established by scx_rustland, adapting them to the Go ecosystem.
The Go implementation consists of several key packages and types:
Core Package:
Memory Management:
In scx_goland_core
, tasks from the kernel’s ring buffers are received as byte slices and then decoded (using binary.Read) into Go structs. This decoding and copying process is relatively expensive when compared to Rust’s approach where you can often work in a more zero‑copy manner. In contrast, scx_rustland
benefits from Rust’s zero‑cost abstractions, minimal runtime overhead, and more direct access to low-level memory without a garbage collector.
In go:
In Rust:
Garbage Collection:
Go is a managed language with a garbage collector and its own scheduler for goroutines. Even though Go is very efficient, those additional layers (GC, goroutine scheduling, channel operations) add overhead compared to Rust’s zero‑cost abstractions where most things are determined at compile time.
It shows how Go's concurrency model and ease of use can be applied to system programming tasks that were traditionally the domain of C or Rust.
While it's not as optimized or mature as scx_rustland
, it provides a valuable alternative for developers more comfortable with Go. It also serves as a good example of how the sched_ext framework enables experimentation with different scheduling policies and implementations without requiring deep kernel expertise.
Free5GC is an open-source implementation of the 5G Core network, written largely in Go. One of its components is the UPF (User Plane Function), which handles the data plane – i.e., packet forwarding for user traffic (GTP-U tunneling, routing packets between RAN and data network). The UPF and other network functions in Free5GC are user-space processes that can be CPU-intensive, especially under high load (many subscribers or high packet rates). Ensuring low latency and high throughput in the data plane is critical.
Identifying Target Tasks: The scheduler must reliably identify the Free5GC data plane threads/processes. This could be done by process name matching, by cgroup (if Free5GC is in a container or specific slice, the scheduler can detect that cgroup and apply a policy), or by explicit configuration(the operator could pass PIDs or process names to scx_goland
at startup to tell it which tasks to prioritize).
prioritize specific goroutines: Free5GC could be run in a mode where the UPF uses dedicated pinned threads for packet RX/TX loops (ensuring those OS threads only do packet work). Then scx_goland
can target those threads precisely. But need to care that Free5GC’s internal scheduling of goroutines on OS threads might not be directly visible to the OS scheduler.
The advantages of SCX (sched_ext
) – runtime pluggability, rapid development, workload-specific optimizations, and crash resilience – make it very attractive for specialized domains. One such domain is the 5G core network. We explored scx_goland
, a Go-based scheduler concept, illustrating how a custom scheduler could be integrated with Free5GC to optimize its performance.
Hello, I'm William Lin. I'd like to share my excitement about being a member of the free5gc project, which is a part of the Linux Foundation. I'm always eager to discuss any aspects of core network development or related technologies.