Exploring sched_ext: BPF-Powered CPU Schedulers in the Linux Kernel

# Exploring sched_ext: BPF-Powered CPU Schedulers in the Linux Kernel The Linux kernel's scheduler is one of its most critical components, determining which tasks run on which CPUs and for how long. But in today's complex computing environments, the default Linux scheduler doesn't always provide optimal performance for specialized workloads. This is where sched_ext (SCX) comes in – a framework that allows implementing custom CPU schedulers in BPF (Berkeley Packet Filter) and loading them dynamically. In this technical analysis. I'll examine the architecture and implementation of SCX schedulers, with a particular focus on scx_rustland and scx_bpflandand, compared to traditional schedulers. ## What is sched_ext? ### Sched_ext Linux kernel 6.12 introduced `sched_ext` (“extensible scheduler”) as a new scheduling class that allows pluggable CPU schedulers via eBPF Enables implementing and dynamically loading thread schedulers written in BPF means it unlike traditional scheduler modifications that require kernel recompilation and rebooting, `sched_ext` defines a set of hook points (operations) that an eBPF-based scheduler can implement (such as picking the next task, enqueuing/dequeuing tasks, etc.) * **BPF struct_ops**: Used to define a scheduling policy through callback functions * **Dispatch queues** (DSQs): Used for task queuing and execution * **Safety mechanisms**: Prevent system crashes from buggy schedulers scx project is a collection of `sched_ext` schedulers and tools. Schedulers in scx range from simple demonstrative policies to production-oriented ones tailored for specific use cases: * **scx_simple** : uses a basic FIFO or least-run-time policy * **scx_nest** : places tasks on high-frequency cores * **scx_lavd** : is optimized for gaming workloads * **scx_rusty** : partitions CPUs by last-level cache to improve locality * **scx_bpfland** : threads that block frequently (i.e. perform many voluntary context switches per second) are assumed to be interactive, and thus prioritized Each scheduler in SCX implements the required sched_ext hooks (via eBPF programs) and can be selected at runtime. The default Linux scheduler can always be restored if needed ### End-to-End Task Lifecycle in sched_ext <img src="https://hackmd.io/_uploads/By19-_Diye.png" alt="End-to-End Task Lifecycle in sched_ext" width="80%"> *Source: https://www.ebpf.top/post/bpf_sched_ext_dive_into/* I'll provide a deep dive into the end-to-end task flow in sched_ext, specifically examining how tasks move through the scheduling cycle. This will cover: * The reception of a task * How it is enqueued and dequeued * The scheduling decisions made #### Task Entry into sched_ext 1. Once a BPF scheduler is loaded, all tasks with policy `SCHED_EXT` are switched to the new `sched_ext` scheduling class 2. a task is under `sched_ext` management (either by being created with `SCHED_EXT` or through a global switch), it will be integrated into the BPF scheduler's queues. 3. The kernel’s `sched_ext` core calls the BPF scheduler’s `ops.init_task()` callback for each task joining `sched_ext`, giving the BPF code to initialize per-task state (e.g. tracking virtual runtime, etc.) 4. At this point, the task is “received” by `sched_ext` – it’s now subject to the BPF scheduling logic rather than the default CFS rules. #### Enqueuing and Dequeuing Mechanisms in sched_ext 1. **Dispatch Queues (DSQs)**: sched_ext uses dispatch queues (DSQs) as intermediate run queues between the BPF logic and actual CPU execution By default, there is one global FIFO queue (designated `SCX_DSQ_GLOBAL`) and one local DSQ per CPU (`SCX_DSQ_LOCAL`). **Tasks are dispatched into DSQs by the BPF code, and CPUs consume from DSQs to get their next runnable task** 2. **Enqueueing a task**: When a task becomes runnable, the `sched_ext` core invokes the BPF scheduler’s `ops.select_cpu()` followed by `ops.enqueue()` callbacks. - it may call `scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags)` to place the task directly into the target CPU’s local queue - Or it might dispatch to the global queue (`SCX_DSQ_GLOBAL`) or a custom DSQ. #### Scheduling Logic and Decision-Making Process We know the actual decision of which task runs next on each CPU is made by the interplay of the `ops.dispatch()` and (optionally) `ops.consume()` callbacks, in conjunction with the dispatch queues. The scheduling cycle on a CPU, from a high-level perspective, is as follows: - **Selecting a target CPU on wakeup** – `select_cpu()`, This function allows the scheduler to choose an optimal CPU for the task before it is enqueued. For example, a scheduler might want to put a waking task on the same CPU it ran last (for cache affinity) or find an idle CPU to improve latency. - **Enqueue decision** – `enqueue()`: As described above, the BPF scheduler’s `.enqueue()` either dispatches the task to a DSQ (local, global, or custom) or holds it in an internal queue - **CPU picks next task** – consumption and dispatch: the scheduler core asks `sched_ext` for the next task to run. The CPU first check local DSQ for any tasks waiting. Then, that task is dequeued and chosen to run immediately. If the local DSQ is empty, the CPU can try to consume from a global shared queue. - **Dispatching tasks from BPF scheduler** – `dispatch()`: If after consuming global the CPU still has no task, the core invokes the BPF scheduler’s `ops.dispatch() `callback. It can call `scx_bpf_dispatch()` to dispatch tasks to either the requesting CPU’s local DSQ, the global DSQ #### Here's a summary of key functions used in SCX scheduling: |Function | Purpose | Called By | | -------- | -------- | -------- | | `select_cpu()` | Choose target CPU hint | Kernel on task wakeup | | `enqueue()` | Place task in queue or DSQ | Kernel after CPU selection | | `dispatch()` | Find next task to run on CPU | Kernel when CPU needs work | | `scx_bpf_dsq_insert()` | Add task to FIFO DSQ | BPF scheduler | | `scx_bpf_dsq_insert_vtime()` | Add task to priority DSQ | BPF scheduler | | `scx_bpf_dsq_move_to_local()` | Move task from DSQ to CPU| BPF in `dispatch() ` | | `scx_bpf_kick_cpu()` | Wake up idle CPU | BPF scheduler | | `running()` | Track task starting on CPU | Kernel when task runs | ## Build and run sched_ext I use [blog](https://arighi.blogspot.com/2024/04/getting-started-with-sched-ext.html) post by Andrea Righi outlines a great workflow for testing sched_ext without modifying existing system ### Run the virtual environment and test a scheduler First, start the virtual environment with the sched_ext kernel: ```bash vng -vr ../linux ``` ![image](https://hackmd.io/_uploads/rJrK0yQj1e.png) Once inside the virtual environment, you can run one of the schedulers with the helper function: ``` scx ./build/scheds/c/scx_simple ``` ![image](https://hackmd.io/_uploads/SyryLyXj1g.png) ## C-Based Schedulers **scx_simple** `scx_simple` is a minimal scheduler that functions either as a global weighted vtime scheduler (similar to the Completely Fair Scheduler) or as a FIFO scheduler. It's designed primarily to demonstrate basic scheduling concepts In the code below, from `scx_simple.bpf.c`, the `.enqueue` callback handles a task that needs to be scheduled. 1. If the scheduler is in FIFO mode (fifo_sched == true), it simply inserts the task into the shared dispatch queue without any priority sorting. 2. If in normal (weighted vtime) mode, it retrieves the task’s current virtual time (p->scx.dsq_vtime), adjusts it so that no task gains more than one slice worth of idle credit, 3. inserts the task into the shared queue with that virtual time as the key: ```C void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) { stat_inc(1); // increment global queue count if (fifo_sched) { // FIFO scheduling: enqueue to shared queue with default slice scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags); } else { u64 vtime = p->scx.dsq_vtime; // Cap the vtime lag to one slice to prevent too much credit if (time_before(vtime, vtime_now - SCX_SLICE_DFL)) vtime = vtime_now - SCX_SLICE_DFL; // Enqueue with a specific virtual time for fairness scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime, enq_flags); } } ``` ## scx_rustland `scx_rustland` is designed to prioritize interactive workloads over CPU-intensive background workloads. In practical terms, this likely means it keeps track of each task’s recent CPU usage or blocking behavior and assigns higher priority (sooner scheduling, more CPU time) to tasks that have shorter CPU bursts. ### scx_rustland overiew `scx_rustland` uses a hybrid approach that splits functionality between kernel space and user space: ``` User Space: Complex scheduling logic in Rust - Task prioritization - CPU selection algorithms - Complex data structures | | Ring buffer communication v Kernel Space: Minimal BPF component - Task state tracking - Dispatch queue management - Safety mechanisms ``` **BPF Dispatcher (scx_rustland_core)**: The BPF part of scx_rustland implements minimal logic required to interface with the kernel. Its enqueue hook, for example, does not directly decide a run queue as a normal scheduler would. Instead, it may place the task into a BPF queue map that represents “tasks waiting for user-space decision.” **User-Space Scheduler** (**Rust**): On the user side, the Rust scheduler process uses libraries (like libbpf or Aya in Rust) to interact with the eBPF program. It attaches to the maps exposed by BPF. Typically, it might use a ring buffer to receives a task to schedule, it runs its algorithm to decide where/when that task should run. ### scx_rustland Code Structure ``` scheds/rust/scx_rustland/ ├── Cargo.toml # Rust package definition ├── src/ │ ├── main.rs # Main userspace implementation │ ├── scheduler.rs # Scheduler logic │ ├── stats.rs # Statistics collection │ ├── topology.rs # CPU topology handling │ └── bpf/ # BPF skeleton code └── src/bpf/ ├── main.bpf.c # Minimal BPF implementation └── intf.h # Interface definitions ``` `scx_rustland` has two main components: 1. BPF Component (kernel space): ```c void BPF_STRUCT_OPS(rustland_enqueue, struct task_struct *p, u64 enq_flags) { // Skip scheduling the scheduler itself if (is_usersched_task(p)) return; // Pass task information to userspace for scheduling decision struct queued_task_ctx *task = bpf_ringbuf_reserve(&queued, sizeof(*task), 0); if (!task) { // Fallback: direct dispatch if userspace communication fails scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, 0, enq_flags); return; } // Send task info to userspace populate_task_info(task, p, enq_flags); bpf_ringbuf_submit(task, 0); } ``` 2. Rust Component (user space): ```rust impl Scheduler { fn dispatch_next_task(&mut self) { if let Some(task) = self.priority_queue.pop_min() { // Complex scheduling logic in Rust let target_cpu = self.find_best_cpu_for_task(&task); let time_slice = self.calculate_time_slice(&task); // Send decision back to kernel let mut dispatched_task = DispatchedTask::new(&task); dispatched_task.cpu = target_cpu; dispatched_task.slice_ns = time_slice; self.bpf.dispatch_task(&dispatched_task).unwrap(); } } fn find_best_cpu_for_task(&self, task: &QueuedTask) -> i32 { // Can use complex data structures and algorithms here // without BPF verifier constraints // ... } } ``` ### Advantages * Better abstraction capabilities * Rich error handling * More readable algorithms * Unit testing support * Modern language features ### Scheduling Logic The scheduling logic in `scx_rustland` is split across kernel (BPF) and user-space ```rust // Main scheduler object struct Scheduler<'a> { bpf: BpfScheduler<'a>, // BPF connector stats_server: StatsServer<(), Metrics>, // statistics task_pool: TaskTree, // tasks ordered by deadline task_map: TaskInfoMap, // map pids to the corresponding task information min_vruntime: u64, // Keep track of the minimum vruntime across all tasks init_page_faults: u64, // Initial page faults counter slice_ns: u64, // Default time slice (in ns) slice_ns_min: u64, // Minimum time slice (in ns) } ``` In user-space, `scx_rustland` maintains its own runqueue data structure – specifically a BTreeSet of tasks ordered by weighted vruntime (essentially mimicking CFS in user-space). It also monitors each task’s behavior to detect interactive tasks: if a task consistently voluntarily yields CPU (releasing the CPU before its time slice is fully used), it’s considered interactive. ## scx_bpfland `scx_bpfland` implements its scheduling logic almost entirely in BPF (Berkeley Packet Filter) code that runs in kernel space. It follows a more traditional approach where all scheduling decisions happen within the kernel: ### scx_bpfland overiew ``` User Space: Minimal (monitoring only) | | (Minimal interaction) v Kernel Space: Full scheduler implementation in BPF - Task selection - CPU assignment - Priority decisions - All scheduling algorithms ``` ``` scheds/c/scx_bpfland.bpf.c # Main BPF scheduler implementation scheds/c/scx_bpfland.c # Userspace loader and monitoring scheds/include/scx_common.bpf.h # Common BPF utilities ``` ### scx_bpfland Code Structure ``` scheds/c/scx_bpfland.bpf.c # Main BPF scheduler implementation scheds/c/scx_bpfland.c # Userspace loader and monitoring scheds/include/scx_common.bpf.h # Common BPF utilities ``` `scx_bpfland` implements all scheduling callbacks directly in BPF: ```c void BPF_STRUCT_OPS(bpfland_enqueue, struct task_struct *p, u64 enq_flags) { // Implementation directly in BPF struct task_ctx *task_ctx; u64 vruntime = 0; // Skip enqueuing the scheduler itself if (p->pid == bpfland_pid) return; // Get or create task context task_ctx = get_task_ctx(p); if (!task_ctx) return; // Calculate vruntime based on scheduling policy vruntime = calc_vruntime(p, task_ctx, enq_flags); // Direct enqueue to DSQ with calculated vruntime scx_bpf_dsq_insert_vtime(p, GLOBAL_DSQ, task_slice(p), vruntime, enq_flags); } ``` ### Scheduling Logic `scx_bpfland` internal logic is very similar in spirit to scx_rustland’s algorithm, but it executes entirely within the BPF program (in kernel). It effectively merges the two-tier logic into one. ## Advantages Over Traditional Schedulers The sched_ext + eBPF approach (exemplified by SCX schedulers) brings several key advantages over traditional in-kernel schedulers: * **Dynamic Policy Changes**: New schedulers can be loaded and unloaded at runtime. There’s no need to patch or reboot the kernel to try a different scheduling policy. This is invaluable for testing and tuning in production environments – administrators can switch between, say, a latency-focused scheduler during interactive sessions and a throughput-focused one for batch processing periods, with a simple command. * **Workload-Specific Optimizations**: The biggest advantage is the ability to tailor scheduling to a specific workload or environment. Instead of one-size-fits-all, users can pick or write a scheduler that, for example, never migrates certain tasks off a preferred CPU, or that implements strict priority levels, or that optimizes for particular patterns (like the “block frequently == interactive” rule). SCX even allows combining policies, as shown by scx_layered which applies different scheduling rules to different groups of processes. ## Future integrating SCX_GOLAND with Free5GC for Data Plane Optimization ### SCX_GOLAND #### Architecture Overview The `scx_goland_core` project represents another approach to implementing a user-space scheduler for Linux using the `sched_ext` framework. This project follows the architectural patterns established by scx_rustland, adapting them to the Go ecosystem. #### Go Implementation The Go implementation consists of several key packages and types: Core Package: * **Sched**: Main scheduler struct that holds references to BPF maps and methods for communication * **QueuedTask**: Represents a task queued from the kernel * **DispatchedTask**: Represents a task to be dispatched back to the kernel * **BssData**: Data structure for accessing BPF's BSS section #### Constraints and Limitations **Memory Management**: In `scx_goland_core`, tasks from the kernel’s ring buffers are received as byte slices and then decoded (using binary.Read) into Go structs. This decoding and copying process is relatively expensive when compared to Rust’s approach where you can often work in a more zero‑copy manner. In contrast, `scx_rustland` benefits from Rust’s zero‑cost abstractions, minimal runtime overhead, and more direct access to low-level memory without a garbage collector. In go: ```go func (s *Sched) DequeueTask(task *QueuedTask) { select { case t := <-s.queue: buff := bytes.NewBuffer(t) err := binary.Read(buff, binary.LittleEndian, task) if err != nil { task.Pid = -1 return } err = s.SubNrQueued() if err != nil { task.Pid = -1 log.Printf("SubNrQueued err: %v", err) return } return default: task.Pid = -1 return } } ``` In Rust: ```rust fn dequeue_task(&mut self) -> Result<Option<QueuedTask>, i32> { match self.queued.consume_raw() { 0 => { self.skel.maps.bss_data.nr_queued = 0; Ok(None) } LIBBPF_STOP => { // A valid task is received, convert data to a proper task struct. let task = unsafe { EnqueuedMessage::from_bytes(&BUF.0).to_queued_task() }; let _ = self.skel.maps.bss_data.nr_queued.saturating_sub(1); Ok(Some(task)) } res if res < 0 => Err(res), res => panic!( "Unexpected return value from libbpf-rs::consume_raw(): {}", res ), } } ``` **Garbage Collection**: Go is a managed language with a garbage collector and its own scheduler for goroutines. Even though Go is very efficient, those additional layers (GC, goroutine scheduling, channel operations) add overhead compared to Rust’s zero‑cost abstractions where most things are determined at compile time. #### SCX_GOLAND Summary It shows how Go's concurrency model and ease of use can be applied to system programming tasks that were traditionally the domain of C or Rust. While it's not as optimized or mature as `scx_rustland`, it provides a valuable alternative for developers more comfortable with Go. It also serves as a good example of how the sched_ext framework enables experimentation with different scheduling policies and implementations without requiring deep kernel expertise. ### Data Plane and CPU Scheduling Challenges ***Free5GC*** is an open-source implementation of the 5G Core network, written largely in Go. One of its components is the UPF (User Plane Function), which handles the data plane – i.e., packet forwarding for user traffic (GTP-U tunneling, routing packets between RAN and data network). The UPF and other network functions in Free5GC are user-space processes that can be CPU-intensive, especially under high load (many subscribers or high packet rates). Ensuring low latency and high throughput in the data plane is critical. #### Future Design of SCX_GOLAND Scheduler **Identifying Target Tasks**: The scheduler must reliably identify the Free5GC data plane threads/processes. This could be done by process name matching, by cgroup (if Free5GC is in a container or specific slice, the scheduler can detect that cgroup and apply a policy), or by explicit configuration(the operator could pass PIDs or process names to `scx_goland` at startup to tell it which tasks to prioritize). **prioritize specific goroutines**: Free5GC could be run in a mode where the UPF uses dedicated pinned threads for packet RX/TX loops (ensuring those OS threads only do packet work). Then `scx_goland` can target those threads precisely. But need to care that Free5GC’s internal scheduling of goroutines on OS threads might not be directly visible to the OS scheduler. ## Conclusion The advantages of SCX (`sched_ext`) – runtime pluggability, rapid development, workload-specific optimizations, and crash resilience – make it very attractive for specialized domains. One such domain is the 5G core network. We explored `scx_goland`, a Go-based scheduler concept, illustrating how a custom scheduler could be integrated with Free5GC to optimize its performance. ## Reference * [Re-implementing my Linux Rust scheduler in eBPF](https://arighi.blogspot.com/2024/08/re-implementing-my-linux-rust-scheduler.html) * [内核调度客制化利器：SCHED_EXT](https://blog.csdn.net/feelabclihu/article/details/139364772) * [BPF 赋能调度器：万字详解 sched_ext 实现机制与工作流程](https://mp.weixin.qq.com/s?__biz=MzA3NzUzNTM4NA==&mid=2649615315&idx=1&sn=ab7adba4ac6b53c543f399d0d3c8c4c3&chksm=8749ca24b03e4332c9dfdb915ae2012873b97c1be864a539319942449ffe098587de2b5c8745&cur_album_id=3509980851549372416&scene=189#wechat_redirect) * [Pluggable CPU schedulers](https://en.opensuse.org/Pluggable_CPU_schedulers#:~:text=,possible%20to%20leverage%20data%20locality) * [sched_ext: scheduler architecture and interfaces](https://blogs.igalia.com/changwoo/sched-ext-scheduler-architecture-and-interfaces-part-2/) * [eBPF 隨筆（七）：sched_ext](https://medium.com/@ianchen0119/ebpf-%E9%9A%A8%E7%AD%86-%E4%B8%83-sched-ext-f7b60ea28976) ## About Hello, I'm William Lin. I'd like to share my excitement about being a member of the free5gc project, which is a part of the Linux Foundation. I'm always eager to discuss any aspects of core network development or related technologies. ### Connect with Me - GitHub: [williamlin0518](https://github.com/williamlin0518) - Linkedin: [Cheng Wei Lin](https://www.linkedin.com/in/cheng-wei-lin-b36b7b235/)