Hybrid Round-Robin(RR)/FIFO scheduler

# Hybrid Round-Robin(RR)/FIFO scheduler contributed by < [`charliechiou`](https://github.com/charliechiou) > < [`EricccTaiwan`](https://github.com/EricccTaiwan) > ## Linux Environment ```shell OS: Ubuntu 25.04 x86_64 Kernel: 6.14.0-16-generic ``` ## Implementation > [GitHub branch](https://github.com/cce-underdogs/scx/tree/otteryc_exp) We are implementing a **hybrid Round-Robin(RR)/FIFO** scheduler based on `scx_rlfifo`, but we are encountering issues related to `slice_ns`. Our scheduling policy is defined as follows: * $NICE < 0 \to$ FIFO, and are pinned to single CPU * $NICE \ge 0 \to$ RR, and are allowed to swith CPUs > $NICE = 0$ $\Leftrightarrow$ `task.weight` $= 1000$ :::warning Both RR and FIFO policy with `dispatched_task.slice_ns = 10_000_000 // 10 ms` (Code line 17) ::: :::spoiler Code ```rust= fn dispatch_tasks(&mut self) { while let Ok(Some(task)) = self.bpf.dequeue_task() { let mut dispatched_task = DispatchedTask::new(&task); let t_weight = task.weight; if t_weight > 100 { // Nice < 0 => Treat as FIFO // limit task migration to the same CPU dispatched_task.cpu = task.cpu; } else { // Nice >= 0 => Treat as RR let cpu = self.bpf.select_cpu(task.pid, task.cpu, task.flags); dispatched_task.cpu = if cpu >= 0 { cpu } else { task.cpu }; } // Assign 10ms to both policy dispatched_task.slice_ns = 10_000_000; self.bpf.dispatch_task(&dispatched_task).unwrap(); } // This function will put the scheduler to sleep, until another task needs to run. self.bpf.notify_complete(0); } ``` ::: ## Simulation Test w/ stress-ng ```shell $ # terminal 1 $ sudo ./bin/scx_rlfifo $ # terminal 2 $ sudo scxtop trace --trace-ms 20000 --output-file test_10ms $ # terminal 3 $ sudo watch nice -n -5 stress-ng --cpu 1 --timeout 1s ``` ### Exp 1 We can see that the Wall duration is 1 second, indicating that the command is functioning correctly. However, the **Average Wall duration is reported as 55.55 ms**, which is unexpected. ![image](https://hackmd.io/_uploads/r1dFjdwxgx.png) ### Exp 2 Avg Wall duration 17ms. ![image](https://hackmd.io/_uploads/rJwTjOvxex.png) :::danger Since we are only running a single `stress-ng` task on a single CPU, context switches rarely occur, allowing `stress-ng` to run continuously for a long time. $\to$ To observe time slices under Round-Robin scheduling, we should run multiple tasks on the same CPU. ::: ## Question - Is the issue caused by the testing procedure ? - There is no context switch between each stress-ng task. So profetto may view each task together as a block.Which cause the time-slice bigger than our assignment. - Profetto may loss the short task. - We may have to assign more task to see the behavior. - We are wondering whether the task slice is working properly. ## Clarification - :::warning In Perfetto, the Avg Wall duration refer to the average time a process runs before being interrupted, rather than average time slice. ::: > got a :+1: from [name=Daniel Hodges] - If a process runs for 10 consecutive time slices, each 10 ms long, without being interrupted, Perfetto will show a single Wall duration of 100 ms, instead of ten separate 10 ms durations. > Perfetto traces `sched_switch()` events, but if the same task keeps running on the same CPU, you can't see the time that you assigned in the trace, because no `sched_switch()` event happened [name=Andrea Righi] ![image](https://hackmd.io/_uploads/HJM4RHOxgx.png) ![image](https://hackmd.io/_uploads/HkMp0Bdxge.png) ## What's next 1. > try this to track the actual time slice used by each task: https://github.com/sched-ext/scx/tree/rustland-core-track-time-slice, start the scheduler as normal and look at the trace in `$ cat /sys/kernel/debug/tracing/trace_pipe` [name=Andrea Righi] 2. When using RR scheduling on a single CPU with two `stress-ng` tasks running concurrently, and assigning a 10 ms time slice, I expect each task's runtime observed between sched_switch events to be less than or equal to 10 ms. ## PR (maybe) 1. I think it would be more intuitive if the duration were divided by time slices; however, the Wall duration calculation still stays the same. 2. `scxtop` can’t trace very short-lived tasks (as someone mentioned earlier, though I forgot who 😅). 3. Add $NICE$ in task struct ? ## 0513 So, we have tested with the default scheduler `scx_simple -f` and `scx_rlfifo` several times, with three `stress-ng` tasks running on the same CPU. * In `scx_simple -f`, the time slice behaves as expected. ![image](https://hackmd.io/_uploads/B1N-TRlZll.png) * whereas in `scx_rlfifo`, it varies. ![image](https://hackmd.io/_uploads/HJve60lZll.png) and we couldn't figure out the problem ... ## Test with bpf_printk 記得先啟動 ssh 自度開啟，否則 server 重新啟動後會無法連線 ```shell $ sudo systemctl enable ssh # 開機後自動開啟 $ sudo systemctl status ssh # 檢查連線狀態 $ ssh localhost # 連線確認，確認可以連到自己 ``` 以隔離 **CPU#13** 為例，方便進行測試和觀察 ```diff $ sudo vim /etc/default/grub $ # 加上下方的 diff + GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=13 nohz_full=13 rcu_nocbs=13" $ sudo update-grub $ sudo reboot ``` ~~粗暴言論~~ 改用 [cpusets](https://docs.kernel.org/admin-guide/cgroup-v1/cpusets.html)