# EEVDF Patch Notes This is a note to record my understanding of EEVDF, for the writing of 《Demystifying The Linux CPU Scheduler》 ## The difference Between CFS and EEVDF CFS focus on fairness, EEVDF take the latency into consideration. In other words, some tasks require great interactivity, they need to have CPU more frequently compared to other more "latency-sensitive". Pratically, we first found eligible tasks, by examining their `vruntime` and pick the one with smallest `deadline`. Because we can control the amount of increasing `deadline` each time a task run, by "latency-nice" and "nice values" ## What is new implementations in `fair.c` By comparing the content of Linux v6.5-rc7 and v6.9-rc7, we could compare difference of `fair.c` to inspect what is new in EEVDF ### `avg_vruntime` and eligible infrastructure We add there functions served for counting if a task is elibile, serve for `vlag` ```diff +static void avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) +static void avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) +static inline void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) +u64 avg_vruntime(struct cfs_rq *cfs_rq) +static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) +static s64 entity_lag(u64 avruntime, struct sched_entity *se) +static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) +static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime) +int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) ``` Define $$ lag_i = S_i - s_i = w_i \times (V - v_i) $$ Where $S$ is the ideal service time and $V$ is it's virtual time counterpart (`cfs_rq->avg_vruntime`), so $log_i$ is the difference between $task_i$ deserved and it really run, and $$ V = \frac{\sum{(v_t - v_{min}) * w_t}}{\sum{w_t}} $$ A task is eligile when $$ lag_i = w_i \times (V - v_i) > 0 \implies V > v_i $$ then we will have $$ task_i \text{ is eligible if } \sum{(v_t - v_{min}) * w_t} \ge (v_i - v_{min}) * \sum{w_t} $$ where $v_{min}$ is `cfs_rq->min_vruntime`, $v_i$ is `se->vruntime`, $\sum{w_t}$ is the sum of weights of se in `cfs`. ### rbtree operations In CFS, we prioritize the task by `vruntime` ```c static inline bool entity_before(const struct sched_entity *a, const struct sched_entity *b) { return (s64)(a->vruntime - b->vruntime) < 0; } ``` In EEVDF, `deadline` is now used for the order of rbtree ```c static inline bool entity_before(const struct sched_entity *a, const struct sched_entity *b) { /* * Tiebreak on vruntime seems unnecessary since it can * hardly happen. */ return (s64)(a->deadline - b->deadline) < 0; } ``` ### `update_curr` In EEVDF, we not only upadte virtual runtime during `update_curr`, update deadline is necessry for maintaining the order of the tasks in rbtree. ```diff static void update_curr(struct cfs_rq *cfs_rq) { curr->vruntime += calc_delta_fair(delta_exec, curr); + update_deadline(cfs_rq, curr); update_min_vruntime(cfs_rq); } ``` ### `pick_next_entity` In CFS, we need to take care of a lots of case for `cfs_rq->next` and `cfs_rq->last`, EEVDF version of `pick_next_entity` is now more simple and elegant ```c static struct sched_entity * pick_next_entity(struct cfs_rq *cfs_rq) { /* * Enabling NEXT_BUDDY will affect latency but not fairness. */ if (sched_feat(NEXT_BUDDY) && cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) return cfs_rq->next; return pick_eevdf(cfs_rq); } ``` ## About Kernel Patches Strongly recommand you to read [PatchPhilosophy](https://kernelnewbies.org/PatchPhilosophy), [PatchTipsAndTricks](https://kernelnewbies.org/PatchTipsAndTricks) and [PatchCulture](https://kernelnewbies.org/PatchCulture) before starting. Because of EEVDF is still working on in Linux v6.10, We will introduce several patches for better understanding it. [Linux 核心設計: Scheduler(5): EEVDF 排程器](https://hackmd.io/@RinHizakura/SyG4t5u1a#PATCH-0615-sched-Commit-to-lag-based-placement) does excellent works to explain the patches about EEVDF. [note](https://github.com/rsy56640/triviality/tree/master/content/sched-eevdf) manages lot of the related patches and continues to update. ## Main Patches 1 of EEVDF ### Introduce EEVDF The starting of EEVDF is [[PATCH 00/10] sched: EEVDF using latency-nice ](https://lore.kernel.org/lkml/5b567c5aecabf0a89d92593d99a73bef41bd65da.camel@gmx.de/t/), before this, Peter are investigating **latency-nice** related algorithms, EEVDF debuts here. This one is more simple than the version we will later mentioned, we could have a simple understanding to EEVDF in this patch. At the end of this CV, > There's definitely more benchmarking/tweaking to be done (0-day already reported a stress-ng loss), but if we can pull this off we can delete a whole much of icky heuristics code. EEVDF is a much better defined policy than what we currently have. stated the superior of EEVDF **K Prateek Nayak** [tests](https://lore.kernel.org/lkml/18199afa-16e2-8e76-c96b-9507f92befdc@amd.com/) on a dual socket Zen3 machine (2 x 64C/128T), shows most of the test has no regression, but some of them (schbench, tbench) are observed "**an increase in number of context switches and the total wait sum**". **Shrikanth Hegde** [replied](https://lore.kernel.org/lkml/2fe2fb1c-345f-adca-d201-ed3ed6f418cc@linux.vnet.ibm.com/) the patch and the experiments show great improvement in a 480 CPU system, but "**Smaller system was showing regressing results**" compared to v6.2 . --- ### Introduce the first implementations In the follow patches in this series, most of the EEVDF implementations are introduced by Peter Zijlstra, notice that not all these changes are merge into kernl, let's look over them : * [[PATCH 01/10] sched: Introduce latency-nice as a per-task attribute]() introduce `latency_nice` in `task_struct` * [[PATCH 02/10] sched/core: Propagate parent tasks latency requirements to the child task]() set defult `latency_nice` value for `task_struct` * [[PATCH 03/10] sched: Allow sched_{get,set}attr to change latency_nice of the task]() * [[PATCH 04/10] sched/fair: Add latency_offset]() * [[PATCH 05/10] sched/fair: Add sched group latency support]() * [[PATCH 06/10] sched/fair: Add avg_vruntime]() add `avg_vruntime` related implementations 1. `avg_vruntime` and `avg_load` 3. `entity_key()`, `avg_vruntime_add()` and `avg_vruntime_sub()` 3. `avg_vruntime()` and `avg_vruntime_update()` 4. `__update_min_vruntime()` * [[PATCH 07/10] sched/fair: Remove START_DEBIT]() * [[PATCH 08/10] sched/fair: Add lag based placement]() the patch said > With the introduction of avg_vruntime, it is possible to approximate lag > (the entire purpose of introducing avg_vruntime in fact). > `avg_vruntime` serves for calculating `lag`, which is used for `sched_feat(PRESERVE_LAG)` and `sched_feat(FAIR_SLEEPERS)`, they are replaced by other `lag` mechanism later on, so skip it here. (see [PATCH 03/15] sched/fair: Add lag based placement) `lag` is actually `vlag` in the follow version. * [[PATCH 09/10] rbtree: Add rb_add_augmented_cached() helper]() add `rb_add_augmented_cached()` for enqueuing entity ```c static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { avg_vruntime_add(cfs_rq, se); se->min_vruntime = se->vruntime; rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less, &min_vruntime_cb); } ``` * [[PATCH 10/10] sched/fair: Implement an EEVDF like policy]() applies 1. `deadline` and `slice` 3. `entity_eligible` 4. deadline update 5. enqueue/dequeue entity and rbtree related 6. `pick_next_entity` and `pick_eevdf` 7. `set_slice` and `sched_slice` is removed * [Discussion](https://lore.kernel.org/lkml/20230307130800.GD2017917@hirez.programming.kicks-ass.net/#t) between Peter Zijlstra and Vincent Guittot Peter points out the difference between CFS and EEVDF > ..., where WFQ is only well defined in how much time is given to any task (bandwidth), but says nothing about how that is distributed in time. Also, he stated the purpose of the new infrastructure added in EEVDF > The avg_vruntime / placement stuff I did is fundamental to how it controls bandwidth distribution and guarantees the WFQ subset. Specifically, by limiting the pick to that subset of tasks that has positive lag (owed time), it guarantees this fairness -- but that means we need a working measure of lag. --- ## Main Patches 2 of EEVDF [[PATCH 00/17] sched: EEVDF using latency-nice](https://lore.kernel.org/lkml/20230328092622.062917921@infradead.org/T/) This patch series can be considered as a hot-fix of previous version, and also adaptive placement strategy for smaller slices is added in. * [[PATCH 06/17] sched/fair: Add lag based placement ]() This patch try to preserve previous lag, so when a task re-enqueued from sleep/migration can be compensated, quoted Peter > Use this to do lag based placement over sleep+wake. Specifically, the FAIR_SLEEPERS thing places things too far to the left and messes up the deadline aspect of EEVDF. Since the enqueued task might usually be urgent to access CPU. and > We want to ensure the lag after placement is the lag we got before dequeue. Specifically, the FAIR_SLEEPERS thing places things too far to the left and messes up the deadline aspect of EEVDF. * [[PATCH 10/17] sched/smp: Use lag to simplify cross-runqueue placement]() This involved tasks migrations, the most important code is instead of updating `se->vruntime`, updating `se->vlag` when dequeue or detach the entities to preserve the value when `place_entity` into another CPU (migration, sleep+wake). Related to [[PATCH 03/15] sched/fair: Add lag based placement](). * [[PATCH 14/17] sched/eevdf: Better handle mixed slice length]() In this patch, Peter tried to introduce `cfs_rq->avg_slice` and weighted `se->slice`, but this is not introduced so far. Related to [[PATCH 03/15] sched/fair: Add lag based placement]() and [[PATCH 11/15] sched/eevdf: Better handle mixed slice length]() discussion --- ## Main Patches 3 of EEVDF There are several versions of EEVDF patches, this patch [[PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr](https://lore.kernel.org/lkml/20230531115839.089944915@infradead.org/T/) adds the fix for tick-preemption and slice heuristics, also, complete the calculation of `se->vlag` usgin `se->vruntime` * [[PATCH 03/15] sched/fair: Add lag based placement]() When a task is woken, if we ignore its `se->vlag`, it will not serve immediately since its `se->vruntime` will be equal to `avg_vruntime(cfs_rq)`. If we want to consider its previous `se->vlag`, this will make `cfs_rq->avg_vruntime` "backward" since updating its `se->vruntime` with `se->vruntime = vruntime - lag;`, so we need to calibrate its `se->vlag`. Put it simple, Consider that all the new entities placed have large `se->vlag`, its `se->vruntime` will be really small to occupy CPU a long time. > If you go to sleep immediately after doing work and happen to do so near the end of a slice (arguably what you _want_ to have happen overall), then you have to pay that negative lag in wakeup latency later, because it is maintained through any amount of sleep. If a task have large negative lag (over-served), #1 will encourage it not to sleep, and try to access CPU to be gain serving, or it will suffer from its negative lag forever (since "lag is maintained through any amount of sleep"). > OTOH it should also not be instantly discarded, doing so will allow a task to game the system by purposefully (micro) sleeping at the end of its time quantum. But we should preserve lag, or tasks will be doing hard to sleep if they have a large negative lag, #2 will discard it by sleeping. The solution is [[RFC][PATCH 08/10] sched/fair: Implement delayed dequeue](). * [Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr](https://lore.kernel.org/lkml/20231005120557.GA743@noisy.programming.kicks-ass.net/) This patch argues about if the eligibility check regresses EEVDF ``` CFS: 168ms EEVDF with eligibility: 206ms (regression from CFS) EEVDF *without* eligibility: 143ms (improvement to CFS) EEVDF *without* eligibility and with a 6ms base_slice_ns (was 1.5ms): 104ms (great improvement) ``` * [Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se](https://lore.kernel.org/lkml/20231011131454.GN14330@noisy.programming.kicks-ass.net/) This patch claim it is possible not to "2-step search", one for eligibility, one for earliest deadline. --- ## Main Patches 4 of EEVDF The latest patches about EEVDF is [[RFC][PATCH 00/10] sched/fair: Complete EEVDF](https://lore.kernel.org/lkml/20240405102754.435410987@infradead.org/T/), even though it's still RFC patches, it's highly possible to be merged in to mainstream. This series try to do all the chores like adding comments, re-organizaed the code. But one of the most important patch is allowed applications to set a suggested request/slice length using `sched_attr::sched_runtime`. * [[RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion]() --- ### Other Patches #### RFC RFC stands for "Request for Comments", if they pass the review and are refined as needed, they can eventually be accepted and merged into the main patches. We will pick several RFC patches about EEVDF to see what is possible next. [[RFC][PATCH 12/15] sched: Introduce latency-nice as a per-task attribute](https://lore.kernel.org/lkml/20230531124604.410388887@infradead.org/) this patch construct the infrastruction for adaptive time slice with latency-nice. In [[RFC][PATCH 13/15] sched/fair: Implement latency-nice](https://lore.kernel.org/lkml/20230531124604.477939524@infradead.org/), you can see `se->slice = sysctl_sched_base_slice;` in `update_deadline` is removed and now calculated by the weight of corresponding "latency prio" ```diff #ifdef CONFIG_UCLAMP_TASK /* * Serializes updates of utilization clamp values @@ -4464,9 +4470,10 @@ static void __sched_fork(unsigned long c p->se.nr_migrations = 0; p->se.vruntime = 0; p->se.vlag = 0; - p->se.slice = sysctl_sched_base_slice; INIT_LIST_HEAD(&p->se.group_node); + set_latency_prio(p, p->latency_prio); + #ifdef CONFIG_FAIR_GROUP_SCHED p->se.cfs_rq = NULL; #endif @@ -4718,8 +4725,7 @@ int sched_fork(unsigned long clone_flags p->prio = p->normal_prio = p->static_prio; set_load_weight(p, false); - - p->latency_prio = NICE_TO_PRIO(0); + set_latency_prio(p, NICE_TO_PRIO(0)); /* * We don't need the reset flag anymore after the fork. It has @@ -7507,7 +7513,7 @@ static void __setscheduler_latency(struc const struct sched_attr *attr) { if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) - p->latency_prio = NICE_TO_PRIO(attr->sched_latency_nice); + set_latency_prio(p, NICE_TO_PRIO(attr->sched_latency_nice)); } ``` ```diff /* --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -952,6 +952,21 @@ int sched_update_scaling(void) } #endif +void set_latency_fair(struct sched_entity *se, int prio) +{ + u32 weight = sched_prio_to_weight[prio]; + u64 base = sysctl_sched_base_slice; + + /* + * For EEVDF the virtual time slope is determined by w_i (iow. + * nice) while the request time r_i is determined by + * latency-nice. + * + * Smaller request gets better latency. + */ + se->slice = div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight); +} + ```