EEVDF Patch Notes

This is a note to record my understanding of EEVDF, for the writing of 《Demystifying The Linux CPU Scheduler》

The difference Between CFS and EEVDF

CFS focus on fairness, EEVDF take the latency into consideration.

In other words, some tasks require great interactivity, they need to have CPU more frequently compared to other more "latency-sensitive".

Pratically, we first found eligible tasks, by examining their vruntime and pick the one with smallest deadline.

Because we can control the amount of increasing deadline each time a task run, by "latency-nice" and "nice values"

What is new implementations in `fair.c`

By comparing the content of Linux v6.5-rc7 and v6.9-rc7, we could compare difference of fair.c to inspect what is new in EEVDF

`avg_vruntime` and eligible infrastructure

We add there functions served for counting if a task is elibile, serve for vlag

+static void avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static inline void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+u64 avg_vruntime(struct cfs_rq *cfs_rq)
+static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static s64 entity_lag(u64 avruntime, struct sched_entity *se)
+static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
+int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)

Define

l a g_{i} = S_{i} - s_{i} = w_{i} \times (V - v_{i})

Where

S

is the ideal service time and

V

is it's virtual time counterpart (cfs_rq->avg_vruntime), so

l o g_{i}

is the difference between

t a s k_{i}

deserved and it really run, and

V = \frac{\sum (v_{t} - v_{m i n}) * w_{t}}{\sum w_{t}}

A task is eligile when

l a g_{i} = w_{i} \times (V - v_{i}) > 0 ⟹ V > v_{i}

then we will have

t a s k_{i} is eligible if \sum (v_{t} - v_{m i n}) * w_{t} \geq (v_{i} - v_{m i n}) * \sum w_{t}

where

v_{m i n}

is cfs_rq->min_vruntime,

v_{i}

is se->vruntime,

\sum w_{t}

is the sum of weights of se in cfs.

rbtree operations

In CFS, we prioritize the task by vruntime

static inline bool entity_before(const struct sched_entity *a,
             const struct sched_entity *b)
{
    return (s64)(a->vruntime - b->vruntime) < 0;
}

In EEVDF, deadline is now used for the order of rbtree

static inline bool entity_before(const struct sched_entity *a,
             const struct sched_entity *b)
{
    /*
     * Tiebreak on vruntime seems unnecessary since it can
     * hardly happen.
     */
    return (s64)(a->deadline - b->deadline) < 0;
}

`update_curr`

In EEVDF, we not only upadte virtual runtime during update_curr, update deadline is necessry for maintaining the order of the tasks in rbtree.

static void update_curr(struct cfs_rq *cfs_rq)
{
    curr->vruntime += calc_delta_fair(delta_exec, curr);
+   update_deadline(cfs_rq, curr);
    update_min_vruntime(cfs_rq);
}

`pick_next_entity`

In CFS, we need to take care of a lots of case for cfs_rq->next and cfs_rq->last, EEVDF version of pick_next_entity is now more simple and elegant

static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq)
{
    /*
     * Enabling NEXT_BUDDY will affect latency but not fairness.
     */
    if (sched_feat(NEXT_BUDDY) &&
        cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
        return cfs_rq->next;

    return pick_eevdf(cfs_rq);
}

About Kernel Patches

Strongly recommand you to read PatchPhilosophy, PatchTipsAndTricks and PatchCulture before starting.

Because of EEVDF is still working on in Linux v6.10, We will introduce several patches for better understanding it.

Linux 核心設計: Scheduler(5): EEVDF 排程器 does excellent works to explain the patches about EEVDF.

note manages lot of the related patches and continues to update.

Main Patches 1 of EEVDF

Introduce EEVDF

The starting of EEVDF is [PATCH 00/10] sched: EEVDF using latency-nice
, before this, Peter are investigating latency-nice related algorithms, EEVDF debuts here.

This one is more simple than the version we will later mentioned, we could have a simple understanding to EEVDF in this patch.

At the end of this CV,

There's definitely more benchmarking/tweaking to be done (0-day already reported a stress-ng loss), but if we can pull this off we can delete a whole much of icky heuristics code. EEVDF is a much better defined policy than what we currently have.

stated the superior of EEVDF

K Prateek Nayak tests on a dual socket Zen3 machine (2 x 64C/128T), shows most of the test has no regression, but some of them (schbench, tbench) are observed "an increase in number of context switches and the total wait sum".

Shrikanth Hegde replied the patch and the experiments show great improvement in a 480 CPU system, but "Smaller system was showing regressing results" compared to v6.2 .

Introduce the first implementations

In the follow patches in this series, most of the EEVDF implementations are introduced by Peter Zijlstra, notice that not all these changes are merge into kernl, let's look over them :

[PATCH 01/10] sched: Introduce latency-nice as a per-task attribute
introduce latency_nice in task_struct
[PATCH 02/10] sched/core: Propagate parent tasks latency requirements to the child task
set defult latency_nice value for task_struct
[PATCH 03/10] sched: Allow sched_{get,set}attr to change latency_nice of the task
[PATCH 04/10] sched/fair: Add latency_offset
[PATCH 05/10] sched/fair: Add sched group latency support
[PATCH 06/10] sched/fair: Add avg_vruntime
add avg_vruntime related implementations
1. avg_vruntime and avg_load
2. entity_key(), avg_vruntime_add() and avg_vruntime_sub()
3. avg_vruntime() and avg_vruntime_update()
4. __update_min_vruntime()
[PATCH 07/10] sched/fair: Remove START_DEBIT
[PATCH 08/10] sched/fair: Add lag based placement
the patch said

With the introduction of avg_vruntime, it is possible to approximate lag
(the entire purpose of introducing avg_vruntime in fact).

avg_vruntime serves for calculating lag, which is used for sched_feat(PRESERVE_LAG) and sched_feat(FAIR_SLEEPERS), they are replaced by other lag mechanism later on, so skip it here. (see [PATCH 03/15] sched/fair: Add lag based placement)

lag is actually vlag in the follow version.

[PATCH 09/10] rbtree: Add rb_add_augmented_cached() helper
add rb_add_augmented_cached() for enqueuing entity

static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    avg_vruntime_add(cfs_rq, se);
    se->min_vruntime = se->vruntime;
    rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
                __entity_less, &min_vruntime_cb);
}

[PATCH 10/10] sched/fair: Implement an EEVDF like policy
applies
1. deadline and slice
2. entity_eligible
3. deadline update
4. enqueue/dequeue entity and rbtree related
5. pick_next_entity and pick_eevdf
6. set_slice and sched_slice is removed
Discussion between Peter Zijlstra and Vincent Guittot

Peter points out the difference between CFS and EEVDF

…, where WFQ is only well defined in how much time is given to any task (bandwidth), but says nothing about how that is distributed in time.

Also, he stated the purpose of the new infrastructure added in EEVDF

The avg_vruntime / placement stuff I did is fundamental to how it controls bandwidth distribution and guarantees the WFQ subset. Specifically, by limiting the pick to that subset of tasks that has positive lag (owed time), it guarantees this fairness – but that means we need a working measure of lag.

Main Patches 2 of EEVDF

[PATCH 00/17] sched: EEVDF using latency-nice

This patch series can be considered as a hot-fix of previous version, and also adaptive placement strategy for smaller slices is added in.

[PATCH 06/17] sched/fair: Add lag based placement
This patch try to preserve previous lag, so when a task re-enqueued from sleep/migration can be compensated, quoted Peter

Use this to do lag based placement over sleep+wake. Specifically, the FAIR_SLEEPERS thing places things too far to the left and messes up the deadline aspect of EEVDF.

Since the enqueued task might usually be urgent to access CPU.

and

We want to ensure the lag after placement is the lag we got before dequeue.

Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.

[PATCH 10/17] sched/smp: Use lag to simplify cross-runqueue placement
This involved tasks migrations, the most important code is instead of updating se->vruntime, updating se->vlag when dequeue or detach the entities to preserve the value when place_entity into another CPU (migration, sleep+wake). Related to [PATCH 03/15] sched/fair: Add lag based placement.
[PATCH 14/17] sched/eevdf: Better handle mixed slice length

In this patch, Peter tried to introduce cfs_rq->avg_slice and weighted se->slice, but this is not introduced so far. Related to [PATCH 03/15] sched/fair: Add lag based placement and [PATCH 11/15] sched/eevdf: Better handle mixed slice length discussion

Main Patches 3 of EEVDF

There are several versions of EEVDF patches, this patch [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr adds the fix for tick-preemption and slice heuristics, also, complete the calculation of se->vlag usgin se->vruntime

[PATCH 03/15] sched/fair: Add lag based placement
When a task is woken, if we ignore its se->vlag, it will not serve immediately since its se->vruntime will be equal to avg_vruntime(cfs_rq).

If we want to consider its previous se->vlag, this will make cfs_rq->avg_vruntime "backward" since updating its se->vruntime with se->vruntime = vruntime - lag;, so we need to calibrate its se->vlag.

Put it simple, Consider that all the new entities placed have large se->vlag, its se->vruntime will be really small to occupy CPU a long time.

If you go to sleep immediately after doing work and happen to do so near the end of a slice (arguably what you want to have happen overall), then you have to pay that negative lag in wakeup latency later, because it is maintained through any amount of sleep.

If a task have large negative lag (over-served), #1 will encourage it not to sleep, and try to access CPU to be gain serving, or it will suffer from its negative lag forever (since "lag is maintained through any amount of sleep").

OTOH it should also not be instantly discarded, doing so will allow a
task to game the system by purposefully (micro) sleeping at the end of
its time quantum.

But we should preserve lag, or tasks will be doing hard to sleep if they have a large negative lag, #2 will discard it by sleeping.

The solution is [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue.

Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
This patch argues about if the eligibility check regresses EEVDF

CFS: 168ms
EEVDF with eligibility: 206ms (regression from CFS)
EEVDF *without* eligibility: 143ms (improvement to CFS)
EEVDF *without* eligibility and with a 6ms base_slice_ns (was 1.5ms):
104ms (great improvement)

Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se
This patch claim it is possible not to "2-step search", one for eligibility, one for earliest deadline.

Main Patches 4 of EEVDF

The latest patches about EEVDF is [RFC][PATCH 00/10] sched/fair: Complete EEVDF, even though it's still RFC patches, it's highly possible to be merged in to mainstream.

This series try to do all the chores like adding comments, re-organizaed the code. But one of the most important patch is allowed applications to set a suggested request/slice length using sched_attr::sched_runtime.

[RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

Other Patches

RFC

RFC stands for "Request for Comments", if they pass the review and are refined as needed, they can eventually be accepted and merged into the main patches.

We will pick several RFC patches about EEVDF to see what is possible next.

[RFC][PATCH 12/15] sched: Introduce latency-nice as a per-task attribute this patch construct the infrastruction for adaptive time slice with latency-nice.

In [RFC][PATCH 13/15] sched/fair: Implement latency-nice, you can see se->slice = sysctl_sched_base_slice; in update_deadline is removed and now calculated by the weight of corresponding "latency prio"

 #ifdef CONFIG_UCLAMP_TASK
 /*
  * Serializes updates of utilization clamp values
@@ -4464,9 +4470,10 @@ static void __sched_fork(unsigned long c
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
-	p->se.slice			= sysctl_sched_base_slice;
 	INIT_LIST_HEAD(&p->se.group_node);
 
+	set_latency_prio(p, p->latency_prio);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
 #endif
@@ -4718,8 +4725,7 @@ int sched_fork(unsigned long clone_flags
 
 		p->prio = p->normal_prio = p->static_prio;
 		set_load_weight(p, false);
-
-		p->latency_prio = NICE_TO_PRIO(0);
+		set_latency_prio(p, NICE_TO_PRIO(0));
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -7507,7 +7513,7 @@ static void __setscheduler_latency(struc
 				   const struct sched_attr *attr)
 {
 	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
-		p->latency_prio = NICE_TO_PRIO(attr->sched_latency_nice);
+		set_latency_prio(p, NICE_TO_PRIO(attr->sched_latency_nice));
 }

 /*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -952,6 +952,21 @@ int sched_update_scaling(void)
 }
 #endif
 
+void set_latency_fair(struct sched_entity *se, int prio)
+{
+	u32 weight = sched_prio_to_weight[prio];
+	u64 base = sysctl_sched_base_slice;
+
+	/*
+	 * For EEVDF the virtual time slope is determined by w_i (iow.
+	 * nice) while the request time r_i is determined by
+	 * latency-nice.
+	 *
+	 * Smaller request gets better latency.
+	 */
+	se->slice = div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
+}
+

EEVDF Patch Notes

The difference Between CFS and EEVDF

What is new implementations in fair.c

avg_vruntime and eligible infrastructure

rbtree operations

update_curr

pick_next_entity