EEVDF Patch Notes

This is a note to record my understanding of EEVDF, for the writing of 《Demystifying The Linux CPU Scheduler》

The difference Between CFS and EEVDF

CFS focus on fairness, EEVDF take the latency into consideration.

In other words, some tasks require great interactivity, they need to have CPU more frequently compared to other more "latency-sensitive".

Pratically, we first found eligible tasks, by examining their vruntime and pick the one with smallest deadline.

Because we can control the amount of increasing deadline each time a task run, by "latency-nice" and "nice values"

What is new implementations in fair.c

By comparing the content of Linux v6.5-rc7 and v6.9-rc7, we could compare difference of fair.c to inspect what is new in EEVDF

avg_vruntime and eligible infrastructure

We add there functions served for counting if a task is elibile, serve for vlag

+static void avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static inline void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+u64 avg_vruntime(struct cfs_rq *cfs_rq)
+static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static s64 entity_lag(u64 avruntime, struct sched_entity *se)
+static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
+int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)

Define

lagi=Sisi=wi×(Vvi)
Where
S
is the ideal service time and
V
is it's virtual time counterpart (cfs_rq->avg_vruntime), so
logi
is the difference between
taski
deserved and it really run, and
V=(vtvmin)wtwt

A task is eligile when

lagi=wi×(Vvi)>0V>vi

then we will have

taski is eligible if (vtvmin)wt(vivmin)wt
where
vmin
is cfs_rq->min_vruntime,
vi
is se->vruntime,
wt
is the sum of weights of se in cfs.

rbtree operations

In CFS, we prioritize the task by vruntime

static inline bool entity_before(const struct sched_entity *a,
             const struct sched_entity *b)
{
    return (s64)(a->vruntime - b->vruntime) < 0;
}

In EEVDF, deadline is now used for the order of rbtree

static inline bool entity_before(const struct sched_entity *a,
             const struct sched_entity *b)
{
    /*
     * Tiebreak on vruntime seems unnecessary since it can
     * hardly happen.
     */
    return (s64)(a->deadline - b->deadline) < 0;
}

update_curr

In EEVDF, we not only upadte virtual runtime during update_curr, update deadline is necessry for maintaining the order of the tasks in rbtree.

static void update_curr(struct cfs_rq *cfs_rq)
{
    curr->vruntime += calc_delta_fair(delta_exec, curr);
+   update_deadline(cfs_rq, curr);
    update_min_vruntime(cfs_rq);
}

pick_next_entity

In CFS, we need to take care of a lots of case for cfs_rq->next and cfs_rq->last, EEVDF version of pick_next_entity is now more simple and elegant

static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq)
{
    /*
     * Enabling NEXT_BUDDY will affect latency but not fairness.
     */
    if (sched_feat(NEXT_BUDDY) &&
        cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
        return cfs_rq->next;

    return pick_eevdf(cfs_rq);
}

About Kernel Patches

Strongly recommand you to read PatchPhilosophy, PatchTipsAndTricks and PatchCulture before starting.

Because of EEVDF is still working on in Linux v6.10, We will introduce several patches for better understanding it.

Linux 核心設計: Scheduler(5): EEVDF 排程器 does excellent works to explain the patches about EEVDF.

note manages lot of the related patches and continues to update.

Main Patches 1 of EEVDF

Introduce EEVDF

The starting of EEVDF is [PATCH 00/10] sched: EEVDF using latency-nice
, before this, Peter are investigating latency-nice related algorithms, EEVDF debuts here.

This one is more simple than the version we will later mentioned, we could have a simple understanding to EEVDF in this patch.

At the end of this CV,

There's definitely more benchmarking/tweaking to be done (0-day already reported a stress-ng loss), but if we can pull this off we can delete a whole much of icky heuristics code. EEVDF is a much better defined policy than what we currently have.

stated the superior of EEVDF

K Prateek Nayak tests on a dual socket Zen3 machine (2 x 64C/128T), shows most of the test has no regression, but some of them (schbench, tbench) are observed "an increase in number of context switches and the total wait sum".

Shrikanth Hegde replied the patch and the experiments show great improvement in a 480 CPU system, but "Smaller system was showing regressing results" compared to v6.2 .


Introduce the first implementations

In the follow patches in this series, most of the EEVDF implementations are introduced by Peter Zijlstra, notice that not all these changes are merge into kernl, let's look over them :


Main Patches 2 of EEVDF

[PATCH 00/17] sched: EEVDF using latency-nice

This patch series can be considered as a hot-fix of previous version, and also adaptive placement strategy for smaller slices is added in.

  • [PATCH 06/17] sched/fair: Add lag based placement
    This patch try to preserve previous lag, so when a task re-enqueued from sleep/migration can be compensated, quoted Peter

    Use this to do lag based placement over sleep+wake. Specifically, the FAIR_SLEEPERS thing places things too far to the left and messes up the deadline aspect of EEVDF.

    Since the enqueued task might usually be urgent to access CPU.

    and

    We want to ensure the lag after placement is the lag we got before dequeue.

Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.


Main Patches 3 of EEVDF

There are several versions of EEVDF patches, this patch [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr adds the fix for tick-preemption and slice heuristics, also, complete the calculation of se->vlag usgin se->vruntime

  • [PATCH 03/15] sched/fair: Add lag based placement
    When a task is woken, if we ignore its se->vlag, it will not serve immediately since its se->vruntime will be equal to avg_vruntime(cfs_rq).

    If we want to consider its previous se->vlag, this will make cfs_rq->avg_vruntime "backward" since updating its se->vruntime with se->vruntime = vruntime - lag;, so we need to calibrate its se->vlag.

    Put it simple, Consider that all the new entities placed have large se->vlag, its se->vruntime will be really small to occupy CPU a long time.

    If you go to sleep immediately after doing work and happen to do so near the end of a slice (arguably what you want to have happen overall), then you have to pay that negative lag in wakeup latency later, because it is maintained through any amount of sleep.

    If a task have large negative lag (over-served), #1 will encourage it not to sleep, and try to access CPU to be gain serving, or it will suffer from its negative lag forever (since "lag is maintained through any amount of sleep").

    OTOH it should also not be instantly discarded, doing so will allow a
    task to game the system by purposefully (micro) sleeping at the end of
    its time quantum.

    But we should preserve lag, or tasks will be doing hard to sleep if they have a large negative lag, #2 will discard it by sleeping.

    The solution is [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue.

  • Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
    This patch argues about if the eligibility check regresses EEVDF

    ​​​​CFS: 168ms
    ​​​​EEVDF with eligibility: 206ms (regression from CFS)
    ​​​​EEVDF *without* eligibility: 143ms (improvement to CFS)
    ​​​​EEVDF *without* eligibility and with a 6ms base_slice_ns (was 1.5ms):
    ​​​​104ms (great improvement)
    
  • Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se
    This patch claim it is possible not to "2-step search", one for eligibility, one for earliest deadline.


Main Patches 4 of EEVDF

The latest patches about EEVDF is [RFC][PATCH 00/10] sched/fair: Complete EEVDF, even though it's still RFC patches, it's highly possible to be merged in to mainstream.

This series try to do all the chores like adding comments, re-organizaed the code. But one of the most important patch is allowed applications to set a suggested request/slice length using sched_attr::sched_runtime.


Other Patches

RFC

RFC stands for "Request for Comments", if they pass the review and are refined as needed, they can eventually be accepted and merged into the main patches.

We will pick several RFC patches about EEVDF to see what is possible next.

[RFC][PATCH 12/15] sched: Introduce latency-nice as a per-task attribute this patch construct the infrastruction for adaptive time slice with latency-nice.

In [RFC][PATCH 13/15] sched/fair: Implement latency-nice, you can see se->slice = sysctl_sched_base_slice; in update_deadline is removed and now calculated by the weight of corresponding "latency prio"

 #ifdef CONFIG_UCLAMP_TASK
 /*
  * Serializes updates of utilization clamp values
@@ -4464,9 +4470,10 @@ static void __sched_fork(unsigned long c
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
-	p->se.slice			= sysctl_sched_base_slice;
 	INIT_LIST_HEAD(&p->se.group_node);
 
+	set_latency_prio(p, p->latency_prio);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
 #endif
@@ -4718,8 +4725,7 @@ int sched_fork(unsigned long clone_flags
 
 		p->prio = p->normal_prio = p->static_prio;
 		set_load_weight(p, false);
-
-		p->latency_prio = NICE_TO_PRIO(0);
+		set_latency_prio(p, NICE_TO_PRIO(0));
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -7507,7 +7513,7 @@ static void __setscheduler_latency(struc
 				   const struct sched_attr *attr)
 {
 	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
-		p->latency_prio = NICE_TO_PRIO(attr->sched_latency_nice);
+		set_latency_prio(p, NICE_TO_PRIO(attr->sched_latency_nice));
 }
 /*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -952,6 +952,21 @@ int sched_update_scaling(void)
 }
 #endif
 
+void set_latency_fair(struct sched_entity *se, int prio)
+{
+	u32 weight = sched_prio_to_weight[prio];
+	u64 base = sysctl_sched_base_slice;
+
+	/*
+	 * For EEVDF the virtual time slope is determined by w_i (iow.
+	 * nice) while the request time r_i is determined by
+	 * latency-nice.
+	 *
+	 * Smaller request gets better latency.
+	 */
+	se->slice = div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
+}
+