This is a note to record my understanding of EEVDF, for the writing of 《Demystifying The Linux CPU Scheduler》
CFS focus on fairness, EEVDF take the latency into consideration.
In other words, some tasks require great interactivity, they need to have CPU more frequently compared to other more "latency-sensitive".
Pratically, we first found eligible tasks, by examining their vruntime
and pick the one with smallest deadline
.
Because we can control the amount of increasing deadline
each time a task run, by "latency-nice" and "nice values"
fair.c
By comparing the content of Linux v6.5-rc7 and v6.9-rc7, we could compare difference of fair.c
to inspect what is new in EEVDF
avg_vruntime
and eligible infrastructureWe add there functions served for counting if a task is elibile, serve for vlag
+static void avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static inline void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+u64 avg_vruntime(struct cfs_rq *cfs_rq)
+static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static s64 entity_lag(u64 avruntime, struct sched_entity *se)
+static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
+int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
Define
Where cfs_rq->avg_vruntime
), so
A task is eligile when
then we will have
where cfs_rq->min_vruntime
, se->vruntime
, cfs
.
In CFS, we prioritize the task by vruntime
static inline bool entity_before(const struct sched_entity *a,
const struct sched_entity *b)
{
return (s64)(a->vruntime - b->vruntime) < 0;
}
In EEVDF, deadline
is now used for the order of rbtree
static inline bool entity_before(const struct sched_entity *a,
const struct sched_entity *b)
{
/*
* Tiebreak on vruntime seems unnecessary since it can
* hardly happen.
*/
return (s64)(a->deadline - b->deadline) < 0;
}
update_curr
In EEVDF, we not only upadte virtual runtime during update_curr
, update deadline is necessry for maintaining the order of the tasks in rbtree.
static void update_curr(struct cfs_rq *cfs_rq)
{
curr->vruntime += calc_delta_fair(delta_exec, curr);
+ update_deadline(cfs_rq, curr);
update_min_vruntime(cfs_rq);
}
pick_next_entity
In CFS, we need to take care of a lots of case for cfs_rq->next
and cfs_rq->last
, EEVDF version of pick_next_entity
is now more simple and elegant
static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq)
{
/*
* Enabling NEXT_BUDDY will affect latency but not fairness.
*/
if (sched_feat(NEXT_BUDDY) &&
cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
return cfs_rq->next;
return pick_eevdf(cfs_rq);
}
Strongly recommand you to read PatchPhilosophy, PatchTipsAndTricks and PatchCulture before starting.
Because of EEVDF is still working on in Linux v6.10, We will introduce several patches for better understanding it.
Linux 核心設計: Scheduler(5): EEVDF 排程器 does excellent works to explain the patches about EEVDF.
note manages lot of the related patches and continues to update.
The starting of EEVDF is [PATCH 00/10] sched: EEVDF using latency-nice
, before this, Peter are investigating latency-nice related algorithms, EEVDF debuts here.
This one is more simple than the version we will later mentioned, we could have a simple understanding to EEVDF in this patch.
At the end of this CV,
There's definitely more benchmarking/tweaking to be done (0-day already reported a stress-ng loss), but if we can pull this off we can delete a whole much of icky heuristics code. EEVDF is a much better defined policy than what we currently have.
stated the superior of EEVDF
K Prateek Nayak tests on a dual socket Zen3 machine (2 x 64C/128T), shows most of the test has no regression, but some of them (schbench, tbench) are observed "an increase in number of context switches and the total wait sum".
Shrikanth Hegde replied the patch and the experiments show great improvement in a 480 CPU system, but "Smaller system was showing regressing results" compared to v6.2 .
In the follow patches in this series, most of the EEVDF implementations are introduced by Peter Zijlstra, notice that not all these changes are merge into kernl, let's look over them :
[PATCH 01/10] sched: Introduce latency-nice as a per-task attribute
introduce latency_nice
in task_struct
[PATCH 02/10] sched/core: Propagate parent tasks latency requirements to the child task
set defult latency_nice
value for task_struct
[PATCH 03/10] sched: Allow sched_{get,set}attr to change latency_nice of the task
[PATCH 06/10] sched/fair: Add avg_vruntime
add avg_vruntime
related implementations
avg_vruntime
and avg_load
entity_key()
, avg_vruntime_add()
and avg_vruntime_sub()
avg_vruntime()
and avg_vruntime_update()
__update_min_vruntime()
[PATCH 08/10] sched/fair: Add lag based placement
the patch said
With the introduction of avg_vruntime, it is possible to approximate lag
(the entire purpose of introducing avg_vruntime in fact).
avg_vruntime
serves for calculating lag
, which is used for sched_feat(PRESERVE_LAG)
and sched_feat(FAIR_SLEEPERS)
, they are replaced by other lag
mechanism later on, so skip it here. (see [PATCH 03/15] sched/fair: Add lag based placement)
lag
is actually vlag
in the follow version.
[PATCH 09/10] rbtree: Add rb_add_augmented_cached() helper
add rb_add_augmented_cached()
for enqueuing entity
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
avg_vruntime_add(cfs_rq, se);
se->min_vruntime = se->vruntime;
rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
__entity_less, &min_vruntime_cb);
}
[PATCH 10/10] sched/fair: Implement an EEVDF like policy
applies
deadline
and slice
entity_eligible
pick_next_entity
and pick_eevdf
set_slice
and sched_slice
is removedDiscussion between Peter Zijlstra and Vincent Guittot
Peter points out the difference between CFS and EEVDF
…, where WFQ is only well defined in how much time is given to any task (bandwidth), but says nothing about how that is distributed in time.
Also, he stated the purpose of the new infrastructure added in EEVDF
The avg_vruntime / placement stuff I did is fundamental to how it controls bandwidth distribution and guarantees the WFQ subset. Specifically, by limiting the pick to that subset of tasks that has positive lag (owed time), it guarantees this fairness – but that means we need a working measure of lag.
[PATCH 00/17] sched: EEVDF using latency-nice
This patch series can be considered as a hot-fix of previous version, and also adaptive placement strategy for smaller slices is added in.
[PATCH 06/17] sched/fair: Add lag based placement
This patch try to preserve previous lag, so when a task re-enqueued from sleep/migration can be compensated, quoted Peter
Use this to do lag based placement over sleep+wake. Specifically, the FAIR_SLEEPERS thing places things too far to the left and messes up the deadline aspect of EEVDF.
Since the enqueued task might usually be urgent to access CPU.
and
We want to ensure the lag after placement is the lag we got before dequeue.
Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.
[PATCH 10/17] sched/smp: Use lag to simplify cross-runqueue placement
This involved tasks migrations, the most important code is instead of updating se->vruntime
, updating se->vlag
when dequeue or detach the entities to preserve the value when place_entity
into another CPU (migration, sleep+wake). Related to [PATCH 03/15] sched/fair: Add lag based placement.
[PATCH 14/17] sched/eevdf: Better handle mixed slice length
In this patch, Peter tried to introduce cfs_rq->avg_slice
and weighted se->slice
, but this is not introduced so far. Related to [PATCH 03/15] sched/fair: Add lag based placement and [PATCH 11/15] sched/eevdf: Better handle mixed slice length discussion
There are several versions of EEVDF patches, this patch [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr adds the fix for tick-preemption and slice heuristics, also, complete the calculation of se->vlag
usgin se->vruntime
[PATCH 03/15] sched/fair: Add lag based placement
When a task is woken, if we ignore its se->vlag
, it will not serve immediately since its se->vruntime
will be equal to avg_vruntime(cfs_rq)
.
If we want to consider its previous se->vlag
, this will make cfs_rq->avg_vruntime
"backward" since updating its se->vruntime
with se->vruntime = vruntime - lag;
, so we need to calibrate its se->vlag
.
Put it simple, Consider that all the new entities placed have large se->vlag
, its se->vruntime
will be really small to occupy CPU a long time.
If you go to sleep immediately after doing work and happen to do so near the end of a slice (arguably what you want to have happen overall), then you have to pay that negative lag in wakeup latency later, because it is maintained through any amount of sleep.
If a task have large negative lag (over-served), #1 will encourage it not to sleep, and try to access CPU to be gain serving, or it will suffer from its negative lag forever (since "lag is maintained through any amount of sleep").
OTOH it should also not be instantly discarded, doing so will allow a
task to game the system by purposefully (micro) sleeping at the end of
its time quantum.
But we should preserve lag, or tasks will be doing hard to sleep if they have a large negative lag, #2 will discard it by sleeping.
The solution is [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue.
Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
This patch argues about if the eligibility check regresses EEVDF
CFS: 168ms
EEVDF with eligibility: 206ms (regression from CFS)
EEVDF *without* eligibility: 143ms (improvement to CFS)
EEVDF *without* eligibility and with a 6ms base_slice_ns (was 1.5ms):
104ms (great improvement)
Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se
This patch claim it is possible not to "2-step search", one for eligibility, one for earliest deadline.
The latest patches about EEVDF is [RFC][PATCH 00/10] sched/fair: Complete EEVDF, even though it's still RFC patches, it's highly possible to be merged in to mainstream.
This series try to do all the chores like adding comments, re-organizaed the code. But one of the most important patch is allowed applications to set a suggested request/slice length using sched_attr::sched_runtime
.
RFC stands for "Request for Comments", if they pass the review and are refined as needed, they can eventually be accepted and merged into the main patches.
We will pick several RFC patches about EEVDF to see what is possible next.
[RFC][PATCH 12/15] sched: Introduce latency-nice as a per-task attribute this patch construct the infrastruction for adaptive time slice with latency-nice.
In [RFC][PATCH 13/15] sched/fair: Implement latency-nice, you can see se->slice = sysctl_sched_base_slice;
in update_deadline
is removed and now calculated by the weight of corresponding "latency prio"
#ifdef CONFIG_UCLAMP_TASK
/*
* Serializes updates of utilization clamp values
@@ -4464,9 +4470,10 @@ static void __sched_fork(unsigned long c
p->se.nr_migrations = 0;
p->se.vruntime = 0;
p->se.vlag = 0;
- p->se.slice = sysctl_sched_base_slice;
INIT_LIST_HEAD(&p->se.group_node);
+ set_latency_prio(p, p->latency_prio);
+
#ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq = NULL;
#endif
@@ -4718,8 +4725,7 @@ int sched_fork(unsigned long clone_flags
p->prio = p->normal_prio = p->static_prio;
set_load_weight(p, false);
-
- p->latency_prio = NICE_TO_PRIO(0);
+ set_latency_prio(p, NICE_TO_PRIO(0));
/*
* We don't need the reset flag anymore after the fork. It has
@@ -7507,7 +7513,7 @@ static void __setscheduler_latency(struc
const struct sched_attr *attr)
{
if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
- p->latency_prio = NICE_TO_PRIO(attr->sched_latency_nice);
+ set_latency_prio(p, NICE_TO_PRIO(attr->sched_latency_nice));
}
/*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -952,6 +952,21 @@ int sched_update_scaling(void)
}
#endif
+void set_latency_fair(struct sched_entity *se, int prio)
+{
+ u32 weight = sched_prio_to_weight[prio];
+ u64 base = sysctl_sched_base_slice;
+
+ /*
+ * For EEVDF the virtual time slope is determined by w_i (iow.
+ * nice) while the request time r_i is determined by
+ * latency-nice.
+ *
+ * Smaller request gets better latency.
+ */
+ se->slice = div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
+}
+