Notes on Linux's Completely Fair Scheduler

# Notes on Linux's Completely Fair Scheduler This note attempts to give context on structure and implementation of various pieces related to Completely Fair Scheduler - or CFS for short - in Linux kernel. It is not a subsitute for official kernel documentation, which can be found in kernel source tree [^1], but rather a worklog of figuring things out after reading the docs. There are other sources for introduction into CFS, most notably [Linux Kernel Development book](https://www.amazon.com/Linux-Kernel-Development-Robert-Love/dp/0672329468). All source code links are based on kernel version 5.1, which is what i am using at this time. ## The modular scheduler Linux kernel scheduler has a modular design where generic code relies on dynamic scheduler classes to implement specific scheduling algorithms. Each task can belong to any of the supported scheduling classes. CFS is just one of these classes, albeit by far the most complicated and commonly used. ### Scheduler classes Each scheduler class is implemented as a set of polymorphic ops that are plugged into a common framework: ```struct sched_class``` [kernel/sched/sched.h:1649](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/sched.h#L1649): * `enqueue_task` Called when a task enters a runnable state. It puts the scheduling entity (task) into the red-black tree and increments the nr_running variable. * `dequeue_task` When a task is no longer runnable, this function is called to keep the corresponding scheduling entity out of the red-black tree. It decrements the nr_running variable. * `yield_task` This function is basically just a dequeue followed by an enqueue, unless the compat_yield sysctl is turned on; in that case, it places the scheduling entity at the right-most end of the red-black tree. * `check_preempt_curr` This function checks if a task that entered the runnable state should preempt the currently running task. * `pick_next_task` This function chooses the most appropriate task eligible to run next. * `set_curr_task` This function is called when a task changes its scheduling class or changes its task group. * `task_tick` This function is mostly called from time tick functions; it might lead to process switch. This drives the running preemption. Scheduler classes have relative priority. A task with realtime class (`SCHED_FIFO/SCHED_RR`) always has higher priority than any CFS task (`SCHED_OTHER`) for instance. As an example, this is how generic `pick_next_task` scheduler functions uses scheduler classes ([kernel/sched/core.c:3349](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/core.c#L3349)), omitting unnecessary details: ``` pick_next_task(runqueue): for each class in priority descending order: if p = class->pick_next_task: return p ``` ### Per-cpu run queues Generic scheduler code maintains a per-cpu [^2] data structure, `struct rq`, or just runqueue: [kernel/sched/sched.h:807](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/sched.h#L807) `struct rq` is not really a queue in a common sense of the word. Instead it is a generic scheduler context bound to a single cpu [^3]. This generic context includes a lot of things, some of which are: * `raw_spinlock_t lock` Runqueue lock (duh). Runqueue is usually locked from a CPU it is per-CPU-ed on. Locking a runqueue from a different CPU is permitted and is used when one CPU needs to touch several queues, e.g. when load balancing. * `struct cfs_rq cfs` Embedded CFS runqueue. * `struct task_struct __rcu *curr` Currently running task. * `unsigned long next_balance` A jiffies value when next load balancing can be attempted on this CPU/runqueue. * `struct sched_domain __rcu *sd` Scheduler domain this runqueue belongs to. This has to do with load balancing and is discussed in detail in its own section. ### Scheduler entry point Main scheduling decision - picking next task to run and running it - is done in `__schedule` ([kernel/sched/core.c:3428](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/core.c#L3428)). `__schedule` can be called: * From periodic scheduler tick, which can be a periodic timer interrupt or a floating HRTICK clock * Manually within the kernel when task is put in a wait queue. * Upon return from hardware interrupts and system calls if `TIF_NEED_RESHED` thread flag is set: [include/linux/sched.h:1681](https://elixir.bootlin.com/linux/v5.1/source/include/linux/sched.h#L1681) `__schedule` is complicated, but in essence it does something similar to pseudocode below: ``` __schedule: lock current runqueue update state and accounting information for current task if current task is blocked: sched_class->dequeue_task(current tak) clear TIF_NEED_RESCHED next = pick_next_task(runqueue) if next: context_switch(current, next) check if we need to do load balancing for this runqueue ``` ## CFS class The CFS is an implementation of `SCHED_OTHER` scheduler class. To quote official kernel documentation: "80% of CFS's design can be summed up in a single sentence: CFS basically models an "ideal, precise multi-tasking CPU" on real hardware." ### CFS runqueue CFS runqueue is a per-cpu extension to generic scheduler runqueue that holds CFS-related context bound to a particular CPU. Unlike generic runqueues, there can be several CFS runqueues per CPU. In fact they can form a tree under a single generic runqueue (see [task groups](#Task-groups)). CFS runqueue structure has a lot of fields. Here's some basic ones: `struct cfs_rq` [kernel/sched/sched.h:488](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/sched.h#L488) * `struct load_weight load` Current load of this runqueue on its assigned CPU. Used in load balancing calculations. * `u64 min_vruntime` The least vruntime value across all enqueued entities. This value gets added to each entity that has migrated (or created and enqueued) to this queue and subsequently substracted when entity has migrated out. This keeps entire entity tree normalized and at the same time allows entities take their accumulated vruntimes with them when migrating between CPUs. * `struct rb_root_cached tasks_timeline` Each CFS runqueue tracks a number of sched entities that are to be executed on a given CPU. It does this by putting them into a per-queue red-black tree keyed by entity's vruntime. The task of selecting the next entity to run on a runqueue then boils down to finding the leftmost node of a red-black tree, e.g. an entity with the least vruntime. ### Sched entity A sched entity is an abstract CFS-specific thing that tracks priority, runtime, load and can be enqueued on a CFS runqueue. When sched entity is _enqueued_ it is ready to run on a particular CFS runqueue. Otherwise an entity is _dequeued_. When task gets blocked by waiting for an event or a lock or is otherwise not ready to run, it gets dequeued. Besides individual tasks, sched entities can also represent entire runqueues, which is used to connect nested task groups (see [task groups](#Task-groups)). `struct sched_entity`: [include/linux/sched.h:440](https://elixir.bootlin.com/linux/v5.1/source/include/linux/sched.h#L440) * `struct load_weight load` Entity's weight derived from task and/or group priority. See [Task priority and weight mapping](#Task-priority-and-weight-mapping) * `u64 vruntime` Accumulated weighted vruntime. See [Weighted vruntime](#Weighted-vruntime). * `struct cfs_rq *cfs_rq` Runqueue we're enqueued on. * `struct sched_entity *parent` If entity is part of a nested task group hierarchy this points to its parent (see [task groups](#Task-groups)) * `struct cfs_rq *my_q` If entity is part of a nested task group hierarchy this points to a CFS runqueue which is represented by this entity in a nested task group (see [task groups](#Task-groups)) #### Entity migration Each per-CPU CFS runqueue tracks a minimum vruntime value across all its enqueued entities. This value is updated each time an entity is enqueued, dequeued or has its vruntime recalculated. The min vruntime value is substracted when entity migrates out of the queue. That same entity will then get its destination runqueues' min vruntime added if it's migrating between CPUs. ### Task groups A task group represents tasks cgroup-ed into a CPU controller group or a terminal session if "autogroup" feature is enabled. Task groups can be nested, which is explained in [Group/entity hierarchy](#Groupentity-hierarchy). `struct task_group` [kernel/sched/sched.h:363](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/sched.h#L363) * `struct cgroup_subsys_state css` The cgroup context this task group is for. * `struct sched_entity **se` Array of sched entities that directly belong to this group. * `struct cfs_rq **cfs_rq` Array of per-CPU CFS runqueues that this group owns and schedules entities on. * `struct task_group *parent` Pointer to a task group we are nested in. #### Group/entity hierarchy Each task group has an array of per-CPU CFS runqueues (`cfs_rq`), one queue for each CPU that entities within a group can execute on. Task groups can be nested within task groups. When task group is nested withing another task group each nested CFS runqueue will have a separate sched entity enqueued into parent group's runqueue for the same CPU. This entity represents an entire nested runqueue for parent task group. Here's a picture to illustrate this nested hierarchy: ![](https://i.imgur.com/qCDqkJE.png) This shows a 2-CPU system with 2 task groups, each group spread across both CPUs. Since there are 2 CPUs, and each task group is not limited to a subset of CPUs, each group will have 2 per-CPU CFS runqueues. Nested task group (bottom one) has several sched entities enqueued in its 2 per-CPU runqueues. These blue entities represent normal tasks. Root task group (top one) also has 2 per-CPU CFS runqueues for the same reason the bottom one does. Each of its runqueues has 1 special red entity enqueued in it. These red entities represent the entire per-CPU runqueue of a nested task group for the same CPU. We can call these red entities "representative" entities. This entity hierarchy allows CFS to compute scheduling slices and weighted vruntimes for general case. ### Weighted vruntime Each sched entity, whether it represent a task or a group, tracks a key characteristic called weighted vruntime (or just vruntime). Very roughly vruntime is a monotonically increasing function that tells CFS how much "time" an entity has spent running in relation to its priority. Each time CFS picks a new entity to run, previously running entity's vruntime is updated in `update_curr` ([kernel/sched/fair.c:815](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L815)), which calls `calc_delta_fair(delta_exec, curr)`, where `delta_exec` is number of timer ticks `curr` spent running on a CPU. #### Task priority and weight mapping CFS maps user `nice` values to task weights: the more `nice` a task is, the less weight it should have and vice versa. A generic calculation of elapsed vruntime is then a function of timer ticks the task spent on a CPU and task weight: `task->vruntime += delta_exec / task->weight` The higher the task priority, the bigger it's weight, the less vruntime increment it gets credited for for the same number of CPU ticks. That is what `calc_delta_fair` calculates, approximated with fixedpoint arithmetic. #### Fixedpoint arithmetic A fixedpoint number is an approximation of a fraction using only integer computations. Fixedpoint number is represented as an integer which has a portion of its bits dedicated to fractional part and the rest is the integer part. The number of fractional bits is also called a fixedpoint shift. A generic formula to convert a floating point value `v` to fixed point value `f` is: `f = (int)(v * (1 << fixedpoint_shift))`. Fixedpoint multiplication works like this: `f = (f1 * f2) >> fixedpoint_shift`. And division: `f = f1 / f2 = (f1 * 1/f2) >> fixedpoint_shift` What we will need though is fixedpoint scaling, which is just simple multiplication of a fixedpoint integer `f` representation by an integer factor `s`: `f *= s`, where `s` is not a fixedpoint number but just a plain integer coefficient. #### Vruntime calculation We're going to look at `calc_delta_fair` [kernel/sched/fair.c:643](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L643) which calculates entity's weighted vruntime increment for a given number if CPU ticks and its weight. If entity's weight is something called `NICE_0_LOAD` (more on that later), then effectively `vruntime = delta_exec`. Things get more interesting when task's weight is not `NICE_0_LOAD` and we go into `__calc_delta` [kernel/sched/fair.c:218](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L218) CFS stores task weight mapping for user nice values as fixedpoint number with fixedpoint shift 10. Which means that weight 1.0 is represented in fixedpoint as `1.0 * 1024 = 1024`. There is an array of fixedpoint numbers for mapping `nice` values to weight coefficients so that mapped weights would change within ~10% between the levels: ```(C) const int sched_prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121, 2501, 1991, 1586, 1277, /* 0 */ 1024, 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15, }; ``` There is also a precomputed array of these values inverted over 2^32 (2^32/v): ```(C) const u32 sched_prio_to_wmult[40] = { /* -20 */ 48388, 59856, 76040, 92818, 118348, /* -15 */ 147320, 184698, 229616, 287308, 360437, /* -10 */ 449829, 563644, 704093, 875809, 1099582, /* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326, /* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587, /* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126, /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717, /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153, }; ``` So the calculation of elapsed vruntime in fixedpoint now becomes: `delta_exec / task->weight = (delta_exec * task->inv_weight) >> 32`, where `task->inv_weight` is a value from `sched_prio_to_wmult` and we need to right-shift by 32 to account for `2^32 / weight`. This is essentially the calculation that happens in `__calc_delta` except that it accepts another weight factor for a different load calculation case when entity represents a runqueue. It is set to `NICE_0_LOAD` for a case when an entity represents a task, which is essentially a `1.0` in fixedpoint. We can do some test calculations for a task that spent 1us executing with `nice` values 0, -20 and 19: | nice | vruntime delta | | -------- | -------- | | 0 | 1024 | | -20 | 11 | | 19 | 68266 | nice -20 is approx. `nice 0 * 0.01` and nice 0 is approx `nice 19 * 0.01`. So each niceness change yields a cumulative level-by-level 10% weight change. Yay. ### Scheduling period CFS does not have a classic notion of a fixed task runtime slice. Instead an amount of time task gets to run on a CPU is defined by several CFS tunables, which can be set from userspace. * `sched_latency`: a period of time in nanoseconds during which all runnable tasks should execute at least once, i.e. a targeted scheduler latency. * `sched_nr_latency`: number of runnable tasks that should have a chance to execute at least once withing targeted scheduler latency * `sched_min_granularity`: a period of time in nanoseconds that any task should always be allowed to execute in the least. A scheduling period is calculated in `__sched_period` [kernel/sched/fair.c:659](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L659) ```(C) static u64 __sched_period(unsigned long nr_running) { if (unlikely(nr_running > sched_nr_latency)) return nr_running * sysctl_sched_min_granularity; else return sysctl_sched_latency; } ``` So, scheduling period is either `sched_latency`, or if number of runnable tasks exceeds `sched_nr_latency`, then it is extended as much as needed so that each runnable task gets at least `sched_min_granularity` nanoseconds to run. To get a timeslice for a given sched entity ready to be placed on a CPU, some additional calculation is done in `sched_slice` [kernel/sched/fair.c:673](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L673) ```(C) static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) { u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq); for_each_sched_entity(se) { struct load_weight *load; struct load_weight lw; cfs_rq = cfs_rq_of(se); load = &cfs_rq->load; if (unlikely(!se->on_rq)) { lw = cfs_rq->load; update_load_add(&lw, se->load.weight); load = &lw; } slice = __calc_delta(slice, se->load.weight, load); } return slice; } ``` First an entire scheduling period is calculated. The `for_each_sched_entity` macro walks sched entity hierarchy from child to parent (which is a bit unexpected). We first calculate the lower-level entity's slice within its runqueue. If this entity is part of a task group, then we go up the hierarchy and `se` now points to our parent, which is a representative entity for a runqueue in our task group on a specific CPU. This representative sched entity sits on its own runqueue with its own weight and load, and we use that to effectively project previous slice calculation on to a higher-level runqueue. And so on until we reach the top. ## Load balancing Load balancing attempts to spread the computational load evenly across CPUs and is implemented by a scheduler class. The CFS class has 3 main types of load balancing: regular, idle and wakeup. * Regular load balancing is normal balancing done periodically. * Idle balancing is performed when CPU core is about to go idle and can try to look for work. * Wakeup balancing can be performed for a specific entity that is being woken up to place it on the idlest core within the closest shared cache level. These balancing types use the same underlying algorithm to calculate desired load spread. ### Sched domains CFS load balancing is built upon the concept of sched domains. A basic load balancing algorithm would compare the load of all cores and then transfer tasks from the most loaded core to the least loaded core. Unfortunately this would result in threads being migrated across the machine without considering cache locality or NUMA. Instead, the load balancer uses a hierarchical strategy that take shared resource topology into account. Sched domains build a topology that describes how machine cores share physical resources at different levels, starting from a single core level. Picture below shows sched domain hierarchy for a system with 2 nodes, 2 cores per node and 2 hardware threads per core. ![](https://i.imgur.com/5Nff8rj.png) Each sched domain has a circular list of sched groups, which is shown by arrows on a picture above (except for the circular part). CFS load-balances each domain by balancing load across its groups. `struct sched_group` [kernel/sched/sched.h:1355](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/sched.h#L1355) * `struct sched_group *next` Next group in a circular list * `unsigned long cpumask[]` Bitmap of cpu ids that belong to this group `struct sched_domain` [include/linux/sched/topology.h:71](https://elixir.bootlin.com/linux/v5.1/source/include/linux/sched/topology.h#L77) * `for_each_domain` Macro walks domains hierarchically from current domain to its parent * `struct sched_group *groups` List of domain balancing groups ### Load metric An entity's load is combination of the its weight and its average CPU utilization. CFS keeps track of various types of load averages, which are a centerpiece in CPU load calculation. Load averages are calculated for all entities whether they are tasks or runqueues. Tracking averages is done using `sched_avg` structure. [include/linux/sched.h#L392](https://elixir.bootlin.com/linux/v5.1/source/include/linux/sched.h#L392): ```(C) struct sched_avg { u64 last_update_time; u64 load_sum; u64 runnable_load_sum; u32 util_sum; u32 period_contrib; unsigned long load_avg; unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est; }; ``` #### Entity load averages CFS calculates load sums (`*_sum` fields) as geometric series of entity's runnable periods of time within predefined intervals of 1024 microseconds: `u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...`, where `u_i` denote the fraction of i-th 1024us time period that the entity was runnable (`i = 0` is the latest period) and `y` is a predefined coefficient such that `y^32 = 0.5`. Load sum for a new period can be calculated from previous sum as `sum(i) = u_0 + y * sum(i-1)`. This allows CFS to geometrically degrade previous contribution of an entity to CPU load as said contribution period gets farther and farther in time. Sum calculation is done by `___update_load_sum` [kernel/sched/pelt.c:174](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/pelt.c#L174). After updating sums all averages are updated by `___update_load_avg` [kernel/sched/pelt.c#L225](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/pelt.c#L225) which divides accumulated and degraded sums over a set period of time. #### Runqueue load CFS runqueue load is calculated as sum of load averages of entities in runqueue. For once things can be simple. #### Domain group load A sched domain load is a sum of all runqueue loads for each CPU that is covered by a domain. Aside from various resolution-adjusting details `update_sg_lb_load` [kernel/sched/fair.c:8250](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L8250) is calculating that. Domain group load is an important metric that will be used by domain balancing algorithm. ### Load balancing algorithm Load balancing is run for each scheduling domain, starting from the bottom to the top. At each level, one core of current domain is responsible for balancing the load across sched groups within its domain and only its domain. This core is either the first idle core of the scheduling domain if the domain has idle cores whose free CPU cycles can be used for load balancing, or the first core of the scheduling domain otherwise. By walking the domain hierarchy we eventually balance the entire system. Domain walk starts at `rebalance_domains` [kernel/sched/fair.c:9442](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L9442). Each individual domain is balanced by `load_balance` [kernel/sched/fair.c:9033](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L9033). `load_balance` is complicated, but the gist of what it does is: * Define local group as the one where current CPU is. * Walk domain's sched group list and find the busiest one, e.g. group with the largest load average. * Within the busiest sched group find the most overloaded CPU. * Redistribute load between local CPU and the most overloaded one in busiest group by migrating tasks from the busiest CPU in the busiest sched group to current CPU which is running the load balancer code. #### Rebalance costs Rebalancing can be expensive and is not always worth doing. For example, idle balancing tracks an average CPU idle time for each CPU and avoids running idle balancing on a CPU when its average idling time is less than assumed balancing runtime cost: [kernel/sched/fair.c:9964](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L9964) [^1]: https://github.com/torvalds/linux/tree/master/Documentation/scheduler [^2]: per-cpu data is managed by the kernel as a fake symbol with an address which is an offset inside a real binary section. Per-CPU section bases are stored in ```__per_cpu_offset``` global array which is indexed by smp processor id on an SMP system. [^3]: A hardware thread on an SMP system.