Evgeny Iakovlev
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Notes on Linux's Completely Fair Scheduler This note attempts to give context on structure and implementation of various pieces related to Completely Fair Scheduler - or CFS for short - in Linux kernel. It is not a subsitute for official kernel documentation, which can be found in kernel source tree [^1], but rather a worklog of figuring things out after reading the docs. There are other sources for introduction into CFS, most notably [Linux Kernel Development book](https://www.amazon.com/Linux-Kernel-Development-Robert-Love/dp/0672329468). All source code links are based on kernel version 5.1, which is what i am using at this time. ## The modular scheduler Linux kernel scheduler has a modular design where generic code relies on dynamic scheduler classes to implement specific scheduling algorithms. Each task can belong to any of the supported scheduling classes. CFS is just one of these classes, albeit by far the most complicated and commonly used. ### Scheduler classes Each scheduler class is implemented as a set of polymorphic ops that are plugged into a common framework: ```struct sched_class``` [kernel/sched/sched.h:1649](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/sched.h#L1649): * `enqueue_task` Called when a task enters a runnable state. It puts the scheduling entity (task) into the red-black tree and increments the nr_running variable. * `dequeue_task` When a task is no longer runnable, this function is called to keep the corresponding scheduling entity out of the red-black tree. It decrements the nr_running variable. * `yield_task` This function is basically just a dequeue followed by an enqueue, unless the compat_yield sysctl is turned on; in that case, it places the scheduling entity at the right-most end of the red-black tree. * `check_preempt_curr` This function checks if a task that entered the runnable state should preempt the currently running task. * `pick_next_task` This function chooses the most appropriate task eligible to run next. * `set_curr_task` This function is called when a task changes its scheduling class or changes its task group. * `task_tick` This function is mostly called from time tick functions; it might lead to process switch. This drives the running preemption. Scheduler classes have relative priority. A task with realtime class (`SCHED_FIFO/SCHED_RR`) always has higher priority than any CFS task (`SCHED_OTHER`) for instance. As an example, this is how generic `pick_next_task` scheduler functions uses scheduler classes ([kernel/sched/core.c:3349](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/core.c#L3349)), omitting unnecessary details: ``` pick_next_task(runqueue): for each class in priority descending order: if p = class->pick_next_task: return p ``` ### Per-cpu run queues Generic scheduler code maintains a per-cpu [^2] data structure, `struct rq`, or just runqueue: [kernel/sched/sched.h:807](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/sched.h#L807) `struct rq` is not really a queue in a common sense of the word. Instead it is a generic scheduler context bound to a single cpu [^3]. This generic context includes a lot of things, some of which are: * `raw_spinlock_t lock` Runqueue lock (duh). Runqueue is usually locked from a CPU it is per-CPU-ed on. Locking a runqueue from a different CPU is permitted and is used when one CPU needs to touch several queues, e.g. when load balancing. * `struct cfs_rq cfs` Embedded CFS runqueue. * `struct task_struct __rcu *curr` Currently running task. * `unsigned long next_balance` A jiffies value when next load balancing can be attempted on this CPU/runqueue. * `struct sched_domain __rcu *sd` Scheduler domain this runqueue belongs to. This has to do with load balancing and is discussed in detail in its own section. ### Scheduler entry point Main scheduling decision - picking next task to run and running it - is done in `__schedule` ([kernel/sched/core.c:3428](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/core.c#L3428)). `__schedule` can be called: * From periodic scheduler tick, which can be a periodic timer interrupt or a floating HRTICK clock * Manually within the kernel when task is put in a wait queue. * Upon return from hardware interrupts and system calls if `TIF_NEED_RESHED` thread flag is set: [include/linux/sched.h:1681](https://elixir.bootlin.com/linux/v5.1/source/include/linux/sched.h#L1681) `__schedule` is complicated, but in essence it does something similar to pseudocode below: ``` __schedule: lock current runqueue update state and accounting information for current task if current task is blocked: sched_class->dequeue_task(current tak) clear TIF_NEED_RESCHED next = pick_next_task(runqueue) if next: context_switch(current, next) check if we need to do load balancing for this runqueue ``` ## CFS class The CFS is an implementation of `SCHED_OTHER` scheduler class. To quote official kernel documentation: "80% of CFS's design can be summed up in a single sentence: CFS basically models an "ideal, precise multi-tasking CPU" on real hardware." ### CFS runqueue CFS runqueue is a per-cpu extension to generic scheduler runqueue that holds CFS-related context bound to a particular CPU. Unlike generic runqueues, there can be several CFS runqueues per CPU. In fact they can form a tree under a single generic runqueue (see [task groups](#Task-groups)). CFS runqueue structure has a lot of fields. Here's some basic ones: `struct cfs_rq` [kernel/sched/sched.h:488](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/sched.h#L488) * `struct load_weight load` Current load of this runqueue on its assigned CPU. Used in load balancing calculations. * `u64 min_vruntime` The least vruntime value across all enqueued entities. This value gets added to each entity that has migrated (or created and enqueued) to this queue and subsequently substracted when entity has migrated out. This keeps entire entity tree normalized and at the same time allows entities take their accumulated vruntimes with them when migrating between CPUs. * `struct rb_root_cached tasks_timeline` Each CFS runqueue tracks a number of sched entities that are to be executed on a given CPU. It does this by putting them into a per-queue red-black tree keyed by entity's vruntime. The task of selecting the next entity to run on a runqueue then boils down to finding the leftmost node of a red-black tree, e.g. an entity with the least vruntime. ### Sched entity A sched entity is an abstract CFS-specific thing that tracks priority, runtime, load and can be enqueued on a CFS runqueue. When sched entity is _enqueued_ it is ready to run on a particular CFS runqueue. Otherwise an entity is _dequeued_. When task gets blocked by waiting for an event or a lock or is otherwise not ready to run, it gets dequeued. Besides individual tasks, sched entities can also represent entire runqueues, which is used to connect nested task groups (see [task groups](#Task-groups)). `struct sched_entity`: [include/linux/sched.h:440](https://elixir.bootlin.com/linux/v5.1/source/include/linux/sched.h#L440) * `struct load_weight load` Entity's weight derived from task and/or group priority. See [Task priority and weight mapping](#Task-priority-and-weight-mapping) * `u64 vruntime` Accumulated weighted vruntime. See [Weighted vruntime](#Weighted-vruntime). * `struct cfs_rq *cfs_rq` Runqueue we're enqueued on. * `struct sched_entity *parent` If entity is part of a nested task group hierarchy this points to its parent (see [task groups](#Task-groups)) * `struct cfs_rq *my_q` If entity is part of a nested task group hierarchy this points to a CFS runqueue which is represented by this entity in a nested task group (see [task groups](#Task-groups)) #### Entity migration Each per-CPU CFS runqueue tracks a minimum vruntime value across all its enqueued entities. This value is updated each time an entity is enqueued, dequeued or has its vruntime recalculated. The min vruntime value is substracted when entity migrates out of the queue. That same entity will then get its destination runqueues' min vruntime added if it's migrating between CPUs. ### Task groups A task group represents tasks cgroup-ed into a CPU controller group or a terminal session if "autogroup" feature is enabled. Task groups can be nested, which is explained in [Group/entity hierarchy](#Groupentity-hierarchy). `struct task_group` [kernel/sched/sched.h:363](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/sched.h#L363) * `struct cgroup_subsys_state css` The cgroup context this task group is for. * `struct sched_entity **se` Array of sched entities that directly belong to this group. * `struct cfs_rq **cfs_rq` Array of per-CPU CFS runqueues that this group owns and schedules entities on. * `struct task_group *parent` Pointer to a task group we are nested in. #### Group/entity hierarchy Each task group has an array of per-CPU CFS runqueues (`cfs_rq`), one queue for each CPU that entities within a group can execute on. Task groups can be nested within task groups. When task group is nested withing another task group each nested CFS runqueue will have a separate sched entity enqueued into parent group's runqueue for the same CPU. This entity represents an entire nested runqueue for parent task group. Here's a picture to illustrate this nested hierarchy: ![](https://i.imgur.com/qCDqkJE.png) This shows a 2-CPU system with 2 task groups, each group spread across both CPUs. Since there are 2 CPUs, and each task group is not limited to a subset of CPUs, each group will have 2 per-CPU CFS runqueues. Nested task group (bottom one) has several sched entities enqueued in its 2 per-CPU runqueues. These blue entities represent normal tasks. Root task group (top one) also has 2 per-CPU CFS runqueues for the same reason the bottom one does. Each of its runqueues has 1 special red entity enqueued in it. These red entities represent the entire per-CPU runqueue of a nested task group for the same CPU. We can call these red entities "representative" entities. This entity hierarchy allows CFS to compute scheduling slices and weighted vruntimes for general case. ### Weighted vruntime Each sched entity, whether it represent a task or a group, tracks a key characteristic called weighted vruntime (or just vruntime). Very roughly vruntime is a monotonically increasing function that tells CFS how much "time" an entity has spent running in relation to its priority. Each time CFS picks a new entity to run, previously running entity's vruntime is updated in `update_curr` ([kernel/sched/fair.c:815](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L815)), which calls `calc_delta_fair(delta_exec, curr)`, where `delta_exec` is number of timer ticks `curr` spent running on a CPU. #### Task priority and weight mapping CFS maps user `nice` values to task weights: the more `nice` a task is, the less weight it should have and vice versa. A generic calculation of elapsed vruntime is then a function of timer ticks the task spent on a CPU and task weight: `task->vruntime += delta_exec / task->weight` The higher the task priority, the bigger it's weight, the less vruntime increment it gets credited for for the same number of CPU ticks. That is what `calc_delta_fair` calculates, approximated with fixedpoint arithmetic. #### Fixedpoint arithmetic A fixedpoint number is an approximation of a fraction using only integer computations. Fixedpoint number is represented as an integer which has a portion of its bits dedicated to fractional part and the rest is the integer part. The number of fractional bits is also called a fixedpoint shift. A generic formula to convert a floating point value `v` to fixed point value `f` is: `f = (int)(v * (1 << fixedpoint_shift))`. Fixedpoint multiplication works like this: `f = (f1 * f2) >> fixedpoint_shift`. And division: `f = f1 / f2 = (f1 * 1/f2) >> fixedpoint_shift` What we will need though is fixedpoint scaling, which is just simple multiplication of a fixedpoint integer `f` representation by an integer factor `s`: `f *= s`, where `s` is not a fixedpoint number but just a plain integer coefficient. #### Vruntime calculation We're going to look at `calc_delta_fair` [kernel/sched/fair.c:643](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L643) which calculates entity's weighted vruntime increment for a given number if CPU ticks and its weight. If entity's weight is something called `NICE_0_LOAD` (more on that later), then effectively `vruntime = delta_exec`. Things get more interesting when task's weight is not `NICE_0_LOAD` and we go into `__calc_delta` [kernel/sched/fair.c:218](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L218) CFS stores task weight mapping for user nice values as fixedpoint number with fixedpoint shift 10. Which means that weight 1.0 is represented in fixedpoint as `1.0 * 1024 = 1024`. There is an array of fixedpoint numbers for mapping `nice` values to weight coefficients so that mapped weights would change within ~10% between the levels: ```(C) const int sched_prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121, 2501, 1991, 1586, 1277, /* 0 */ 1024, 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15, }; ``` There is also a precomputed array of these values inverted over 2^32 (2^32/v): ```(C) const u32 sched_prio_to_wmult[40] = { /* -20 */ 48388, 59856, 76040, 92818, 118348, /* -15 */ 147320, 184698, 229616, 287308, 360437, /* -10 */ 449829, 563644, 704093, 875809, 1099582, /* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326, /* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587, /* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126, /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717, /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153, }; ``` So the calculation of elapsed vruntime in fixedpoint now becomes: `delta_exec / task->weight = (delta_exec * task->inv_weight) >> 32`, where `task->inv_weight` is a value from `sched_prio_to_wmult` and we need to right-shift by 32 to account for `2^32 / weight`. This is essentially the calculation that happens in `__calc_delta` except that it accepts another weight factor for a different load calculation case when entity represents a runqueue. It is set to `NICE_0_LOAD` for a case when an entity represents a task, which is essentially a `1.0` in fixedpoint. We can do some test calculations for a task that spent 1us executing with `nice` values 0, -20 and 19: | nice | vruntime delta | | -------- | -------- | | 0 | 1024 | | -20 | 11 | | 19 | 68266 | nice -20 is approx. `nice 0 * 0.01` and nice 0 is approx `nice 19 * 0.01`. So each niceness change yields a cumulative level-by-level 10% weight change. Yay. ### Scheduling period CFS does not have a classic notion of a fixed task runtime slice. Instead an amount of time task gets to run on a CPU is defined by several CFS tunables, which can be set from userspace. * `sched_latency`: a period of time in nanoseconds during which all runnable tasks should execute at least once, i.e. a targeted scheduler latency. * `sched_nr_latency`: number of runnable tasks that should have a chance to execute at least once withing targeted scheduler latency * `sched_min_granularity`: a period of time in nanoseconds that any task should always be allowed to execute in the least. A scheduling period is calculated in `__sched_period` [kernel/sched/fair.c:659](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L659) ```(C) static u64 __sched_period(unsigned long nr_running) { if (unlikely(nr_running > sched_nr_latency)) return nr_running * sysctl_sched_min_granularity; else return sysctl_sched_latency; } ``` So, scheduling period is either `sched_latency`, or if number of runnable tasks exceeds `sched_nr_latency`, then it is extended as much as needed so that each runnable task gets at least `sched_min_granularity` nanoseconds to run. To get a timeslice for a given sched entity ready to be placed on a CPU, some additional calculation is done in `sched_slice` [kernel/sched/fair.c:673](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L673) ```(C) static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) { u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq); for_each_sched_entity(se) { struct load_weight *load; struct load_weight lw; cfs_rq = cfs_rq_of(se); load = &cfs_rq->load; if (unlikely(!se->on_rq)) { lw = cfs_rq->load; update_load_add(&lw, se->load.weight); load = &lw; } slice = __calc_delta(slice, se->load.weight, load); } return slice; } ``` First an entire scheduling period is calculated. The `for_each_sched_entity` macro walks sched entity hierarchy from child to parent (which is a bit unexpected). We first calculate the lower-level entity's slice within its runqueue. If this entity is part of a task group, then we go up the hierarchy and `se` now points to our parent, which is a representative entity for a runqueue in our task group on a specific CPU. This representative sched entity sits on its own runqueue with its own weight and load, and we use that to effectively project previous slice calculation on to a higher-level runqueue. And so on until we reach the top. ## Load balancing Load balancing attempts to spread the computational load evenly across CPUs and is implemented by a scheduler class. The CFS class has 3 main types of load balancing: regular, idle and wakeup. * Regular load balancing is normal balancing done periodically. * Idle balancing is performed when CPU core is about to go idle and can try to look for work. * Wakeup balancing can be performed for a specific entity that is being woken up to place it on the idlest core within the closest shared cache level. These balancing types use the same underlying algorithm to calculate desired load spread. ### Sched domains CFS load balancing is built upon the concept of sched domains. A basic load balancing algorithm would compare the load of all cores and then transfer tasks from the most loaded core to the least loaded core. Unfortunately this would result in threads being migrated across the machine without considering cache locality or NUMA. Instead, the load balancer uses a hierarchical strategy that take shared resource topology into account. Sched domains build a topology that describes how machine cores share physical resources at different levels, starting from a single core level. Picture below shows sched domain hierarchy for a system with 2 nodes, 2 cores per node and 2 hardware threads per core. ![](https://i.imgur.com/5Nff8rj.png) Each sched domain has a circular list of sched groups, which is shown by arrows on a picture above (except for the circular part). CFS load-balances each domain by balancing load across its groups. `struct sched_group` [kernel/sched/sched.h:1355](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/sched.h#L1355) * `struct sched_group *next` Next group in a circular list * `unsigned long cpumask[]` Bitmap of cpu ids that belong to this group `struct sched_domain` [include/linux/sched/topology.h:71](https://elixir.bootlin.com/linux/v5.1/source/include/linux/sched/topology.h#L77) * `for_each_domain` Macro walks domains hierarchically from current domain to its parent * `struct sched_group *groups` List of domain balancing groups ### Load metric An entity's load is combination of the its weight and its average CPU utilization. CFS keeps track of various types of load averages, which are a centerpiece in CPU load calculation. Load averages are calculated for all entities whether they are tasks or runqueues. Tracking averages is done using `sched_avg` structure. [include/linux/sched.h#L392](https://elixir.bootlin.com/linux/v5.1/source/include/linux/sched.h#L392): ```(C) struct sched_avg { u64 last_update_time; u64 load_sum; u64 runnable_load_sum; u32 util_sum; u32 period_contrib; unsigned long load_avg; unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est; }; ``` #### Entity load averages CFS calculates load sums (`*_sum` fields) as geometric series of entity's runnable periods of time within predefined intervals of 1024 microseconds: `u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...`, where `u_i` denote the fraction of i-th 1024us time period that the entity was runnable (`i = 0` is the latest period) and `y` is a predefined coefficient such that `y^32 = 0.5`. Load sum for a new period can be calculated from previous sum as `sum(i) = u_0 + y * sum(i-1)`. This allows CFS to geometrically degrade previous contribution of an entity to CPU load as said contribution period gets farther and farther in time. Sum calculation is done by `___update_load_sum` [kernel/sched/pelt.c:174](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/pelt.c#L174). After updating sums all averages are updated by `___update_load_avg` [kernel/sched/pelt.c#L225](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/pelt.c#L225) which divides accumulated and degraded sums over a set period of time. #### Runqueue load CFS runqueue load is calculated as sum of load averages of entities in runqueue. For once things can be simple. #### Domain group load A sched domain load is a sum of all runqueue loads for each CPU that is covered by a domain. Aside from various resolution-adjusting details `update_sg_lb_load` [kernel/sched/fair.c:8250](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L8250) is calculating that. Domain group load is an important metric that will be used by domain balancing algorithm. ### Load balancing algorithm Load balancing is run for each scheduling domain, starting from the bottom to the top. At each level, one core of current domain is responsible for balancing the load across sched groups within its domain and only its domain. This core is either the first idle core of the scheduling domain if the domain has idle cores whose free CPU cycles can be used for load balancing, or the first core of the scheduling domain otherwise. By walking the domain hierarchy we eventually balance the entire system. Domain walk starts at `rebalance_domains` [kernel/sched/fair.c:9442](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L9442). Each individual domain is balanced by `load_balance` [kernel/sched/fair.c:9033](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L9033). `load_balance` is complicated, but the gist of what it does is: * Define local group as the one where current CPU is. * Walk domain's sched group list and find the busiest one, e.g. group with the largest load average. * Within the busiest sched group find the most overloaded CPU. * Redistribute load between local CPU and the most overloaded one in busiest group by migrating tasks from the busiest CPU in the busiest sched group to current CPU which is running the load balancer code. #### Rebalance costs Rebalancing can be expensive and is not always worth doing. For example, idle balancing tracks an average CPU idle time for each CPU and avoids running idle balancing on a CPU when its average idling time is less than assumed balancing runtime cost: [kernel/sched/fair.c:9964](https://elixir.bootlin.com/linux/v5.1/source/kernel/sched/fair.c#L9964) [^1]: https://github.com/torvalds/linux/tree/master/Documentation/scheduler [^2]: per-cpu data is managed by the kernel as a fake symbol with an address which is an offset inside a real binary section. Per-CPU section bases are stored in ```__per_cpu_offset``` global array which is indexed by smp processor id on an SMP system. [^3]: A hardware thread on an SMP system.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Google Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully