# Chapter 4 Process Scheduling
## Multitasking
A multitasking operating system is one that can simultaneously interleave execution of more than one process.
Multitasking operating systems come in two flavors:
1. cooperative multitasking: preemptive multitasking
The time a process runs before it is preempted is usually predetermined, and it is called the timeslice of the process.
2. preemptive multitasking
a process does not stop running until it voluntary decides to do so.o.The act of a process voluntarily suspending itself is called yielding.
Shortcomings:
1). The scheduler cannot make global decisions regarding how long processes run;
2). processes can monopolize the processor for longer than the user desires
3). a hung process that never yields can potentially bring down the entire system.
## Linux’s Process Scheduler
N/A
## Policy
### I/O-Bound Versus Processor-Bound Processes
Processes can be classified as either I/O-bound or processor-bound.
**I/O-bound**
Most graphical user interface (GUI) applications, for example, are I/O-bound, even if they never read from or write to the disk, because they spend most of their time waiting on user interaction via the keyboard and mouse.
**Processor-bound**
examples include programs that perform a lot of mathematical calculations, such as sshkeygen or MATLAB.
Of course, these classifications are not mutually exclusive. Processes can exhibit both behaviors simultaneously.
### Process Priority
The Linux kernel implements two separate priority ranges.
The first is the nice value, a number from –20 to +19 with a default of 0.
Larger nice values correspond to a lower priority—you are being “nice” to the other processes on the system.
> ps -el.
The second range is the real-time priority.
The values are configurable, but by default range from 0 to 99, inclusive.
Opposite from nice values, higher real-time priority values correspond to a greater priority.
Linux implements real-time priorities in accordance with the relevant Unix standards,specifically POSIX.1b
> ps -eo state,uid,pid,ppid,rtprio,time,comm
### Timeslice
The timeslice is the numeric value that represents how long a task can run until it is preempted.
**Too long**
Too long a timeslice causes the system to have poor interactive performance; the system will no longer feel as if applications are concurrently executed.
**Too short**
Too short a timeslice causes significant amounts of processor time to be wasted on the overhead of switching processes
Furthermore, the conflicting goals of I/Obound versus processor-bound processes again arise: I/O-bound processes do not need longer timeslices (although they do like to run often), whereas processor-bound processes crave long timeslices (to keep their caches hot).
On Linux, therefore, the amount of processor time that a process receives is a function of the load of the system.
This assigned proportion is further affected by each process’s nice value.
The nice value acts as a weight, changing the proportion of the processor time each process receives.
Processes with higher nice values (a lower priority) receive a deflationary weight, yielding them a smaller proportion of the processor; processes with smaller nice values (a higher priority) receive an inflationary weight, netting them a larger proportion of the processor.
In Linux, under the new CFS scheduler, the decision is a function of how much of a proportion of the processor the newly runnable processor has consumed.
If it has consumed a smaller proportion of the processor than the currently executing process, it runs immediately, preempting the current process. If not, it is scheduled to run at a later time.
### The Scheduling Policy in Action
Consider a system with two runnable tasks: a text editor and a video encoder.
The text editor is I/O-bound because it spends nearly all its time waiting for user key presses. Despite this, when the text editor does receive a key press, the user expects the editor to respond immediately.
Conversely, the video encoder is processor-bound.Aside from reading the raw data stream from the disk and later writing the resulting video, the encoder spends all its time applying the video codec to the raw data, easily consuming 100% of the processor.
The video encoder does not have any strong time constraints on when it runs—if it started running now or in half a second, the user could not tell and would not care. Of course, the sooner it finishes the better, but latency is not a primary concern.
On most operating systems, these goals are accomplished (if at all) by giving the text editor a higher priority and larger timeslice than the video encoder.Advanced operating systems do this automatically, by detecting that the text editor is interactive.
Linux achieves these goals too, but by different means. Instead of assigning the text editor a specific priority and timeslice, it guarantees the text editor a specific proportion of the processor. If the video encoder and text editor are the only running processes and both are at the same nice level, this proportion would be 50%—each would be guaranteed half of the processor’s time.
Because the text editor spends most of its time blocked, waiting for user key presses, it does not use anywhere near 50% of the processor. Conversely, the video encoder is free to use more than its allotted 50%, enabling it to finish the encoding quickly.
Specifically, CFS determines that the text editor has run for less time than the video encoder. Attempting to give all processes a fair share of the processor, it then preempts the video encoder and enables the text editor to run.The text editor runs, quickly processes the user’s key press, and again sleeps, waiting for more input.
## The Linux Scheduling Algorithm
### Scheduler Classes
The base scheduler code, which is defined in kernel/sched.c, iterates over each scheduler class in order of priority.The highest priority scheduler class that has a runnable process wins, selecting who runs next.
The Completely Fair Scheduler (CFS) is the registered scheduler class for **normal processes**, called SCHED_NORMAL in Linux (and SCHED_OTHER in POSIX). CFS is defined in kernel/sched_fair.c.
3.We discuss the scheduler class for **real-time processes** in a later section.
### Process Scheduling in Unix Systems
**traditional Unix systems schedule processes**
As mentioned in the previous section, modern process schedulers have two
common concepts:
1. process priority (NICE)
2. timeslice
On Unix, the priority is exported to user-space in the form of nice values
**1st: suboptimal switching behavior**
if two processes with one NICE 0 and the other NICE 20.
| NICE | timeslice(ms) | proportion of processor
| -------- | -------- | --------
| 0 | 100 | 100/105
| 20 | 5 | 5/105
if two processes both with NICE 20, each proportion of processor is 5/10 (50%).
| NICE | timeslice(ms) | proportion of processor
| -------- | -------- | --------
| 20 | 5 | 5/10
| 20 | 5 | 5/10
But instead of context switching twice every 105 milliseconds, we now context switch twice every 10 milliseconds.
if two processes both with NICE 0, each proportion of processor is 100/200 (50%).
| NICE | timeslice(ms) | proportion of processor
| -------- | -------- | --------
| 0 | 100 | 100/200
| 0 | 100 | 100/200
But we now context switch twice every 200 milliseconds.
Neither of these timeslice allotments are necessarily ideal;
**2nd: relative nice values**
| NICE | timeslice(ms) |
| -------- | -------- |
| 0 | 100 |
|1 | 95 |
| 18 | 10 |
|19 | 5|
Two processes are at nice values 0 and 1 and have timeslice 100+95.
The other two processes are at nice values 18 and 19 and have timeslice 10 +5.
The former receiving twice the processor time as the latter!
This behavior means that “nicing down a process by one” has wildly different effects depending on the starting nice value.
**3rd**
if performing a nice value to timeslice mapping, we need the ability to assign an absolute timeslice.This absolute value must be measured in terms the kernel can measure.
In most operating systems, this means the timeslice must be some integer multiple of the timer tick.
Finally, timeslices change with different timer ticks.
**4th**
handling process wake up in a priority-based scheduler that wants to optimize for interactive tasks. In such a system, you might want to give freshly woken-up tasks a priority boost by allowing them to run immediately, even if their timeslice was expired.
the true problem, which is that assigning absolute timeslices yields a constant switching rate but variable fairness.
**CFS**
The approach taken by CFS is a radical (for process
schedulers) rethinking of timeslice allotment: Do away with timeslices completely and assign each process a proportion of the processor. CFS thus yields constant fairness but a variable switching rate.
### Fair Scheduling
CFS is based on a simple concept: Model process scheduling as if the system had an ideal, perfectly multitasking processor. In such a system, each process would receive 1/n of the processor’s time, where n is the number of runnable processes, and we’d schedule them for infinitely small durations, so that in any measurable period we’d have run all n processes for the same amount of time.
Instead, CFS will run each process for some amount of time, round-robin, selecting next the process that has run the least.
Rather than assign each process a timeslice, CFS calculates how long a
process should run as a function of the total number of runnable processes.
**weight**
Instead of using the nice value to calculate a timeslice, CFS uses the nice value to weight the proportion of processor a process is to receive: Higher valued (lower priority) processes receive a fractional weight relative to the default nice value, whereas lower valued (higher priority) processes receive a larger weight.
**targeted latency**
Each process then runs for a “timeslice” proportional to its weight divided by the total weight of all runnable threads.To calculate the actual timeslice, CFS sets a target for its approximation of the “infinitely small” scheduling duration in perfect multitasking.
This target is called the targeted latency.
Let’s assume the targeted latency is 20 milliseconds and we have two runnable tasks at the same priority. Regardless of those task’s priority, each will run for 10 milliseconds before preempting in favor of the other. If we have four tasks at the same priority, each will run for 5 milliseconds. If there are 20 tasks, each will run for 1 millisecond.
**minimum granularity**
Note as the number of runnable tasks approaches infinity, the proportion of allotted processor and the assigned timeslice approaches zero.As this will eventually result in unacceptable switching costs, CFS imposes a floor on the timeslice assigned to each process.This floor is called the minimum granularity. By default it is 1 millisecond.
target latency is again 20 milliseconds
the case of two runnable processes, NICE 0 and NICE 5
the case of two runnable processes, NICE 10 and NICE 15
| NICE | timeslice(ms) | proportion of processor
| -------- | -------- | --------
| 0 | 15 | 3/4
| 5 | 5 | 1/4
| 10 | 15 | 3/4
| 15 | 5 | 1/4
Put generally, the proportion of processor time that any process receives is determined only by the relative difference in niceness between it and the other runnable processes.
## The Linux Scheduling Implementation
### Time Accounting
#### The Scheduler Entity Structure
CFS uses the scheduler entity structure, struct sched_entity, defined in <linux/sched.h>
```
struct sched_entity {
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
struct list_head group_node;
unsigned int on_rq;
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
u64 nr_migrations;
#ifdef CONFIG_SCHEDSTATS
struct sched_statistics statistics;
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
struct cfs_rq *cfs_rq;
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif
#ifdef CONFIG_SMP
/*
* Per entity load average tracking.
*
* Put into separate cache line so it does not
* collide with read-mostly values above.
*/
struct sched_avg avg ____cacheline_aligned_in_smp;
#endif
};
```
The scheduler entity structure is embedded in the process descriptor, struct
*task_stuct*, as a member variable named *se*.
#### The Virtual Runtime
The vruntime variable stores the virtual runtime of a process, which is the actual runtime (the amount of time spent running) normalized (or weighted) by the number of runnable processes.
The function update_curr(), defined in kernel/sched_fair.c, manages this
accounting:
```
/*
* Update the current task's runtime statistics.
*/
static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
u64 now = rq_clock_task(rq_of(cfs_rq));
u64 delta_exec;
if (unlikely(!curr))
return;
delta_exec = now - curr->exec_start;
if (unlikely((s64)delta_exec <= 0))
return;
curr->exec_start = now;
schedstat_set(curr->statistics.exec_max,
max(delta_exec, curr->statistics.exec_max));
curr->sum_exec_runtime += delta_exec;
schedstat_add(cfs_rq->exec_clock, delta_exec);
curr->vruntime += calc_delta_fair(delta_exec, curr);
update_min_vruntime(cfs_rq);
if (entity_is_task(curr)) {
struct task_struct *curtask = task_of(curr);
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cpuacct_charge(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
}
account_cfs_rq_runtime(cfs_rq, delta_exec);
}
```
update_curr() calculates the execution time of the current process and stores that value in delta_exec. It then passes that runtime to __update_curr(), which weights the time by the number of runnable processes.The current process’s vruntime is then incremented by the weighted value:
### Process Selection
The core of CFS’s scheduling algorithm:
Pick the task with **the smallest vruntime**.That’s it!
CFS uses a **red-black tree** to manage the list of runnable processes's **vruntime**.
**red-black tree->the leftmost node->the smallest vruntime**
#### Picking the Next Task
* CFS’s process selection algorithm is thus summed up as **"run the process represented by the leftmost node in the rbtree"**.
* The function that performs this selection is **__pick_next_entity()**, defined in kernel/sched_fair.c
```
static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)
{
struct rb_node *left = cfs_rq->rb_leftmost;
if (!left)
return NULL;
return rb_entry(left, struct sched_entity, run_node);
}
```
* pick_next_entity() does not traverse the tree O(log N) to find **the leftmost node**, instead use **variable *rb_leftmost* to store it** and use it latter.
* The return value from this function is the process that CFS next runs.
#### Adding Processes to the Tree
* CFS adds processes to the rbtree and caches the leftmost node when
* a process becomes runnable (wakes up)
* is first created via fork().
* Adding processes to the tree is performed by enqueue_entity() which calls __enqueue_entity()
* The new process is inserted accroding red-black tree algorithm with key as inserted process’s vruntime
#### Removing Processes from the Tree
* CFS removes processes from the red-black tree when
* a process blocks (becomes unrunnable)
* terminates.
* dequeue_entity() calls __dequeue_entity()
* Remove a process from the tree also accroding red-black tree algorithm.
### The Scheduler Entry Point
* The main entry point into the process schedule is the function **schedule()**, defined in kernel/sched.c.
* schedule() goes through each scheduler class, starting with the highest priority, and selects the highest priority process in the highest priority class
* **schedule()**->pick_next_task()->sched_class#pick_next_task()->pick_next_entity()->**__pick_next_entity()**
```
/*
* Pick up the highest-prio task:
*/
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct pin_cookie cookie)
{
const struct sched_class *class = &fair_sched_class;
struct task_struct *p;
/*
* Optimization: we know that if all tasks are in
* the fair class we can call that function directly:
*/
if (likely(prev->sched_class == class &&
rq->nr_running == rq->cfs.h_nr_running)) {
p = fair_sched_class.pick_next_task(rq, prev, cookie);
if (unlikely(p == RETRY_TASK))
goto again;
/* assumes fair_sched_class->next == idle_sched_class */
if (unlikely(!p))
p = idle_sched_class.pick_next_task(rq, prev, cookie);
return p;
}
again:
for_each_class(class) {
p = class->pick_next_task(rq, prev, cookie);
if (p) {
if (unlikely(p == RETRY_TASK))
goto again;
return p;
}
}
BUG(); /* the idle class will always have a runnable task */
}
```
### Sleeping and Waking Up
Tasks that are sleeping (blocked) are in a special nonrunnable state.
A task sleeps while it is waiting for some event.
The event can be
* a specified amount of time,
* more data from a file I/O,
* or another hardware event.
* involuntarily go to sleep when it tries to obtain a contended semaphore in the kernel.
A common example to sleep is file I/O—for read().
Whatever the case, the sleeps behavior is the same:
1. The task marks itself as sleeping
2. puts itself on a wait queue
3. removes itself from the red-black tree of runnable
4. calls schedule() to select a new process to execute
Waking back up is the inverse:
1. set as runnable
2. removed from the wait queue
3. added back to the red-black tree
4. calls schedule()
Two states are associated with sleeping, TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE.
* TASK_UNINTERRUPTIBLE state ignore signals
* TASK_INTERRUPTIBLE state wake up prematurely and respond to a signal if one is issued.
Both types of sleeping tasks sit on a wait queue, waiting for an event to occur, and are not runnable.
#### Wait Queues
* Sleeping is handled via wait queues.
* A wait queue is a simple list of processes waiting for an event to occur.
* Processes put themselves on a wait queue and mark themselves not runnable.
* When the event associated with the wait queue occurs, the processes on the queue are awakened.
* Wait queues are represented in the kernel by wake_queue_head_t.
* Wait queues are created statically via DECLARE_WAITQUEUE() or dynamically via init_waitqueue_head().
#### Waking Up
Waking is handled calls **wake_up()**, which wakes up the tasks waiting on the given wait queue and calls:
1. calls try_to_wake_up() sets the task to TASK_RUNNING,
2. calls activate_task() to add the task to a runqueue
3. calls schedule().
4. calls __remove_wait_queue() removes the task from the wait queue.
* The code that causes the event to occur typically calls wake_up() itself.
* For example, when data arrives from the hard disk, the VFS calls wake_up() on the wait queue that holds the processes waiting for the data.

## Preemption and Context Switching
* Context switching, the switching from one runnable task to another, is handled by the context_switch() function defined in kernel/sched.c.
* **context_switch() is called by schedule()** when a new process has been selected to run.
* context_switch() does two basic jobs:
* Calls switch_mm() to switch the virtual memory mapping from the previous process’s to that of the new process.
* Calls switch_to() to switch the processor state from the previous process’s to the current’s. This involves saving and restoring stack information and the processor registers and any other architecture-specific state.
The kernel provides the *need_resched* flag to signify whether a reschedule should be performed.
This flag is set
* when a process should be preempted
* when a process that has a higher priority than the currently running process is awakened.
The kernel checks the flag, sees that it is set, and calls schedule() to switch to a new process.
### User Preemption
User preemption can occur
* When returning to user-space from a system call
* When returning to user-space from an interrupt handler
whenever the kernel is preparing to return to user-space, the value of *need_resched* is checked.
If it is set, the scheduler is invoked to select a new process to execute.
### Kernel Preemption
Linux kernel preempt a task at any point, so long as the kernel is in a
state in which it is safe to reschedule.
So when is it safe to reschedule? The kernel can preempt a task running in the kernel so long as it does not hold a lock.
The kernel uses a preemption counter, *preempt_count*, to each process’s thread_info.
* This counter begins at zero
* increments once for each lock that is acquired
* decrements once for each lock that is released.
* When the counter is zero, the kernel is preemptible.
Kernel preemption can occur
* When an interrupt handler exits, before returning to kernel-space
* When kernel code becomes preemptible again
* If a task in the kernel explicitly calls schedule()
* If a task in the kernel blocks (which results in a call to schedule())
## Real-Time Scheduling Policies
Linux provides two **real-time scheduling** policies, **SCHED_FIFO** and **SCHED_RR**.
The normal, **not real-time schedulin**g policy is **SCHED_NORMAL**.
**SCHED_FIFO**
* A runnable SCHED_FIFO task is always scheduled over any SCHED_NORMAL tasks.
* implements a simple first-in, first-out scheduling algorithm **without timeslices**.
* continues to run until it blocks or explicitly yields the processor; it has no timeslice and can run indefinitely.
* Only a higher priority SCHED_FIFO or SCHED_RR task can preempt a SCHED_FIFO task.
* Two or more SCHED_FIFO tasks at the same priority run round-robin, but again only yielding the processor when they explicitly choose to do so.
**SCHED_RR**
* SCHED_RR is **SCHED_FIFO with timeslices**
* When a SCHED_RR task exhausts its timeslice, any other real-time processes at its priority are scheduled round-robin.
* The timeslice is used to allow only rescheduling of same-priority processes.
* As with SCHED_FIFO, a higher-priority process always immediately preempts a lower-priority one
**Real-time priorities**
* Real-time priorities range: 0~(MAX_RT_PRIO-1).
* By default, MAX_RT_PRIO is 100, therefore, the default real-time priority range is: 0~99.
* This priority space is shared with the nice values of SCHED_NORMAL tasks: MAX_RT_PRIO ~ (MAX_RT_PRIO + 40).
* By default, this means the –20 ~ +19 nice range maps to 100 ~ 139.
## Scheduler-Related System Calls

## Conclusion
This chapter ruminated on the theory behind process scheduling and the specific implementation, algorithms, and interfaces used by the current Linux kernel.
# 補充:
Linux 5.3版本逐漸支援hard real time system
PREEMPT_RT 作為邁向硬即時作業系統的機制
https://hackmd.io/@sysprog/preempt-rt