# Operating System Concepts, Ch. 5: CPU Scheduling
## Basic Concepts
- Process execution switches between CPU bursts and I/O bursts.
- Multi-process scheduling can be **cooperative** or **preemptive** (introduces race conditions).
- **Critical Sections**: Very short sections of instructions that cannot be interrupted.
- A **dispatcher** needs to (1) perform context switch, (2) switch to user mode, and (3) perform program resumption (PC register). The latency in context switching is known as the **dispatch latency**.
- Context switch info is avaliable in Linux (`vmstat`).
- There are voluntary context switches and non-voluntary ones.
## Scheduling Criteria
There are different criteria for judging the quality of scheduling algorithms,
- **CPU utilization** (↑)
- **Throughput** (↑): #Processes completed per time unit.
- **Turnaround Time** (↓): How long before a process finishes.
- **Waiting Time** (↓): How long a process waits in the ready queue.
- **Response Time** (↓): Time between request submission and first response.
For an interactive system, one should minimize the variance of response time (rather than the average).
## Scheduling Algorithms
These algorithms deal with **which process to choose in the ready queue**. We first discuss when there is only one CPU core.
### First-Come, First-Served (FCFS)
- A simple, FIFO structure.
- Could result in **long waiting time sums**. When long bursts come first, later ones are always waiting.

- **Convoy Effect**: When there is one CPU-bound process, and several I/O-bound processes, the CPU and the devices would sit idle alternately. Could be solved if shorter processes are allowed to go first.
- A **non-preemptive** algorithm. Bad for interactive systems.
### Shortest-Job-First (SJF)
- The process with **shortest incoming CPU burst** is selected (identical length? FCFS).
- **Optimal** (in terms of **average waiting time**) for a given set of processes.
- **Difficulty**: Hard to know the length of the next burst of a process. One workaround is to do a **exponential average estimate**, $\tau_{n+1}=\alpha t_n+(1-\alpha)\tau_n$. For example, when $\alpha=0.5$,

- Can be **preemptive** or **non-preemptive**. A preemptive one may switch out the currently executing process, if an even shorter on enters the ready queue. In that case, it is called the **shortest-remaining-time-first** scheduling.
### Round-Robin (RR)
- An FCFS with preemption.
- When a process runs longer than a predefined time slice, it is preempted.
- Usually **long waiting time**.
- The length of time slices are usually > 10x of context switch time.
- The length of time slices can affect **turnaround times**.
- 80% of the CPU bursts should be shorter than the time slice
### Priority
- Each process is associated with a **priority** (identical priority? FCFS). SJF is a special case of it.
- **Internal Priority Definition**: time limits, memory requirements, #open files, CPU-I/O-burst ratio, etc.
- **External Priority Definition**: process importance, fund-related factors, political factors, etc.
- Can be **preemptive** or **non-preemptive**. In former cases, if a process of higher priority enters the ready queue, the current process is preempted.
- **Starvation**: Some low-priority processes may be indefinitely waiting.
- A solution to starvation is **aging**. The priority of a process increases over time as it waits in the ready queue.
- Another option: combine priority and RR (for identical priorities).
### Multi-Level Queues
- Searching for the process with highest priority can take $O(n)$ time.
- **Multi-level Queues** can be used; each priority a queue.
- This scheme can also be used to divide process by foreground and background status. Each queue can even use different scheduling policies.
- Processes permanently stay at its own level.
- **Option 1**: Queues of lower priorities are allowed to run only when all higher priority ones are empty.
- **Option 2**: Time ratio based. For example, the "real-time" queue is given 80% of the CPU time (to run its own scheduling over processes), and the "background" queue 20%.
### Multi-Level Feedback Queues
- For flexibility's sake, processes in this scheme are allowed to move between levels.
- If a process has overly long CPU bursts, it gets its priority lowered.
- In this way, I/O-bound and interactive processes are higher prioritized.
- **Starvation Prevention**: A process waiting too long gets its priority increased.

- Needs to define (1) #queues, (2) scheduling policies of each queue, (3) process promotion / demotion criteria & method (4) method of deciding which queue a process enters.
- Most complex, but most general.
## Thread Scheduling
User threads are ultimately mapped to kernel threads (potentially by LWPs).
- **Many-to-One and Many-to-Many**: The thread library schedules user-level threads to run on an available LWP; this is called **Process Contention Scope Scheduling (PCS)**, and is usually done according to priorities.
- Competition of CPU usage happens within the threads of the same process.
- Scheduled to run on an LWP does not mean running on a physical core (which requires OS scheduling over LWPs).
- **System Contention Scope Scheduling (SCS)**: The scheduling of kernel threads. Competition for the CPU takes place among all threads in the system. For **One-to-One** systems, they only use SCS.
- Pthread has an interface to set how a thread is scheduled (SCS / PCS).
## Multi-Processor Scheduling
### Approaches
1. **Asymmetrics Multiprocessing**: a master core handles scheduling decisions and system activities. Others run only user codes. **Pros**: no data sharing, simple. **Cons**: master core can be a bottleneck.
2. **Symmetric Multiprocessing (SMP)**: all cores are self-scheduling. They can have a common or separate ready queue of threads. For the former, there can be race conditions (duplicated scheduling, lost threads etc.).
### Multicore Processors
- Multiple CPU cores are more efficient in terms of power consumption and performance than multiple single-core CPU chips.
- **Memory Stall**: when accessing memory, waiting for the data to be available is very time-consuming. (processors are a lot faster than memory, or due to cache misses)
- Each core can have **multiple threads**. Execution of each thread is interleaved to compensate for memory stalls. Also called **Chip Multithreading (CMT)**.
- **Coarse-Grained Multithreading**: a thread executes until a long-latency event happens. High-overhead operations are carried out to switch to another thread (flushing the old one and filling the pipelines with the new one).
- **Fine-Grained Multithreading**: thread switching happens at a more fine-grained level, like instruction boundary. Architectural design support faster switching logic.
- Physical core resources (caches, pipelines, ...) must be shared among HW threads. Thus, there are two levels of scheduling.

- The first level is primarily done by OSs. The second level can be RR, or priority-based. They are not necessarily mutually exclusive.
### Load Balancing
- **Load Balancing** attempts to keep workloads evenly distributed.
- **Push Migration**: a specific task periodically checks the load on each processor, and fixes inbalances by pushing threads from overloaded ones to idle ones.
- **Pull Migration**: an idle processor pulls a waiting task from a busy processor. The two are not necessarily mutually exclusive.
- What balance means differs. Same number of threads? Distributed threads of the same priority? <font color=red>They may even work against the goals of scheduling algorithms</font> (exercise).
### Processor Affinity
- A thread running for some time can have a **warm cache** with high cache hit rates. When it is moved to another processor (e.g. by load balancing), this advantage is lost. Most OSs try to keep a thread on the same processor.
- Per-processor ready queues implies processor affinity.
- Linux can configure **soft** and **hard** affinity with a system call.
- If an OS is **NUMA-aware**, memory allocation for a thread on a specific CPU can be improved.
- Naturally, load balancing works against processor affinity.
### Heterogeneous Multiprocessing
- Processors of a system can differ from each other in terms of clock speed, power consumption management, etc., especially on mobiles.
- For ARM processors that support this, it is called **big.LITTLE**. **Big** processors consume more energy and are used in shorter time, in contrary to energy efficient **LITTLE** cores.
## Real-Time CPU Scheduling
- **Soft** and **Hard**. For hard real-time OSs, a task must be serviced before its deadline expires.
### Latency
- RTOSs usually wait for real-time (SW/HW) events to occur.
- Different events have different latency requirements (e.g. 5ms for antilock brake systems).
- **Interrupt Latency**: from interrupt arrival to the begining of its service routine. The OS has to first complete its instruction at hand, determine interrupt types, save states before executing the ISR.
- For hard RTOSs, interrupt latency has to be strictly bounded. Kernel data structure updating disables interrupts, so they need to be regulated.
- **Dispatch Latency**: the amount of time for the scheduling dispatcher to stop one process and start another. The **conflict phase** contains kernel process preemption, and resource releasing.
### Priority-Based Scheduling
- Preemptive, priority-based scheduling only guarantees soft real-time functionality.
- For hard RTOSs, deadline requirements have to be met.
- **Admission-Control**: processes can announce their deadlines, and the OS can either accept or reject its request of execution.
### Rate Monotonic Scheduling
- Each periodic task is assigned a priority accord. to its period.
- Low-priority tasks can be preempted by higher ones (w/ shorter periods).
- **Optimality**: if a set of processes cannot be schedule by this algorithm, it cannot be scheduled by any other algorithm that assigns **static priorities**.
- No deadline guarantees.
### Earliest Deadline First (EDF)
- The priority of a process is determined by its deadline (dynamic priority).
- **Theoretical Optimality**: each process can meet its deadline (if possible) and high (~100%) CPU utilization. In practice, there are context switches and interrupts.
### Proportional Share Scheduling
- Assigns each process with some share of the CPU time (e.g. 100 shares in total; A gets 50, B gets 15, and C gets 20).
- Must work with **admission control** for share allocation.
## OS Examples
### Linux
- Version 2.6.23 releases the **Completely Fair Scheduler (CFS)**. Before that was a non-SMP schduler, and an *O(1)* scheduler that performs good at SMP but bad at response time.
- **Scheduling Classes**: each class is associated with different priority. The system selects one highest-prty task from the highest-prty class.
- Standard Linux has two classes: default CFS, and a real-time one.
- CFS assigns a proportion of CPU time to each task, based on its **nice value** (from -20 to 19). A *nicer* process is assigned less time (like it yields to others).
- **Targeted Latency**: CFS does not use discrete time slices. It identifies a latency during which each process should run at least once (this latency can increase).
- **Virtual Runtime**: each task is associated with a `vruntime`, which is kind of like actual running time, but for running for 200 ms,
- A process with nice value = 0, its `vruntime` is 200 ms.
- A process with less nice value (higher-prty) has `vruntime` < 200 ms.
- A process with more nice value (lower-prty) has `vruntime` > 200 ms.
- The system always selects processes with the least `vruntime` (preemptable).
- In this way, an I/O-bound process can preempt a CPU-bound process whenever possible.
- Linux uses priority range 0-99 for real-time tasks, and 100-139 for normal tasks. 100-139 is mapped onto nice value ranges.
- CFS comes with NUMA-aware load balancing. The **load** of a process is a combination of its priority and CPU util. CFS attempts to evenly distribute the sum of loads of each processor's queue.
### Windows
- WIP (5.7.2)
### Solaris
- WIP (5.7.3)
## Algorithm Evaluation
- Algorithm selection involves weighing factors: CPU utilization, throughput, response time etc.
- **Analytic Evaluation**: uses an algo and system workload to produce a formula or number.
- **Deterministic Modeling**: uses a pre-determined workload.
- **Queueing Models**: uses estimated / approximated I/O and CPU bursts to describe probabilities of a particular CPU burst, commonly as exponential distributions (they are used to describe the time between Poisson point process events). We can also describe arrival-time distributions. These two types of distributions enable computation of average throughput, utilization, waiting time, etc.
- The number of processes leaving a queue, must be equal to the number of processes entering it (by the 2 distributions).
- **Little's Formula**: $\text{Queue Length}=\text{Arrival Rate}\times\text{Waiting Time}$
- **Simulations**: software-driven simulation for scheduling algorithm setup, potentially with randomized process requests or data gathered from real systems.
- Programmers aware of the scheduling policies can work against it, for example to maximize the priority of their programs.
https://ithelp.ithome.com.tw/articles/10207797