Operating System Concepts, Ch. 5: CPU Scheduling

# Operating System Concepts, Ch. 5: CPU Scheduling ## Basic Concepts - Process execution switches between CPU bursts and I/O bursts. - Multi-process scheduling can be **cooperative** or **preemptive** (introduces race conditions). - **Critical Sections**: Very short sections of instructions that cannot be interrupted. - A **dispatcher** needs to (1) perform context switch, (2) switch to user mode, and (3) perform program resumption (PC register). The latency in context switching is known as the **dispatch latency**. - Context switch info is avaliable in Linux (`vmstat`). - There are voluntary context switches and non-voluntary ones. ## Scheduling Criteria There are different criteria for judging the quality of scheduling algorithms, - **CPU utilization** (↑) - **Throughput** (↑): #Processes completed per time unit. - **Turnaround Time** (↓): How long before a process finishes. - **Waiting Time** (↓): How long a process waits in the ready queue. - **Response Time** (↓): Time between request submission and first response. For an interactive system, one should minimize the variance of response time (rather than the average). ## Scheduling Algorithms These algorithms deal with **which process to choose in the ready queue**. We first discuss when there is only one CPU core. ### First-Come, First-Served (FCFS) - A simple, FIFO structure. - Could result in **long waiting time sums**. When long bursts come first, later ones are always waiting. ![](https://hackmd.io/_uploads/BJhR_Djah.png) - **Convoy Effect**: When there is one CPU-bound process, and several I/O-bound processes, the CPU and the devices would sit idle alternately. Could be solved if shorter processes are allowed to go first. - A **non-preemptive** algorithm. Bad for interactive systems. ### Shortest-Job-First (SJF) - The process with **shortest incoming CPU burst** is selected (identical length? FCFS). - **Optimal** (in terms of **average waiting time**) for a given set of processes. - **Difficulty**: Hard to know the length of the next burst of a process. One workaround is to do a **exponential average estimate**, $\tau_{n+1}=\alpha t_n+(1-\alpha)\tau_n$. For example, when $\alpha=0.5$, ![](https://hackmd.io/_uploads/BJPcoviph.png) - Can be **preemptive** or **non-preemptive**. A preemptive one may switch out the currently executing process, if an even shorter on enters the ready queue. In that case, it is called the **shortest-remaining-time-first** scheduling. ### Round-Robin (RR) - An FCFS with preemption. - When a process runs longer than a predefined time slice, it is preempted. - Usually **long waiting time**. - The length of time slices are usually > 10x of context switch time. - The length of time slices can affect **turnaround times**. - 80% of the CPU bursts should be shorter than the time slice ### Priority - Each process is associated with a **priority** (identical priority? FCFS). SJF is a special case of it. - **Internal Priority Definition**: time limits, memory requirements, #open files, CPU-I/O-burst ratio, etc. - **External Priority Definition**: process importance, fund-related factors, political factors, etc. - Can be **preemptive** or **non-preemptive**. In former cases, if a process of higher priority enters the ready queue, the current process is preempted. - **Starvation**: Some low-priority processes may be indefinitely waiting. - A solution to starvation is **aging**. The priority of a process increases over time as it waits in the ready queue. - Another option: combine priority and RR (for identical priorities). ### Multi-Level Queues - Searching for the process with highest priority can take $O(n)$ time. - **Multi-level Queues** can be used; each priority a queue. - This scheme can also be used to divide process by foreground and background status. Each queue can even use different scheduling policies. - Processes permanently stay at its own level. - **Option 1**: Queues of lower priorities are allowed to run only when all higher priority ones are empty. - **Option 2**: Time ratio based. For example, the "real-time" queue is given 80% of the CPU time (to run its own scheduling over processes), and the "background" queue 20%. ### Multi-Level Feedback Queues - For flexibility's sake, processes in this scheme are allowed to move between levels. - If a process has overly long CPU bursts, it gets its priority lowered. - In this way, I/O-bound and interactive processes are higher prioritized. - **Starvation Prevention**: A process waiting too long gets its priority increased. ![](https://hackmd.io/_uploads/By-EHuipn.png) - Needs to define (1) #queues, (2) scheduling policies of each queue, (3) process promotion / demotion criteria & method (4) method of deciding which queue a process enters. - Most complex, but most general. ## Thread Scheduling User threads are ultimately mapped to kernel threads (potentially by LWPs). - **Many-to-One and Many-to-Many**: The thread library schedules user-level threads to run on an available LWP; this is called **Process Contention Scope Scheduling (PCS)**, and is usually done according to priorities. - Competition of CPU usage happens within the threads of the same process. - Scheduled to run on an LWP does not mean running on a physical core (which requires OS scheduling over LWPs). - **System Contention Scope Scheduling (SCS)**: The scheduling of kernel threads. Competition for the CPU takes place among all threads in the system. For **One-to-One** systems, they only use SCS. - Pthread has an interface to set how a thread is scheduled (SCS / PCS). ## Multi-Processor Scheduling ### Approaches 1. **Asymmetrics Multiprocessing**: a master core handles scheduling decisions and system activities. Others run only user codes. **Pros**: no data sharing, simple. **Cons**: master core can be a bottleneck. 2. **Symmetric Multiprocessing (SMP)**: all cores are self-scheduling. They can have a common or separate ready queue of threads. For the former, there can be race conditions (duplicated scheduling, lost threads etc.). ### Multicore Processors - Multiple CPU cores are more efficient in terms of power consumption and performance than multiple single-core CPU chips. - **Memory Stall**: when accessing memory, waiting for the data to be available is very time-consuming. (processors are a lot faster than memory, or due to cache misses) - Each core can have **multiple threads**. Execution of each thread is interleaved to compensate for memory stalls. Also called **Chip Multithreading (CMT)**. - **Coarse-Grained Multithreading**: a thread executes until a long-latency event happens. High-overhead operations are carried out to switch to another thread (flushing the old one and filling the pipelines with the new one). - **Fine-Grained Multithreading**: thread switching happens at a more fine-grained level, like instruction boundary. Architectural design support faster switching logic. - Physical core resources (caches, pipelines, ...) must be shared among HW threads. Thus, there are two levels of scheduling. ![](https://hackmd.io/_uploads/HyLeTmiJ6.png) - The first level is primarily done by OSs. The second level can be RR, or priority-based. They are not necessarily mutually exclusive. ### Load Balancing - **Load Balancing** attempts to keep workloads evenly distributed. - **Push Migration**: a specific task periodically checks the load on each processor, and fixes inbalances by pushing threads from overloaded ones to idle ones. - **Pull Migration**: an idle processor pulls a waiting task from a busy processor. The two are not necessarily mutually exclusive. - What balance means differs. Same number of threads? Distributed threads of the same priority? <font color=red>They may even work against the goals of scheduling algorithms</font> (exercise). ### Processor Affinity - A thread running for some time can have a **warm cache** with high cache hit rates. When it is moved to another processor (e.g. by load balancing), this advantage is lost. Most OSs try to keep a thread on the same processor. - Per-processor ready queues implies processor affinity. - Linux can configure **soft** and **hard** affinity with a system call. - If an OS is **NUMA-aware**, memory allocation for a thread on a specific CPU can be improved. - Naturally, load balancing works against processor affinity. ### Heterogeneous Multiprocessing - Processors of a system can differ from each other in terms of clock speed, power consumption management, etc., especially on mobiles. - For ARM processors that support this, it is called **big.LITTLE**. **Big** processors consume more energy and are used in shorter time, in contrary to energy efficient **LITTLE** cores. ## Real-Time CPU Scheduling - **Soft** and **Hard**. For hard real-time OSs, a task must be serviced before its deadline expires. ### Latency - RTOSs usually wait for real-time (SW/HW) events to occur. - Different events have different latency requirements (e.g. 5ms for antilock brake systems). - **Interrupt Latency**: from interrupt arrival to the begining of its service routine. The OS has to first complete its instruction at hand, determine interrupt types, save states before executing the ISR. - For hard RTOSs, interrupt latency has to be strictly bounded. Kernel data structure updating disables interrupts, so they need to be regulated. - **Dispatch Latency**: the amount of time for the scheduling dispatcher to stop one process and start another. The **conflict phase** contains kernel process preemption, and resource releasing. ### Priority-Based Scheduling - Preemptive, priority-based scheduling only guarantees soft real-time functionality. - For hard RTOSs, deadline requirements have to be met. - **Admission-Control**: processes can announce their deadlines, and the OS can either accept or reject its request of execution. ### Rate Monotonic Scheduling - Each periodic task is assigned a priority accord. to its period. - Low-priority tasks can be preempted by higher ones (w/ shorter periods). - **Optimality**: if a set of processes cannot be schedule by this algorithm, it cannot be scheduled by any other algorithm that assigns **static priorities**. - No deadline guarantees. ### Earliest Deadline First (EDF) - The priority of a process is determined by its deadline (dynamic priority). - **Theoretical Optimality**: each process can meet its deadline (if possible) and high (~100%) CPU utilization. In practice, there are context switches and interrupts. ### Proportional Share Scheduling - Assigns each process with some share of the CPU time (e.g. 100 shares in total; A gets 50, B gets 15, and C gets 20). - Must work with **admission control** for share allocation. ## OS Examples ### Linux - Version 2.6.23 releases the **Completely Fair Scheduler (CFS)**. Before that was a non-SMP schduler, and an *O(1)* scheduler that performs good at SMP but bad at response time. - **Scheduling Classes**: each class is associated with different priority. The system selects one highest-prty task from the highest-prty class. - Standard Linux has two classes: default CFS, and a real-time one. - CFS assigns a proportion of CPU time to each task, based on its **nice value** (from -20 to 19). A *nicer* process is assigned less time (like it yields to others). - **Targeted Latency**: CFS does not use discrete time slices. It identifies a latency during which each process should run at least once (this latency can increase). - **Virtual Runtime**: each task is associated with a `vruntime`, which is kind of like actual running time, but for running for 200 ms, - A process with nice value = 0, its `vruntime` is 200 ms. - A process with less nice value (higher-prty) has `vruntime` < 200 ms. - A process with more nice value (lower-prty) has `vruntime` > 200 ms. - The system always selects processes with the least `vruntime` (preemptable). - In this way, an I/O-bound process can preempt a CPU-bound process whenever possible. - Linux uses priority range 0-99 for real-time tasks, and 100-139 for normal tasks. 100-139 is mapped onto nice value ranges. - CFS comes with NUMA-aware load balancing. The **load** of a process is a combination of its priority and CPU util. CFS attempts to evenly distribute the sum of loads of each processor's queue. ### Windows - WIP (5.7.2) ### Solaris - WIP (5.7.3) ## Algorithm Evaluation - Algorithm selection involves weighing factors: CPU utilization, throughput, response time etc. - **Analytic Evaluation**: uses an algo and system workload to produce a formula or number. - **Deterministic Modeling**: uses a pre-determined workload. - **Queueing Models**: uses estimated / approximated I/O and CPU bursts to describe probabilities of a particular CPU burst, commonly as exponential distributions (they are used to describe the time between Poisson point process events). We can also describe arrival-time distributions. These two types of distributions enable computation of average throughput, utilization, waiting time, etc. - The number of processes leaving a queue, must be equal to the number of processes entering it (by the 2 distributions). - **Little's Formula**: $\text{Queue Length}=\text{Arrival Rate}\times\text{Waiting Time}$ - **Simulations**: software-driven simulation for scheduling algorithm setup, potentially with randomized process requests or data gathered from real systems. - Programmers aware of the scheduling policies can work against it, for example to maximize the priority of their programs. https://ithelp.ithome.com.tw/articles/10207797