# Linux kernel Scheduler > Presenter: YangYeh-PD > [video](https://youtu.be/m6ps9Z9htzM) :::info My Question List 1. Are Scheduling Classes specified by the source program or identified by the scheduler? 2. Why `SCHED_IDLE` does belong to `idle_sched_class`, what is the difference? 3. What is `sleep` process doing in Early Linux Scheduler? Why the process is to do with ==interactivity==? ::: ## Descriptions * Reading "Demystifying the Linux CPU Scheduler" and noting any questions or issues encountered. * Reproducing the experiments in Chapter 7. ## Prerequisites #### Update Linux to V6.1 or above In `Linhoward0522`'s research on Scheduler, there are related [operational instructions](https://hackmd.io/@sysprog/BJh9FdlS2#%E6%9B%B4%E6%96%B0%E8%87%B3-Linux-v61). #### download sched-analyzer Go to [sched-analyzer](https://github.com/qais-yousef/sched-analyzer) and download. After generate trace file, we can put it to [perfetto](https://perfetto.dev/docs/) and visualize the behavior of the scheduler. ## Developing Environment ``` $ gcc --version gcc (Ubuntu 13.2.0-4ubuntu3) 13.2.0 $ uname -mrs Linux 6.5.0-41-generic x86_64 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz CPU family: 6 Model: 142 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 10 CPU(s) scaling MHz: 36% CPU max MHz: 3400.0000 CPU min MHz: 400.0000 BogoMIPS: 3600.00 Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1 MiB (4 instances) L3: 6 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 ``` ## TODO: Reading "Demystifying the Linux CPU Scheduler" ### The importance of CPU Scheduler The CPU scheduler not only plays an important role in **preemptive multi-tasking system**, but also in **non-preemptive multi-tasking system**. * In the preemptive multi-tasking system, the sheculer allows the operating system to ==forcibly interrupt== a running process to assign the CPU to another process. This ensures that all processes get a fair share of CPU time and ==improves the system's responsiveness==. * In non-preemptive multi-tasking system, when a process is created, it is placed in the ready queue. Then the scheduler would **select** a process in the ready queue (or **runqueue**). ```graphviz digraph shuffle { rankdir = "LR" node [shape=box, style=filled]; // Define nodes with different colors scheduler1 [label="scheduler", fillcolor="white"]; head1 [shape=none, label="head", fillcolor=white]; tail1 [shape=none, label="tail", fillcolor=white]; node1 [label="task1", fillcolor="#ffbd19", color="black"]; node2 [label="task2", fillcolor="white", color="black"]; node3 [label="task3", fillcolor="white", color="black"]; node4 [label="task4", fillcolor="white", color="black"]; node5 [label="task5", fillcolor="white", color="black"]; // Connect the nodes scheduler1 -> node1; head1 -> node1; node1 -> node2; node2 -> node3; node3 -> node4; node4 -> node5; tail1 -> node5; } ``` #### Interactive v.s. Throughput The CPU scheduler aims to balance **interactiveness** and **throughput**. For a system to achieve immediate responsiveness and interact effectively with users, ==preemptive multi-tasking is required==. However, this preemptive multi-tasking often causes the CPU to be **idle** at certain moments, reducing overall CPU efficiency. #### Fairness and Liveness **Fairness** means that tasks with the same priority should receive an ==equal amount of CPU time==. However, this alone is not sufficient. If a high-priority task continuously occupies the CPU, other tasks will be unable to execute. Therefore, the design of **liveness** becomes crucial. Liveness ensures that ==every task gets CPU resources== within a certain timeframe. With liveness, even lower-priority tasks that need to be completed can be executed. #### Real-Time (RT) Real-time OS are divided into two types: soft real-time and hard real-time. * In a **soft real-time OS**, if a process exceeds its deadline without being processed, the consequences are not severe. The most common example is audio and video desynchronization. * In a **hard real-time OS**, if a process exceeds its deadline without being processed, it can lead to serious consequences, potentially even causing the system to crash. #### Power Efficiency The Linux CPU scheduler employs different power-saving methods depending on the platform. On desktops and laptops, there are two main power-saving strategies: * ==Prioritizing low-power tasks==, such as idle tasks or lower-priority tasks. * ==Adjusting the CPU frequency== based on the workload. When the workload is light, the frequency can be reduced to save energy. In servers, because the system needs to operate continuously, it is essential to strike a balance between throughput and power consumption. The strategy is * ==Prioritizing **CPU-bound tasks**== to increase cache efficiency (both temporal and spacial) and inter-processor communication. ### Working Patterns So how does the scheduler know if a task needs to be executed first? Instead of considering other solutions, the Linux Kernel directly requires the source program to ==specify the priority== before handing the task over to the scheduler for scheduling. The scheduler subsystem of the Linux kernel uses the concept of **scheduling classes** to implement certain scheduling algorithms. There are 5 different scheduling classes in the current kernel version. ```graphviz digraph shuffle { rankdir = "LR" node [shape=box, style=filled]; // Define nodes with different colors node1 [label="Stop", fillcolor="#ffbd19", color="black"]; node2 [label="Deadline", fillcolor="white", color="black"]; node3 [label="Real-Time", fillcolor="white", color="black"]; node4 [label="Fair", fillcolor="white", color="black"]; node5 [label="Idle", fillcolor="grey", color="black"]; // Connect the nodes node1 -> node2; node2 -> node3; node3 -> node4; node4 -> node5; } ``` | Scheduling classes | Scheduling Policies | |:------------------:|:---------------------------------------:| | `stop_sched_class` | | | `dl_sched_class` | `SCHED_DEADLINE` | | `rt_sched_class` | `SCHED_FIFO`&`SCHED_RR` | | `fair_sched_class` |`SCHED_NORMAL`&`SCHED_BATCH`&`SCHED_IDLE`| | `idle_sched_class` | | * **Stop** is used to schedule the pre-CPU stop task, whic is only available for symmetric multiprocessing (SMP) systems. It includes a mechanism to stop a kernel threwad running on a specific CPU and **migrate it** to another CPU. * **Deadline** guarantees that the deadline is respected. A task with this policy defines a **deadline**, before which the task has to finish its execution. * In the `1rt_sched_class`, there are two scheduling policies, `SCHED_FIFO` and `SCHED_RR`, which are very similar. * In `SCHED_FIFO`, FIFO means ==First In First Out==, so tasks are executed in the order they arrive. However, if a higher priority task needs to be executed, it will be given priority. * `SCHED_RR` (round-robin) is similar to FIFO, but the difference is that RR has a ==timeslice==. It uses round-robin to ==cycle== through all the tasks with the same priority. Each task can run only up to a maximum fixed timeslice. ![Round-robin](https://hackmd.io/_uploads/H1K88QnL0.png) * `SCHED_NORMAL` and `SCHED_OTHER` is the default policy that is used for ==regular tasks== and uses **CFS (the Completely Fair Scheduler)**. `SCHED_BATCH` is similar to `SCHED_NORMAL` but it will preempt less frequently. * `SCHED_IDLE` is for tasks with very low priority, almost any task can preempt tasks with the policy. * **Idle** is used to schedule the pre-CPU idle task (also called **swapper task**), which is run if there's no other task runnable. ### Early CPU Scheduler In v0.01 of Linux kernel, all tasks are expressed by ==the array==. This array is not only the list of all tasks but also the runqueue. The length of this array is `NR_TASKS`, which is 64. An empty entry is expressed as `NULL`. ```graphviz graph G{ 1[shape=none,margin=0,label=< <table BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4"> <tr> <td bgcolor="skyblue"> runnable 10 </td> <td bgcolor="skyblue"> runnable 15 </td> <td bgcolor="grey"> NULL </td> <td bgcolor="skyblue"> runnable 5 </td> <td bgcolor="lightcoral"> sleep 12 </td> </tr> </table> >]; } ``` The algorithm is simple. 1. ==Traverse== runqueue in reverse order and schdule the first runnable process which has ==the largest timeslice==. 2. If there is no such process, the scheduler resets all the task's timeslices and ==adds half of the current timeslice to the sleeping tasks==. ```graphviz graph G{ 1[shape=none,margin=0,label=< <table BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4"> <tr> <td bgcolor="skyblue"> runnable 10 </td> <td bgcolor="skyblue"> <b>runnable 15</b> </td> <td bgcolor="grey"> NULL </td> <td bgcolor="skyblue"> runnable 5 </td> <td bgcolor="lightcoral"> sleep 12 </td> </tr> </table> >]; } ``` ```graphviz graph G{ 1[shape=none,margin=0,label=< <table BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4"> <tr> <td bgcolor="skyblue"> <b>runnable 10</b> </td> <td bgcolor="skyblue"> runnable 0 </td> <td bgcolor="grey"> NULL </td> <td bgcolor="skyblue"> runnable 5 </td> <td bgcolor="lightcoral"> sleep 12 </td> </tr> </table> >]; } ``` ```graphviz graph G{ 1[shape=none,margin=0,label=< <table BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4"> <tr> <td bgcolor="skyblue"> runnable 0 </td> <td bgcolor="skyblue"> runnable 0 </td> <td bgcolor="grey"> NULL </td> <td bgcolor="skyblue"> <b>runnable 5</b> </td> <td bgcolor="lightcoral"> sleep 12 </td> </tr> </table> >]; } ``` ```graphviz graph G{ 1[shape=none,margin=0,label=< <table BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4"> <tr> <td bgcolor="skyblue"> runnable 0 </td> <td bgcolor="skyblue"> runnable 0 </td> <td bgcolor="grey"> NULL </td> <td bgcolor="skyblue"> runnable 0 </td> <td bgcolor="lightcoral"> sleep 12 </td> </tr> </table> >]; } ``` ```graphviz graph G{ 1[shape=none,margin=0,label=< <table BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4"> <tr> <td bgcolor="skyblue"> runnable 15 </td> <td bgcolor="skyblue"> runnable 15 </td> <td bgcolor="grey"> NULL </td> <td bgcolor="skyblue"> runnable 15 </td> <td bgcolor="lightcoral"> sleep 20 </td> </tr> </table> >]; } ``` The process will continue. When the task is finished, it would leave the array and set the element to `NULL`. ### O(n) Scheduler The linear complexity of this algorithm resulted from the requirement to ==iterate through every task during scheduling events==. To reduce the imbalance between batch and interactive tasks, the scheduler incorporated **epoch**, which represetned the **lifetime (timeslice)** of every task within the system. ```graphviz digraph shuffle { rankdir = "LR" node [shape=box, style=filled]; node0 [shape=none, label="", fillcolor="white", color="black"]; // Define nodes with different colors node1 [label="task1", fillcolor="white", color="black"]; node2 [label="task2", fillcolor="grey", color="black"]; node3 [label="task3", fillcolor="white", color="black"]; node4 [label="task4", fillcolor="white", color="black"]; node5 [label="task5", fillcolor="white", color="black"]; // Connect the nodes node0 -> node2; node1 -> node2; node2 -> node3; node3 -> node4; node4 -> node5; } ``` At this point, all tasks, regardless of their runnable status, are stored in a **linked list**. When Linux need to select a task for execution, it engages O(n) scheduler and iterate every task, give each task a **goodness** score. ==The task with the highest goodness score earns access to the core==. When the task's timeslice is consumed, it is preemptted, and another task is run. ### O(1) Scheduler ### Completely Fair Scheduler (CFS) ### EEVDF ## TODO: Reproduce the Experiments in Chapter ### Same Nice v.s. Different Nice ### FIFO v.s. RR The experiments aims to demonstrate the difference between FIFO and RR policy, both the policy are `real-time` scheduling class. To present the two policies in detail, 3 experiments will be covered. 1. Difference between FIFO and RR We expect that each task under the FIFO policy will only be chosen to execute ==once the previous one terminates==, while each task under the RR will have a chance to execute ==once the previous one exceeds the given timeslice==. We will use the following configuration file. ```c { "proc": [ { "comm": "cpu_bound", "cpus": "3", "name": "fifo", "policy": "fifo", "policy_detail": "99", "spawn_cnt": 3 }, { "comm": "cpu_bound", "cpus": "2", "name": "rr", "policy": "rr", "policy_detail": "99", "spawn_cnt": 3 } ] } ``` 6 tasks (3 of each) will be generated in this part. The tasks under FIFO policy will be scheduled on the third CPU, and those under RR policy will be scheduled on the second CPU. 2. Priority of FIFO and RR In this part, we will present how the priority affects each policy. ```c { "proc": [ { "comm": "cpu_bound", "cpus": "3", "name": "low priority (fifo)", "policy": "fifo", "policy_detail": "97" }, { "comm": "cpu_bound", "cpus": "3", "name": "medium priority (fifo)", "policy": "fifo", "policy_detail": "98", "spawn_cnt": 2 }, { "comm": "cpu_bound", "cpus": "3", "name": "high priority (fifo)", "policy": "fifo", "policy_detail": "99" }, { "comm": "cpu_bound", "cpus": "2", "name": "low priority (rr)", "policy": "rr", "policy_detail": "97" }, { "comm": "cpu_bound", "cpus": "2", "name": "medium priority (rr)", "policy": "rr", "policy_detail": "98", "spawn_cnt": 2 }, { "comm": "cpu_bound", "cpus": "2", "name": "high priority (rr)", "policy": "rr", "policy_detail": "99" } ] } ``` 3. Block the FIFO We expect that another task will be selected to execute once the running one blocks itself under FIFO policy. ```c { "proc": [ { "comm": "io_block", "cpus": "3", "name": "io_block", "policy": "fifo", "policy_detail": "99", "spawn_cnt": 2 }, { "comm": "yield", "cpus": "3", "name": "yield", "policy": "rr", "policy_detail": "99", "spawn_cnt": 2 } ] } ``` There are 4 tasks (2 of each) will be generated in this part. All of them are all scheduled on the third CPU. #### Expecting Result 1. We expect that in a FIFO scheduler, each task will be completed before moving on to the next one. In a Round Robin (RR) scheduler, each task is executed in turn each timeslice. ![IMG_0084](https://hackmd.io/_uploads/rJfBZQTLA.jpg) 2. For tasks with different priorities, it is natural that high-priority tasks are executed first. Among tasks with the same priority, a FIFO scheduler will ==execute the task that arrived first==, completing it entirely before moving on to the next task. In contrast, a Round Robin (RR) scheduler ==switches between tasks at each timeslice==. ![IMG_0085](https://hackmd.io/_uploads/H1VOZXpUC.jpg) 3. ![IMG_0086](https://hackmd.io/_uploads/rJqubmT8A.jpg)