Linux kernel Scheduler

Presenter: YangYeh-PD
video

My Question List

Are Scheduling Classes specified by the source program or identified by the scheduler?
Why SCHED_IDLE does belong to idle_sched_class, what is the difference?
What is sleep process doing in Early Linux Scheduler? Why the process is to do with interactivity?

Descriptions

Reading "Demystifying the Linux CPU Scheduler" and noting any questions or issues encountered.
Reproducing the experiments in Chapter 7.

Prerequisites

Update Linux to V6.1 or above

In Linhoward0522's research on Scheduler, there are related operational instructions.

download sched-analyzer

Go to sched-analyzer and download. After generate trace file, we can put it to perfetto and visualize the behavior of the scheduler.

Developing Environment

$ gcc --version
gcc (Ubuntu 13.2.0-4ubuntu3) 13.2.0

$ uname -mrs
Linux 6.5.0-41-generic x86_64

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
    CPU family:          6
    Model:               142
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            10
    CPU(s) scaling MHz:  36%
    CPU max MHz:         3400.0000
    CPU min MHz:         400.0000
    BogoMIPS:            3600.00
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    6 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7

TODO: Reading "Demystifying the Linux CPU Scheduler"

The importance of CPU Scheduler

The CPU scheduler not only plays an important role in preemptive multi-tasking system, but also in non-preemptive multi-tasking system.

In the preemptive multi-tasking system, the sheculer allows the operating system to forcibly interrupt a running process to assign the CPU to another process. This ensures that all processes get a fair share of CPU time and improves the system's responsiveness.
In non-preemptive multi-tasking system, when a process is created, it is placed in the ready queue. Then the scheduler would select a process in the ready queue (or runqueue).

Interactive v.s. Throughput

The CPU scheduler aims to balance interactiveness and throughput. For a system to achieve immediate responsiveness and interact effectively with users, preemptive multi-tasking is required. However, this preemptive multi-tasking often causes the CPU to be idle at certain moments, reducing overall CPU efficiency.

Fairness and Liveness

Fairness means that tasks with the same priority should receive an equal amount of CPU time. However, this alone is not sufficient. If a high-priority task continuously occupies the CPU, other tasks will be unable to execute. Therefore, the design of liveness becomes crucial. Liveness ensures that every task gets CPU resources within a certain timeframe. With liveness, even lower-priority tasks that need to be completed can be executed.

Real-Time (RT)

Real-time OS are divided into two types: soft real-time and hard real-time.

In a soft real-time OS, if a process exceeds its deadline without being processed, the consequences are not severe. The most common example is audio and video desynchronization.
In a hard real-time OS, if a process exceeds its deadline without being processed, it can lead to serious consequences, potentially even causing the system to crash.

Power Efficiency

The Linux CPU scheduler employs different power-saving methods depending on the platform. On desktops and laptops, there are two main power-saving strategies:

Prioritizing low-power tasks, such as idle tasks or lower-priority tasks.
Adjusting the CPU frequency based on the workload. When the workload is light, the frequency can be reduced to save energy.

In servers, because the system needs to operate continuously, it is essential to strike a balance between throughput and power consumption. The strategy is

Prioritizing CPU-bound tasks to increase cache efficiency (both temporal and spacial) and inter-processor communication.

Working Patterns

So how does the scheduler know if a task needs to be executed first? Instead of considering other solutions, the Linux Kernel directly requires the source program to specify the priority before handing the task over to the scheduler for scheduling.
The scheduler subsystem of the Linux kernel uses the concept of scheduling classes to implement certain scheduling algorithms. There are 5 different scheduling classes in the current kernel version.

Scheduling classes	Scheduling Policies
`stop_sched_class`
`dl_sched_class`	`SCHED_DEADLINE`
`rt_sched_class`	`SCHED_FIFO`&`SCHED_RR`
`fair_sched_class`	`SCHED_NORMAL`&`SCHED_BATCH`&`SCHED_IDLE`
`idle_sched_class`

Stop is used to schedule the pre-CPU stop task, whic is only available for symmetric multiprocessing (SMP) systems. It includes a mechanism to stop a kernel threwad running on a specific CPU and migrate it to another CPU.
Deadline guarantees that the deadline is respected. A task with this policy defines a deadline, before which the task has to finish its execution.
In the 1rt_sched_class, there are two scheduling policies, SCHED_FIFO and SCHED_RR, which are very similar.
- In SCHED_FIFO, FIFO means First In First Out, so tasks are executed in the order they arrive. However, if a higher priority task needs to be executed, it will be given priority.
- SCHED_RR (round-robin) is similar to FIFO, but the difference is that RR has a timeslice. It uses round-robin to cycle through all the tasks with the same priority. Each task can run only up to a maximum fixed timeslice.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

SCHED_NORMAL and SCHED_OTHER is the default policy that is used for regular tasks and uses CFS (the Completely Fair Scheduler). SCHED_BATCH is similar to SCHED_NORMAL but it will preempt less frequently.
SCHED_IDLE is for tasks with very low priority, almost any task can preempt tasks with the policy.
Idle is used to schedule the pre-CPU idle task (also called swapper task), which is run if there's no other task runnable.

Early CPU Scheduler

In v0.01 of Linux kernel, all tasks are expressed by the array. This array is not only the list of all tasks but also the runqueue. The length of this array is NR_TASKS, which is 64. An empty entry is expressed as NULL.

The algorithm is simple.

Traverse runqueue in reverse order and schdule the first runnable process which has the largest timeslice.
If there is no such process, the scheduler resets all the task's timeslices and adds half of the current timeslice to the sleeping tasks.

The process will continue. When the task is finished, it would leave the array and set the element to NULL.

O(n) Scheduler

The linear complexity of this algorithm resulted from the requirement to iterate through every task during scheduling events. To reduce the imbalance between batch and interactive tasks, the scheduler incorporated epoch, which represetned the lifetime (timeslice) of every task within the system.

At this point, all tasks, regardless of their runnable status, are stored in a linked list. When Linux need to select a task for execution, it engages O(n) scheduler and iterate every task, give each task a goodness score. The task with the highest goodness score earns access to the core. When the task's timeslice is consumed, it is preemptted, and another task is run.

O(1) Scheduler

Completely Fair Scheduler (CFS)

EEVDF

TODO: Reproduce the Experiments in Chapter

Same Nice v.s. Different Nice

FIFO v.s. RR

The experiments aims to demonstrate the difference between FIFO and RR policy, both the policy are real-time scheduling class. To present the two policies in detail, 3 experiments will be covered.

Difference between FIFO and RR
We expect that each task under the FIFO policy will only be chosen to execute once the previous one terminates, while each task under the RR will have a chance to execute once the previous one exceeds the given timeslice.
We will use the following configuration file.

{
    "proc": [
        {
            "comm": "cpu_bound",
            "cpus": "3",
            "name": "fifo",
            "policy": "fifo",
            "policy_detail": "99",
            "spawn_cnt": 3
        },
        {
            "comm": "cpu_bound",
            "cpus": "2",
            "name": "rr",
            "policy": "rr",
            "policy_detail": "99",
            "spawn_cnt": 3
        }
    ]
}

6 tasks (3 of each) will be generated in this part. The tasks under FIFO policy will be scheduled on the third CPU, and those under RR policy will be scheduled on the second CPU.

Priority of FIFO and RR
In this part, we will present how the priority affects each policy.

{
    "proc": [
        {
            "comm": "cpu_bound",
            "cpus": "3",
            "name": "low priority (fifo)",
            "policy": "fifo",
            "policy_detail": "97"
        },
        {
            "comm": "cpu_bound",
            "cpus": "3",
            "name": "medium priority (fifo)",
            "policy": "fifo",
            "policy_detail": "98",
            "spawn_cnt": 2
        },
        {
            "comm": "cpu_bound",
            "cpus": "3",
            "name": "high priority (fifo)",
            "policy": "fifo",
            "policy_detail": "99"
        },
        {
            "comm": "cpu_bound",
            "cpus": "2",
            "name": "low priority (rr)",
            "policy": "rr",
            "policy_detail": "97"
        },
        {
            "comm": "cpu_bound",
            "cpus": "2",
            "name": "medium priority (rr)",
            "policy": "rr",
            "policy_detail": "98",
            "spawn_cnt": 2
        },
        {
            "comm": "cpu_bound",
            "cpus": "2",
            "name": "high priority (rr)",
            "policy": "rr",
            "policy_detail": "99"
        }
    ]
}

Block the FIFO
We expect that another task will be selected to execute once the running one blocks itself under FIFO policy.

{
    "proc": [
        {
            "comm": "io_block",
            "cpus": "3",
            "name": "io_block",
            "policy": "fifo",
            "policy_detail": "99",
            "spawn_cnt": 2
        },
        {
            "comm": "yield",
            "cpus": "3",
            "name": "yield",
            "policy": "rr",
            "policy_detail": "99",
            "spawn_cnt": 2
        }
    ]
}

There are 4 tasks (2 of each) will be generated in this part. All of them are all scheduled on the third CPU.

Expecting Result

We expect that in a FIFO scheduler, each task will be completed before moving on to the next one. In a Round Robin (RR) scheduler, each task is executed in turn each timeslice.
For tasks with different priorities, it is natural that high-priority tasks are executed first. Among tasks with the same priority, a FIFO scheduler will execute the task that arrived first, completing it entirely before moving on to the next task. In contrast, a Round Robin (RR) scheduler switches between tasks at each timeslice.