Note of kernel-scheduler-internals

# Note of kernel-scheduler-internals ###### tags: `linux2021q3` ## monolithic vs micro kernel > p8 - reduce the program in kernel | | monolithic | micro | | -------- | -------- | -------- | | Bugs | fewer | more | | recover bugs | hard(reboot) | easy(re-execute) | | portable | hard | easy | | maintain | hard | easy | | test a program | hard(recompile whole system & reboot) | easy(recompile this program) | | performance | better | worse(heavy use of IPC)(context switch & system call) | | modularity | hard | easy | --- ## process vs thread > p10 - Threads share the same address space, while processes do not. - The kernel cannot distinguish them. ### PID & TGID > p11 > PID: process ID > TGID: thread group ID - single-threaded process - PID == TGID - multi-threaded process - Each thread in one group has the same TGID and a unique PID. - thread group leader: PID == TGID - `getpid()`: return ==TGID== - `gettid()`: return ==PID== --- ## container_of > read-around: [Linux 核心原始程式碼巨集: container_of](https://hackmd.io/@sysprog/linux-macro-containerof) ```c #define container_of(ptr, type, member) \ ({ \ void *__mptr = (void *)(ptr); \ ((type *)(__mptr - offsetof(type, member))); \ }) // An alias that is used everywhere #define list_entry(ptr, type, member) container_of(ptr, type, member) ``` give a member, return the begin of the struct. So, we can get the data we want in the structure without a pointer in the list node. ==aaaaaaaamazed== usage: One data structure can contain many different list_heads. In other words, it can exist in the different linked lists at the same time, and not consume more spaces for duplicate data. - OFFSETOF ```c // File include/linux/stddef.h #define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER) ``` TYPE is the struct we are considering, MEMBER is the name of the field, what it does is: 1. Take the address 0, the first in the address space of the process 2. Cast it to a TYPE pointer 3. Dereference the pointer and access the MEMBER field 4. Take the address of the field and cast it to a size, now it is no longer an address --- ## Tasks lifetime >P24 ### Zombie Process A zombie process is a process that is terminated, but its process descriptor and entry in the pid hash table are still present in memory and accessible (for example, by ps -aux). Tasks’ resources are not deallocated immediately because the parent process may want to access some of this information, most likely the exit status, or may want to synchronize with the child process termination via wait() or waitpid(). Zombie processes are impossible to kill externally: they can’t receive signals as they no longer exist, so a wait by the parent is the only way to clean the memory occupied by the zombie data structure. The ancestor process (init) has a routine that waits periodically to reap possible zombie processes. ### Task States ```cpp /* Used in tsk->state: */ #define TASK_RUNNING 0x0000 #define TASK_INTERRUPTIBLE 0x0001 #define TASK_UNINTERRUPTIBLE 0x0002 #define __TASK_STOPPED 0x0004 #define __TASK_TRACED 0x0008 /* Used in tsk->exit_state: */ #define EXIT_DEAD 0x0010 #define EXIT_ZOMBIE 0x0020 #define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD) ``` ## Response time and throughput cannot maintain both --- ## The Linux scheduler >p30 ### Scheduler Concept **In fact, all scheduling on Linux is preemptive.** ![](https://i.imgur.com/yG6hx6Z.png) | Scheduling classes | Scheduling policies | | -------- | -------- | | stop sched class | | | dl sched class | SCHED DEADLINE | | rt sched class | SCHED FIFO SCHED RR | | fair sched class | SCHED NORMALSCHED BATCHSCHED IDLE | | idle sched class | | - SCHED_NORMAL SCHED_NORMAL is the **default policy** that is used for regular tasks and uses **CFS** (the Completely Fair Scheduler, implemented in fair.c) - SCHED_BATCH SCHED_BATCH is similar to SCHED_NORMAL but it will **preempt less frequently**, so every process will run longer. For this reason, it is **more suited for non-interactive workloads**, typically on **servers**. ### Terminology - affinity ### Commands #### Process Priority ```shell ps -eo pid,rtprio,time,comm ps -el ``` report a snapshot of the current processes with real-time priority ranging from 0 to 99. - **nice** ---- ### O(1) Scheduler ![](https://i.imgur.com/HdwuCRw.png) If a process does not complete its full timeslice before it is preempted, then it goes back in the ready queue. If it does run to the end of the timeslice, it is placed in the expired queue instead. All scheduling takes place from the active queues. The highest priority queue is chosen; if there are multiple tasks in that queue, they are scheduled in round-robin fashion. This continues until the active queue structure is empty. When that happens, the active and expired queues change places, and execution (scheduling) continues. #### Pros - O(1) it executed in constant time (O(1)) under all circumstances. - work stealing O(1) scheduler keeps a global scheduler that can rebalance per-CPU queues. If a CPU is idle, O(1) scheduler takes a process from another CPU. #### Cons - different effect from nice value This means that with the O(1) scheduler, calling nice() to increment the nice level of a task has different effects depending on the initial value. - BAD INTERACTIVE IN REALTIME This approach caused another major problem because SCHED_FIFO, as we stated earlier, **is not starvation proof**. **Tasks in O(1) scheduler had to wait until all of the other tasks, in all of active runqueues at all of the levels of higher priority, exhausted their timeslices.** A task is also marked as interactive depending on its dynamic priority and its nice value. ### Rotating Staircase DeadLine The **”multi-level”** queues correspond to **different levels of time quanta** on the CPU. The **highest queue has the shortest quanta** on the CPU, with each subsequent queue having longer quanta. With several queues, processes placed onto these queues will run for the timeslice specified by its queue. **The number of queues and the range of time quanta will vary**, but this structure allows for processes to be classified into groups based on their needs. Using this model, we can **prioritize interactive and I/O bound processes by introducing priorities, preemption, and feedback across the queues.** ### Completely Fair Scheduler (CFS) #### virtual runtime `t` × `weight` (based on priority i.e. nice value) a monotonic increasing value accumulates 1 ms of virtual runtime for each elapsed millisecond during which it runs. CFS will choose the lowest virtual runtime task for the fairness. #### target time CFS starts with a target time for **how long it should take to make one complete round-robin** through the runnable threads. The value of 6 ms used in the examples is the default for uniprocessor systems. #### Running the next task Being as fair as possible with all the tasks means keeping **all the tasks’ vruntimes as close as possible** to each other. Following this logic, the **task** that **deserves** more than anyone to be **executed** next is the one **with the smallest vruntime**. #### Runqueue & minimum virtual runtime the **minimum virtual runtime** is **the virtual runtime of the most deserving active task** (in the run queue). Every time that a task is chosen for execution, the minimum is updated. **When new tasks are added** to the runqueue their virtual runtime is updated to keep things fair. **When a sleeping task is woken up,** the scheduler checks that its virtual runtime is at least equal to the current minimum. When a new task is created via **fork()** and inserted into the runqueue, it **inherits the virtual runtime from the parent**: this **prevents** the exploit where a task can **take control of the CPU by continuously forking itself**. #### Pros - Nice Value handle nice values better than the previous scheduler increasing the nice value by one has the same effect regardless of the starting value nice value seemed to a weight, so the CPU proportion is determined only by the relative difference in nice values - the preemption time is no longer fixed like in the O(1) scheduler, but it is variable #### Cons Too many context switch cause more overheads. ### Multiprocessing #### Load balancing **The load of a task** becomes **a combination of its weight and its average CPU utilization**, and the load of the core is the sum of the loads of its tasks. Since the CPU utilization of a task can vary, its load is **constantly updated**. #### Migration *non-uniform memory access* (NUMA) cause different cost of migrating tasks to another CPU. ### Energy-Aware Scheduling (EAS) **CFS will always put a new task on an idle CPU if available (promoting throughput)**. However, this is not always the most energy-efficient decision. Because evaluating all possible options would impact performance, **EAS simply evaluates the CPU the task last ran on and the CPU chosen by a simple heuristic** which finds where the task best fits. #### Arm big.LITTLE architecture The name “big.LITTLE” refers to the two CPU/core types, big and LITTLE. The **big cores are more powerful but consume more power**, whereas the **LITTLE cores are less powerful but also consume less power**. *asymmetric multi-core* (AMC) adds complexity to the scheduling problem. #### Dynamic Voltage and Frequency Scaling (DVFS) It became possible to change a CPU’s frequency dynamically, either through the BIOS or the operating system. i.e. DVFS techniques adjust the frequency of the little cores from 200MHz up to 1.4GHz. DVFS adds complexity to the scheduling problem. #### Summarization in order to have both good performance and power efficient, a scheduler could take into account: - What type of core the task(s) run on (big or LITTLE), - whether it is worth migrating a task between cores and between core types, - and whether it is worth running the cores at full speed and voltage or if some cores could be throttled to save power at a slight performance hit. The existing CFS has a throughput based policy, but it doesn't fit the energy usage. --- ## Ftrace Ftrace uses a **ring buffer** to store all the events that are happening at runtime ### Function tracing VS. Event tracing #### Function tracing - using code instrumentation mechanism by gcc, enabled by compiling with the -pg flag - **dynamic profiling**: toggled at runtime in the binary executable, without the need to recompile the code. - gcc adds extra NOP (“No OPeration”) assembly instructions at the beginning of every function, so that it will be possible to change these NOPs into something else, if needed. - advantage: - tracing at runtime, **zero overhead** when it is disabled - **filter what is being traced.** We could dynamically activate tracing only on functions from a single subsystem, or on one function alone. #### Event tracing - less efficient than function tracing. It uses tracepoints directly in the C code, which makes it static. - Since this mechanism is static, **the whole kernel must be recompiled to toggle the tracepoints.** ### Interfacing Creates **a sort of shared memory between the user and the kernel**. This is done through the **procfs** filesystem, which is found in **/proc** The approach used by Linux is more straightforward: the information is (mostly) in human-readable form, **so you simply read the files in /proc and parse the results as strings.** By doing so, **no syscall is needed**, except, of course, open() and read() to interact with the filesystem. **On Linux, when commands such as ps, top or pgrep are invoked, they internally query the procfs filesystem.** You could always do the same operation manually by doing something like cat /proc/1337/info_that_you_need | grep specific_info, but it would be tedious: this is why utilities like ps are convenient front-ends for the user. --- ## Questions - quiz15 1. 在我填好選項且執行時，65 行的 "No second ELF header found.\n" 總是會跑出來，我認為這是因為程式找不到所謂的 real elf，也就是利用 cat 內嵌進去的 payload。我不確定是我執行操作不當還是環境錯誤 (kernel 版本為 5.4.0-73)，但這部份貌似不是選項填答會影響到的，我也嘗試過用 objcopy 但目前還未成功。 2. 選項 AAA 的部份，我認為做加減運算時是需要對 newelf 做型態轉換的。如同上面在 63 行 size - (intptr_t) newelf - 6 這樣，然後目前還在研究為什麼加法的話只會出現 warning，而減法則會出現 error，這部份應該要從 c 語言規格書查閱嘛?

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.