owned this note
owned this note
Published
Linked with GitHub
# 2024q1 Homework5 (assessment)
contributed by < `devarajabc` >
## 〈因為自動飲料機而延畢的那一年〉的啟發
文章點出很多我的盲點
## 閱讀紀錄
心得:
這幾週我閱讀了 cs:app 第八章與第十二章,但來回看了三次還是看不懂,各種名詞與概念有如中世紀的蒙古鐵騎一般把我踩的屍骨無存,倒在地上的我看著時間一天一天過去,作業進度依然卡在泥濘之中,只好先將 cs:app 擱置一邊,改去閱讀 [Linux 核心設計: 不僅是個執行單元的 Process](https://hackmd.io/@sysprog/linux-process) 和 [UNIX 作業系統 fork/exec 系統呼叫的前世今生](https://hackmd.io/@sysprog/unix-fork-exec),閱讀完還是對相關概念一知半解,蒙古鐵騎換成中文版依舊能把我踩死,搞得自己非常焦慮,痛定思痛後意識到自己太想要「速成」,總是想著要「一步登天」,最後一是無成,決定痛改前非,老老實實地觀看交大資工曹孝櫟教授的〈作業系統設計與實作〉線上課程並搭配閱讀 Demystifying The Linux CPU Scheduler ,慢慢把基礎補回來.........
詳細內容請見 [閱讀筆記](https://hackmd.io/@devaraja/Sk3VST4WR)
### 閱讀 OSTEP
30 .Condition Variables
文中提到:
>In particular, there are many cases where a thread wishes to check
whether a condition is true before continuing its execution. For example,
a parent thread might wish to check whether a child thread has completed
before continuing
為何不用 wait() 就好了?
### 研讀第 1 到第 6 週「課程教材」和 CS:APP 3/e
quiz 8, 9, 10
8.4.5
看不懂 unsetenv
> The execve function loads and runs executable object file with the argument and environment list
是存在哪?
8.5
文中提到 :
>However, if the set is nonempty, then the kernel chooses some signal k in the set (typically the smallest k) and forces p to receive signal k. The receipt of the signal triggers some action by the process. Once the process completes the action, then control passes back to the next instruction (Inext) in the logical control flow of p.
讓我產生以下問題:
1. 為何是選最小的 (typically the smallest k)
2. 只處理最小的嗎?那其他的 signal 怎麼辦?
3. the set of unblocked pending signals 儲存在哪?
看不懂在 figure 8.36 中用來處理 signal SIGCHLD 的 handler2為何要如此設計,
>kernel sends a SIGCHLD signal to the parent whenever one of its children terminates or stops.
既然這樣為何還需要 waitpid ?都已經結束了(收到 SIGCHLD)為何還需要 waitpid ?
```c
void handler2(int sig) 2{
int olderrno = errno;
while (waitpid(-1, NULL, 0) > 0) {
Sio_puts("Handler reaped child\n");
}
if (errno != ECHILD)
Sio_error("waitpid error");
Sleep(1);
errno = olderrno;
}
```
## 簡述想投入的專案
我想先補完前面積欠的作業
1. lab0
2. quiz(1~9)
---
## TODO: 閱讀 Demystifying The Linux CPU Scheduler (到第 3 章) 紀錄問題
### 1.3.1 Processes and threads
a.文中對於 thread 的介紹:
> A thread, defined as a single flow of execution, is linked with a stack and a set of CPU registers, with the stack pointer and program counter being the most significant.
most significant 的啥? 是否漏字了? 還是指 the most significant registers ?
b. 一樣是關於 thread 的描述:
>Each thread consists of a unique copy of the CPU registers and has a
dedicated stack, everything else is shared with its process, which is why threads
are often referred as lightweight processes or LWP.
其中 has 是否多餘了?
原意是: Each thread consists of a unique copy of the CPU registers and a dedicated stack. 嗎?
c. 關於 `clone()` 的描述
>The distinction between
a clone and a copy is subtle but important: the former means that the tasks share the same object, the latter means that each task possess a distinct copy of the object.
這是的 copy 是指 `fork()` 嗎?
d.
既然 `clone()` 可以決定哪些資源要複製、哪些資源要與 parent 共用
那 process 便可以使用 `clone()` 來產生 thread 對吧?
那為何還需要 `pthread_creat()` ?
```clike=
// spawns a new thread
clone(&do_something, stackTop,CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
// spawns a new process, this is the same as using fork()
clone(&do_something, stackTop, SIGCHLD, 0);
```
e. 關於 volatile 的描述
>One thing to notice about this task_struct is the usage of the keyword volatile, which suppresses the compiler from optimizing the data access away.
Hence, every time a volatile variable is accessed, a load is guaranteed to occur. The fact that the task state is volatile makes sense because it could be
unpredictably modified by interrupts: it is possible that the variable is changed outside of the current execution context, so loading the variable on every access is required.
文中提到 "unpredictably modified by interrupts" 是什麼 interrupts 呢?為何是 interrupts 而非 compiler 呢?文中都提到
"which suppresses the compiler from optimizing the data access away." 不就是指 modified by compiler 嗎?
f. 關於 os 如何儲存 process :
>There are actually four tables, one for each PID type.
為何需要維護四個 hash table ? 文中介紹的 TGID class 不就可以得到剩下三種(PID, PGID, SID)所需的資料嗎?
### 1.3.3 Scheduling
a.關於 time sharing 的介紹:
>The idea is that several processes run
in “time multiplexing” because the CPU core is divided into slices, one for each runnable process.
但後面又說 "A CPU core, also called a core or a hardware thread, is the smallest computing unit found in a processor chip. "
都已經是最小的計算單位 (the smallest computing unit) 了,為什麼還能切割(divide into slices)呢?
### 1.3.4 Task lifetime
a. 關於 __TASK_STOPPED 的介紹
>A task in __TASK_STOPPED is not running and cannot be scheduled: this happens upon stop signals or any signal from a tracing process.
TODO: Since Linux v6.6, the default CPU scheduler is EEVDF.
既然 task「被停止」又無法「被排程」,那要如何維護呢?不就變成 zombie了嗎?
ANS:當需要繼續執行停止的任務時,可以向該任務發送 SIGCONT 信號,這樣任務就會從停止狀態變為可執行狀態(TASK_RUNNING),然後被 scheduler 重新排程。
b. 關於 timeslice 的描述:
>A task is given a timeslice, or the maximum amount of CPU time it is allowed to consume, when it is going to transition from “running — wants a CPU core” to “running — on a CPU core”.
when it is going to transition from “wants a CPU core” to “on a CPU core” 這不是指等待時間嗎?跟 timeslice 有什麼關係呢?
### 2.1.2 Handling various workload patterns
>此處搭配 [CPU Scheduling of the Linux Kernel](https://docs.google.com/presentation/d/1qDSFOfhGF-AX3O8CNzoAkQr_5ohr0nMVDlNfPnM9hOI/edit#slide=id.g11990eefd4b_0_0) 、[Linux 核心設計: 不只挑選任務的排程器](https://hackmd.io/@sysprog/linux-scheduler)
a.有關 SCHED_RR 和 SCHED_FIFO 的介紹:
>SCHED_RR and SCHED_FIFO use the same priority scale, so processes of both policies share the same runqueues.
所以不同的 Scheduling class 會有不同的 priority scale ,不同的 priority scale 會有不同的 runqueue ? 所以 deadline, real-time, fair 會各有一個 runqueue ?
b. 關於 SCHED_DEADLINE 的描述:
>SCHED_DEADLINE is the highest priority policy. A task with this policy defines a deadline, before which the task has to finish its execution.
>The scheduler guarantees that the deadline is respected.
>To do that, when the policy or the attributes are changed, the scheduler checks the feasibility of the change,and if the CPU is too busy then it returns an error.
"policy or the attributes are changed" ? policy 為什麼會改變 ? 有優先級更高的 task 變成 runnable 嗎? 但文中不是提到 “SCHED_DEADLINE is the highest priority policy” 嗎? 怎麼會有優先級更高的 task 呢? attributes 又是什麼呢?
c.關於 scheduling class 的描述:
>`deadline class` implements the Earliest Deadline First scheduling policy;
>`real-time` class follows the POSIX specification for real-time process scheduling . POSIX is a family of standards, which includes specifications of real-time interfaces for scheduling and resource sharing. Real-time tasks,in POSIX, can run at a minimum of 32 fixed-priorities until they are preempted or yielded.
但 real-time 的定義不就是:Real-time (RT) tasks, which require processing within stringent time constraints or deadlines, are integral to a system’s operational correctness. 這樣跟 deadline class 差在哪?此外 Hard real-time 和 Soft real-time 都算在 Real-time task 嗎?
d.缺乏關於 SCHED_NORMAL 的描述:
文中提到 SCHED_BATCH 的介紹
>SCHED_BATCH is similar to SCHED_NORMAL but it will preempt less frequently, thereby allowing tasks to run longer and make better use of caches but at the cost of interactivity.
和 SCHED_IDLE
>SCHED_IDLE is for tasks with very low priority, almost any task can preempt tasks with this policy.
>In other words, it can be described as a SCHED_NORMAL with an even weaker priority.
兩者的描述都是基於 SCHED_NORMAL 的特性,但文中卻沒有介紹 SCHED_NORMAL
e. 關於 subprocesses 的優先權只提到
>All forked subprocesses will inherit the normal scheduling policy that is assigned to newly created processes.
卻沒有提到 real-time task ,我可以說:
>All forked subprocesses will inherit the scheduling policy that is assigned to newly created processes.
嗎?
### 2.2 Prior to CFS
### 2.2.3 $O(1)$ Scheduler
a. 可能是我英文不好一時半刻看不懂以下描述:
>Using the same scheduling for all CPUs, O(n) scheduler worked well with a small number of tasks, but there were performance issues when working with many tasks, since it required O(n) time to schedule tasks, where n is the number of runnable tasks. It also had significant difficulty with SMP systems due to its design, notably:
>1. A single runqueue;
>2. A single runqueue lock; and
>3. An inability to preempt running processes.
能不能改成以下版本呢?:
>Using the same scheduling approach for all CPUs, an O(n) scheduler performs well with a small number of tasks. However, it encounters performance issues when handling many tasks, as it requires O(n) time to schedule each task, where n is the number of runnable tasks.
>This scheduler also faces significant challenges in SMP systems due to its design, notably:
>1. A single runqueue;
>2. A single runqueue lock; and
>3. An inability to preempt running processes."
b. 關於 processor affinity 的描述:
>Since each core has its own set of registers and cache, this means that it is simpler to restore a task to the core it previously ran on, compared to transferring all the task’s information to a different core, invalidating the information on the old core, and then restoring the layout on another core.
文中提到 "it is simpler to restore a task to the core it previously ran on, compared to transferring all the task’s information to a different core" ,但看到這我就懞了,為何會比較 simple 呢 ? 在做 context switch 的時候不就會把 CPU 上的資訊全部洗掉嗎?那 different core 有差嗎? 還是 context 都存在 pre- CPU cache 呢?
c. 關於 bitmap 的敘述:
>To make the process to choose the task on the highest priority list more efficient, a bitmap is used to define when tasks are on a given priority list.
能否改成:
>To make the task selection from the highest priority list more efficient, a bitmap is used to indicate when tasks are present on a given priority list
因為這原文中的 process 很容易跟執行單元 Process 搞混
d. 也是關於 bitmap 的描述:
>A bitmap records which priority levels have waiting tasks, and true indicates that queue contains a task.
能否改成:
>A bitmap records which priority levels have waiting tasks; a true value in the bitmap indicates that the corresponding queue contains at least one task.
這樣更清楚表達 "true" 是指 bitmap 的一個位元,避免混淆
更何況 queue 裡面也不會只有一個 task , 所以改成 “at least one task”
e. 關於 BITMAP SIZE 的描述:
> As 140÷32 = 4.375, so with 140 levels and 32 bit words, BITMAP SIZE is 5.
能否改成以下版本方便理解,否則原來的版本太唐突了:
>Since 140 ÷ 32 = 4.375, we need 5 words to cover 140 priority levels using 32-bit words. Therefore, the BITMAP_SIZE is 5.
### 2.3 Completely Fair Scheduler (CFS)
a. 文中提到每個 task 會平均地分配到 $\frac{1}{n}$ 的CPU 時間
> Assume that we have an infinitely small
timeslice, and ideally, every task uniformly receives $\frac{1}{n}$ of the processor’s time, with n being the number of runnable tasks, as shown in Figure 2.11. In other words, CPU usage is divided equally among every task. Each task runs in parallel and consumes equal CPU share.
但我不懂為何在此處要假設無限小的 timeslice ( Assume that we have an infinitely small timeslice) ,文章後續也沒有基於這個「假設」做出任何推理
b. 關於 preemption time 與 nice values 的描述:
>Moreover, starting from Linux v2.6.23, priorities of all processes are static,with one exception: the priority of a regular process can be temporally boosted to the real-time priority, when the process invokes a system call using the RTmutex which prevents the priority inversion problem .
This results in simpler code that can handle nice values better than the previous scheduler; the preemption time is no longer fixed like in the O(1) scheduler, but it is variable.
1.這裡的 “simpler code” 是指 CFS 嗎?
2.文中提到過去的 scheduler 為了避免 priority inversion 會暫時提升一般行程的優先權,但接著又提到 “This results in simpler code that can handle nice values better than the previous scheduler” 我看不懂其中的因果關係,也沒有解釋為何要因此使用 nice value
c. 關於 `vruntime` 的描述有提到 “it is the actual execution time normalized by the number of runnable tasks.”:
>Instead, for each runnable process, CFS assigns it a virtual runtime (vruntime),which represents how much time the task has run on the ideal multitasking processor.
>Practically, it is the actual execution time normalized by the number of runnable tasks.
然而為什麼是 “the number of runnable tasks” 呢 ?
文中後續提到如何計算 `vruntime` : “Then, it increases the virtual runtime by t × weight (based on priority)”
怎麼會是 “normalized by the number of runnable tasks.” 呢?
d. 關於 fair 的描述:
>Suppose that a task requires 10 minutes to complete its work.
>If 5 such tasks are simultaneously present on a perfect CPU, each will get 20 percent of the computational power, which means that it will be running for 50 instead of 10 minutes.
>However, all 5 tasks will finish their job after exactly this time span, and none of them will have ever been inactive.
所以最後 this time span 是指 50 minutes 嗎? 如果是的話為什麼是用 However 呢? 這個例子並沒有引入一個對立的觀點,而是繼續說明這 5 個任務會在 50 分鐘內完成。
此外我也看不懂這個例子想要說明什麼?是指「一樣的 task 會有一樣的 CPU 時間」嗎?
e.關於 unfair 的描述:
>If multitasking is simulated by running one process after another, then the process that is currently running is favored over those waiting to be picked
by the scheduler — the poor waiting processes are being treated unfairly.
The unfairness is directly proportional to the waiting time.
所以 fair 是建立在 waiting time 差不多嗎? 那 non-concurrent 不就註定無法 fair 了嗎?
### 2.3.1 Proportional-Share Scheduling
a. 關於如何選擇 task
> However, if CFS always devoted the CPU to the task that was furthest behind, it would be constantly switching back and forth between the tasks.
> Instead, the scheduler sticks with the current task until its timeslice runs out or it is preempted by a waking task.
意思是任務會被其他優先權較高的任務搶佔 (preemption),但不會被其他 vruntime 較低的任務取代嗎?
b.關於何時切換 task ?
>The scheduler performs these task switches under two circumstances.
>One is the expiration of a timeslice.
>The other is when a new task enters the runqueue, provided that the currently running task has not just recently started running.
"running task has not just recently started running." 的意思是 task 跑太久嗎? 那跟 “expiration of a timeslice” 差在哪 ?
### 2.3.3 Assigned time and virtual runtime
在介紹定點數的時候發現自己真的對定點數一竅不通,於是便回去翻閱
### 2.4 EEVDF Scheduler
### 2.5.2 Load balancing
z.文中提到實踐 load balancing 的方法是透過定期檢查 CPU 的 load 並再考量其他因素(如:task behavior )後決定是否要 “Migrating tasks”,然而這樣會消耗不少資源。
有沒有辦法能在「一開始分配任務」的時候就避免這樣 load unbalance 呢? 是因為 per-core runqueues 內的 task 數量無法預測嗎?
a.文中有提到使用一個 global list 來管理 task 雖然直觀但會有同步的問題,所以作業系統會傾向於儲存對於每一個 CPU 而言「經常使用」的資源,例如 “the list of runnable tasks” ,這是指 per-core runqueues 嗎? 在文中使用 “such as ” 是因為還有其他資訊要儲存嗎?
>Therefore, operating system kernels prefer to keep frequently accessed resources, such as the list of runnable tasks, local to each core.
b.文中提到 CFS 的 load balancing 有三種類別 (class)
> In Linux, the CFS class has three main types of load balancing:
>1. **Regular** balancing is the standard balancing performed periodically.
>2. **Idle** balancing occurs when a CPU core is about to become idle and can
search for additional work.
>3. **Wakeup** balancing may be executed for a specific entity that is being awakened, placing it on the least occupied core within the nearest shared cache level.
但後續有提到這三種類別都是使用同一種演算法,既然這樣幹嘛要分類呢?
>These balancing types utilize the same underlying algorithm to calculate the desired load distribution.
c. 關於任務如何在 domains 之間轉移的說明
>To enable tasks to migrate between domains, a higher-level domain encompasses all the CPUs.
>This domain is divided into two CPU groups, one for each pair.
>Although the two groups are balanced, the scheduler does not attempt to balance the load within a group. This domain seeks to balance the load less frequently and is more tolerant of load imbalances.
原意是用一個 "大大大 domain" 來囊括所有 CPU ,然後再將所有 CPU 分成兩組 CPU groups ? "one for each pair" 的意思是兩是 disjoint set 嗎?
借下來這句話就有趣了:“Although the two groups are balanced, the scheduler does not attempt to balance the load within a group.” 雖然兩組已經平衡,依然不對組內的 CPU 做平衡?? 接著又提到「這個大大大 domain」 對不平衡的容忍度較高??
但我還是看不懂要如何 “migrate between domains” ??
d. 關於 domain :
domain 只是個抽象的概念而非特定指某個實體?
可以用來表示 NUMA node ? ㄒ那 UMA 可以表示 CPU groups 嗎?
e. 關於 CFS Load Balancing,文中提到在每個 domain 中會有一個 core 來負責處理 sched groups 之間的 load balancing :
>At every level, one core within the current domain assumes responsibility for equitably distributing the workload across sched groups exclusively within its respective domain.
這裡的 sched groups 就是 cpu groups 嗎?
f. 關於執行 load balancing 的步驟:
>1. Designate the local group as the one encompassing the current CPU
什麼是 current CPU ? 是指負責處理 load balancing 的那個 core 嗎? 那 non-current CPU 又是指什麼呢?
最後提到找到當找到「最忙的」CPU 之後會將其與 “local CPU” 重新平分任務:
>Redistribute the workload between the local CPU and the most overloaded CPU in the busiest group by migrating tasks from the latter to the former,which is executing the load balancer code.
但這裡的 local CPU 又是指誰呢? 是剛剛的 current CPU 嗎?
為何不是跟工作量最少甚至空轉的 CPU 平分呢?
ans:不能無腦地直接跟最閒的 CPU 來分配:
>One issue to address is that load balancing should not merely involve moving tasks from the busiest core to the least loaded one, as this oversimplification disregards cache locality and NUMA considerations.
### 2.5.3 CPU Isolation
a. 文中提到藉由移走(migrate away)kernel 造成的中斷來提高吞吐量(throughput) 和即時反應的效能:
>It is possible to enhance throughput and real-time performance by reducing the overhead of interrupt handling.
但又說可以運用在高吞吐量與經常中斷的裝置上?意思是要避免被打擾嗎?
>This is also good for the application using very high throughput and frequent interrupt device
b. 文中提到了 CPU isolation 的重要性與應用場景,以及 load balancing 可能帶來的負面影響(例如錯誤地將任務分配給不適合的 CPU)
隨後便冒出一句讓人丈二金剛摸不著頭緒的話:
>The other problem is the actual work of migrating threads.
>This can be easily solved by disabling load balancing on the CPUs that should be isolated.
我不明白 “the actual work of migrating threads” 會有什麼問題
### 2.5.4 Unintended Side Effects
>Consider a scenario with four CPUs in use and one task waiting for processing on CPU0.
If the task on CPU3 terminates, it might take some time for the waiting task on CPU0 to be moved, as the scheduler must put in some ‘effort’ (detect the situation, decide to act, and execute the move), potentially causing additional cache misses.
While it may be better to ignore it, excessively long wait times (more than a few hundred milliseconds) are problematic.
> A busy core that notices a nearby idle core will wake it up and instruct it to perform load balancing
TODO:繼續閱讀,提出洞見
TODO:重點整理
TODO:quiz 9
TODO: 閱讀 EEVDF論文
## TODO: Since Linux v6.6, the default CPU scheduler is EEVDF.