要看的教材2

Linux 核心設計: 中斷處理和現代架構考量

為什麼需要 interrupt ?

Interrupt 是一種強制性改變處理器執行流程的機制，目的是讓 CPU 能即時處理重要事件，而不需一直去 polling 周邊裝置。

4/15 更新

特別在處理器快，但周邊慢的情境下（例如 I/O 設備），interrupt 可以讓處理器在需要時再被通知，避免 CPU 資源浪費。

發生的流程

硬體中斷：
首先會由硬體觸發訊號
PIC（Programmable Interrupt Controller）/ APIC（Advanced Programmable Interrupt Controller）將 IRQ translate 成 interrupt vector
vector 就是一組一組的數字
CPU 收到 interrupt，查表（IDT）找到對應 handler
儲存目前 context，進入 interrupt context
執行 ISR（Interrupt Service Routine）
處理完畢後 return，可能進行 context switch

ARM 處理器把 interrupt 當作一種特別的 exception 來看待
interrupt 發生 -> IRQ mode
需要模擬 interrupt return
分成不同的 profile 來看要如何保存 context

軟體中斷：

Interrupt number 128 (=0x80) is released. In Linux, it corresponds to a
system call interrupt.

由程式指令主動觸發
不會經過 PIC / APIC，interrupt vector 是直接指定的
不會進入 interrupt context ???
像是系統呼叫
參考 Demystifying the Linux CPU Scheduler

Performing a system call in assembly code involves filling right CPU registers with the syscall arguments and then using a special assembly instruction.
… invoke the interrupt 128 (by the instruction int 0x80).

問：
system call 算是軟體中斷嗎？
在書中 syscall 的章節是寫 interrupt 沒錯

Image Not Showing Possible Reasons
The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted
Learn More →

PIC（Programmable Interrupt Controller）

可設定 interrupt 優先順序，當多個 interrupt 同時來臨，PIC 負責選擇優先處理的那個

APIC（Advanced Programmable Interrupt Controller）

為了支援多核而發展出的
interrpt number 透過 data bus 傳，此時會有 local APIC 去判斷說要派送到哪一個處理器（mask）（routing）

masking

決定要走哪個 IRQ line
分成 global / per-IRQ 全部或是特定（bitwise operation）

SMP IRQ affinity

SMP（Symmetric multiprocessing）

具有多個處理器的系統，這些處理器共享一個共同的 bus 和 central memory。

IPI（inter-processor interrupt）

處理器之間透過 IPI 相互通知

TPR（task priority register）

IDT（interrupt descriptor table）

存 handler 的位址
使用 IRQ# + 32 的編號規則作為索引
編號 0-31 保留給 non-maslable interrupt / exceptions 使用
早期的 ARM IDT 存的是指令

Nested interrupt

處理中斷到一半發生另一個中斷
處理方式是暫時 disable interrupt，完成後再 re-enable
但 linux kernel 希望減少這種情況，避免系統延遲
另外可能出現 page fault -> 造成更長的延遲

但是 page fualt 也不是每種都很嚴重：

minor：要存取的記憶體還沒被 MMU（Memory Management Unit）標注是否有效，衝擊比較小
major：作業系統特定的 page 是記憶體找不到的時候需要從 swap 讀取資料，影響較大
invalid：segmentaion fault

Preemptive Context Switching 和 interrupt 的關係

interrupt 本身就是 preemptive 的，如同前面有提到他是一種強制性的改變 cpu 執行流程
interrupt 發生 -> 系統中斷 -> 處理器發生改變 -> 執行到 ISR
interrupt return 時 -> 可能會 reschedule
return 時的 reschedule 機制實現 preemptive multitasking。系統可以根據 priority 或其他排程的策略決定下一個應該執行的程式，而不是簡單地返回到被 interrupt 的程式

Linux 核心搶佔

cooperating multitasking v.s. preempting multitasking

refer to 1.3.3

差異在於讓出目前 processor 控制權的方式？
是否有作業系統介入？？？
preempting multitasking：
the operating system initiates it by actively pausing the running process.

cooperating multitasking：
the former process voluntarily yields control of the processor by explicitly telling the operating system to give someone else a turn.
缺點：
不安全，無法保證 process 真的會放棄 processor 的控制權
timer interrupt 用來觸發中斷給 kernel，觸發排程氣排程

延遲處理 interrupt

top：hard IRQ
是不會交錯的，遇到中斷趕快處理
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
bottom：soft IRQ

softirq 是什麼？
interrupt request 來自軟體觸發可排程，可以拖延

top / bottom 取決於有沒有軟體來排程

interrupt context

用來描述現在 interrupt 執行到哪裡去了
如何描述？透過一個 interrupt stack

為什麼要區分 interrupt context / process context ?
首先 context 是什麼？
jserv 說的：前情提要還珠格格
只要寫 context 就代表可存
只要可存，就可以交換，就可以 switch

要保存的東西是什麼
interrupt context / process context 內容不能交換
層級不同 interrupt 等級比較高

kxo 整個在 bottom half ?

jserv：對

timer 到底算不算 top half ?
既然是在 softirq 所以是在軟體層面模擬嚴格來說不算 top half

補：kernel thread 是什麼？
Kernel threads perform a specific system task. They are created by the kernel, live exclusively in kernel space, and never switched to user mode. Their address space is the whole kernel space and they can use it whenever they want. Besides this, they are fully schedulable and preemptible∗ entities just like the normal processes.

critical top-half
disable interrupt 代表一定很關鍵需要馬上處理
non-critical top-half
bottom-half
do it later 可以重新被排程

如何 do interrupt later 延遲處理中斷？
以下三種方法都是 bottom half

softirq && tasklet

softirqs
可以 reschedule itself
對於 user program 的執行很關鍵可能會導致 starvation
ksoftirqd 用來排程（一個 kernel thread）
單位以 cpu 為主，可以並行，所以必須要做到 reentrant functions

4/17 上課說 ksoftirqd 不能排程？？？
到底是能不能被排程
到底能不能排程別人
咁我真的不懂

4/19 更新
感謝邱繼寬解惑
還有學長的筆記 Linux 核心設計: Interrupt

Image Not Showing Possible Reasons
The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted
Learn More →

             interrupt 發生             
                  |
                  v
        raise_softirq() 被呼叫     
                  |
    若 interrupt context 允許執行 → 呼叫 do_softirq()
                    |
    否則（例如時間太長）v
        設定 softirq pending mask 
                |
                v
        ksoftirqd/<N> 被喚醒  
                |
                v
           do_softirq()

每個 cpu 對應一個 ksoftirqd，ksoftirqd 是個 specific kernel thread。N 是 cpu id。

$ systemd-cgls -k | grep ksoft
├─   3 [ksoftirqd/0]
├─  13 [ksoftirqd/1]
├─  18 [ksoftirqd/2]
├─  23 [ksoftirqd/3]
├─  28 [ksoftirqd/4]
├─  33 [ksoftirqd/5]
├─  38 [ksoftirqd/6]
├─  43 [ksoftirqd/7]

每個 ksoftirqd 背後對應到的是一個 CPU，Linux kernel 中會為每個 CPU 維護一個 softirq_vec 陣列。這個陣列對應了所有可能的 softirq 類型（例如圖示中的用於處理 tasklet 的 HI_SOFTIRQ、TASKLET_SOFTIRQ 等），每一項目對應一個特定 softirq 類型的處理函式以及相關的工作隊列的鏈結串列結構。

可以把每一個 softirq 類型看成是一份獨立的 todo list，這些清單本身是無法重新排列的，也就是說，每種 softirq 一旦被選中進行處理，其對應的任務（鏈結串列中的節點）會被迴圈走訪依序（FIFO）執行，不能進行細部重新排程。

若 softirq 無法快速完成，會 defer 給 ksoftirqd kernel thread，在該 CPU 上延後處理

const char * const softirq_to_name[NR_SOFTIRQS] = {
        "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
        "TASKLET", "SCHED", "HRTIMER", "RCU"
};

用命令來觀察，每 0.5 秒看系統的 softirq 運作的情況：

$ watch -n 0.5 cat /proc/softirqs

範例輸出：

Every 0.5s: cat /proc/softirqs                                        salmon-ASUS-TUF-Dash-F15-FX516PR: Sun Apr 20 09:52:12 2025

                    CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
          HI:      13620      14888      15456       8635      21584      16341      16559     192842
       TIMER:      31123      35333      27050      22988      27716      23573      25453      43723
      NET_TX:          1          2          2          0          0          0          1          2
      NET_RX:        615        654        841        769      10915        919        540       2295
       BLOCK:        809       1149       1847       1003        982        913       1764       1281
    IRQ_POLL:          0          0          0          0          0          0          0          0
     TASKLET:         46       1382        293         43       6762        260         37        165
       SCHED:     223430     120754      87280      79769      79099      72653      75886      78875
     HRTIMER:          0          0          0          0         27          0          0          0
         RCU:      53206      49595      49238      48867      49875      48368      49420      52369

補充為什麼可以直接用 cat ? Linux 檔案系統的特性：
《Demystifying the Linux CPU Scheduler》 1.2.2
It provides information in a (mostly) human-readable form by reading the files in /proc and parsing the results as strings. In this way, no system call is needed except for those interacting with the filesystem.

那「排程」的概念何在？

softirq 類型之間存在處理順序，例如 HI_SOFTIRQ 會在 TASKLET_SOFTIRQ 之前處理。這個順序是內建在 kernel 中的處理邏輯，不是動態可調整的優先權。
在 softirq 被標記為 active 後，kernel 會在適當時機處理它們
- interrupt 返回前
- 或交給 ksoftirqd 處理，然後依照 softirq 類型的固定順序逐個處理已標記的 softirq

To summarize, each softirq goes through the following stages:
Registration of a softirq with the open_softirq function.
Activation of a softirq by marking it as deferred with the raise_softirq function.
After this, all marked softirqs will be triggered in the next time the Linux kernel schedules a round of executions of deferrable functions.
And execution of the deferred functions that have the same type.

以下節錄自 4/15 課堂討論

若 tasklet 執行於如 systemd-journal 行程脈絡中，代表 softirq 並未延遲處理，而是在該行程返回前被「順勢」執行。若不掌握此概念，將誤判行程邏輯與排程模型。

執行模式描述

direct softirq execution 當下在呼叫者脈絡中直接執行，執行時機難以預測

deferred by ksoftirqd 延遲交由 ksoftirqd 處理，脈絡獨立、行為可預期

執行模式	描述
direct softirq execution	當下在呼叫者脈絡中直接執行，執行時機難以預測
deferred by ksoftirqd	延遲交由 ksoftirqd 處理，脈絡獨立、行為可預期

ksoftirqd 有無 preempt 與否
沒：綁在 cpu 上的 kernel thread -> atomic context
有：只是個容器可被強佔
-> 因此首先要搞清楚模式

tasklets
可以動態建立或刪除
一次執行一個很短不用 reentrant
多個 tasklets 可能會有 locking
atomic
workqueue
可以 sleeping 代表會有 context switch 發生
處理比較長的任務
不需要 atomic
需要 wake up 機制來維護狀態

Thread IRQ

Linux 自 2.6.30 起支援 Threaded IRQs，
硬體 interrupt handler 執行在一個專用 kernel thread 裡（可以睡眠、排程），避免長時間佔用 interrupt context。
將大部分 interrupt 處理邏輯從 interrupt 移到 kernel thread 的方式
task_struct
運行在 process context

跟 ksoftirqd , workqueue 各有點像？
todo

發現自己對於排程的觀念非常抽象決定乖乖去唸書

Linux 核心設計: 檔案系統概念及實作手法

「一切皆為檔案」的理念與解讀

linux 有許多加強像是 buffer cache , device drivers
並且支援非常多檔案系統

Fuse 出來後，檔案系統可以實作在 user space
優點：如果檔案系統有些瑕疵不至於讓整個 kernel 崩潰
延伸既有功能 ex. sshfs

Everything is a file -> Everything is a file descriptor 更精準
把 I/O 通用化
open(), read(), write(), and close() -> 第一版 Unix 即存在
這些系統呼叫都可以拿來處理 I/O，不一定指檔案，也可能是記憶體等等

device driver 的責任，銜接以上通用介面

file descriptor 一個數字
宅：號碼牌

File Descriptor 及開啟的檔案

真正操作的都是 fd
檔案可以被多個 process 來打開，多人多工
（獨立 process -> addess space 是獨立的，上面筆記有提及）
兩個不同程式可以打開同一個檔案
檔案系統內部設計只在意 inode（information node）
inode 銜接細節會需要有同步機制

上圖可以看到 file descriptor 對 process 而言是唯一的，不同 process 可有相同數值的 fd 指向不同檔案

Open File Table （Open File Descriptions）會記錄所有已開啟檔案的狀態，當中也會包含指向 inode 表中對應 inode 的指標

其他作業系統的 Open File Table 好像不是 system-wide 的？
好像是每個 process 會有自己的 Open File Table

fork() 的時候 duplicate 的是 Open File Table，並非新的 inode

為什麼會需要 duplicate 系列的系統呼叫（dup）
一開始 fork 建立的是 open file table 的副本
但是 fork 完的原本 process 跟新的 process 會是兩個獨立的 address space。也就是說，兩個 process 的 fd 會指向相同的 inode，但各自有獨立的 open file table。
所以很重要的是要確認 fd 是不是在同一個 address space
才能沿用，否則就要透過 dup，來確保在同一個 process 內或不同 process 之間可以共享相同的檔案資源。

struct task_struct {
    ...
    struct files_struct *files;
    ...
}

Blocking I/O Model

Linux 從一開始就是設計要多人多工，本來就是預期會頻繁進行 I/O 操作的環境。

傳統 I/O 呼叫如 read() 是 blocking 的，如果沒有資料可讀，會讓整個 process 停下來等資料。

select 做到同步 I/O multiplexing
用單工做到類似於多工的效果

限制：只能同時監控 1024 個 fd（FD_SETSIZE 預設值限制）

select() + read() 相較單一 read 的優勢是 select() 可同時等待多個 file descriptors (fd)

Everything is a file descriptor -> 所有東西都可以透過 I/O multiplexor 去監控

為什麼不用多執行緒？
無可避免會有 context switch 會變複雜
且對於簡單但耗時的 I/O 工作，用多個 thread 等於在裝忙