Linux 核心設計: Interrupt

--- tags: NCKU Linux Kernel Internals, 作業系統 --- # Linux 核心設計: Interrupt [Linux 核心設計: 中斷處理和現代架構考量](https://hackmd.io/@sysprog/linux-interrupt) :::danger 老師的課程內容從很廣泛的角度講述了中斷相關的議題，一時無法完整的消化。此筆記僅補充一些基本的議題，整理的還不夠完整，詳細請參考課程錄影 ::: ## What is interrupt? Some reference to [CSE 438/598 Embedded Systems Programming](http://rts.lab.asu.edu/web_438_Fall_2014/CSE438_Fall2014_Main_page.htm) : [Linux Interrupt Processing and Kernel Thread](http://rts.lab.asu.edu/web_438/CSE438_598_slides_yhlee/438_7_Linux_ISR.pdf) 簡單概括的話，Interrupt 是一個通知 CPU 事件發生的機制，迫使 CPU 無論忙碌與否，都要對此事件做出回應。當 interrupt 發生，類似於 context switch(需注意僅是類似，但本質上並不相同！)，硬體會儲存當前 process 的狀態(通常需要儲存的訊息會相對 context switch 少一些)，從 process context 切換到 interrupt context，判斷 interrupt 的類型後，使用對應的 interrupt handler 去對此進行處理。 ### Preemptive Context Switching ![](https://i.imgur.com/j7tNtQQ.png) Context Switching / multitasking 可以分成[協同式(Cooperative)](https://en.wikipedia.org/wiki/Cooperative_multitasking)與[搶佔式(Preemptive)](https://en.wikipedia.org/wiki/Preemption_(computing))，前者需由 thread 本身決定甚麼時候讓出 CPU 讓其他 thread 執行(例如透過 [schedule()](https://elixir.bootlin.com/linux/latest/source/kernel/sched/core.c#L4375))，後者則需藉由 interrupt，在每次離開 interupt context 時去做 context switch，把 CPU 移轉給當前優先權最高的 thread 去執行。 ### Interrupt Handling Interrupt 可以分成多種類型，例如： * I/O interrupt * Timer interrupt * Interprocessor interrupt ![](https://i.imgur.com/o76BTom.png) 當外部的硬體裝置發出某種訊號，[PIC](https://en.wikipedia.org/wiki/Programmable_interrupt_controller) 會接收該硬體發出的 interrupt。PIC 接受的訊號會被轉換成一組 vector，用來查詢系統中的 [IDT](https://en.wikipedia.org/wiki/Interrupt_descriptor_table)，找到對應的 [ISR / Interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) 起始位址進行處理。每個 PIC 可以處理有限數量的 interrupts，如果讓其中一個 interupt 接受另一個 PIC 的訊號，則可以擴充可處理的 interrupts 總量。 > 延伸閱讀： [PIC中斷控制器介紹](http://stenlyho.blogspot.com/2008/08/pic.html) 在現代的作業系統中，ISR 會被切成 top half 和 botton half 兩個部份，目的是為了減少任務的延遲。當 interrupt 發生，為了避免 nested interrupt 導致中斷的處理變得複雜(需考慮如 ISR 的 reentry、資源的互斥等)，最簡單的作法是在 interrupt context 中關閉 interrupt，然而如果關閉的時間過長，可能會導致系統對 I/O 的回應變慢，導致錯過某個 interrupt 而產生延遲。 Top half 和 botton half 的區分使得系統可以把 interrupt 的處理推遲，在 top half 中，disable interrupt ，做最小而重要的任務後(例如 pending 發生的 interrupt 類型)，enable interrupt，如果接著沒有 interrupt 進來，再對 bottom half 去做處理。藉此，降低處理 interrupt 產生的 latency。在 linux 中，主要有三種延遲 intterupt 處理的機制: * softirqs * tasklets * workqueues ### Softirq softirq 在 kernel 的編譯時期就會被註冊，由 [open_softirq](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/softirq.c#L447) 初始化。 ```c= void open_softirq(int nr, void (*action)(struct softirq_action *)) { softirq_vec[nr].action = action; } ``` 可以看到一個這裡去 index `softirq_vec` 並設定一個對應 softirq 處理的 function pointer。 ```c= struct softirq_action { void (*action)(struct softirq_action *); }; static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp; const char * const softirq_to_name[NR_SOFTIRQS] = { "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL", "TASKLET", "SCHED", "HRTIMER", "RCU" }; ``` `softirq_vec` 是型別為 `softirq_action` 的 array，結構中僅有一個指向 action funtion 的 pointer。在 `softirq_vec` 中，有 NR_SOFTIRQS(=10) 種的 softirq 被註冊: * 兩個屬於 tasklet 的處理 (HI, TASKLET) * 兩個屬於網路 (NET_TX, NET_RX) * 兩個屬於 block device (BLOCK, BLOCK_IOPOLL) * 兩個屬於 timer (TIMER, HRTIMER) * 一個屬於 scheduler (SCHED) * 一個屬於 read-copy-update (RCU) 透過 `cat /proc/softirqs` 也可以得到相關的資訊。 ```c= void raise_softirq(unsigned int nr) { unsigned long flags; local_irq_save(flags); raise_softirq_irqoff(nr); local_irq_restore(flags); } ``` `raise_softirq` 會觸發 softirq 的處理。`local_irq_save` 首先將狀態存入一個 [Interrupt flag](https://en.wikipedia.org/wiki/Interrupt_flag) 並且關閉 interrupt，`local_irq_restore` 則反之會回存 flag，回復到 `local_irq_save` 之前的狀態(interrupt 可能是開或關，視乎保存前的狀況而定)。關閉 interrupt 的理由為何呢? 這是由於 `raise_softirq_irqoff` 中將會對全域的變數做設置 bitflag 的操作(對某個位元做 or 1，詳見 `or_softirq_pending`)，則倘若 interrupt 未關閉，將可能導致該全域變數的 race condition。因此避免另一個 softirq 的執行，才可以預防競爭導致的 dead lock。 ```c= inline void raise_softirq_irqoff(unsigned int nr) { __raise_softirq_irqoff(nr); if (!in_interrupt()) wakeup_softirqd(); } ``` `raise_softirq_irqoff` 會根據 `nr` 透過 `__raise_softirq_irqoff(nr)` 去 pending softirq 的 bitmask `__softirq_pending`，標註要被延遲處理的 intterupt 類型。在離開 `raise_softirq_irqoff` 之前，檢查 CPU 是在 interrupt context 或是 process context，如果是在 interrupt context 中，則 restore interrupt flag 再開啟 interrupt 即可，返回後會自然進行 softirq 的 bottom half 處理，但是如果是在 process context 的話，則需要透過 `wakeup_softirqd` 去喚醒 kernel thread deamon `ksoftirqd`。 ```c= asmlinkage __visible void __softirq_entry __do_softirq(void) { unsigned long end = jiffies + MAX_SOFTIRQ_TIME; ... restart: while ((softirq_bit = ffs(pending))) { ... h->action(h); ... } ... pending = local_softirq_pending(); if (pending) { if (time_before(jiffies, end) && !need_resched() && --max_restart) goto restart; } ... } ``` `ksoftirqd` 會透過 `run_ksoftirqd` 去檢查是否有被推遲處理的 interrupt ，使用 `__do_softirq` 去做對應的處理。根據 `__softirq_pending` 的 bitmask 內容，就可以知道有哪些 interrupt 的處理是被延遲的。當系統在做推遲的處理時，有可能會不斷有新的 softirqs 發生，此時如果為了處理新的 softirq，可能會導致 userspace 的 thread 不能被排程，因此可以看到這裡會設定一個允許處理的時間。對於有沒有被推遲的 softirq 檢查會被安插在 kernel 中以確保周期性的運作。主要的檢查點在 [`do_IRQ`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/irq.c#L218) 中，也就是實際有 intterrupt 發生時的處理點。在 `do_IRQ` 的結束前，會呼叫 `exiting_irq()`，`exiting_irq()` 再呼叫 `irq_exit()`。 ```c= void irq_exit(void) { ... if (!in_interrupt() && local_softirq_pending()) invoke_softirq(); ... } ``` `irq_exit` 會檢查是否有 pending 的 softirq，呼叫的 `invoke_softirq` 也會呼叫 `__do_softirq`，對 bottom half 做相應的處理。 ### Tasklet Softirq 是面向性能的，相同的 softirq 可以同時在不同的 CPU 上平行進行，因此程式必須要可以 reentry，對於撰寫程式就增加了一定的難度。而其另一個缺點是在編譯時期就決定好對應的處理，無法動態的註冊和刪除，顯然對於 kernel module 的撰寫不大友善，而 tasklet 的設計可以解決此問題。 ```c= void __init softirq_init(void) { int cpu; for_each_possible_cpu(cpu) { per_cpu(tasklet_vec, cpu).tail = &per_cpu(tasklet_vec, cpu).head; per_cpu(tasklet_hi_vec, cpu).tail = &per_cpu(tasklet_hi_vec, cpu).head; } open_softirq(TASKLET_SOFTIRQ, tasklet_action); open_softirq(HI_SOFTIRQ, tasklet_hi_action); } ``` 在初始化階段時，程式會走遍所有 possible processors(支援熱插拔的 processor?)，並初始化 [per_cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) 的 `tasklet_vec` 和 `tasklet_hi_vec` ```c= struct tasklet_head { struct tasklet_struct *head; struct tasklet_struct **tail; }; static DEFINE_PER_CPU(struct tasklet_head, tasklet_vec); static DEFINE_PER_CPU(struct tasklet_head, tasklet_hi_vec); ``` 每個 CPU 都會維護一個 tasklet 的 linked-list，其中 HI_SOFTIRQ 用於高優先級的 tasklet，TASKLET_SOFTIRQ 則用於普通的 tasklet。可以看到 `softirq_init` 的最後有呼叫我們在前面提到的 `open_softirq`，去註冊兩個 tasklet 相關的 softirq。 ```c= void tasklet_init(struct tasklet_struct *t, void (*func)(unsigned long), unsigned long data) { t->next = NULL; t->state = 0; atomic_set(&t->count, 0); t->func = func; t->data = data; } ``` 接著，我們可以透過 linux kernel 中提供的 API 來操作 tasklet。一個例子是 `tasklet_init`，可以用來動態的初始化 `tasklet_struct` ```c= DECLARE_TASKLET(name, func, data); DECLARE_TASKLET_DISABLED(name, func, data); ``` 透過上面兩個 macro 也可以靜態的定義 tasklet。 ```c= void tasklet_schedule(struct tasklet_struct *t); void tasklet_hi_schedule(struct tasklet_struct *t); void tasklet_hi_schedule_first(struct tasklet_struct *t); static inline void tasklet_schedule(struct tasklet_struct *t) { if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) __tasklet_schedule(t); } void __tasklet_schedule(struct tasklet_struct *t) { unsigned long flags; local_irq_save(flags); t->next = NULL; *__this_cpu_read(tasklet_vec.tail) = t; __this_cpu_write(tasklet_vec.tail, &(t->next)); raise_softirq_irqoff(TASKLET_SOFTIRQ); local_irq_restore(flags); } ``` 上面的 API 則可以用來標示 tasklet 已經準備好要被執行(根據優先權的要求使用不同的 API)。以 tasklet_schedule 為例，會將 tasklet stuct 的狀態設成 `TASKLET_STATE_SCHED`，再去執行 `__tasklet_schedule`，`__tasklet_schedule` 的作用就類似前面提及的 `raise_softirq`，先保存 interrupt flag 並且關閉 interrupt，將 `tasklet_vec` 更新後，呼叫 `raise_softirq_irqoff` 去 pending softirq。如此一來，當 kernel 要去處理 bottom half 時，前面註冊的 softirq action `tasklet_action` 就會被呼叫。 ```c= static void tasklet_action(struct softirq_action *a) { local_irq_disable(); list = __this_cpu_read(tasklet_vec.head); __this_cpu_write(tasklet_vec.head, NULL); __this_cpu_write(tasklet_vec.tail, this_cpu_ptr(&tasklet_vec.head)); local_irq_enable(); while (list) { if (tasklet_trylock(t)) { t->func(t->data); tasklet_unlock(t); } ... } } ``` 在 tasklet action 中，本地的 interrupt 會先被關閉，接著取出 local cpu 的 tasklet linked-list 到一個臨時變量中，再將該鍊linked-list 設為 NULL。然後開啟 interrupt，走遍整個 list。 ```c= static inline int tasklet_trylock(struct tasklet_struct *t) { return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state); } ``` `tasklet_trylock` 被呼叫來嘗試將 state 設為 `TASKLET_STATE_RUN`，如果成功，則執行在 `tasklet_init` 註冊的對應 function，結束後再透過 `tasklet_unlock` 回復 state。注意到 softirq 和 tasklet 同樣運行在 interrupt context (software irq context) 之下，因此不允許 sleep / preempt / context switch，也不允許存取 userspace 的資料。此外，同一個 tasklet 不允許在多個 CPU 上平行處理，每個 tasklet 將僅在調度它的 CPU 上運行，以優化 cache 使用。因而這種設計可能不理想，因為其他潛在 idle 的 CPU 不能用於運行此 tasklet。 > * [why tasklet cant sleep](https://lists.kernelnewbies.org/pipermail/kernelnewbies/2011-November/003812.html) > * [Why kernel code/thread executing in interrupt context cannot sleep?](https://stackoverflow.com/questions/1053572/why-kernel-code-thread-executing-in-interrupt-context-cannot-sleep/1056710#1056710) ### Work queue workqueue 是另一種處理 bottom half 的方式，其最大的特點在於 workqueue 是執行在 kernel context，而非 interrupt context。 ```c= struct work_struct { atomic_long_t data; struct list_head entry; work_func_t func; #ifdef CONFIG_LOCKDEP struct lockdep_map lockdep_map; #endif }; ``` 整個 workqueue 的核心概念是對 interrupt 的處理建立 per-CPU 的 kernel threads，而整個 workqueue 的基本單元根據一個 [`work_struct`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/workqueue.h#L100) 來描述。其中 `func` 是排程任務的執行內容，`data` 則是任務要處理的數據。 ```c= #define DECLARE_WORK(n, f) \ struct work_struct n = __WORK_INITIALIZER(n, f) ``` DECLARE_WORK 可以用來靜態建立 workqueue。 ```c= #define INIT_WORK(_work, _func) \ __INIT_WORK((_work), (_func), 0) #define __INIT_WORK(_work, _func, _onstack) \ do { \ __init_work((_work), _onstack); \ (_work)->data = (atomic_long_t) WORK_DATA_INIT(); \ INIT_LIST_HEAD(&(_work)->entry); \ (_work)->func = (_func); \ } while (0) ``` 或者可以通過 `INIT_WORK` 動態建立。 ```c= static inline bool queue_work(struct workqueue_struct *wq, struct work_struct *work) { return queue_work_on(WORK_CPU_UNBOUND, wq, work); } ``` 一旦 `work_struct` 被建立，可以透過 `queue_work` 將其加入到 workqueue 中。`queue_work_on` 被呼叫，其中 `WORK_CPU_UNBOUND` 表示該 kernel thread 不限定在哪個 CPU 中被執行。 ### Reference * [Introduction to deferred interrupts (Softirq, Tasklets and Workqueues)](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html) * [linux kernel的中断子系统之（八）：softirq](http://www.wowotech.net/irq_subsystem/soft-irq.html) * [linux kernel的中断子系统之（九）：tasklet](http://www.wowotech.net/irq_subsystem/tasklet.html) * [softirq, tasklet和workqueue的区别](https://blog.csdn.net/jusang486/article/details/51155277) ## TODO - [ ] 自行閱讀 softirq、tasklet、work queue 的程式碼，並透過實驗補充二手文章中可能忽略的更多的細節 - [ ] 研究 interrupt 在多核心上的額外考量 - [ ] 研究虛擬化技術的實作框架(如何運作?)，以及其對作業系統在中斷處理上的影響

Read more

Linux 核心設計: Scheduler(5): EEVDF 排程器

PCI/PCIe(1): 基礎篇

Linux 核心設計: Power Management(1): System Sleep model

Linux 核心設計: Scheduler(7): sched_ext