Linux 核心設計: Scheduler(10): PREEMPT_RT

--- tags: Linux Kernel Internals, 作業系統 --- # Linux 核心設計: Scheduler(10): PREEMPT_RT ## Overview of PREEMPT_RT > [Linux 核心設計: PREEMPT_RT 作為邁向硬即時作業系統的機制](https://hackmd.io/@sysprog/preempt-rt) 在前一章節的 [Deadline Scheduling](https://hackmd.io/@RinHizakura/Bk4D0BhaA) 中，讀者對 Real-time System 的定義應該已經有了基礎的認識。不過，該文主要著眼於與 Soft real-time 相關部分，實際上，Hard real-time 在系統軟體應用中獲得更多注目。 Hard real-time 所受到的關注反映實際的需求。在某些任務中，比起大部分時候都能更快速的完成，在規範的時間限制內完成來得更為關鍵。比如自駕車在偵測到障礙物之後，必須要**每次**都在規範的時間中停下來。這種即時性的保證在醫療、航空或其他工業領域中至關重要，因為些微偏差也許就意味著可觀的金錢損失、甚至是對生命的傷害。 Linux 最初是以分時多工排程設計的作業系統。在將其修改使得能夠滿足 Hard real-time 的路程上，關鍵點是要減少無法搶佔(preemption)的核心程式碼，這就是 [`PREEMPT_RT`](https://en.wikipedia.org/wiki/PREEMPT_RT) 系列改動的重要目標。比如: 在 `PREEMPT_RT` 下，softirq 被調整為 kernel thread，意味著每個 interrupt handler 有獨立擁有的 context，可擺脫搶佔的限制; 或者，一般的 spinlock 被改為可以睡眠的 mutex，因此持有鎖的 task 將可以暫時釋放 CPU 資源，已允許其他即時的任務獲得排程。本文將探討與 `PREEMPT_RT` 相關的知識點以及工具。 ## PREMMPT_RT 與 non-PREEMPT 的 Lock ### Lock Type Overview 在 Linux 核心提供了多種 locking primitives，主要可以分為三類： * Sleeping locks * CPU local locks * Spinning locks #### Sleeping Lock Sleeping Lock 顧名思義是持有鎖的任務會從運行(running)變為睡眠(sleeping)狀態，因此只能在可搶佔任務的 context 下取得。 :::info 實際上，核心中實作 `try_lock()` 以允許在其他 context 取得 sleeping lock。但除非必要否則應盡量在可搶佔的 context 下去獲取鎖。 ::: 以下幾種 lock 都是 sleeping lock: * mutex * rt_mutex * semaphore * rw_semaphore * ww_mutex * percpu_rw_semaphore 而在 `PREEMPT_RT` 下，以下幾種 lock 也會轉變為 sleeping lock: * local_lock * spinlock_t * rwlock_t #### CPU local locks 在 non-PREEMPT_RT 中，local_lock 是用來禁用搶占和中斷之 primitives 的 wrapper，提供更明確的語意以闡明被 lock 保護的範圍。這為強化核心在靜態或者動態的 deadlock 檢測工具([lockdep](https://docs.kernel.org/locking/lockdep-design.html))上帶來了優勢，並且這些介面也讓 PREEMPT_RT 的整合更容易。 > [Local locks in the kernel](https://lwn.net/Articles/828477/) #### Spinning locks 在 Linux 中的 Spinlock 有兩種: * raw_spinlock_t * bit spinlocks 在 non-RT kernel，以下兩種也是 spinlock * spinlock_t * rwlock_t spinlock 會隱式的禁用搶佔，並且相關函數根據名稱的後綴可以具有進一步的保護: | Postfix | Description | | ------------------ |:---------------------------------------- | | `_bh()` | 將 bottom halves(softirq) enable/disable | | `_irq()` | 將 interrupt enable/disable | | `_irqsave/restore()` | 儲存並 disable/ restore interupt 停用狀態 | 對於上述的 lock，除了 semaphore 之外都有嚴格的 owner semantics: 取得 lock 的 context/task 必須負責釋放之。這樣的嚴格規範在[優先級反轉(priority inversion)](https://en.wikipedia.org/wiki/Priority_inversion)問題上能帶來益處: 使優先級反轉可以通過[優先級繼承(priority inheritance, PI)](https://en.wikipedia.org/wiki/Priority_inheritance)手段解決。一般的 semaphore 由於沒有 owner 的概念，因此就無可避面的有造成優先級反轉的可能性。而 rw_semaphores 比較特別。在 rw_semaphore 的語意下，writer 無法將其優先權授予多個 reader，因此當一個低優先權的 reader 持續持有鎖，可能使高優先權的 reader 發生 [startvation](https://en.wikipedia.org/wiki/Starvation_(computer_science));而由於 reader 可以將其優先級授予 writer，因此被搶佔的低優先級 writer 的優先級可以得到提升，從而防止 writer 讓 reader 發生 starvation。對於更多 lock 在 PREEMPT 與 non-PREEMPT 的功用細節與使用方式，請閱讀 [Lock types and their rules](https://docs.kernel.org/locking/locktypes.html) 一文。 ## RTLA ### 簡介在之前的章節中我們曾經介紹過 [Cyclictest](https://hackmd.io/Rq-sfre_R9aFmG6vaS1z5Q?view#Cyclictest)，後者是可以用來測試系統延遲的工具。不過這種測量方法是一個黑盒子: 我們只能得知一個不透明的延遲數值，但無法輕易得知這個值的組成。 ![image](https://hackmd.io/_uploads/rkQaUAlD1l.png) > Ref: [Linux scheduling latency debug and analysis](https://bristot.me/linux-scheduling-latency-debug-and-analysis/) 為解決此問題，Linux kernel 引入了 Timerlat 和 Osnoise 追蹤工具。[Timerlat](https://docs.kernel.org/trace/timerlat-tracer.html) 可用來追蹤 IRQ 和 thread context 中的喚醒延遲，而 [Osnoice](https://docs.kernel.org/trace/osnoise-tracer.html) 則可分析 NMI、IRQ、SoftIRQ 等對事件對系統造成干擾的計數統計。對於資深開發者來說，使用這些 tracer 並分析其輸出也許並不困難，但對於不熟悉的人來說，使用門檻可能就有些高了。所幸有 [RTLA](https://docs.kernel.org/tools/rtla/rtla.html) 整合了測量和追蹤，後者作為前端歸納了前述 tracer 的追蹤結果，進而產生更容易閱讀的分析。 ### rtla timerlat rtla timelat 是一個 userspace 的工具，它可以應用 timerlat tracer 的輸出、收集資料，也可以選擇性的用來建立 userspace workload。這個工具呈現的基本輸出如下: ``` $ sudo rtla timerlat top 0 00:00:03 | IRQ Timer Latency (us) | Thread Timer Latency (us) CPU COUNT | cur min avg max | cur min avg max 0 #2529 | 43 14 53 1747 | 47 18 58 1758 1 #2528 | 60 20 58 2079 | 64 26 63 2091 2 #2526 | 64 14 58 2322 | 68 18 63 2372 3 #2528 | 46 20 53 2199 | 50 23 57 2209 4 #2528 | 36 10 53 2168 | 38 17 57 2179 5 #2526 | 65 21 55 2134 | 68 25 60 2136 6 #2525 | 46 22 53 2199 | 50 26 57 2239 7 #2528 | 46 29 51 1009 | 50 31 55 1019 8 #2528 | 61 37 55 1825 | 65 42 61 1834 9 #2525 | 46 27 52 1879 | 50 37 57 2806 10 #2529 | 52 27 57 1738 | 75 40 64 1751 11 #2530 | 46 18 53 946 | 50 21 57 950 12 #2527 | 71 42 59 2113 | 76 44 65 2132 13 #2527 | 55 38 54 2146 | 58 41 58 2154 14 #2528 | 53 35 55 1957 | 57 38 59 1968 15 #2528 | 46 22 55 2228 | 50 25 59 2238 16 #2526 | 58 31 55 2188 | 62 36 59 2198 17 #2528 | 61 37 56 1941 | 65 40 60 1949 18 #2527 | 68 3 57 2172 | 72 6 61 2189 19 #2527 | 42 9 58 2071 | 56 19 63 2081 ---------------|----------------------------------------|--------------------------------------- ALL #50548 e0 | 3 55 2322 | 6 60 2806 ``` 這裡就可以注意到 timelat 與 cyclictest 的最大差異: 其呈現兩個延遲數值，分別代表在 IRQ context 和 Thread Context 的延遲。這是受益於 Timerlat tracer 中增加了一個特殊的 timer IRQ handler，一旦 handler 開始執行，其可提供額外的 IRQ 延遲時間。然後，IRQ handler 再把喚醒測量執行緒延遲的執行緒喚醒，藉此計算 Thread 的延遲時間。如下圖所示: ![image](https://hackmd.io/_uploads/rJ7I3AePkx.png) rtla timelat 的一個相當優異的功能是自動分析(auto analysis)，可以透過 `-a <threshold>` 選項啟動。這個功能讓 rtla 可以設定在延遲超過給定 threshold 時停止，然後 rtla timelat 解析結果，為造成延遲的原因提供自動分析。舉例來說: ``` $ sudo timerlat -a 30 Timer Latency 0 00:00:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) CPU COUNT | cur min avg max | cur min avg max 0 #763 | 1 0 1 9 | 8 4 8 18 1 #763 | 1 0 1 8 | 12 4 12 21 2 #763 | 1 0 1 5 | 13 4 15 23 3 #763 | 1 0 1 8 | 16 4 14 21 4 #763 | 12 0 1 16 | 28 5 12 28 5 #763 | 1 0 1 8 | 12 4 11 22 6 #763 | 32 0 1 32 | 52 5 13 52 7 #763 | 0 0 1 11 | 7 5 12 20 rtla timerlat hit stop tracing ## CPU 6 hit stop tracing, analyzing it ## IRQ handler delay: 31.00 us (59.56 %) IRQ latency: 32.17 us Timerlat IRQ duration: 9.57 us (18.38 %) Blocking thread: 8.77 us (16.84 %) objtool:1164402 8.77 us Blocking thread stack trace -> timerlat_irq -> __hrtimer_run_queues -> hrtimer_interrupt -> __sysvec_apic_timer_interrupt -> sysvec_apic_timer_interrupt -> asm_sysvec_apic_timer_interrupt -> _raw_spin_unlock_irqrestore -> cgroup_rstat_flush_locked -> cgroup_rstat_flush_irqsafe -> mem_cgroup_flush_stats -> mem_cgroup_wb_stats -> balance_dirty_pages -> balance_dirty_pages_ratelimited_flags -> btrfs_buffered_write -> btrfs_do_write_iter -> vfs_write -> __x64_sys_pwrite64 -> do_syscall_64 -> entry_SYSCALL_64_after_hwframe ------------------------------------------------------------------------ Thread latency: 52.05 us (100%) Saving trace to timerlat_trace.txt ``` 針對此自動分析結果，可以將時間關係透過此圖描述。 ![image](https://hackmd.io/_uploads/HkjEvJZvkg.png) ### Cause of Latency 對於延遲的原因，可以分類為以下幾種: #### Interference 當較高優先順序的任務搶佔較低優先順序的任務時。在 Linux 中，context 的優先順序從高至低如下: * NMI: Nonmaskable interrupts 可以搶佔任何其他類型 * IRQ: 搶佔 NMI 以外的其他 * Softirq: 只能搶佔 thread context(留意到在 PREEMPT_RT, softirq 等價於一般的 threads) * Threads: Threads 只能搶占其他 thread #### Blocking 當低優先權任務延遲高優先權任務時。這種情形通常與同步相關。例如當停用 preeption 時，較低優先權執行緒由於持有 lock 而延遲優先級高的執行緒。或者當任何執行緒停用 interrupt 時，它可以阻止 IRQ context。 #### Release Jitter 由外部事件（例如硬體）引起的定時器 IRQ 延遲。例如 IRQ handler 在 idle CPU 中會較晚啟動，因為需等待 CPU 回復到 active 狀態。 #### Execution time 完成目標所需的工作，在本例中就是喚醒目標執行緒。 ### IRQ latency Timerlat tracer 藉由將 IRQ 和 Thread 的延遲區分開，允許我們更容易分析造成問題的原因。 IRQ 可能會面臨來自 NMI 和其他 IRQ 的 **interference**。每個平台的 Interrupt controller 會根據定義的優先級決定 IRQ handler 的順序。因此，可能會發生timer IRQ 在另一個較高優先權 IRQ 之後進行的情況。 IRQ 也可能會面臨較低優先權 IRQ 的 **blocking**。原因是 IRQ handler 可能在 IRQ 停用的情況下運作。thread/softirq 也可以停用 IRQ 來導致 blocking 其他 IRQ。 Timer IRQ 的 **execution time** 也會造成延遲，不過多數時候這只會影響 thread latency，因為此延遲是在 handler 開始時測量的。在前面一個案例中，IRQ handler 面臨 32.17 us 的延遲，其中沒有 **interference**，而延遲多數來自 31 us 的 **blocking**。問題來自較低優先權的 `objtool:1164402`，從 stacktrace 就能得知是 cgroup 操作停用了 `raw_spin_lock` 操作上的 IRQ，從而導致延遲。而 **execution time** 造成的延遲是 9.57。 Timer IRQ 也可能因系統外部因素而延遲。下面展示了另一個案例。在這個例子中，IRQ handler 花了 39.01 us 才啟動。因為 blocking thread 是 swapper，也就是 idle thread，因此這不是 blocking 而是 **Release jitter** 導致。當 IRQ 未停用但發生 IRQ 延遲時，可以推測原因來自 Release jitter。例如，當 blocking thread 運作於 userspace 中。 ``` ## CPU 9 hit stop tracing, analyzing it ## IRQ handler delay: (exit from idle) 39.01 us (76.59 %) IRQ latency: 40.49 us Timerlat IRQ duration: 5.85 us (11.49 %) Blocking thread: 3.99 us (7.83 %) swapper/9:0 3.99 us Blocking thread stack trace -> timerlat_irq -> __hrtimer_run_queues -> hrtimer_interrupt -> __sysvec_apic_timer_interrupt -> sysvec_apic_timer_interrupt -> asm_sysvec_apic_timer_interrupt -> pv_native_safe_halt -> default_idle -> default_idle_call -> do_idle -> cpu_startup_entry -> start_secondary -> __pfx_verify_cpu ------------------------------------------------------------------------ Thread latency: 50.93 us (100%) Max timerlat IRQ latency from idle: 40.49 us in cpu 9 ``` rtla timelat 中可以透過 `--dma-latency <latency>` 減少退出 idle 狀態的延遲。最常見的設定值為 0(例如 cyclictest)。如果設定 `-–dma-latency` 不足，就必須對相關硬體做調整。 ### Thread latency 執行緒可能會受來自 NMI、IRQ、softirq 甚至其他執行緒的 **interference**。在一般的核心中，softirq 有自己的 context，可以搶佔執行緒。在 PREEMPT_RT 上，softirq 運行於執行緒一樣的 context 運行，對排程器而言視為與執行緒相同。執行緒可能會造成 **interference** 和 **blocking**，取決於其優先級。如果優先權高於 timerlat thread 那就是 **interference**。如果優先權低於 timerlat thread，則會因在停用 interrupt 或 preempt 的情況下，或者排程的成本對 timerlat 執行緒造成 **blocking**。下面的案例以 non-PREEMPT_RT kernel 展示了此情境。在此範例中，thread latency 來自 btrfs 檔案系統上行為。由於不能搶佔，因此該操作將導致 blocking。 ``` ## CPU 18 hit stop tracing, analyzing it ## IRQ handler delay: 0.00 us (0.00 %) IRQ latency: 1.64 us Timerlat IRQ duration: 9.52 us (1.80 %) Blocking thread: 501.68 us (94.96 %) kworker/u40:0:306130 501.68 us Blocking thread stack trace -> timerlat_irq -> __hrtimer_run_queues -> hrtimer_interrupt -> __sysvec_apic_timer_interrupt -> sysvec_apic_timer_interrupt -> asm_sysvec_apic_timer_interrupt -> ZSTD_compressBlock_fast -> ZSTD_buildSeqStore -> ZSTD_compressBlock_internal -> ZSTD_compressContinue_internal -> ZSTD_compressEnd -> ZSTD_compressStream2 -> ZSTD_endStream -> zstd_compress_pages -> btrfs_compress_pages -> compress_file_range -> async_cow_start -> btrfs_work_helper -> process_one_work -> worker_thread -> kthread -> ret_from_fork IRQ interference 3.68 us (0.70 %) local_timer:236 3.68 us Softirq interference 4.21 us (0.80 %) TIMER:1 3.71 us RCU:9 0.49 us Thread interference 6.21 us (1.17 %) migration/18:125 6.21 us ------------------------------------------------------------------------ Thread latency: 528.31 us (100%) Max timerlat IRQ latency from idle: 10.34 us in cpu 12 Saving trace to timerlat_trace.txt ``` ### timerlat hist rtla timerlat 預設情況下以 `top` 模式下運作(如前範例)。但工具另外支援 `hist` 模式。這個模式下，結果不會即時更新，而是在測試結束時顯示兩種延遲的直方圖。如下圖的範例: ``` sudo rtla timerlat hist -c 0-1 -d 30 # RTLA timerlat histogram # Time unit is microseconds (us) # Duration: 0 00:00:30 Index IRQ-000 Thr-000 Usr-000 IRQ-001 Thr-001 Usr-001 15 1 0 0 0 0 0 16 1 0 0 0 0 0 17 1 0 0 0 0 0 (略...) 253 11 10 20 8 13 13 254 12 6 11 13 9 9 255 17 11 15 4 14 17 over: 433 655 728 410 588 675 count: 30032 30032 30032 30033 30033 30033 min: 15 18 19 20 24 25 avg: 98 107 110 88 96 99 max: 2008 2015 2018 2145 2153 2158 ALL: IRQ Thr Usr count: 60065 60065 60065 min: 15 18 19 avg: 93 101 105 max: 2145 2153 2158 ``` 如圖顯示了每個延遲(Index)值在各個 CPU 的 IRQ/thread latency 上的分布。 ## Reference * [Linux scheduling latency debug and analysis](https://bristot.me/linux-scheduling-latency-debug-and-analysis/) * [Demystifying the Real-time Linux Scheduling Latency(Video)](https://youtu.be/_vSkAbfVprA?si=8zBEIKQ-z6iyI8eY) * [Demystifying the Real-Time Linux Scheduling Latency](https://bristot.me/demystifying-the-real-time-linux-latency/) * [realtime:start [Wiki]](https://wiki.linuxfoundation.org/realtime/start)