Linux 核心設計: Scheduler(10): PREEMPT_RT

Overview of PREEMPT_RT

Linux 核心設計: PREEMPT_RT 作為邁向硬即時作業系統的機制

在前一章節的 Deadline Scheduling 中，讀者對 Real-time System 的定義應該已經有了基礎的認識。不過，該文主要著眼於與 Soft real-time 相關部分，實際上，Hard real-time 在系統軟體應用中獲得更多注目。

Hard real-time 所受到的關注反映實際的需求。在某些任務中，比起大部分時候都能更快速的完成，在規範的時間限制內完成來得更為關鍵。比如自駕車在偵測到障礙物之後，必須要每次都在規範的時間中停下來。這種即時性的保證在醫療、航空或其他工業領域中至關重要，因為些微偏差也許就意味著可觀的金錢損失、甚至是對生命的傷害。

Linux 最初是以分時多工排程設計的作業系統。在將其修改使得能夠滿足 Hard real-time 的路程上，關鍵點是要減少無法搶佔(preemption)的核心程式碼，這就是 PREEMPT_RT 系列改動的重要目標。比如: 在 PREEMPT_RT 下，softirq 被調整為 kernel thread，意味著每個 interrupt handler 有獨立擁有的 context，可擺脫搶佔的限制; 或者，一般的 spinlock 被改為可以睡眠的 mutex，因此持有鎖的 task 將可以暫時釋放 CPU 資源，已允許其他即時的任務獲得排程。

本文將探討與 PREEMPT_RT 相關的知識點以及工具。

PREMMPT_RT 與 non-PREEMPT 的 Lock

Lock Type Overview

在 Linux 核心提供了多種 locking primitives，主要可以分為三類：

Sleeping locks
CPU local locks
Spinning locks

Sleeping Lock

Sleeping Lock 顧名思義是持有鎖的任務會從運行(running)變為睡眠(sleeping)狀態，因此只能在可搶佔任務的 context 下取得。

實際上，核心中實作 try_lock() 以允許在其他 context 取得 sleeping lock。但除非必要否則應盡量在可搶佔的 context 下去獲取鎖。

以下幾種 lock 都是 sleeping lock:

mutex
rt_mutex
semaphore
rw_semaphore
ww_mutex
percpu_rw_semaphore

而在 PREEMPT_RT 下，以下幾種 lock 也會轉變為 sleeping lock:

local_lock
spinlock_t
rwlock_t

CPU local locks

在 non-PREEMPT_RT 中，local_lock 是用來禁用搶占和中斷之 primitives 的 wrapper，提供更明確的語意以闡明被 lock 保護的範圍。這為強化核心在靜態或者動態的 deadlock 檢測工具(lockdep)上帶來了優勢，並且這些介面也讓 PREEMPT_RT 的整合更容易。

Local locks in the kernel

Spinning locks

在 Linux 中的 Spinlock 有兩種:

raw_spinlock_t
bit spinlocks

在 non-RT kernel，以下兩種也是 spinlock

spinlock_t
rwlock_t

spinlock 會隱式的禁用搶佔，並且相關函數根據名稱的後綴可以具有進一步的保護:

Postfix	Description
`_bh()`	將 bottom halves(softirq) enable/disable
`_irq()`	將 interrupt enable/disable
`_irqsave/restore()`	儲存並 disable/ restore interupt 停用狀態

對於上述的 lock，除了 semaphore 之外都有嚴格的 owner semantics: 取得 lock 的 context/task 必須負責釋放之。這樣的嚴格規範在優先級反轉(priority inversion)問題上能帶來益處: 使優先級反轉可以通過優先級繼承(priority inheritance, PI)手段解決。

一般的 semaphore 由於沒有 owner 的概念，因此就無可避面的有造成優先級反轉的可能性。而 rw_semaphores 比較特別。在 rw_semaphore 的語意下，writer 無法將其優先權授予多個 reader，因此當一個低優先權的 reader 持續持有鎖，可能使高優先權的 reader 發生 startvation;而由於 reader 可以將其優先級授予 writer，因此被搶佔的低優先級 writer 的優先級可以得到提升，從而防止 writer 讓 reader 發生 starvation。

對於更多 lock 在 PREEMPT 與 non-PREEMPT 的功用細節與使用方式，請閱讀 Lock types and their rules 一文。

RTLA

簡介

在之前的章節中我們曾經介紹過 Cyclictest，後者是可以用來測試系統延遲的工具。不過這種測量方法是一個黑盒子: 我們只能得知一個不透明的延遲數值，但無法輕易得知這個值的組成。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Ref: Linux scheduling latency debug and analysis

為解決此問題，Linux kernel 引入了 Timerlat 和 Osnoise 追蹤工具。Timerlat 可用來追蹤 IRQ 和 thread context 中的喚醒延遲，而 Osnoice 則可分析 NMI、IRQ、SoftIRQ 等對事件對系統造成干擾的計數統計。對於資深開發者來說，使用這些 tracer 並分析其輸出也許並不困難，但對於不熟悉的人來說，使用門檻可能就有些高了。所幸有 RTLA 整合了測量和追蹤，後者作為前端歸納了前述 tracer 的追蹤結果，進而產生更容易閱讀的分析。

rtla timerlat

rtla timelat 是一個 userspace 的工具，它可以應用 timerlat tracer 的輸出、收集資料，也可以選擇性的用來建立 userspace workload。這個工具呈現的基本輸出如下:

$ sudo rtla timerlat top
  0 00:00:03   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max
  0 #2529      |       43        14        53      1747 |       47        18        58      1758
  1 #2528      |       60        20        58      2079 |       64        26        63      2091
  2 #2526      |       64        14        58      2322 |       68        18        63      2372
  3 #2528      |       46        20        53      2199 |       50        23        57      2209
  4 #2528      |       36        10        53      2168 |       38        17        57      2179
  5 #2526      |       65        21        55      2134 |       68        25        60      2136
  6 #2525      |       46        22        53      2199 |       50        26        57      2239
  7 #2528      |       46        29        51      1009 |       50        31        55      1019
  8 #2528      |       61        37        55      1825 |       65        42        61      1834
  9 #2525      |       46        27        52      1879 |       50        37        57      2806
 10 #2529      |       52        27        57      1738 |       75        40        64      1751
 11 #2530      |       46        18        53       946 |       50        21        57       950
 12 #2527      |       71        42        59      2113 |       76        44        65      2132
 13 #2527      |       55        38        54      2146 |       58        41        58      2154
 14 #2528      |       53        35        55      1957 |       57        38        59      1968
 15 #2528      |       46        22        55      2228 |       50        25        59      2238
 16 #2526      |       58        31        55      2188 |       62        36        59      2198
 17 #2528      |       61        37        56      1941 |       65        40        60      1949
 18 #2527      |       68         3        57      2172 |       72         6        61      2189
 19 #2527      |       42         9        58      2071 |       56        19        63      2081
---------------|----------------------------------------|---------------------------------------
ALL #50548  e0 |                  3        55      2322 |                  6        60      2806

這裡就可以注意到 timelat 與 cyclictest 的最大差異: 其呈現兩個延遲數值，分別代表在 IRQ context 和 Thread Context 的延遲。這是受益於 Timerlat tracer 中增加了一個特殊的 timer IRQ handler，一旦 handler 開始執行，其可提供額外的 IRQ 延遲時間。然後，IRQ handler 再把喚醒測量執行緒延遲的執行緒喚醒，藉此計算 Thread 的延遲時間。如下圖所示:

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

rtla timelat 的一個相當優異的功能是自動分析(auto analysis)，可以透過 -a <threshold> 選項啟動。這個功能讓 rtla 可以設定在延遲超過給定 threshold 時停止，然後 rtla timelat 解析結果，為造成延遲的原因提供自動分析。舉例來說:

$ sudo timerlat -a 30 
                                     Timer Latency                                              
  0 00:00:01   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max
  0 #763       |        1         0         1         9 |        8         4         8        18
  1 #763       |        1         0         1         8 |       12         4        12        21
  2 #763       |        1         0         1         5 |       13         4        15        23
  3 #763       |        1         0         1         8 |       16         4        14        21
  4 #763       |       12         0         1        16 |       28         5        12        28
  5 #763       |        1         0         1         8 |       12         4        11        22
  6 #763       |       32         0         1        32 |       52         5        13        52
  7 #763       |        0         0         1        11 |        7         5        12        20
rtla timerlat hit stop tracing
## CPU 6 hit stop tracing, analyzing it ##
  IRQ handler delay:                     		31.00 us (59.56 %)
  IRQ latency:                                   	32.17 us
  Timerlat IRQ duration:         			9.57 us (18.38 %)
  Blocking thread:                        		8.77 us (16.84 %)
	                 objtool:1164402         	8.77 us
    Blocking thread stack trace
		-> timerlat_irq
		-> __hrtimer_run_queues
		-> hrtimer_interrupt
		-> __sysvec_apic_timer_interrupt
		-> sysvec_apic_timer_interrupt
		-> asm_sysvec_apic_timer_interrupt
		-> _raw_spin_unlock_irqrestore
		-> cgroup_rstat_flush_locked
		-> cgroup_rstat_flush_irqsafe
		-> mem_cgroup_flush_stats
		-> mem_cgroup_wb_stats
		-> balance_dirty_pages
		-> balance_dirty_pages_ratelimited_flags
		-> btrfs_buffered_write
		-> btrfs_do_write_iter
		-> vfs_write
		-> __x64_sys_pwrite64
		-> do_syscall_64
		-> entry_SYSCALL_64_after_hwframe
------------------------------------------------------------------------
  Thread latency:					52.05 us (100%)

Saving trace to timerlat_trace.txt

針對此自動分析結果，可以將時間關係透過此圖描述。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Cause of Latency

對於延遲的原因，可以分類為以下幾種:

Interference

當較高優先順序的任務搶佔較低優先順序的任務時。在 Linux 中，context 的優先順序從高至低如下:
* NMI: Nonmaskable interrupts 可以搶佔任何其他類型
* IRQ: 搶佔 NMI 以外的其他
* Softirq: 只能搶佔 thread context(留意到在 PREEMPT_RT, softirq 等價於一般的 threads)
* Threads: Threads 只能搶占其他 thread

Blocking

當低優先權任務延遲高優先權任務時。這種情形通常與同步相關。例如當停用 preeption 時，較低優先權執行緒由於持有 lock 而延遲優先級高的執行緒。或者當任何執行緒停用 interrupt 時，它可以阻止 IRQ context。

Release Jitter

由外部事件（例如硬體）引起的定時器 IRQ 延遲。例如 IRQ handler 在 idle CPU 中會較晚啟動，因為需等待 CPU 回復到 active 狀態。

Execution time

完成目標所需的工作，在本例中就是喚醒目標執行緒。

IRQ latency

Timerlat tracer 藉由將 IRQ 和 Thread 的延遲區分開，允許我們更容易分析造成問題的原因。

IRQ 可能會面臨來自 NMI 和其他 IRQ 的 interference。每個平台的 Interrupt controller 會根據定義的優先級決定 IRQ handler 的順序。因此，可能會發生timer IRQ 在另一個較高優先權 IRQ 之後進行的情況。

IRQ 也可能會面臨較低優先權 IRQ 的 blocking。原因是 IRQ handler 可能在 IRQ 停用的情況下運作。thread/softirq 也可以停用 IRQ 來導致 blocking 其他 IRQ。

Timer IRQ 的 execution time 也會造成延遲，不過多數時候這只會影響 thread latency，因為此延遲是在 handler 開始時測量的。

在前面一個案例中，IRQ handler 面臨 32.17 us 的延遲，其中沒有 interference，而延遲多數來自 31 us 的 blocking。問題來自較低優先權的 objtool:1164402，從 stacktrace 就能得知是 cgroup 操作停用了 raw_spin_lock 操作上的 IRQ，從而導致延遲。而 execution time 造成的延遲是 9.57。

Timer IRQ 也可能因系統外部因素而延遲。下面展示了另一個案例。在這個例子中，IRQ handler 花了 39.01 us 才啟動。因為 blocking thread 是 swapper，也就是 idle thread，因此這不是 blocking 而是 Release jitter 導致。當 IRQ 未停用但發生 IRQ 延遲時，可以推測原因來自 Release jitter。例如，當 blocking thread 運作於 userspace 中。

## CPU 9 hit stop tracing, analyzing it ##
  IRQ handler delay:		(exit from idle)	    39.01 us (76.59 %)
  IRQ latency:						    40.49 us
  Timerlat IRQ duration:				     5.85 us (11.49 %)
  Blocking thread:					     3.99 us (7.83 %)
	               swapper/9:0        		     3.99 us
    Blocking thread stack trace
		-> timerlat_irq
		-> __hrtimer_run_queues
		-> hrtimer_interrupt
		-> __sysvec_apic_timer_interrupt
		-> sysvec_apic_timer_interrupt
		-> asm_sysvec_apic_timer_interrupt
		-> pv_native_safe_halt
		-> default_idle
		-> default_idle_call
		-> do_idle
		-> cpu_startup_entry
		-> start_secondary
		-> __pfx_verify_cpu
------------------------------------------------------------------------
  Thread latency:					    50.93 us (100%)

Max timerlat IRQ latency from idle: 40.49 us in cpu 9

rtla timelat 中可以透過 --dma-latency <latency> 減少退出 idle 狀態的延遲。最常見的設定值為 0(例如 cyclictest)。如果設定 -–dma-latency 不足，就必須對相關硬體做調整。

Thread latency

執行緒可能會受來自 NMI、IRQ、softirq 甚至其他執行緒的 interference。

在一般的核心中，softirq 有自己的 context，可以搶佔執行緒。在 PREEMPT_RT 上，softirq 運行於執行緒一樣的 context 運行，對排程器而言視為與執行緒相同。

執行緒可能會造成 interference 和 blocking，取決於其優先級。如果優先權高於 timerlat thread 那就是 interference。如果優先權低於 timerlat thread，則會因在停用 interrupt 或 preempt 的情況下，或者排程的成本對 timerlat 執行緒造成 blocking。

下面的案例以 non-PREEMPT_RT kernel 展示了此情境。在此範例中，thread latency 來自 btrfs 檔案系統上行為。由於不能搶佔，因此該操作將導致 blocking。

## CPU 18 hit stop tracing, analyzing it ##
  IRQ handler delay:		                	              0.00 us (0.00 %)
  IRQ latency:							     1.64 us
  Timerlat IRQ duration:					     9.52 us (1.80 %)
  Blocking thread:						   501.68 us (94.96 %)
	           kworker/u40:0:306130   			   501.68 us
    Blocking thread stack trace
		-> timerlat_irq
		-> __hrtimer_run_queues
		-> hrtimer_interrupt
		-> __sysvec_apic_timer_interrupt
		-> sysvec_apic_timer_interrupt
		-> asm_sysvec_apic_timer_interrupt
		-> ZSTD_compressBlock_fast
		-> ZSTD_buildSeqStore
		-> ZSTD_compressBlock_internal
		-> ZSTD_compressContinue_internal
		-> ZSTD_compressEnd
		-> ZSTD_compressStream2
		-> ZSTD_endStream
		-> zstd_compress_pages
		-> btrfs_compress_pages
		-> compress_file_range
		-> async_cow_start
		-> btrfs_work_helper
		-> process_one_work
		-> worker_thread
		-> kthread
		-> ret_from_fork
  IRQ interference					     3.68 us (0.70 %)
	             local_timer:236			     3.68 us
  Softirq interference					     4.21 us (0.80 %)
	                   TIMER:1  			     3.71 us
	                     RCU:9  			     0.49 us
  Thread interference					     6.21 us (1.17 %)
	            migration/18:125			     6.21 us
------------------------------------------------------------------------
  Thread latency:					   528.31 us (100%)

Max timerlat IRQ latency from idle: 10.34 us in cpu 12
  Saving trace to timerlat_trace.txt

timerlat hist

rtla timerlat 預設情況下以 top 模式下運作(如前範例)。但工具另外支援 hist 模式。這個模式下，結果不會即時更新，而是在測試結束時顯示兩種延遲的直方圖。如下圖的範例:

sudo rtla timerlat hist -c 0-1 -d 30
# RTLA timerlat histogram
# Time unit is microseconds (us)
# Duration:   0 00:00:30
Index   IRQ-000   Thr-000   Usr-000   IRQ-001   Thr-001   Usr-001
15            1         0         0         0         0         0
16            1         0         0         0         0         0
17            1         0         0         0         0         0

(略...)

253          11        10        20         8        13        13
254          12         6        11        13         9         9
255          17        11        15         4        14        17
over:       433       655       728       410       588       675
count:    30032     30032     30032     30033     30033     30033
min:         15        18        19        20        24        25
avg:         98       107       110        88        96        99
max:       2008      2015      2018      2145      2153      2158
ALL:        IRQ       Thr       Usr
count:    60065     60065     60065
min:         15        18        19
avg:         93       101       105
max:       2145      2153      2158

如圖顯示了每個延遲(Index)值在各個 CPU 的 IRQ/thread latency 上的分布。

Linux 核心設計: Scheduler(10): PREEMPT_RT

Overview of PREEMPT_RT

PREMMPT_RT 與 non-PREEMPT 的 Lock

Lock Type Overview

Sleeping Lock

CPU local locks

Spinning locks

RTLA

簡介

rtla timerlat

Cause of Latency

Interference

Blocking

Release Jitter

Execution time

IRQ latency

Thread latency

timerlat hist

Reference

Read more

Linux 核心設計: Kernel Debugging(1): Kdump

USB(1): 基礎篇

Linux 核心設計: Scheduler(7): sched_ext

Linux 核心設計: Scheduler(1): O(1) Scheduler