Linux Kernel Power Management

# 從 Linux Kernel 閱讀 Power Management ## Reference - [Power Management](https://docs.kernel.org/admin-guide/pm/index.html) - [Runtime Power Management Framework for I/O Devices](https://docs.kernel.org/power/runtime_pm.html) - [Power Management - Deeper view](https://docs.kernel.org/power/index.html) - [Device Power Management Basics](https://www.kernel.org/doc/html/v4.19/driver-api/pm/devices.html) - [Introduction to Kernel Power Management - presentation](http://events17.linuxfoundation.org/sites/events/files/slides/Intro_Kernel_PM.pdf) - [Power Management In The Linux Kernel Current Status And Future - presentation](http://events17.linuxfoundation.org/sites/events/files/slides/kernel_PM_plain.pdf) ## 1. Overview - Power Management Strategies Linux Kernel 支援兩種主要的 Power Management (PM) 策略： 1. **system-wide power management**: (so called static PM) 由 user space 發出要求，將整個系統的狀態做統一調整，進入到低功耗的 sleep states；而後直到在收到特定的訊號時，恢復到 working state。 - 此 PM 根據休眠程度分成好幾種類別。 - 由於此 PM 是統一的狀態影響整個系統，故名為 system-wide。 2. **working-state power management**: (so called dynamic PM) 在 working state 的狀態下根據個別硬體元件狀態進行功耗上的動態調整。若元件處於 active (in use)，則需要允許軟體存取使用，反之若為 inactive (idle)，則不可被存取。 ## 2. system-wide power management > - [System Sleep States](https://docs.kernel.org/admin-guide/pm/sleep-states.html) > - [System Suspend Code Flows](https://docs.kernel.org/admin-guide/pm/suspend-flows.html) 根據設定與需要，linux kernel 可以支援四種 system sleep states，由淺到深分別為：suspend-to-Idle, standby, suspend-to-RAM, hibernation ### 2.1 suspend-to-Idle 這是最普遍、輕量、軟體層級的 suspend 模式（freeze, S2I, S2Idle）。他比起 runtime idle (freezing userspace) 更省功耗，因為他也同時會 suspend timekeeping 並將 I/O devices 置於低功耗模式。由於整個系統都 suspend 了，也讓 CPU 能夠處在最深的 idle state。相比以下提到的 standby 或 suspend-to-RAM，他是更普遍支援的功能，只要系統支援 `CONFIG_SUSPEND` 就需要有此功能。 #### suspend flow: 1. Invoking system-wide suspend notifiers: - 用 callback 的方式告知各個 subsystem 去做狀態轉換準備。 2. Freezing tasks - 從 userspace 開始 freeze， - 接著換 kernel thread。然而 kernel thread 不可被 intercepted 的 (不是很確定這個詞的表達)，於是 kernel thread 需要定時確認自己需不需要要被 froze，然後自發的、 uninterruptible 的進入 sleep state。 3. Suspending devices and reconfiguring IRQs. - Device suspending 的四階段：Prepare, Suspend, Late suspend, Noirq suspend. - 跟 system wakeup devices 有關的 IRQs (interrupt requests) 被準備好，以備隨時收到喚醒訊號。 4. Freezing the scheduler tick and suspending timekeeping. - 最終，CPU 終於能進入最深的 idle state。scheduler tick 和 timekeeping 也被凍結。以至於最終 CPU 只能透過 non-timer hardware interrupt 來喚醒。 #### resume flow: 基本上就是依相反順序重啟： 1. Resuming timekeeping and unfreezing the scheduler tick. 2. Resuming devices and restoring the working-state configuration of IRQs. 3. Thawing tasks. 4. Invoking system-wide resume notifiers. ![Screenshot 2024-07-01 at 11.40.47 PM](https://hackmd.io/_uploads/SkCWAHeDA.png) ### 2.2 platform-dependent suspend - standby and suspend-to-RAM standby 與 suspend-to-RAM 又被稱為 **platform-dependent suspend**，需要端看平台是否提供此功能。 #### standby 除了上述的 freezing userspace, suspend timekeeping 和將 I/O device 設置 low-power states 外，standby 更將 suspend 了 nonboot CPUs (通常 boot CPU 即為 CPU0) 和更多 low-level system function。依據 `CONFIG_SUSPEND` 是否設定決定能否使用此功能。 #### suspend-to-RAM 在這個狀態下 (STR or S2RAM) 幾乎將所有系統都轉換成 low-power state，除了 memory (RAM)。 memory 會保留電源以保存運行前狀態，且保留可以傳遞 "on" state 的 bus 通電。依據 `CONFIG_SUSPEND` 是否設定決定能否使用此功能。 #### suspend and resume flow: 基本上就是先做 S2Idle 的前後步驟，最後增加 nonboot CPU offline, System Core Offline，然後 STR 會再將非必需的裝置轉換成 low-power mode。接著就是反著重新啟動。 ![Screenshot 2024-06-30 at 10.35.52 PM](https://hackmd.io/_uploads/SkYL6yyDR.png) ### 2.3 Hibernation 在這個階段 (冬眠、Suspend-to-Disk or STD) 擁有最高的能源保存率，許多低階平台也可以支援。然而要進入 STD 狀態，也需要 low-level code 來重啟整個系統。 #### suspend flow 這個階段的 suspend 與上述三種模式在實現上有很大的差別。他需要三個階段的狀態更動，而上述的只需要一次的狀態更動。首先，若使用 STD，kernel 會停止運作並生成一個快速映像檔 (snapshot image)，以便在斷電後還能存儲在 non-volatile disk storage。再來，部分的系統會啟動，只是為了讓這個 image 能被存入到 disk。最後，整個系統就會斷電（除了一些 wakeup devices，好比說今天我手指按了鍵盤、打開筆電螢幕、開機鍵，保留這些被視為重新啟動訊號的裝置）。 ![Screenshot 2024-06-30 at 10.13.33 PM](https://hackmd.io/_uploads/ByxQOk1PR.png) #### resume flow Resume 的階段有兩個（相比上述模式只需要一個）。首先會用 bootloader 重啟，restore kernel 會去尋找 hibernation image 並存入到 memory。第二步驟，整個系統會根據這個 image 原本的狀態重啟以恢復到原本的運作中。 ![Screenshot 2024-06-30 at 10.14.45 PM](https://hackmd.io/_uploads/rJOwOkJwA.png) ### 2.4 Configuration in linux kernel PM subsystem 使用了 `sysfs` 的統一介面寫成，可見 `/sys/power`。 - `state`: [`disk`, `freeze`, `standby`] 分別是 hibernation, S2Idle, standby. - `mem_sleep`: [`s2idle`, `shallow`, `deep`] 分別是 S2Idle, standby, STR. - `disk`: 這個參數告知系統在存好 hibernation image 後，該如何做處理。(`platform`: 進入特殊的低功耗模式（需有平台支援）。 `shutdown`: 關機。 `suspend`: 改用 `mem_sleep` 的模式 suspend，如果順利醒來則捨棄 image，否則就使用此 image 啟動。 `reboot`, `test_resume`: 後兩者主要是在測試功能才會用到。) ## 3. Working-State Power Management - Idle PM: CPU idle, Device idle (Runtime PM) - Active PM: CPUFreq ### 3.1 [CPU Idle Time Management](https://docs.kernel.org/admin-guide/pm/cpuidle.html) 作業系統會跑一個 Idle task，當執行單元 (cpu, core) 沒有在執行任務時，則會換上 Idle task 運行，若有其他任務，則會切換給其他任務作業。這些 Idle task 是由 scheduler 來做安排的，Linux 中有一個 subsystem, CPU idle time management, 來管理 C-states (ACPI 術語)。當作業系統執行 Idle task 的時候，會根據平台設計來決定處理器 low-power 的狀態和持續多久。（如果平台並未特別設定，他就像是一個 task 一樣 loop，直到被 scheduler 打斷。）關於處理器需要執行什麼樣的低功耗設定，也需要端看硬體狀況。如果只是 single core cpu，則可以直接將 cpu 設為低功耗；若是多核或更複雜的執行組建，則需要考量各核心之間共用資源的關係（比如 cache）。 #### Idle Loop 在每一次 Idle loop，會有兩個主要行為： 1. 由 `governor` 為處理器(硬體)決定該進入哪種 idle state 2. 由 `driver` 來實際請處理器進入 idle state。各個不同 idle state 會決定兩件事： 1. target residency: 該處理器進入此狀態最短需要待的時間。 2. exit latency: 紀錄該處理器離開該狀態最久需要花的時間（紀錄 worst case）。 `governor` 需要自行去調配處理器 idle 的時間。如果有些 timer events 是可以預知他何時會觸發，那這個情況下很容易去設定本次休眠可以的時間。然而有些 non-timer events 是沒有辦法預測的，`governor` 只能根據過去的經驗來推判下次可以休眠的時間為多久來進行設定。在這個基礎上，便又發展出了數個不同的演算法的 `governor` subsystem。可參考 `/sys/devices/system/cpu/cpuidle/` 中 `available_governors`, `current_governor`, `current_driver`. #### Idle CPUs and The Scheduler Tick Scheduler Tick 是排程器管理的一個機制，讓正在運行的 task 能夠公平分配到 CPU 運行時間。會有一個 Periodiacal Tick，週期一到便會重新考量 CPU task 排程。然而如果將 Idle CPU 放入一起考量是不合理的，這會讓 Idle CPU 需要一同納入考量，而頻繁重啟運行。在這個基礎上 Governer 會停止 Scheduler Tick 將 Idle CPU 納入考量，讓處理器能夠進入穩定且長得休眠時間裡。然而，如果遇到 Non-tick Timer wakeups, Expected Non-timer Wakeup 時候， Governer 會讓 ticker 重新考量 idle CPU 以進行協助進行 CPU 排程。 #### The menu Governor #### The Timer Events Oriented (TEO) Governor ### 3.2 Device idle: Runtime PM > [Runtime Power Management Framework for I/O Devices](https://docs.kernel.org/power/runtime_pm.html) > [Linux 核心設計: Power Management(2): Runtime Power Management model](https://hackmd.io/@RinHizakura/SkziZRa9a) - per-device idle - single device at a time ```c struct dev_pm_ops { int (*runtime_suspend)(struct device *dev); int (*runtime_resume)(struct device *dev); int (*runtime_idle)(struct device *dev); } ``` 是否執行會由 PM core 來管理執行。這些 callback 可能會被定義在（依序檢查）： 1. 該 device 屬於特定 domain：`dev->pm_domain`，優先採用此 callback。 2. 該 device 屬於特定 `dev->type`，且有定義 `dev->type->pm` 3. 該 device 屬於特定 `dev->class`，且有定義 `dev->class->pm` 4. 該 device 屬於特定 `dev->bus`，且有定義 `dev->bus->pm` 5. 如果皆未定義在 subsystem 中或無實作 callback，則檢查 `dev->driver->pm` 來執行。 #### 使用情境： - Tell PM core the device is in use: 1. `pm_runtime_get()`, `_sync()` 2. Increment use count 3. 如果裝置原本為 suspend，執行 `pm_runtime_resume()` 先恢復執行 - Tell PM core the device is not in use: 1. `pm_runtime_put()`, `_sync()` 2. Decrement use count - 如果 count 1->0: 1. 同時檢查 device usage counter 和該 device 的 ‘active’ children counter. 是否皆為 0 2. 若是則可以執行 `pm_runtime_idle()` - `pm_runtime_idle()` 所作的行為取決於所對應的 subsystem 或 device，但總體目的是為了去為了檢查該裝置是否滿足 suspend 的條件。如果滿足，則執行 `suspend`。 - `pm_runtime_suspend()` - prepare for low-power state - ensure wakeups enabled - save context - 如果 `resume`, 則將 context 復原。以下是各功能的執行條件： - `runtime_resume/idle/suspend` 是 mutual exclusive 的。 - `idle/suspend` 只可執行於原本就是 active 的裝置 - `idle/suspend` 只能對上述提到兩種 counter 為 0 的裝置執行 - `resume` 只可對 suspend 的裝置執行。 #### Autosuspend ```c pm_runtime_set_autosuspend_delay() pm_runtime_mark_last_busy() pm_runtime_put_autosuspend() ``` #### Grouping PM Domains (genpd) Devices are often grouped into domains - power gated as a group Linux PM domains: - override ops where PM domain present, PM core uses domain callbacks ```c struct dev_pm_domain { struct dev_pm_ops ops; ... } ``` - The implementations is based on runtime PM - When all devices in domain are runtime suspended: `genpd->power_off()` - When first device in domain in runtime resume: `genpd->power_on()` Since the power on/off can also need certain planning, genpd governors is required. ### 3.3 [CPU Performance Scaling: CPU Freq](https://docs.kernel.org/admin-guide/pm/cpufreq.html) 現代處理器可以選擇多個不同 clock frequency 或 voltage configuration 的選擇（ACPI術語：Operating Performance Points (OPP), P-states）。越高的 clk freq 與 voltage，便可以執行更多指令，卻也更加耗能。透過調整 cpu freq (調整 P-states)，亦可以達到功耗調節的效益。 #### `CPUFreq` 在 `CPUFreq` 的 subsystem 由 `core`, `scaling governors` 和 `scaling drivers` 組成。 - `core` 提供一個統一的架構與介面，讓其他組建若也支持 CPU performance scaling ，也能自己實現。 - `governors` 一個能夠預估 CPU 所需調節的演算法 - `dirvers` 實際去調節硬體的驅動。