Linux Suspend/Resume 實驗（二）

# Linux Suspend/Resume 實驗（二） ## 追蹤 suspend 階段要優化系統的 suspend 和 resume 流程，首先要弄清楚：當我下命令讓電腦休眠時，Linux 內部到底呼叫了哪些函式？了解這整條流程，才能知道從哪裡開始優化。我首先觀察最簡單的 suspend 模式 — freeze (Suspend-to-Idle)。當我執行下列指令後： ```shell echo "freeze" | sudo tee /sys/power/state ``` 系統會開始進行 freeze 流程，首先系統會呼叫 [linux/kernel/power/main.c](https://github.com/torvalds/linux/blob/master/kernel/power/main.c) 當中的 `state_store()` 這個函式，這個函式主要的效果是根據使用者輸入的字串（如 echo mem > /sys/power/state），決定要讓系統進入哪種 suspend 模式，或是執行休眠（hibernate），並呼叫相對應的核心 API。 ```c static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t n) ``` 若使用者輸入的狀態為 freeze 或 mem，系統將會呼叫 `pm_suspend()` 函式以進入對應的 suspend 模式；若輸入為 disk，則會轉而呼叫 `hibernate()`，進行休眠操作。在 `pm_suspend()` 的執行過程中，系統會進一步呼叫 `enter_state()`，該函式負責整個 suspend 流程，包括凍結使用者行程、通知驅動進入低功耗模式，以及執行平台相關的 suspend 操作，在輸入 freeze 後總體流程會是下列這樣： ``` state_store pm_suspend enter_state s2idle_begin ksys_sync_helper suspend_prepare suspend_devices_and_enter suspend_finish ``` 經觀察分析發現，ksys_sync_helper 是導致 suspend_enter 階段耗時較長的主要原因。 ![Screenshot from 2025-05-25 15-11-08](https://hackmd.io/_uploads/rJ0cfrgGlg.png) 而這個函式的功能為執行完整的系統同步，確保所有修改過的資料都安全寫入磁碟 >Perform a full system sync to ensure all modified data is safely written to disk. 他的步驟為下列： 1. 喚醒所有 flusher 執行緒，以平行的方式把各個裝置的資料寫回去。 2. 同步所有 inode，將檔案資料刷新到磁碟，並確保 flusher 執行緒已完成。 3. 對所有已掛載的檔案系統呼叫 ->sync_fs() 方法：首先以非同步方式執行，接著再以同步方式執行，以確保確保 metadata (描述資料的資料)確實寫入儲存裝置。 4. 額外把所有 block device 的頁面快取從記憶體寫入磁碟。因為像 ext2 這類檔案系統可能只更新 metadata（例如 inode 或 bitmap）到 block device 的頁面快取，而不會在 ->sync_fs() 中主動把資料從記憶體寫入磁碟。 5. 如果啟用了 laptop_mode（筆電省電模式），則觸發額外的同步處理。 **flusher 執行緒**是核心中負責將 dirty pages 從 page cache 寫回到儲存裝置的背景執行緒，可以輸入下列命令來觀察： ```shell ps -eo pid,comm | grep flush ``` 可以觀察到這種工作是由 kworker 這個執行緒來執行的 ``` 77646 kworker/u32:3-flush-259:5 87512 kworker/u32:0-flush-259:5 ``` 經過測試不管是哪種程度的 suspend 都會在 `suspend_enter` 階段做 `sync filesystem`，嘗試將 `dirty page` 寫回磁碟中，但是根據 [System Sleep States](https://www.kernel.org/doc/html/next/admin-guide/pm/sleep-states.html) 所述 memory 只有在 Hibernation 的情況下才會斷電，因此 memory 只要不是 Hibernation 狀況下都不會消失，那是不是只要不是使用 Hibernation 就能不用 `sync filesystem` 不去將 `dirty page` 寫回磁碟中？ >This state (also referred to as STR or S2RAM), if supported, offers significant energy savings as everything in the system is put into a low-power state, except for memory。接下來要說明剩下的三個 suspend 主要函式，分別是： * suspend_prepare * suspend_devices_and_enter * suspend_finish 其中，`suspend_prepare` 的主要工作是凍結使用者空間的行程與核心空間中可凍結的執行緒，這一步是為了避免在 suspend 過程中有其他程式繼續運作而造成干擾。它會呼叫 [linux/kernel/power/process.c](https://github.com/torvalds/linux/blob/master/kernel/power/process.c#L28) 中的 `freeze_processes()` 和 `freeze_kernel_threads()` 來完成這個步驟。而 `suspend_finish` 則是在系統喚醒後執行，負責將先前被凍結的所有行程與執行緒回復執行，使系統能夠正常回到運作狀態。這裡會呼叫 `thaw_processes()` 。再來是最關鍵的 `suspend_devices_and_enter` ，它是整個 suspend 流程的核心部分，會處理： * 裝置的 suspend 流程（包含 late suspend、noirq suspend） * 呼叫實際執行低功耗狀態的函式 * 判斷是否有喚醒事件，並進行必要的 resume 操作下列為輸入 freeze 後 `suspend_devices_and_enter()` 函式的運作流程： ``` suspend_devices_and_enter() ├── 檢查所輸入 state 是否支援 ├── 記錄所輸入 state ├── pm_set_suspend_no_platform() ├── platform_suspend_begin() ├── console_suspend_all() ├── dpm_suspend_start() ├── suspend_enter() ├── dpm_resume_end() ├── console_resume_all() └── platform_resume_end() ``` 而其中的 [`dpm_suspend_start()`](https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L2009) 會呼叫 `dpm_prepare` 和 `dpm_suspend`， `dpm_prepare()` 負責對所有非 sysdev 裝置（指的是所有用 struct device 管理的裝置）呼叫其註冊的 ->prepare() callback，進行 suspend 前的預備作業，例如鎖定狀態、檢查裝置是否可以被暫停等。這個階段不會實際改變裝置的運作狀態，而是為後續 suspend 動作建立條件。接下來的 `dpm_suspend()` 則會對這些裝置呼叫 ->suspend() callback，這是驅動中負責將裝置真正進入低功耗的核心步驟。 ``` suspend_devices_and_enter() └─> dpm_suspend_start() ├─> dpm_prepare() // 預備階段 │ └─> device_prepare() // 各裝置的 prepare callback └─> dpm_suspend() // 實際暫停 └─> device_suspend() // 各裝置的 suspend callback ``` ![Screenshot from 2025-05-25 21-04-02](https://hackmd.io/_uploads/SkmISqxGex.png) 而再來下列為 `suspend_enter` 的執行流程： ``` [suspend_enter()] ↓ [platform_suspend_prepare] ↓ [dpm_suspend_late] ↓ [platform_suspend_prepare_late] ↓ [dpm_suspend_noirq] ↓ [platform_suspend_prepare_noirq] ↓ ┌─────────────────────────────────────────┐ │ if (state == S2IDLE) │ │ → [s2idle_loop()] │ │ else │ │ → [pm_sleep_disable_secondary_cpus]│ │ ↓ │ │ [arch_suspend_disable_irqs] │ │ ↓ │ │ [syscore_suspend] │ │ ↓ │ │ [suspend_ops->enter()] │ │ ↓ │ │ [syscore_resume] │ │ ↓ │ │ [arch_suspend_enable_irqs] │ │ ↓ │ │ [pm_sleep_enable_secondary_cpus] │ └─────────────────────────────────────────┘ ↓ [platform_resume_noirq] → [dpm_resume_noirq] ↓ [platform_resume_early] → [dpm_resume_early] ↓ [platform_resume_finish] ``` ![Screenshot from 2025-05-25 21-21-02](https://hackmd.io/_uploads/BklwF5lMgg.png) 隨後執行 `s2idle_loop()`，系統進入 suspend-to-idle 模式的主要控制迴圈。在這個階段中，所有使用者行程已經凍結、裝置已暫停，僅剩處理器持續 idle 等待喚醒事件。根據原始註解： >Suspend-to-idle equals: frozen processes + suspended devices + idle processors. `s2idle_loop()` 會重複檢查是否有任何有效的喚醒事件發生。若平台提供 `s2idle_ops->wake()`，則優先使用該函式進行判斷；否則回退至 Linux 通用的 `pm_wakeup_pending()` 進行確認。若無喚醒條件，則透過 `s2idle_enter()` 讓包括目前執行的處理器在內的所有 CPU 同步進入 idle 狀態，並一同等待喚醒事件的發生。只要其中任一來源觸發有效喚醒事件，便會中斷等待，喚醒整個系統並結束 suspend-to-idle 狀態。 ![Screenshot from 2025-05-26 13-49-47](https://hackmd.io/_uploads/rJJGWtbflg.png) ## 追蹤 resume 階段由 `suspend_enter()` 的流程可知，目前系統正停留在 `s2idle_loop()` 中，進入 idle 狀態並等待喚醒事件發生： ``` [suspend_enter()] ↓ [platform_suspend_prepare] ↓ [dpm_suspend_late] ↓ [platform_suspend_prepare_late] ↓ [dpm_suspend_noirq] ↓ [platform_suspend_prepare_noirq] ↓ ┌─────────────────────────────────────────┐ │ if (state == S2IDLE) │ │ → [s2idle_loop()] │ │ else │ │ → [pm_sleep_disable_secondary_cpus]│ │ ↓ │ │ [arch_suspend_disable_irqs] │ │ ↓ │ │ [syscore_suspend] │ │ ↓ │ │ [suspend_ops->enter()] │ │ ↓ │ │ [syscore_resume] │ │ ↓ │ │ [arch_suspend_enable_irqs] │ │ ↓ │ │ [pm_sleep_enable_secondary_cpus] │ └─────────────────────────────────────────┘ ↓ [platform_resume_noirq] → [dpm_resume_noirq] ↓ [platform_resume_early] → [dpm_resume_early] ↓ [platform_resume_finish] ``` 根據原始程式碼可以得知如果睡眠狀態是 Suspend-to-Idle 系統喚醒之後將從 `platform_resume_noirq()` 開始回復流程，接著依序執行 `dpm_resume_noirq()`、`platform_resume_early()`、`dpm_resume_early()` 等函式，完成裝置與平台的 resume 操作，最後回到 `suspend_devices_and_enter()` 去執行 `dpm_resume_end()` ，這個函式的作用是去執行另外兩個函式 `dpm_resume()` 和 `dpm_complete()`： ```c void dpm_resume_end(pm_message_t state) { dpm_resume(state); dpm_complete(state); } EXPORT_SYMBOL_GPL(dpm_resume_end); ``` ![Screenshot from 2025-05-26 14-13-34](https://hackmd.io/_uploads/Sypc8KWGxg.png) ![Screenshot from 2025-05-26 14-17-45](https://hackmd.io/_uploads/SkT9PY-Mlx.png) ## 分析最久階段根據下列兩張圖可以發現耗時最久的在於 suspend_enter 階段的 `sync_filesystems` 和回復階段的 dpm_resume 階段，因此可以嘗試去改進這兩個階段的效率： ![Screenshot from 2025-05-26 14-28-14](https://hackmd.io/_uploads/rkxEqKZGxg.png) ![Screenshot from 2025-05-26 14-29-27](https://hackmd.io/_uploads/r1dL9tWfle.png)