詳細介紹 Android Low Memory Killer

###### tags: `Android Performance Tuning` # 詳細介紹 Android Low Memory Killer > 與 Embedded Software Engineer 相關的問題，歡迎透過 Linkedin 與我聯繫 [**Linkedin**](https://www.linkedin.com/in/chao-shun/) ## Introduction > [Low memory killer daemon](https://source.android.com/docs/core/perf/lmkd) 在 AOSP (Android Open Source Project) 的官方文件當中有介紹，不過此文件維護的速度跟不上目前最新的程式碼，因此這篇文章會介紹目前在 [Android 14](https://developer.android.com/about/versions/14) 搭配 **Linux 5.15** 上最新的機制如 lmkd 的命名，主要是要針對當系統的 System Memory 觸發低水位警報時，藉由 **Kill** APKs 來釋放出記憶體，以達到順暢的使用者體驗。因此這個機制在 ***Low Ram Device*** 格外重要，參數設定的好，可以使 Memory 處於穩定狀態，達到很好的使用者體驗 ## [PSI (Pressure Stall Information)](https://docs.kernel.org/accounting/psi.html) > PSI 從 [**GKI (Generic Kernel Image)**](https://source.android.com/docs/core/architecture/kernel/generic-kernel-image) 出來之後，會強制打開 [`CONFIG_PSI`](https://android.googlesource.com/kernel/common/+/refs/heads/android-gs-bluejay-5.10-android14/arch/arm64/configs/gki_defconfig#8)，成為 lmkd 默認的低水位警報器，當然 PSI 不只有監測 Memory 而已，連 CPU and IO 都可以監測，但 lmkd 只看 Memory 而已，所以下面只會針對 Memory 介紹 ### Memory Stall Section 首先先來看一下在 Kernel 當中如何算是 **Memory Stall**，這在 Kernel PSI 的 Source Code 當中有給相當清楚的描述，只要是被 `psi_memstall_enter` and `psi_memstall_leave` 所包起來的 Section 就稱為 Memory Stall Section，接下來只要把所有在 Kernel Source Code 當中有用到這些 API 的地方找出來就知道 PSI 的 Memory Stall 會是在哪裡 [`kernel/sched/psi.c`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/sched/psi.c?h=v5.15.177#n903) ```c /** * psi_memstall_enter - mark the beginning of a memory stall section * @flags: flags to handle nested sections * * Marks the calling task as being stalled due to a lack of memory, * such as waiting for a refault or performing reclaim. */ void psi_memstall_enter(unsigned long *flags) /** * psi_memstall_leave - mark the end of an memory stall section * @flags: flags to handle nested memdelay sections * * Marks the calling task as no longer stalled due to lack of memory. */ void psi_memstall_leave(unsigned long *flags) ``` 透過 `cscope` 搜尋的結果如下，可以發現 Memory Stall Section 包含 **Directly Reclaim** 之外，也有 `Disk IO` 等待的時間，甚至有等待 Read Swap Page 的時間，因此需要先知道平台主要貢獻者是哪種 Memory Stall Section，才比較好進行下一步的動作 :::warning 任何進行 Performance Tuning 之前要先蒐集足夠證據再開始動作，否則可能忙半天才發現完全走錯方向浪費不少時間，因此在做任何動作前，要先反問自己一句話，接下來要進行的調整是**基於事實的推測，還是盲目的猜測** ::: `cscope for psi_memstall_enter` ```clike Cscope tag: psi_memstall_enter # line filename / context / line 1 1698 block/blk-cgroup.c <<blkcg_maybe_throttle_blkg>> psi_memstall_enter(&pflags); 2 1088 block/blk-core.c <<submit_bio>> psi_memstall_enter(&pflags); 3 3020 mm/compaction.c <<kcompactd>> psi_memstall_enter(&pflags); 4 1294 mm/filemap.c <<wait_on_page_bit_common>> psi_memstall_enter(&pflags); 5 2365 mm/memcontrol.c <<reclaim_high>> psi_memstall_enter(&pflags); 6 2596 mm/memcontrol.c <<mem_cgroup_handle_over_high>> psi_memstall_enter(&pflags); 7 2665 mm/memcontrol.c <<try_charge_memcg>> psi_memstall_enter(&pflags); 8 4674 mm/page_alloc.c <<__alloc_pages_direct_compact>> psi_memstall_enter(&pflags); 9 4957 mm/page_alloc.c <<__alloc_pages_direct_reclaim>> psi_memstall_enter(&pflags); 10 325 mm/page_io.c <<swap_readpage>> psi_memstall_enter(&pflags); 11 7045 mm/vmscan.c <<balance_pgdat>> psi_memstall_enter(&pflags); 12 7676 mm/vmscan.c <<__node_reclaim>> psi_memstall_enter(&pflags); ``` 有點經驗的開發者，聽到這邊可能就會直接選擇使用 **[BCC (BPF Compiler Collection)](https://github.com/iovisor/bcc)**，來動態觀察是哪些 Call Trace 會走到 Memory Stall Section，效率會比直接 `grep` 來的有效率不少，因為可以知道目前你的平台主要是進入到哪裡的 Memory Stall Section，甚至可以在自己寫 **[bpftrace](https://github.com/bpftrace/bpftrace)** 直接觀察到哪個 Section 是主要的貢獻者 [`bcc stackcount psi_memstall_enter`](https://github.com/iovisor/bcc/blob/master/tools/stackcount.py) ```bash (bcc)root@localhost:/# stackcount psi_memstall_enter Tracing 1 functions for "psi_memstall_enter"... Hit Ctrl-C to end. psi_memstall_enter wait_on_page_bit_common filemap_fault __do_fault do_handle_mm_fault do_page_fault do_DataAbort __dabt_usr 122 psi_memstall_enter swap_readpage swapin_readahead do_swap_page do_handle_mm_fault do_page_fault do_DataAbort __dabt_usr [unknown] 246 ``` ### PSI State for Memory 了解 Memory Stall Section 之後就要來看 PSI 的 Trigger 機制，在 PSI 當中有兩個指標 `SOME` and `FULL` * **`SOME`** - 只要在 Run Queue 上有進入 Memory Stall Section 的 Task 就算是 `SOME` 的計算範圍，可以看下方程式碼第 9 行 * **`FULL`** - 除了有進入 Memory Stall Section 的 Task 之外，沒有其他非 Reclaimer (如 `kswapd`) 的 Task 在 Run Queue，這樣才算 `FULL` 的計算範圍，可以看下方程式碼的 13 行，就是在確認 Run Queue 上是否有其他非 Reclaimer 的 Task [`kernel/sched/psi.c`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/sched/psi.c?h=v5.15.177#n21) ``` * The time in which a task can execute on a CPU is our baseline for * productivity. Pressure expresses the amount of time in which this * potential cannot be realized due to resource contention. * * This concept of productivity has two components: the workload and * the CPU. To measure the impact of pressure on both, we define two * contention states for a resource: SOME and FULL. * SOME = nr_delayed_tasks != 0 * FULL = nr_delayed_tasks != 0 && nr_productive_tasks == 0 * For each runqueue, we track: * * tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0) * tFULL[cpu] = time(nr_delayed_tasks[cpu] && !nr_productive_tasks[cpu]) * tNONIDLE[cpu] = time(nr_nonidle_tasks[cpu] != 0) * * and then periodically aggregate: * * tNONIDLE = sum(tNONIDLE[i]) * * tSOME = sum(tSOME[i] * tNONIDLE[i]) / tNONIDLE * tFULL = sum(tFULL[i] * tNONIDLE[i]) / tNONIDLE * * %SOME = tSOME / period * %FULL = tFULL / period ``` :::info 有時候 Kernel 的文件更新很慢，直接看 Source Code 有時候可以看到很詳細的 Comment 解釋，所以建議看 Document 的同時，也要順便去看 Source Code 上面有沒有更詳細的解說，有時候很納悶為什麼 Maintainer 不一起把 Document 更新一下 ::: [`kernel/sched/psi.c`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/sched/psi.c?h=v5.15.177#n232) ```c=1 static bool test_state(unsigned int *tasks, enum psi_states state) { switch (state) { case PSI_IO_SOME: return unlikely(tasks[NR_IOWAIT]); case PSI_IO_FULL: return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]); case PSI_MEM_SOME: return unlikely(tasks[NR_MEMSTALL]); case PSI_MEM_FULL: return unlikely(tasks[NR_MEMSTALL] && tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]); case PSI_CPU_SOME: return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]); case PSI_CPU_FULL: return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]); case PSI_NONIDLE: return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || tasks[NR_RUNNING]; default: return false; } } ``` ### User Space Interface 接下來就可以來看 PSI 提供給 User Space 的接口該如何使用，這在 [PSI Document](https://docs.kernel.org/accounting/psi.html#monitoring-for-pressure-thresholds) 已經有給很詳細的描述，User 可以透過 `poll()` 主動去 Monitor PSI，每當觸發條件就會發 Event 到 User Space，此外條件是可以由 User 自行定義的，可以自訂 `SOME` or `FULL` 以及在多少的 `window time` 裡面占多少 `stall time` 就觸發 Event > To register a trigger user has to open psi interface file under /proc/pressure/ representing the resource to be monitored and write the desired threshold and time window. The open file descriptor should be used to wait for trigger events using select(), poll() or epoll(). ```bash <some|full> <stall amount in us> <time window in us> ``` 回頭看 Android 14 當中是如何去 Register PSI Interface，可以看到在 `init_psi_monitor` 裡面的第 22 行會去定義 `stall_type_name`, `threshold_us`, and `window_us`，這些都是可以透過控制 lmkd 的參數去決定，在後面的內容會提到。另外可以在 `register_psi_monitor` 裡面看到 lmkd 確實是透過 `epoll()` 去 Moniter 所設定的條件是否被觸發，假如觸發就去執行對應的 Event Handler `system/memory/lmkd/libpsi/psi.cpp` ```cpp=1 int init_psi_monitor(enum psi_stall_type stall_type, int threshold_us, int window_us, enum psi_resource resource) { if (resource < PSI_MEMORY || resource >= PSI_RESOURCE_COUNT) { ALOGE("Invalid psi resource type: %d", resource); errno = EINVAL; return -1; } int fd; int res; char buf[256]; fd = TEMP_FAILURE_RETRY(open(psi_resource_file[resource], O_WRONLY | O_CLOEXEC)); if (fd < 0) { ALOGE("No kernel psi monitor support (errno=%d)", errno); return -1; } switch (stall_type) { case (PSI_SOME): case (PSI_FULL): res = snprintf(buf, sizeof(buf), "%s %d %d", stall_type_name[stall_type], threshold_us, window_us); break; default: ALOGE("Invalid psi stall type: %d", stall_type); errno = EINVAL; goto err; } if (res >= (ssize_t)sizeof(buf)) { ALOGE("%s line overflow for psi stall type '%s'", psi_resource_file[resource], stall_type_name[stall_type]); errno = EINVAL; goto err; } res = TEMP_FAILURE_RETRY(write(fd, buf, strlen(buf) + 1)); if (res < 0) { ALOGE("%s write failed for psi stall type '%s'; errno=%d", psi_resource_file[resource], stall_type_name[stall_type], errno); goto err; } return fd; err: close(fd); return -1; } int register_psi_monitor(int epollfd, int fd, void* data) { int res; struct epoll_event epev; epev.events = EPOLLPRI; epev.data.ptr = data; res = epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &epev); if (res < 0) { ALOGE("epoll_ctl for psi monitor failed; errno=%d", errno); } return res; } ``` ## System Memory Information lmkd 會從 Kernel 所提供的 System Memory Information 去獲取目前系統的 Memory 狀況，來去計算目前的 Memory 水位、[threashing](https://en.wikipedia.org/wiki/Thrashing_(computer_science)) 是否嚴重、[Zram Swap](https://docs.kernel.org/admin-guide/blockdev/zram.html) 用量是否快抵達上限、各個 Task 的 `oom_score_adj` 分別是多少，了解這些資訊後，再去看 lmkd 的 Code Flow 才會理解其計算的意義 ### [`/proc/meminfo`](https://man7.org/linux/man-pages/man5/proc_meminfo.5.html) > This file reports statistics about memory usage on the system. 會提供系統角度的 Memory 使用量，可以用其資訊計算出目前的水位狀況是在哪個等級，可以從 [***Understanding the Linux Virtual Memory Manager***](https://www.kernel.org/doc/gorman/) 看到一張非常經典的水位圖 * 當水位低於 Low 時會觸發 `kswapd` 進行 Reclaim Page，直到水位回到 High 才停止進行 Reclaim * 當 `kswapd` Reclaim Page 的速度比 Allocate Page 的速度還慢，就會導致水位持續降低，低到 Min 時就會觸發 Directly Reclaim `Figure 2.2. Zone Watermarks` ![image](https://hackmd.io/_uploads/HJomo2hO1g.png) :::info [***Understanding the Linux Virtual Memory Manager***](https://www.kernel.org/doc/gorman/) 是非常經典的 Linux Kernel Memory 的書，即使是基於 Linux 2.6 寫的，但很多基本的觀念到現在 Linux 6.13 也還是通用，非常推薦想了解 Linux Kernel Memory 的開發者閱讀 ::: ### [`/proc/vmstat`](https://man7.org/linux/man-pages/man5/proc_vmstat.5.html) > This file displays various virtual memory statistics. 會統計目前所有 Page 的使用狀況，並且以 Page 為單位去進行計算，其中對於 lmkd 最重要的是會從 `workingset_refault_file` 去計算 threashing 的狀況，至於 Kernel 是如何定義 threashing 可以去看 Source Code 的解釋，目前筆者沒有詳細閱讀過，就不誤人子弟 `mm/workingset.c` ```c * Double CLOCK lists * * Per node, two clock lists are maintained for file pages: the * inactive and the active list. Freshly faulted pages start out at * the head of the inactive list and page reclaim scans pages from the * tail. Pages that are accessed multiple times on the inactive list * are promoted to the active list, to protect them from reclaim, * whereas active pages are demoted to the inactive list when the * active list grows too big. * * fault ------------------------+ * | * +--------------+ | +-------------+ * reclaim <- | inactive | <-+-- demotion | active | <--+ * +--------------+ +-------------+ | * | | * +-------------- promotion ------------------+ * * * Access frequency and refault distance * * A workload is thrashing when its pages are frequently used but they * are evicted from the inactive list every time before another access * would have promoted them to the active list. ``` ### [`/proc/$(pid)/oom_score_adj`](https://man7.org/linux/man-pages/man5/proc_pid_oom_score_adj.5.html) > This file can be used to adjust the badness heuristic. lmkd 決定要砍 Task 的話會從 `oom_score_adj` 高的開始砍，其中每個 `oom_score_adj` 在 AOSP 有清楚解釋其代表的意義，例如 900 以上代表可以隨意砍掉不會造成 System 任何影響，反之 0 代表 User 正在使用在地 Task，砍掉會造成 User 巨大的影響 `services/core/java/com/android/server/am/ProcessList.java` ```java // This is a process only hosting activities that are not visible, // so it can be killed without any disruption. public static final int CACHED_APP_MAX_ADJ = 999; public static final int CACHED_APP_MIN_ADJ = 900; // This is the oom_adj level that we allow to die first. This cannot be equal to // CACHED_APP_MAX_ADJ unless processes are actively being assigned an oom_score_adj of // CACHED_APP_MAX_ADJ. public static final int CACHED_APP_LMK_FIRST_ADJ = 950; // Number of levels we have available for different service connection group importance // levels. public static final int CACHED_APP_IMPORTANCE_LEVELS = 5; // The B list of SERVICE_ADJ -- these are the old and decrepit // services that aren't as shiny and interesting as the ones in the A list. public static final int SERVICE_B_ADJ = 800; // This is the process of the previous application that the user was in. // This process is kept above other things, because it is very common to // switch back to the previous app. This is important both for recent // task switch (toggling between the two top recent apps) as well as normal // UI flow such as clicking on a URI in the e-mail app to view in the browser, // and then pressing back to return to e-mail. public static final int PREVIOUS_APP_ADJ = 700; // This is a process holding the home application -- we want to try // avoiding killing it, even if it would normally be in the background, // because the user interacts with it so much. public static final int HOME_APP_ADJ = 600; // This is a process holding an application service -- killing it will not // have much of an impact as far as the user is concerned. public static final int SERVICE_ADJ = 500; // This is a process with a heavy-weight application. It is in the // background, but we want to try to avoid killing it. Value set in // system/rootdir/init.rc on startup. public static final int HEAVY_WEIGHT_APP_ADJ = 400; // This is a process currently hosting a backup operation. Killing it // is not entirely fatal but is generally a bad idea. public static final int BACKUP_APP_ADJ = 300; // This is a process bound by the system (or other app) that's more important than services but // not so perceptible that it affects the user immediately if killed. public static final int PERCEPTIBLE_LOW_APP_ADJ = 250; // This is a process hosting services that are not perceptible to the user but the // client (system) binding to it requested to treat it as if it is perceptible and avoid killing // it if possible. public static final int PERCEPTIBLE_MEDIUM_APP_ADJ = 225; // This is a process only hosting components that are perceptible to the // user, and we really want to avoid killing them, but they are not // immediately visible. An example is background music playback. public static final int PERCEPTIBLE_APP_ADJ = 200; // This is a process only hosting activities that are visible to the // user, so we'd prefer they don't disappear. public static final int VISIBLE_APP_ADJ = 100; static final int VISIBLE_APP_LAYER_MAX = PERCEPTIBLE_APP_ADJ - VISIBLE_APP_ADJ - 1; // This is a process that was recently TOP and moved to FGS. Continue to treat it almost // like a foreground app for a while. // @see TOP_TO_FGS_GRACE_PERIOD public static final int PERCEPTIBLE_RECENT_FOREGROUND_APP_ADJ = 50; // This is the process running the current foreground app. We'd really // rather not kill it! public static final int FOREGROUND_APP_ADJ = 0; ``` ## Low Memory Killer Daemon lmkd 最重要的 API 就是 `mp_event_psi`，這是 PSI Event Handler Function，會去計算目前系統的 Memory 狀態，並且檢查是否符合 **Kill Condition**，最後再決定要砍哪個 APKs，這個 API 看熟就等於掌握 lmkd 最核心的地方 ### Code Flow ```graphviz digraph lmkdFlow { graph [bgcolor="transparent"] rankdir=LR; {rank=same A B C D E} {rank=same F G H} A[label = "PSI Event Trigger", shape=parallelogram] B[label = "PSI Event Handler", shape=box] C[label = "Kill Condition is True", shape=diamond] D[label = "Find a Task", shape=diamond] E[label = "Kill the Task", shape=box] F[label = "Finished", shape=parallelogram] G[label = "Finished", shape=parallelogram] H[label = "Finished", shape=parallelogram] A->B->C C->D[label="Y"] D->E[label="Y"] E->F C->G[label="N"] D->H[label="N"] } ``` ### Parameters * **kill_heaviest_task** 在同樣的 `oom_score_adj` 下是否從 Memory 用量多的 Task 先砍 * **kill_timeout_ms** 當收到 PSI Event 時，會先檢查上次的 `kill` 是否執行完畢，假如還在執行並且執行時間還小於 `kill_timeout_ms`，就忽略這次的 Event * **swap_free_low_percentage** Swap Free 的比例低於 `swap_free_low_percentage` 就會進入所謂的 `swap_is_low`，假設 Swap Total Size 是 500 MB and `swap_free_low_percentage` 設定 10%，那只要 Swap Free Size 低於 50 MB 就會進入 `swap_is_low` * **psi_partial_stall_ms** PSI `SOME` 的 **stall ms**，window us 是 1000 ms * **psi_complete_stall_ms** PSI `FULL` 的 **stall ms**，window us 是 1000 ms * **thrashing_limit** 當 Thrashing 的比例高於這個參數，就代表 lmkd 的 Thrashing Condition 成立 * **lowmem_min_oom_score** Memory 水位低於 High，會去砍 `oom_score_adj` 比 `lowmem_min_oom_score` 大的 Task ### Kill Condition * **min watermark is breached even after kill** 已經砍過 APK 後，Memory 水位還是低於 Low * **device is not responding** 此次 Event 是由 PSI `FULL` 所觸發 * **device is low on swap and thrashing** Device 進入 `swap_is_low` 並且 `thrashing` 比設定的 `thrashing_limit` 還高 * **watermark is breached and swap is low** Memory 水位低於 High 並且 `swap_is_low` * **watermark is breached and thrashing** Memory 水位低於 High 並且 `thrashing` 比設定的 `thrashing_limit` 還高 * **device is in direct reclaim and thrashing** 發生 Directly Reclaim 並且 `thrashing` 比設定的 `thrashing_limit` 還高 * **watermark is breached** Memory 水位低於 High 並且有 Task 的 `oom_score_adj` 高於 `lowmem_min_oom_score` ## Tuning Performance 策略由前面介紹可以知道 lmkd 不僅僅調 lmkd 參數而已，其中還牽扯到 Linux Kernel Virtual Memory 的參數設定，如 `swappiness`、`watermark_scale_factor`、`min_free_kbytes`，更不用說還有很多 Android 的參數設定，因此調整參數前要先理解這些參數對應的交互關係，並且觀察平台目前是遇到甚麼瓶頸，蒐集數據找到關鍵，再開始進行調整，千萬不要一股腦地開始調參，這樣往往不會得到好的結果假如讀者有遇到相關的問題，歡迎 [**Linkedin**](https://www.linkedin.com/in/chao-shun/) 連絡討論