Try   HackMD
tags: Android Performance Tuning

詳細介紹 Android Low Memory Killer

與 Embedded Software Engineer 相關的問題,歡迎透過 Linkedin 與我聯繫
Linkedin

Introduction

Low memory killer daemon 在 AOSP (Android Open Source Project) 的官方文件當中有介紹,不過此文件維護的速度跟不上目前最新的程式碼,因此這篇文章會介紹目前在 Android 14 搭配 Linux 5.15 上最新的機制

如 lmkd 的命名,主要是要針對當系統的 System Memory 觸發低水位警報時,藉由 Kill APKs 來釋放出記憶體,以達到順暢的使用者體驗。因此這個機制在 Low Ram Device 格外重要,參數設定的好,可以使 Memory 處於穩定狀態,達到很好的使用者體驗

PSI (Pressure Stall Information)

PSI 從 GKI (Generic Kernel Image) 出來之後,會強制打開 CONFIG_PSI,成為 lmkd 默認的低水位警報器,當然 PSI 不只有監測 Memory 而已,連 CPU and IO 都可以監測,但 lmkd 只看 Memory 而已,所以下面只會針對 Memory 介紹

Memory Stall Section

首先先來看一下在 Kernel 當中如何算是 Memory Stall,這在 Kernel PSI 的 Source Code 當中有給相當清楚的描述,只要是被 psi_memstall_enter and psi_memstall_leave 所包起來的 Section 就稱為 Memory Stall Section,接下來只要把所有在 Kernel Source Code 當中有用到這些 API 的地方找出來就知道 PSI 的 Memory Stall 會是在哪裡

kernel/sched/psi.c

/**
 * psi_memstall_enter - mark the beginning of a memory stall section
 * @flags: flags to handle nested sections
 *
 * Marks the calling task as being stalled due to a lack of memory,
 * such as waiting for a refault or performing reclaim.
 */
void psi_memstall_enter(unsigned long *flags)

/**
 * psi_memstall_leave - mark the end of an memory stall section
 * @flags: flags to handle nested memdelay sections
 *
 * Marks the calling task as no longer stalled due to lack of memory.
 */
void psi_memstall_leave(unsigned long *flags)

透過 cscope 搜尋的結果如下,可以發現 Memory Stall Section 包含 Directly Reclaim 之外,也有 Disk IO 等待的時間,甚至有等待 Read Swap Page 的時間,因此需要先知道平台主要貢獻者是哪種 Memory Stall Section,才比較好進行下一步的動作

任何進行 Performance Tuning 之前要先蒐集足夠證據再開始動作,否則可能忙半天才發現完全走錯方向浪費不少時間,因此在做任何動作前,要先反問自己一句話,接下來要進行的調整是基於事實的推測,還是盲目的猜測

cscope for psi_memstall_enter

Cscope tag: psi_memstall_enter
   #   line  filename / context / line
   1   1698  block/blk-cgroup.c <<blkcg_maybe_throttle_blkg>>
             psi_memstall_enter(&pflags);
   2   1088  block/blk-core.c <<submit_bio>>
             psi_memstall_enter(&pflags);
   3   3020  mm/compaction.c <<kcompactd>>
             psi_memstall_enter(&pflags);
   4   1294  mm/filemap.c <<wait_on_page_bit_common>>
             psi_memstall_enter(&pflags);
   5   2365  mm/memcontrol.c <<reclaim_high>>
             psi_memstall_enter(&pflags);
   6   2596  mm/memcontrol.c <<mem_cgroup_handle_over_high>>
             psi_memstall_enter(&pflags);
   7   2665  mm/memcontrol.c <<try_charge_memcg>>
             psi_memstall_enter(&pflags);
   8   4674  mm/page_alloc.c <<__alloc_pages_direct_compact>>
             psi_memstall_enter(&pflags);
   9   4957  mm/page_alloc.c <<__alloc_pages_direct_reclaim>>
             psi_memstall_enter(&pflags);
  10    325  mm/page_io.c <<swap_readpage>>
             psi_memstall_enter(&pflags);
  11   7045  mm/vmscan.c <<balance_pgdat>>
             psi_memstall_enter(&pflags);
  12   7676  mm/vmscan.c <<__node_reclaim>>
             psi_memstall_enter(&pflags);

有點經驗的開發者,聽到這邊可能就會直接選擇使用 BCC (BPF Compiler Collection),來動態觀察是哪些 Call Trace 會走到 Memory Stall Section,效率會比直接 grep 來的有效率不少,因為可以知道目前你的平台主要是進入到哪裡的 Memory Stall Section,甚至可以在自己寫 bpftrace 直接觀察到哪個 Section 是主要的貢獻者

bcc stackcount psi_memstall_enter

(bcc)root@localhost:/# stackcount psi_memstall_enter
Tracing 1 functions for "psi_memstall_enter"... Hit Ctrl-C to end.

  psi_memstall_enter
  wait_on_page_bit_common
  filemap_fault
  __do_fault
  do_handle_mm_fault
  do_page_fault
  do_DataAbort
  __dabt_usr
    122
    
  psi_memstall_enter
  swap_readpage
  swapin_readahead
  do_swap_page
  do_handle_mm_fault
  do_page_fault
  do_DataAbort
  __dabt_usr
  [unknown]
    246

PSI State for Memory

了解 Memory Stall Section 之後就要來看 PSI 的 Trigger 機制,在 PSI 當中有兩個指標 SOME and FULL

  • SOME - 只要在 Run Queue 上有進入 Memory Stall Section 的 Task 就算是 SOME 的計算範圍,可以看下方程式碼第 9 行
  • FULL - 除了有進入 Memory Stall Section 的 Task 之外,沒有其他非 Reclaimer (如 kswapd) 的 Task 在 Run Queue,這樣才算 FULL 的計算範圍,可以看下方程式碼的 13 行,就是在確認 Run Queue 上是否有其他非 Reclaimer 的 Task

kernel/sched/psi.c

* The time in which a task can execute on a CPU is our baseline for
* productivity. Pressure expresses the amount of time in which this
* potential cannot be realized due to resource contention.
*
* This concept of productivity has two components: the workload and
* the CPU. To measure the impact of pressure on both, we define two
* contention states for a resource: SOME and FULL.

*  SOME = nr_delayed_tasks != 0
*  FULL = nr_delayed_tasks != 0 && nr_productive_tasks == 0

* For each runqueue, we track:
*
*     tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)
*     tFULL[cpu] = time(nr_delayed_tasks[cpu] && !nr_productive_tasks[cpu])
*  tNONIDLE[cpu] = time(nr_nonidle_tasks[cpu] != 0)
*
* and then periodically aggregate:
*
*  tNONIDLE = sum(tNONIDLE[i])
*
*     tSOME = sum(tSOME[i] * tNONIDLE[i]) / tNONIDLE
*     tFULL = sum(tFULL[i] * tNONIDLE[i]) / tNONIDLE
*
*     %SOME = tSOME / period
*     %FULL = tFULL / period

有時候 Kernel 的文件更新很慢,直接看 Source Code 有時候可以看到很詳細的 Comment 解釋,所以建議看 Document 的同時,也要順便去看 Source Code 上面有沒有更詳細的解說,有時候很納悶為什麼 Maintainer 不一起把 Document 更新一下

kernel/sched/psi.c

static bool test_state(unsigned int *tasks, enum psi_states state) { switch (state) { case PSI_IO_SOME: return unlikely(tasks[NR_IOWAIT]); case PSI_IO_FULL: return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]); case PSI_MEM_SOME: return unlikely(tasks[NR_MEMSTALL]); case PSI_MEM_FULL: return unlikely(tasks[NR_MEMSTALL] && tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]); case PSI_CPU_SOME: return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]); case PSI_CPU_FULL: return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]); case PSI_NONIDLE: return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || tasks[NR_RUNNING]; default: return false; } }

User Space Interface

接下來就可以來看 PSI 提供給 User Space 的接口該如何使用,這在 PSI Document 已經有給很詳細的描述,User 可以透過 poll() 主動去 Monitor PSI,每當觸發條件就會發 Event 到 User Space,此外條件是可以由 User 自行定義的,可以自訂 SOME or FULL 以及在多少的 window time 裡面占多少 stall time 就觸發 Event

To register a trigger user has to open psi interface file under /proc/pressure/ representing the resource to be monitored and write the desired threshold and time window. The open file descriptor should be used to wait for trigger events using select(), poll() or epoll().

<some|full> <stall amount in us> <time window in us>

回頭看 Android 14 當中是如何去 Register PSI Interface,可以看到在 init_psi_monitor 裡面的第 22 行會去定義 stall_type_name, threshold_us, and window_us,這些都是可以透過控制 lmkd 的參數去決定,在後面的內容會提到。另外可以在 register_psi_monitor 裡面看到 lmkd 確實是透過 epoll() 去 Moniter 所設定的條件是否被觸發,假如觸發就去執行對應的 Event Handler

system/memory/lmkd/libpsi/psi.cpp

int init_psi_monitor(enum psi_stall_type stall_type, int threshold_us, int window_us, enum psi_resource resource) { if (resource < PSI_MEMORY || resource >= PSI_RESOURCE_COUNT) { ALOGE("Invalid psi resource type: %d", resource); errno = EINVAL; return -1; } int fd; int res; char buf[256]; fd = TEMP_FAILURE_RETRY(open(psi_resource_file[resource], O_WRONLY | O_CLOEXEC)); if (fd < 0) { ALOGE("No kernel psi monitor support (errno=%d)", errno); return -1; } switch (stall_type) { case (PSI_SOME): case (PSI_FULL): res = snprintf(buf, sizeof(buf), "%s %d %d", stall_type_name[stall_type], threshold_us, window_us); break; default: ALOGE("Invalid psi stall type: %d", stall_type); errno = EINVAL; goto err; } if (res >= (ssize_t)sizeof(buf)) { ALOGE("%s line overflow for psi stall type '%s'", psi_resource_file[resource], stall_type_name[stall_type]); errno = EINVAL; goto err; } res = TEMP_FAILURE_RETRY(write(fd, buf, strlen(buf) + 1)); if (res < 0) { ALOGE("%s write failed for psi stall type '%s'; errno=%d", psi_resource_file[resource], stall_type_name[stall_type], errno); goto err; } return fd; err: close(fd); return -1; } int register_psi_monitor(int epollfd, int fd, void* data) { int res; struct epoll_event epev; epev.events = EPOLLPRI; epev.data.ptr = data; res = epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &epev); if (res < 0) { ALOGE("epoll_ctl for psi monitor failed; errno=%d", errno); } return res; }

System Memory Information

lmkd 會從 Kernel 所提供的 System Memory Information 去獲取目前系統的 Memory 狀況,來去計算目前的 Memory 水位、threashing 是否嚴重、Zram Swap 用量是否快抵達上限、各個 Task 的 oom_score_adj 分別是多少,了解這些資訊後,再去看 lmkd 的 Code Flow 才會理解其計算的意義

/proc/meminfo

This file reports statistics about memory usage on the system.

會提供系統角度的 Memory 使用量,可以用其資訊計算出目前的水位狀況是在哪個等級,可以從 Understanding the Linux Virtual Memory Manager 看到一張非常經典的水位圖

  • 當水位低於 Low 時會觸發 kswapd 進行 Reclaim Page,直到水位回到 High 才停止進行 Reclaim
  • kswapd Reclaim Page 的速度比 Allocate Page 的速度還慢,就會導致水位持續降低,低到 Min 時就會觸發 Directly Reclaim

Figure 2.2. Zone Watermarks

image

Understanding the Linux Virtual Memory Manager 是非常經典的 Linux Kernel Memory 的書,即使是基於 Linux 2.6 寫的,但很多基本的觀念到現在 Linux 6.13 也還是通用,非常推薦想了解 Linux Kernel Memory 的開發者閱讀

/proc/vmstat

This file displays various virtual memory statistics.

會統計目前所有 Page 的使用狀況,並且以 Page 為單位去進行計算,其中對於 lmkd 最重要的是會從 workingset_refault_file 去計算 threashing 的狀況,至於 Kernel 是如何定義 threashing 可以去看 Source Code 的解釋,目前筆者沒有詳細閱讀過,就不誤人子弟

mm/workingset.c

 *      Double CLOCK lists
 *
 * Per node, two clock lists are maintained for file pages: the
 * inactive and the active list.  Freshly faulted pages start out at
 * the head of the inactive list and page reclaim scans pages from the
 * tail.  Pages that are accessed multiple times on the inactive list
 * are promoted to the active list, to protect them from reclaim,
 * whereas active pages are demoted to the inactive list when the
 * active list grows too big.
 *
 *   fault ------------------------+
 *                                 |
 *              +--------------+   |            +-------------+
 *   reclaim <- |   inactive   | <-+-- demotion |    active   | <--+
 *              +--------------+                +-------------+    |
 *                     |                                           |
 *                     +-------------- promotion ------------------+
 *
 *
 *      Access frequency and refault distance
 *
 * A workload is thrashing when its pages are frequently used but they
 * are evicted from the inactive list every time before another access
 * would have promoted them to the active list.

/proc/$(pid)/oom_score_adj

This file can be used to adjust the badness heuristic.

lmkd 決定要砍 Task 的話會從 oom_score_adj 高的開始砍,其中每個 oom_score_adj 在 AOSP 有清楚解釋其代表的意義,例如 900 以上代表可以隨意砍掉不會造成 System 任何影響,反之 0 代表 User 正在使用在地 Task,砍掉會造成 User 巨大的影響

services/core/java/com/android/server/am/ProcessList.java

    // This is a process only hosting activities that are not visible,
    // so it can be killed without any disruption.
    public static final int CACHED_APP_MAX_ADJ = 999;
    public static final int CACHED_APP_MIN_ADJ = 900;
    // This is the oom_adj level that we allow to die first. This cannot be equal to
    // CACHED_APP_MAX_ADJ unless processes are actively being assigned an oom_score_adj of
    // CACHED_APP_MAX_ADJ.
    public static final int CACHED_APP_LMK_FIRST_ADJ = 950;
    // Number of levels we have available for different service connection group importance
    // levels.
    public static final int CACHED_APP_IMPORTANCE_LEVELS = 5;
    // The B list of SERVICE_ADJ -- these are the old and decrepit
    // services that aren't as shiny and interesting as the ones in the A list.
    public static final int SERVICE_B_ADJ = 800;
    // This is the process of the previous application that the user was in.
    // This process is kept above other things, because it is very common to
    // switch back to the previous app.  This is important both for recent
    // task switch (toggling between the two top recent apps) as well as normal
    // UI flow such as clicking on a URI in the e-mail app to view in the browser,
    // and then pressing back to return to e-mail.
    public static final int PREVIOUS_APP_ADJ = 700;
    // This is a process holding the home application -- we want to try
    // avoiding killing it, even if it would normally be in the background,
    // because the user interacts with it so much.
    public static final int HOME_APP_ADJ = 600;
    // This is a process holding an application service -- killing it will not
    // have much of an impact as far as the user is concerned.
    public static final int SERVICE_ADJ = 500;
    // This is a process with a heavy-weight application.  It is in the
    // background, but we want to try to avoid killing it.  Value set in
    // system/rootdir/init.rc on startup.
    public static final int HEAVY_WEIGHT_APP_ADJ = 400;
    // This is a process currently hosting a backup operation.  Killing it
    // is not entirely fatal but is generally a bad idea.
    public static final int BACKUP_APP_ADJ = 300;
    // This is a process bound by the system (or other app) that's more important than services but
    // not so perceptible that it affects the user immediately if killed.
    public static final int PERCEPTIBLE_LOW_APP_ADJ = 250;
    // This is a process hosting services that are not perceptible to the user but the
    // client (system) binding to it requested to treat it as if it is perceptible and avoid killing
    // it if possible.
    public static final int PERCEPTIBLE_MEDIUM_APP_ADJ = 225;
    // This is a process only hosting components that are perceptible to the
    // user, and we really want to avoid killing them, but they are not
    // immediately visible. An example is background music playback.
    public static final int PERCEPTIBLE_APP_ADJ = 200;
    // This is a process only hosting activities that are visible to the
    // user, so we'd prefer they don't disappear.
    public static final int VISIBLE_APP_ADJ = 100;
    static final int VISIBLE_APP_LAYER_MAX = PERCEPTIBLE_APP_ADJ - VISIBLE_APP_ADJ - 1;
    // This is a process that was recently TOP and moved to FGS. Continue to treat it almost
    // like a foreground app for a while.
    // @see TOP_TO_FGS_GRACE_PERIOD
    public static final int PERCEPTIBLE_RECENT_FOREGROUND_APP_ADJ = 50;
    // This is the process running the current foreground app.  We'd really
    // rather not kill it!
    public static final int FOREGROUND_APP_ADJ = 0;

Low Memory Killer Daemon

lmkd 最重要的 API 就是 mp_event_psi,這是 PSI Event Handler Function,會去計算目前系統的 Memory 狀態,並且檢查是否符合 Kill Condition,最後再決定要砍哪個 APKs,這個 API 看熟就等於掌握 lmkd 最核心的地方

Code Flow







lmkdFlow


A

PSI Event Trigger



B

PSI Event Handler



A->B





C

Kill Condition is True



B->C





D

Find a Task



C->D


Y



G

Finished



C->G


N



E

Kill the Task



D->E


Y



H

Finished



D->H


N



F

Finished



E->F





Parameters

  • kill_heaviest_task
    在同樣的 oom_score_adj 下是否從 Memory 用量多的 Task 先砍
  • kill_timeout_ms
    當收到 PSI Event 時,會先檢查上次的 kill 是否執行完畢,假如還在執行並且執行時間還小於 kill_timeout_ms,就忽略這次的 Event
  • swap_free_low_percentage
    Swap Free 的比例低於 swap_free_low_percentage 就會進入所謂的 swap_is_low,假設 Swap Total Size 是 500 MB and swap_free_low_percentage 設定 10%,那只要 Swap Free Size 低於 50 MB 就會進入 swap_is_low
  • psi_partial_stall_ms
    PSI SOMEstall ms,window us 是 1000 ms
  • psi_complete_stall_ms
    PSI FULLstall ms,window us 是 1000 ms
  • thrashing_limit
    當 Thrashing 的比例高於這個參數,就代表 lmkd 的 Thrashing Condition 成立
  • lowmem_min_oom_score
    Memory 水位低於 High,會去砍 oom_score_adjlowmem_min_oom_score 大的 Task

Kill Condition

  • min watermark is breached even after kill
    已經砍過 APK 後,Memory 水位還是低於 Low
  • device is not responding
    此次 Event 是由 PSI FULL 所觸發
  • device is low on swap and thrashing
    Device 進入 swap_is_low 並且 thrashing 比設定的 thrashing_limit 還高
  • watermark is breached and swap is low
    Memory 水位低於 High 並且 swap_is_low
  • watermark is breached and thrashing
    Memory 水位低於 High 並且 thrashing 比設定的 thrashing_limit 還高
  • device is in direct reclaim and thrashing
    發生 Directly Reclaim 並且 thrashing 比設定的 thrashing_limit 還高
  • watermark is breached
    Memory 水位低於 High 並且有 Task 的 oom_score_adj 高於 lowmem_min_oom_score

Tuning Performance 策略

由前面介紹可以知道 lmkd 不僅僅調 lmkd 參數而已,其中還牽扯到 Linux Kernel Virtual Memory 的參數設定,如 swappinesswatermark_scale_factormin_free_kbytes,更不用說還有很多 Android 的參數設定,因此調整參數前要先理解這些參數對應的交互關係,並且觀察平台目前是遇到甚麼瓶頸,蒐集數據找到關鍵,再開始進行調整,千萬不要一股腦地開始調參,這樣往往不會得到好的結果

假如讀者有遇到相關的問題,歡迎 Linkedin 連絡討論