Linux 4.9.140 Memory Management
===
# Concepts
參考 [Concepts overview](https://www.kernel.org/doc/html/v4.18/admin-guide/mm/concepts.html)
## Huge Pages
> Usage of huge pages significantly reduces pressure on TLB, improves TLB hit-rate and thus improves overall system performance.
>
> There are two mechanisms in Linux that enable mapping of the physical memory with the huge pages. The first one is HugeTLB filesystem, or hugetlbfs.
> Another, more recent, mechanism that enables use of the huge pages is called Transparent HugePages, or THP.
>
大 Page 使 TLB hit-rate 提升, 提升整體效能
## Zones
> Linux groups memory pages into zones according to their possible usage.
>
> The actual layout of the memory zones is hardware dependent as not all architectures define all zones, and requirements for DMA are different for different platforms.
>
- ZONE_DMA
- will contain memory that can be used by devices for DMA
- ZONE_HIGHMEM
- will contain memory that is not permanently mapped into kernel’s address space
- ZONE_NORMAL
- will contain normally addressed pages
## Nodes
> Many multi-processor machines are NUMA - Non-Uniform Memory Access - systems.
>
一個 Nodes 有自己的 zones, lists of free and used pages 和許多 statistics counters
## Page cache
從 disk 讀檔案進 memory 會先放到 Page cache, 後續若有讀取一樣檔案的要求時, 會直接讀此 Page cache, 而非重新對 disk 送出花更多時間的要求
被寫過的 Page cache 會被標記成 dirty, 在必要時候會再跟 disk 內的檔案同步
## Anonymous Memory
此 memory 沒有 map 到任何 filesystem 的 file, 就稱 Anonymous Memory
一個 Process 的 stack, heap 就是 Anonymous Memory
## Reclaim
能 free 掉的 pages 稱為 reclaimable, 像 Page cache, Anonymous Memory 都是 reclaimable
通常存放 kernel data 的 Pages, 或是當作 DMA(Direct Memory Access) buffers 的 Pages 是不能被 free 掉的, 這些 Pages 就是 unreclaimable
> However, in certain circumstances, even pages occupied with kernel data structures can be reclaimed. For instance, in-memory caches of filesystem metadata can be re-read from the storage device and therefore it is possible to discard them from the main memory when system is under memory pressure.
>
free 掉 reclaimable pages 並重新規劃 pages 的過程在 linux 稱為 reclaim
當系統 loading 增加, free page 的數量下降, 當下降到一個點(稱為 high watermark)時, 配置 page 的要求會啟動 kswapd daemon, 此 daemon 會 just free 掉能從其他地方得到資訊的 page 或是把 dirty page evicts to backing storage device
當系統 loading 增加到一個不行, free page 下降到另一個點(稱為 min watermark)時, allocate 會引發 direct reclaim, 此時 allocate 會停止直到 reclaim 了足夠的 memory
## Compaction
在配置/釋放 memory 過程中, 會造成許多 memory fragment, 雖然透過 virtual memory 的機制, 還是能讓使用者像是在操作一段連續的記憶體, 不過有時候還是需要連續的 physical memory(比如說有 device 要求了 large buffer for DMA, 又或是 THP 要求了 Huge page)
kcompactd daemon 就會做 Compaction
## OOM Killer
OOM(Out of Memory)
若真的 OOM, kernel 會觸發 OOM Killer
OOM Killer 任務很簡單, 就是砍掉一個 task 使 Memory 夠用
幫那 task QQ
# __alloc_pages_nodemask
[__alloc_pages_nodemask](https://elixir.bootlin.com/linux/v4.9.140/source/mm/page_alloc.c#L3770)
```c=3766
/*
* This is the 'heart' of the zoned buddy allocator.
*/
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
struct page *page;
unsigned int alloc_flags = ALLOC_WMARK_LOW;
gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
/* nodemask, migratetype, high_zoneidx 在這裡初始化後就不會在更動 */
struct alloc_context ac = {
.high_zoneidx = gfp_zone(gfp_mask),
.zonelist = zonelist,
.nodemask = nodemask,
.migratetype = gfpflags_to_migratetype(gfp_mask),
};
/*
* 若沒 define CONFIG_CPUSETS, 則 cpusets_enabled() 回傳 false
* 若有 define, 則回傳 static_branch_unlikely(&cpusets_enabled_key)
*/
if (cpusets_enabled()) {
alloc_mask |= __GFP_HARDWALL;
alloc_flags |= ALLOC_CPUSET;
if (!ac.nodemask)
ac.nodemask = &cpuset_current_mems_allowed;
}
/* 濾掉沒有 allowed 的 flags */
gfp_mask &= gfp_allowed_mask;
lockdep_trace_alloc(gfp_mask);
/* 若 gfp_mask 設定了 __GFP_DIRECT_RECLAIM, 可能會 sleep */
might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
/*
* 若沒有 define CONFIG_FAIL_PAGE_ALLOC, 則不會進入此 if
* 若有 define,
* 且 order < fail_page_alloc.min_order
* 或是 gfp_mask 設定了 __GFP_NOFAIL
* 或是 fail_page_alloc.ignore_gfp_highmem 為 true 且 gfp_mask 有設定 __GFP_HIGHMEM
* 或是 fail_page_alloc.ignore_gfp_reclaim 為 true 且 gfp_mask 有設定 __GFP_DIRECT_RECLAIM
* 都不會進入此 if
* 除了上述條件, 回傳 should_fail(&fail_page_alloc.attr, 1 << order);
*/
if (should_fail_alloc_page(gfp_mask, order))
return NULL;
/*
* Check the zones suitable for the gfp_mask contain at least one
* valid zone. It's possible to have an empty zonelist as a result
* of __GFP_THISNODE and a memoryless node
* 判斷 zonlist 至少有一個 valid zone
*/
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;
/*
* 若 CONFIG_CMA 設定為 'y' 或 'm' 且
* ac.migratetype 設定為 MIGRATE_MOVABLE
* 就會進入 if
*/
if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;
/*
* Dirty zone balancing only done in the fast path
* ac.spread_dirty_pages 會記錄 gfp_mask 有沒有設定 __GFP_WRITE
*/
ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
/*
* The preferred zone is used for statistics but crucially it is
* also used as the starting point for the zonelist iterator. It
* may get reset for allocations that ignore memory policies.
* 回傳從 ac.high_zoneidx 開始, 在 ac.zonelist 中第一個符合 ac.nodemask 的 zone
*/
ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
ac.high_zoneidx, ac.nodemask);
/* 若沒有此 zone */
if (!ac.preferred_zoneref->zone) {
page = NULL;
/*
* This might be due to race with cpuset_current_mems_allowed
* update, so make sure we retry with original nodemask in the
* slow path.
*/
goto no_zone;
}
/*
* First allocation attempt
* 從 zonelist 配置一個 page
*/
page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
if (likely(page))
goto out;
no_zone:
/*
* Runtime PM, block IO and its error handling path can deadlock
* because I/O on the device might not complete.
* 在 memalloc_noio_flags 中
* 若 current->flags 設定了 PF_MEMALLOC_NOIO, 則回傳 gfp_mask & ~(__GFP_IO | __GFP_FS)
*/
alloc_mask = memalloc_noio_flags(gfp_mask);
/* 將 ac.spread_dirty_pages 重新標記成沒寫過 */
ac.spread_dirty_pages = false;
/*
* Restore the original nodemask if it was potentially replaced with
* &cpuset_current_mems_allowed to optimize the fast-path attempt.
* 將 ac.nodemask 復原為 nodemask
*/
if (unlikely(ac.nodemask != nodemask))
ac.nodemask = nodemask;
/* 用 __alloc_pages_slowpath 再跑一次 */
page = __alloc_pages_slowpath(alloc_mask, order, &ac);
out:
/*
* 若沒有 define CONFIG_MEMCG 或有 define CONFIG_SLOB
* 則 memcg_kmem_enabled 都會回傳 false
* 若非上述條件
* 則 memcg_kmem_enabled 回傳 static_branch_unlikely(&memcg_kmem_enabled_key)
* 若 memcg_kmem_enabled 回傳不為 0
* 且 gfp_mask 設定了 __GFP_ACCOUNT
* 且 page 不為 0
* 且 memcg_kmem_charge(page, gfp_mask, order) 執行失敗
* 則進入 if
*/
if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
/*
* 若 refcount 為 0 (此 page 沒有 process 在使用)
* 則根據 order 來做事情
* 若 order 為 0, 則執行 free_hot_cold_page(page, false)
* 否則執行 __free_pages_ok(page, order)
*/
__free_pages(page, order);
page = NULL;
}
/*
* 若 kmemcheck_enabled 設定為不等於 0 的值且
* page 不為 0
*/
if (kmemcheck_enabled && page)
kmemcheck_pagealloc_alloc(page, order, gfp_mask);
trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);
return page;
}
```
## get_page_from_freelist
[get_page_from_freelist](https://elixir.bootlin.com/linux/v4.9.140/source/mm/page_alloc.c#L2905)
```c=2900
/*
* get_page_from_freelist goes through the zonelist trying to allocate
* a page.
*/
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
const struct alloc_context *ac)
{
struct zoneref *z = ac->preferred_zoneref;
struct zone *zone;
struct pglist_data *last_pgdat_dirty_limit = NULL;
/*
* Scan zonelist, looking for a zone with enough free.
* See also __cpuset_node_allowed() comment in kernel/cpuset.c.
*/
for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
ac->nodemask) {
struct page *page;
unsigned long mark;
/*
* cpusets_enabled 回傳值
* 若沒 define CONFIG_CPUSETS, 則回傳 false
* 若有 define, 則回傳 static_branch_unlikely(&cpusets_enabled_key)
*
* __cpuset_node_allowed 回傳值
* 若沒 define CONFIG_CPUSETS, 則回傳 true
* 若有 define, 則回傳 __cpuset_node_allowed(zone_to_nid(z), gfp_mask)
*
* __cpuset_node_allowed 回傳值
* 若在 interrupt,
* 或 zone->node 設定於 mems_allowed
* 或 gfp_mask 沒設定 __GFP_HARDWALL,
* 或 current 因為 TIF_MEMDIE 所以有權存取 memory
* 以上都可以配置於此 memory node, 回傳 true
* 否則不行, 回傳 false
*/
if (cpusets_enabled() &&
(alloc_flags & ALLOC_CPUSET) &&
!__cpuset_zone_allowed(zone, gfp_mask))
continue;
/*
* When allocating a page cache page for writing, we
* want to get it from a node that is within its dirty
* limit, such that no single node holds more than its
* proportional share of globally allowed dirty pages.
* The dirty limits take into account the node's
* lowmem reserves and high watermark so that kswapd
* should be able to balance it without having to
* write pages from its LRU list.
*
* XXX: For now, allow allocations to potentially
* exceed the per-node dirty limit in the slowpath
* (spread_dirty_pages unset) before going into reclaim,
* which is important when on a NUMA setup the allowed
* nodes are together not big enough to reach the
* global limit. The proper fix for these situations
* will require awareness of nodes in the
* dirty-throttling and the flusher threads.
*/
if (ac->spread_dirty_pages) {
if (last_pgdat_dirty_limit == zone->zone_pgdat)
continue;
if (!node_dirty_ok(zone->zone_pgdat)) {
last_pgdat_dirty_limit = zone->zone_pgdat;
continue;
}
}
mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
if (!zone_watermark_fast(zone, order, mark,
ac_classzone_idx(ac), alloc_flags)) {
int ret;
/* Checked here to keep the fast path fast */
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (alloc_flags & ALLOC_NO_WATERMARKS)
goto try_this_zone;
if (node_reclaim_mode == 0 ||
!zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
continue;
ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
switch (ret) {
case NODE_RECLAIM_NOSCAN:
/* did not scan */
continue;
case NODE_RECLAIM_FULL:
/* scanned but unreclaimable */
continue;
default:
/* did we reclaim enough */
if (zone_watermark_ok(zone, order, mark,
ac_classzone_idx(ac), alloc_flags))
goto try_this_zone;
continue;
}
}
try_this_zone:
/* 從 zone 配置一個 page */
page = buffered_rmqueue(ac->preferred_zoneref->zone, zone, order,
gfp_mask, alloc_flags, ac->migratetype);
if (page) {
prep_new_page(page, order, gfp_mask, alloc_flags);
/*
* If this is a high-order atomic allocation then check
* if the pageblock should be reserved for the future
*/
if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
reserve_highatomic_pageblock(page, zone, order);
return page;
}
}
return NULL;
}
```
## buffered_rmqueue
[buffered_rmqueue](https://elixir.bootlin.com/linux/v4.9.140/source/mm/page_alloc.c#L2619)
```c=2615
/*
* Allocate a page from the given zone. Use pcplists for order-0 allocations.
*/
static inline
struct page *buffered_rmqueue(struct zone *preferred_zone,
struct zone *zone, unsigned int order,
gfp_t gfp_flags, unsigned int alloc_flags,
int migratetype)
{
unsigned long flags;
struct page *page;
bool cold = ((gfp_flags & __GFP_COLD) != 0);
if (likely(order == 0)) {
struct per_cpu_pages *pcp;
struct list_head *list;
/* 將 CPU 的 flag 值先儲存到 flags 變數, 然後將 CPU 的 interrupt disable */
local_irq_save(flags);
do {
pcp = &this_cpu_ptr(zone->pageset)->pcp;
list = &pcp->lists[migratetype];
if (list_empty(list)) {
pcp->count += rmqueue_bulk(zone, 0,
pcp->batch, list,
migratetype, cold);
if (unlikely(list_empty(list)))
goto failed;
}
if (cold)
page = list_last_entry(list, struct page, lru);
else
page = list_first_entry(list, struct page, lru);
list_del(&page->lru);
pcp->count--;
} while (check_new_pcp(page));
} else {
/*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with __GFP_NOFAIL.
*/
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
/* 進入 Critical Section, 先將 CPU flag 儲存起來 */
spin_lock_irqsave(&zone->lock, flags);
do {
page = NULL;
if (alloc_flags & ALLOC_HARDER) {
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if (page)
trace_mm_page_alloc_zone_locked(page, order, migratetype);
}
if (!page)
/* 從 buddy allocator 刪除一個 element, call 此 function 前要先讓 zone->lock lock 起來 */
page = __rmqueue(zone, order, migratetype);
} while (page && check_new_pages(page, order));
/* 離開 Critical Section */
spin_unlock(&zone->lock);
if (!page)
goto failed;
__mod_zone_freepage_state(zone, -(1 << order),
get_pcppage_migratetype(page));
}
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
zone_statistics(preferred_zone, zone, gfp_flags);
/* 將 flags 設回 CPU flag */
local_irq_restore(flags);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
return page;
failed:
local_irq_restore(flags);
return NULL;
}
```
## __rmqueue
[__rmqueue](https://elixir.bootlin.com/linux/v4.9.140/source/mm/page_alloc.c#L2204)
```c=2200
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
static struct page *__rmqueue(struct zone *zone, unsigned int order,
int migratetype)
{
struct page *page;
/* 根據 migratetype 遍尋 free lists, 並刪除最小的 avaliable page */
page = __rmqueue_smallest(zone, order, migratetype);
if (unlikely(!page)) {
/* MOVABLE 失敗, 優先從 cma 遷移 */
if (migratetype == MIGRATE_MOVABLE)
page = __rmqueue_cma_fallback(zone, order);
/* 失敗的話會從其他遷移類型當中遷移 page */
if (!page)
page = __rmqueue_fallback(zone, order, migratetype);
}
trace_mm_page_alloc_zone_locked(page, order, migratetype);
return page;
}
```
## __rmqueue_smallest
[__rmqueue_smallest](https://elixir.bootlin.com/linux/v4.9.140/source/mm/page_alloc.c#L1813)
```c=1808
/*
* Go through the free lists for the given migratetype and remove
* the smallest available page from the freelists
* 從 order 開始向上尋找 free_area 並從 list 當中取出 page
*/
static inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
int migratetype)
{
unsigned int current_order;
struct free_area *area;
struct page *page;
/* Find a page of the appropriate size in the preferred list */
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
area = &(zone->free_area[current_order]);
page = list_first_entry_or_null(&area->free_list[migratetype],
struct page, lru);
/* 若失敗就嘗試更大的塊 */
if (!page)
continue;
list_del(&page->lru);
rmv_page_order(page);
/* nr_free counter - 1 */
area->nr_free--;
/* 從下往上找到合適的塊以後再從上往下進行拆分 */
expand(zone, page, order, current_order, area, migratetype);
/* 設置 page 的 migratetype */
set_pcppage_migratetype(page, migratetype);
return page;
}
return NULL;
}
```
## expand
[expand](https://elixir.bootlin.com/linux/v4.9.140/source/mm/page_alloc.c#L1653)
```c=1639
/*
* The order of subdivision here is critical for the IO subsystem.
* Please do not alter this order without good reasons and regression
* testing. Specifically, as large blocks of memory are subdivided,
* the order in which smaller blocks are delivered depends on the order
* they're subdivided in this function. This is the primary factor
* influencing the order in which pages are delivered to the IO
* subsystem according to empirical testing, and this is also justified
* by considering the behavior of a buddy system containing a single
* large block of memory acted on by a series of small allocations.
* This behavior is a critical factor in sglist merging's success.
*
* -- nyc
* 向下一層的 free_area 留一半, 用另一半進行拆分
*/
static inline void expand(struct zone *zone, struct page *page,
int low, int high, struct free_area *area,
int migratetype)
{
unsigned long size = 1 << high;
/* 從高階往低階 */
while (high > low) {
area--;
high--;
size >>= 1;
VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
/*
* Mark as guard pages (or page), that will allow to
* merge back to allocator when buddy will be freed.
* Corresponding page table entries will not be touched,
* pages will stay not present in virtual address space
*/
if (set_page_guard(zone, &page[size], high, migratetype))
continue;
/* 把被切割的 page 後半部存回 free_list */
list_add(&page[size].lru, &area->free_list[migratetype]);
/* nr_free counter + 1 */
area->nr_free++;
/* page->private = high, 表示此 page 屬於 order 為 high 階的 block 中 */
set_page_order(&page[size], high);
}
}
```
# Page Replacement
kswapd daemon 前面提到:
此 daemon 會 just free 掉能從其他地方得到資訊的 page 或是把 dirty page evicts to backing storage device
而 kswapd 執行過程中會使用到 shrink_list
## struct lru_list
[struct lru_list](https://elixir.bootlin.com/linux/v4.9.140/source/include/linux/mmzone.h#L189)
```c=
enum lru_list {
LRU_INACTIVE_ANON = LRU_BASE, // 初始的 chain, anonymous page 的 inactive chain
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE, // anonymous page 的 active chain
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE, // file cache page 的 inactive chain
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE, // file cache page 的 active chain
LRU_UNEVICTABLE, // unevictable page 的 chain
NR_LRU_LISTS
};
```
## 重要的 define
- [for_each_lru(lru)](https://elixir.bootlin.com/linux/v4.9.140/source/include/linux/mmzone.h#L198)
```c=
// loop 過所有的 chain list
#define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
```
- [for_each_evictable_lru(lru)](https://elixir.bootlin.com/linux/v4.9.140/source/include/linux/mmzone.h#L200)
```c=
// loop 過所有 evictable 的 active chain
#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
```
- [is_file_lru(enum lru_list lru)](https://elixir.bootlin.com/linux/v4.9.140/source/include/linux/mmzone.h#L202)
```c=
// 是否為 file cache page chain
static inline int is_file_lru(enum lru_list lru)
{
return (lru == LRU_INACTIVE_FILE || lru == LRU_ACTIVE_FILE);
}
```
- [is_active_lru(enum lru_list lru)](https://elixir.bootlin.com/linux/v4.9.140/source/include/linux/mmzone.h#L207)
```c=
// 是否為 active page chain
static inline int is_active_lru(enum lru_list lru)
{
return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
}
```
## shrink_list
[shrink_list](https://elixir.bootlin.com/linux/v4.9.140/source/mm/vmscan.c#L2085)
```c=
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct scan_control *sc)
{
if (is_active_lru(lru)) {
// 將 active page 變為 inactive
if (inactive_list_is_low(lruvec, is_file_lru(lru), sc))
shrink_active_list(nr_to_scan, lruvec, sc, lru);
return 0;
}
// 將 inactive page swap out 或是單純 free 掉
return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
}
```
## shrink_active_list
[shrink_active_list](https://elixir.bootlin.com/linux/v4.9.140/source/mm/vmscan.c#L1931)
```c=
static void shrink_active_list(unsigned long nr_to_scan,
struct lruvec *lruvec,
struct scan_control *sc,
enum lru_list lru)
{
unsigned long nr_taken;
unsigned long nr_scanned;
unsigned long vm_flags;
/* 初始化 double linked list */
LIST_HEAD(l_hold); /* The pages which were snipped off */
LIST_HEAD(l_active);
LIST_HEAD(l_inactive);
struct page *page;
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
unsigned long nr_rotated = 0;
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
lru_add_drain();
if (!sc->may_unmap)
isolate_mode |= ISOLATE_UNMAPPED;
if (!sc->may_writepage)
isolate_mode |= ISOLATE_CLEAN;
spin_lock_irq(&pgdat->lru_lock);
/* 回傳幾個 page 被移動到 l_hold */
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, isolate_mode, lru);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
reclaim_stat->recent_scanned[file] += nr_taken;
if (global_reclaim(sc))
__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
__count_vm_events(PGREFILL, nr_scanned);
spin_unlock_irq(&pgdat->lru_lock);
while (!list_empty(&l_hold)) {
cond_resched();
/* 從這個 lru page chain evict 出第一個 page */
page = lru_to_page(&l_hold);
list_del(&page->lru);
if (unlikely(!page_evictable(page))) {
/* 若 page 是 unevictable, 則加回 chain 的最後面, 並且重新再 evict 出一個 page */
putback_lru_page(page);
continue;
}
if (unlikely(buffer_heads_over_limit)) {
if (page_has_private(page) && trylock_page(page)) {
if (page_has_private(page))
try_to_release_page(page, 0);
unlock_page(page);
}
}
if (page_referenced(page, 0, sc->target_mem_cgroup,
&vm_flags)) {
nr_rotated += hpage_nr_pages(page);
/*
* Identify referenced, file-backed active pages and
* give them one more trip around the active list. So
* that executable code get better chances to stay in
* memory under moderate memory pressure. Anon pages
* are not likely to be evicted by use-once streaming
* IO, plus JVM can create lots of anon VM_EXEC pages,
* so we ignore them here.
*/
if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
list_add(&page->lru, &l_active);
continue;
}
}
/* 把此 page 加到 inactive chain */
ClearPageActive(page); /* we are de-activating */
list_add(&page->lru, &l_inactive);
}
/*
* Move pages back to the lru list.
*/
spin_lock_irq(&pgdat->lru_lock);
/*
* Count referenced pages from currently used mappings as rotated,
* even though only some of them are actually re-activated. This
* helps balance scan pressure between file and anonymous pages in
* get_scan_count.
*/
reclaim_stat->recent_rotated[file] += nr_rotated;
/* 將 pages 從 active list 放到 inactive list */
move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
spin_unlock_irq(&pgdat->lru_lock);
mem_cgroup_uncharge_list(&l_hold);
free_hot_cold_page_list(&l_hold, true);
}
```
## shrink_inactive_list
[shrink_inactive_list](https://elixir.bootlin.com/linux/v4.9.140/source/mm/vmscan.c#L1726)
```c=
/*
* shrink_inactive_list() is a helper for shrink_node(). It returns the number
* of reclaimed pages
*/
static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
struct scan_control *sc, enum lru_list lru)
{
LIST_HEAD(page_list);
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
unsigned long nr_taken;
unsigned long nr_dirty = 0;
unsigned long nr_congested = 0;
unsigned long nr_unqueued_dirty = 0;
unsigned long nr_writeback = 0;
unsigned long nr_immediate = 0;
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
if (!inactive_reclaimable_pages(lruvec, sc, lru))
return 0;
while (unlikely(too_many_isolated(pgdat, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
return SWAP_CLUSTER_MAX;
}
lru_add_drain();
if (!sc->may_unmap)
isolate_mode |= ISOLATE_UNMAPPED;
if (!sc->may_writepage)
isolate_mode |= ISOLATE_CLEAN;
spin_lock_irq(&pgdat->lru_lock);
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, isolate_mode, lru);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
reclaim_stat->recent_scanned[file] += nr_taken;
if (global_reclaim(sc)) {
__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
if (current_is_kswapd())
__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
else
__count_vm_events(PGSCAN_DIRECT, nr_scanned);
}
spin_unlock_irq(&pgdat->lru_lock);
if (nr_taken == 0)
return 0;
/* 實際收回 pages */
nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
&nr_dirty, &nr_unqueued_dirty, &nr_congested,
&nr_writeback, &nr_immediate,
false);
spin_lock_irq(&pgdat->lru_lock);
if (global_reclaim(sc)) {
if (current_is_kswapd())
__count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
else
__count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
}
putback_inactive_pages(lruvec, &page_list);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
spin_unlock_irq(&pgdat->lru_lock);
mem_cgroup_uncharge_list(&page_list);
free_hot_cold_page_list(&page_list, true);
/*
* If reclaim is isolating dirty pages under writeback, it implies
* that the long-lived page allocation rate is exceeding the page
* laundering rate. Either the global limits are not being effective
* at throttling processes due to the page distribution throughout
* zones or there is heavy usage of a slow backing device. The
* only option is to throttle from reclaim context which is not ideal
* as there is no guarantee the dirtying process is throttled in the
* same way balance_dirty_pages() manages.
*
* Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number
* of pages under pages flagged for immediate reclaim and stall if any
* are encountered in the nr_immediate check below.
*/
if (nr_writeback && nr_writeback == nr_taken)
set_bit(PGDAT_WRITEBACK, &pgdat->flags);
/*
* Legacy memcg will stall in page writeback so avoid forcibly
* stalling here.
*/
if (sane_reclaim(sc)) {
/*
* Tag a zone as congested if all the dirty pages scanned were
* backed by a congested BDI and wait_iff_congested will stall.
*/
if (nr_dirty && nr_dirty == nr_congested)
set_bit(PGDAT_CONGESTED, &pgdat->flags);
/*
* If dirty pages are scanned that are not queued for IO, it
* implies that flushers are not keeping up. In this case, flag
* the pgdat PGDAT_DIRTY and kswapd will start writing pages from
* reclaim context.
*/
if (nr_unqueued_dirty == nr_taken)
set_bit(PGDAT_DIRTY, &pgdat->flags);
/*
* If kswapd scans pages marked marked for immediate
* reclaim and under writeback (nr_immediate), it implies
* that pages are cycling through the LRU faster than
* they are written so also forcibly stall.
*/
if (nr_immediate && current_may_throttle())
congestion_wait(BLK_RW_ASYNC, HZ/10);
}
/*
* Stall direct reclaim for IO completions if underlying BDIs or zone
* is congested. Allow kswapd to continue until it starts encountering
* unqueued dirty pages or cycling through the LRU too quickly.
*/
if (!sc->hibernation_mode && !current_is_kswapd() &&
current_may_throttle())
wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed,
sc->priority, file);
return nr_reclaimed;
}
```
#
###### tags: `OS` `Memory Management`