Linux 4.9.140 Memory Management === # Concepts 參考 [Concepts overview](https://www.kernel.org/doc/html/v4.18/admin-guide/mm/concepts.html) ## Huge Pages > Usage of huge pages significantly reduces pressure on TLB, improves TLB hit-rate and thus improves overall system performance. > > There are two mechanisms in Linux that enable mapping of the physical memory with the huge pages. The first one is HugeTLB filesystem, or hugetlbfs. > Another, more recent, mechanism that enables use of the huge pages is called Transparent HugePages, or THP. > 大 Page 使 TLB hit-rate 提升, 提升整體效能 ## Zones > Linux groups memory pages into zones according to their possible usage. > > The actual layout of the memory zones is hardware dependent as not all architectures define all zones, and requirements for DMA are different for different platforms. > - ZONE_DMA - will contain memory that can be used by devices for DMA - ZONE_HIGHMEM - will contain memory that is not permanently mapped into kernel’s address space - ZONE_NORMAL - will contain normally addressed pages ## Nodes > Many multi-processor machines are NUMA - Non-Uniform Memory Access - systems. > 一個 Nodes 有自己的 zones, lists of free and used pages 和許多 statistics counters ## Page cache 從 disk 讀檔案進 memory 會先放到 Page cache, 後續若有讀取一樣檔案的要求時, 會直接讀此 Page cache, 而非重新對 disk 送出花更多時間的要求 被寫過的 Page cache 會被標記成 dirty, 在必要時候會再跟 disk 內的檔案同步 ## Anonymous Memory 此 memory 沒有 map 到任何 filesystem 的 file, 就稱 Anonymous Memory 一個 Process 的 stack, heap 就是 Anonymous Memory ## Reclaim 能 free 掉的 pages 稱為 reclaimable, 像 Page cache, Anonymous Memory 都是 reclaimable 通常存放 kernel data 的 Pages, 或是當作 DMA(Direct Memory Access) buffers 的 Pages 是不能被 free 掉的, 這些 Pages 就是 unreclaimable > However, in certain circumstances, even pages occupied with kernel data structures can be reclaimed. For instance, in-memory caches of filesystem metadata can be re-read from the storage device and therefore it is possible to discard them from the main memory when system is under memory pressure. > free 掉 reclaimable pages 並重新規劃 pages 的過程在 linux 稱為 reclaim 當系統 loading 增加, free page 的數量下降, 當下降到一個點(稱為 high watermark)時, 配置 page 的要求會啟動 kswapd daemon, 此 daemon 會 just free 掉能從其他地方得到資訊的 page 或是把 dirty page evicts to backing storage device 當系統 loading 增加到一個不行, free page 下降到另一個點(稱為 min watermark)時, allocate 會引發 direct reclaim, 此時 allocate 會停止直到 reclaim 了足夠的 memory ## Compaction 在配置/釋放 memory 過程中, 會造成許多 memory fragment, 雖然透過 virtual memory 的機制, 還是能讓使用者像是在操作一段連續的記憶體, 不過有時候還是需要連續的 physical memory(比如說有 device 要求了 large buffer for DMA, 又或是 THP 要求了 Huge page) kcompactd daemon 就會做 Compaction ## OOM Killer OOM(Out of Memory) 若真的 OOM, kernel 會觸發 OOM Killer OOM Killer 任務很簡單, 就是砍掉一個 task 使 Memory 夠用 幫那 task QQ # __alloc_pages_nodemask [__alloc_pages_nodemask](https://elixir.bootlin.com/linux/v4.9.140/source/mm/page_alloc.c#L3770) ```c=3766 /* * This is the 'heart' of the zoned buddy allocator. */ struct page * __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, nodemask_t *nodemask) { struct page *page; unsigned int alloc_flags = ALLOC_WMARK_LOW; gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */ /* nodemask, migratetype, high_zoneidx 在這裡初始化後就不會在更動 */ struct alloc_context ac = { .high_zoneidx = gfp_zone(gfp_mask), .zonelist = zonelist, .nodemask = nodemask, .migratetype = gfpflags_to_migratetype(gfp_mask), }; /* * 若沒 define CONFIG_CPUSETS, 則 cpusets_enabled() 回傳 false * 若有 define, 則回傳 static_branch_unlikely(&cpusets_enabled_key) */ if (cpusets_enabled()) { alloc_mask |= __GFP_HARDWALL; alloc_flags |= ALLOC_CPUSET; if (!ac.nodemask) ac.nodemask = &cpuset_current_mems_allowed; } /* 濾掉沒有 allowed 的 flags */ gfp_mask &= gfp_allowed_mask; lockdep_trace_alloc(gfp_mask); /* 若 gfp_mask 設定了 __GFP_DIRECT_RECLAIM, 可能會 sleep */ might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM); /* * 若沒有 define CONFIG_FAIL_PAGE_ALLOC, 則不會進入此 if * 若有 define, * 且 order < fail_page_alloc.min_order * 或是 gfp_mask 設定了 __GFP_NOFAIL * 或是 fail_page_alloc.ignore_gfp_highmem 為 true 且 gfp_mask 有設定 __GFP_HIGHMEM * 或是 fail_page_alloc.ignore_gfp_reclaim 為 true 且 gfp_mask 有設定 __GFP_DIRECT_RECLAIM * 都不會進入此 if * 除了上述條件, 回傳 should_fail(&fail_page_alloc.attr, 1 << order); */ if (should_fail_alloc_page(gfp_mask, order)) return NULL; /* * Check the zones suitable for the gfp_mask contain at least one * valid zone. It's possible to have an empty zonelist as a result * of __GFP_THISNODE and a memoryless node * 判斷 zonlist 至少有一個 valid zone */ if (unlikely(!zonelist->_zonerefs->zone)) return NULL; /* * 若 CONFIG_CMA 設定為 'y' 或 'm' 且 * ac.migratetype 設定為 MIGRATE_MOVABLE * 就會進入 if */ if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE) alloc_flags |= ALLOC_CMA; /* * Dirty zone balancing only done in the fast path * ac.spread_dirty_pages 會記錄 gfp_mask 有沒有設定 __GFP_WRITE */ ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE); /* * The preferred zone is used for statistics but crucially it is * also used as the starting point for the zonelist iterator. It * may get reset for allocations that ignore memory policies. * 回傳從 ac.high_zoneidx 開始, 在 ac.zonelist 中第一個符合 ac.nodemask 的 zone */ ac.preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx, ac.nodemask); /* 若沒有此 zone */ if (!ac.preferred_zoneref->zone) { page = NULL; /* * This might be due to race with cpuset_current_mems_allowed * update, so make sure we retry with original nodemask in the * slow path. */ goto no_zone; } /* * First allocation attempt * 從 zonelist 配置一個 page */ page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac); if (likely(page)) goto out; no_zone: /* * Runtime PM, block IO and its error handling path can deadlock * because I/O on the device might not complete. * 在 memalloc_noio_flags 中 * 若 current->flags 設定了 PF_MEMALLOC_NOIO, 則回傳 gfp_mask & ~(__GFP_IO | __GFP_FS) */ alloc_mask = memalloc_noio_flags(gfp_mask); /* 將 ac.spread_dirty_pages 重新標記成沒寫過 */ ac.spread_dirty_pages = false; /* * Restore the original nodemask if it was potentially replaced with * &cpuset_current_mems_allowed to optimize the fast-path attempt. * 將 ac.nodemask 復原為 nodemask */ if (unlikely(ac.nodemask != nodemask)) ac.nodemask = nodemask; /* 用 __alloc_pages_slowpath 再跑一次 */ page = __alloc_pages_slowpath(alloc_mask, order, &ac); out: /* * 若沒有 define CONFIG_MEMCG 或有 define CONFIG_SLOB * 則 memcg_kmem_enabled 都會回傳 false * 若非上述條件 * 則 memcg_kmem_enabled 回傳 static_branch_unlikely(&memcg_kmem_enabled_key) * 若 memcg_kmem_enabled 回傳不為 0 * 且 gfp_mask 設定了 __GFP_ACCOUNT * 且 page 不為 0 * 且 memcg_kmem_charge(page, gfp_mask, order) 執行失敗 * 則進入 if */ if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page && unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) { /* * 若 refcount 為 0 (此 page 沒有 process 在使用) * 則根據 order 來做事情 * 若 order 為 0, 則執行 free_hot_cold_page(page, false) * 否則執行 __free_pages_ok(page, order) */ __free_pages(page, order); page = NULL; } /* * 若 kmemcheck_enabled 設定為不等於 0 的值且 * page 不為 0 */ if (kmemcheck_enabled && page) kmemcheck_pagealloc_alloc(page, order, gfp_mask); trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype); return page; } ``` ## get_page_from_freelist [get_page_from_freelist](https://elixir.bootlin.com/linux/v4.9.140/source/mm/page_alloc.c#L2905) ```c=2900 /* * get_page_from_freelist goes through the zonelist trying to allocate * a page. */ static struct page * get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, const struct alloc_context *ac) { struct zoneref *z = ac->preferred_zoneref; struct zone *zone; struct pglist_data *last_pgdat_dirty_limit = NULL; /* * Scan zonelist, looking for a zone with enough free. * See also __cpuset_node_allowed() comment in kernel/cpuset.c. */ for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) { struct page *page; unsigned long mark; /* * cpusets_enabled 回傳值 * 若沒 define CONFIG_CPUSETS, 則回傳 false * 若有 define, 則回傳 static_branch_unlikely(&cpusets_enabled_key) * * __cpuset_node_allowed 回傳值 * 若沒 define CONFIG_CPUSETS, 則回傳 true * 若有 define, 則回傳 __cpuset_node_allowed(zone_to_nid(z), gfp_mask) * * __cpuset_node_allowed 回傳值 * 若在 interrupt, * 或 zone->node 設定於 mems_allowed * 或 gfp_mask 沒設定 __GFP_HARDWALL, * 或 current 因為 TIF_MEMDIE 所以有權存取 memory * 以上都可以配置於此 memory node, 回傳 true * 否則不行, 回傳 false */ if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) && !__cpuset_zone_allowed(zone, gfp_mask)) continue; /* * When allocating a page cache page for writing, we * want to get it from a node that is within its dirty * limit, such that no single node holds more than its * proportional share of globally allowed dirty pages. * The dirty limits take into account the node's * lowmem reserves and high watermark so that kswapd * should be able to balance it without having to * write pages from its LRU list. * * XXX: For now, allow allocations to potentially * exceed the per-node dirty limit in the slowpath * (spread_dirty_pages unset) before going into reclaim, * which is important when on a NUMA setup the allowed * nodes are together not big enough to reach the * global limit. The proper fix for these situations * will require awareness of nodes in the * dirty-throttling and the flusher threads. */ if (ac->spread_dirty_pages) { if (last_pgdat_dirty_limit == zone->zone_pgdat) continue; if (!node_dirty_ok(zone->zone_pgdat)) { last_pgdat_dirty_limit = zone->zone_pgdat; continue; } } mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; if (!zone_watermark_fast(zone, order, mark, ac_classzone_idx(ac), alloc_flags)) { int ret; /* Checked here to keep the fast path fast */ BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); if (alloc_flags & ALLOC_NO_WATERMARKS) goto try_this_zone; if (node_reclaim_mode == 0 || !zone_allows_reclaim(ac->preferred_zoneref->zone, zone)) continue; ret = node_reclaim(zone->zone_pgdat, gfp_mask, order); switch (ret) { case NODE_RECLAIM_NOSCAN: /* did not scan */ continue; case NODE_RECLAIM_FULL: /* scanned but unreclaimable */ continue; default: /* did we reclaim enough */ if (zone_watermark_ok(zone, order, mark, ac_classzone_idx(ac), alloc_flags)) goto try_this_zone; continue; } } try_this_zone: /* 從 zone 配置一個 page */ page = buffered_rmqueue(ac->preferred_zoneref->zone, zone, order, gfp_mask, alloc_flags, ac->migratetype); if (page) { prep_new_page(page, order, gfp_mask, alloc_flags); /* * If this is a high-order atomic allocation then check * if the pageblock should be reserved for the future */ if (unlikely(order && (alloc_flags & ALLOC_HARDER))) reserve_highatomic_pageblock(page, zone, order); return page; } } return NULL; } ``` ## buffered_rmqueue [buffered_rmqueue](https://elixir.bootlin.com/linux/v4.9.140/source/mm/page_alloc.c#L2619) ```c=2615 /* * Allocate a page from the given zone. Use pcplists for order-0 allocations. */ static inline struct page *buffered_rmqueue(struct zone *preferred_zone, struct zone *zone, unsigned int order, gfp_t gfp_flags, unsigned int alloc_flags, int migratetype) { unsigned long flags; struct page *page; bool cold = ((gfp_flags & __GFP_COLD) != 0); if (likely(order == 0)) { struct per_cpu_pages *pcp; struct list_head *list; /* 將 CPU 的 flag 值先儲存到 flags 變數, 然後將 CPU 的 interrupt disable */ local_irq_save(flags); do { pcp = &this_cpu_ptr(zone->pageset)->pcp; list = &pcp->lists[migratetype]; if (list_empty(list)) { pcp->count += rmqueue_bulk(zone, 0, pcp->batch, list, migratetype, cold); if (unlikely(list_empty(list))) goto failed; } if (cold) page = list_last_entry(list, struct page, lru); else page = list_first_entry(list, struct page, lru); list_del(&page->lru); pcp->count--; } while (check_new_pcp(page)); } else { /* * We most definitely don't want callers attempting to * allocate greater than order-1 page units with __GFP_NOFAIL. */ WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); /* 進入 Critical Section, 先將 CPU flag 儲存起來 */ spin_lock_irqsave(&zone->lock, flags); do { page = NULL; if (alloc_flags & ALLOC_HARDER) { page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (page) trace_mm_page_alloc_zone_locked(page, order, migratetype); } if (!page) /* 從 buddy allocator 刪除一個 element, call 此 function 前要先讓 zone->lock lock 起來 */ page = __rmqueue(zone, order, migratetype); } while (page && check_new_pages(page, order)); /* 離開 Critical Section */ spin_unlock(&zone->lock); if (!page) goto failed; __mod_zone_freepage_state(zone, -(1 << order), get_pcppage_migratetype(page)); } __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); zone_statistics(preferred_zone, zone, gfp_flags); /* 將 flags 設回 CPU flag */ local_irq_restore(flags); VM_BUG_ON_PAGE(bad_range(zone, page), page); return page; failed: local_irq_restore(flags); return NULL; } ``` ## __rmqueue [__rmqueue](https://elixir.bootlin.com/linux/v4.9.140/source/mm/page_alloc.c#L2204) ```c=2200 /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ static struct page *__rmqueue(struct zone *zone, unsigned int order, int migratetype) { struct page *page; /* 根據 migratetype 遍尋 free lists, 並刪除最小的 avaliable page */ page = __rmqueue_smallest(zone, order, migratetype); if (unlikely(!page)) { /* MOVABLE 失敗, 優先從 cma 遷移 */ if (migratetype == MIGRATE_MOVABLE) page = __rmqueue_cma_fallback(zone, order); /* 失敗的話會從其他遷移類型當中遷移 page */ if (!page) page = __rmqueue_fallback(zone, order, migratetype); } trace_mm_page_alloc_zone_locked(page, order, migratetype); return page; } ``` ## __rmqueue_smallest [__rmqueue_smallest](https://elixir.bootlin.com/linux/v4.9.140/source/mm/page_alloc.c#L1813) ```c=1808 /* * Go through the free lists for the given migratetype and remove * the smallest available page from the freelists * 從 order 開始向上尋找 free_area 並從 list 當中取出 page */ static inline struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, int migratetype) { unsigned int current_order; struct free_area *area; struct page *page; /* Find a page of the appropriate size in the preferred list */ for (current_order = order; current_order < MAX_ORDER; ++current_order) { area = &(zone->free_area[current_order]); page = list_first_entry_or_null(&area->free_list[migratetype], struct page, lru); /* 若失敗就嘗試更大的塊 */ if (!page) continue; list_del(&page->lru); rmv_page_order(page); /* nr_free counter - 1 */ area->nr_free--; /* 從下往上找到合適的塊以後再從上往下進行拆分 */ expand(zone, page, order, current_order, area, migratetype); /* 設置 page 的 migratetype */ set_pcppage_migratetype(page, migratetype); return page; } return NULL; } ``` ## expand [expand](https://elixir.bootlin.com/linux/v4.9.140/source/mm/page_alloc.c#L1653) ```c=1639 /* * The order of subdivision here is critical for the IO subsystem. * Please do not alter this order without good reasons and regression * testing. Specifically, as large blocks of memory are subdivided, * the order in which smaller blocks are delivered depends on the order * they're subdivided in this function. This is the primary factor * influencing the order in which pages are delivered to the IO * subsystem according to empirical testing, and this is also justified * by considering the behavior of a buddy system containing a single * large block of memory acted on by a series of small allocations. * This behavior is a critical factor in sglist merging's success. * * -- nyc * 向下一層的 free_area 留一半, 用另一半進行拆分 */ static inline void expand(struct zone *zone, struct page *page, int low, int high, struct free_area *area, int migratetype) { unsigned long size = 1 << high; /* 從高階往低階 */ while (high > low) { area--; high--; size >>= 1; VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]); /* * Mark as guard pages (or page), that will allow to * merge back to allocator when buddy will be freed. * Corresponding page table entries will not be touched, * pages will stay not present in virtual address space */ if (set_page_guard(zone, &page[size], high, migratetype)) continue; /* 把被切割的 page 後半部存回 free_list */ list_add(&page[size].lru, &area->free_list[migratetype]); /* nr_free counter + 1 */ area->nr_free++; /* page->private = high, 表示此 page 屬於 order 為 high 階的 block 中 */ set_page_order(&page[size], high); } } ``` # Page Replacement kswapd daemon 前面提到: 此 daemon 會 just free 掉能從其他地方得到資訊的 page 或是把 dirty page evicts to backing storage device 而 kswapd 執行過程中會使用到 shrink_list ## struct lru_list [struct lru_list](https://elixir.bootlin.com/linux/v4.9.140/source/include/linux/mmzone.h#L189) ```c= enum lru_list { LRU_INACTIVE_ANON = LRU_BASE, // 初始的 chain, anonymous page 的 inactive chain LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE, // anonymous page 的 active chain LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE, // file cache page 的 inactive chain LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE, // file cache page 的 active chain LRU_UNEVICTABLE, // unevictable page 的 chain NR_LRU_LISTS }; ``` ## 重要的 define - [for_each_lru(lru)](https://elixir.bootlin.com/linux/v4.9.140/source/include/linux/mmzone.h#L198) ```c= // loop 過所有的 chain list #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++) ``` - [for_each_evictable_lru(lru)](https://elixir.bootlin.com/linux/v4.9.140/source/include/linux/mmzone.h#L200) ```c= // loop 過所有 evictable 的 active chain #define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++) ``` - [is_file_lru(enum lru_list lru)](https://elixir.bootlin.com/linux/v4.9.140/source/include/linux/mmzone.h#L202) ```c= // 是否為 file cache page chain static inline int is_file_lru(enum lru_list lru) { return (lru == LRU_INACTIVE_FILE || lru == LRU_ACTIVE_FILE); } ``` - [is_active_lru(enum lru_list lru)](https://elixir.bootlin.com/linux/v4.9.140/source/include/linux/mmzone.h#L207) ```c= // 是否為 active page chain static inline int is_active_lru(enum lru_list lru) { return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE); } ``` ## shrink_list [shrink_list](https://elixir.bootlin.com/linux/v4.9.140/source/mm/vmscan.c#L2085) ```c= static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc) { if (is_active_lru(lru)) { // 將 active page 變為 inactive if (inactive_list_is_low(lruvec, is_file_lru(lru), sc)) shrink_active_list(nr_to_scan, lruvec, sc, lru); return 0; } // 將 inactive page swap out 或是單純 free 掉 return shrink_inactive_list(nr_to_scan, lruvec, sc, lru); } ``` ## shrink_active_list [shrink_active_list](https://elixir.bootlin.com/linux/v4.9.140/source/mm/vmscan.c#L1931) ```c= static void shrink_active_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc, enum lru_list lru) { unsigned long nr_taken; unsigned long nr_scanned; unsigned long vm_flags; /* 初始化 double linked list */ LIST_HEAD(l_hold); /* The pages which were snipped off */ LIST_HEAD(l_active); LIST_HEAD(l_inactive); struct page *page; struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; unsigned long nr_rotated = 0; isolate_mode_t isolate_mode = 0; int file = is_file_lru(lru); struct pglist_data *pgdat = lruvec_pgdat(lruvec); lru_add_drain(); if (!sc->may_unmap) isolate_mode |= ISOLATE_UNMAPPED; if (!sc->may_writepage) isolate_mode |= ISOLATE_CLEAN; spin_lock_irq(&pgdat->lru_lock); /* 回傳幾個 page 被移動到 l_hold */ nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold, &nr_scanned, sc, isolate_mode, lru); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); reclaim_stat->recent_scanned[file] += nr_taken; if (global_reclaim(sc)) __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned); __count_vm_events(PGREFILL, nr_scanned); spin_unlock_irq(&pgdat->lru_lock); while (!list_empty(&l_hold)) { cond_resched(); /* 從這個 lru page chain evict 出第一個 page */ page = lru_to_page(&l_hold); list_del(&page->lru); if (unlikely(!page_evictable(page))) { /* 若 page 是 unevictable, 則加回 chain 的最後面, 並且重新再 evict 出一個 page */ putback_lru_page(page); continue; } if (unlikely(buffer_heads_over_limit)) { if (page_has_private(page) && trylock_page(page)) { if (page_has_private(page)) try_to_release_page(page, 0); unlock_page(page); } } if (page_referenced(page, 0, sc->target_mem_cgroup, &vm_flags)) { nr_rotated += hpage_nr_pages(page); /* * Identify referenced, file-backed active pages and * give them one more trip around the active list. So * that executable code get better chances to stay in * memory under moderate memory pressure. Anon pages * are not likely to be evicted by use-once streaming * IO, plus JVM can create lots of anon VM_EXEC pages, * so we ignore them here. */ if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) { list_add(&page->lru, &l_active); continue; } } /* 把此 page 加到 inactive chain */ ClearPageActive(page); /* we are de-activating */ list_add(&page->lru, &l_inactive); } /* * Move pages back to the lru list. */ spin_lock_irq(&pgdat->lru_lock); /* * Count referenced pages from currently used mappings as rotated, * even though only some of them are actually re-activated. This * helps balance scan pressure between file and anonymous pages in * get_scan_count. */ reclaim_stat->recent_rotated[file] += nr_rotated; /* 將 pages 從 active list 放到 inactive list */ move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru); move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); spin_unlock_irq(&pgdat->lru_lock); mem_cgroup_uncharge_list(&l_hold); free_hot_cold_page_list(&l_hold, true); } ``` ## shrink_inactive_list [shrink_inactive_list](https://elixir.bootlin.com/linux/v4.9.140/source/mm/vmscan.c#L1726) ```c= /* * shrink_inactive_list() is a helper for shrink_node(). It returns the number * of reclaimed pages */ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc, enum lru_list lru) { LIST_HEAD(page_list); unsigned long nr_scanned; unsigned long nr_reclaimed = 0; unsigned long nr_taken; unsigned long nr_dirty = 0; unsigned long nr_congested = 0; unsigned long nr_unqueued_dirty = 0; unsigned long nr_writeback = 0; unsigned long nr_immediate = 0; isolate_mode_t isolate_mode = 0; int file = is_file_lru(lru); struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; if (!inactive_reclaimable_pages(lruvec, sc, lru)) return 0; while (unlikely(too_many_isolated(pgdat, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; } lru_add_drain(); if (!sc->may_unmap) isolate_mode |= ISOLATE_UNMAPPED; if (!sc->may_writepage) isolate_mode |= ISOLATE_CLEAN; spin_lock_irq(&pgdat->lru_lock); nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list, &nr_scanned, sc, isolate_mode, lru); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); reclaim_stat->recent_scanned[file] += nr_taken; if (global_reclaim(sc)) { __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned); if (current_is_kswapd()) __count_vm_events(PGSCAN_KSWAPD, nr_scanned); else __count_vm_events(PGSCAN_DIRECT, nr_scanned); } spin_unlock_irq(&pgdat->lru_lock); if (nr_taken == 0) return 0; /* 實際收回 pages */ nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP, &nr_dirty, &nr_unqueued_dirty, &nr_congested, &nr_writeback, &nr_immediate, false); spin_lock_irq(&pgdat->lru_lock); if (global_reclaim(sc)) { if (current_is_kswapd()) __count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed); else __count_vm_events(PGSTEAL_DIRECT, nr_reclaimed); } putback_inactive_pages(lruvec, &page_list); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); spin_unlock_irq(&pgdat->lru_lock); mem_cgroup_uncharge_list(&page_list); free_hot_cold_page_list(&page_list, true); /* * If reclaim is isolating dirty pages under writeback, it implies * that the long-lived page allocation rate is exceeding the page * laundering rate. Either the global limits are not being effective * at throttling processes due to the page distribution throughout * zones or there is heavy usage of a slow backing device. The * only option is to throttle from reclaim context which is not ideal * as there is no guarantee the dirtying process is throttled in the * same way balance_dirty_pages() manages. * * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number * of pages under pages flagged for immediate reclaim and stall if any * are encountered in the nr_immediate check below. */ if (nr_writeback && nr_writeback == nr_taken) set_bit(PGDAT_WRITEBACK, &pgdat->flags); /* * Legacy memcg will stall in page writeback so avoid forcibly * stalling here. */ if (sane_reclaim(sc)) { /* * Tag a zone as congested if all the dirty pages scanned were * backed by a congested BDI and wait_iff_congested will stall. */ if (nr_dirty && nr_dirty == nr_congested) set_bit(PGDAT_CONGESTED, &pgdat->flags); /* * If dirty pages are scanned that are not queued for IO, it * implies that flushers are not keeping up. In this case, flag * the pgdat PGDAT_DIRTY and kswapd will start writing pages from * reclaim context. */ if (nr_unqueued_dirty == nr_taken) set_bit(PGDAT_DIRTY, &pgdat->flags); /* * If kswapd scans pages marked marked for immediate * reclaim and under writeback (nr_immediate), it implies * that pages are cycling through the LRU faster than * they are written so also forcibly stall. */ if (nr_immediate && current_may_throttle()) congestion_wait(BLK_RW_ASYNC, HZ/10); } /* * Stall direct reclaim for IO completions if underlying BDIs or zone * is congested. Allow kswapd to continue until it starts encountering * unqueued dirty pages or cycling through the LRU too quickly. */ if (!sc->hibernation_mode && !current_is_kswapd() && current_may_throttle()) wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10); trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, nr_scanned, nr_reclaimed, sc->priority, file); return nr_reclaimed; } ``` # ###### tags: `OS` `Memory Management`