LINUX KERNEL
Following based on Linux v4.14
每個 NUMA node 會有一個 kswapd process: [kswapd0], [kswapd1]
設置 callback function kswapd()
kswapd 雖然在系統啟動時就創建,只有等到由於記憶體不足而分配內存失敗時才會觸發
struct scan_control
記錄回收狀態
gfp_
https://lwn.net/Articles/23042/
gfp = get free pages
Allocation flags 包含兩個部分
走過所有 memory cgroup
核心回收函式, kswapd 和 direct reclaim 最後都會呼叫到這裡
上述中有一個蠻重要的函式 inactive_list_is_low
用來判斷 inactive list 是否足夠,如果判斷為 True 則會 shrink active list
程式碼中提到
kernel 用 refault ratio 來決定 inactive list 的大小。
https://zhuanlan.zhihu.com/p/421298579
当不断有新的文件页产生,并被添加在inactive file lru head,同时kernel不断进行inactive file lru tail回收,如果产生的新的文件页足够多使inactive lru足够长,且一直满足kernel回收量,这样就会导致shrink active lru一直不被触发,极端情况下,active lru会被孤立,即使active lru中存在很多非常老的page,它也没有机会被换到inactive lru中。Workingset_activate便可以统计和改善这种情况
在 shrink_page_list
中透過 page_check_references
來判斷 page 該去哪
active -> inactive
本來 LRU list 是每個 zone 一個
High memory (highmem) is used when the size of physical memory approaches or exceeds the maximum size of virtual memory. At that point it becomes impossible for the kernel to keep all of the available physical memory mapped at all times. This means the kernel needs to start using temporary mappings of the pieces of physical memory that it wants to access.
如果是 64 位元系統,要使用 kernel address 直接 MASK 0xffff000000000000 (phy -> virt) 即可,因為物理記憶體一定小於 0xffff000000000000,而 32 位元系統要 MASK 0xC0000000 如果我們的 RAM 有 4G,則後面 3G 沒辦法訪問 (MASK 後炸掉)
kernel 做 direct mapping
https://blog.csdn.net/weixin_42730667/article/details/123438959
page fault path
for anon page
for file page
對於 anon page 在加到 lru list 前會先 SetPageActive, file page 則不會
https://lwn.net/Articles/815342/ base on v5.5
anon page 帶入時改為放到 inactive list, 但這有個缺點,如果 inactive list 太短則大部分的 page 來不及被提到 active list 就會被換出。
The kernel handles this case by putting newly faulted, file-backed pages directly onto the inactive list; they will only move to the active list if they are accessed again before being reclaimed.
如果我們每次都一個一個 page 加入 active/inactive list 那麼這兩個 list 鎖的競爭會非常嚴重,因此 kernel 有維護一個 per-CPU 的 lru cache,每當這些 lru cache 累積到一定數量再去獲取鎖然後放入
總共有6個 per-CPU cache,對應不同情況有不同的 cache
__pagevec_lru_add_fn
pagevec_move_tail
lru_deactivate_file_fn
, 對於 file page 當有外力主動釋放時, i.e. drop_cacheslru_deactivate_fn
, deactivate a pagelru_lazyfree_fn
, make an anon page lazyfree 加速這個 page 的回收__activate_page
https://lwn.net/Articles/495543/
目前 kernel 只有針對 file page 做
The other change addresses the fact that refault tracking, in current kernels, is only done for the file-backed LRU list. Once an anonymous page is reclaimed, the kernel forgets about its history. As it turns out, the previous change (faulting pages into the inactive list) can exacerbate some bad behavior: adding new pages to the inactive list can quickly push out other pages that were just faulted in before they can be accessed a second time and promoted to the active list. If refault tracking were done for the anonymous LRU list, this situation could be detected and dealt with
As a general rule, refaults indicate thrashing, which is not a good thing. The kernel can respond to excessive refaulting by, for example, making the active list larger.
可能有些 page 頻繁被帶入又被換出,造成 thrashing (因為 file page 一開始會先被帶進 inactive list,所以這個問題在 file page 比較嚴重?)
當一個 file page 被帶入則判斷 shadow 存不存在,如果存在則直接帶入 active list
把 active file list 當作 workingset 大小,如果 refault distance > workingset 大小則不做任何事,因為也放不下
refault 的子集,如果真的有被帶到 active list 則 ++