Try   HackMD

Linux 核心 Copy On Write - Memory Region

contributed by < linD026 >

tags: Linux kernel COW , linux2021

此篇所引用之原始程式碼皆以用 Linux Kernel 5.10 版本,並且有關於硬體架構之程式碼以 arm 為主。
前篇: Linux 核心 Copy On Write 實作機制


Chapter 4 Process Address Space
cs.columbia.edu - W4118: Linux memory management







abstract



mm1

struct mm_struct



mm2

struct mm_struct



mm1->mm2





mm3

struct mm_struct



mm2->mm3





vma1

struct vm_area_struct



mm2->vma1


mmap



vma2

struct vm_area_struct



vma1->vma2


vm_next



file

struct file



vma1->file


vm_file



vma2->mm2


vm_mm





vmop

struct vm_operations_struct



vma2->vmop





vma2->file


vm_file



dentry

struct dentry



file->dentry


f_dentry



inode

struct inode



dentry->inode


                                  i_dentry



dentry->inode


d_inode       



address

struct address_space



inode->address


i_mapping



address->vma1


i_mmap



addop

struct address_space_operation



address->addop


a_ops



page1

struct page



page1->address


mapping



page2

struct page




page2->address


mapping




4.2 Managing the Address Space

If a region is backed by a file, its vm_file field will be set. By traversing vm_file->f_dentry->d_inode->i_mapping, the associated address_space for the region may be obtained. The address_space has all the filesystem specific information required to perform page-based operations on disk.


mm_struct

/include/linux/mm_types.h - struct mm_struct







vir



mm_struct

mm_struct

type

struct vm_area_struct

struct rb_root

unsigned long

unsigned long

pgd_t *

atomic_t

atomic_t

atomic_t

seqcount_t

int

spinlock_t

struct rw_semaphore

struct list_head

...

name

*mmap

mm_rb

task_size

highest_vm_end

pgd

mm_users

mm_count

has_pinned

write_protect_seq

map_count

page_table_lock

mmap_lock

mmlist

...



mm_struct 詳細註解版本






test



mm_struct

mm_struct

struct vm_area_struct

struct rb_root

*mmap

mm_rb

list of VMAs

u64

vmacache_seqnum

per-thread vmacache

unsigned long

task_size

highest_vm_end

size of task vm space

highest vma end address

atomic_t

mm_users

mm_count

has_pinned

The number of users
including userspace

the number of references
to &struct mm_struct
(mm_users count as 1)

Whether this mm has pinned any pages.

seqcount_t

write_protect_seq

Locked when any thread is write
protecting pages mapped by
this mm to enforce a later COW,
for instance during page table
 copying for fork()

int

map_count

number of VMAs

spinlock_t

struct rw_semaphore

page_table_lock

mmap_lock

Protects page tables
and some counters

struct list_head

mmlist

List of maybe swapped mm's

unsigned long

hiwater_rss

hiwater_vm

total_vm

locked_vm

data_vm

exec_vm

stack_vm

def_flags

High watermark of RSS usage

high-water virtual memory usage

Total pages mapped

Pages that have PG_mlocked set

VM_WRITE & ~VM_SHARED & ~VM_STACK

VM_EXEC & ~VM_WRITE & ~VM_STACK

VM_STACK

 

atomic64_t

pinned_vm

Refcount permanently increased

spinlock_t arg_lock
(protect the right fields)

unsigned long

start_code

start_brk

start_stack

arg_start

env_start

end_code

brk

 

arg_end

env_end

pointer of code section

pointer of heap region

pointer of stack region

pointer of command line arguments

pointer of environment variables

unsigned long

saved_auxv[AT_VECTOR_SIZE]

for /proc/PID/auxv



Linux kernel 5.10.35

在 Linux 當中 process 和 thread 都是以 task_struct 結構描述,而兩者皆以 clone 系統呼叫建立,區別在於資源的管理。
task_struct 的 process address space 則是以 mm_struct 結構管理,以 task_struct->mm 方式呼叫。 process 和 thread 差別在於 threads 之間的 task_struct 會是指向同個 mm_struct

延伸閱讀 - syscall / 系統呼叫

lwn.net - Anatomy of a system call, part 1 / part 2

在 Linux kernel 系統呼叫會是以 SYSCALL_DEFINEn() 定義。例如:

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
    struct fd f = fdget_pos(fd);
    ssize_t ret = -EBADF;
    /* ... */

在這之後會展開成若干個定義,如 SYSCALL_METADATA 儲存系統呼叫的相關資訊(名稱、參數等),以及相關函式。

asmlinkage long sys_read(unsigned int fd, char __user * buf, size_t count)
    __attribute__((alias(__stringify(SyS_read))));

static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count);
asmlinkage long SyS_read(long int fd, long int buf, long int count);

asmlinkage long SyS_read(long int fd, long int buf, long int count)
{
    long ret = SYSC_read((unsigned int) fd, (char __user *) buf, (size_t) count);
    asmlinkage_protect(3, ret, fd, buf, count);
    return ret;
}

static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count)
{
    struct fd f = fdget_pos(fd);
    ssize_t ret = -EBADF;
    /* ... */

而要讓 userspace 可以呼叫 syscall ,則會維護 sys_call_table 並以 index 值來呼叫。例如 x86 架構的 index 數值會在 syscall_64.tbl 定義,而大部分的架構則會以此定義:

#define __NR_read 63
__SYSCALL(__NR_read, sys_read)

// use the read syscall will be like:
syscall_read_ptr = sys_call_table[__NR_read];
/* ... */

詳細說明可見上方列出的 lwn.net 文章,也可以見官方文件:kernel.org - Adding a New System Call

mm_users / mm_count

mm_struct 有兩個 atomic_t 型態的數值分別為 mm_users 以及 mm_count,是紀錄 mm_struct 有多少使用者。 mm_users 紀錄有多少 process 使用此 mm_struct 的 userspace portion ,例如 page tables 、 file mapping 等。而 mm_count 為紀錄有多少 anonymous users 的 reference count 。兩者的差別在於 mm_users 為零時, exit_map() 會刪除所有 mapping 並且把所有 page tables 消除; mm_count 則是確保 mm_struct 能夠在沒有任何 reference 情況下安全刪除。

在初始化 mm_struct 的函式 mm_init() 中可以看到,兩者都初始直接為一:

  • /kernel/fork.c
    ​​​​static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
    ​​​​    struct user_namespace *user_ns)
    ​​​​{
    ​​​​    atomic_set(&mm->mm_users, 1);
    ​​​​    atomic_set(&mm->mm_count, 1);
    ​​​​}
    

Linux kernel atomicC11 atomic
在 C11 standard 還未出來前,Linux kernel 就有 atomic 操作的需求,且也早已制訂了相關操作,其實作則是 architecture-specific code 。
在 2014 以及 2016 年, Linux 社群有討論過是否要改用 C11 atomic ,然而因為當時 C11 atomic 還未完善(2014),且 C11 的 memory model 不符合 Linux 的 memory model
相關文章可見:

延伸閱讀
與 memory model 有關的機制 RCU 也有在上述文章中提到,而 RCU 的描述可看 Linux 核心設計: RCU 同步機制,或官方文件 What is RCU? "Read, Copy, Update"

kernel portion

A unique mm_struct is not needed for kernel threads as they will never page fault or access the userspace portion. The only exception is page faulting within the vmalloc space. The page fault handling code treats this as a special case and updates the current page table with information in the the master page table. As a mm_struct is not needed for kernel threads, the task_struct->mm field for kernel threads is always NULL. For some tasks such as the boot idle task, the mm_struct is never setup but for kernel threads, a call to daemonize() will call exit_mm() to decrement the usage counter.

Lazy TLB - TLB flush

Flushing TLB in an SMP environment

context switch 時,因為 VA 之間的轉換會因人而異,為了避免新的 task 會被 TLB 之中舊的資料所影響,因此會進行清空 ( flush ) 。但這在效能上會有很大的衝擊會導致 overhead ,因此有人提出了 lazy TLB 的概念。

As TLB flushes are extremely expensive, especially with architectures such as the PPC, a technique called lazy TLB is employed which avoids unnecessary TLB flushes by processes which do not access the userspace page tables as the kernel portion of the address space is always visible. The call to switch_mm(), which results in a TLB flush, is avoided by “borrowing” the mm_struct used by the previous task and placing it in task_struct→active_mm. This technique has made large improvements to context switches times.

每個 process 都有同個 kernel space , userspace 則不盡相同,因此 lazy TLB 對於 process 切換至 kernel portion 時,下個 task 直接不准許使用 userspace 的前提下使用前一個 task 的 TLB ,而非進行開銷大的 TLB flush 。

TLB 的資料時常被視為 per-core cache ,因此如果 TLB 中相對應 page table entry 資料更改時,core 之間的 TLB 要同步 ( TLB coherence ) 。在 SMP 中,儘管只是在單核心下進行 TLB flush ,也需要藉由 IPI - Inter Processor Interrupt 通知其他核心此 TLB entry 已經是 invalidated 了。而 IPI 的操作成本又很高,因此效率會變非常差。

Receiving an IPI is as expensive as receiving any other IRQ (e.g. causes a full pipeline flush, followed by IDT and GDT lookups and protection checks, followed by the overhead of the interrupt handler itself). If a CPU doing useful work sends an average of 10 IPIs per second, then 32 CPUs will probably send an average of 320 IPIs per second. If those IPIs are received by all CPUs except one; then for 2 CPUs you'd get 10 IPI received per second, for 4 CPUs you'd get "4 * 3 *10 = 120" IPIs received per second, and for 128 CPUs you'd get "128 * 127 * 10 = 162560" IPIs received per second. It ends up being exponential overhead (a scalability nightmare for large systems). Basically, it's important to avoid unnecessary IPIs.

Lazy TLB 的方法則是不藉由 IPI 傳送,而是當用到舊的 TLB 時讓其產生 page fault 並且自己標註為 invalid 。

When entering lazy TLB, the function enter_lazy_tlb() is called to ensure that a mm_struct is not shared between processors in SMP machines, making it a NULL operation on UP machines. The second time use of lazy TLB is during process exit when start_lazy_tlb() is used briefly while the process is waiting to be reaped by the parent.

而上述在 kernel portion 中進行 lazy TLB 對於避免不必要的 TLB flush 中,不准許 userspace page tables 的相關操作,可以當成一個為何需要兩個紀錄 user 數量的例子。在某些情況下,可能不需要或已經消除 page tables 和 mapping ,而還需要其 mm_struct

mm_count is a reference count of the “anonymous users” for the mm_struct initialised at 1 for the “real” user. An anonymous user is one that does not necessarily care about the userspace portion and is just borrowing the mm_struct. Example users are kernel threads which use lazy TLB switching. When this count drops to 0, the mm_struct can be safely destroyed. Both reference counts exist because anonymous users need the mm_struct to exist even if the userspace mappings get destroyed and there is no point delaying the teardown of the page tables.

延伸閱讀

seqcount_t write_protect_seq

/**
 * @write_protect_seq: Locked when any thread is write
 * protecting pages mapped by this mm to enforce a later COW,
 * for instance during page table copying for fork().
 */
seqcount_t write_protect_seq;

根據 kernel.org - seqlock

Sequence counters are a reader-writer consistency mechanism with lockless readers (read-only retry loops), and no writer starvation. They are used for data that’s rarely written to (e.g. system time), where the reader wants a consistent set of information and is willing to retry if that information changes.

是做出 lockless reader 的 reader-writer 機制,

A data set is consistent when the sequence count at the beginning of the read side critical section is even and the same sequence count value is read again at the end of the critical section. The data in the set must be copied out inside the read side critical section. If the sequence count has changed between the start and the end of the critical section, the reader must retry.

Writers increment the sequence count at the start and the end of their critical section. After starting the critical section the sequence count is odd and indicates to the readers that an update is in progress. At the end of the write side critical section the sequence count becomes even again which lets readers make progress.

A sequence counter write side critical section must never be preempted or interrupted by read side sections. Otherwise the reader will spin for the entire scheduler tick due to the odd sequence count value and the interrupted writer. If that reader belongs to a real-time scheduling class, it can spin forever and the kernel will livelock.

This mechanism cannot be used if the protected data contains pointers, as the writer can invalidate a pointer that the reader is following.

  • initial
/* dynamic */
seqcount_t foo_seqcount;
seqcount_init(&foo_seqcount);

/* static */
static seqcount_t foo_seqcount = SEQCNT_ZERO(foo_seqcount);

/* C99 struct init */
struct {
        .seq   = SEQCNT_ZERO(foo.seq),
} foo;
  • write path
/* Serialized context with disabled preemption */

write_seqcount_begin(&foo_seqcount);

/* ... [[write-side critical section]] ... */

write_seqcount_end(&foo_seqcount);
  • read path
do {
        seq = read_seqcount_begin(&foo_seqcount);

        /* ... [[read-side critical section]] ... */

} while (read_seqcount_retry(&foo_seqcount, seq));

mm_struct 中,也有如 spinlock_t page_table_lockstruct rw_semaphore mmap_lock 等 lock :

mmap_sem This is a long lived lock which protects the VMA list for readers and writers. As users of this lock require it for a long time and may need to sleep, a spinlock is inappropriate. A reader of the list takes this semaphore with down_read(). If they need to write, it is taken with down_write() and the page_table_lock spinlock is later acquired while the VMA linked lists are being updated;

page_table_lock This protects most fields on the mm_struct. As well as the page tables, it protects the RSS (see below) count and the VMA from modification;

mm_sem 在 5.8 版本時更改成 mm_lock

The corresponding comment in the 5.7 kernel is:

/*
 * The caller must hold down_write(&current->mm->mmap_sem).
 */

do_mmap() is called via do_mmap_pgoff() (in "include/linux/mm.h") via vm_mmap_pgoff() (in "mm/util.c") via ksys_mmap_pgoff() (in "mm/mmap.c") via the mmap_pgoff() syscall handler (in "mm/mmap.c"). (N.B. From kernel version 5.9 onwards, do_mmap_pgoff() is eliminated and do_mmap() is called directly from vm_mmap_pgoff().) The mmap lock is write-locked in vm_mmap_pgoff().

延伸閱讀
Linux 核心設計: RCU 同步機制 中也有提到 lock 相關的資訊,如 spinlock , rwlock 和 seqlock 的基本原理等。
而講到 concurrency 和 RCU ,相關書籍可以看 Paul E. McKenney ( Linux kernel RCU 機制的開發者)所撰寫的 Is Parallel Programming Hard, And, If So, What Can You Do About It?

Resident Set Size ( RSS )

Resident Set Size (RSS) is the number of resident pages for this process. It should be noted that the global zero page is not accounted for by RSS;

/*
 * Special counters, in some configurations protected by the
 * page_table_lock, in other configurations by being atomic.
 */
struct mm_rss_stat rss_stat;

randomize layout

lwn.net - Randomizing structure layout

{{} __randomize_layout;}

c11 - 6.7.2.1 Structure and union specifiers - 13
An unnamed member of structure type with no tag is called an anonymous structure; an unnamed member of union type with no tag is called an anonymous union. The members of an anonymous structure or union are considered to be members of the containing structure or union. This applies recursively if the containing structure or union is also anonymous.

因此如下操作是合法的:

struct a {
    struct {int b};
};
struct a A;
A.b = 1;

stackoverflow - C: How to access different types of anonymous or unnamed nested structs 有更詳細的解釋。

延伸閱讀 - Kernel Address Space Layout randomization
ASLR 是讓物件的地址是隨機而非固定某個數值的方式,來提升資安強度。在 Linux kernel 中也有這項機制,當 kernel boot parameter 沒有 nokaslr 時會開啟 KASLR。此時可看 /boot/System.map(靜態) 當中的 symbol 與 /proc/kallsyms 會有些不同。

/proc/kallsyms (since Linux 2.5.71)
      This holds the kernel exported symbol definitions used by
      the modules(X) tools to dynamically link and bind loadable
      modules.  In Linux 2.5.47 and earlier, a similar file with
      slightly different syntax was named ksyms.
$ grep GRUB_CMDLINE_LINUX_DEFAULT /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
$ sudo grep sys_call_table /boot/System.map-$(uname -r)
ffffffff82000300 R sys_call_table
$ sudo grep sys_call_table /proc/kallsyms
ffffffff820013a0 R sys_call_table
# Reboot
$ sudo grep sys_call_table /boot/System.map-$(uname -r)
ffffffff82000300 R sys_call_table
$ sudo grep sys_call_table /proc/kallsyms 
ffffffff86400300 R sys_call_table

相關文章可見:Kernel Address Space Randomization in Linux or how I made Volatility bruteforce the page tables

setup

  • /mm/init-mm.c

    /*
    ​ * For dynamically allocated mm_structs, there is a dynamically sized cpumask
    ​ * at the end of the structure, the size of which depends on the maximum CPU
    ​ * number the system can see. That way we allocate only as much memory for
    ​ * mm_cpumask() as needed for the hundreds, or thousands of processes that
    ​ * a system typically runs.
    ​ *
    ​ * Since there is only one init_mm in the entire system, keep it simple
    ​ * and size this cpu_bitmask to NR_CPUS.
    ​ */struct mm_struct init_mm = {.mm_rb		= RB_ROOT,.pgd		= swapper_pg_dir,.mm_users	= ATOMIC_INIT(2),.mm_count	= ATOMIC_INIT(1),.write_protect_seq = SEQCNT_ZERO(init_mm.write_protect_seq),MMAP_LOCK_INITIALIZER(init_mm).page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),.user_ns	= &init_user_ns,.cpu_bitmap	= CPU_BITS_NONE,INIT_MM_CONTEXT(init_mm)};
    
  • /init/main.c - Set up kernel memory allocators

arm - setup_arch

TODO allocator

Function Description
mm_init() Initialises a mm_struct by setting starting values for each field, allocating a PGD, initialising spinlocks etc.
allocate_mm() Allocates a mm_struct() from the slab allocator
mm_alloc() Allocates a mm_struct using allocate_mm() and calls mm_init() to initialise it
exit_mmap() Walks through a mm_struct and unmaps all VMAs associated with it
copy_mm() Makes an exact copy of the current tasks mm_struct for a new task. This is only used during fork
free_mm() Returns the mm_struct to the slab allocator

Two functions are provided to allocate a mm_struct. To be slightly confusing, they are essentially the same but with small important differences. allocate_mm() is just a preprocessor macro which allocates a mm_struct from the slab allocator. mm_alloc() allocates from slab and then calls mm_init() to initialise it.

While a new user increments the usage count with atomic_inc(&mm->mm_users), it is decremented with a call to mmput(). If the mm_users count reaches zero, all the mapped regions are destroyed with exit_mmap() and the page tables destroyed as there is no longer any users of the userspace portions. The mm_count count is decremented with mmdrop() as all the users of the page tables and VMAs are counted as one mm_struct user. When mm_count reaches zero, the mm_struct will be destroyed.

Initialization

  • /kernel/fork.c
    /*
    ​ * Allocate and initialize an mm_struct.
    ​ */struct mm_struct *mm_alloc(void){struct mm_struct *mm;
    
    ​	mm = allocate_mm();if (!mm)return NULL;memset(mm, 0, sizeof(*mm));return mm_init(mm, current, current_user_ns());}
    

Copy

  • /kernel/fork.c
    static int copy_mm(unsigned long clone_flags, struct task_struct *tsk){struct mm_struct *mm, *oldmm;int retval;
    
    ​	tsk->min_flt = tsk->maj_flt = 0;
    ​	tsk->nvcsw = tsk->nivcsw = 0;
    ​#ifdef CONFIG_DETECT_HUNG_TASK
    ​	tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;
    ​	tsk->last_switch_time = 0;
    ​#endif
    
    ​	tsk->mm = NULL;
    ​	tsk->active_mm = NULL;/*
    ​	 * Are we cloning a kernel thread?
    ​	 *
    ​	 * We need to steal a active VM for that..
    ​	 */
    ​	oldmm = current->mm;if (!oldmm)return 0;/* initialize the new vmacache entries */vmacache_flush(tsk);if (clone_flags & CLONE_VM) {mmget(oldmm);
    ​		mm = oldmm;goto good_mm;}
    
    ​	retval = -ENOMEM;
    ​	mm = dup_mm(tsk, current->mm);if (!mm)goto fail_nomem;
    
    ​good_mm:
    ​	tsk->mm = mm;
    ​	tsk->active_mm = mm;return 0;
    
    ​fail_nomem:return retval;}
    

vm_area_struct

/include/linux/mm_types.h

/*
 * This struct describes a virtual memory area. There is one of these
 * per VM-area/task. A VM area is any part of the process virtual memory
 * space that has a special rule for the page-fault handlers (ie a shared
 * library, the executable area etc).
 */
struct vm_area_struct {
	/* The first cache line has the info for VMA tree walking. */

	unsigned long vm_start;		/* Our start address within vm_mm. */
	unsigned long vm_end;		/* The first byte after our end address
					   within vm_mm. */

	/* linked list of VM areas per task, sorted by address */
	struct vm_area_struct *vm_next, *vm_prev;

	struct rb_node vm_rb;

	/*
	 * Largest free memory gap in bytes to the left of this VMA.
	 * Either between this VMA and vma->vm_prev, or between one of the
	 * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
	 * get_unmapped_area find a free area of the right size.
	 */
	unsigned long rb_subtree_gap;

	/* Second cache line starts here. */

	struct mm_struct *vm_mm;	/* The address space we belong to. */

	/*
	 * Access permissions of this VMA.
	 * See vmf_insert_mixed_prot() for discussion.
	 */
	pgprot_t vm_page_prot;
	unsigned long vm_flags;		/* Flags, see mm.h. */

	/*
	 * For areas with an address space and backing store,
	 * linkage into the address_space->i_mmap interval tree.
	 */
	struct {
		struct rb_node rb;
		unsigned long rb_subtree_last;
	} shared;

	/*
	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
	 * list, after a COW of one of the file pages.	A MAP_SHARED vma
	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
	 * or brk vma (with NULL file) can only be in an anon_vma list.
	 */
	struct list_head anon_vma_chain; /* Serialized by mmap_lock &
					  * page_table_lock */
	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */

	/* Function pointers to deal with this struct. */
	const struct vm_operations_struct *vm_ops;

	/* Information about our backing store: */
	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
					   units */
	struct file * vm_file;		/* File we map to (can be NULL). */
	void * vm_private_data;		/* was vm_pte (shared mem) */

#ifdef CONFIG_SWAP
	atomic_long_t swap_readahead_info;
#endif
#ifndef CONFIG_MMU
	struct vm_region *vm_region;	/* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
#endif
	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
} __randomize_layout;

Flag

/*
 * vm_flags in vm_area_struct, see mm_types.h.
 * When changing, update also include/trace/events/mmflags.h
 */
#define VM_NONE 	0x00000000

#define VM_READ 	0x00000001	/* currently active flags */
#define VM_WRITE	0x00000002
#define VM_EXEC 	0x00000004
#define VM_SHARED	0x00000008

Red Black Tree and Linked list

因為需要頻繁的讀取 VMA 結構,如 page fault 、 mmap 等操作,並且當有大量的 mapped region 時如果要尋找某個特定的 virtual address , linked list 會浪費很大量時間在走訪,因此需要一種資料結構以提供更有效率的尋找,如紅黑樹 ( rbtree ) 。紅黑樹是一種自平衡樹,儲存一些可排序的資料,並利用紅黑顏色等性質來維持數高不超過

lg(n)。關於紅黑樹的相關操作,可以看此篇 Red-black Trees (rbtree) in Linux

為何不選擁有同樣性質,且一樣是

lg(n) 的 AVL tree ?

在樹高性質上, AVL tree 和 rbtree 雖然都是

O(lgn) ,但 AVL 在樹高上比較緊致,約
1.44lg(n+2)
,而 rbtree 則為
2lg(n+1)
。因此在搜尋等以樹高作為上限的操作下, AVL tree 會比 rbtree 快一些,但這是建立在 insert 和 delete 需要更多 rotate 來維持樹高的情形( AVL 為三次, rbtree 為兩次 )。也因此,對於一般實作方面還是會採用 rbtree 。而在 database 則會採用 AVL tree 。

stackoverflow - Red black tree over avl tree

除此之外還有一種說法,是因為 AVL tree 沒有提供 amortized update cost ,而 rbtree 有。

SortedSequences - section 7.4

實際研究數據可以看此篇:Performance Analysis of BSTs in System Software∗

The results indicate that when input is expected to be randomly ordered with occasional runs of sorted order, red-black trees are preferred; when insertions often occur in sorted order, AVL trees excel for later random access, whereas splay trees perform best for later sequential or clustered access.

而在 address space 中所有的 VMA 結構以 Linear linked list 連結,此准許了可以走訪整個 address space 。

All the VMAs in an address space are linked together in an address-ordered singly linked list via this field It is interesting to note that the VMA list is one of the very rare cases where a singly linked list is used in the kernel;

取代 rbtree 的 RCU-safe Maple Tree

[PATCH v4 00/66] Introducing the Maple Tree

Linux is a virtual-memory system. The address space for each process contains multiple virtual memory areas represented by vm_area_struct structures. Each VMA represents a contiguous block of address space and represents a range of memory of the same type, which may be anonymous memory (not backed by a file), a memory-mapped file, or even device memory. A virtual memory area, as seen from the process, is contiguous, while the supporting physical memory may not be. In addition, the address space contains holes between the VMAs; the kernel fills those empty blocks (leaving space for unmapped "guard" pages to catch buffer overflows) when it needs to add a new mapped area, for example when loading a library or in a response to an mmap() call.

VMAs are currently stored in a modified red-black tree (rbtree) with an additional, doubly-linked list that allows the kernel to traverse all of the VMAs in a process's address space. Kernel developers have been unhappy with this data structure for some time, for a number of reasons: rbtrees do not support ranges well, they are difficult to handle in a lockless manner (the balancing operation of the rbtree affects multiple items at the same time), and rbtree traversal is inefficient, which is why the additional linked list exists.

vma mapping

Types of VMA Mappings

  • File / device backed mappings (mmap):
    • Code pages (binaries), libraries
    • Data files
    • Shared memory
    • Devices
  • Anonymous mappings:
    • Stack
    • Heap
    • COW pages

How The Kernel Manages Your Memory

vm_region

/*
 * A region containing a mapping of a non-memory backed file under NOMMU
 * conditions.  These are held in a global tree and are pinned by the VMAs that
 * map parts of them.
 */
struct vm_region {
	struct rb_node	vm_rb;		/* link in global region tree */
	vm_flags_t	vm_flags;	/* VMA vm_flags */
	unsigned long	vm_start;	/* start address of region */
	unsigned long	vm_end;		/* region initialised to here */
	unsigned long	vm_top;		/* region allocated to here */
	unsigned long	vm_pgoff;	/* the offset in vm_file corresponding to vm_start */
	struct file	*vm_file;	/* the backing file or NULL */

	int		vm_usage;	/* region usage count (access under nommu_region_sem) */
	bool		vm_icache_flushed : 1; /* true if the icache has been flushed for
						* this region */
};

VMA Operations

/include/linux/mm.h

/*
 * These are the virtual MM functions - opening of an area, closing and
 * unmapping it (needed to keep files on disk up-to-date etc), pointer
 * to the functions called when a no-page or a wp-page exception occurs.
 */
struct vm_operations_struct {
	void (*open)(struct vm_area_struct * area);
	void (*close)(struct vm_area_struct * area);
	int (*split)(struct vm_area_struct * area, unsigned long addr);
	int (*mremap)(struct vm_area_struct * area);
	vm_fault_t (*fault)(struct vm_fault *vmf);
	vm_fault_t (*huge_fault)(struct vm_fault *vmf,
			enum page_entry_size pe_size);
	void (*map_pages)(struct vm_fault *vmf,
			pgoff_t start_pgoff, pgoff_t end_pgoff);
	unsigned long (*pagesize)(struct vm_area_struct * area);

	/* notification that a previously read-only page is about to become
	 * writable, if an error is returned it will cause a SIGBUS */
	vm_fault_t (*page_mkwrite)(struct vm_fault *vmf);

	/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
	vm_fault_t (*pfn_mkwrite)(struct vm_fault *vmf);

	/* called by access_process_vm when get_user_pages() fails, typically
	 * for use by special VMAs that can switch between memory and hardware
	 */
	int (*access)(struct vm_area_struct *vma, unsigned long addr,
		      void *buf, int len, int write);

	/* Called by the /proc/PID/maps code to ask the vma whether it
	 * has a special name.  Returning non-NULL will also cause this
	 * vma to be dumped unconditionally. */
	const char *(*name)(struct vm_area_struct *vma);

#ifdef CONFIG_NUMA
	/*
	 * set_policy() op must add a reference to any non-NULL @new mempolicy
	 * to hold the policy upon return.  Caller should pass NULL @new to
	 * remove a policy and fall back to surrounding context--i.e. do not
	 * install a MPOL_DEFAULT policy, nor the task or system default
	 * mempolicy.
	 */
	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);

	/*
	 * get_policy() op must add reference [mpol_get()] to any policy at
	 * (vma,addr) marked as MPOL_SHARED.  The shared policy infrastructure
	 * in mm/mempolicy.c will do this automatically.
	 * get_policy() must NOT add a ref if the policy at (vma,addr) is not
	 * marked as MPOL_SHARED. vma policies are protected by the mmap_lock.
	 * If no [shared/vma] mempolicy exists at the addr, get_policy() op
	 * must return NULL--i.e., do not "fallback" to task or system default
	 * policy.
	 */
	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
					unsigned long addr);
#endif
	/*
	 * Called by vm_normal_page() for special PTEs to find the
	 * page for @addr.  This is useful if the default behavior
	 * (using pte_page()) would not find the correct page.
	 */
	struct page *(*find_special_page)(struct vm_area_struct *vma,
					  unsigned long addr);
};

Cache

mm_struct-> vmacache_seqnum

task_struct->vmacache

/* Per-thread vma caching: */struct vmacache			vmacache;

[PATCH 19/28] mm: Remove vmacache

vma cache


struct page

struct folio

pull slab out of struct page

This is an offshoot of the folio work, although it does not depend on any of the outstanding pieces of folio work. One of the more complex parts of the struct page definition is the parts used by the slab allocators. It would be good for the MM in general if struct slab were its own data type, and it also helps to prevent tail pages from slipping in anywhere.


reverse mapping anon_vma

kernel.org - mm Concepts overview
The anonymous memory or anonymous mappings represent memory that is not backed by a filesystem. Such mappings are implicitly created for program’s stack and heap or by explicit calls to mmap(2) system call.
Usually, the anonymous mappings only define virtual memory areas that the program is allowed to access. The read accesses will result in creation of a page table entry that references a special physical page filled with zeroes. When the program performs a write, a regular physical page will be allocated to hold the written data. The page will be marked dirty and if the kernel decides to repurpose it, the dirty page will be swapped out.

linux man page - mmap

/*
 * The copy-on-write semantics of fork mean that an anon_vma
 * can become associated with multiple processes. Furthermore,
 * each child process will have its own anon_vma, where new
 * pages for that process are instantiated.
 *
 * This structure allows us to find the anon_vmas associated
 * with a VMA, or the VMAs associated with an anon_vma.
 * The "same_vma" list contains the anon_vma_chains linking
 * all the anon_vmas associated with this VMA.
 * The "rb" field indexes on an interval tree the anon_vma_chains
 * which link all the VMAs associated with this anon_vma.
 */
struct anon_vma_chain {
	struct vm_area_struct *vma;
	struct anon_vma *anon_vma;
	struct list_head same_vma;   /* locked by mmap_lock & page_table_lock */
	struct rb_node rb;			/* locked by anon_vma->rwsem */
	unsigned long rb_subtree_last;
#ifdef CONFIG_DEBUG_VM_RB
	unsigned long cached_vma_start, cached_vma_last;
#endif
};

struct address_space

Problem: anon_vma idea is good for limited sharing

  • Memory maps can be shared by large numbers of processes, e.g., libc
  • Linear search for every eviction is slow
  • Also, different processes may map different ranges of a
    memory map into their address space

vma to page
(inode, address_space, i_mapp, page_tree, xarray)

If the region is backed by a file, the struct file is available through the vm_file field which has a pointer to the struct inode. The inode is used to get the struct address_space which has all the private information about the file including a set of pointers to filesystem functions which perform the filesystem specific operations such as reading and writing pages to disk.

priority search tree
priority search tree ppt
kernel.org xarray