contributed by < linD026
>
Linux kernel COW
, linux2021
此篇所引用之原始程式碼皆以用 Linux Kernel 5.10 版本,並且有關於硬體架構之程式碼以 arm 為主。
前篇: Linux 核心 Copy On Write 實作機制
Chapter 4 Process Address Space
cs.columbia.edu - W4118: Linux memory management
4.2 Managing the Address Space
If a region is backed by a file, its vm_file field will be set. By traversing
vm_file->f_dentry->d_inode->i_mapping
, the associatedaddress_space
for the region may be obtained. Theaddress_space
has all the filesystem specific information required to perform page-based operations on disk.
Linux kernel 5.10.35
在 Linux 當中 process 和 thread 都是以 task_struct
結構描述,而兩者皆以 clone
系統呼叫建立,區別在於資源的管理。
task_struct
的 process address space 則是以 mm_struct
結構管理,以 task_struct->mm
方式呼叫。 process 和 thread 差別在於 threads 之間的 task_struct
會是指向同個 mm_struct
。
延伸閱讀 - syscall / 系統呼叫
在 Linux kernel 系統呼叫會是以 SYSCALL_DEFINEn()
定義。例如:
在這之後會展開成若干個定義,如 SYSCALL_METADATA
儲存系統呼叫的相關資訊(名稱、參數等),以及相關函式。
而要讓 userspace 可以呼叫 syscall ,則會維護 sys_call_table
並以 index 值來呼叫。例如 x86 架構的 index 數值會在 syscall_64.tbl 定義,而大部分的架構則會以此定義:
詳細說明可見上方列出的 lwn.net 文章,也可以見官方文件:kernel.org - Adding a New System Call。
mm_struct
有兩個 atomic_t
型態的數值分別為 mm_users
以及 mm_count
,是紀錄 mm_struct
有多少使用者。 mm_users
紀錄有多少 process 使用此 mm_struct
的 userspace portion ,例如 page tables 、 file mapping 等。而 mm_count
為紀錄有多少 anonymous users 的 reference count 。兩者的差別在於 mm_users
為零時, exit_map()
會刪除所有 mapping 並且把所有 page tables 消除; mm_count
則是確保 mm_struct
能夠在沒有任何 reference 情況下安全刪除。
在初始化 mm_struct
的函式 mm_init()
中可以看到,兩者都初始直接為一:
Linux kernel atomic 和 C11 atomic
在 C11 standard 還未出來前,Linux kernel 就有 atomic 操作的需求,且也早已制訂了相關操作,其實作則是 architecture-specific code 。
在 2014 以及 2016 年, Linux 社群有討論過是否要改用 C11 atomic ,然而因為當時 C11 atomic 還未完善(2014),且 C11 的 memory model 不符合 Linux 的 memory model 。
相關文章可見:
延伸閱讀
與 memory model 有關的機制 RCU 也有在上述文章中提到,而 RCU 的描述可看 Linux 核心設計: RCU 同步機制,或官方文件 What is RCU? – "Read, Copy, Update" 。
kernel portion
A unique
mm_struct
is not needed for kernel threads as they will never page fault or access the userspace portion. The only exception is page faulting within the vmalloc space. The page fault handling code treats this as a special case and updates the current page table with information in the the master page table. As amm_struct
is not needed for kernel threads, thetask_struct->mm
field for kernel threads is alwaysNULL
. For some tasks such as the boot idle task, themm_struct
is never setup but for kernel threads, a call todaemonize()
will callexit_mm()
to decrement the usage counter.
Lazy TLB - TLB flush
context switch 時,因為 VA 之間的轉換會因人而異,為了避免新的 task 會被 TLB 之中舊的資料所影響,因此會進行清空 ( flush ) 。但這在效能上會有很大的衝擊會導致 overhead ,因此有人提出了 lazy TLB 的概念。
As TLB flushes are extremely expensive, especially with architectures such as the PPC, a technique called lazy TLB is employed which avoids unnecessary TLB flushes by processes which do not access the userspace page tables as the kernel portion of the address space is always visible. The call to
switch_mm()
, which results in a TLB flush, is avoided by “borrowing” the mm_struct used by the previous task and placing it intask_struct→active_mm
. This technique has made large improvements to context switches times.
每個 process 都有同個 kernel space , userspace 則不盡相同,因此 lazy TLB 對於 process 切換至 kernel portion 時,下個 task 直接不准許使用 userspace 的前提下使用前一個 task 的 TLB ,而非進行開銷大的 TLB flush 。
TLB 的資料時常被視為 per-core cache ,因此如果 TLB 中相對應 page table entry 資料更改時,core 之間的 TLB 要同步 ( TLB coherence ) 。在 SMP 中,儘管只是在單核心下進行 TLB flush ,也需要藉由 IPI - Inter Processor Interrupt 通知其他核心此 TLB entry 已經是 invalidated 了。而 IPI 的操作成本又很高,因此效率會變非常差。
Receiving an IPI is as expensive as receiving any other IRQ (e.g. causes a full pipeline flush, followed by IDT and GDT lookups and protection checks, followed by the overhead of the interrupt handler itself). If a CPU doing useful work sends an average of 10 IPIs per second, then 32 CPUs will probably send an average of 320 IPIs per second. If those IPIs are received by all CPUs except one; then for 2 CPUs you'd get 10 IPI received per second, for 4 CPUs you'd get "4 * 3 *10 = 120" IPIs received per second, and for 128 CPUs you'd get "128 * 127 * 10 = 162560" IPIs received per second. It ends up being exponential overhead (a scalability nightmare for large systems). Basically, it's important to avoid unnecessary IPIs.
Lazy TLB 的方法則是不藉由 IPI 傳送,而是當用到舊的 TLB 時讓其產生 page fault 並且自己標註為 invalid 。
When entering lazy TLB, the function enter_lazy_tlb() is called to ensure that a mm_struct is not shared between processors in SMP machines, making it a NULL operation on UP machines. The second time use of lazy TLB is during process exit when start_lazy_tlb() is used briefly while the process is waiting to be reaped by the parent.
而上述在 kernel portion 中進行 lazy TLB 對於避免不必要的 TLB flush 中,不准許 userspace page tables 的相關操作,可以當成一個為何需要兩個紀錄 user 數量的例子。在某些情況下,可能不需要或已經消除 page tables 和 mapping ,而還需要其 mm_struct
。
mm_count is a reference count of the “anonymous users” for the mm_struct initialised at 1 for the “real” user. An anonymous user is one that does not necessarily care about the userspace portion and is just borrowing the mm_struct. Example users are kernel threads which use lazy TLB switching. When this count drops to 0, the mm_struct can be safely destroyed. Both reference counts exist because anonymous users need the mm_struct to exist even if the userspace mappings get destroyed and there is no point delaying the teardown of the page tables.
延伸閱讀
有關 lazy TLB 會使系統崩潰,進而提出更 lazier 的模式。
Sequence counters are a reader-writer consistency mechanism with lockless readers (read-only retry loops), and no writer starvation. They are used for data that’s rarely written to (e.g. system time), where the reader wants a consistent set of information and is willing to retry if that information changes.
是做出 lockless reader 的 reader-writer 機制,
A data set is consistent when the sequence count at the beginning of the read side critical section is even and the same sequence count value is read again at the end of the critical section. The data in the set must be copied out inside the read side critical section. If the sequence count has changed between the start and the end of the critical section, the reader must retry.
Writers increment the sequence count at the start and the end of their critical section. After starting the critical section the sequence count is odd and indicates to the readers that an update is in progress. At the end of the write side critical section the sequence count becomes even again which lets readers make progress.
A sequence counter write side critical section must never be preempted or interrupted by read side sections. Otherwise the reader will spin for the entire scheduler tick due to the odd sequence count value and the interrupted writer. If that reader belongs to a real-time scheduling class, it can spin forever and the kernel will livelock.
This mechanism cannot be used if the protected data contains pointers, as the writer can invalidate a pointer that the reader is following.
在 mm_struct
中,也有如 spinlock_t page_table_lock
、 struct rw_semaphore mmap_lock
等 lock :
mmap_sem This is a long lived lock which protects the VMA list for readers and writers. As users of this lock require it for a long time and may need to sleep, a spinlock is inappropriate. A reader of the list takes this semaphore with down_read(). If they need to write, it is taken with down_write() and the page_table_lock spinlock is later acquired while the VMA linked lists are being updated;
page_table_lock This protects most fields on the mm_struct. As well as the page tables, it protects the RSS (see below) count and the VMA from modification;
mm_sem
在 5.8 版本時更改成 mm_lock
。
The corresponding comment in the 5.7 kernel is:
do_mmap()
is called viado_mmap_pgoff()
(in "include/linux/mm.h") viavm_mmap_pgoff()
(in "mm/util.c") viaksys_mmap_pgoff()
(in "mm/mmap.c") via themmap_pgoff()
syscall handler (in "mm/mmap.c"). (N.B. From kernel version 5.9 onwards,do_mmap_pgoff()
is eliminated anddo_mmap()
is called directly fromvm_mmap_pgoff()
.) The mmap lock is write-locked invm_mmap_pgoff()
.
延伸閱讀
在 Linux 核心設計: RCU 同步機制 中也有提到 lock 相關的資訊,如 spinlock , rwlock 和 seqlock 的基本原理等。
而講到 concurrency 和 RCU ,相關書籍可以看 Paul E. McKenney ( Linux kernel RCU 機制的開發者)所撰寫的 Is Parallel Programming Hard, And, If So, What Can You Do About It?。
Resident Set Size (RSS) is the number of resident pages for this process. It should be noted that the global zero page is not accounted for by RSS;
{{} __randomize_layout;}
c11 - 6.7.2.1 Structure and union specifiers - 13
An unnamed member of structure type with no tag is called an anonymous structure; an unnamed member of union type with no tag is called an anonymous union. The members of an anonymous structure or union are considered to be members of the containing structure or union. This applies recursively if the containing structure or union is also anonymous.
因此如下操作是合法的:
stackoverflow - C: How to access different types of anonymous or unnamed nested structs 有更詳細的解釋。
延伸閱讀 - Kernel Address Space Layout randomization
ASLR 是讓物件的地址是隨機而非固定某個數值的方式,來提升資安強度。在 Linux kernel 中也有這項機制,當 kernel boot parameter 沒有 nokaslr
時會開啟 KASLR。此時可看 /boot/System.map
(靜態) 當中的 symbol 與 /proc/kallsyms
會有些不同。
相關文章可見:Kernel Address Space Randomization in Linux or how I made Volatility bruteforce the page tables
Function | Description |
---|---|
mm_init() | Initialises a mm_struct by setting starting values for each field, allocating a PGD, initialising spinlocks etc. |
allocate_mm() | Allocates a mm_struct() from the slab allocator |
mm_alloc() | Allocates a mm_struct using allocate_mm() and calls mm_init() to initialise it |
exit_mmap() | Walks through a mm_struct and unmaps all VMAs associated with it |
copy_mm() | Makes an exact copy of the current tasks mm_struct for a new task. This is only used during fork |
free_mm() | Returns the mm_struct to the slab allocator |
Two functions are provided to allocate a mm_struct. To be slightly confusing, they are essentially the same but with small important differences. allocate_mm()
is just a preprocessor macro which allocates a mm_struct
from the slab allocator. mm_alloc()
allocates from slab and then calls mm_init()
to initialise it.
While a new user increments the usage count with atomic_inc(&mm->mm_users)
, it is decremented with a call to mmput()
. If the mm_users
count reaches zero, all the mapped regions are destroyed with exit_mmap()
and the page tables destroyed as there is no longer any users of the userspace portions. The mm_count
count is decremented with mmdrop()
as all the users of the page tables and VMAs are counted as one mm_struct user. When mm_count
reaches zero, the mm_struct
will be destroyed.
因為需要頻繁的讀取 VMA 結構,如 page fault 、 mmap 等操作,並且當有大量的 mapped region 時如果要尋找某個特定的 virtual address , linked list 會浪費很大量時間在走訪,因此需要一種資料結構以提供更有效率的尋找,如紅黑樹 ( rbtree ) 。紅黑樹是一種自平衡樹,儲存一些可排序的資料,並利用紅黑顏色等性質來維持數高不超過 。關於紅黑樹的相關操作,可以看此篇 Red-black Trees (rbtree) in Linux 。
為何不選擁有同樣性質,且一樣是 的 AVL tree ?
在樹高性質上, AVL tree 和 rbtree 雖然都是 ,但 AVL 在樹高上比較緊致,約 ,而 rbtree 則為 。因此在搜尋等以樹高作為上限的操作下, AVL tree 會比 rbtree 快一些,但這是建立在 insert 和 delete 需要更多 rotate 來維持樹高的情形( AVL 為三次, rbtree 為兩次 )。也因此,對於一般實作方面還是會採用 rbtree 。而在 database 則會採用 AVL tree 。
除此之外還有一種說法,是因為 AVL tree 沒有提供 amortized update cost ,而 rbtree 有。
實際研究數據可以看此篇:Performance Analysis of BSTs in System Software∗
The results indicate that when input is expected to be randomly ordered with occasional runs of sorted order, red-black trees are preferred; when insertions often occur in sorted order, AVL trees excel for later random access, whereas splay trees perform best for later sequential or clustered access.
而在 address space 中所有的 VMA 結構以 Linear linked list 連結,此准許了可以走訪整個 address space 。
All the VMAs in an address space are linked together in an address-ordered singly linked list via this field It is interesting to note that the VMA list is one of the very rare cases where a singly linked list is used in the kernel;
取代 rbtree 的 RCU-safe Maple Tree
Linux is a virtual-memory system. The address space for each process contains multiple virtual memory areas represented by vm_area_struct structures. Each VMA represents a contiguous block of address space and represents a range of memory of the same type, which may be anonymous memory (not backed by a file), a memory-mapped file, or even device memory. A virtual memory area, as seen from the process, is contiguous, while the supporting physical memory may not be. In addition, the address space contains holes between the VMAs; the kernel fills those empty blocks (leaving space for unmapped "guard" pages to catch buffer overflows) when it needs to add a new mapped area, for example when loading a library or in a response to an mmap() call.
VMAs are currently stored in a modified red-black tree (rbtree) with an additional, doubly-linked list that allows the kernel to traverse all of the VMAs in a process's address space. Kernel developers have been unhappy with this data structure for some time, for a number of reasons: rbtrees do not support ranges well, they are difficult to handle in a lockless manner (the balancing operation of the rbtree affects multiple items at the same time), and rbtree traversal is inefficient, which is why the additional linked list exists.
Types of VMA Mappings
How The Kernel Manages Your Memory
mm_struct-> vmacache_seqnum
task_struct->vmacache
[PATCH 19/28] mm: Remove vmacache
This is an offshoot of the folio work, although it does not depend on any of the outstanding pieces of folio work. One of the more complex parts of the struct page definition is the parts used by the slab allocators. It would be good for the MM in general if struct slab were its own data type, and it also helps to prevent tail pages from slipping in anywhere.
kernel.org - mm Concepts overview
The anonymous memory or anonymous mappings represent memory that is not backed by a filesystem. Such mappings are implicitly created for program’s stack and heap or by explicit calls to mmap(2) system call.
Usually, the anonymous mappings only define virtual memory areas that the program is allowed to access. The read accesses will result in creation of a page table entry that references a special physical page filled with zeroes. When the program performs a write, a regular physical page will be allocated to hold the written data. The page will be marked dirty and if the kernel decides to repurpose it, the dirty page will be swapped out.
Problem: anon_vma idea is good for limited sharing
- Memory maps can be shared by large numbers of processes, e.g., libc
- Linear search for every eviction is slow
- Also, different processes may map different ranges of a
memory map into their address space
vma to page
(inode, address_space, i_mapp, page_tree, xarray)
If the region is backed by a file, the struct file is available through the vm_file field which has a pointer to the struct inode. The inode is used to get the struct address_space which has all the private information about the file including a set of pointers to filesystem functions which perform the filesystem specific operations such as reading and writing pages to disk.
priority search tree
priority search tree ppt
kernel.org xarray