contributed by < linD026
>
Linux kernel COW
, linux2021
此篇所引用之原始程式碼皆以用 Linux Kernel 5.10 版本,並且有關於硬體架構之程式碼以 arm 為主。
前篇: Linux 核心 Copy On Write - Memory Region
There are two situations where a bad reference may occur.
The first is where a process sends an invalid pointer to the kernel via a system call which the kernel must be able to safely trap as the only check made initially is that the address is below PAGE_OFFSET.
The second is where the kernel uses copy_from_user() or copy_to_user() to read or write data from userspace.
The assembler function startup_32() is responsible for enabling the paging unit in arch/i386/kernel/head.S. While all normal kernel code in vmlinuz is compiled with the base address at PAGE_OFFSET + 1MiB, the kernel is actually loaded beginning at the first megabyte (0x00100000) of memory. The first megabyte is used by some devices for communication with the BIOS and is skipped. The bootstrap code in this file treats 1MiB as its base address by subtracting __PAGE_OFFSET from any address until the paging unit is enabled so before the paging unit is enabled, a page table mapping has to be established which translates the 8MiB of physical memory to the virtual address PAGE_OFFSET.
At compile time, the linker creates an exception table in the __ex_table
section of the kernel code segment which starts at __start___ex_table
and ends at __stop___ex_table
. Each entry is of type exception_table_entry which is a pair consisting of an execution point and a fixup routine. When an exception occurs that the page fault handler cannot manage, it calls search_exception_table() to see if a fixup routine has been provided for an error at the faulting instruction. If module support is compiled, each modules exception table will also be searched.
Linux kernel 2.6 - x86
copy_nonpresent_pte - copy one vm_area from one task to the other.
The call graph for this function is shown in Figure 4.17. This function handles the case where a user tries to write to a private page shared amoung processes, such as what happens after fork(). Basically what happens is a page is allocated, the contents copied to the new page and the shared count decremented in the old page.
PTE entry is marked as un-writeable, But VMA is marked as writeable. Page fault handler notices difference.
4.6.4 Copy On Write (COW) Pages
During fork, the PTEs of the two processes are made read-only so that when a write occurs there will be a page fault. Linux recognises a COW page because even though the PTE is write protected, the controlling VMA shows the region is writable. It uses the function do_wp_page() to handle it by making a copy of the page and assigning it to the writing process. If necessary, a new swap slot will be reserved for the page. With this method, only the page table entries have to be copied during a fork.
handle_pte_fault
pte entry marked un-writeable, vma fault in FAULT_FLAG_WRITE
https://elixir.bootlin.com/linux/v4.6/source/mm/memory.c#L2345
https://elixir.bootlin.com/linux/v4.6/source/include/linux/mm.h#L728
https://elixir.bootlin.com/linux/latest/source/include/linux/rmap.h#L29
https://www.kernel.org/doc/gorman/html/understand/understand007.html
do_translation_fault
do_page_fault
__do_page_fault
handle_mm_fault
__handle_mm_fault
handle_pte_fault
do_wp_page
( break COW )
lwn.net
commit - 17839856fd588f4ab6b789f482ed3ffd7c403e1f
patch