contributed by < linD026
>
Linux kernel COW
, linux2021
CoW explain
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
Linux kernel memory region
process
do_wp_page (break COW)
follow_page() - vma
Github - davidhcefx/Translate-Virtual-Address-To-Physical-Address-in-Linux-Kernel
Github - lkb: The linux kernel programming guide - Data structure (including process)
page_owner - testing
high memory
cleancache
x86 pti testing
Copy on Write is implemented based on implicit interrupt generated by MMU (Memory Management Unit). Example reasons for page fault are as follows.
A page fault is also an implicit interrupt generated by MMU but both are NOT same. Some reasons for a page fault are following.
Invalid Memory access: A page fault occurs when a page desired by a user process is not present in memory. Page fault may occur if a process wants to access a virtual address that is not allocated to it (commonly known as segmentation fault). Or it may occur if a page is swapped out.
Copy on Write: One reason for a page fault is Copy On write. During a fork() system call OS allocate same memory for both child and parent and marks the memory as read-only. This saves huge copy penalty. Assume the child calls an exec just after fork. If copy on write was not employed the entire copied page would be flushed during exec. When either parent or child try to write on that page it creates a page fault. Then OS allocate a new page and remove read-only restrictions.
Copy on Demand: Another reason for a page fault is copy on demand. When a user process asks for a new page in its virtual address range OS may allocate a virtual address without allocating a physical address corresponding to it. When the process tries to access that page it generates a page fault. OS then allocate a physical page corresponding to the virtual page.
So, a page fault may NOT need a fresh page to be allocated (in the case when it's generated from an error). But if a page fault needs a fresh page the page comes from the same pool of pages from where a page comes to server copy on write.
malloc implementation is not related with copy on write.
NOTE An Operating System can work without Copy on Write and Copy on Demand. Although it'll not perform well. But page fault mechanism is necesseary for an OS to support `paging'
要怎麼證明 CoW fork ?
fork -> exec : optimization : no address space
fork -> 一般會配至 address space
fork -> exec -> wait
system
c89 c99 才出現
ebpf
觀測目標: CoW 記憶體操作盡可能少
[PATCH 2/4] ftrace: Add 'function-fork' trace option
clone
and
github kernel-testexec ON-DEMAND-FORK
:快速版的 fork
=> PTE
fork 的實做 life cycle 等
==> 在 web server, data base 有幫助
memory slab:
slabdbg
搭配 GDB 進行核心追蹤和分析
clone 系統呼叫出來前 vfork 實做 thread
vfork() is a special case of clone(2). It is used to create new processes without copying the
page tables of the parent process. It may be useful in performance-sensitive applications
where a child is created which then immediately issues an execve(2).
4.3BSD; POSIX.1-2001 (but marked OBSOLETE). POSIX.1-2008 removes the specification of vfork().
=> NPTL (IBM, Red hat 1991 ~ 2001)
==> 出現原生實做才被取代
ftrace: function trace (大部分系統沒有動態追蹤)
有動態追宗: linux macos
window 動態追中:換 kernel
作業系統的完整性
mach microkernel : 完整才上
glibc : GNU Hurd ( OS )
=> Linux
garbage collection => Linux ( reclaim )
snooping
2006 64 位元 intel 就已經在準備。
longterm 注重: 完全、編譯的新舊衝突、 device driver (ex: WIFI)
redhat The Linux Vitrual Memory System
DataCadamia - os/memory/virtual
anon_vma_fork - find anon_vma non-COW
Memory Layout on AArch64 Linux
=> page size 影響 level
==> translation 所造成的成本以及執行環境有關
ex:
web-server : 小資料頻繁讀取 => page table 小
大運算(工程運算)或巨量資料 : page size 64KB => page fault 不太會出現( page 夠大 ) => translation 下降
fork.c - search : GFP_
GFP - lwn.net
memory allocation guide
do_futex in fork.c
do_futex source code
Context Switch Latency 實驗結果
wiki-ncku : arm-linux
Unix getrusage
function
CS:APP - 9.8
at any point in time, the swap space bounds the total amount of virtual pages that can be akkocated by the currently running processes.
Dirty COW (Dirty copy-on-write) is a computer security vulnerability for the Linux kernel that affected all Linux-based operating systems, including Android devices, that used older versions of the Linux kernel created before 2018. It is a local privilege escalation bug that exploits a race condition in the implementation of the copy-on-write mechanism in the kernel's memory-management subsystem. Computers and devices that still use the older kernels remain vulnerable.
What are some examples from Linux kernel source implementing copy-on-write feature?
chapter 9 - COW
execption handler and process
Sharing pages between mappings
The case of the overly anonymous anon_vma
Anonymous VMA naming patches
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 2)
get_user_page - GUP
描述 PM (各個不同的記憶體,可能是不同裝置的) => 讓 struct page 能夠對應到 => 操作時不用去考慮到底是哪擃裝置的記憶體
Understanding the Linux® Virtual Memory Manager
lwn.net - Memory: the flat, the discontiguous, and the sparse
要支援更多的應用程式,但沒辦法實際去增加記憶體 => 壓縮記憶體
=> zswap(以 swap code 改寫)
==> swap : PID = 0 (swapper, unix 第一版)
===> swap 本身也是程式,有 PID 可以排程
BSD 才有 VM
=> non-cachable => 沒有程式要處理所以不用 cache
=> 實際使用到記憶體,page 的狀態 CPU 結束後才知道
==> VA -> IPA -> PA
intel => 可寫可執行 => bufferoverflow attack
=> 安全考量
harvard 架構區分 data 和 code
von 架構則是 data + code
=> 現代混用
SMP => 不同 core 之間是獨立的,TLB 正常情況下需要 flush (開銷大)
=> cpu_tlbstate <- per-cpu data (IPI 發生要 TLB flush)
==> load balancing (process 可能在不同 core 之間切換,userspace 一樣)
===> lazy tlb ( TLBSTATE_LAZY )
intel 8086 16 位元 但實際定址可以很大(但實際存取還是一樣)
==> buffer 為了省 memory 不作 Read Write Execute protect => 出問題
===> NX bit
====> shellcode
====> rootkit
When available memory in the system is low, the pageout daemon kswapd is woken up to start freeing pages (see Chapter 10). If the pressure is high, the process will free up memory synchronously, sometimes referred to as the direct-reclaim path. The parameters affecting pageout behaviour are similar to those by FreeBSD [McK96] and Solaris [MM01].
StackExchange - kswapd0 is taking a lot of cpu
The Kernel Swap Daemon (kswapd)
The name swap daemon is a bit of a misnomer as the daemon does more than just swap modified pages out to the swap file. Its task is to keep the memory management system operating efficiently. The Kernel swap daemon (kswapd kernel init process at startup time and sits waiting for the kernel swap timer to periodically expire. ) is started by the Every time the timer expires, the swap daemon looks to see if the number of free pages in the system is getting too low. Free pages in the system are too low if:
Memory management - Page cache / Page frame / reclaiming Swapping / Swap cache