Try   HackMD

Linux kernel COW 探討註記

contributed by < linD026 >

tags: Linux kernel COW , linux2021

Contents

  1. CoW explain

    • 概念
      • 簡說
      • process
    • virtual memory
      • page table
        • each level pt is ppf. (need example)
        • walk to pte (maybe can write in 2. 3.)
      • memory mapping

        MAP_PRIVATE
        Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.

    • Physical memory
      • memory model
        • 潤飾與檢查是否正確
  2. Linux kernel memory region

    • mm_struct
    • vma
    • struct page
    • anon_vma
      • reverse mapping ( sharing )
    • address_space
  3. process

    • page fault
    • function fork and clone
      • ftrace do_wp_page (break COW)
    • uclinux
    • vfork
    • cow (maybe 4.)

Linux kernel /mm

page and github

follow_page() - vma
Github - davidhcefx/Translate-Virtual-Address-To-Physical-Address-in-Linux-Kernel
Github - lkb: The linux kernel programming guide - Data structure (including process)

doc vm

page_owner - testing
high memory
cleancache
x86 pti testing

struct page and struct vma

quiz3

share memory

eBPF

fork/exec

process

POSIX share memory

/dev/mem

linux file system

stackoverflow CoW based on page fault

Copy on Write is implemented based on implicit interrupt generated by MMU (Memory Management Unit). Example reasons for page fault are as follows.

A page fault is also an implicit interrupt generated by MMU but both are NOT same. Some reasons for a page fault are following.

Invalid Memory access: A page fault occurs when a page desired by a user process is not present in memory. Page fault may occur if a process wants to access a virtual address that is not allocated to it (commonly known as segmentation fault). Or it may occur if a page is swapped out.

Copy on Write: One reason for a page fault is Copy On write. During a fork() system call OS allocate same memory for both child and parent and marks the memory as read-only. This saves huge copy penalty. Assume the child calls an exec just after fork. If copy on write was not employed the entire copied page would be flushed during exec. When either parent or child try to write on that page it creates a page fault. Then OS allocate a new page and remove read-only restrictions.

Copy on Demand: Another reason for a page fault is copy on demand. When a user process asks for a new page in its virtual address range OS may allocate a virtual address without allocating a physical address corresponding to it. When the process tries to access that page it generates a page fault. OS then allocate a physical page corresponding to the virtual page.

So, a page fault may NOT need a fresh page to be allocated (in the case when it's generated from an error). But if a page fault needs a fresh page the page comes from the same pool of pages from where a page comes to server copy on write.

malloc implementation is not related with copy on write.

NOTE An Operating System can work without Copy on Write and Copy on Demand. Although it'll not perform well. But page fault mechanism is necesseary for an OS to support `paging'

Evolutionary memory management

Linux Intro memory management 101

實驗 and fork and system

要怎麼證明 CoW fork ?

fork -> exec : optimization : no address space

fork -> 一般會配至 address space

fork -> exec -> wait

system c89 c99 才出現

ebpf

觀測目標: CoW 記憶體操作盡可能少

copy_process
  cgroup_fork
  sched_fork
  dup_mm
  mm_init
  anon_vma_fork
$ cat /proc/kallsyms | grep fork
cgroup_fork
_do_fork
ftrace_pid_follow_sched_process_fork
ftrace_pid_follow_fork
event_enter__fork
trace_event_fields_sched_process_fork
trace_event_type_funcs_sched_process_fork
fork_init
anon_vma_fork
$ cat /proc/kallsyms | grep dup_mm
dup_mm
dup_mmap_sem

kernel/cgroup/cproup.c

/**
 * cgroup_fork - initialize cgroup related fields during copy_process()
 * @child: pointer to task_struct of forking parent process.
 *
 * A task is associated with the init_css_set until cgroup_post_fork()
 * attaches it to the target css_set.
 */
void cgroup_fork(struct task_struct *child)
{
	RCU_INIT_POINTER(child->cgroups, &init_css_set);
	INIT_LIST_HEAD(&child->cg_list);
}

[PATCH 2/4] ftrace: Add 'function-fork' trace option

clone

and

/kernel/fork.c

/*
 *  'fork.c' contains the help-routines for the 'fork' system call
 * (see also entry.S and others).
 * Fork is rather simple, once you get the hang of it, but the memory
 * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
 */

github kernel-testexec ON-DEMAND-FORK

github kernel-testexec ON-DEMAND-FORK
:快速版的 fork
=> PTE
fork 的實做 life cycle 等
==> 在 web server, data base 有幫助

memory slab:
slabdbg
搭配 GDB 進行核心追蹤和分析

vfork

clone 系統呼叫出來前 vfork 實做 thread

vfork() is a special case of clone(2). It is used to create new processes without copying the
page tables of the parent process. It may be useful in performance-sensitive applications
where a child is created which then immediately issues an execve(2).
4.3BSD; POSIX.1-2001 (but marked OBSOLETE). POSIX.1-2008 removes the specification of vfork().

=> NPTL (IBM, Red hat 1991 ~ 2001)
==> 出現原生實做才被取代

ftrace

ftrace: function trace (大部分系統沒有動態追蹤)
有動態追宗: linux macos
window 動態追中:換 kernel

作業系統的完整性
mach microkernel : 完整才上

glibc alloc

glibc : GNU Hurd ( OS )
=> Linux

garbage collection => Linux ( reclaim )

snooping

alloc TLSFXen

2006 64 位元 intel 就已經在準備。

longterm

longterm 注重: 完全、編譯的新舊衝突、 device driver (ex: WIFI)

redhat The Linux Vitrual Memory System

redhat The Linux Vitrual Memory System

Linux 核心設計: 記憶體管理

Linux 核心設計: 記憶體管理

rtenv+

rtenv+

Virtual Memory - DataCadamia

DataCadamia - os/memory/virtual

Linux kernel source code - rmap

anon_vma_fork - find anon_vma non-COW

k-level page table

Memory Layout on AArch64 Linux
=> page size 影響 level
==> translation 所造成的成本以及執行環境有關
ex:
web-server : 小資料頻繁讀取 => page table 小
大運算(工程運算)或巨量資料 : page size 64KB => page fault 不太會出現( page 夠大 ) => translation 下降

Get free page (GFP) - fork.c

fork.c - search : GFP_
GFP - lwn.net
memory allocation guide

do_futex in fork.c
do_futex source code

ptrace

manual

建立 process 成本 - 測量方式的影響

Context Switch Latency 實驗結果
wiki-ncku : arm-linux

counting page faults

Unix getrusage function

swap area ( swap file )

CS:APP - 9.8
at any point in time, the swap space bounds the total amount of virtual pages that can be akkocated by the currently running processes.

Chapter 11 Swap Management

dirty CoW

Dirty COW (Dirty copy-on-write) is a computer security vulnerability for the Linux kernel that affected all Linux-based operating systems, including Android devices, that used older versions of the Linux kernel created before 2018. It is a local privilege escalation bug that exploits a race condition in the implementation of the copy-on-write mechanism in the kernel's memory-management subsystem. Computers and devices that still use the older kernels remain vulnerable.

wikipedia
IEEE - paper

Post-init read-only memory

lwn.net

btrfs - cow

What are some examples from Linux kernel source implementing copy-on-write feature?

Kernel same-page merging

wikipedia

OSTEP process API

cpu-api

MMU

How TRACE32® handles MMU

Virtual file system

wikipedia
tldp - intro vfs

CMU - vm system

chapter 9 - COW
execption handler and process

lwn.net

Sharing pages between mappings
The case of the overly anonymous anon_vma
Anonymous VMA naming patches

Patching until the COWs come home (part 1)
Patching until the COWs come home (part 2)
get_user_page - GUP

ZONE - 2.6

描述 PM (各個不同的記憶體,可能是不同裝置的) => 讓 struct page 能夠對應到 => 操作時不用去考慮到底是哪擃裝置的記憶體
Understanding the Linux® Virtual Memory Manager

Phyiscal Memory Model

  • flat - linear
  • discont - nonlinear
  • sparse :
    • hotplug (總量增加,mmap 則是使用等)
    • NUMA - 不同 node 之間的記憶體
    • pfn <=> struct page
    • kernel build 時選擇(config)

lwn.net - Memory: the flat, the discontiguous, and the sparse

ZRAM

要支援更多的應用程式,但沒辦法實際去增加記憶體 => 壓縮記憶體
=> zswap(以 swap code 改寫)
==> swap : PID = 0 (swapper, unix 第一版)
===> swap 本身也是程式,有 PID 可以排程

BSD 才有 VM

DMA

=> non-cachable => 沒有程式要處理所以不用 cache
=> 實際使用到記憶體,page 的狀態 CPU 結束後才知道
==> VA -> IPA -> PA

NX bit

intel => 可寫可執行 => bufferoverflow attack
=> 安全考量
harvard 架構區分 data 和 code
von 架構則是 data + code
=> 現代混用

LAZY TLB

SMP => 不同 core 之間是獨立的,TLB 正常情況下需要 flush (開銷大)
=> cpu_tlbstate <- per-cpu data (IPI 發生要 TLB flush)
==> load balancing (process 可能在不同 core 之間切換,userspace 一樣)
===> lazy tlb ( TLBSTATE_LAZY )

PAE

intel 8086 16 位元 但實際定址可以很大(但實際存取還是一樣)
==> buffer 為了省 memory 不作 Read Write Execute protect => 出問題
===> NX bit
====> shellcode
====> rootkit

kswapd

zone watermarks

When available memory in the system is low, the pageout daemon kswapd is woken up to start freeing pages (see Chapter 10). If the pressure is high, the process will free up memory synchronously, sometimes referred to as the direct-reclaim path. The parameters affecting pageout behaviour are similar to those by FreeBSD [McK96] and Solaris [MM01].

StackExchange - kswapd0 is taking a lot of cpu

The Kernel Swap Daemon (kswapd)

The name swap daemon is a bit of a misnomer as the daemon does more than just swap modified pages out to the swap file. Its task is to keep the memory management system operating efficiently. The Kernel swap daemon (kswapd kernel init process at startup time and sits waiting for the kernel swap timer to periodically expire. ) is started by the Every time the timer expires, the swap daemon looks to see if the number of free pages in the system is getting too low. Free pages in the system are too low if:

page cache

Memory management - Page cache / Page frame / reclaiming Swapping / Swap cache