# Linux kernel COW 探討註記
contributed by < [`linD026`](https://github.com/linD026) >
###### tags: `Linux kernel COW` , `linux2021`
---
## Contents
1. CoW explain
- [x] 概念
- [x] 簡說
- [x] process
- [ ] virtual memory
- [ ] page table
- [x] each level pt is ppf. (need example)
- [ ] walk to pte (maybe can write in 2. 3.)
- [ ] memory mapping
> **MAP_PRIVATE**
> Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
- [ ] Physical memory
- [ ] memory model
- [ ] 潤飾與檢查是否正確
2. Linux kernel memory region
- [ ] mm_struct
- [ ] vma
- [ ] struct page
- [ ] anon_vma
- [ ] reverse mapping ( sharing )
- [ ] address_space
3. process
- [ ] page fault
- [ ] function fork and clone
- [ ] ftrace `do_wp_page (break COW)`
- [ ] uclinux
- [ ] vfork
- [ ] cow (maybe 4.)
---
[Linux kernel /mm](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm?h=v5.12)
---
* page table lock
[pagemap / filemap](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?h=v5.12)
* [ /mm/memory.c - cow_page](https://elixir.bootlin.com/linux/v5.10.37/source/mm/memory.c#L4036)
* [/linux/mm.h - struct vm_fault](https://elixir.bootlin.com/linux/v5.10.37/source/include/linux/mm.h#L508)
* [/arch/arm/mm/context.c - check_and_switch_context()](https://elixir.bootlin.com/linux/v5.10.41/source/arch/arm/mm/context.c#L237)
page and github
---
[follow_page() - vma](https://elixir.bootlin.com/linux/v5.10.38/source/mm/gup.c#L756)
[Github - davidhcefx/Translate-Virtual-Address-To-Physical-Address-in-Linux-Kernel](https://github.com/davidhcefx/Translate-Virtual-Address-To-Physical-Address-in-Linux-Kernel)
[Github - lkb: The linux kernel programming guide - Data structure (including process)](https://github.com/arshad512/lkb)
[doc vm](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm?h=v5.12)
---
* [memory model](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/memory-model.rst?h=v5.12)
* [Architecture Page Table Helpers](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/arch_pgtable_helpers.rst?h=v5.12)
[page_owner - testing](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/page_owner.rst?h=v5.12)
[high memory](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/highmem.rst?h=v5.12)
[cleancache](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/cleancache.rst?h=v5.12)
[x86 pti testing](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/x86/pti.rst?h=v5.12)
struct page and struct vma
---
* [Linux kernel source tree /mm.h](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/mm.h?h=v5.12)
> [struct page and struct vma](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/mm_types.h?h=v5.12)
* [mm_type.h - struct page](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/mm_types.h?h=v5.10.35)
* [vma cache](https://zhuanlan.zhihu.com/p/99124666)
quiz3
---
* [linD026 - 2021q1 Homework3 (quiz3)](https://hackmd.io/@linD026/linux2021-quiz3#5-Linux-%E6%A0%B8%E5%BF%83-%E5%85%A7%E9%83%A8%E5%AF%A6%E4%BE%8B)
* [Chapter 7 Non-Contiguous Memory Allocation](https://www.kernel.org/doc/gorman/html/understand/understand010.html#toc48)
share memory
---
* [解析 Linux 共享記憶體機制](https://hackmd.io/@sysprog/linux-shared-memory)
* [memfd_create](https://hackmd.io/@sysprog/linux-shared-memory#memfd_create)
* [dma_buf](https://hackmd.io/@sysprog/linux-shared-memory#dma_buf)
eBPF
---
* [Linux 核心設計: 透過 eBPF 觀察作業系統行為](https://hackmd.io/@sysprog/linux-ebpf?type=view)
fork/exec
---
* [UNIX 作業系統 fork/exec 系統呼叫的前世今生](https://hackmd.io/@sysprog/unix-fork-exec)
process
---
* [Linux 核心設計: 不僅是個執行單元的 Process](https://hackmd.io/@sysprog/linux-process?type=view)
* [Chapter 4 Process Address Space](https://www.kernel.org/doc/gorman/html/understand/understand007.html)
POSIX share memory
---
* [POSIX Shared Memory](http://logan.tw/posts/2018/01/07/posix-shared-memory/)
/dev/mem
---
* [Linux 核心的 /dev/mem 裝置](https://hackmd.io/@sysprog/linux-mem-device)
linux file system
---
* [Linux 核心設計: 檔案系統概念及實作手法 - File Descriptor 及開啟的檔案](https://hackmd.io/@sysprog/linux-file-system?type=view#File-Descriptor-%E5%8F%8A%E9%96%8B%E5%95%9F%E7%9A%84%E6%AA%94%E6%A1%88)
> 
stackoverflow CoW based on page fault
---
* [Is copy-on-write not implemented based on page fault?](https://unix.stackexchange.com/questions/475617/is-copy-on-write-not-implemented-based-on-page-fault)
:::spoiler
> Copy on Write is implemented based on implicit interrupt generated by MMU (Memory Management Unit). Example reasons for page fault are as follows.
>
> A page fault is also an implicit interrupt generated by MMU but both are NOT same. Some reasons for a page fault are following.
>
> Invalid Memory access: A page fault occurs when a page desired by a user process is not present in memory. Page fault may occur if a process wants to access a virtual address that is not allocated to it (commonly known as segmentation fault). Or it may occur if a page is swapped out.
>
> Copy on Write: One reason for a page fault is Copy On write. During a fork() system call OS allocate same memory for both child and parent and marks the memory as read-only. This saves huge copy penalty. Assume the child calls an exec just after fork. If copy on write was not employed the entire copied page would be flushed during exec. When either parent or child try to write on that page it creates a page fault. Then OS allocate a new page and remove read-only restrictions.
>
> Copy on Demand: Another reason for a page fault is copy on demand. When a user process asks for a new page in its virtual address range OS may allocate a virtual address without allocating a physical address corresponding to it. When the process tries to access that page it generates a page fault. OS then allocate a physical page corresponding to the virtual page.
>
> So, a page fault may NOT need a fresh page to be allocated (in the case when it's generated from an error). But if a page fault needs a fresh page the page comes from the same pool of pages from where a page comes to server copy on write.
>
> malloc implementation is not related with copy on write.
>
> NOTE An Operating System can work without Copy on Write and Copy on Demand. Although it'll not perform well. But page fault mechanism is necesseary for an OS to support `paging'
:::
Evolutionary memory management
---
* [An Evolutionary Study of Linux Memory Management for Fun and Profit](https://www.usenix.org/system/files/conference/atc16/atc16_paper-huang.pdf)
Linux Intro memory management 101
---
* [Memory Management 101: Introduction to Memory Management in Linux](https://events19.linuxfoundation.org/wp-content/uploads/2017/12/MM-101-Introduction-to-Linux-Memory-Management-Christoph-Lameter-Jump-Trading-LLC-1.pdf)
* [Lecture 7: Memory Management CSE 120: Principles of Operating Systems](https://cseweb.ucsd.edu/classes/su09/cse120/lectures/Lecture7.pdf)
[halolinux copy on write](https://www.halolinux.us/kernel-architecture/copy-on-write.html)
實驗 and fork and system
---
要怎麼證明 CoW fork ?
fork -> exec : optimization : no address space
fork -> 一般會配至 address space
fork -> exec -> wait
`system` c89 c99 才出現
ebpf
* **ftrace**
* [kprobe](https://www.kernel.org/doc/Documentation/kprobes.txt)
* jprobe
* hook
觀測目標: CoW 記憶體操作盡可能少
```cpp
copy_process
cgroup_fork
sched_fork
dup_mm
mm_init
anon_vma_fork
```
```shell
$ cat /proc/kallsyms | grep fork
cgroup_fork
_do_fork
ftrace_pid_follow_sched_process_fork
ftrace_pid_follow_fork
event_enter__fork
trace_event_fields_sched_process_fork
trace_event_type_funcs_sched_process_fork
fork_init
anon_vma_fork
$ cat /proc/kallsyms | grep dup_mm
dup_mm
dup_mmap_sem
```
[kernel/cgroup/cproup.c](https://elixir.bootlin.com/linux/v5.10.53/source/kernel/cgroup/cgroup.c#L5889)
```cpp
/**
* cgroup_fork - initialize cgroup related fields during copy_process()
* @child: pointer to task_struct of forking parent process.
*
* A task is associated with the init_css_set until cgroup_post_fork()
* attaches it to the target css_set.
*/
void cgroup_fork(struct task_struct *child)
{
RCU_INIT_POINTER(child->cgroups, &init_css_set);
INIT_LIST_HEAD(&child->cg_list);
}
```
[[PATCH 2/4] ftrace: Add 'function-fork' trace option](https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1364993.html)
clone
and
[/kernel/fork.c](https://elixir.bootlin.com/linux/v5.10.53/source/kernel/fork.c)
```cpp
/*
* 'fork.c' contains the help-routines for the 'fork' system call
* (see also entry.S and others).
* Fork is rather simple, once you get the hang of it, but the memory
* management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
*/
```
github kernel-testexec ON-DEMAND-FORK
---
[github kernel-testexec ON-DEMAND-FORK](https://github.com/magickaiyang/kernel-testexec)
:快速版的 fork
=> PTE
fork 的實做 life cycle 等
==> 在 web server, data base 有幫助
memory slab:
[slabdbg](https://github.com/NeatMonster/slabdbg)
[搭配 GDB 進行核心追蹤和分析](https://hackmd.io/@sysprog/user-mode-linux-env#%E6%90%AD%E9%85%8D-GDB-%E9%80%B2%E8%A1%8C%E6%A0%B8%E5%BF%83%E8%BF%BD%E8%B9%A4%E5%92%8C%E5%88%86%E6%9E%90)
vfork
---
clone 系統呼叫出來前 vfork 實做 thread
> vfork() is a special case of clone(2). It is used to create new processes without copying the
page tables of the parent process. It may be useful in performance-sensitive applications
where a child is created which then immediately issues an execve(2).
> 4.3BSD; POSIX.1-2001 (but marked OBSOLETE). POSIX.1-2008 removes the specification of vfork().
=> [NPTL (IBM, Red hat 1991 ~ 2001)](https://man7.org/linux/man-pages/man7/nptl.7.html)
==> 出現原生實做才被取代
ftrace
---
ftrace: function trace (大部分系統沒有動態追蹤)
有動態追宗: linux macos
window 動態追中:換 kernel
作業系統的完整性
mach microkernel : 完整才上
glibc alloc
---
glibc : GNU Hurd ( OS )
=> Linux
garbage collection => Linux ( reclaim )
snooping
alloc [TLSF](http://www.gii.upv.es/tlsf/main/used),[Xen](http://xenbits.xensource.com/)
2006 64 位元 intel 就已經在準備。
longterm
---
longterm 注重: 完全、編譯的新舊衝突、 device driver (ex: WIFI)
redhat The Linux Vitrual Memory System
---
[redhat The Linux Vitrual Memory System](https://people.redhat.com/pladd/NUMA_Linux_VM_NYRHUG.pdf)
Linux 核心設計: 記憶體管理
---
[Linux 核心設計: 記憶體管理](https://hackmd.io/@sysprog/linux-memory)
rtenv+
---
[rtenv+](http://wiki.csie.ncku.edu.tw/embedded/rtenv)
Virtual Memory - DataCadamia
---
[DataCadamia - os/memory/virtual](https://datacadamia.com/os/memory/virtual)
Linux kernel source code - rmap
---
[anon_vma_fork - find anon_vma non-COW](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/mm/rmap.c?h=v5.10.36#n328)
k-level page table
---
[Memory Layout on AArch64 Linux](https://www.kernel.org/doc/html/latest/arm64/memory.html)
=> page size 影響 level
==> translation 所造成的成本以及執行環境有關
ex:
web-server : 小資料頻繁讀取 => page table 小
大運算(工程運算)或巨量資料 : page size 64KB => page fault 不太會出現( page 夠大 ) => translation 下降
Get free page (GFP) - fork.c
---
[fork.c - search : GFP_](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/fork.c?h=v5.10.36)
[GFP - lwn.net](https://lwn.net/Articles/274971/)
[memory allocation guide](https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html)
[do_futex in fork.c](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/fork.c?h=v5.10.36#n1295)
[do_futex source code](https://elixir.bootlin.com/linux/v4.6/source/kernel/futex.c#L3147)
ptrace
---
[manual](https://man7.org/linux/man-pages/man2/ptrace.2.html)
建立 process 成本 - 測量方式的影響
---
Context Switch Latency 實驗結果
[wiki-ncku : arm-linux](http://wiki.csie.ncku.edu.tw/embedded/arm-linux)
counting page faults
---
Unix `getrusage` function
swap area ( swap file )
---
> **CS:APP - 9.8**
> at any point in time, the swap space bounds the total amount of virtual pages that can be akkocated by the currently running processes.
[Chapter 11 Swap Management](https://www.kernel.org/doc/gorman/html/understand/understand014.html)
dirty CoW
---
> Dirty COW (Dirty copy-on-write) is a computer security vulnerability for the Linux kernel that affected all Linux-based operating systems, including Android devices, that used older versions of the Linux kernel created before 2018. It is a local privilege escalation bug that exploits a race condition in the implementation of the copy-on-write mechanism in the kernel's memory-management subsystem. Computers and devices that still use the older kernels remain vulnerable.
[wikipedia](https://en.wikipedia.org/wiki/Dirty_COW)
[IEEE - paper](https://ieeexplore.ieee.org/document/8019988)
Post-init read-only memory
---
[lwn.net](https://lwn.net/Articles/666550/)
btrfs - cow
---
[What are some examples from Linux kernel source implementing copy-on-write feature?](https://www.quora.com/What-are-some-examples-from-Linux-kernel-source-implementing-copy-on-write-feature)
Kernel same-page merging
---
[wikipedia](https://en.wikipedia.org/wiki/Kernel_same-page_merging)
OSTEP process API
---
[cpu-api](https://pages.cs.wisc.edu/~remzi/OSTEP/cpu-api.pdf)
MMU
---
[How TRACE32® handles MMU](https://www.lauterbach.com/projects_download/newsletter_fr/mmu.pdf)
Virtual file system
---
[wikipedia](https://en.wikipedia.org/wiki/Virtual_file_system)
[tldp - intro vfs](https://tldp.org/LDP/intro-linux/html/sect_03_01.html)
CMU - vm system
---
[chapter 9 - COW](http://www.cs.cmu.edu/~213/lectures/18-vm-systems.pdf)
[execption handler and process](http://www.cs.cmu.edu/~213/lectures/19-ecf-procs.pdf)
lwn.net
---
[Sharing pages between mappings](https://lwn.net/Articles/717950/)
[The case of the overly anonymous anon_vma](https://lwn.net/Articles/383162/)
[Anonymous VMA naming patches](https://lwn.net/Articles/830218/)
[Patching until the COWs come home (part 1)](https://lwn.net/Articles/849638/)
[Patching until the COWs come home (part 2)](https://lwn.net/Articles/849876/)
[get_user_page - GUP](https://elixir.bootlin.com/linux/latest/source/mm/gup.c#L1873)
ZONE - 2.6
---
描述 PM (各個不同的記憶體,可能是不同裝置的) => 讓 struct page 能夠對應到 => 操作時不用去考慮到底是哪擃裝置的記憶體
[Understanding the Linux® Virtual Memory Manager](http://ptgmedia.pearsoncmg.com/images/0131453483/downloads/gorman_book.pdf)
Phyiscal Memory Model
---
* flat - linear
* discont - nonlinear
* sparse :
* hotplug (總量增加,mmap 則是使用等)
* NUMA - 不同 node 之間的記憶體
* pfn <=> struct page
* kernel build 時選擇(config)
[lwn.net - Memory: the flat, the discontiguous, and the sparse](https://lwn.net/Articles/789304/)
[ZRAM](https://en.wikipedia.org/wiki/Zram)
---
要支援更多的應用程式,但沒辦法實際去增加記憶體 => 壓縮記憶體
=> zswap(以 swap code 改寫)
==> swap : PID = 0 (swapper, unix 第一版)
===> swap 本身也是程式,有 PID 可以排程
BSD 才有 VM
DMA
---
=> non-cachable => 沒有程式要處理所以不用 cache
=> 實際使用到記憶體,page 的狀態 CPU 結束後才知道
==> VA -> IPA -> PA
[NX bit](https://en.wikipedia.org/wiki/NX_bit)
---
intel => 可寫可執行 => bufferoverflow attack
=> 安全考量
harvard 架構區分 data 和 code
von 架構則是 data + code
=> 現代混用
[LAZY TLB](https://www.slideshare.net/brendangregg/what-linux-can-learn-from-solaris-performance-and-viceversa/53-Lazy_TLB_Lazy_TLB_mode)
---
SMP => 不同 core 之間是獨立的,TLB 正常情況下需要 flush (開銷大)
=> cpu_tlbstate <- per-cpu data (IPI 發生要 TLB flush)
==> load balancing (process 可能在不同 core 之間切換,userspace 一樣)
===> lazy tlb ( TLBSTATE_LAZY )
PAE
---
[intel 8086](https://en.wikipedia.org/wiki/Intel_8086) 16 位元 但實際定址可以很大(但實際存取還是一樣)
==> buffer 為了省 memory 不作 Read Write Execute protect => 出問題
===> NX bit
====> shellcode
====> rootkit
kswapd
---
[zone watermarks](https://www.kernel.org/doc/gorman/html/understand/understand005.html)
> When available memory in the system is low, the pageout daemon kswapd is woken up to start freeing pages (see Chapter 10). If the pressure is high, the process will free up memory synchronously, sometimes referred to as the direct-reclaim path. The parameters affecting pageout behaviour are similar to those by FreeBSD [McK96] and Solaris [MM01].
[StackExchange - kswapd0 is taking a lot of cpu](https://askubuntu.com/questions/259739/kswapd0-is-taking-a-lot-of-cpu)
[The Kernel Swap Daemon (kswapd)](http://www.science.unitn.it/~fiorella/guidelinux/tlk/node39.html)
> The name swap daemon is a bit of a misnomer as the daemon does more than just swap modified pages out to the swap file. Its task is to keep the memory management system operating efficiently. The Kernel swap daemon (kswapd kernel init process at startup time and sits waiting for the kernel swap timer to periodically expire. ) is started by the Every time the timer expires, the swap daemon looks to see if the number of free pages in the system is getting too low. Free pages in the system are too low if:
page cache
---
[Memory management - Page cache / Page frame / reclaiming Swapping / Swap cache](https://students.mimuw.edu.pl/ZSO/Wyklady/10_memory3/pageCache-Reclaiming.pdf)