[Linux 核心 Copy On Write On-Demand Fork](https://hackmd.io/@linD026/Linux-kernel-COW-content)

# [Linux 核心 Copy On Write On-Demand Fork](https://hackmd.io/@linD026/Linux-kernel-COW-content) contributed by < [`linD026`](https://github.com/linD026) > ###### tags: `Linux kernel COW` , `linux2022` > 部份程式碼： [Linux 核心 Copy On Write On-Demand Fork 程式碼](https://hackmd.io/@linD026/Linux-kernel-COW-odf-code) --- ## [On-demand fork](https://dl.acm.org/doi/10.1145/3447786.3456258) 機制 ```graphviz digraph abstract { node [shape = box] rankdir = TB label = "syscall - odfork() routine" syscall [label = "syscall(439)\nkernel/fork.c:tfork()"] _do_tfork [label = "_do_tfork()"] copy_process_tfork [label = "copy_process_tfork()"] copy_mm_tfork [label = "copy_mm_tfork()"] dup_mm_tfork [label = "dup_mm_tfork()"] dup_mmap_tfork [label = "dup_mmap_tfork()"] copy_page_range_tfork [label = "mm/memory.c:copy_page_range_tfork()"] copy_p4d_range_tfork [label = "copy_p4d_range_tfork()"] copy_pud_range_tfork [label = "copy_pud_range_tfork()"] copy_pmd_range_tfork [label = "copy_pmd_range_tfork()*"] copy_pte_range [label = "vma does not cover the entire pte table\nif(table_start < vma->vm_start || table_end > vma->vm_end)"] odf_set [label = "set write-protect to the pmd\nentry if the vma is writable\natomic64_inc(&table_page->pte_table_refcount)\nthan shares the table with the child:\nset_pmd_at()"] syscall -> _do_tfork -> copy_process_tfork -> copy_mm_tfork -> dup_mm_tfork -> dup_mmap_tfork -> copy_page_range_tfork -> copy_p4d_range_tfork -> copy_pud_range_tfork -> copy_pmd_range_tfork copy_pmd_range_tfork -> copy_pte_range [label = "can be fallback"] copy_pmd_range_tfork -> odf_set } ``` ### breaking COW - page fault >[Virtual Memory in the IA-64 Linux Kernel](https://www.informit.com/articles/article.aspx?p=29961&seqNum=5) ```graphviz digraph abstract { node [shape = box] rankdir = LR compound = true fault -> pgtable [lhead = cluster_handler] subgraph cluster_handler { label = "page fault handler" pgtable [label = "pgtable field"] vma [label = "vma field"] pgtable -> vma } vma -> done [ltail = cluster_handler] label = "page fault routine" } ``` ```graphviz digraph abstract { node [shape = box] rankdir = TB compound = true do_t_fault [label = "arm/mm/fault.c:\ndo_translation_fault()"] do_mm_fault [label = "TODO do_mm_fault()"] exc_page_fault [label = "after v5.8+\nx86/mm/fault.c:\nexc_page_fault()"] handle_page_fault [label = "handle_page_fault()"] do_page_fault [label = "before v5.8+\nx86/mm/fault.c:\ndo_page_fault()"] handle_mm_fault [label = "mm/memory.c:\nhandle_mm_fault()"] hhandle_mm_fault [label = "{__handle_mm_fault()*|{<1>1|<2>2}}" shape = record] tfork_one_pte_table [label = "{tfork_one_pte_table()|{<1>1|<2>2}}" shape = record] copy_pte_range_tfork [label = "{copy_pte_range_tfork()|{<1>1|<2>2|<3>3}}" shape = record] tfork_pte_offset_map [label = "tfork_pte_offset_map()\npass src_pmd_val\nand\naddr into func\nto get src_pte"] tfork_pte_alloc_map_lock [label = "tfork_pte_alloc_map_lock()\nwhen src_pte points\nto the old table\nthen after this\ndst_pte will point\nto the new table"] pte_alloc_map_lock [label = "pte_alloc_map_lock()\notherwise use this"] copy_one_pte_tfork [label = "loop:\ncopy_one_pte_tfork()*"] get_page_tfork [label = "get_page_tfork()\nsame as origin\nneed to consider compound page"] tfork_pte_alloc [label = "tfork_pte_alloc()\n__tfork_pte_alloc()*"] dereference_pte_table [label = "dereference_pte_table()\nDereferences a shared pte table\nand frees it if requested and the\ntable is unused"] zap_one_pte_table [label = "zap_one_pte_table()\ndereferences every page\ndescribed by the shared pte table"] handle_pte_fault [label = "handle_pte_fault()\nvma field"] do_fault [label = "do_fault()" style = filled color = gray] do_cow_page [label = "do_cow_page()\n TODO what going on here?\nallocate page to vmf->cow_page\ncopy vmf->page to vmf->cow_page"] do_wp_page [label = "do_wp_page()" style = filled color = gray] tfork_pte_offset_map -> tfork_pte_alloc_map_lock [style = invis] tfork_pte_offset_map -> pte_alloc_map_lock [style = invis] tfork_pte_alloc -> copy_one_pte_tfork [style = invis] // arm do_t_fault -> do_mm_fault -> handle_mm_fault // uncheck // x86 do_page_fault ->do_user_addr_fault exc_page_fault -> handle_page_fault handle_page_fault -> do_user_addr_fault -> handle_mm_fault -> hhandle_mm_fault hhandle_mm_fault:1 -> tfork_one_pte_table tfork_one_pte_table:1 -> copy_pte_range_tfork [label = "first"] tfork_one_pte_table:2 -> dereference_pte_table [label = "second"] copy_pte_range_tfork:2 -> tfork_pte_alloc_map_lock [label = "2.\pmd_iswrite(*dst_pmd)\nis false"] copy_pte_range_tfork:2 -> pte_alloc_map_lock [label = "2.\nis true"] copy_pte_range_tfork:3 -> copy_one_pte_tfork [label = "3."] copy_pte_range_tfork:1 -> tfork_pte_offset_map [label = "1."] tfork_pte_alloc_map_lock -> tfork_pte_alloc copy_one_pte_tfork -> get_page_tfork dereference_pte_table -> zap_one_pte_table hhandle_mm_fault:2 -> handle_pte_fault handle_pte_fault -> do_fault handle_pte_fault -> do_wp_page [label = "try to write shared page\nIt is done by copying\nthe page to a\nnew address\nand decrementing\n the shared-page\ncounter for\nthe old page"] do_fault -> do_cow_page [label = "vm_flags doesn't\nset VM_SHARED\nvm is not shared"] { rank = sink do_cow_page do_wp_page } } ``` Comments with `tfork_one_pte_table()` : > kyz: Handles an entire pte-level page table consisting of one private anon VMAs. > The pmd lock should be held (the shared pte table is NOT locked), or mmap_sem write lock is held. Some codes and comments in `copy_one_pte_tfork()` : > COW mapping and rss increase. > ```cpp > 790 /* > 791 * If it's a COW mapping, write protect it both > 792 * in the parent and the child > 793 * kyz: only protect in the child (the faulting process) > 794 */ > 795 if (is_cow_mapping(vm_flags) && pte_write(pte)) { > 796 pte = pte_wrprotect(pte); > 797 } > 798 pte = pte_mkold(pte); > > 800 page = vm_normal_page(vma, addr, pte); > 801 if (page) { > 802 BUG_ON(!PageAnon(page)); > 803 get_page_tfork(page); //kyz :same as get_page() > 804 page_dup_rmap(page, false); > 805 rss[mm_counter(page)]++; > 806 #ifdef CONFIG_DEBUG_VM > 807 // printk("copy_one_pte_tfork: addr=%lx, (after) mapcount=%d, refcount=%d\n", addr, page_mapcount(pa ge), page_ref_count(page)); > 808 #endif > 809 } > ``` :::warning `copy_one_pte_tfork()` call `vm_normal_page()` : **vm_normal_page -- This function gets the "struct page" associated with a pte.** ``` "Special" mappings do not wish to be associated with a "struct page" (either it doesn't exist, or it exists but they don't want to touch it). In this case, NULL is returned here. "Normal" mappings do have a struct page. There are 2 broad cases. Firstly, an architecture may define a pte_special() pte bit, in which case this function is trivial. Secondly, an architecture may not have a spare pte bit, which requires a more complicated scheme, described below. A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a special mapping (even if there are underlying and valid "struct pages"). COWed pages of a VM_PFNMAP are always normal. The way we recognize COWed pages within VM_PFNMAP mappings is through the rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit set, and the vm_pgoff will point to the first PFN mapped: thus every special mapping will always honor the rule pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT) And for normal mappings this is false. This restricts such mappings to be a linear translation from virtual address to pfn. To get around this restriction, we allow arbitrary mappings so long as the vma is not a COW mapping; in that case, we know that all ptes are special (because none can have been COWed). In order to support COW of arbitrary special mappings, we have VM_MIXEDMAP. VM_MIXEDMAP mappings can likewise contain memory with or without "struct page" backing, however the difference is that _all_ pages with a struct page (that is, those where pfn_valid is true) are refcounted and considered normal pages by the VM. The disadvantage is that pages are refcounted (which can be slower and simply not an option for some PFNMAP users). The advantage is that we don't have to follow the strict linearity rule of PFNMAP mappings in order to support COWable mappings. ``` ::: :::warning **x86 - page fault change** * [[PATCH 0/8] [v2] x86/mm: page fault handling cleanups](https://lore.kernel.org/lkml/20180928160219.3402F0AA@viggo.jf.intel.com/) * `$ git show aa37c51b9421d` * distinguish user and kernel mode * [[PATCH 1/2] x86/context-tracking: Remove exception_enter/exit() from do_page_fault()](https://lore.kernel.org/lkml/20191227163612.10039-2-frederic@kernel.org/) * after 5.7 `do_page_fault()` does no exist. * [x86/entry: Switch page fault exception to IDTENTRY_RAW] * `$ git show 91eeafea1e4b7` * add `handle_page_fault()` * `do_page_fault()` -> `handle_page_fault()` * mm: do page fault accounting in handle_mm_fault * `$ git show bce617edecada007aee8610fbe2c14d10b8de2f6` * [Patch series "mm: Page fault accounting cleanups", v5](https://lore.kernel.org/all/20200707225021.200906-11-peterx@redhat.com/T/) ``` x86/entry: Switch page fault exception to IDTENTRY_RAW Convert page fault exceptions to IDTENTRY_RAW: - Implement the C entry point with DEFINE_IDTENTRY_RAW - Add the CR2 read into the exception handler - Add the idtentry_enter/exit_cond_rcu() invocations in in the regular page fault handler and in the async PF part. - Emit the ASM stub with DECLARE_IDTENTRY_RAW - Remove the ASM idtentry in 64-bit - Remove the CR2 read from 64-bit - Remove the open coded ASM entry code in 32-bit - Fix up the XEN/PV code - Remove the old prototypes No functional change. ``` https://lore.kernel.org/lkml/87mu0sr6s4.fsf@nanos.tec.linutronix.de/T/ --- **[x86/exception-tables.txt](https://www.kernel.org/doc/Documentation/x86/exception-tables.txt) still use `do_page_fault()`**([已提交](https://lore.kernel.org/linux-doc/20220318142536.116761-1-shiyn.lin@gmail.com/T/#u)) --- * `4819e15f740ec884a50bdc431d7f1e7638b6f7d9` ``` x86/mm/32: Bring back vmalloc faulting on x86_32 One can not simply remove vmalloc faulting on x86-32. Upstream commit: 7f0a002b5a21 ("x86/mm: remove vmalloc faulting") removed it on x86 alltogether because previously the arch_sync_kernel_mappings() interface was introduced. This interface added synchronization of vmalloc/ioremap page-table updates to all page-tables in the system at creation time and was thought to make vmalloc faulting obsolete. ``` ::: * `mm/memory.c:__handle_mm_fault()` * 不處理 swap ? ```cpp ... /* if (unlikely(is_swap_pmd(orig_pmd))) { VM_BUG_ON(thp_migration_supported() && !is_pmd_migration_entry(orig_pmd)); if (is_pmd_migration_entry(orig_pmd)) pmd_migration_entry_wait(mm, vmf.pmd); return 0; } */ ... ``` * 只檢查 pmd entry prohibits writes ，便執行 `tfork_one_pte_table()` 。論文中寫先檢測 PTE table 的 reference count 確認是否為 shared ，在進行複製，減少 shared PTE table 的 reference count 。 ```cpp ptl = pmd_lock(mm, vmf.pmd); // checks if the pmd entry prohibits writes if((!pmd_none(*vmf.pmd)) && (!pmd_iswrite(*vmf.pmd)) && (vma->vm_flags & VM_WRITE)) { #ifdef CONFIG_DEBUG_VM printk("__handle_mm_fault: PID=%d, addr=%lx\n", current->pid, address); #endif tfork_one_pte_table(mm, vma, vmf.pmd, vmf.address); } spin_unlock(ptl); ``` * `mm/memory.c:tfork_one_pte_table()` * `dereference_pte_table()` 處理原先的 pmd table ，如果其 refcount 降為零釋放。 * The kernel relies on the value returned by `page_mapcount()` to determine how many mappings exist. ```cpp /* kyz: Handles an entire pte-level page table consisting of one private anon VMAs * The pmd lock should be held (the shared pte table is NOT locked), or mmap_sem write lock is held. */ void tfork_one_pte_table(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t *dst_pmd, unsigned long addr) { unsigned long table_start, table_end; pmd_t orig_pmd_val; //sanity checks BUG_ON(pmd_none(*dst_pmd)); orig_pmd_val = *dst_pmd; table_start = pte_table_start(addr); table_end = pte_table_end(addr); BUG_ON(table_start < vma->vm_start || table_end > vma->vm_end); BUG_ON(is_vma_odf_incompatible(vma)); #ifdef CONFIG_DEBUG_VM printk("tfork_one_pte_table: vm_start=%lx, vm_end=%lx, addr=%lx, end=%lx\n", vma->vm_start, vma->vm_end, addr, end); #endif copy_pte_range_tfork(mm, dst_pmd, orig_pmd_val, vma, table_start, table_end); dereference_pte_table(orig_pmd_val, true, mm, addr); } ``` --- ## cache > [Translation caching: skip, don't walk (the page table)](https://dl.acm.org/doi/10.1145/1816038.1815970) > [Performance analysis of the memory management unit under scale-out workloads](https://ieeexplore.ieee.org/document/6983034) --- ## 實驗 ### `fork()` syscall > 此實驗僅計算 `fork()` 的花費成本，沒有進行 page fault 。論文當中的測量時間是以 parent 的時間為主，而非 child 。然而若以 5.11.0-41-generic 版本為例， parent 與 child 所得的 `fork()` 執行時間分別為： ```cpp parent 138476.000000 ns child 253248.000000 ns ``` 兩者時間相差進一倍，為了排出單一事件發生的巧合，以 1000 次來看就相差一倍： ![](https://imgur.com/YBWWHYD.png) 而從 source code 來看 `kernel_clone()` ( v5.6 為 `_do_fork()` ) 函式中，會先以 `copy_process()` 複製 task 的相關資料後，再利用 `wake_up_new_task()` 排程與執行等操作來喚醒 child 。而後者，所造成的 child 執行的時間差異應該便是主因。由於重於 pgtable 的操作所花費的時間，因此論文才以 parent 的時間來統計。下方為 5.11 版本以及論文當中的 5.6 版本的測量圖表。可以看到 odfork 雖然比 `fork()` 還要低，但並沒有差距到很大。而比較令人驚訝的是， 5.11 、 5.15 版本 `fork()` 的時間花費比 5.6 還要來的多。 ![](https://imgur.com/wKjq6Eb.png) ``` parent/v5.11 mean: 31708.306 parent/v5.6 mean: 24882.8 odfork mean: 22757.074 parent/v5.15 mean: 28729.226 ``` v5.11 花費了 31 ms 而 v5.15 花了 28 ms 但 v5.6 只有 25 ms ，這之間的差異利用 perf event 紀錄。利用 vmlinux 與 pef event 得到更詳盡的資訊。以下是另用 perf event 紀錄相關函式的花費時間佔比，並畫成圓餅圖以利查看何者時間花費較多（除了先以 perf-annotate 和 ftrace 紀錄函式花費時間，並在這之後花費大量時間進行人工統計，目前很難找到一個方式可以紀錄統計所有函式呼叫的實際花費，因此利用 perf event 的時間佔比和先前所得的花費時間去作交叉比對)： ![](https://imgur.com/YByG9NY.png) 除去 perf event 本身， `dup_mm()` 函式佔比最大，因此更進一步的查看此函式，可看到 `copy_page_range()` 佔比進半： ![](https://imgur.com/3oFuItG.png) 下圖則是 v5.15 的函式呼叫佔比，可看出 `dup_mm()` 的佔比比其 v5.6 較大： ![](https://imgur.com/qNmRdCV.png) 而其中 `dup_mm()` 所呼叫的函式中，相比 v5.6 多了 `vm_area_dup()` ( v5.6 已有此函式)： ![](https://imgur.com/9YQiLei.png) 查看原始程式碼，可以發現 v5.15 有針對 data race 作處理： * v5.6 ```cpp 357 struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) 358 { 359 struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); 360 361 if (new) { 362 *new = *orig; 363 INIT_LIST_HEAD(&new->anon_vma_chain); 364 } 365 return new; 366 } ``` * v5.15 ```cpp 355 struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) 356 { 357 struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); 358 359 if (new) { 360 ASSERT_EXCLUSIVE_WRITER(orig->vm_flags); 361 ASSERT_EXCLUSIVE_WRITER(orig->vm_file); 362 /* 363 * orig->shared.rb may be modified concurrently, but the clone 364 * will be reinitialized. 365 */ 366 *new = data_race(*orig); 367 INIT_LIST_HEAD(&new->anon_vma_chain); 368 new->vm_next = new->vm_prev = NULL; 369 } 370 return new; 371 } ``` :::success **data_race()** 是避免當 KCSAN 在撿測時(利用編譯器插入檢測指令)發出錯誤( data race )。 ::: :::warning ``` commit cda099b37d7165fc73a63961739acf026444cde2 Author: Qian Cai <cai@lca.pw> Date: Wed Feb 19 11:00:54 2020 -0800 fork: Annotate a data race in vm_area_dup() struct vm_area_struct could be accessed concurrently as noticed by KCSAN, write to 0xffff9cf8bba08ad8 of 8 bytes by task 14263 on cpu 35: vma_interval_tree_insert+0x101/0x150: rb_insert_augmented_cached at include/linux/rbtree_augmented.h:58 (inlined by) vma_interval_tree_insert at mm/interval_tree.c:23 __vma_link_file+0x6e/0xe0 __vma_link_file at mm/mmap.c:629 vma_link+0xa2/0x120 mmap_region+0x753/0xb90 do_mmap+0x45c/0x710 vm_mmap_pgoff+0xc0/0x130 ksys_mmap_pgoff+0x1d1/0x300 __x64_sys_mmap+0x33/0x40 do_syscall_64+0x91/0xc44 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff9cf8bba08a80 of 200 bytes by task 14262 on cpu 122: vm_area_dup+0x6a/0xe0 vm_area_dup at kernel/fork.c:362 __split_vma+0x72/0x2a0 __split_vma at mm/mmap.c:2661 split_vma+0x5a/0x80 mprotect_fixup+0x368/0x3f0 do_mprotect_pkey+0x263/0x420 __x64_sys_mprotect+0x51/0x70 do_syscall_64+0x91/0xc44 entry_SYSCALL_64_after_hwframe+0x49/0xbe vm_area_dup() blindly copies all fields of original VMA to the new one. This includes coping vm_area_struct::shared.rb which is normally protected by i_mmap_lock. But this is fine because the read value will be overwritten on the following __vma_link_file() under proper protection. Thus, mark it as an intentional data race and insert a few assertions for the fields that should not be modified concurrently. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> ``` --- `kernel/kcsan/core.c:data_race()` calling: ```cpp 672 void kcsan_disable_current(void) 673 { 674 ++get_ctx()->disable_count; 675 } 676 EXPORT_SYMBOL(kcsan_disable_current); ... 196 static __always_inline struct kcsan_ctx *get_ctx(void) 197 { 198 /* 199 * In interrupts, use raw_cpu_ptr to avoid unnecessary checks, that would 200 * also result in calls that generate warnings in uaccess regions. 201 */ 202 return in_task() ? &current->kcsan_ctx : raw_cpu_ptr(&kcsan_cpu_ctx); 203 } ``` ::: 下圖則為 odfrok 的函式佔比。利用 `echo 1 > /proc/self/use_odf` 與 shellscript 進行 odfork ，因此並不會與直接以 `syscall(439)` 所得的函式名稱完全相同，但不影響檢測結果 (目前看起來僅在函式名稱上作變更)。 ![](https://imgur.com/8sEpeXK.png) ![](https://imgur.com/0gslxMJ.png) 但比較令人不解的是如果用 perf-event 或 ftrace ，雖然到 `copy_page_range_tfork()` 都會顯示 odfork 的相關函式，但並不會有 `copy_p4d_range_tfork()` 等函式出現。情況如下： ``` | |--8.53%--dup_mm_tfork | | | | | |--4.17%--copy_page_range_tfork | | | | | | | |--2.11%--copy_page_range | | | | | | | | | |--0.75%--__pmd_alloc | | | | | alloc_pages_current | | | | | __alloc_pages_nodemask | | | | | | | | | | | --0.57%--__memcg_kmem_charge | | | | | __memcg_kmem_charge_memcg | | | | | page_counter_try_charge | | | | | | | | | --0.54%--copy_pte_range | | | | | | | |--1.03%--copy_pte_range | | | | | | | | | --0.65%--__pte_alloc | | | | pte_alloc_one | | | | alloc_pages_current | | | | __alloc_pages_nodemask | | | | | | | --0.52%--__pud_alloc | | | get_zeroed_page | | | __get_free_pages | | | alloc_pages_current | | | __alloc_pages_nodemask | | | get_page_from_freelist | | | ``` 然而利用 perf-annotate 卻可以查到相關函式，例子如下： ```cpp : 964 copy_pud_range_tfork(): : 1152 next = pud_addr_end(addr, end); 0.00 : ffffffff81266d7f: lea 0x40000000(%r9),%r14 0.00 : ffffffff81266d86: and $0xffffffffc0000000,%r14 16.51 : ffffffff81266d8d: mov (%rax),%rcx 0.00 : ffffffff81266d90: add %rax,%rbx 0.00 : ffffffff81266d93: lea -0x1(%r14),%rdx 0.00 : ffffffff81266d97: mov %rcx,%rax 0.00 : ffffffff81266d9a: and $0xffffffffffffff9f,%rax 0.00 : ffffffff81266d9e: cmp -0x90(%rbp),%rdx 0.00 : ffffffff81266da5: jae ffffffff81266de4 <copy_page_range_tfork+0x2a4> : 1162 pud_none_or_clear_bad(): : 608 return 0; : 609 } ``` https://diamon.org/ctf/ https://lwn.net/Articles/593690/ https://lwn.net/Articles/442113/ http://cscads.rice.edu/linux-kernel-amd-pmu-support.pdf https://developer.ibm.com/tutorials/l-analyzing-performance-perf-annotate-trs/ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/design.txt?id=HEAD --- * odfork 論文的評估中`struct page` 的操作成本如 `atomic` 的讀寫、記憶體複製佔很高的比例，其中也牽扯到 cacheline 等相關議題。能否將低這些操作次數以減少時間花費? * 完整評估 struct page 在 `fork()` 與 page fault 的過程的相關操作成本 * 利用 perf event 數據化，如 atomic operation 的次數等 --- **相異 5.15 v.s. 5.6** * `copy_present_pte()` and `copy_present_page()` ``` Copy a present and normal page if necessary. NOTE! The usual case is that this doesn't need to do anything, and can just return a positive value. That will let the caller know that it can just increase the page refcount and re-use the pte the traditional way. But _if_ we need to copy it because it needs to be pinned in the parent (and the child should get its own copy rather than just a reference to the same page), we'll do that here and return zero to let the caller know we're done. And if we need a pre-allocated page but don't yet have one, return a negative error to let the preallocation code know so that it can do so outside the page table lock. ``` ``` commit 70e806e4e645019102d0e09d4933654fb5fb58ce Author: Peter Xu <peterx@redhat.com> Date: Fri Sep 25 18:25:59 2020 -0400 mm: Do early cow for pinned pages during fork() for ptes This allows copy_pte_range() to do early cow if the pages were pinned on the source mm. Currently we don't have an accurate way to know whether a page is pinned or not. The only thing we have is page_maybe_dma_pinned(). However that's good enough for now. Especially, with the newly added mm->has_pinned flag to make sure we won't affect processes that never pinned any pages. It would be easier if we can do GFP_KERNEL allocation within copy_one_pte(). Unluckily, we can't because we're with the page table locks held for both the parent and child processes. So the page allocation needs to be done outside copy_one_pte(). Some trick is there in copy_present_pte(), majorly the wrprotect trick to block concurrent fast-gup. Comments in the function should explain better in place. Oleg Nesterov reported a (probably harmless) bug during review that we didn't reset entry.val properly in copy_pte_range() so that potentially there's chance to call add_swap_count_continuation() multiple times on the same swp entry. However that should be harmless since even if it happens, the same function (add_swap_count_continuation()) will return directly noticing that there're enough space for the swp counter. So instead of a standalone stable patch, it is touched up in this patch directly. Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> ``` --- ## RFC patch - development * `copy_present_page()` 的 COW mapping 。 * 函式執行時的狀態： child 和 parent 的 mm 都有 lock 上。 * 條件：位於 memory 當中，且沒有被 swapped out 、沒有被 parent pinned 。 * 為何被 parent pinned 時會無法共用 pte ？ * odfork 在進行 COW pte 時，並沒有把 parent 的 pte 設為 write protection ，這樣在 parent 進行寫入時 child 可以同時讀到。 ```cpp= copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, struct page **prealloc) { struct mm_struct *src_mm = src_vma->vm_mm; unsigned long vm_flags = src_vma->vm_flags; pte_t pte = *src_pte; struct page *page; page = vm_normal_page(src_vma, addr, pte); if (page) { int retval; retval = copy_present_page(dst_vma, src_vma, dst_pte, src_pte, addr, rss, prealloc, pte, page); if (retval <= 0) return retval; get_page(page); page_dup_rmap(page, false); rss[mm_counter(page)]++; } /* * If it's a COW mapping, write protect it both * in the parent and the child */ if (is_cow_mapping(vm_flags) && pte_write(pte)) { ptep_set_wrprotect(src_mm, addr, src_pte); pte = pte_wrprotect(pte); } /* * If it's a shared mapping, mark it clean in * the child */ if (vm_flags & VM_SHARED) pte = pte_mkclean(pte); pte = pte_mkold(pte); if (!userfaultfd_wp(dst_vma)) pte = pte_clear_uffd_wp(pte); set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); return 0; } ``` --- ### Debug * [`static void check_mm(struct mm_struct *mm)`](https://elixir.bootlin.com/linux/v5.18-rc5/source/kernel/fork.c#L753) * print out rss state * [`clone3()` with `clone()`](https://lwn.net/Articles/792628/) --- ### upstream 時所需要的考量（處理）的狀況 * `get_user_pages()` * `struct page: bool pte_table_counter_pending` reduce (prevent second copy) * pmd present (doesn't support [swapping](https://www.kernel.org/doc/gorman/html/understand/understand014.html)[swapping](https://www.kernel.org/doc/gorman/html/understand/understand014.html)) * 確認是否能支援，若不行則實作在開啟 odf 時取消相對應的處理 * allocate page outside of lock * doesn't support hug page * reduce the \*\_fixup function * zap page table * folio (compound_page): 7b230db3b8d37 * 改良實作方向：利用 syscall 設定 shared pte ，而非使用 procfs ？ * 原先想利用 `include/linux/sched.h` 的 clone_flags ，但因 flags 的 bit 都用完了所以需要另尋他法。 * :::warning ## [[RFC PATCH 0/6] Add support for shared PTEs across processes](https://lore.kernel.org/linux-mm/eb696699-0138-33c5-ad47-bfca7f6e9079@intel.com/T/) / [[PATCH v1] 4/12](https://lore.kernel.org/lkml/cover.1649370874.git.khalid.aziz@oracle.com/) > This is a proposal to implement a mechanism in kernel to allow userspace processes to opt into sharing PTEs. The proposal is to add a new system call - mshare(), which can be used by a process to create a region (we will call it mshare'd region) which can be used by other processes to map same pages using shared PTEs. Other process(es), assuming they have the right permissions, can then make the mashare() system call to map the shared pages into their address space using the shared PTEs. When a process is done using this mshare'd region, it makes a mshare_unlink() system call to end its access. When the last process accessing mshare'd region calls mshare_unlink(), the mshare'd region is torn down and memory used by it is freed. ::: * vma overlap: ```bash ]$ cat /proc/$$/m map_files/ maps mem mountinfo mounts mountstats [vslda@arch-is-the-best ~]$ cat /proc/$$/maps 55df5bc78000-55df5bc98000 r--p 00000000 08:03 269265 /usr/bin/bash 55df5bc98000-55df5bd2b000 r-xp 00020000 08:03 269265 /usr/bin/bash 55df5bd2b000-55df5bd5a000 r--p 000b3000 08:03 269265 /usr/bin/bash 55df5bd5a000-55df5bd5d000 r--p 000e1000 08:03 269265 /usr/bin/bash 55df5bd5d000-55df5bd61000 rw-p 000e4000 08:03 269265 /usr/bin/bash 55df5bd61000-55df5bd6f000 rw-p 00000000 00:00 0 55df5cdc2000-55df5ce46000 rw-p 00000000 00:00 0 [heap] 7f048a755000-7f048aa3e000 r--p 00000000 08:03 292160 /usr/lib/locale/locale-archive 7f048aa3e000-7f048aa40000 rw-p 00000000 00:00 0 7f048aa40000-7f048aa56000 r--p 00000000 08:03 266461 /usr/lib/libncursesw.so.6.3 7f048aa56000-7f048aa97000 r-xp 00016000 08:03 266461 /usr/lib/libncursesw.so.6.3 7f048aa97000-7f048aaae000 r--p 00057000 08:03 266461 /usr/lib/libncursesw.so.6.3 7f048aaae000-7f048aaaf000 ---p 0006e000 08:03 266461 /usr/lib/libncursesw.so.6.3 7f048aaaf000-7f048aab3000 r--p 0006e000 08:03 266461 /usr/lib/libncursesw.so.6.3 7f048aab3000-7f048aab4000 rw-p 00072000 08:03 266461 /usr/lib/libncursesw.so.6.3 7f048aab4000-7f048aae0000 r--p 00000000 08:03 265591 /usr/lib/libc.so.6 7f048aae0000-7f048ac56000 r-xp 0002c000 08:03 265591 /usr/lib/libc.so.6 7f048ac56000-7f048acaa000 r--p 001a2000 08:03 265591 /usr/lib/libc.so.6 7f048acaa000-7f048acab000 ---p 001f6000 08:03 265591 /usr/lib/libc.so.6 7f048acab000-7f048acae000 r--p 001f6000 08:03 265591 /usr/lib/libc.so.6 7f048acae000-7f048acb1000 rw-p 001f9000 08:03 265591 /usr/lib/libc.so.6 7f048acb1000-7f048acbe000 rw-p 00000000 00:00 0 7f048acbe000-7f048acbf000 r--p 00000000 08:03 265596 /usr/lib/libdl.so.2 7f048acbf000-7f048acc0000 r-xp 00001000 08:03 265596 /usr/lib/libdl.so.2 7f048acc0000-7f048acc1000 r--p 00002000 08:03 265596 /usr/lib/libdl.so.2 7f048acc1000-7f048acc2000 r--p 00002000 08:03 265596 /usr/lib/libdl.so.2 7f048acc2000-7f048acc3000 rw-p 00003000 08:03 265596 /usr/lib/libdl.so.2 7f048acc3000-7f048acda000 r--p 00000000 08:03 269236 /usr/lib/libreadline.so.8.1 7f048acda000-7f048ad05000 r-xp 00017000 08:03 269236 /usr/lib/libreadline.so.8.1 7f048ad05000-7f048ad0f000 r--p 00042000 08:03 269236 /usr/lib/libreadline.so.8.1 7f048ad0f000-7f048ad12000 r--p 0004b000 08:03 269236 /usr/lib/libreadline.so.8.1 7f048ad12000-7f048ad18000 rw-p 0004e000 08:03 269236 /usr/lib/libreadline.so.8.1 7f048ad18000-7f048ad1b000 rw-p 00000000 00:00 0 7f048ad21000-7f048ad23000 r--p 00000000 08:03 265582 /usr/lib/ld-linux-x86-64.so.2 7f048ad23000-7f048ad4a000 r-xp 00002000 08:03 265582 /usr/lib/ld-linux-x86-64.so.2 7f048ad4a000-7f048ad55000 r--p 00029000 08:03 265582 /usr/lib/ld-linux-x86-64.so.2 7f048ad56000-7f048ad58000 r--p 00034000 08:03 265582 /usr/lib/ld-linux-x86-64.so.2 7f048ad58000-7f048ad5a000 rw-p 00036000 08:03 265582 /usr/lib/ld-linux-x86-64.so.2 7fff40f40000-7fff40f61000 rw-p 00000000 00:00 0 [stack] 7fff40f95000-7fff40f99000 r--p 00000000 00:00 0 [vvar] 7fff40f99000-7fff40f9b000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall] ``` * upstream: ```graphviz digraph fork_fault_fsm { node [shape = oval ] label = "P/C pte Finite State Machine" 0 [label = "ref = 1\nowner = NULL\n[P-RSS = 1]" color = gray style=filled] 1 [label = "ref = 2+\nowner = P\n[P-RSS = 1]\n[C-RSS = 0]" color = cyan2 style=filled] 2 [label = "ref = 1\nowner = NULL\n[P-RSS = 1]\n[C-RSS = 0]"] 3 [label = "ref = 2\nowner = C\n[ C-RSS = 1]\n[CC-RSS = 0]\n(only add ref\n from child of C)"] 5 [label = "ref = 1\nowner = P\n[P-RSS = 1]\n[C-RSS = 1]"] COW_PTE_C [label = "COW PTE of C" color = cyan2 style=filled shape = box] 0 -> 1 [label = " P fork()"] 1 -> 2 [label = "P\npage fault"] 2 -> 3 [label = "C fork()"] 3 -> COW_PTE_C 2 -> 0 [label = "C\n page fault"] 1 -> 1 [label = "P/C\nfork()"] 1 -> 5 [label = "C\npage fault\n"] 5 -> 1 [label = "P fork()"] 5 -> 0 [label = "P\npage fault"] {rank=same 1 3 } {rank=same 0} {rank=same COW_PTE_C} } ``` :::success **摘要** * pte page 用 ownership (vma hold) 和 refcount (only in cow pgtable) 維持 cow pgtable 的狀態。 > [name=linD026][time=Tue, Mar 22, 2022 9:48 PM] ::: - [x] vma: - 每個 process 對單個 struct page 有自己的 vma 。 - [x] 需要尋找一個可以標記 odf 的 flags ，盡量不另設置新的參數 - [x] 候選：於 pte page 設置 src vma 指標。 - **解決多個 vma 對同個 pte 進行 reference ( odf 是以超出 vma 範圍直接 fallback copy_pte_range ) 時造成單次 fork 多次 reference 的困境。** - **Ownership**: 設置 pte->share_pte_owner 為 ownership 的指標，在 fork 時會設置成 parent 直到 parent 產生 page fault 或 parent 先行結束才會在 share 的狀態下空出 owner ( ->share_pte_owner 為 NULL )。其他 child 則只要在 fork 或 page fault 時操作 refcount 即可，但當上述兩個操作發生而 owner 為空時， child 會接替 owner ( ownership 轉移 )。 - 而當 owner 進行釋放並且 refcount 為一時， lifetime 才結束。 - 此種方式能夠在避免同一個 process (parent) 於不同 vma 下對同個 pte 進行多次 inc_refcount 。並且解決其他方法的 lifetime 問題。 - 缺點： pte page 、 vma 佔一個指標的空間。 - [ ] `VM_HIGH_ARCH_BIT_5` ，但僅限於 64 bit - COW 的 vma 是 writeable ，pte 是 un-writeable - 盡量保持原設計，但從 pte 判斷改為 pmd - 現行狀態為： vma setup 以及 pmd non-writeable ，為 odf 狀態 - odf 原作者設定：利用 vma->pending (flag) 和 pmd writeable 判斷 pte 狀態： - vma set / pmd writeable: 一般的 fork - vma set / pmd non-writeable: odf 後，並沒有更動 ( odf 狀態) - vma unset / pmd writeable: vma is old, pte table is new (created by page fault: refcount is 0?) - vma unset / pmd non-writeable: shared pte table, vma gone through odf - zap_pmd_range: - vma set && pmd non-writeable (odf 狀態): - if vma 的記憶體空間完全位於 pmd 之中， dec pmd:pte_refcount - vma unset || pmd writeable: - do normal zap_pte_range - if vma unset (vma is old): dec pmd:pte_refcount - upstream: vma_is_share_pte() 可更改為判斷有沒有其他 vma 共用此 pmd entry (pte) ，有的話： - 一起處理，之後在各個處理過後的 vma 設 flags 判斷已處理。 - 在複製時， vma 是從前至後複製，因此只需處理後續 vma 以及當下 vma 負責的 pte 是否已經有 refcount 狀況 - 如果用 vma 的 flag 標記已處理過，要解決之後再次 fork 時會遇到所有 vma 都以標記的狀況 - 參考 copy_pte_range/copy_present_page 等函式是如何處理 - !test_bit(MMF_HAS_PINNED, ...) return false; - odf 對於此處理是： table_start < vm_start || table_end > vma->vm_end : copy_pte_range() - [x] pte copy: - [x] 問題：現今 pte copy 的處理比 5.6 還複雜，要盡量沿用原 copy_pte_range 函式？ :::warning **現形實作方向** pmd 的 refcount 可能並不如預期的能作到管理 pte 的操作 (還未完全搞懂整體 refcount 在 vma 和 pgtable 的哪種情況下操作)，現今先把 refcount 移至 pte 的 struct page 上。並且拆分對於不同函式對 pmd 和 pte 的 page fault 操作 (現今 page fault 與 unmap 會呼叫同個函式)，以簡化維護。 * 於 page fault 新增 handle_pmd_fault * 只維持 mm_struct 狀態以標記何種 process 進行過 odf ，讓 vma 的狀態維持原狀，盡可能保持 pte 與 vma 之間原先 COW 的操作 (page fault 等)。把 odf 侷限於 pmd 操作上。 --- v5.15 版本不完善，實作方向要更改對照組為 on-demand-fork 。 ::: - [x] copy_pmd_range: - [x] pmd_get_pte ( pmd:struct page:pte_refcount ) - [x] pmd_put_pte - [x] page fault - [x] copy_pte_range: - inner: pte (dst/src) lock; need out lock src_pmd out side - do the same thing as forking - [x] rename to cow pte - [x] refcount - [x] share - [x] reference count of cow pgtable - [ ] ./include/linux/pgtable.h:int pmd_free_pte_page(pmd_t *pmd, unsigned long addr); - [ ] ./mm/memory.c:static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, - [ ] ./mm/vmalloc.c: if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr)) - [x] unmap:zap_page_range:zap_pmd_range - [x] 參考 [[PATCH v1] Add support for shared PTEs across processes](https://lore.kernel.org/lkml/cover.1649370874.git.khalid.aziz@oracle.com/) 中，對於 vma 的操作 - [x] **add abort_share_pte() function** - 與 `handle_pmd_fault()` 不同，此函式是為了在 vma 作其他如 split / unmap 等操作時，還原所有的 share pte 的設置，避免影響 share pte 的狀態。 - 而 `handle_pmd_fault()` 則是處理單個 address 的 pmd entry (share pte) 得狀況。 - Fix all the TODO - [x] detect lock or not to avoid dead lock ( mm_lock / split lock ) - [x] consider the fast path of if there is no any share pte in the vma - [x] get all the pmd of vma included. - [ ] move_vma - do_munmap - from unmap_page_range - [ ] `oom_kill.c:__oom_reap_task_mm` - [x] `memory.c:unmap_single_vma` - move_page_tables - remap - zap_detail set: if unmap share page and keep private page - [x] `exit_mmap()`/`unmap_region()` call `unmap_vmas()` unmap_page_range()::handle_cow_pte() ) first, then call `free_pgtables()` ```bash $ grep "free_pgtables" -r mm mm/internal.h:void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, mm/huge_memory.c: * The destination pmd shouldn't be established, free_pgtables() mm/memory.c:void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma, Binary file mm/memory.o matches mm/mremap.c: * The destination pmd shouldn't be established, free_pgtables() mm/mremap.c: * The destination pud shouldn't be established, free_pgtables() mm/mremap.c: * The destination pud shouldn't be established, free_pgtables() mm/mmap.c: free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, mm/mmap.c: free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); Binary file mm/mmap.o matches ``` - [x] rss : ownership transfer, - [ ] 不是 owner 的不增加，當在 break cow 時，如果要取得 owner 則需要補上。 - [ ] swap: - [ ] do_swap_page() for handling COW on PTEs during swapin directly - [ ] huge page: - [ ] do_huge_pmd_wp_page() for handling COW on PMD-mapped THP during write faults - [ ] atomic to refcount - [ ] see [refcount-vs-atomic.rst](https://www.kernel.org/doc/Documentation/core-api/refcount-vs-atomic.rst) - [x] clone flags 找舊的，可替代 - [[PATCH v2 1/2] fork: add clone3](https://lwn.net/ml/linux-kernel/20190603144331.16760-1-christian@brauner.io/) > We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last > free flag from clone(). > > Independent of the CLONE_PIDFD patchset a time namespace has been discussed > at Linux Plumber Conference last year and has been sent out and reviewed > (cf. [5]). It is expected that it will go upstream in the not too distant > future. However, it relies on the addition of the CLONE_NEWTIME flag to > clone(). The only other good candidate - CLONE_DETACHED - is currently not > recyclable as we have identified at least two large or widely used > codebases that currently pass this flag (cf. [2], [3], and [4]). Given that > CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively > blocked. clone3() has the advantage that it will unblock this patchset > again. * 待解決： - [ ] gup - [ ] reclaim - [ ] anon mmap ### test booting * [REPORT](https://hackmd.io/@linD026/odfork-report) :::success **資料** * [[PATCH v3 02/15] mm: introduce is_huge_pmd() helper](https://lwn.net/ml/linux-kernel/20211110105428.32458-3-zhengqi.arch@bytedance.com/) ``` Currently we have some times the following judgments repeated in the code: is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd) which is to determine whether the *pmd is a huge pmd, so introduce is_huge_pmd() helper to deduplicate them. ``` ::: :::warning * 解決方案：利用 union 把 share_pte_owner 包起來。 ```cpp + union { + pgtable_t pmd_huge_pte; /* protected by page->ptl */ + /* share pgtable */ + struct vm_area_struct *share_pte_owner; + } ``` ```cpp In file included from ./include/linux/atomic/atomic-instrumented.h:20, from ./include/linux/atomic.h:82, from ./include/linux/crypto.h:15, from arch/x86/kernel/asm-offsets.c:9: ./include/linux/build_bug.h:78:41: error: static assertion failed: "offsetof(struct page, _mapcount) == offsetof(struct folio, _mapcount)" 78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg) | ^~~~~~~~~~~~~~ ./include/linux/build_bug.h:77:34: note: in expansion of macro ‘__static_assert’ 77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr) | ^~~~~~~~~~~~~~~ ./include/linux/mm_types.h:263:9: note: in expansion of macro ‘static_assert’ 263 | static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl)) | ^~~~~~~~~~~~~ ./include/linux/mm_types.h:270:1: note: in expansion of macro ‘FOLIO_MATCH’ 270 | FOLIO_MATCH(_mapcount, _mapcount); | ^~~~~~~~~~~ ./include/linux/build_bug.h:78:41: error: static assertion failed: "offsetof(struct page, _refcount) == offsetof(struct folio, _refcount)" 78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg) | ^~~~~~~~~~~~~~ ./include/linux/build_bug.h:77:34: note: in expansion of macro ‘__static_assert’ 77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr) | ^~~~~~~~~~~~~~~ ./include/linux/mm_types.h:263:9: note: in expansion of macro ‘static_assert’ 263 | static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl)) | ^~~~~~~~~~~~~ ./include/linux/mm_types.h:271:1: note: in expansion of macro ‘FOLIO_MATCH’ 271 | FOLIO_MATCH(_refcount, _refcount); | ^~~~~~~~~~~ make[1]: *** [scripts/Makefile.build:121: arch/ ``` ::: * [Re: Free user PTE page table pages](https://lore.kernel.org/linux-mm/20211110125601.GQ1740502@nvidia.com/) ``` So, this approach basically adds two atomics on every PTE map If I have it right the reason that zap cannot clean the PTEs today is because zap cannot obtain the mmap lock due to a lock ordering issue with the inode lock vs mmap lock. If it could obtain the mmap lock then it could do the zap using the write side as unmapping a vma does. Rather than adding a new "lock" to ever PTE I wonder if it would be more efficient to break up the mmap lock and introduce a specific rwsem for the page table itself, in addition to the PTL. Currently the mmap lock is protecting both the vma list and the page table. I think that would allow the lock ordering issue to be resolved and zap could obtain a page table rwsem. Compared to two atomics per PTE this would just be two atomic per page table walk operation, it is conceptually a lot simpler, and would allow freeing all the page table levels, not just PTEs.s ``` ### Anon page Parent own the anon page, during fork also do cow but: * if child want to write/read: * anon page is not belong to child * but it shared with parent and child * no one can access it * do the same thing as normal page, copy it * odfork deal with it as following: > Anonymous mappings are backed by physical pages, which > are also reference counted. The kernel uses this reference > count to decide whether a page can be freed. Because Ondemand-fork defers processing PTE tables, which includes > incrementing the page reference counter, it must prevent > pages from being prematurely freed when (a) a page fault > gets a new page for copy-on-write and (b) the memory region > gets unmapped or remapped. [Multiple copy-on-write branches of an anonymous memory segment](https://lore.kernel.org/lkml/9492A0F4-990D-44F0-B10A-1B55D25995B6@gmail.com/T/) --- ### LinuxCon The fork system call may use copy-on-write to share the memory among parent and child processes. Last year, Kaiyang Zhao brought out the idea on-demand fork, doing copy-on-write on the page table to reduce the response time. It shares the last level of page table PTE among parent and child processes. In this presentation, I will talk about how this works, what I have improved on, and the upstreaming experience I have tried. --- ### Perf ```log vslda@arch-is-the-best ~> sudo perf stat --repeat 1000 -e cache-misses,cache-references,instructions,cycles ./benchmark-mmap-fork Performance counter stats for './benchmark-mmap-fork' (1000 runs): 5,031 cache-misses # 7.313 % of all cache refs ( +- 0.70% ) 65,398 cache-references ( +- 0.20% ) 1,073,080 instructions # 0.79 insn per cycle ( +- 0.10% ) 1,367,972 cycles ( +- 0.16% ) 0.0004505 +- 0.0000350 seconds time elapsed ( +- 7.77% ) vslda@arch-is-the-best ~> sudo perf stat --repeat 1000 -e cache-misses,cache-references,instructions,cycles ./benchmark-mmap-sfork Performance counter stats for './benchmark-mmap-sfork' (1000 runs): 2,200 cache-misses # 3.173 % of all cache refs ( +- 1.34% ) 71,335 cache-references ( +- 0.19% ) 1,094,156 instructions # 0.80 insn per cycle ( +- 0.10% ) 1,346,445 cycles ( +- 0.17% ) 0.0004468 +- 0.0000303 seconds time elapsed ( +- 6.78% ) ``` ### SWAP BUG ```log unreferenced object 0xffff888102e75a00 (size 256): comm "swapper/0", pid 1, jiffies 4294667738 (age 2254.798s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 48 00 00 00 00 00 00 00 ........H....... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000001bfb752f>] msr_build_context+0x40/0x110 [<000000009a7168da>] pm_check_save_msr+0x62/0x70 [<00000000e39389f5>] do_one_initcall+0x53/0x2f0 [<00000000a1f6694c>] kernel_init_freeable+0x248/0x290 [<00000000b785bd6c>] kernel_init+0x11/0x120 [<000000005f509dea>] ret_from_fork+0x1f/0x30 ``` --- ## TODO * 3/24 ~ 4/1 - [ ] 準備 refcount 資料詢問 `songmuchun@bytedance.com` - [[PATCH] Free user PTE page](https://lore.kernel.org/linux-mm/20211110105428.32458-14-zhengqi.arch@bytedance.com/) - [[PATCH v3]](https://lore.kernel.org/linux-mm/20211110125601.GQ1740502@nvidia.com/) - [HugeTLB](https://zhuanlan.zhihu.com/p/392703566) * 5/9 odfork - upstream * mmap bad page * tlb * swap * benchmark - webserver * stress ng * snapshot * error report * mmap, sfork, then write => both segfault * record time => Bad page map --- - [x] 1. 在 slack 簡介 ODF - [x] 2. 執行 ODF ( 5.6 版本 ) - [x] 測量現今 `fork()` 和 `odfork()` 的效能差異 - [x] 近期 `fork()` 相對於 5.6 版本有做出任何改進與更動，以利移植 5.15 版本 - [x] 3. 評估 ODF 在 [linux-5.15+](https://kernel.org/) 的移植，需要考量： - [x] 作者已於 [Shared page tables during fork](https://lwn.net/Articles/861547/) email 提出 patch ，但在幾篇回應後，並沒有後續進展，因此實作會以此 patch 做更改 - [x] pte refcount 的相關 patch : [Some upcoming memory-management patches](https://lwn.net/Articles/875587/) / [Free user PTE page table pages](https://lwn.net/ml/linux-kernel/20211110105428.32458-1-zhengqi.arch@bytedance.com/) - [x] [[PATCH v3 00/15] Free user PTE page table pages](https://lore.kernel.org/linux-mm/20211110084057.27676-1-zhengqi.arch@bytedance.com/) - [ ] 4. `posix_spawn()` 實作差異與相關 `fork()` 研究 - [ ] https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf - [ ] https://lwn.net/Articles/360747/ - [ ] https://lwn.net/Articles/360509/ - [ ] http://lkml.iu.edu/hypermail/linux/kernel/1910.1/03159.html - [ ] 5. 看 [Gramine Library OS](https://github.com/gramineproject/gramine) 的介紹 (或類似的 unikernel) - [ ] 嘗試移植 ODF --- ## `posix_spawn()` > The posix_spawn() and posix_spawnp() functions are used to create a new child process that exe‐ > cutes a specified file. These functions were specified by POSIX to provide a **standardized** > **method** of creating new processes on machines that lack the capability to support the fork(2) > system call. **These machines are generally small, embedded systems lacking MMU support.** - posix standardized method ，依據哪個規範？ - lack MMU support > The posix_spawn() and posix_spawnp() functions provide the functionality of a combined fork(2) > and exec(3), **with some optional housekeeping steps in the child process before the exec(3)**. > These functions are not meant to replace the fork(2) and execve(2) system calls. In fact, they > provide only a subset of the functionality that can be achieved by using the system calls. - [posix: New Linux posix_spawn{p} implementation](https://sourceware.org/git/?p=glibc.git;a=commit;h=9ff72da471a509a8c19791efe469f47fa6977410) - [[glibc.git]/posix/spawn.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=posix/spawn.c;h=103ed7c4ebcd10d9ca5e81801ec62c68fe4354a3;hb=03c9c4fce4fefbb34e65723467d86cb68739a9d1) - [What's the structure of glibc's source code](https://stackoverflow.com/questions/11347103/whats-the-structure-of-glibcs-source-code) - [Issue 819228: Consider using posix_spawn() on Linux](https://bugs.chromium.org/p/chromium/issues/detail?id=819228) - some benchmarks `posix_spawn()` 使用 `fork()` 系統呼叫。以 manual page 的範例程式碼為例，用 `$ strace ./test date` 去追可以證明： ``` clone(child_stack=0x7f9dc6c5bff0, flags=CLONE_VM|CLONE_VFORK|SIGCHLD) = 9651 munmap(0x7f9dc6c53000, 36864) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}) = 0 brk(NULL) = 0x565421bb4000 brk(0x565421bd5000) = 0x565421bd5000 write(1, "PID of child: 9651\n", 19PID of child: 9651 ) = 19 wait4(9651, Tue Feb 8 01:25:53 CST 2022 [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WSTOPPED|WCONTINUED, NULL) = 9651 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=9651, si_uid=1000, si_status=0, si_utime=0, si_stime=0} --- write(1, "Child status: exited, status=0\n", 31Child status: exited, status=0 ``` 而在 glibc 的 Linux 實作中也有明確註解：[/sysdeps/unix/sysv/linux/spawni.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/spawni.c;h=d703485e3fb898dc65986d3e1cd9c1e7b8593abe;hb=03c9c4fce4fefbb34e65723467d86cb68739a9d1) ，原始碼(先是呼叫 `__clone_internal()` )也使用了 [`clone()`](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/clone-internal.c;hb=03c9c4fce4fefbb34e65723467d86cb68739a9d1) 等系統呼叫來執行。而對於 Linux kernel 來說，`clone()` 以及 `fork()` 等都會呼叫 `kernel_clone()` 函式。 ``` CLONE_VM (since Linux 2.0) If CLONE_VM is set, the calling process and the child process run in the same memory space. In particular, memory writes performed by the calling process or by the child process are also visible in the other process. Moreover, any memory mapping or unmap‐ ping performed with mmap(2) or munmap(2) by the child or calling process also affects the other process. If CLONE_VM is not set, the child process runs in a separate copy of the memory space of the calling process at the time of the clone call. Memory writes or file mappings/un‐ mappings performed by one of the processes do not affect the other, as with fork(2). ``` ``` CLONE_VFORK (since Linux 2.2) If CLONE_VFORK is set, the execution of the calling process is suspended until the child releases its virtual memory resources via a call to execve(2) or _exit(2) (as with vfork(2)). If CLONE_VFORK is not set, then both the calling process and the child are schedulable after the call, and an application should not rely on execution occurring in any partic‐ ular order. ``` on-demand fork 是主要著重於 page table 的處理，是把 COW 對 vma 的概念套用在其上。而 glibc 上 `posix_spawn()` 在 Linux Kernel 中是以 `clone()` 實作。並以 `CLONE_VM` ( parent 和 child 共享記憶體空間) 與 `CLONE_VFORK` ( parent 會等待 child 釋放虛擬記憶體資源) 執行。在查看 odfork (5.6 版本)的程式碼後，確認會多餘地複製 mm ，之後再進行 `vfork()` 的操作( parent 等待)。而 5.16 版本在設置 `CLONE_VM` 後，則是直接傳 parent 的 mm 給 child ，只增加 mm 的 reference count 。 :::info `vfork()` 的 `clone_flags` 設置為 `CLONE_VFORK | CLONE_VM` 。 ::: TODO clone 系統呼叫的 flag : SIGCHLD 。 ### 疑問所以是以 `posix_spawn()` 控制 (利用 clone flag ) odfork 與 `fork()` 等相關機制改變？ ## 機制改進 ### Queue copy work (舊) ODF 是為了主要在於減少了 response time ，那麼如果把複製 pgd 、 p4d 、 pud 的工作給 workqueue ？在複製期間 pgtable 被標為 RO ，而 parent 和 child 先共享此 RO pgtable 。若有一方想要寫入，則須等待複製完成後才能繼續進行。 * 因 write protect ， response time 在沒有寫入操作時為最低。 * 完成複製後，需要有個機制通知其中一方，有令一個 pgtable 可以使用。 * 而有一方要寫入，則會回到除了原先複製的時間花費，還有 workqueue 和 process 溝通。 * workqueue 需要通知要寫入的 process ，複製已完成。 * [`completion`](https://www.kernel.org/doc/html/latest/scheduler/completion.html) :::warning 或是以更激進的方式，直接把所有複製的工作都丟給 workqueue ? ::: ```graphviz digraph fork { node [shape = box] rankdir = TB p [label = "parent"] pgtable [label = "page table"] wq [label = "bkcow_worker"] p -> pgtable [label = "on_demand_fork()"] pgtable -> wq [label = "queue copy work"] wq -> wq [label = "set COPYING"] pgtable -> p [label = "after fork()\nthe pgtable\nis read only"] } ``` ### 實作考量 * 複製時的流程： ```graphviz digraph break_cow { compound = true p [label = "parent"] c [label = "child"] subgraph cluster_tasks { p, c label = "want to write" } pgtable [label = "page table (read only)"] wq [label = "bkcow_worker"] nf [label = "not finish (copying)" shape = oval] fd [label = "finished"] wf [label = "worker finished\nsee the WAITING is set or not"] mm [label = "mm_struct"] pgtw [label = "new page table (writable)"] p -> mm [ltail = cluster_tasks] mm ->pgtable p -> pgtable c -> pgtable [label = "write"] pgtable -> wq [label = "access flag\nto see is\nfinised or not"] wq -> nf [label = "set WAITING\nand goto sleep\nuntil worker finish"] wq -> fd [label = "change the mm->pgd\npointer to new object"] nf -> wf fd -> pgtw pgtw -> mm [label = "change pgd pointer"] {rank=same pgtable, mm} } ``` * 必須在 read only 和必定會 copy 的情況。 * `fork()` 後，複製 pgd 、 p4d 和 pud 。 * 需考量**複製時， task exit 情況**。 * pgtable 的資源釋放是否即時，需要中斷與否。 * bottleneck: **handler 的處理量**，以及需要**額外記憶體來儲存 pgtable 位置**。 * workqueue 要執行於要求寫入的 task 的 CPU 上： * child 和 parent 會在同個 CPU 。不管是誰，在寫入新的 pgtable 時，都不會對 **cache localility** 造成影響。 * 而如果有個 task1 在 CPU1 上複製在 CPU0 task0 的 pgtable 。如果 handler 是執行於 CPU0 (RO pgtable owner 的 CPU)， task1 (想要寫入的 task) 在之後還是會進行溝通，如讀取至 CPU0 cache ( TLB ? )、 write invaliate 等。 * workqueue per node or per CPU ，以各個 node ( CPU ) 都有一個 workqueue 來解決上述 cache localility 和 bottleneck 的問題。 * workqueue 沒辦法在 Booting 時使用，需要用回原先的方式。 :::warning `madvise`, `posix_madvise` -- give advice about use of memory ::: ### 改進一 - `CLONE_VM` odfork 的改進更改成 5.16 版本，設置 `CLONE_VM` 並傳遞 mm 指標，額外設置 write protection，並在 page fault 上作原先的 odfork 的處理。差異會是 pgtable 的層級共享會從 pte 改為全部層級，會減去複製 pgd 等其他階層的花費時間， latency 會更低，但相對的在 breaking COW 的處理時間會提高。 ### 改進二 - workqueue **安全疑慮: 不會，但評估過後不符合效益。(傳遞指標，設置 write protection，不會動到 pgtable 的內容，但會把全部的事情都丟給 page fault )** 論文當中以及其實作只對 child 做 write-protect ，因此在之後 parent 做寫入時 child 依舊可以讀取到，因此有安全疑慮。最明確真正解決的方法，可能朝向用 workqueue 分散複製的時間成本 Pgtable 整體是離散，開多個 thread 處理複製任務？需要考量 lock :::info https://lwn.net/Articles/850113/ https://lwn.net/Articles/785430/ --- **COW Tracing delay time** [PATCH] delayacct: track delays from COW --- **閱讀清單** [Lightweight kernel isolation with virtualization and VM functions](https://doi.org/10.1145/3381052.3381328) [EPTI: Efficient Defence against Meltdown Attack for Unpatched VMs](https://www.usenix.org/conference/atc18/presentation/hua) [Toward Least-Privilege Isolation for Software ](http://www.scs.stanford.edu/~sorbo/bittau-phd.pdf) [android Share memory](https://developer.android.com/topic/performance/memory-overview#SharingRAM) [SOCK: Rapid Task Provisioning with Serverless-Optimized Containers](https://www.usenix.org/conference/atc18/presentation/oakes) [What do you think MAP_POPULATE is actually doing here?](https://news.ycombinator.com/item?id=7740578) ::: --- ## COW fixes series ### [Patching until the COWs come home (part 2)](https://lwn.net/Articles/849876/) > Then he expanded upon the general rules governing **how to deal with pinning of pages for DMA transfers and write-protecting**. They can be summarized as: > * When considering whether to just allow a write on a write-protected PTE, or to instead create a copy, and it is not certain that the process is the exclusive owner of the page, always create a copy. The elevated page reference count is an indication of not being an exclusive owner of the page. > * If the page is pinned with a cache-coherent GUP (such as for write DMA transfers) the page-table entry has to also be writable. It doesn't make sense to make it read-only if a DMA transfer can write to the page anyway. ### [2021-11-15 - Summary of COW (Copy On Write) Related Issues in Upstream Linux](https://lore.kernel.org/all/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com/) #### 1. Observing Memory Modifications of Private Pages From A Child Process > The core problem is that pinning pages readable in a child process, such as done via the vmsplice system call, can result in a child process observing memory modifications done in the parent process the child is not supposed to observe. #### 2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET) > It was discovered that we can create a memory corruption by reading a file via O_DIRECT to a part (e.g., first 512 bytes) of a page, concurrently writing to an unrelated part (e.g., last byte) of the same page, and concurrently write-protecting the page via clear_refs SOFTDIRTY tracking [6]. > > For the reproducer, the issue is that O_DIRECT grabs a reference of the target page (via FOLL_GET) and clear_refs write-protects the relevant page table entry. On successive write access to the page from the process itself, we wrongly COW the page when resolving the write fault, resulting in a loss of synchronicity and consequently a memory corruption. #### 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN) > ### Part 1 - using GUP-triggered unsharing of shared anonymous pages (ordinary, THP, hugetlb) #### [[PATCH v1] 2021-12-17](https://lore.kernel.org/all/20211217113049.23850-1-david@redhat.com/T/#u) > It is currently again possible for a child process to observe modifications of anonymous pages performed by the parent process after fork() in some cases, which is not only a violation of the POSIX semantics of MAP_PRIVATE, but more importantly a real security issue. #### [[PATCH v3] 2022-01-31](https://lore.kernel.org/all/20220131162940.210846-1-david@redhat.com/T/#u) > To fix this COW issue once and for all, introduce GUP-triggered unsharing > that can be conditionally triggered via FAULT_FLAG_UNSHARE. In contrast to > traditional COW, unsharing will leave the copied page mapped > write-protected in the page table, not having the semantics of a write > fault. > > Logically, unsharing is triggered "early", as soon as GUP performs the > action that could result in a COW getting missed later and the security > issue triggering: however, unsharing is not triggered as before via a > write fault with undesired side effects. > I'm working on an approach to fix (2) and improve (3): PageAnonExclusive to > mark anon pages that are exclusive to a single process, allow GUP pins only > on such exclusive pages, and allow turning exclusive pages shared > (clearing PageAnonExclusive) only if there are no GUP pins. Anon pages with > PageAnonExclusive set never have to be copied during write faults, but > eventually during fork() if they cannot be turned shared. The improved > reuse logic in this series will essentially also be the logic to reset > PageAnonExclusive. This work will certainly take a while, but I'm planning > on sharing details before having code fully ready. ### Part 2 * [[PATCH v2] 2022-03-15](https://lore.kernel.org/linux-mm/20220315104741.63071-2-david@redhat.com/T/) ### Part 3 * [[PATCH v1] 2022-03-15](https://lore.kernel.org/linux-mm/51afa7a7-15c5-8769-78db-ed2d134792f4@redhat.com/T/) ### Others * https://lore.kernel.org/all/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com/ --- [Redis with COW mess up the security](https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf) - uffd-wp > Modern implementations of fork use copy-on-write to reduce the overhead of copying memory that is often soon discarded [72]. A number of applications have since taken a dependency on fork merely to gain access to copy-on-write memory. One common pattern involves forking from a pre-initialised process, to reduce startup overhead and memory footprint of a worker process, as in the Android Zygote [39, 62] and Chrome site isolation on Linux [4]. Another pattern uses fork to capture a consistent snapshot of a running process’s address space, allowing the parent to continue execution; this includes persistence support in Redis [68], and some reverse debuggers [21]. > > POSIX would benefit from an API for using copy-on-write memory independently of forking a new process. Bittau [16] proposed checkpoint() and resume() calls to take copy-onwrite snapshots of an address space, thus reducing the overhead of security isolation. More recently, Xu et al. [82] observed that fork time dominates the performance of fuzzing tools, and proposed a similar snapshot() API. These designs are not yet general enough to cover all the use-cases outlined above, but perhaps can serve as a starting point. We note that any new copy-on-write memory API must tackle the issue of memory overcommit described in §4, but decoupling this problem from fork should make it much simpler. --- ## 其他討論 * [PATCH v1 00/11] mm: COW fixes part 1: fix the COW security issue for THP and hugetlb * mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb) * [folio - compound page](https://lwn.net/ml/linux-kernel/20210511214735.1836149-1-willy@infradead.org/) * [Clarifying memory management with page folios](https://lwn.net/Articles/849538/) * [pull slab out of struct page](https://lwn.net/Articles/871982/) * >This is an offshoot of the folio work, although it does not depend on any of the outstanding pieces of folio work. One of the more complex parts of the struct page definition is the parts used by the slab allocators. It would be good for the MM in general if struct slab were its own data type, and it also helps to prevent tail pages from slipping in anywhere. * [v2.6](https://www.kernel.org/doc/gorman/html/understand/understand011.html) * [struct slab](https://elixir.bootlin.com/linux/v2.6.39.4/source/mm/slab.c#L220) * commits: [slab: use struct page for slab management](https://github.com/torvalds/linux/commit/8456a648cf44f14365f1f44de90a3da2526a4776#diff-b0ac8926476debbaac79e701aa716d669ba29d1ef3a0a16e290b300888e7a477) * mail list: [slab: overload struct slab over struct page to reduce memory usage](https://lore.kernel.org/lkml/5270C666.6090209@iki.fi/T/) > There is two main topics in this patchset. One is to reduce memory usage > and the other is to change a management method of free objects of a slab. > > The SLAB allocate a struct slab for each slab. The size of this structure > except bufctl array is 40 bytes on 64 bits machine. We can reduce memory > waste and cache footprint if we overload struct slab over struct page. > > And this patchset change a management method of free objects of a slab. > Current free objects management method of the slab is weird, because > it touch random position of the array of kmem_bufctl_t when we try to > get free object. See following example. * [phyr](https://lore.kernel.org/netdev/Yd0IeK5s%2FE0fuWqn@casper.infradead.org/T/) - phyiscal memory / device * and use it to replace `bio_vec` as well as using it to replace the array of `struct pages` used by `get_user_pages()` and friends. * [`bio_vec`](http://books.gigatux.nl/mirror/kerneldevelopment/0672327201/ch13lev1sec3.html) * tracepoint - loop cost: `free_bulk()` * barrier and combining tree to RCU (ppt) * [`kfree_rcu()`](https://hackmd.io/@linD026/linux-kernel-RCU) 新增擁有 `struct rcu_head` 的地址，至 bkvhead 。 * 是否可以直接新增至 struct rcu_head list ，以減少 memory pressure ？ * 須考量 kfree 和 vfree * [UMCG](https://lwn.net/ml/linux-kernel/20210520183614.1227046-1-posk@google.com/) / [lwn](https://lwn.net/Articles/879398/) / [2021/12/14](https://lwn.net/ml/linux-kernel/20211214205358.701701555@infradead.org/) * 使用 kernel task_struct 而非自行建立如 fiber 純 userspace 的實作（可以避免到需要再建立一套特別的 fiber 函式庫），以明確管理資源並准許 thread-API 使用。 * [NGPT](https://wellbay.cc/thread-2024670.htm) > Basically a band-aid for the fact that many years ago GNU/Linux rejected NGPT and went with NPTL. If you allocate, essentially, a dedicated kernel thread for your “green thread” then you may use all syscalls and libraries which are allowed for regular threads: parts of the program where “green threads” are cooperating would work like proper “green threads”, but if you call some “heavy” function the instead of freezing your whole “green thread” machinery you would just get one-time hit when kernel would save your beacon and would give you a chance to remove misbehaving “green thread” from the cooperating pool. * UMCG 被評論說像 Windows UMS ，但後者被移除了（因為很少人用）。 > This mechanism and the very similar Windows UMS one which was added in 2009 and ripped out in 2020 help userspace control thread scheduling in the face of arbitrary system calls with arbitrary blocking. > > With traditional M:N scheduling like fibers, if the user threaded code blocks, no code in the app gets control to choose what's going to run next, unless the blocking is going through the userspace threading library. This is a major part of the reason that Go or LibUV wrap all of the IO calls, so that they can control their green thread scheduling. UMS allows such a runtime to effectively get a callback to decide what to do next (e.g. schedule a new lightweight task) when something blocks. This is a great idea if you have a set of syscalls from your task that may or may not block in an unpredictable manner, like a read from the pagecache where you don't know if you'll miss. You can be optimistic, knowing that you'll get a callback if something goes wrong. * [time_namespaces(7) — Linux manual page](https://man7.org/linux/man-pages/man7/time_namespaces.7.html) * `kernel/fork.c:copy_process()` > If the new process will be in a different time namespace ... * time namespace 為何？