contributed by < linD026
>
Linux kernel COW
, linux2021
此篇所引用之原始程式碼皆以用 Linux Kernel 5.10 版本,並且有關於硬體架構之程式碼以 arm 為主。
一般來說,每個 process 都有屬於自己私人的記憶體空間,如 heap , stack , BSS , data 等。但是 processes 之間也可能會使用到相同的資源,例如在寫 C 語言使用的 libc 。這些不會修改到資源就可以透過 virtual address , MMU 等提供的 address 轉換機制得以使用相同資源而不作不必要的複製。
上述手法為 virtual memory 的 sharing ,而 Copy On Write ( COW ) 則為此延伸。
以下為維基百科對 COW 的說明:
Wikipedia
Copy-on-write (COW), sometimes referred to as implicit sharing or shadowing, is a resource-management technique used in computer programming to efficiently implement a "duplicate" or "copy" operation on modifiable resources.
If a resource is duplicated but not modified, it is not necessary to create a new resource; the resource can be shared between the copy and the original. Modifications must still create a copy, hence the technique: the copy operation is deferred until the first write. By sharing resources in this way, it is possible to significantly reduce the resource consumption of unmodified copies, while adding a small overhead to resource-modifying operations.
而在實作上多個 process 使用相同資料時,在一開始只會 loading 一份資料,並且被標記成 read-only 。當有 process 要寫入時則會對 kernel 觸發 page fault ,進而使得 page fault handler 處理 copy on write 的操作。
整體流程如下:
在 COW 機制當中,對此唯讀資料進行複製的操作,稱為 breaking COW 。
在 Linux 裡 Everything is a file descriptor ,file 又有明確的 memory 和 address ,因此當我們在建立一個新的 process 所用到的資料也可以利用 COW 的機制作最佳化。
The virtual filesytem is an interface provided by the kernel. Hence the phrase was corrected to say "Everything is a file descriptor". Linus Torvalds himself corrected it again a bit more precisely: "Everything is a stream of bytes".
根據 Wikipedia:
Copy-on-write finds its main use in sharing the virtual memory of operating system processes, in the implementation of the fork system call. Typically, the process does not modify any memory and immediately executes a new process, replacing the address space entirely. Thus, it would be wasteful to copy all of the process's memory during a fork, and instead the copy-on-write technique is used.
fork
system call 會建立新的 process ,而在 Linux 當中它會先複製原先 process 的 mm_struct , vm_area_struct 以及 page table ,並且讓每個 page 的 flag 設為 read-only 。最後,當有作更改時則會利用 COW 機制進行處理。
在細部探討 COW 之前,需要先理解記憶體管理中虛擬記憶體的機制。現代作業系統提供了虛擬記憶體作為記憶體管理的一部份。虛擬記憶體的作法很簡單,就是把 physical address ( PA )轉換成 virtual address (VA)。而透過 address translation 使用更自由的 VA 使得資料抽離出原先硬體上的 PA 。這可以很好的處裡 process 記憶體之間的干擾、硬體之間的記憶體轉換、 RAM 等記憶體不足問題等。而在 Linux kernel 也中使用了較抽象的虛擬記憶體作為管理,而非實體記憶體,這是為了有更高效率以及較低錯誤。
至於從 PA 轉換成 VA 則是經由 MMU ( Memory management Unit )達成,並且 CPU 對記憶體的操作都是虛擬記憶體。
關於 arm 的 MMU 可以看: FreeRTOS (MMU) 中的 ARMv7-A MMU Architecture 。
作為記憶體管理的最小單位 page ,一般來說大小是 4KB 至 2 MB ,在 Linux 當中是 4 KB 。
為了讓資料的 page 能夠確認是否儲存於 cache / main memory 中,以及建構上述替換不必要的 page 稱為 swapping ,作業系統會建構 page table 來進行處理。一般來說,每個 process 都有自己的 Page table 。如果只有一個 Page table ,則其 page table entry ( PTE ) 會儲存 PA 。
當 process 在 cache / memory 中的 page table 找不到想要的 page fault 就會出發 page fault exception ,而如果只是目標 page 不在此 cache 在 disk 中,則會進行 swapping 。若是其他況,則讓 fault handler 進而作對應的處置(如超出可讀取範圍的 segmentation fault 、寫入 read-only 的 object 、在 user mode 更改 kernel mode 的資料等)。
但這種儲存結構對讀取的速度可能會太慢,並不是說結構上有缺陷而是在存取 page table 的硬體上有所限制。大體上 SRAM 、 DRAM 、 disk 的讀取速度都會有落差,而為了讓尋找 PTE 的時間能夠更加簡短,會在 MMU 之中設置 small cache 存放最近使用的 PTE 。
上述所說都是針對單一 page table ,然而可能因 page 和 PTE 的關係導致 page table 記憶體過大,卻又需要常駐於 main memory 之中。
若以單個 page table 並以 64-bit address space , page size 4 KB , PTE 4 bytes 為例:
需要常駐 16 PB 大小的 page table 才可以概括整個 64-bit address space 。
若以 Intel Core i7 為例,其可支援到 52-bit physical address space ( 4 PB ),因此雖然現今確實不太會用到 PB 等級的 address space ,但仍需要一個更好的方法來減少 page table 對 main memory 的使用。
因此,有人提出了 multi-Level page table 對 page table 在進行多次的 page table 處理。以下是 3 level page tables 轉換 VA 至 PA 的圖示:
一般來說,每個 level 的 page table 的大小是以實際 physical page frame 大小為主。
在 Linux kernel 中,每個 process 都有屬於自己的 multi-level page tables ,以每個 process 獨立擁有的 mm_struct
結構中的指標 mm_struct->pgd
指向屬於自己的 Page Global Directory ( PGD ) 。此指標所指向的 physical page frame 含有 an array type of pgd_t
,其定義於 <asm/page.h>
標頭檔 ,而這會根據不同的硬體夠而有所不同。
Each active entry in the PGD table points to a page frame containing an array of Page Middle Directory (PMD) entries of type pmd_t which in turn points to page frames containing Page Table Entries (PTE) of type pte_t, which finally points to page frames containing the actual user data. In the event the page has been swapped out to backing storage, the swap entry is stored in the PTE and used by do_swap_page() during page fault to find the swap entry containing the page data.
延伸閱讀 - context switch
其中 check_and_switch_context(struct mm_struct *mm, struct task_struct *tsk)
在 arm 架構中,會根據是否有 ASID ( Address Space Identifier ) 而有不同的定義:
在此函式當中,都有 cpu_switch_mm(mm->pgd, mm);
這段操作,亦即當在進行 context switch 時,關於每個 process ( task ) 的相關記憶體資訊,會需要切換 mm_struct->pgd
。
有關 cpu_switch_mm
操作,幾乎都定義於 /arch/arm/include/asm/proc-fns.h 之中:
x86 - pgd 和 TLB
On the x86, the process page table is loaded by copying
mm_struct->pgd
into the cr3 register which has the side effect of flushing the TLB. In fact this is how the function__flush_tlb()
is implemented in the architecture dependent code.
設定分頁功能
在控制暫存器(control registers)中,有三個和分頁功能有關的旗標:PG(CR0 的 bit 31)、PSE(CR4 的 bit 4,在 Pentium 和以後的處理器才有)、和 PAE(CR4 的 bit 5,在 Pentium Pro 和 Pentium II 以後的處理器才有)。
- PG(paging)旗標設為 1 時,就會開啟分頁功能。
- PSE(page size extensions)旗標設為 1 時,才可以使用 4MB 的分頁大小(否則就只能使用 4KB 的分頁大小)。
- 而 PAE(physical address extension)是 P6 家族新增的功能,可以支援到 64GB 的實體記憶體(本文中不說明這個功能)。
分頁目錄和分頁表
分頁目錄和分頁表存放分頁的資訊(參考「記憶體管理」的「緒論」)。分頁目錄的基底位址是存放在 CR3(或稱 PDBR,page directory base register),存放的是實體位址。在開啟分頁功能之前,一定要先設定好這個暫存器的值。在開啟分頁功能之後,可以用 MOV 命令來改變 PDBR 的值,而在工作切換(task switch)時,也可能會載入新的 PDBR 值。也就是說,每一個工作(task)可以有自己的分頁目錄。在工作切換時,前一個工作的分頁目錄可能會被 swap 到硬碟中,而不再存在實體記憶體中。不過,在工作被切換回來之前,一定要使該工作的分頁目錄存放在實體記憶體中,而且在工作切換之前,分頁目錄都必須一直在實體記憶體中。Translation Lookaside Buffers(TLBs)
只有 CPL 為 0 的程序才能選擇 TLB 的 entry 或是把 TLB 設為無效。無論是在更動分頁目錄或分頁表之後,都要立刻把相對的 TLB entry 設為無效,這樣在下次取用這個分頁目錄或分頁表時,才會更新 TLB 的內容(否則就可能會從 TLB 中讀到舊的資料了)。
要把 TLB 設為無效,只要重新載入 CR3 就可以了。要重新載入 CR3,可以用 MOV 指令(例如:MOV CR3, EAX),或是在工作切換時,處理器也會重新載入 CR3 的值。此外,INVLPG 指令可以把某個特定的 TLB entry 設成無效。不過,在某些狀況下,它會把一些 TLB entries 甚至整個 TLB 都設為無效。INVLPG 的參數是該分頁的位址,處理器會把 TLB 中存放該分頁的 entry 設為無效。引自:ntu - 分頁架構
size of a page
page align is PAGE_SIZE - 1 + (x) & PAGE_MASK
arm 架構的 page tables 有 two-level 和 three-level ,在此先以 three-level 為主。
PAGE_ALGIN()
統一定義在 mm.h
裡:
/arch/arm/include/asm/page.h
/include/linux/mm.h
/include/linux/kernel.h
/include/linux/const.h
和 /include/uapi/linux/const.h
實際操作會像是:
各個層級的 page table 型態有兩種定義方式,一種是沒有作 C type-checking 直接以 pteval_t
的形式定義 pte_t
,另一種是則是在 STRICT_MM_TYPECHECKS
下以結構方式定義:
As mentioned, each entry is described by the structs pte_t, pmd_t and pgd_t for PTEs, PMDs and PGDs respectively. Even though these are often just unsigned integers, they are defined as structs for two reasons. The first is for type protection so that they will not be used inappropriately. The second is for features like PAE on the x86 where an additional 4 bits is used for addressing more than 4GiB of memory. To store the protection bits, pgprot_t is defined which holds the relevant flags and is usually stored in the lower bits of a page table entry.
以下為不做 C type-checking :
在任何 level 的 page table ,其指向下一個 level 的指標可以是 null ,這意味著那段區間的 VA 是不允許操作的 ( no valid ) 。而在 The middle levels 的指標也可以直接指向較大範圍的 physical page ,而非下一個 level 。例如, PMD level 可以指向 huge page ( 2MB )。
At any level of the page table, the pointer to the next level can be null, indicating that there are no valid virtual addresses in that range. This scheme thus allows large subtrees to be missing, corresponding to ranges of the address space that have no mapping. The middle levels can also have special entries indicating that they point directly to a (large) physical page rather than to a lower-level page table; that is how huge pages are implemented. A 2MB huge page would be found directly at the PMD level, with no intervening PTE page, for example.
Bit Function _PAGE_PRESENT
Page is resident in memory and not swapped out _PAGE_PROTNONE
Page is resident but not accessable _PAGE_RW
Set if the page may be written to _PAGE_USER
Set if the page is accessible from user space _PAGE_DIRTY
Set if the page is written to _PAGE_ACCESSED
Set if the page is accessed Table 3.1: Page Table Entry Protection and Status Bits
pte_none()
,pmd_none()
andpgd_none()
return 1 if the corresponding entry does not exist;pte_present()
,pmd_present()
andpgd_present()
return 1 if the corresponding page table entries have the PRESENT bit set;pte_clear()
,pmd_clear()
andpgd_clear()
will clear the corresponding page table entry;pmd_bad()
andpgd_bad()
are used to check entries when passed as input parameters to functions that may change the value of the entries. Whether it returns 1 varies between the few architectures that define these macros but for those that actually define it, making sure the page entry is marked as present and accessed are the two most important checks.
Linux 有三種 protect ,分別是 read 、 write 、 execute 。一般來說,會以 pte_mk*
開頭名稱作為設定 protect 的種類。譬如, write 為 pte_mkwrite()
、 execute 為 pte_mkexec()
等。而清除已設置的 protect 則以 pte_wrprotect()
、 pte_exprotect()
等函式。
根據註解說明在 arm 架構利用 write 來實作 read 。從下方程式碼中可以看出,在清除 write 的時候會被設為 read only ,而在標為 write 的時候則會清出原先的 read 設定。
至於檢測是否有設 protect 可以用 pte_write()
、 pte_exec()
等檢測。
flag 相關函式以及 marco
Execution permissions - PXN / XN
Learn the architecture: AArch64 memory model - Permissions attributes
These attributes let you specify that instructions cannot be fetched from the address:
PXN. Privileged Execute Never (Called XN at EL3, and EL2 when HCR_EL2.E2H==0)
These are Execute Never bits. This means that setting the bit makes the location not executable.
protect modify
The permissions can be modified to a new value with pte_modify() but its use is almost non-existent. It is only used in the function change_pte_range() in mm/mprotect.c.
There are only two bits that are important in Linux, the dirty bit and the accessed bit. To check these bits, the macros pte_dirty() and pte_young() macros are used. To set the bits, the macros pte_mkdirty() and pte_mkyoung() are used. To clear them, the macros pte_mkclean() and pte_old() are available.
在 2005 年左右, Linux 2.6.10 合併了 four-level page tables patch 。
而在今日可以看到 Linux 在 4.11-rc2 開始新增了 five-level page tables 。在 PGD 和 PUD 之間有了新的 level - P4D 。
在 four-level 裡,address 的 valid bits ( VA ) 為最低位的 48-bits ,而 five-level 則是使用到 52-bits 或是 57-bits 的 VA 。
然而儘管以提供到 five-level ,在某些硬體或系統架構下,仍然是使用 three-level 更甚至 two-level ,就如前述的 arm 使用 three-level 或者是 32-bit 的系統也是使用 two-level 或是 three-level 一般。
也因為如此,記憶體管理的程式碼被撰寫成在 five-level page tables 結構下,可以容許只使用 low level 。因此,在觀看 arm 、 x86 等架構的 page table 管理時,儘管沒有用到程式碼所提供完整的 page table 功能也依然能夠運行。
關於實際例子,在 Linux kernel - pte 有列出 arm 架構下在經由 follow_pte
函式從 page tables 中尋找目標 pte 。
基本上 從 mm_struct
運用 follow_pte
函式得到 pte 的大致流程是:
除此之外,還需要考慮到在 pmd level 時是否為 huge page ,以及取得 pte 後保護接下來操作的 lock 也要跟著回傳。
MMU notifier 機制
lwn.net - Memory management notifiers
lwn.net - A last-minute MMU notifier change
當變更 page table 時,要確保 TLB 的相對地變成 invalidated 。
在方面的處理會需要 MMU notifier 機制,這在 2.6.27 被整合進來。最一開始的原因在於虛擬化, guest 端對記憶體操作時,為了資安考量並不准許實際操作 host 端的記憶體,而是利用在 guest 端管理一個 shadow page table。
guest 端當有需要操作時會在 shadow page table 進行,而 host 端會查看 guest 端的 shadow page table 來進行相對應的操作。這會產生幾個問題,比如當 host 端 swap 某個 guest 端所要用的 page 時,guest 端要如何得知此記憶體在 host 端被移出?原先 KVM 的解決方法是把被 guest mapped 的 page 固定在記憶體當中,然而這對 host 端記憶體使用效率很有問題。因此 MMU notifier 機制才被提出。
在這之後,又有許多裝置開始使用 memory bus ,如顯示卡有了自己的 MMU 等,而 MM 在對記憶體進行操作後,那些 non-CPU MMU 也需要更新,但 MM 卻不能直接干涉等相關議題也會用到此機制。
More recently, other devices have appeared on the memory bus with their own views of memory; graphics processing units (GPUs) have led this trend with technologies like GPGPU, but others exist as well. To function properly, these non-CPU MMUs must be updated when the memory-management subsystem makes changes, but the memory-management code is not able (and should not be able) to make changes directly within the subsystems that maintain those other MMUs.
此機制的主要目的是准許子系統掛載 mm operations 與當變更 page table 時得到 callback。
To address this problem, Andrea Arcangeli added the MMU notifier mechanism during the 2.6.27 merge window in 2008. This mechanism allows any subsystem to hook into memory-management operations and receive a callback when changes are made to a process's page tables. One could envision a wide range of callbacks for swapping, protection changes, etc., but the actual approach was simpler. The main purpose of an MMU notifier callback is to tell the interested subsystem that something has changed with one or more pages; that subsystem should respond by simply invalidating its own mapping for those pages. The next time a fault occurs on one of the affected pages, the mapping will be re-established, reflecting the new state of affairs.
而 mmu_notifier_invalidate_range_start/end
為此機制中的一種掛載方式之一。
mmu_notifier_invalidate_range_start/end
are just calling MMU notifier hooks; these hooks only exist so that other kernel code can be told when TLB invalidation is happening. The only places that set up MMU notifiers are
- KVM (hardware assisted virtualization) uses them to handle swapping out pages; it needs to know about host TLB invalidations to keep the virtualized guest MMU in sync with the host.
- GRU (driver for specialized hardware in huge SGI systems) uses MMU notifiers to keep the mapping tables in the GRU hardware in sync with the CPU MMU.
In this case,
invalidate_range_start()
is called while all pages in the affected range are still mapped; no more mappings for pages in the region should be added in the secondary MMU after the call. When the unmapping is complete and the pages have been freed,invalidate_range_end()
is called to allow any necessary cleanup to be done.
[RFC PATCH 0/6] Add support for shared PTEs across processes
此 patch 是為了解決當多個行程共享 struct page
時,每個行程的 page table 都會產生 PTE 以儲存此 page 。這在行程多到一定程度後,其記憶體開銷會變得不可忽略。因此,在 2022 一月中, Khalid Aziz 提出了在行程之間共用 PTE 這個概念。
Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is only 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page.
以 syscall 的形式,提供 userspace 的操作界面。
This is a proposal to implement a mechanism in kernel to allow
userspace processes to opt into sharing PTEs. The proposal is to add
a new system call - mshare(), which can be used by a process to
create a region (we will call it mshare'd region) which can be used
by other processes to map same pages using shared PTEs. Other
process(es), assuming they have the right permissions, can then make
the mashare() system call to map the shared pages into their address
space using the shared PTEs. When a process is done using this
mshare'd region, it makes a mshare_unlink() system call to end its
access. When the last process accessing mshare'd region calls
mshare_unlink(), the mshare'd region is torn down and memory used by
it is freed.
Another interesting problem is described at the end of the patch series. It would appear that there are programs out there that "know" that only the bottom 48 bits of a virtual address are valid. They take advantage of that knowledge by encoding other information in the uppermost bits. Those programs will clearly break if those bits suddenly become part of the address itself. To avoid such problems, the x86 patches in their current form will not allocate memory in the new address space by default. An application that needs that much memory, and which does not play games with virtual addresses, can provide an address hint above the boundary in a call to mmap(), at which point the kernel will understand that mappings in the upper range are accessible.
allocate and free page table
GitHub - linD026 / Three-level-page-table
因篇幅在此不把程式碼完整列出來,請去上方連結觀看或是實際測試及修改。
此為 three level page table 的部份實作,僅實現建立 page tables 以及插入 page 操作。並且有關 pa 、 va 和 pfn 轉換涉及到 MMU 等硬體支援的操作,並未完整模擬出來,而是以現有 va 以及自行設定各 level 中的偏移量建構另一個 va 。
在實作的過程中發現,對於 page table 的 lock 不管在哪個 level 都會以 mm_struct->page_table_lock
操作,這在 concurrent 下有明顯的效能不足。至此,lwn.net 也有一篇有關 range lock 的相關討論: Range reader/writer locks for the kernel 。其中說明到,一般資源控管的 lock 會如此設定,是因為要求對於 lock 的操作會以最簡單不複雜的方式進行全域的保護。
The kernel uses a variety of lock types internally, but they all share one feature in common: they are a simple either/or proposition. When a lock is obtained for a resource, the entire resource is locked, even if exclusive access is only needed to a part of that resource. Many resources managed by the kernel are complex entities for which it may make sense to only lock a smaller part; files (consisting of a range of bytes) or a process's address space are examples of this type of resource.
至於為何一般 lock 會是如此簡潔的方式來運行的原因,是為了要減少操作 lock 的成本。
As a general rule, keeping the locks simple minimizes the time it takes to claim and release them. Splitting locks (such as replacing a per-hash-table lock with lots of per-hash-chain locks) tends to be the better approach to scalability, rather than anything more complex that mutual-exclusion.
而 range lock 的提出不外乎就是可以提高 scalability ,但可以注意到的是這兩種型態的 lock 也都有優缺點,因此並不會完全傾向任一種 lock ,而是會在適當的地方使用:
Range locks are handling a fairly unique case. Files are used in an enormous variety of ways - sometimes as a whole, sometimes as lots of individual records. In some case the whole-file
mmap_sem
really is simplest and best. Other times per-page locks are best. But sometimes, takingmmap_sem
will cause too much contention, while taking the page lock on every single page would take even longer… and some of the pages might not be allocated yet.
So range locks are being added, not because it is a generally good idea, but because there is a specific use case (managing the internals of files) that seems to justify them.
在 2013 以及 2017 年左右,開發者們也開始提出了一些 range lock 的策略,如 range_lock tree
等。
insert_page
插入 page 至 page table 的函式 insert_page
有作 lock 的效能提昇的修改。目前版本中有提到,關於在 arch
硬體相關程式碼當中要提供 pmd_index
等功能,才可以運用此版本。至於舊版的 insert_page
也有保留下來,在註解中有說明僅限使用於舊的 driver 。
在 NUMA 架構中,不同記憶取區段對於不同處理器的讀取,根據記憶體與處理器的距離會有不同的讀取成本。
根據 wikipedia 對於 NUMA 架構的說明:
Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users.
對於每個處理器都會有自己的記憶體區段,而每個區段的 phyiscal memory 在 Linux 當中以 node 稱之,並以 struct pglist_data
( pg_data_t
)進行操作。而在每個 node 之中,記憶體又會被區分成數個區段以 zone ( struct zone
)來進行。
UMA 架構則是會以一個 pglist_data
來描述整個記憶體。
從 struct page
至 zone 的轉換一般來說會是以 page_zone
函式進行:
Zone 有三種型態分別是 ZONE_DMA
、ZONE_NORMAL
、ZONE_HIGHMEM
。這三種型態的對於記憶體的區分會因硬體架構而有不同。而在早期, ZONE_DMA
因 ISA 的硬體限制被劃分在低段記憶體位置上。
The DMA zone (ZONE_DMA) is a memory-management holdover from the distant past. Once upon a time, many devices (those on the ISA bus in particular) could only use 24 bits for DMA addresses, and were thus limited to the bottom 16MB of memory.
stackoerflow - Increasing Linux DMA_ZONE memory on ARM i.MX287
The ZONE_DMA 16MB limit is imposed by a hardware limitation of certain devices. Specifically, on the PC architecture in the olden days, ISA cards performing DMA needed buffers allocated in the first 16MB of the physical address space because the ISA interface had 24 physical address lines which were only capable of addressing the first 2^24=16MB of physical memory. Therefore, device drivers for these cards would allocate DMA buffers in the ZONE_DMA area to accommodate this hardware limitation.
延伸閱讀
在 2018 年時,開發者們有討論過是否要移除 ZONE_DMA
這古老的東西。而另一篇列出對此會受影響的 driver 清單則建議不要。詳細請見: lwn.net - Is it time to remove ZONE_DMA?
可以利用 struct page 中描述 page 狀態的 flag 以及 page_zonenum
函式來得到此 struct page 是屬於哪種 zone 。
區分上述三種 zone type 的方式以 min_low_pfn
、 max_pfn
、 max_low_pfn
變數設定。 kernel 可使用的第一個 pfn 位於 min_low_pfn
,而最後一個則是在 max_pfn
。
The value of max low pfn is calculated on the x86 with
find_max_low_pfn()
, and it marks the end of ZONE_NORMAL. This is the physical memory directly accessible by the kernel and is related to the kernel/userspace split in the linear address space marked by PAGE OFFSET. The value, with the others, is stored in mm/bootmem.c. In low memory machines, the max pfn will be the same as the max low pfn.
x86 的三種 zone 型態為:
- ZONE_DMA : First 16MiB of memory
- ZONE_NORMAL : 16MiB - 896MiB
- ZONE_HIGHMEM : 896 MiB - End
延伸閱讀
patchwork.kernel.org - [v6,3/4] arm64: use both ZONE_DMA and ZONE_DMA32
lwn.net - ARM, DMA, and memory management
設定好三種型態以後,會進行 free_area_init_node
函式:
而在 NUMA 架構下,每個 node 會以 node_start_pfn
以及 node_mem_map
紀錄自己的 struct page 。 pfn
( physical page frame number ) 是描述 page 在 phyiscal memory 之中的位置,因此每個 node 都會有相同的 pfn
數值,此時就是利用 nid
(node ID) 以及 node_start_pfn
來得到在全域當中可分辨的 pfn
。
On NUMA systems, the global
mem_map
is treated as a virtual array starting atPAGE_OFFSET
.free_area_init_node()
is called for each active node in the system, which allocates the portion of this array for the node being initialized.
從 struct page 轉換至 pfn
的詳細程式碼在之後會列出,請見 Mapping address to page - page to pfn and back 。
node_start_pfn
此紀錄了 node 的第一個能自己使用的 pfn
( physical page frame number ) 。在 2.6 版本以前是以 physical address 紀錄,但這會因 PAE 而產生一些問題。
A PFN is simply an index within physical memory that is counted in page-sized units.
而實際分配記憶體會是以下列形式:
free_area_init_node(nid)
建立 node 的 pg_data_t
的資料結構,並設定偏移量如 node_start_pfn
等。alloc_node_mem_map
與先前函式建立的資訊,得到實際分配的大小(會處理因 buddy allocator 等 aligned 的問題),並傳給下個函式。memblock_alloc_node
和 memblock_alloc_try_nid
會提供相關分配記憶體資訊,並由下個函式實際執行。memblock_alloc_internal
才會依照 slab allocator
實際以分配記憶體。如果不能完整分配所要求的記憶體,則會試著分配現有剩餘的記憶體。mem_map
in Physical memory models儲存整個 physical page 的結構會以 struct page *
型態包含 mem_map
名稱的陣列維持,相關函式操作則會以 mem_map
或是 memmap
、 memblock
等命名。
Each physical page frame is represented by a struct page and all the structs are kept in a global
mem_map
array which is usually stored at the beginning of ZONE_NORMAL or just after the area reserved for the loaded kernel image in low memory machines.
在不同 physical memory model 下,會有不同的 mem_map
形式儲存,因為要在下個段落才會介紹 physical memory model 因此在此只簡短列出不同 model 所使用的 mem_map
:
mem_map
node_data[nid]->node_mem_map
(in arm64)
section[i].section_mem_map
vmemmap
延伸閱讀 - page_wait_table
當 page 進行 I/O 處理時,會希望在同一時間下只有一個 process 進行運作。因此,衍生出管理 waiting queue 的 wait_table
。以下是 2.6 版本左右的圖示(現今已更改):
然而如果實際查看原始碼會發現 /mm/filemap.c 以及 /kernel/sched/wait_bit.c 都有 bit_wait_table
相關程式碼。而根據 Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit 此系列的電子郵件紀錄可得知 wait_page_table
等操作有做更改。
然而,在之後又做了多次修改,可見:
Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit
We encountered workloads that have very long wake up list on large
systems. A waker takes a long time to traverse the entire wake list and
execute all the wake functions.We saw page wait list that are up to 3700+ entries long in tests of large
4 and 8 socket systems. It took 0.8 sec to traverse such list during
wake up. Any other CPU that contends for the list spin lock will spin
for a long time. It is a result of the numa balancing migration of hot
pages that are shared by many threads.Multiple CPUs waking are queued up behind the lock, and the last one queued
has to wait until all CPUs did all the wakeups.
根據此系列電子郵件紀錄來看, struct page 所使用的 waitqueue 只會有 per-page ,不會同時有 per-page 和 per-bit 。
Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit
But even without sharing the same queue, we could just do a per-page
allocation for the three queues - and probably that stupiud
add_page_wait_queue() waitqueue too. So no "per-page and per-bit"
thing, just a per-page thing.
在現今 5.10 版本當中是以維持 element 為 struct wait_queue_head
的 hash table 。 Process 利用 page_waitqueue
得到 index ,並且進行 wait_on_page_bit 函式等待。
在函式一開始時會被加入至 linked list tail 位置( __add_wait_queue_entry_tail
),之後進入迴圈確保自身狀態。
當出來迴圈會呼叫 finish_wait
函式,對 waitqueue 進行處理:
以下為 ftrace 的部分紀錄:
而 page 的使用權被釋放後,會走訪此 linked list 直到其中一個 item 可使用此 page 。
wikipedia - thundering herd problem
In computer science, the thundering herd problem occurs when a large number of processes or threads waiting for an event are awoken when that event occurs, but only one process is able to handle the event. When the processes wake up, they will each try to handle the event, but only one will win. All processes will compete for resources, possibly freezing the computer, until the herd is calmed down again
root/Documentation/vm/memory-model.rst
Memory: the flat, the discontiguous, and the sparse
關於 physical memory 有多種方式可以呈現,例如最簡單的直接從 0 開始一直算到最大可表示範圍 ( 64-bits 的 0 到 ) 。然而實際上還須考慮 CPU 讀取範圍的漏洞以及多個 CPU 之間不同的範圍、 NUMA 架構、 SMP 等多個外部因素。在 Linux kernel 中總共提供了 3 memory model : FLATMEM 、 DISCONTIGMEM 和 SPARSEMEM ,分別運用在 flat 、 discontiguous 以及 sparse 的記憶體空間。
At time of this writing, DISCONTIGMEM is considered deprecated, although it is still in use by several architectures.
All the memory models track the status of physical page frames using struct page arranged in one or more arrays.
Regardless of the selected memory model, there exists one-to-one mapping between the physical page frame number (PFN) and the corresponding struct page
.
Each memory model defines :c:func:pfn_to_page
and :c:func:page_to_pfn
helpers that allow the conversion from PFN to struct page
and vice versa.
FLATMEM 適用於連續或多為連續的 physical memory 的 non-NUMA 系統。此 memory model 會有個以 struct page
為 element 的 global mem_map
array 對應到所有 physical memory 並包含記憶體漏洞,記憶體漏洞會對應上無法初始化的 struct page
object 。 此中方式提供了有效率的 pfn 轉 struct page
,這在 cache 上對讀取 struct page
有很好的最佳化,因為僅做出 index 之間的 offset 即可。
在 1.3.50 版本 mem_map
的 element 才被命名為 struct page
。
在分配 mem_map
array 之前,需要先下 free_area_init
函式並且以 memblock_free_all
函式初始化給 page allocator 才可使用。
/include/linux/mm.h - free_area_init()
/mm/page_alloc.c - free_area_init()
start_pfn
和 end_pfn
的會別儲存於 arch_zone_lowest_possible_pfn[zone]
和 arch_zone_highest_possible_pfn[zone]
之中,而當 zone
為 ZONE_MOVABLE
時,會區分下一個區塊,並且下一個區塊始於上一個 end 。ZONE_MOVABLE
,此為 enum
並且被定義於 /include/linux/mmzone.h
當中:
而最一開始的 start_pfn
則由 find_min_pfn_with_active_regions
找出( 最小 PFN
數值 ):
以下為實際程式碼以及之後印出出定義區塊的迴圈程式碼:
之後,印出之前的記憶體空間 ( node ),並 enable sub-section
至新的分配空間的最後一個區塊 ( 因前面的迴圈使得 start_pfn
指向最後一個 )。
/mm/memblock.c - memblock_free_all
轉換 PFN
與 struct page
在 FLATMEM 是非常直觀的,以 PFN - ARCH_PFN_OFFSET
即可得到在 mem_map
array 的位置。
ARCH_PFN_OFFSET
則為系統中第一個 page frame number 的 physical address 。
If an architecture enables CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
option, it may free parts of the mem_map
array that do not cover the actual physical pages. In such case, the architecture specific :c:func:pfn_valid
implementation should take the holes in the mem_map
into account.
DISCONTIGMEM 如字面意思般適用於不連續記憶體上,是以 nodes
來操作 physical memory ,而對於每個 node 是以 struct pglist_data
( pg_data_t
) 獨立儲存自己的記憶體空間,如 free-page list 、in-use-page list 、 LRU 等相關資訊。每個 pg_data_t
則以 node_mem_map
array 儲存 phsical pages ,node_mem_map
同如 FLATMEM 的 mem_map
。每個 node 的第一個 page frame 則以 node_start_pfn
標示,而這也造成了每當對應 pfn 轉 struct page
時,對要先了解是哪個 node 所有。
每個 node 的以 free_area_init_node
函式初始化它的 pg_data_t
object 。
可以從 FLATMEM 的 free_area_init
註解得知,free_area_init
會初始化所有 pg_data_t
,而 free_area_init_node
則是初始化單一 pg_data_t
。
關於 PFN
以及 struct page
轉換較為複雜一些,是以 node number - nid
( node ID ) 為中介,利用 pfn_to_nid
和 page_to_nid
處理。因為在得知 nid
時,可以利用 node_mem_map
array 的 index 得 struct page 並且以其 offset 加上 node_start_pfn
可得它的 PFN 。
nid is the Node ID which is the logical identifier of the node whose zones are being initialised;
Architectures that support DISCONTIGMEM provide :c:func:pfn_to_nid
to convert PFN to the node number. The opposite conversion helper :c:func:page_to_nid
is generic as it uses the node number encoded in page->flags
SPARSEMEM is the most versatile memory model available in Linux and it is the only memory model that supports several advanced features such as hot-plug and hot-remove of the physical memory, alternative memory maps for non-volatile memory devices and deferred initialization of the memory map for larger systems.
相對於 nodes 之間各自以 pg_data_t
維護記憶體空間的 DISCONTIGMEM , SPARSEMEM 抽象化了各個硬體架構之間的 memory map , 以 struct mem_section
操作 physical memory ,並以 section_mem_map
指標指向 array of struct page
。 section 的大小以 SECTION_SIZE_BITS
常數表示,而最大數量則以前者和 MAX_PHYSMEM_BITS
決定,而兩者皆依據架構而有所不同。MAX_PHYSMEM_BITS
是以用架構所提供的 physical address 大小定義; SECTION_SIZE_BITS
則為任意數。例如在 arm 架構,這兩個值分別為:
而 section 的最大數量 NR_MEM_SECTIONS
定義為
對於 pfn 轉 struct page
,在 PFN 的高位元之中儲存了在 array 之中的 index ,而在哪個 mem_section
則是在儲存於 page flag 。
在 SPARSEMEM 提出數個月之後 SPARSEMEM_EXTREME 被提出,以多個 struct mem_section
object 組成名為 mem_section
的二維 array 。 array 大小受 CONFIG_SPARSEMEM_EXTREME
以及 section 的最大數量影響:
如果 CONFIG_SPARSEMEM_EXTREME
disabled , 此為靜態 array , row 有 NR_MEM_SECTIONS
大小,而每個 row 只有一個 object 。若為 enable , 則為動態記憶體分配的 array , row 的總數為 計算過後要符合所有 memory sections 的數量,而每個 row 含有 mem_section
記憶體大小分之 PAGE_SIZE
個 object 。
memory sections 與 memory maps 的初始化為呼叫 sparse_init()
函式。
在 PFN
與 page
間的轉換( pfn_to_page
和 page_to_pfn
),SPARSEMEM 有兩種選擇 - "classic sparse" 和 "sparse vmemmap",並在 build time 以 CONFIG_SPARSEMEM_VMEMMAP
作選擇。 SPARSEMEM_VMEMMAP 是在 2007 年的時候新增的,其理念是把整個 memory map 對應到虛擬化的記憶體區塊。
Another enhancement to SPARSEMEM was added in 2007; it was called generic virtual memmap support for SPARSEMEM, or SPARSEMEM_VMEMMAP. The idea behind SPARSEMEM_VMEMMAP is that the entire memory map is mapped into a virtually contiguous area, but only the active sections are backed with physical pages. This model wouldn't work well with 32-bit systems, where the physical memory size might approach or even exceed the virtual address space. However, for 64-bit systems SPARSEMEM_VMEMMAP is a clear win. At the cost of additional page table entries, page_to_pfn(), and pfn_to_page() became as simple as with the flat model.
sparse vmemmap 利用虛擬化 mapping 的手段最佳化了 pfn 以及 struct page
之間的轉換。vmemmap
為 struct page
結構的指標,指向 struct page
object 組成的虛擬化的連續 array ,而 PFN 便是其 offset 。
There is a global struct page *vmemmap
pointer that points to a virtually contiguous array of struct page
objects. A PFN is an index to that array and the offset of the struct page
from vmemmap
is the PFN of that page.
根據 kernel doc 文件:
To use vmemmap, an architecture has to reserve a range of virtual addresses that will map the physical pages containing the memory map and make sure that
vmemmap
points to that range. In addition, the architecture should implement :c:func:vmemmap_populate
method that will allocate the physical memory and create page tables for the virtual memory map. If an architecture does not have any special requirements for the vmemmap mappings, it can use default :c:func:vmemmap_populate_basepages
provided by the generic memory management.
vmemmap
是以連續的 virtual address 來儲存 memory map ,以 vmemmap_populate
分配 physical memory 並且也要建立相關的 page table 。
The virtually mapped memory map allows storing struct page
objects for persistent memory devices in pre-allocated storage on those devices. This storage is represented with struct vmem_altmap
that is eventually passed to vmemmap_populate()
through a long chain of function calls. The vmemmap_populate()
implementation may use the vmem_altmap
along with :c:func:vmemmap_alloc_block_buf
helper to allocate memory map on the persistent memory device.
延伸閱讀 - ZONE_DEVICE
原文
The
ZONE_DEVICE
facility builds uponSPARSEMEM_VMEMMAP
to offerstruct page
mem_map
services for device driver identified physical address ranges. The "device" aspect ofZONE_DEVICE
relates to the fact that the page objects for these address ranges are never marked online, and that a reference must be taken against the device, not just the page to keep the memory pinned for active use.ZONE_DEVICE
, via :c:func:devm_memremap_pages
, performs just enough memory hotplug to turn on :c:func:pfn_to_page
, :c:func:page_to_pfn
, and :c:func:get_user_pages
service for the given range of pfns. Since the page reference count never drops below 1 the page is never tracked as free memory and the page'sstruct list_head lru
space is repurposed for back referencing to the host device / driver that mapped the memory.While
SPARSEMEM
presents memory as a collection of sections, optionally collected into memory blocks,ZONE_DEVICE
users have a need for smaller granularity of populating themem_map
. Given thatZONE_DEVICE
memory is never marked online it is subsequently never subject to its memory ranges being exposed through the sysfs memory hotplug api on memory block boundaries. The implementation relies on this lack of user-api constraint to allow sub-section sized memory ranges to be specified to :c:func:arch_add_memory
, the top-half of memory hotplug. Sub-section support allows for 2MB as the cross-arch common alignment granularity for :c:func:devm_memremap_pages
.The users of
ZONE_DEVICE
are:
- pmem: Map platform persistent memory to be used as a direct-I/O target via DAX mappings.
- hmm: Extend
ZONE_DEVICE
with->page_fault()
and->page_free()
event callbacks to allow a device-driver to coordinate memory management events related to device-memory, typically GPU memory. See Documentation/vm/hmm.rst.- p2pdma: Create
struct page
objects to allow peer devices in a PCI/-E topology to coordinate direct-DMA operations between themselves, i.e. bypass host memory.
ZONE_DEVICE
與 ZONE_MOVABLE
一樣定義在 /include/linux/mmzone.h中,此 enum
讓 SPARSEMEM_VMEMMAP
提供了 struct page
和 mem_map
識別 device driver 的 physical address 範圍,而這些 address 永遠不會被標為 online ,亦即 sysfs 不會對它作與其他一般記憶體一樣的處置。
Given that
ZONE_DEVICE
memory is never marked online it is subsequently never subject to its memory ranges being exposed through the sysfs memory hotplug api on memory block boundaries.
ZONE_DEVICE
利用 devm_memremap_pages
表示足夠的記憶體去能夠隨時開起 pfn_to_page
、page_to_pfn
、 get_user_pages
功能。
也因為 page reference 永遠不會低於 1 ,因此不會被劃分到如 free memory 、struct list_head lru
等空間,而是會重新導向至 mapping 到次記憶體空間的 the host device / driver 。
也因他不會在 memory block boundaries 被揭露於 sysfs memory hotplug API 上 ,因此他的相關記憶體操作是以 arch_add_memory
、devm_memremap_pages
等 API 操作。
The implementation relies on this lack of user-api constraint to allow sub-section sized memory ranges to be specified to :c:func:
arch_add_memory
, the top-half of memory hotplug. Sub-section support allows for 2MB as the cross-arch common alignment granularity for :c:func:devm_memremap_pages
.
註:arch_add_memory
是根據硬體架構而有所不同。
大略說明完三種 memory model 後,再來看 page 與 address 是如何轉換。在先前有說明到,不同的 memory model 有不同 physical memory 的管理機制,因此在轉換的過程也可能會不同。
struct page
在轉換成 address 時,會先轉換成 page frame number ( pfn
) ,之後在根據不同 memory model 的管理結構作偏移。
此為沒有 __ASSEMBLY__
定義的轉換。
此外 /include/asm-generic/memory_model.h,更精確的說是 asm-generic
也有補足未定義到相關操作的架構。
就如, arm 架構沒有對 pfn_to_virt
有自己的定義,因此會因用此標頭檔。
利用上述 marco 可組成 pfn 與 pte 之間得關係:
關於 VA 與 pfn ,也可以利用這些 marco 與相關 page table 操作在目標 page table 中新增 entry :
利用 pfn 和 phsical address 的關係。
在分配 page table 記憶體的相關函式,則會用到 page_to_phys
。
arm 沒有 page_to_virt
, 但 arm64 有。
tldp.org - Translating Addresses in Kernel Space
stackoverflow - Is there any API for determining the physical address from virtual address in Linux?
asm/io.h