paper review
, huge page
Mon, Oct 23, 2023 7:38 PM
Paper Title: Making Dynamic Page Coalescing Effective on Virtualized Clouds
Link: https://web.njit.edu/~dingxn/papers/gemini.pdf
year: EuroSys '23
Keyword: huge page
REMOVE THIS: Keyword should be added to the tags
Start from TLB misses and address translation problems, TLB capacity cannot scale at the same rate as memory capacity, these problems have be a major performance bottleneck for many big-memory workloads, such as virutalized clouds which support the nested paging like Intel's extended page tables and AMD nested page tables. The factor is that, with nested paging, the process needs to walk through two layers of page tables and the cost can be 6x as much as walking thrtough one layer of page table upon TLB misses on native systems (Performance Implications of Extended Page Tables on Virtualized x86 Processors. VEE '16'). Thus, using the huge page (e.g., 2 MB on x86 arch) has become the effective method (2 MB huge page PTE vs. 4 KB page PTE).
Page coalescing
To make the 4 KB pages region become 2 MB huge page we can turn on the auto-coalescing in kernel configuration (Not recommended). Otherwise, we can use madvise()
to guide the kernel to do the coalescing.
I remember that, on mailing list, one kernel developer said
madvise()
only suggest the kernel to do the coalescing, but I cannot find where is it.
Moreover, there are two methe to merge the 4 KB pages to huge page.
- THP (aka dynamic page coalescing), it only add the memory region to the scanning list. Then, the khugepaged will decide to coalesce the region at the certain time. (
mm/khugepaged.c:collapse_huge_page()
)madvise()
, after calling the function, it will coalesce the pages during the page fault.
When the page is coalescing, if the region has the hole in the PTE, it will allocate the memory to fill the space. So, the page coalescing may incur memory space waste and high demand-paging overhead. Currently, Linux kernel will dynamically combine contiguous base pages into huge pages and split under-utilized huge pages to reduce the space and paging overhead.
The problem, called huge page misalignment problem in the paper, with virtualized platform with huge pages is that they coalesce pages independently, this means that the guest huge page might backed by base page in the host.
Using huge pages can reduce TLB misses only when both the guest and the host use huge pages for the same data, i.e., a huge GVP is backed by a huge GPP and a huge HPP at the same time. The reason is that only in these cases the PTEs of the VM page tables can be cached in the TLB and used correctly in address translation. When a huge GVP is backed by multiple base HPPs in the host, there is not a PTE in the VM page table that corresponds to the GVP; thus, no PTEs can be loaded to the TLB to help translation. If a base GVP is backed by a huge HPP, because there is not a PTE in the VM page table that corresponds to the virtual page, the page offset cannot be used to obtain correct host physical address (HPA). Thus, in neither of these cases, there is a valid PTE that can be cached in the TLB to help the address translation.
The paper proposes the solution, here is the detail:
Gemini first makes the memory management at a layer aware of the mis-aligned huge pages formed at the other layer. Gemini periodically scans the process page tables in VMs and the VM page tables in the host to find the mis-aligned huge pages. It keeps track of these huge pages using their guest physical addresses. Thus, a guest can check the guest physical addresses of the mis-aligned huge pages in the host, and tries to form and allocate huge guest pages at these addresses. The host can check the guest physical addresses of the mis-aligned huge pages in a guest, and tries to back these addresses using huge host pages.
To form and allocate new huge pages that match the misaligned huge pages at the other layer, Gemini carefully manages the space of the memory regions corresponding to the mis-aligned huge pages. It reserves the space temporarily, hoping that the space can be allocated directly as huge pages or allocated as contiguous base pages, which later can be promoted into huge pages with minimal overhead. Then, Gemini guides page coalescing and huge page allocation to form and allocate huge pages from these regions first before considering other memory regions. These huge pages turn the mis-aligned huge pages at the other layer into wellaligned huge pages. Gemini does not create huge pages excessively, and thus does not aggravate the adverse effects incurred by excessive huge pages
Address Translation with virtualized systems
A TLB is a small cache that buffers a number of page table entries (PTEs), which contain the real memory locations of the corresponding virtual pages. On a native system the PTEs of process page tables are cached; and on a virtualized system the PTEs of VM page tables are cached. The PTEs are tagged with the virtual page addresses; this forms a table mapping virtual pages to their real locations.
To translate a virtual address, the TLB uses the virtual page address (higher bits in the virtual address) to find the corresponding page table entry and obtain the real location of the page. Then, it adds the page offset (lower bits in the virtual address) to the page location to obtain the location of the data.
When the PTE needed to finish an address translation can be found in the TLB (i.e., a TLB hit), the address translation overhead is minimized. Otherwise (i.e., a TLB miss), the TLB conducts a page walk to locate the PTE and load it to the TLB. On a native system, this is to walk down a multi-level process page table
TLB misses overhead
TLB misses and page walks can significantly increase address translation overhead, because they may incur memory accesses. For example, on x86 platforms, each page walk may incur up to 4 memory accesses on a native system and 24 memory accesses on a virtualized system with nested paging.
In addition to memory accesses, in two-dimensional page walks, since the guest physical addresses used in process page tables in the guest must be translated to host physical addresses, extra TLB misses may be incurred, further increasing the address translation overhead.
As their experiments, only well-aligned huge pages can effectively reduce address translation overhead. Thus, the system they proposed, GEMINI, is to effectively turn mis-aligned huge pages into well-aligned huge pages. The method is to create new huge pages and map them to mis-aligned huge pages.
First, GEMINI classifies two types of mis-aligned huge pages:
Note that we are seeking a solution with low overhead. Creating excessive huge pages causes both space overhead (e.g., memory fragmentation) and run-time overhead (e.g., page migrations). The overhead constraint has been the main reason for existing systems to use multiple page sizes and dynamic page coalescing methods. It is also a main factor affecting the design of Gemini.
Second, Gemini enhances page allocators. With the enhancement, when huge pages or contiguous base pages are needed, the memory regions reserved for maintaining the status of type-1 mis-aligned huge pages are used first in the allocations. This is also to increase the chances that turn type-1 mis-aligned huge pages directly to well-aligned huge pages without going through the type-2 status.
Third, when page coalescing components are used to promote pages, they first try to promote the base pages mapped to type-2 mis-aligned huge pages before checking other base pages. This is to increase the chance that type-2 mis-aligned huge pages are turned into well-aligned huge pages and to leverage the mechanisms in these page coalescing components to avoid high overhead and excessive page promotions.
Details
Enhanced Memory Allocator (EMA) is based on VMA instead of huge page sized memory region because of the number of offset descriptor for huage page size memory regions can be huge.
When GEMINI memory allocations are requested, EMA first tries to find offset descriptor associated with the VMA that the requested address is in.
Then, the use the offset descriptor they got to calculate the guest/host physical address. (the storage of offset descriptor is self-organizing linear search list to optimize the search time.)
There are two issues with VMA:
The GEMINI contiguity list
When a VMA is touched for the first time, Gemini searches the Gemini contiguity list for a free physical memory region that could fit it with the next-fit policy. In the host level, Gemini searches the Gemini contiguity list of the HPA to fit the contiguous guest physical memory space that has been firstly touched. This increases the possibility of forming more well-aligned huge pages. The search starts from the place where it left off the previous time.
Sub-VMA
After Gemini searches Gemini contiguity list, it may not find an available memory region to fit the entire VMA. In this case, Gemini chooses the largest free memory region and generates a new starting address, a new length, and a new offset for the remaining VMA.
The sub-VMA mechanisms are applied to each level independently to avoid costly interactions between guestand host-level.
Huge Bucket
They implement the huge bucket through the repurposing the buddy allocator.
Huge bucket books the misaligned host huge pages and each time allocates the whole huge page sized guest physical memory regions backed by host huge pages.
They let the buddy allocator groups the free memory pages into blocks. Each block contains contiguous free pages and is aligned to X 4 KB, where x is the order of the block.
For well-aligned huge page that have been freed after use, GEMINI temporarily reserves them for a time period (or reaches the memory pressure watermark, or the memory fragmenetaion becomes severe) and returns them to the OSs.