# Making Dynamic Page Coalescing Effective on Virtualized Clouds. EuroSys '23 ###### tags: `paper review`, `huge page` > [time=Mon, Oct 23, 2023 7:38 PM] --- ## Summary **Paper Title**: Making Dynamic Page Coalescing Effective on Virtualized Clouds **Link**: https://web.njit.edu/~dingxn/papers/gemini.pdf **year**: EuroSys '23 **Keyword**: huge page > REMOVE THIS: Keyword should be added to the tags --- ## Highlight 1. The huge pages have become a mainstream method to **reduce address translation overhead for big memory workloads**. To create huge pages, system software usually uses page coalescing methods to dynamically combine contiguous base pages. However, their **effectiveness is substantailly undermined on virtualized platforms**. 2. This paper identifies the problem and shows that **only huge guest pages backed by huge host pages** can effectively reduce address translation overhead. Existing page coalescing methods only aim to increase huge pages at each layer, and **fail to consider this cross-layer (host-guest) requirement** on the alignmentment of huge pages. 3. To address this issue, this paper designs a cross-layer solution that guides the formation and allocation of huge pages in the guest and the host side. In the layer (guest or host), MM system aware huge pages in the other layer and carefully manage the corresponding relationship of these huge page memory regions between two layers. Therefore, it can increase the potential of forming and allocating huge pages from these regions and reduce aggravating the adverse effects incurred by excessive huge pages, e.g., guide page coalescing and huge page allocation to first consider thes regions. 4. The solution on Linux/KVM with diverse real-world applications shows that it can reduce TLB misses by up to 83 % and improve the performance by up to 126 %. ## Problems Start from TLB misses and address translation problems, TLB capacity cannot scale at the same rate as memory capacity, these problems have be a major performance bottleneck for many big-memory workloads, such as virutalized clouds which support the nested paging like Intel's extended page tables and AMD nested page tables. The factor is that, with nested paging, the process needs to walk through two layers of page tables and the cost can be 6x as much as walking thrtough one layer of page table upon TLB misses on native systems ([Performance Implications of Extended Page Tables on Virtualized x86 Processors. VEE '16'](https://dl.acm.org/doi/10.1145/2892242.2892258)). Thus, using the huge page (e.g., 2 MB on x86 arch) has become the effective method (2 MB huge page PTE vs. 4 KB page PTE). :::warning **Page coalescing** To make the 4 KB pages region become 2 MB huge page we can turn on the auto-coalescing in kernel configuration (Not recommended). Otherwise, we can use `madvise()` to **guide** the kernel to do the coalescing. > I remember that, on mailing list, one kernel developer said `madvise()` only suggest the kernel to do the coalescing, but I cannot find where is it. > Moreover, there are two methe to merge the 4 KB pages to huge page. > 1. THP (aka dynamic page coalescing), it only add the memory region to the scanning list. Then, the khugepaged will decide to coalesce the region at the certain time. (`mm/khugepaged.c:collapse_huge_page()`) > 2. `madvise()`, after calling the function, it will coalesce the pages during the page fault. When the page is coalescing, if the region has the hole in the PTE, it will allocate the memory to fill the space. So, the page coalescing may incur memory space waste and high demand-paging overhead. Currently, Linux kernel will dynamically combine contiguous base pages into huge pages and split under-utilized huge pages to reduce the space and paging overhead. ::: The problem, called **huge page misalignment problem** in the paper, with virtualized platform with huge pages is that they coalesce pages independently, this means that the guest huge page might backed by base page in the host. Using huge pages can reduce TLB misses only when both the guest and the host use huge pages for the same data, i.e., a huge GVP is backed by a huge GPP and a huge HPP at the same time. The reason is that only in these cases the PTEs of the VM page tables can be cached in the TLB and used correctly in address translation. When a huge GVP is backed by multiple base HPPs in the host, there is not a PTE in the VM page table that corresponds to the GVP; thus, no PTEs can be loaded to the TLB to help translation. If a base GVP is backed by a huge HPP, because there is not a PTE in the VM page table that corresponds to the virtual page, the page offset cannot be used to obtain correct host physical address (HPA). Thus, in neither of these cases, there is a valid PTE that can be cached in the TLB to help the address translation. The paper proposes the solution, here is the detail: > Gemini first makes the memory management at a layer aware of the mis-aligned huge pages formed at the other layer. Gemini periodically scans the process page tables in VMs and the VM page tables in the host to find the mis-aligned huge pages. It keeps track of these huge pages using their guest physical addresses. Thus, a guest can check the guest physical addresses of the mis-aligned huge pages in the host, and tries to form and allocate huge guest pages at these addresses. The host can check the guest physical addresses of the mis-aligned huge pages in a guest, and tries to back these addresses using huge host pages. > > To form and allocate new huge pages that match the misaligned huge pages at the other layer, Gemini carefully manages the space of the memory regions corresponding to the mis-aligned huge pages. It reserves the space temporarily, hoping that the space can be allocated directly as huge pages or allocated as contiguous base pages, which later can be promoted into huge pages with minimal overhead. Then, Gemini guides page coalescing and huge page allocation to form and allocate huge pages from these regions first before considering other memory regions. These huge pages turn the mis-aligned huge pages at the other layer into wellaligned huge pages. Gemini does not create huge pages excessively, and thus does not aggravate the adverse effects incurred by excessive huge pages :::warning **Address Translation with virtualized systems** A TLB is a small cache that buffers a number of page table entries (PTEs), which contain the real memory locations of the corresponding virtual pages. On a native system the PTEs of process page tables are cached; and on a virtualized system the PTEs of VM page tables are cached. The PTEs are tagged with the virtual page addresses; this forms a table mapping virtual pages to their real locations. To translate a virtual address, the TLB uses the virtual page address (higher bits in the virtual address) to find the corresponding page table entry and obtain the real location of the page. Then, it adds the page offset (lower bits in the virtual address) to the page location to obtain the location of the data. When the PTE needed to finish an address translation can be found in the TLB (i.e., a TLB hit), the address translation overhead is minimized. Otherwise (i.e., a TLB miss), the TLB conducts a page walk to locate the PTE and load it to the TLB. On a native system, this is to walk down a multi-level process page table --- **TLB misses overhead** TLB misses and page walks can significantly increase address translation overhead, because they may incur memory accesses. For example, on x86 platforms, each page walk may incur up to 4 memory accesses on a native system and 24 memory accesses on a virtualized system with nested paging. In addition to memory accesses, in two-dimensional page walks, since the guest physical addresses used in process page tables in the guest must be translated to host physical addresses, extra TLB misses may be incurred, further increasing the address translation overhead. ::: ## Main Idea As their experiments, only well-aligned huge pages can effectively reduce address translation overhead. Thus, the system they proposed, GEMINI, is to effectively turn mis-aligned huge pages into well-aligned huge pages. The method is to create new huge pages and map them to mis-aligned huge pages. First, GEMINI **classifies two types of mis-aligned huge pages**: 1. mis-aligned huge page does not have any base pages mapped to it. * Solution, temporary reserving the memory regions from the following (has time-out): * allocate a huge page at the corresponding address at the other layer. * If the first approach is not immediately avaiable, then allocate the contiguous base page first. So, they don't need to do page migration during promoting. 2. mis-aligned huge page has some base pages mapped to it, but the base pages cannot be directly promoted into a huge page. The page migration must be involved first. * Maintaining the status of type-1 is to prevent them from becoming type-2 mis-aligned huge pages and thus avoid the high overhead associated with turning type-2 mis-aligned huge pages. > Note that we are seeking a solution with low overhead. **Creating excessive huge pages causes both space overhead (e.g., memory fragmentation) and run-time overhead (e.g., page migrations)**. **The overhead constraint has been the main reason for existing systems to use multiple page sizes and dynamic page coalescing methods**. It is also a main factor affecting the design of Gemini. Second, Gemini **enhances page allocators**. With the enhancement, when huge pages or contiguous base pages are needed, the memory regions reserved for maintaining the status of type-1 mis-aligned huge pages are used first in the allocations. This is also to increase the chances that turn type-1 mis-aligned huge pages directly to well-aligned huge pages without going through the type-2 status. Third, **when page coalescing** components are used to promote pages, they **first try to promote the base pages mapped to type-2 mis-aligned huge page**s before checking other base pages. This is to increase the chance that type-2 mis-aligned huge pages are turned into well-aligned huge pages and to leverage the mechanisms in these page coalescing components to avoid high overhead and excessive page promotions. ## GEMINI components 1. **the host layer misaligned huge page scanner** (MHPS) * Periodically scans the page tables of the guest processes for the huge pages formed in the guest, and scans the page tables of VMs for the huge pages formed in the host. * For each huge page identified in the scanning, MHPS labels it with the system layer (i.e. guest or host), its guest physical address, and VM ID of the address. 2. Use these labels to determine the memory type (makes each guest aware of the mis-aligned huge host pages that are mapped to it). 3. At each layer, GEMINI uses **the huge booking component** to reserve the huge-page-sized memory space corresponding to the type-1 mis-aligned huge pages (until time-out). * The booking timout value affects the overheads, A large value increases the waste of memory space and may increase memory fragmentation, as the memory space cannot be used by other processes or for other purposes during the booking period. A small value may reduce the effectiveness of GEMINI in reducing its overhead. * The component uses the algorithm, perf, and free memory fragmentation index ([FMFI](https://www.usenix.org/system/files/conference/osdi16/osdi16-kwon.pdf)) to adjust the value. 4. **The enhanced memory allocator (EMA)** interacts with the huge booking component to preferentially allocate huge pages and contiguous base pages from the memory regions reserved by the ghuge booking component. * The main idea of EMA is to align the starting addresses of GPA and HPA to GVA based on huge pages upon the first page fault to the huge page sized memory region. * **Huge preallocation.** In many cases, only a few base pages are missed to make up to such 512 pages in the requirement. EMA will performs the preallocation only when two conditions are met to control the space waste: * the number of base pages allocated in the region exceeds a threshold (256 seletec experimentally) * the degree of memory fragmentation is low (FMFI $\le$ 0.5), indicating that the workload tends to use contiguous memory pages. 5. **The promoter component interacts with the page coalescing component** in the system to perferentially promote the base pages mapped to type-2 mis-aligned huge pages. :::warning **Details** **Enhanced Memory Allocator (EMA) is based on VMA** instead of huge page sized memory region because of the number of offset descriptor for huage page size memory regions can be huge. When GEMINI memory allocations are requested, EMA first tries to find offset descriptor associated with the VMA that the requested address is in. Then, the use the offset descriptor they got to calculate the guest/host physical address. (the storage of offset descriptor is self-organizing [linear search list](https://dl.acm.org/doi/10.1145/5505.5507) to optimize the search time.) There are two issues with VMA: * what guest/host physical memory space should be booked to fit the entire VMA? * They try to align the entire VMA based on huge pages boundaries to increase the memory contiguity and reduce memory fragmentation [1](https://dl.acm.org/doi/10.1145/3307650.3322223). * To address the issue, the GEMINI enhances the buddy allocator in Linux. * How to handle the target GPA or HPA is unavaible for allocation? * Explain: After the guest/host memory space is booked, the target GPA/HPA calculated with EMA may not be avialable for allocation as the corresponding VMA may be changed (expansion) orthe target GPA/HPA has been allocated. * To address it, GEMINI uses the GEMINI contiguity list and Sub-VMA. --- **The GEMINI contiguity list** When a VMA is touched for the first time, Gemini searches the Gemini contiguity list for a free physical memory region that could fit it with the next-fit policy. In the host level, Gemini searches the Gemini contiguity list of the HPA to fit the contiguous guest physical memory space that has been firstly touched. This increases the possibility of forming more well-aligned huge pages. The search starts from the place where it left off the previous time. **Sub-VMA** After Gemini searches Gemini contiguity list, it may not find an available memory region to fit the entire VMA. In this case, Gemini chooses the largest free memory region and generates a new starting address, a new length, and a new offset for the remaining VMA. The sub-VMA mechanisms are applied to each level independently to avoid costly interactions between guestand host-level. **Huge Bucket** They implement the huge bucket through the repurposing the buddy allocator. Huge bucket books the misaligned host huge pages and each time allocates the whole huge page sized guest physical memory regions backed by host huge pages. They let the buddy allocator groups the free memory pages into blocks. Each block contains $2^x$ contiguous free pages and is aligned to $2^x$ X 4 KB, where x is the order of the block. For well-aligned huge page that have been freed after use, GEMINI temporarily reserves them for a time period (or reaches the memory pressure watermark, or the memory fragmenetaion becomes severe) and returns them to the OSs. :::