HyperTRIO: Hyper-Tenant Translation of I/O Addresses

# HyperTRIO: Hyper-Tenant Translation of I/O Addresses ###### tags : `cache storage`, `input-output programs`, `resource allocation`, `storage management`, `virtual machines` ###### paper origin : ACM/IEEE ISCA ###### papers : [link](https://ieeexplore.ieee.org/document/9138993) # 1. Introduction * **Problems :** * The increasing number of hardware accelerators along with growing interconnection bandwidth creates a new class of devices available for sharing. To fully utilize the potential of these devices, I/O architecture needs to be carefully designed for both processors and devices. * The challenge that this paper addresses is the severe underutilization of available I/O bandwidth as the number of tenants approaches the hyper-tenant regime. * We identify that this severe I/O bandwidth underutilization is caused bythe lack of scalability in the I/O address translation subsystem- IOMMU design, device design, and software structures. * **Paper work :** * Architectural design and evaluation of **Hyper-tenant TRanslation of I/O addresses (HyperTRIO)**. We investigate the interaction between I/O device and system memory and use cloud benchmarks as a case study * The creation and open source release of a Hyper-tenantSimulator of I/O address accesses (HyperSIO) used foranalysis and performance evaluation. * Detailed analysis of inter- and intra-tenant interactions during I/O address translation. * Study of IOTLB replacement policies, partitioning, parallel address translation, and prefetching of I/O address translations. # 2.Background and Motivation ## 2.1 Device Sharing and Address Translation * Single Root I/O Virtualization (SR-IOV) provides a way for one physical device to be shared between multiple independent tenants. Such I/O device can be viewed as a number of seperate PCIe devices, each of them represented by a **Virtual Function (VF)**. * Each VF can be independently used by a tenant, and it provides isolation, and low virtualization overhead while efficiently using available hardware resources. * To futher decrease the involvement of the tenant's CPU when moving data between main memory and I/O device, modern-day processors use direct memory access (DMA). * When communicating with the tenant’s main memory via DMA, the device uses **guest I/O virtual addresses (gIOVA)** to read/write data from/to it. These addresses are generated by a tenant’s OS, and they provide a flexible manner for accessing a shared device by multiple isolated units at the same time. * However, all of the gIOVAs must be translated to **host physical addresses (hPAs)** before reading/writing data in the memory. **I/O Memory Management Units (IOMMUs)** implement this functionality for both non-virtualized and virtualized environments. * In the latter (non-virtualized) case, translation takes the form of a two-dimensional page-table walk, shown in Figure 2. Every access to a guest page table (labeled as the first-level walk) incurs a walk through a host page table (second-level walk). **This two-dimensional walk is expensive** and requires 24 or 35 memory accesses for 4-level or 5-level page tables, respectively on the x86-64 architecture. In order to reduce the number of memory accesses, IOMMUs can have translation caches (L[1-4]TLBs in Figure 2) or nested TLBs, which store translations from guest physical to host physical addresses. ![Figure 2](https://i.imgur.com/5BaXYvp.png) * As an example in Figure 3, we show the translation steps performed for each incoming packet. * First, a tenant places a gIOVA into its corresponding ring buffer, located inside of the packet handling logic, which is later read upon the arrival of a packet. After identifying the source ID (SID) (e.g. PCIe Bus/Device/Function) for a request, the device looks up in Context Cache (CC in Figure 3) : * (1) to find a corresponding Context Entry (CE) which contains a pointer to the base of the second-level paging entries and Device ID (DID) * (2) configured by the host. To accelerate translation from gIOVA to hPA, the device can have a cache to store the most recent translations shown as the Device Translation Lookaside Buffer (DevTLB). It is checked for a request * (3) and the device sends a request to the system over PCIe * (5) with a translated address in the case of a hit * (4) and with an untranslated address otherwise. The translation subsystem (IOMMU), located in the chipset in Figure 3, translates the gIOVA * (6) performing a two-dimensional Page Table Walk (PTW) when there is a miss in the DevTLB. There can be multiple hardware structures for caching page-table entries (L[1-4]TLBs) and for caching translations from gIOVA to hPA (IOTLB) to accelerate the Page Table Walk. When an entry is not found in a corresponding caching structure, the IOMMU accesses main memory * (7) to retrieve a page-table entry (PTE). After the page table walk is finished, the hPA is returned to the device * (8) and it finishes the requested read/write operation ![Figure 3](https://i.imgur.com/DwCmH2L.png) * Throughout this work, we focus on hyper-tenant environments where **every tenant requires a two-dimensional page table walk** to translate its guest I/O Virtual Address (gIOVA) to a host Physical Address (hPA). ## 2.2 Case-Study of a SR-IOV NIC * a server running on an AMD Ryzen 9 3900X CPU, and another one with a server running on an Intel Xeon E7-4870. we measure the bottlenecks for IOMMU translation for a multi-tenant use case. * **AMD :** * they record the number of IOMMU TLB PTE hits/misses and the number of nested IOMMU page reads as we vary number of parallel connections between 2 and 120 * For all the experiments, the total bandwidth was around 12.1Gb/s, which was the same as when using non-virtualized configuration (it is less than the expected 20Gb/s due to the NIC design, which was also found in other studies). * Using the recorded statistics, we calculate the TLB PTE miss rate, which is less than 0.1% when there are less than 80 connections. * However, for a larger number of connections we observe increasing miss rate for up to 4.3% for 120 connections (shown in Figure 4). ![Figure 4](https://i.imgur.com/D36zdDU.png) * In addition to increasing IOMMU TLB PTE miss rate, the number of nested page table reads increases more than 400 times for 120 connections compared to 80 connections. * **These results indicate that a large number of tenants cause contention for cached translations and it will be even more challenging in a hyper-tenant setup.** * **Intel :** * they compared I/O bandwidth when running native versus virtualized (using VF’s) network interfaces. * They observe that a single connection using a host interface can utilize up to 8.7Gb/s, which is less than the 9.49Gb/s of useful bandwidth possible for 1500B packets on a 10Gb/s link. This behavior is caused by a CPU bottleneck on the server side and can be removed by using faster cores. * At the same time, the maximum achievable bandwidth for the connection using a VF is only 6.7Gb/s, which is lower than the direct (nonvirtualized) host interface speed. This can be explained by virtualization overhead. ![Figure 5](https://i.imgur.com/SlQbapA.png) * When the number of client-server pairs increases, bandwidth per connection goes down, therefore removing the CPU bottleneck * As the number of connections is varied between two and eight, the physical link is utilized 99% in both experiments. However, when the number of connection pairs exceeds eight, the total bandwidth for configurations using VFs starts to decrease, flattening out at around 0.5Gb/s for more than 16 pairs * In contrast, increasing the number of client-server pairs does not affect total bandwidth in the case of running on the host directly * From the above experiments, we conclude that **it is contention for a shared IOVA translation resource in the virtualized setup which ultimately limits the utilization of available I/O bandwidth * Every tenant’s OS allocates a number of pages for use by its device independently, and **translations of gIOVAs for every tenant start thrashing the DevTLB, L[1-4]TLBs, and overloads the IOMMU with a large number of requests**, as the experiment with the AMD host showed. # 3. HyperTRIO Architecture * **To remove the guest I/O Virtual Address (gIOVA) translation bottleneck in a hyper-tenant environment** and **enable full utilization of available I/O device bandwidth**, we propose the HyperTRIO architecture - Hyper-tenant TRanslation of I/O addresses * three main innovations : * **Pending Translation Buffer (PTB)** : The PTB is located inside of each a device and supports multiple in-flight translations from different tenants * **Partitioned Device-TLB (P-DevTLB)** : The P-DevTLB provides architectural support for translation isolation by assigning a unique tenant’s tag per row of the DeviceTLB * **Translation Prefetching Scheme** : The Prefetch Unit (PU)initiates translations of the most recent gIOVAs stored from previous tenant’s accesses to IOMMU using intertenant information * The HyperTRIO architecture is presented in Figure 6 withour newly added blocks shown in light gray ![Figure 6](https://i.imgur.com/6xYd4cs.png) ## 3.1 Pending Translation Buffer (PTB) * To avoid head-of-line blocking when an IOMMU performs a twodimensional page table walk, PTB supports **out-of-order translation completion**. * However, to keep PTB size reasonable, we look for optimization in other parts of the design. For example, in the case of performing a full 4-level two-dimensional page table walk when doing a translation for 1500B packets, PTB has to keep track of 112 outstanding requests and it is expensive ## 3.2 Partitioned Device-TLB (P-DevTLB) * Every translation request received by a DevTLB contains a Source ID (SID) and/or Process Address Space Identifier (PASID), SID assignment is controlled by a hypervisor, it is known after a VF is allocated to a tenant, and it does not depend on a tenant’s type. Therefore, we can use it to isolate translations of independent tenants at the DevTLB * We add a partition tag (PTag) to every row in the DevTLB, which should match with a SID in order for translation to be cached * This partitioning of a single DevTLB enables performance isolation, e.g., it prevents a low-bandwidth tenant from evicting translations for high-bandwidth tenants ## 3.3 Translation Prefetching Scheme * they introduce a Prefetch Unit (PU) on each HyperTRIO device, which is accessed concurrently with the DevTLB, The Prefetch Unit has two parts - a Prefetch Buffer (PB), and Source ID predictor (SID-predictor). * The PB is a small fully-associative cache, which keeps translations from gIOVAs to hPAs and it is shared by all tenants in the system * The SID-predictor contains a table which directly maps from the currently accessed SID to a predicted SID and a history-length register * The PU is checked along with DevTLB to see if it contains a valid translation for a current translation request in a PB. * If there is one in a PB, it is returned, and no further requests are generated * In the case of a miss in the PB, the SID-predictor is checked for a corresponding entry to a currently accessed SID, In the presence of a valid entry, the PU sends a prefetch request with a predicted SID to the chipset. * The chipset contains a gIOVA history reader, which uses a predicted SID to read the most recent translations from main memory. The IOVA history reader fetches two previously accessed gIOVAs * Instead of keeping translations from gIOVA to hPA, it issues translation requests for predicted gIOVAs to an IOMMU. This enables fetching the most recent translations from the memory when previous ones were invalidated, and at the same time gives a chance to cache intermediate translations in the L[1-4]TLBs. # 4. Hyper-tenant Simulator of I/O * To be able to study I/O devices in a hyper-tenant environment running real-world workloads, we created a Hyper-tenant Simulator of I/O - HyperSIO. It consists of three main parts: * **Log Collector** : It records all operations performed by an IOMMU while translating addresses for tenants’ devices * **Trace Constructor** : using multiple collected logs, HyperSIO constructs translation information for every tenant and its sequence of accesses to an IOMMU. Using this information, it generates a Hyper-Trace which is used by the hyper-tenant performance model. * **Trace-Driven Device-System Performance** : a fully custom trace-driven performance model written in C++. It incorporates detailed interaction between an I/O device, translation subsystem and host main memory using realworld latencies to compute I/O utilization ## 4.1 HyperSIO : Log Collector * HyperSIO’s Log Collector is used to record accesses to an IOMMU from independent tenants running real-world workloads * Inside of L1VM, we run nested Level-2 VMs (L2VM) representing separate tenants, where each of them is using one NIC directly assigned to it using PCIe passthrough. As a result, every NIC is allocated into a separate IOMMU group in the L1VM, guaranteeing its isolation from other NICs ![Figure 7](https://i.imgur.com/iLKI0Uk.png) ## 4.2 HyperSIO : Trace Constructor * HyperSIO’s Hyper-Trace Constructor produces a single trace from the logs generated by a Log Collector to model a hyper-tenant system. * it also supports two options for inter-tenant interleaving. * The first one is Round-Robin (RR), which is used as an arbitration scheme between queues and is found in a real NIC * The second one is random (RAND), which represents a scenario for tenants sending separate requests instead of generating a steady data stream. ## 4.3 HyperSIO : Performance Model * The HyperSIO Performance Model is a fully custom devicesystem model written in C++ * It reads traces generated by the Trace Constructor which includes gIOVA translation requests, Context Cache, and page-table entries ## 4.4 Single- and Multi-Tenant Characterization * **single-tenant characterization** : * First, page frames can be grouped based on their access frequency. When caching translations in the IOTLB, this fact can be used to decide which translation to evict in the case of a conflict (Figure 8a) * The second observation is that each 2MB page used for data buffers is accessed many times sequentially, exhibiting high temporal locality. Periodic access to those pages can be used by a prefetcher to load the next page. It also shows that switching to a new page frame happens less frequently than accesses to the same page. (Figure 8b). ![Figure 8](https://i.imgur.com/s4EAPQM.png) * **multi-tenant characterization** : * Figure 9 presents the results of performance simulation indicating that the maximum achievable aggregated I/O bandwidth depends on the number of connections in the same way as our motivational study shown in Figure 5 * Since the DevTLB is a shared resource, it becomes a bottleneck when utilized by a large number of tenants * For an 8-way set-associate DevTLB, more than four concurrent connections start evicting entries of other tenants, which eventually leads to thrashing and significantly increases translation time for every request * Finally, the system becomes limited by the performance of the gIOVA translation subsystem, which involves traversing the PCIe bus, doing a two-dimensional page-table walk, and accessing DRAM ![Figure 9](https://i.imgur.com/0nFUU4r.png) # 5. Evaluation ## 5.1 Benchmarks * We used three I/O intensive benchmarks for evaluation listed in Table III. ![Table 3](https://i.imgur.com/K7RrhYI.png) * **iperf3** : throughput oriented benchmark stressing network stack. It generates a steady stream of packets with a maximum size specified as a parameter * **mediastream** : serving videos of different length and qualities to a client * **websearch** : a client sends requests to multiple server index nodes ## 5.2 HyperTRIO Scalability * Figure 10 shows maximum achievable link bandwidth of HyperTRIO compared to a Base architecture using parameters listed in Table IV ![Figure 10](https://i.imgur.com/4qBEsIY.png) * For the Base configuration, the maximum achievable I/O bandwidth does not scale with increasing number of tenants independent of their interleaving * In contrast, the HyperTRIO architecture enables the use of up to 100% of the total link bandwidth in an environment with 1024 tenants * The Prefetching Scheme captures inter- and intra-tenant information supplying a valid translation from a Prefetch Buffer for 45% of requests for websearch benchmark in 1024-tenant setup * The Pending Translation buffer provides support for hiding misses to the DevTLB and Prefetch Buffer by keeping track of in-flight translations and hiding the latency caused by performing a two-dimensional page table walk ## 5.3 Base Configuration Study * whether changing some parameters of the Base configuration can significantly affect utilization of an I/O link * **Scaling DevTLB Size** : (Fiugre 11a) * A 1024 entry DevTLB enables reaching higher bandwidth for up to 64 tenants * However, it depends on the tenants’ order. For example, a 64-entry size DevTLB for 16 tenants allows the utilization of the I/O link 3 times more efficiently for RR4 interleaving than a 1024-entry DevTLB with RR1 interleaving for websearch benchmark. * When the number of tenants exceeds 128, configurations with both sizes provide the same link utilization for RR1 and RAND1 * Overall, in a hyper-tenant setup, when many tenants use the same IOVAs, a simple increase of DevTLB size does not improve utilization of available bandwidth due to conflicts in frequently used set * **Studying DevTLB Replacement Policies** : (Figure 11b) * For a small number of tenants, all translations fit into DevTLB without conflicts, therefore allowing the system to fully utilize available I/O bandwidth * With increasing number of tenants, total bandwidth starts to decrease, and for more than 64 of them the translation cache becomes completely thrashed by requests from different tenants, making the translation subsystem a bottleneck * **Scaling with Fully Associative DevTLB** : (Figure 11c) * I/O link utilization scales for fullassociative DevTLBs with increasing number of tenant * As can be seen from Figure 11c, in all benchmarks, using more than eight tenants produces low bandwidth utilization * Though its total processing time depends on a miss/hit to L[1-4]TLBs, for high bandwidth I/O devices even just PCIe latency will severely affect the total throughput ![Figure 11](https://i.imgur.com/YqI1pjF.png) ## 5.4 HyperTRIO Evaluation * **Scaling of the Partitioning Scheme** : (Figure 12a) * We set partition size to one 8-entry row per tenant, even though mediastream and websearch benchmarks have larger active translation sets per tenant. This decision is made in favor of having more isolated partitions for large numbers of tenants * Link utilization stays high until multiple devices start using the same partition * it limits the maximum possible cached translations available for every tenant * The benefit of the scheme comes from isolation and independent management of tenants, allowing translations to evict entries which belong only to the same partition * Overall, partitioning improves utilization better than simply increasing associativity or changing replacement policy of DevTLB, but it still does not solve the scalability challenge in hyper-tenant environments * **Pending Translation Buffer** : (Figure 12b) * The PTB allows HyperTRIO to hide translation latency in the case of a miss in the DevTLB * With eight entries it enables reaching full I/O bandwidth for a partitioned design with up to 16 tenants (compare with Figure 12a) * Further size increase of the PTB to 32 entries allows us to achieve an aggregated 136Gb/s for 1024 tenants for all benchmarks * it becomes expensive from a hardware point of view and does not scale for larger I/O bandwidth * **Translation Prefetching Scheme** : (Figure 12c) * PB size : we found that eight entries are a good trade-off between precision and hardware resources used for the buffer * history length : We found that for our simulated system a history depth of 48 requests is optimal across different number of tenants * number of prefetched translations for each tanant : In order to keep the translation for several tenants in a small PB, we prefetch the two most recently used translations per tenant * For hyper-tenant configurations it improves link utilization by up to 30% for the websearch benchmark. It also scales better than simply increasing the PTB size, since the prefetch buffer and history length can stay the same for larger number tenants ![Figure 12](https://i.imgur.com/CiYTjzv.png) # 6. Conclusion * presented HyperTRIO, an architecture for scalable guest I/O Virtual Address (gIOVA) to host Physical Address (hPA) translation in hyper-tenant environments * **Partitioned DeviceTLB** : * tenant isolation * uniform utilization of hardware resources * **Pending Translation Buffer** : * support multiple inflight translations on a device * hide translation latency in the case of a miss in the DevTLB * **Prefetching Scheme** : * uses inter- and intra-tenant information to predict and translate gIOVAs to hPAs * built the Hyper-Tenant Simulator of I/O (HyperSIO) to analyze the performance of HyperTRIO * Result : * **HyperTRIO architecture** : utilize more than 90% of the 200Gb/s link, with up to 1024 independent tenants * Base configuration : utilize only 6%