X-Containers: Breaking Down Barriers to Improve Performance and Isolation of Cloud-Native Containers

tags : `Containers`, `X-Containers`, `Cloud-Native`, `Library OS`, `exokernel`

paper origin : ACM ISBN

papers : link

1. Introduction

single concern principle :
- each container should have a single responsibility and handle that responsibility well.
- By focusing on a single concern, cloud-native containers are easier to scale horizontally, and replace, reuse, and upgrade transparently
problems :
- when running multiple containers on the same host, if one container is compromised, all containers on the same Operating System (OS) kernel are put under risk
- There have been several proposals to address the issue of container isolation. Hypervisor-based container runtimes , such as Clear Containers, Kata Containers, and Hyper Containers, wrap containers with a dedicated OS kernel running in a virtual machine (VM).
- These platforms require hardware-assisted virtualization support to reduce the overhead of adding another layer of indirection
- However, many public and private clouds, including Amazon EC2, do not support nested hardware virtualization. Even in clouds like Google Compute Engine where nested hardware virtualization is enabled, its performance overhead is high
- Designing a container architecture inspired by the exokernel+LibOS model can improve both container isolation and performance. However, existing LibOSes, such as MirageOS, Graphene, and OSv, lack features like full binary compatibility or multi-processing support. This makes porting containerized applications very challenging
paper work :
- propose a new LibOS platform called X-Containers that improves container isolation without requiring hardware virtualization support
- present X-Containers, a new exokernel-inspired container architecture that is designed specifically for single-concerned cloud-native applications
- demonstrate how the Xen paravirtualization architecture and the Linux kernel can be turned into a secure and efficient LibOS platform that supports both binary compatibility and multi-processing
- present a technology for automatically changing system calls into function calls to optimize applications running on a LibOS
- evaluate the efficacy of X-Containers against Docker, gVisor, Clear Container, and other LibOSes (Unikernel and Graphene), and demonstrate competitive or superior performance

2. X-Container as a New Security Paradigm

2.1 Single-Connected Containers

Cloud-native applications are designed to fully exploit the potential of cloud infrastructures. Although legacy applications can be packaged in containers and run in a cloud, these applications cannot take full advantage of the automated deployment, scaling, and orchestration offered by systems like Kubernetes, which are designed for single-concerned containers
The shift to single-concerned containers is already apparent in many popular container clouds, such as :
- Amazon Elastic Container Service (ECS)
- Google Container Engine
both of which propose different mechanisms for grouping containers that need to be tightly coupled
- using a “task” in Amazon ECS
- using a “pod” in Google Kubernetes
single concern containers are not necessarily single-process

2.2 Rethink the Isolation Boundary

Modern OS support multiple users and processes provide various types of isolation
- Kernel Isolation : ensures that a process cannot compromise the integrity of the kernel nor read confidential information that is kept in the kernel
  - cost of kernel isolation can be significant
  - system call handlers are forced to perform various security checks
  - Modern monolithic OS kernels like Linux have become a large code base with complicated services, device drivers, and system call interfaces, resulting in a mounting number of newly discovered security vulnerabilities
- Process Isolation : ensures that one process cannot easily access or compromise another
  - processes are not intended solely for security isolation. They are often used for resource sharing and concurrency support, and to support this modern OSes provide interfaces that transcend isolation, including shared memory, shared file systems, signaling, user groups, and debugging hooks. These mechanisms lay out a big attack surface, which causes many vulnerabilities for applications that rely on processes for security isolation
In this study, they revisiting the question of what functionality belongs in the kernel, and where the security boundaries should be built
- An exokernel architecture is essential for ensuring a minimal kernel attack surface while providing good performance
- processes are useful for resource management and concurrency, but security isolation could be decoupled from the process model
In this paper, we propose the X-Container as a new paradigm for isolating single-concerned containers
- each single-concerned container runs with its own LibOS called X-LibOS
- can have multiple processes—for resource management and concurrency, but not isolation
- Inter-container isolation is guarded by the X-Kernel, an exokernel that ensures both a small kernel attack surface
The X-Containers architecture is different from existing container architectures, as shown in Figure 1
- gVisor has a user-space kernel isolated in its own address space
- Clear Container and LightVM run each container in its own virtual machine, but they don't reduce the cost of kernel and process isolation
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

2.3 Threat Model and Design Trade-offs

external threat may attempt to break through the isolation barrier of a container
- standard container : this isolation barrier is provided by the underlying general purpose OS kernel, which has a large TCB, and due to the large number of system calls, a large attack surface
- X-Container : rely on a small X-Kernel that is specifically dedicated to providing isolation. The X-Kernel has a small TCB and a small number of system calls that lead to a smaller number of vulnerabilities in practice
The X-Containers model involves multiple trade-offs :
- While many cloud-native applications are compatible with X-Container’s security model, there exist some applications that still rely on strong process and kernel isolation, and some widely used security and fault-tolerance mechanisms are no longer working, for example:
  - A regular OpenSSH server : isolates different users in their own shell processes, but s. Running these applications directly in X-Containers cannot provide the same security guarantees.
  - Applications using processes for fault tolerance expect that a crashed process does not affect others. However, in X-Containers, a crashed process might compromise the X-LibOS and the whole application
  - Kernel-supported security features such as seccomp, file permissions, and network filter are no longer effective in securing a particular thread or process within the container

3. X-Container Design

3.1 Challenges of Running Containers with LibOSes

Binary Compatibility :
- A LibOS without binary level compatibility can make the porting of many containers infeasible
- incompatibility causes the loss of opportunities to leverage existing, mature development and deployment infrastructures that have been optimized and tested for years
Concurrent multi-processing :
- concurrent multi-processing refers to the capability of running multiple processes in different address spaces concurrently
- Without concurrent multi-processing, the performance of many applications would be dramatically impacted due to the reduced concurrency

3.2 Why Use Linux as the X-LibOS

We believe that the best way to develop an X-LibOS that is fully compatible with Linux is to leverage Linux itself for the primitives needed in the LibOS.
Turning the Linux kernel into a LibOS and dedicating it to a single application can unlock its full potential.

3.3 Why Use Xen as the X-Kernel

There are five reasons that make Xen ideal for implementing an X-Kernel with binary compatibility and concurrent multi-processing :

Compared to Linux, Xen is a much smaller kernel with simpler interfaces
- Table 1 shows that the Linux kernel has an order of magnitude more vulnerabilities than Xen, roughly consistent with the ratio in code size and number of system calls
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
  (CVE : Common Vulnerabilities and Exposures)
Xen provides a clean separation of functions in kernel mode (Xen) and user mode (Linux)
Xen supports portability of guest kernels
Multi-processing support is implemented in guest kernels
There is a mature ecosystem around Xen infrastructures

3.4 Limitations and Open Questions

In this paper, we focus on addressing some of the key challenges of turning the Xen PV architecture into an efficient X-Containers platform
As shown in Table 2, there are some challenges remaining for future work, as discussed below :
- Memory Management :
  - In the Xen PV architecture, the separation of policy and mechanism in page table management greatly improves compatibility and avoids the overhead of shadow page table [2]. However, it still incurs overheads for page table operations
  - the memory footprint of an X-Container is larger than a Docker container due to the requirement of running an X-LibOS
  - Dynamic memory allocation and over-subscription face problems of determining correct memory requirement and the efficiency of changing memory allocation with ballooning
- Spawning speed of new instances :
  - X-Containers require extra time for bootstrapping the whole software stack including the X-LibOS
  - This overhead mainly comes from Xen’s “xl” toolstack for creating new VMs
- GPL license contamination :
  - The Linux kernel uses the GNU General Public License (GPL), which requires that software using GPL-licensed modules must carry a license no less restrictive

4. Implementation

4.1 Background : Xen Paravirtualization

Xen’s PV interface has been supported by the mainline Linux kernel—it was one of the most efficient virtualization technologies on x86-32 platforms
- Due to the elimination of segment protection in x86-64 long mode, we can only run the guest kernel and user processes in user mode
- To protect the guest kernel from user processes, the guest kernel needs to be isolated in another address space
- Each system call needs to be forwarded by the Xen hypervisor as a virtual exception, and incurs a page table switch and a TLB flush
- This causes significant overheads

4.2 Eliminating Kernel Isolation

We modified the application binary interface (ABI) of the Xen PV architecture so that it no longer provides isolation between the guest kernel (i.e., the X-LibOS) and user processes
X-LibOS is mapped into user processes’ address space with the same page table privilege level and segment selectors
- kernel access no longer incurs a switch between (guest) user mode and (guest) kernel mode, and system calls can be performed with function calls
This leads to a complication: Xen needs to know whether the CPU is in guest user mode or guest kernel mode for correct syscall forwarding and interrupt delivery
- X-Kernel determines whether the CPU is executing kernel or user process code by checking the location of the current stack pointer (the most significant bit in the stack pointer indicates whether it is in guest kernel mode or guest user mode)
In the Xen PV architecture, interrupts are delivered as asynchronous events
- There is a variable shared by Xen and the guest kernel that indicates whether there is any event pending
To return from an interrupt handler, one typically uses the iret instruction to reset code and stack segments, flags, the stack and instruction pointers, while also atomically enabling interrupts
- When returning to a place running on the kernel mode stack, the X-LibOS pushes return address on the destination stack, and switches the stack pointer before enabling interrupts so preemption is safe. Then the code jumps to the return address by using a lightweight ret instruction
- When returning to the user mode stack, the user mode stack pointer might not be valid, so X-LibOS saves register values in the kernel stack for system call handling, enables interrupts, and then executes the more expensive iret instruction.

4.3 Concurrent Multi-Processing Support

X-Containers inherit support for concurrent multiprocessing from the Xen PV architecture
the kernel code is no longer protected, kernel routines would not need a dedicated stack if the X-LibOS only supported a single process
- However, since the X-LibOS supports multiple forked processes with overlapping address spaces, using user-mode stack for kernel routines causes problems after context switch
- Therefore, we still use dedicated kernel stacks in the kernel context, and when performing a system call, a switch from user stack to kernel stack is necessary

4.4 Automatic Lightweight System Call

the X-LibOS and the process both run in the same privilege level, it is more efficient to invoke system call handlers using function call instructions
- s. X-LibOS stores a system call entry table in the vsyscall page
- . Updating X-LibOS will not affect the location of the system call entry table
To avoid re-writing or re-compiling the application, we implemented an online Automatic Binary Optimization Module (ABOM) in the X-Kernel
- . It automatically replaces syscall instructions with function calls on the fly when receiving a syscall request from user processes, avoiding scanning the entire binary file
- Since each cmpxchg instruction can handle at most eight bytes, if we need to modify more than eight bytes, we need to make sure that any intermediate state of the binary is still valid for the sake of multicore concurrency safety
Figure 2 illustrates three patterns of binary code that ABOM recognizes

4.5 Lightweight Bootstrapping of Docker Images

When creating a new X-Container, a special bootloader is loaded with an X-LibOS, which initializes virtual devices, configures IP addresses, and spawns processes of the container directly
We connect our X-Container architecture to the Docker platform with a Docker Wrapper. An unmodified Docker engine running in the Host X-Container is used to pull and build Docker images

5. Evaluation

In this section, we address the following questions :

How effective is the Automatic Binary Optimization Module (ABOM)?
What is the performance overhead of X-Containers, and how does it compare to Docker and other container runtimes in the cloud?
How does the performance of X-Containers compare to other LibOS designs?
How does the scalability of X-Containers compare to Docker Containers and VMs?

5.1 Experiment Setup

conducted experiments on VMs in both Amazon Elastic Compute Cloud (EC2) and Google Compute Engine (GCE)
EC2 :
- we used c4.2xlarge instances in the North Virginia region (4 CPU cores, 8 threads, 15GB memory, and 2×100GB SSD storage)
- To make the comparison fair and reproducible, we ran the VMs with different configurations on a dedicated host.
Google GCE :
- we used a customized instance type in the South Carolina region (4 CPU cores, 8 threads, 16GB memory, and 3×100GB SSD storage)
- Google does not support dedicated hosts, so we attached multiple boot disks to a single VM, and rebooted it with different configurations

5.2 Automatic Binary Optimazation

we added a counter in the X-Kernel to calculate how many system calls were forwarded to X-LibOS.
Table 3 shows the applications we tested and the reduction in system call invocations that ABOM achieved
- For all but one application we tested, ABOM turns more than 92% of system calls into function calls
- The exception is MySQL, which uses cancellable system calls implemented in the libpthread library that are not recognized by ABOM
- However, using our offline patching tool, two locations in the libpthread library can be patched, reducing system call invocations by 92.2%

5.3 Macrobenchmarks

We evaluated the performance of X-Containers with four macrobenchmarks :
- NGINX
- Memcached
- Redis
- Apache httpd
Figure 3 shows the relative performance of the macrobenchmarks normalized to native Docker (patched) :
- gVisor performance suffers significantly from the overhead of using ptrace for intercepting system calls
- Clear Containers and gVisor in Google suffer a significant performance penalty for using nested hardware virtualization
- X-Containers improve throughput of Memcached from 134% to 208% compared to native Docker,
  - 307K Ops/sec in Amazon
  - 314K Ops/sec in Google
- For NGINX, X-Containers achieve 21% to 50% throughput improvement over Docker
  - 32K Req/sec in Amazon
  - 40K Req/sec in Google
- For Redis, the performance of X-Containers is comparable to Docker
  - 68K Ops/sec in Amazon
  - 72K Ops/sec in Google
- For Apache, X-Containers incur 28% to 45% performance overhead
  - 11K Req/sec in Amazon
  - 12K Req/sec in Google

5.4 Microbenchmarks

We ran our tests both in Google GCE and Amazon EC2. We ran tests both isolated and concurrently
- For concurrent tests, we ran 4 copies of the benchmark simultaneously
Figure 4 shows the relative system call throughput normalized to Docker. X-Containers dramatically improve system call throughput
- The throughput of gVisor is only 7 to 55% of Docker due to the high overhead of ptrace and nested virtualization, so can be barely seen in the figure
- Clear Containers achieve better system call throughput than Docker because the guest kernel is optimized by disabling most security features within a Clear container

Figure 5 shows the relative performance for other microbenchmarks, also normalized to patched Docker
- Similar to the system call throughput benchmark, the Meltdown patch does not affect X-Containers and Clear Containers
- In contrast, patched Docker containers and Xen-Containers suffer significant performance penalties. X-Containers have noticeable overheads in process creation and context switching

5.5 Unikernel and Graphene

We also compared X-Containers to Graphene and Unikernel
Figure 6a compares throughput of the NGINX webserver serving static webpages with a single worker process
- X-Containers achieve throughput comparable to Unikernel, and over twice that of Graphene
For Figure 6b, we ran 4 worker processes of a single NGINX webserver
- X-Containers outperform Graphene by more than 50%, since in Graphene, processes use IPC calls to coordinate access to a shared POSIX library, which incurs high overheads
For Figure 6c we evaluated the scenario where two PHP CGI servers were connected to MySQL databases
- As illustrated in Figure 7, the PHP servers can have either shared or dedicated databases, so there are three possible configurations depending on the threat model
- X-Containers outperform Unikernel by over 40%

5.6 Scalability

We evaluated scalability of the X-Containers architecture by running up to 400 containers on one physical machine
Figure 8 shows the aggregated throughput of all containers or VMs
- Docker achieves higher throughput for small numbers of containers since context switching between Docker containers is cheaper than between X-Containers and between Xen VMs
- However, as the number of containers increases, the performance of Docker drops faster
- with N containers, the Linux kernel running Docker containers is scheduling 4N processes, while X-Kernel is scheduling N vCPUs, each running 4 processes
  - With N = 400, X-Containers outperform Docker by 18%.

5.7 Spawning Time and Memory Footprint

We evaluated the overhead of X-Containers on spawning time and memory footprint, comparing to the same version of Docker engine as we used for X-Containers running on an unmodified Linux kernel
Figure 9a shows the detailed breakdown of the time spent on different phases when spawning a new container
- Docker takes 558ms to finish spawning
- X-Containers takes 277ms to boot a new X-LibOS, and another 287ms for spawning the user program
Figure 9b shows the breakdown of memory usage
- the docker stats command reports 3.56MB extra memory consumption (counted from cgroups) used for page cache and container file system
- For X-Containers, the “Extra” 11.16MB memory includes page cache for the whole system and all other user processes

7. Conclusion

Propose X-container as a new security paradigm for isolating single-concerned cloud-native containers
- minimal exokernels can securely isolate mutually untrusting containers
- LibOSes allow for customization and performance optimization
X-Containers introduce new trade-offs in container design:
- intra-container isolation is significantly reduced for improving performance and inter-container isolation
We show that X-Containers offer significant performance improvements in cloud environments, and discuss the advantages and limitations of the current design, including those pertaining to running unmodified applications in X-Containers.

X-Containers: Breaking Down Barriers to Improve Performance and Isolation of Cloud-Native Containers

tags : Containers, X-Containers, Cloud-Native, Library OS, exokernel

paper origin : ACM ISBN

papers : link

1. Introduction

2. X-Container as a New Security Paradigm

2.1 Single-Connected Containers

2.2 Rethink the Isolation Boundary

2.3 Threat Model and Design Trade-offs

3. X-Container Design

3.1 Challenges of Running Containers with LibOSes

3.2 Why Use Linux as the X-LibOS

3.3 Why Use Xen as the X-Kernel

3.4 Limitations and Open Questions

4. Implementation

4.1 Background : Xen Paravirtualization

4.2 Eliminating Kernel Isolation

4.3 Concurrent Multi-Processing Support

4.4 Automatic Lightweight System Call

4.5 Lightweight Bootstrapping of Docker Images

5. Evaluation

5.1 Experiment Setup

5.2 Automatic Binary Optimazation

5.3 Macrobenchmarks

5.4 Microbenchmarks

5.5 Unikernel and Graphene

5.6 Scalability

5.7 Spawning Time and Memory Footprint

7. Conclusion

Read more

VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs

Capuchin: Tensor-based GPU Memory Management for Deep Learning

AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing

SGCN: Exploiting Compressed-Sparse Features inDeep Graph Convolutional Network Accelerators

tags : `Containers`, `X-Containers`, `Cloud-Native`, `Library OS`, `exokernel`