paper origin : ACM ISBN
papers : link
`
1. Introduction
2. X-Container as a New Security Paradigm
2.1 Single-Connected Containers
- Cloud-native applications are designed to fully exploit the potential of cloud infrastructures. Although legacy applications can be packaged in containers and run in a cloud, these applications cannot take full advantage of the automated deployment, scaling, and orchestration offered by systems like Kubernetes, which are designed for single-concerned containers
- The shift to single-concerned containers is already apparent in many popular container clouds, such as :
- Amazon Elastic Container Service (ECS)
- Google Container Engine
- both of which propose different mechanisms for grouping containers that need to be tightly coupled
- using a “task” in Amazon ECS
- using a “pod” in Google Kubernetes
- single concern containers are not necessarily single-process
2.2 Rethink the Isolation Boundary
-
Modern OS support multiple users and processes provide various types of isolation
- Kernel Isolation : ensures that a process cannot compromise the integrity of the kernel nor read confidential information that is kept in the kernel
- cost of kernel isolation can be significant
- system call handlers are forced to perform various security checks
- Modern monolithic OS kernels like Linux have become a large code base with complicated services, device drivers, and system call interfaces, resulting in a mounting number of newly discovered security vulnerabilities
- Process Isolation : ensures that one process cannot easily access or compromise another
- processes are not intended solely for security isolation. They are often used for resource sharing and concurrency support, and to support this modern OSes provide interfaces that transcend isolation, including shared memory, shared file systems, signaling, user groups, and debugging hooks. These mechanisms lay out a big attack surface, which causes many vulnerabilities for applications that rely on processes for security isolation
-
In this study, they revisiting the question of what functionality belongs in the kernel, and where the security boundaries should be built
- An exokernel architecture is essential for ensuring a minimal kernel attack surface while providing good performance
- processes are useful for resource management and concurrency, but security isolation could be decoupled from the process model
-
In this paper, we propose the X-Container as a new paradigm for isolating single-concerned containers
- each single-concerned container runs with its own LibOS called X-LibOS
- can have multiple processes—for resource management and concurrency, but not isolation
- Inter-container isolation is guarded by the X-Kernel, an exokernel that ensures both a small kernel attack surface
-
The X-Containers architecture is different from existing container architectures, as shown in Figure 1
- gVisor has a user-space kernel isolated in its own address space
- Clear Container and LightVM run each container in its own virtual machine, but they don't reduce the cost of kernel and process isolation
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
2.3 Threat Model and Design Trade-offs
3. X-Container Design
3.1 Challenges of Running Containers with LibOSes
3.2 Why Use Linux as the X-LibOS
- We believe that the best way to develop an X-LibOS that is fully compatible with Linux is to leverage Linux itself for the primitives needed in the LibOS.
- Turning the Linux kernel into a LibOS and dedicating it to a single application can unlock its full potential.
3.3 Why Use Xen as the X-Kernel
There are five reasons that make Xen ideal for implementing an X-Kernel with binary compatibility and concurrent multi-processing :
-
Compared to Linux, Xen is a much smaller kernel with simpler interfaces
- Table 1 shows that the Linux kernel has an order of magnitude more vulnerabilities than Xen, roughly consistent with the ratio in code size and number of system calls
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
(CVE : Common Vulnerabilities and Exposures)
-
Xen provides a clean separation of functions in kernel mode (Xen) and user mode (Linux)
-
Xen supports portability of guest kernels
-
Multi-processing support is implemented in guest kernels
-
There is a mature ecosystem around Xen infrastructures
3.4 Limitations and Open Questions
- In this paper, we focus on addressing some of the key challenges of turning the Xen PV architecture into an efficient X-Containers platform
- As shown in Table 2, there are some challenges remaining for future work, as discussed below :
- Memory Management :
- In the Xen PV architecture, the separation of policy and mechanism in page table management greatly improves compatibility and avoids the overhead of shadow page table [2]. However, it still incurs overheads for page table operations
- the memory footprint of an X-Container is larger than a Docker container due to the requirement of running an X-LibOS
- Dynamic memory allocation and over-subscription face problems of determining correct memory requirement and the efficiency of changing memory allocation with ballooning
- Spawning speed of new instances :
- X-Containers require extra time for bootstrapping the whole software stack including the X-LibOS
- This overhead mainly comes from Xen’s “xl” toolstack for creating new VMs
- GPL license contamination :
- The Linux kernel uses the GNU General Public License (GPL), which requires that software using GPL-licensed modules must carry a license no less restrictive
4. Implementation
4.1 Background : Xen Paravirtualization
- Xen’s PV interface has been supported by the mainline Linux kernel—it was one of the most efficient virtualization technologies on x86-32 platforms
- Due to the elimination of segment protection in x86-64 long mode, we can only run the guest kernel and user processes in user mode
- To protect the guest kernel from user processes, the guest kernel needs to be isolated in another address space
- Each system call needs to be forwarded by the Xen hypervisor as a virtual exception, and incurs a page table switch and a TLB flush
- This causes significant overheads
4.2 Eliminating Kernel Isolation
- We modified the application binary interface (ABI) of the Xen PV architecture so that it no longer provides isolation between the guest kernel (i.e., the X-LibOS) and user processes
- X-LibOS is mapped into user processes’ address space with the same page table privilege level and segment selectors
- kernel access no longer incurs a switch between (guest) user mode and (guest) kernel mode, and system calls can be performed with function calls
- This leads to a complication: Xen needs to know whether the CPU is in guest user mode or guest kernel mode for correct syscall forwarding and interrupt delivery
- X-Kernel determines whether the CPU is executing kernel or user process code by checking the location of the current stack pointer (the most significant bit in the stack pointer indicates whether it is in guest kernel mode or guest user mode)
- In the Xen PV architecture, interrupts are delivered as asynchronous events
- There is a variable shared by Xen and the guest kernel that indicates whether there is any event pending
- To return from an interrupt handler, one typically uses the iret instruction to reset code and stack segments, flags, the stack and instruction pointers, while also atomically enabling interrupts
- When returning to a place running on the kernel mode stack, the X-LibOS pushes return address on the destination stack, and switches the stack pointer before enabling interrupts so preemption is safe. Then the code jumps to the return address by using a lightweight ret instruction
- When returning to the user mode stack, the user mode stack pointer might not be valid, so X-LibOS saves register values in the kernel stack for system call handling, enables interrupts, and then executes the more expensive iret instruction.
4.3 Concurrent Multi-Processing Support
- X-Containers inherit support for concurrent multiprocessing from the Xen PV architecture
- the kernel code is no longer protected, kernel routines would not need a dedicated stack if the X-LibOS only supported a single process
- However, since the X-LibOS supports multiple forked processes with overlapping address spaces, using user-mode stack for kernel routines causes problems after context switch
- Therefore, we still use dedicated kernel stacks in the kernel context, and when performing a system call, a switch from user stack to kernel stack is necessary
4.4 Automatic Lightweight System Call
-
the X-LibOS and the process both run in the same privilege level, it is more efficient to invoke system call handlers using function call instructions
- s. X-LibOS stores a system call entry table in the vsyscall page
- . Updating X-LibOS will not affect the location of the system call entry table
-
To avoid re-writing or re-compiling the application, we implemented an online Automatic Binary Optimization Module (ABOM) in the X-Kernel
- . It automatically replaces syscall instructions with function calls on the fly when receiving a syscall request from user processes, avoiding scanning the entire binary file
- Since each cmpxchg instruction can handle at most eight bytes, if we need to modify more than eight bytes, we need to make sure that any intermediate state of the binary is still valid for the sake of multicore concurrency safety
-
Figure 2 illustrates three patterns of binary code that ABOM recognizes

4.5 Lightweight Bootstrapping of Docker Images
- When creating a new X-Container, a special bootloader is loaded with an X-LibOS, which initializes virtual devices, configures IP addresses, and spawns processes of the container directly
- We connect our X-Container architecture to the Docker platform with a Docker Wrapper. An unmodified Docker engine running in the Host X-Container is used to pull and build Docker images
5. Evaluation
In this section, we address the following questions :
- How effective is the Automatic Binary Optimization Module (ABOM)?
- What is the performance overhead of X-Containers, and how does it compare to Docker and other container runtimes in the cloud?
- How does the performance of X-Containers compare to other LibOS designs?
- How does the scalability of X-Containers compare to Docker Containers and VMs?
5.1 Experiment Setup
5.2 Automatic Binary Optimazation
- we added a counter in the X-Kernel to calculate how many system calls were forwarded to X-LibOS.
- Table 3 shows the applications we tested and the reduction in system call invocations that ABOM achieved
- For all but one application we tested, ABOM turns more than 92% of system calls into function calls
- The exception is MySQL, which uses cancellable system calls implemented in the libpthread library that are not recognized by ABOM
- However, using our offline patching tool, two locations in the libpthread library can be patched, reducing system call invocations by 92.2%

5.3 Macrobenchmarks

5.4 Microbenchmarks
- We ran our tests both in Google GCE and Amazon EC2. We ran tests both isolated and concurrently
- For concurrent tests, we ran 4 copies of the benchmark simultaneously
- Figure 4 shows the relative system call throughput normalized to Docker. X-Containers dramatically improve system call throughput
- The throughput of gVisor is only 7 to 55% of Docker due to the high overhead of ptrace and nested virtualization, so can be barely seen in the figure
- Clear Containers achieve better system call throughput than Docker because the guest kernel is optimized by disabling most security features within a Clear container

- Figure 5 shows the relative performance for other microbenchmarks, also normalized to patched Docker
- Similar to the system call throughput benchmark, the Meltdown patch does not affect X-Containers and Clear Containers
- In contrast, patched Docker containers and Xen-Containers suffer significant performance penalties. X-Containers have noticeable overheads in process creation and context switching

5.5 Unikernel and Graphene
- We also compared X-Containers to Graphene and Unikernel
- Figure 6a compares throughput of the NGINX webserver serving static webpages with a single worker process
- X-Containers achieve throughput comparable to Unikernel, and over twice that of Graphene
- For Figure 6b, we ran 4 worker processes of a single NGINX webserver
- X-Containers outperform Graphene by more than 50%, since in Graphene, processes use IPC calls to coordinate access to a shared POSIX library, which incurs high overheads
- For Figure 6c we evaluated the scenario where two PHP CGI servers were connected to MySQL databases
- As illustrated in Figure 7, the PHP servers can have either shared or dedicated databases, so there are three possible configurations depending on the threat model
- X-Containers outperform Unikernel by over 40%


5.6 Scalability
- We evaluated scalability of the X-Containers architecture by running up to 400 containers on one physical machine
- Figure 8 shows the aggregated throughput of all containers or VMs
- Docker achieves higher throughput for small numbers of containers since context switching between Docker containers is cheaper than between X-Containers and between Xen VMs
- However, as the number of containers increases, the performance of Docker drops faster
- with N containers, the Linux kernel running Docker containers is scheduling 4N processes, while X-Kernel is scheduling N vCPUs, each running 4 processes
- With N = 400, X-Containers outperform Docker by 18%.

- We evaluated the overhead of X-Containers on spawning time and memory footprint, comparing to the same version of Docker engine as we used for X-Containers running on an unmodified Linux kernel
- Figure 9a shows the detailed breakdown of the time spent on different phases when spawning a new container
- Docker takes 558ms to finish spawning
- X-Containers takes 277ms to boot a new X-LibOS, and another 287ms for spawning the user program
- Figure 9b shows the breakdown of memory usage
- the docker stats command reports 3.56MB extra memory consumption (counted from cgroups) used for page cache and container file system
- For X-Containers, the “Extra” 11.16MB memory includes page cache for the whole system and all other user processes

7. Conclusion
- Propose X-container as a new security paradigm for isolating single-concerned cloud-native containers
- minimal exokernels can securely isolate mutually untrusting containers
- LibOSes allow for customization and performance optimization
- X-Containers introduce new trade-offs in container design:
- intra-container isolation is significantly reduced for improving performance and inter-container isolation
- We show that X-Containers offer significant performance improvements in cloud environments, and discuss the advantages and limitations of the current design, including those pertaining to running unmodified applications in X-Containers.