Auraed - Deploying Pods in a MicroVM

Auraed - Deploying Pods in a MicroVM ==================================== ###### tags: `aurae` ## Setting the scene We got a server (physical or virtual, doesn't matter right now), that runs *auraed* as pid 1. The target is to deploy a pod within a MicroVM. ### What is a Pod? A pod is a collection of one or more containers, that share the same network namespace and can reach each other via the loopback device. Every pod has an IP address assigned, that makes the pod available in the customer's network. ### What is a MicroVM? A MicroVM is a very small virtual machine, that's missing a lot of virtual hardware a standard VM has. For example, a MicroVM does not have a PCI bus. [SR-IOV](https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov-) cannot be used to provide devices to the VM, instead we use virtio-net interfaces for NICs and virtio-blk interfaces for block storage. MicroVMs can be created using [Qemu](https://qemu.readthedocs.io/en/latest/system/i386/microvm.html) or [firecracker](https://github.com/firecracker-microvm/firecracker). In this document we will focus on firecracker. Still, a MicroVM is more comparable to a VM than to a container. A MicroVM is a sandbox, that virtualization extensions of the processor for isolation. Unlike a container, it is not just a process running on our host machine's Kernel. A MicroVM runs its own Kernel. That's why we need to have a Kernel image as well as an initramfs or a disk image when starting a MicroVM. ## Why auraed within a MicroVM? When the MicroVM is starting up, it loads the Kernel, which starts the virtual hardware and initiliazes the system. Then it executes the init program *auraed*. Again *auraed*? We run auraed already as pid 1 on the host machine! Yes. *Auread's* is a process scheduler, taking care of processes running on a machine. We should use *auraed* also within the MicroVM, where it spawns containers. ## Communication and Networking The *auraed* instance within the MicroVM and the *auraed* instance on the host machine need to communicate with each other. GRPC is the obvious choice here. As learned on the twitch stream on Sep 17th 2022, vsocks between the MicroVM and host machine offer a quite large attack surface. Instead we should use Ethernet/IP based communication, as it's used in production everywhere and there's a lot focus on the security of Linux's IP stack. We should provide two NICs to the MicroVM. `eth0` for the control plane communication and `eth1` for customer traffic. Firecracker expects a tap device on the host system for every NIC, that should be mounted to the MicroVM. On the host system this allows to simply add an IP address to the network interface via `ip addr add`. We shouldn't think about complex IP address management. Instead we could use IPv6 link-local addressing. E.g. on the host we would assign `fe80::1/64` to every tap device. In comparison to IPv4 link-local addresses `169.254.0.0/16`, with IPv6 you always need to suffix the target link-local address with the interface name, that you want to use - because the `fe80` adresses are *really link-local*. A ping from the MicroVM to the host system would look like this: `ping fe80::1%eth0`. You can use link-local addresses to establish a GRPC connection. On the host system you are able to identify the caller via the source address, as it's suffixed with the tap interface, associated with one specific MicroVM. Now we got a communication between the MicroVM's *auraed* and the host's. The host would tell the MicroVM to create a pod with n containers. Similar to containerd, *auraed* would create a network namespace and attach all container processes to this namespace. Instead of creating a veth pair and attaching one side to the network namespace, we would attach `eth1` to the network namespace. As a result we have a layer 2 connection to the host (via the second tap interface), which then can provide routing, firewalling or additional services like DNS or, if needed, metadata. Those services are probably better detailed out in a separate document. ### Metadata Service Open question: Do we really need a metadata service? Our target is to schedule a pod of containers. In the Kubernetes world containers are configured using configmaps and secrets, which are provided via environment variables or files on disk. Configuring containerized applications via the metadata service on `http://169.254.169.254` is rather unusual. Imho the metadata service would be more relevant to a VM, that is scheduled. In our MicroVM case we use *auraed* as pid 1, which establishes a connection to *auraed* on the host. This connection can be used to exchange configuration information, which would render the metadata service redundant. If we add the functionality to provide VMs to the customer, this would look different. Somehow the customer needs to configure their VM. This can be done using a metadata service.