onmetal - Reclaiming the Datacenter

# onmetal - Reclaiming the Datacenter Let's discuss how to build an infrastructure to run cloud native applications in your own datacenter. This is about medium to large scale deployments. With that I mean one full rack of servers minimum, better three, or even more than that, distributed over three different locations (Availability Zones). This also implies that the app developers are different persons/teams than the infrastructure people. The app developers can also be split into multiple teams, with their apps interacting with each other, but not sharing the same deployments (=multi tenancy). The described architecture takes hardware into account, that is available on the market today in 2022. Requirements ------------ ### Containers & Kubernetes We want to build an environment, that allows app developers to **deploy containerized applications**. App developers ask for Kubernetes clusters as deployment target. So we would like to offer a **Kubernetes API** in the end, that can be consumed by the developers. Still, we want to be flexible and later on maybe also provide a custom Container as a Service (**CaaS**) or a Kubernetes API with reduced functionality. Further, the environment must be designed in a way that allows to securely run workload from **multiple tenants**. Containers must be isolated from each other in a secure way. Noisy neighbor effects must be reduced to a minimum. ### Virtual Machines The world is not perfect. You cannot build every cloud exclusively out of containers. **Sometimes you need a Virtual Machine** with a huge *here be dragons* sticker on it. For this case the infrastructure should be able to spin up Qemu/KVM based virtual machines. Those are no MicroVMs but classical VMs - they have a virtual PCI bus and can mount **GPUs, AI accelerators, FPGAs or special NICs** attached to the host system; they **support hot plugging** of devices. ### Storage You try to reduce state in an application as good as you can. Still, state must be persisted somewhere. The infrastructure needs to provide a persistency layer to databases. There should be two types of persistency provided: **block storage** and **object storage**. However, a **shared filesystem is a non-target!** In the context of cloud native applications, shared filesystems can be seen as an antipattern. Applications relying on shared filesystems should change their persistency layer to a database or object storage. Storage must provide **high bandwidth** and **low latency**, while having an **inherent redundancy**. ### Networking #### Low Latency Modern web applications follow the microservices paradigm. This results in a lot of traffic between app components (east/west traffic). The latency between different microservices must be kept as small as possible to reduce the overall latency of the application. #### High Bandwidth Modern applications do not only communicate small JSON contents but also photos, audio or video. This results in high bandwidth demands. Modern NICs come with 100Gbps ports or greater. Processing packages at such a high rate requires a lot of CPU cycles, which are then not available to serve customer workload. Therefore network packet handling must be simplified and offloaded from the CPU as much as possible. #### IPv6 support [Google Statistics](https://www.google.com/intl/en/ipv6/statistics.html#tab=per-country-ipv6-adoption) show an ever growing adoption of IPv6. As of September 2022 client traffic from the US using IPv6 is at 48%, from Germany at 65%, from India at 66% and from France at 72%. The reason is that IPv4 addresses are exhausted and ISPs started rolling out IPv6 to their customers. Further, in large companies we see an exhaustion of *private* IPv4 space (`10.0.0.0/8`, `192.168.0.0/16`, `172.16.0.0/12`, even `100.64.0.0/10`). So companies start using the same address space multiple times - especially for Kubernetes clusters - which prevents applications, that are deployed in overlapping address spaces, from communicating with each other. As a "fix" for this, people implement spooky double NAT constructs. Instead, we should run clusters with unique IP addresses that do not overlap. IPv6 allows us to do this. #### Multi Tenancy Network traffic from different tenants must be separated. And as we do not only support IPv6 but also IPv4, we need to support overlapping IPv4 address space. #### Resiliency Parts of the network can fail, and they will. The network must be resilient against partial failures. Packets must be routed around holes in the network. Architecture ------------ The idea is to use Kubernetes to manage the physical infrastructure and build an *Infrastructure as a Service*, that allows users to create VMs, networks, load balancers, firewall rules etc. A *Kubernetes as a Service* will be deployed on top, e.g. using [Gardener](https://gardener.cloud/). In a next step, this environment can be extended to provide a *Container as a Service* offering. We should then be able to deploy the *Kubernetes API Server* ontop of this CaaS. A customer could directly use the CaaS or order a Kubernetes cluster, whose control plane would then be deployed ontop of CaaS. This Kubernetes cluster could then use VMs as Kubernetes nodes or use a [Virtual Kubelet](https://github.com/virtual-kubelet/virtual-kubelet) which deploys Kubernetes pods using our CaaS. Kubernetes has the notion of nodes, where pods are scheduled to. Programs are packaged in containers. Multiple containers are co-located in a pod. Kubernetes allows to extend its API with [Custom Ressource Definitons](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) (CRDs) and [API aggregation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/). We extend the Kubernetes API with multiple infrastructure related CRDs, e.g. a `machine` CRD (could be a VM or bare metal server), a `network` CRD (something like a VPC on AWS), a block storage `drive` CRD, an object storage `bucket` etc. But there will also be custom ressources holding internal configuration, e.g. servers will be inventorized and switches need to be configured. Kubernetes [Operators](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) are deployed on all physical infrastructure components, which act on custom ressources and manage the respective device. Kubernetes clusters do not scale infinitely. Therefore, the environment is built out of multiple *partition clusters*. On earth you have multiple *Regions* (e.g. Amsterdam, Seattle, Tokyo). One *Region* has multiple *Availability Zones* (datacenters). One *Availability Zone* has multiple *Partition Clusters* (rack rows or data halls). Servers and switches are nodes of a partition cluster. ### Servers A standard server nowadays comes with a processor, memory, NVMe, a NIC and a Baseboard Management Controller (BMC - Lenovo calls it [XClarity Controller](https://lenovopress.lenovo.com/lp0880-xcc-support-on-thinksystem-servers), Dell [iDRAC](https://www.dell.com/en-us/dt/solutions/openmanage/idrac.htm) and HPE [iLO](https://www.hpe.com/us/en/servers/integrated-lights-out-ilo.html)). A typical server for the described infrastructure could look like this: * 1U rack chassis * redundant power supply * TPM 2.0 module * 2x Xeon Gold 6338 (32 cores, 64 threads each) or 2x AMD EPYC 7453 * 16x 64GB DDR4 memory * 1,6T NVMe PCIe 4.0, 3 DWPD * ConnectX-6 2x100 Gbps NIC #### Server Onboarding While racking the server, it will connected with 5 cables: 2x power, 2x 100Gbps network and 1x CAT6 for the BMC. As soon as the server is connected to power, the BMC starts up and tries to get an IP address via DHCP. A Kubernetes operator monitors the DHCP leases and finds out about the new BMC. The operator will connect to the BMC via [Redfish](https://en.wikipedia.org/wiki/Redfish_(specification)) and reads out the server's serial number, model name and system ID. If it does not find this information already in its inventory database, the operator will create a new dataset. It then send a command to power on the server to the BMC and sets the first boot device to [PXEBoot](https://en.wikipedia.org/wiki/Preboot_Execution_Environment). The server powers on and asks the DHCP server for an IP address and bootfile via one of its 100Gbps ports. The DHCP server will return a link to an inventory image. The server boots into the provided live Linux inventory system, that scans all hardware. The inventory data will be added to the dataset, that has been created by the operator before. The server shuts down again and powers off. It is now fully inventorized and onboarded to the environment. It is part of a ressource pool and is standing by to get powered on again, when needed. ### Virtualization The specs (CPUs, memory, attached block devices, network interfaces, ...) of a requested virtual machine are stored in a custom ressource. The VM will be scheduled to an available compute node in the partition cluster. A custom scheduler can be implemented for this, or a dummy pod with the corresponding CPU and memory requirements can be scheduled by the stock Kubernetes scheduler. An operator on the compute node receives the CR and creates a [Qemu/KVM](https://www.qemu.org/) virtual machine using [libvirt](https://libvirt.org/). Specified block devices and network interfaces are added to the machine. ### Storage The boring choice regarding storage provider is [Ceph](https://ceph.io/en/discover/technology/). It is battle tested and used in lots of OpenStack environments. It provides an S3-compatible object storage service and Rados block devices. Ceph can be managed via Kubernetes using the [Rook](https://rook.io/) Operator. There are also other, mostly commercial and proprietary, storage solutions. Especially when going the *NVMe over TCP* path they are of interest (e.g. [Lightbits](https://www.lightbitslabs.com/)). ### Networking ... *Copyright (c) 2022 by Malte Janduda, SAP SE* ###### tags: `aurae`, `onmetal`