Developing Kubernetes for Windows Nodes

# Developing Kubernetes for Windows Nodes Authors: Jay Vyas (vmw), Friedrich Wilken(?), Peri Thompson (vmw), Mark Rosetti, `with special thanks to James Sturevant (@msft), Jamie Phillips (@rancher), Slaydyn(?@)`` Sig-Windows has always strived to make it easier for developers to run Kubernets on windows, and started out early on with the https://github.com/kubernetes-sigs/sig-windows-tools/ repository that has various powerhsell utilities for managing CNIs, installing kubeadm, and so on. In this post we'll look at what we've done to build a more automated "batteries included" sandbox for hacking on Kubernetes on Windows. Our goal is that anyone can test and contribute fixes to the Windows kubelet or the Windows kube-proxy, by looking at a new project created in the Kubernetes-sigs umbrella: https://github.com/kubernetes-sigs/sig-windows-dev-tools/. ## Why we need a unique developer environment for windows on Kubernetes The typical answer to "I need to test a patch in a realistic environment" for Kubernetes is often "just simulate a cluster in kind". This definetly works given the power of docker-in-docker isolation, even with complex CNI and storage providers. But, Windows is just a different animal entirely, and you need real VMs, because `Windows Containers` are a feature that needs to be enabled on a Windows VM. - clouds are expensive, and windows doesn't run in Kind - Kind doesnt support windows - it's faster than cloud as soon as a snapshot system if combined with a snapshot system Although the sig-windows testing dashboard (https://k8s-testgrid.appspot.com/sig-windows-signal) has a good signal-to noise ratio, it doesn't test all of the permutations of windows clusters that are possible (for example, it doesn't test network policies, user space windows proxying, active directory, and many other important aspects of Kubernetes on Windows). Because spinning up Windows nodes in Kubernetes clusters requires a complex multi-OS environment (i.e. `kind` cannot be used to make Windows nodes), it's simply not practical to build complex CI jobs for all permutations of Windows feature sets. It's also not easy to develop or fix bugs for these features, for the same reasons. Concretely, several issues recently came up that made it obvious we needed to up our testing and development game: ### A wake up call for more flexible Windows testing - In 1.21, we had a kube-proxy on windows regression (https://github.com/kubernetes/kubernetes/issues/101500) due to the fact that the `WindowsEndpointSliceProxying` feature graduated to beta. Since the userspace proxy doesn't (yet) support endpoint slices, this caused the kube proxy on windows userspace (which is relied on by the https://antrea.io/ project) to break. - The proliferation of new CNI options (Flannel, Calico, Antrea, and others), as well as new CRI options (which have different installers on both `containerd` and `docker`) for windows have now multiplied the complexity of the Windows testing matrix - Emerging technologies like hostProcess containers (add link!!) made it clear we needed a way to build Windows clusters with access to bleeding edge APIServer features .. blah blah ... This made it clear that we needed to do a better job in sig-windows of testing off-the-beaten-path features for CNI providers. ## Kubernetes development for linux is a solved problem The first problem to solve when it comes to making flexible Windows development environments was: How can we automate spinning up Windows Kubelets. To understand the Kubernetes development workflow, we need to look first at how it works in non-windows environments. ### How Kind does it The `Kind` (link) project has commoditized development environments on windows in a few simple steps: - clone Kubernetes - build a Kind Docker image (https://kind.sigs.k8s.io/docs/design/node-image/) - spin up a Kind cluster with the node image you have - Hack and repeat ### hack/local-up-cluster.sh As an alternative to Kind, we can use tools like hack/local-up-cluster.sh. On a linux node, you can easily just run this script to compile Kubernetes from scratch and run a kubelet, API Server, controller manager, and scheduler locally. ## How can we do this for Windows? In order to give Windows developers the same experience of Linux developers, we would need to build a push-button solution that allows spinning up a cluster for someone: - without a cloud account - without deep knowledge of Kuberetnes compilation - without deep knowledge of kubeadm So we need to create clusters, with API servers from Kubernetes 1.22+, as well as Windows and Linux Kubelets, with pod networking. In order to do this, we needed to setup a few core components in our clusters. #### Compiling Windows executables Since our goal is to support Windows development, the most important compilation we do is that of the Windows Kubernetes components (mainly, this means the kubelet.exe and the kube-proxy.exe, because the control plane parts of Kubernetes do not run on windows). #### Figuring out a hypervisor Replacing docker-in-docker with a local development solution. For us, Hyper-V/VMware Fusion/Virtualbox: This enables anyone with a computer with the capacity to make 2 VMs to develop Kubernetes on Windows. #### Manual startup of `kubeadm` For us, kubeadm is a first class installation tool, as opposed to something managed for us by the `Kind` abstraction. This allows us to pull and tag images with our Kubernetes version on the control plane: ``` sudo docker pull k8s.gcr.io/etcd:3.4.13-0 sudo docker tag k8s.gcr.io/etcd:3.4.13-0 gcr.io/k8s-staging-ci-images/etcd:v1.22.0-alpha.3.31+a3abd06ad53b2f ``` and point at them via the image repository when we init: ``` sudo kubeadm init --apiserver-advertise-address=10.20.30.10 \ --pod-network-cidr=100.244.0.0/16 \ --image-repository=gcr.io/k8s-staging-ci-images \ --kubernetes-version=ci/v1.22.0-alpha.3.31+a3abd06ad53b2f \ --v=6 ``` When we now join Windows nodes via the join token, kubeadm will automaticlaly take these images. (friedrich can add deets from https://jayunit100.blogspot.com/2021/06/pulling-k8s-bits-from-sourcehead-and.html). #### Configuring CNIs without daemonsets, on containerd With docker-shim being deprecated, supporting `containerd` is paramount for windows as well as linux development and testing on Kubernetes. Additionally, comomunity tools like the cluster-api default to useing Containerd as well, and we want to support the cluster-api community in our development environments as much as possible. In windows, however, on containerd - you can't run your CNI in a container. That is, your CNI provider must run on the host (most people using open source CNIs, nowadays, are used to running `calico`, `antrea`, `flannel`, and `cillium` in a pod). Thus on windows, we need to break from the norm when it comes to CNI installation - and make sure there was fine grained tooling in place to automate both `antrea` and `calico` as CNI provider options. so that developers from different CNI communities can help us to investigate issues such as OpenVSwitch on Windows, HNS networking, BGP, and VXLan related windows Kubernetes networking. So, lets dive into the details... ### How the new Kubernetes development environments work - Kubeadm - using the ci/ option - pulling and tagging - Setting up containerd or Docker #### Compilation: Reusing the great work of k8s-staging-ci-images #### The Hypervisor: Vagrant and vagrant providers #### Kubeadm setup on linux and windows #### CNI Providers One of the trickiest parts of setting up our developer environments is CNI provisioning. As mentioned, we need to support `containerd`, which means running our CNIs on windows on some kind of systemd "like" service (we use nssm), so that they can run as host processes with the ability to make Windows networking commands to the HNS Subsystem. In addition, CNI providers in Windows depend on the kube-proxy, for example, to setup initial routes to the APIServer. The "kube-proxy" and "cni" chicken-or-egg problem is quite common (even in Linux), because: - Your CNI provider needs to access the APIServer - Your Kube-proxy's job is to create routes to the APIServer - But the routes in your kube-proxy aren't going to function until your CNI is up For this reason, the kube-proxy typically runs as a host-network process. This allows it to startup without a valid CNI provider in place. In windows, host-networked processes are not possible on containerd, however. This is because there is no `docker0` styled "bridge" to use as a local network device. ##### Antrea on Containerd To install antrea in containerd, we break the antrea installation into a few separate scripts, which all live under the `forked directory`. These scripts: - bootstrap openvswitch - install kube-proxy in userspace - install antrea-agent, which depends on the above processes You can view these scripts under the forked/ directory. For convenience in understanding, we index each script by order, i.e (antrea-0.ps1, antrea-1.ps1, and so on). It is possible to custom install antrea components by hand if you have a manullly built antrea distribution, but this requires modifying some of the vagrant mounts and powershell scripts which download the components. We may automated this at some point if there is community demand to test CNIs from source. ##### Calico on Containerd For calico on Containerd, we follow a similar process as that of antrea. In particular: - bootstrap the windows node with required calico features, `Install-WindowsFeature RemoteAccess`, `RSAT-RemoteAccess-PowerShell`, and `Routing`. - start the kube proxy in kernel space - configure the calico config (cluster wide) to disable IPIP and to use VXLan - Bootstrap the calico node agent (the first part of calico setup) as a windows service, which then relies on calico-felix, the secondary service, for dataplane networking. #### Reboots and breaking scripts up Throughout installation of the windows developer environments, we broke out peices of the installation up to perform reboots and or run new scripts in the correct order as needed. For example, Windows nodes require rebooting in various scenarios that are required for setup. We use the `vagrant reload` directive in various places throughout the setup of windows kubelets. - installing new windows features like `Containers` requires reboot afterwards. - with Antrea and Calico, special features such as `Routing` or installation of OVS are required, which need reboots In addition to reloading our windows nodes, we needed to break up installation scripts so that they were a little more flexible - In the kubeadm setup for windows, we made some modifications to the upstream https://github.com/kubernetes-sigs/sig-windows-tools repository so we can support easy to modify "self-built" kubernetes binary paths, and finer grained debugging of where kubelet and kube-proxy artifacts are coming from. - In calico's case, because we must start up `Felix` *after* we start the calico `node` service and felix actually creates a new HNS network which temporarily disables the windows networking stack, we must run the calico node windows service startup in a separate step. This doesn't require a reboot, but it does require "watching" our installation of calico fail after starting felix (because it perceives a timeout for the startup command to succeed). ## How we use them ### Testing out the userspace kube-proxy on windows Several months ago, antonio ojea (red hat) wrote a quick patch to attempt remove some unnecessary dependencies from the windows userspace kube-proxy, which were forcing the overal k/k codebase to continue carrying some unneeded dependencies. Of course, such dependencies are needless sources of complexity (and CVEs), so we wanted to merge this, but we didn't have CI in place for the userspace Kube-proxy. Putting CI in place for the userspace kube proxy isn't practical given that, over time, there are most-likely going to be proposals and plans in place to remove it entirely from Kubernetes as it is a legacy component. Thus, we had a great use case for our new development environments. We updated Antonios original patch (to remove bazel), used the developer environments to spin up an `antrea` cluster with the latest user space kube proxy, and reissued his PR here https://github.com/kubernetes/kubernetes/pull/102847. Note the comical issue thread, where it rapidly becomes clear that, without these environments, we wouldn't have easily been able to test these at all ! The workflow for testing this patch was: - running `make all` a first time from the root of the sig-windows-dev-tools directory. This confirms that at a baseline, we can make mixed os windows+linux clusters. - running`vagrant destroy --force` to clean up our environment, and then adding antonio's fork to the local `kubernetes/` directory, which was already created by the windows development evnrionemtn on startup. - using `git checkout` to checkout his branch, and removing all of the artifacts in the `output` directory, to ensure we recompile the windows kube-proxy. - making sure that the `cni:antrea` field is set in the variables.yml file. - running `vagrant destroy --force; make all` This then creates an antrea based cluster (antrea then installs the user space kube-proxy for us). The cluster gives us a quick verification that indeed, the user-space proxy starts up properly. Note that for this patch, we also did some secondary testing in other clusters as well, just to be sure, because these environments are quite new. ### Testing out hostProcess containers on Windows for the first time ............. PERI TO ADD ................... # Some interesting stuff we learned - Spinning up bleeding edge windows clusters is easy thanks to the sig-release tooling, as described earlier in this post. - Installation of OVS, calico on VXLan, and many other windows native networking tools works just fine in desktop hypervisors such as Hyperv, VMWare, Virtualbox, and so on. We we'rent 100% sure of this when we started. - In calico, if you misconfigure the pod cidr network, an input the the calico node agent on startup, your CNI's network interface (made by windwos for a pod), gets degraded after it is created, but before the pod is created! This is an interesting case where the Windows OS realizes you have something incorrect in the way you made a new CNI network device a little bit after the device is created. - In antrea, the primary networking device is assumed to also be connected to an outbound default gateway. This actually is a bit of a bug because it means antrea on windows enforces some networking constraints that don't play very well with nodes that have multiple windows NICs. # Future ideas - Image builder incorporation to speed up creation of new clusters. This comes at the cost of requiring the development environments to first make a bootable OVA image, however, and thus might make offline installation and support of other clouds (like the vagrant-vsphere or azure provider, for example) harder. - building kubernetes on windows using wsl: One of the benefits of the the Dev Env is the possibility to fetch the latest Kuberenetes releases and build components from source. However, at this moment this is on only possible if your host system is Linux or MacOS. So one of our future goals is to make this possible on Windows too. At this moment, it looks like the best approach to it is using the WSL. blah -