Deploying and testing the Descheduler in OpenShift 4.20

# Deploying and testing the Descheduler in OpenShift 4.20 ## Intro Administrators who are managing virtual machines in OpenShift may want to configure the cluster to automatically rebalance the placement of VMs across cluster nodes. The idea here is to enable a feature that users of VMware have become accustomed to: DRS, or Distributed Resource Scheduler's ability to reblance VMs across a VMware cluster. The OpenShift scheduler makes placement decisions based on _requested_ resource usage, but does not automatically "reschedule" VMs if the node hosting the VMs becomes inhospitable due to a "noisy neighbor" workload which exceeds its requested resource usage and starts consuming more of the nodes compute resources. The reason OpenShift/Kubernetes has traditionally not rebalanced workload comes down to one of the main differences between traditional, singleton workloads like VMs versus more modern, container workloads which often run as multiple "replicas" which are spread across the cluster. If one of the replicas of the container application on a particular node is struggling due to a noisy neighbor, the other replicas likely are not. If they are, the application could be set up to scale the number of replicas onto different nodes which may not be experiencing pressure. Now that OpenShift has become a popular platform for running Virtual Machines, the Descheduler is offered as a way to provide this DRS-like functionality in Kubernetes. ## Install the Descheduler operator Search for the Descheduler operator in the Ecosystem -> Software Catalog ![descheduler-operator](https://hackmd.io/_uploads/SJMEt-kX-g.png) ## Configure your nodes for Load Aware Balancing This rolls out a change to your nodes via a MachineConfig object. This MachineConfig appends `psi=1` to the kernel command line parameters on master nodes to enable **Pressure Stall Information** tracking. This change will trigger a rolling reboot of the targeted nodes. This information is key to the Descheduler’s recent capabilities in Load Aware Balancing. Choose this config if you are running the classic 3-node compact cluster in a test environment. It will target the "master" Machine Config Pool: ```yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: name: 99-master-psi labels: machineconfiguration.openshift.io/role: master spec: kernelArguments: - psi=1 #Pressure Stall Information ``` Wait for this to roll out to all 3 of your nodes. Choose this config if you are running what we might call a more traditional cluster comprised of separate Worker and Control Plane nodes: ```yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: name: 99-worker-psi labels: machineconfiguration.openshift.io/role: worker spec: kernelArguments: - psi=1 #Pressure Stall Information ``` Wait for this to roll out to all your Worker nodes. They will reboot and rejoin the cluster once they are done applying the change. Expect this step to take 15+ minutes depending on the size of the machine pool you are targeting and the age/size/personality of your hardware. You can watch its progress with: ``` # oc get mcp <mcp-name> # Get the mcp you're interested in with # oc get mcp oc get master # If you want to check other pools like 'worker' # oc get mcp <poolname> ``` Example: for a 3-node cluster, look for UPDATING to be False (indicates it is complete), READYMACHINECOUNT to be 3 and UPDATEDMACHINECOUNT to be 3 ![oc-get-mcp](https://hackmd.io/_uploads/Bk1uLZ1Q-e.png) You can see which node it is working on by looking for the one that is Scheduling Disabled and/or Not Ready. ``` oc get nodes ``` ## Create an instance of the Descheduler There are fields you can customize ([see the docs](https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/virtualization/index#nodes-descheduler-profiles_virt-enabling-descheduler-evictions)) but the intention here is to get you a starting point for testing. ```yaml --- apiVersion: operator.openshift.io/v1 kind: KubeDescheduler metadata: name: cluster namespace: openshift-kube-descheduler-operator spec: managementState: Managed deschedulingIntervalSeconds: 30 # Run every 30s - good for testing # Probably need a more conservative value for prod. profiles: - KubeVirtRelieveAndMigrate # Use PSI metrics mode: Automatic # vs Predictive (which is informative-only) profileCustomizations: devEnableSoftTainter: true devDeviationThresholds: Low devActualUtilizationProfile: PrometheusCPUCombined ``` ## Set up conditions such that only one node runs a lot of VMs If running a 3-node cluster, cordon 2 of your 3 nodes so that we can launch a storm of VMs on one node only, overloading it and creating conditions that are ripe for descheduling evictions. (If running more than 3 nodes, cordon all but one of your worker nodes.) ``` oc get nodes oc adm cordon <nodename> # to Uncordon # oc adm uncordon <nodename> ``` Check to see the new status of your nodes. You should see two of them be `Ready, SchedulingDisabled`. Here is a view of my 6-node cluster with two Workers cordoned: ![oc-get-nodes](https://hackmd.io/_uploads/SyrlwbJXZg.png) ## Launch a pool of VMs Thanks to Mark DeNeve and his wonderful blog post "[OpenShift Virtualization and the Kubernetes Descheduler Revisited](https://xphyr.net/post/ocpv_descheduler_revisited/)" for the VirtualMachinePool manifest below. I modified it slightly to include the necessary descheduler annotation in a different location. ```yaml apiVersion: pool.kubevirt.io/v1alpha1 kind: VirtualMachinePool metadata: name: vm-pool-fedora spec: replicas: 3 selector: matchLabels: kubevirt.io/vmpool: vm-pool-fedora virtualMachineTemplate: metadata: creationTimestamp: null labels: kubevirt.io/vmpool: vm-pool-fedora spec: runStrategy: Always template: metadata: creationTimestamp: null labels: kubevirt.io/vmpool: vm-pool-fedora annotations: descheduler.alpha.kubernetes.io/evict: "true" spec: domain: cpu: cores: 1 maxSockets: 4 model: host-model sockets: 1 threads: 1 devices: disks: - disk: bus: virtio name: containerdisk - disk: bus: virtio name: cloudinitdisk memory: guest: 2048Mi terminationGracePeriodSeconds: 0 volumes: - containerDisk: image: quay.io/kubevirt/fedora-cloud-container-disk-demo name: containerdisk - cloudInitNoCloud: userData: | #cloud-config user: fedora password: fedora chpasswd: { expire: False } ssh_pwauth: True disable_root: false package_update: true packages: - stress-ng runcmd: - stress-ng --matrix 0 name: cloudinitdisk ``` Generally speaking, you opt-in the VMs you want to be relocatable via the Descheduler This can be done in the UI or via YAML. Below is the key annotation which does this. ```yaml kind: VirtualMachine spec: template: metadata: annotations: descheduler.alpha.kubernetes.io/evict: "true" ``` ## Create pressure by scaling up the number of VMs The above will start with 3 but on your nodes that may not be enough to trigger pressure. The magic number of VMs which will trigger the Descheduler to act depends on... 1) How much CPU your cluster nodes have 2) How much other workload is already running on them 3) Certain settings in your descheduler configuration Perhaps start by scaling up to 25 and see what happens. You can always scale down or up. ``` oc scale vmpool vm-pool-fedora --replicas 25 ``` ## Monitor what’s happening I recommend having a few terminal windows or tmux panes, one for each of these ```watch``` commands. ![vmpool-etc](https://hackmd.io/_uploads/HyMfd-JQbg.png) ``` watch oc get vmpool ``` For this one, pay attention to the NODENAME field. You want to see this field change for some of them once you uncordon a node. ``` watch oc get vmi ``` ## Watch the descheduler logs You’ll notice it ignores high priority, non-VM pods with “`Pod fails the following checks…pod has system critical priority…`”. Also notice it is only evicting one at a time (one every 30-sec descheduler interval, as we configured above). That rate is configurable. ``` oc logs -n openshift-kube-descheduler-operator -l app=descheduler -f I1212 22:30:36.344112 1 defaultevictor.go:268] "Pod fails the following checks" pod="openshift-cnv/hco-operator-7d65c9f6f9-gf6mg" checks="[pod has system critical priority, pod has higher priority than specified priority class threshold]" I1212 22:30:36.344144 1 defaultevictor.go:268] "Pod fails the following checks" pod="openshift-workload-availability/self-node-remediation-controller-manager-c58d6cb8d-rtmnh" checks="[pod has system critical priority, pod has higher priority than specified priority class threshold]" I1212 22:30:36.344175 1 defaultevictor.go:268] "Pod fails the following checks" pod="openshift-ovn-kubernetes/ovnkube-node-nnsjn" checks="[pod has system critical priority, pod has higher priority than specified priority class threshold, pod is related to daemonset and descheduler is not configured with evictDaemonSetPods]" I1212 22:30:36.344203 1 defaultevictor.go:268] "Pod fails the following checks" pod="openshift-workload-availability/self-node-remediation-ds-g4p9f" checks="[pod has system critical priority, pod has higher priority than specified priority class threshold, pod is related to daemonset and descheduler is not configured with evictDaemonSetPods]" I1212 22:30:36.344229 1 nodeutilization.go:200] "Pods on node" node="ip-10-0-3-178.us-east-2.compute.internal" allPods=130 nonRemovablePods=46 removablePods=84 I1212 22:30:36.344253 1 nodeutilization.go:216] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers" I1212 22:30:36.489479 1 evictions.go:601] "Eviction in background assumed" pod="vm-load-test/virt-launcher-vm-pool-fedora-33-rhn2n" I1212 22:30:36.489514 1 nodeutilization.go:334] "Currently, only a single pod eviction is allowed" I1212 22:30:36.489533 1 profile.go:376] "Total number of evictions/requests" extension point="Balance" evictedPods=0 evictionRequests=1 I1212 22:30:36.489550 1 descheduler.go:403] "Number of evictions/requests" totalEvicted=0 evictionRequests=1 ```