Adding bare metal nodes to platform vSphere

# Adding bare metal nodes to platform vSphere There are scenarios where a customer may need to add bare metal, platform "none", nodes to a vSphere(or any other cloud for that matter) cluster. ## Configuration 1. Install a platform vSphere cluster 2. Download the RHCOS Live CD which aligns with the installed version of OpenShift. 3. Obtain or create a worker.ign file. This will be used to bootstrap the bare metal node. 4. Boot the new bare metal host from the RHCOS Live CD. 5. Install RHCOS: ```bash= coreos-installer install /dev/sdX --insecure-ignition --ignition-url=https://path-to-worker-ignition --platform=metal ``` 6. Reboot the node 7. If the node fails to get a hostname, reboot the node and type `e` to edit the boot command line. See https://access.redhat.com/solutions/5500131 for further details 8. Once at the emergency mode prompt: `vi /etc/hostname` 9. Enter the desired host name and exit. 10. Press <CTRL+D> to resume booting 11. Approve CSRs for the node 12. Apply a taint to the node to block workloads from being scheduled: ```yaml= spec: taints: - key: bare-metal/no-vmware value: 'true' effect: NoExecute ``` Note: The [vSphere CSI driver daemonset](https://github.com/kubernetes-sigs/vsphere-csi-driver/blob/4479e2418f38cb93b5da4df7e043aff71a20cccc/manifests/vanilla/vsphere-csi-driver.yaml#L565-L569) tolerates all taints. I was able to disable it by making the operator unmanaged and removing the tolerations. ## Results The node `bare-metal-worker-1` is being hosted on libvirt in my home lab. As can be seen below, all operators are available. Storage is no longer progressing as I had to spin it down to edit the daemonset. ```log= oc get nodes;oc get co 162-0-168-192.in-addr.arpa: Tue Feb 4 13:12:43 2025 NAME STATUS ROLES AGE VERSION bare-metal-worker-0 Ready worker 65m v1.31.4 bare-metal-worker-1 Ready worker 43m v1.31.4 ci-op-rvanderp-abc-kqwmr-master-0 Ready control-plane,master,worker 4h4m v1.31.4 ci-op-rvanderp-abc-kqwmr-master-1 Ready control-plane,master,worker 4h4m v1.31.4 ci-op-rvanderp-abc-kqwmr-master-2 Ready control-plane,master,worker 4h4m v1.31.4 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.18.0-rc.7 True False False 3h39m baremetal 4.18.0-rc.7 True False False 4h1m cloud-controller-manager 4.18.0-rc.7 True False False 4h4m cloud-credential 4.18.0-rc.7 True False False 4h4m cluster-autoscaler 4.18.0-rc.7 True False False 4h1m config-operator 4.18.0-rc.7 True False False 4h2m console 4.18.0-rc.7 True False False 3h46m control-plane-machine-set 4.18.0-rc.7 True False False 4h1m csi-snapshot-controller 4.18.0-rc.7 True False False 4h2m dns 4.18.0-rc.7 True False False 3h50m etcd 4.18.0-rc.7 True False False 3h59m image-registry 4.18.0-rc.7 True False False 3h49m ingress 4.18.0-rc.7 True False False 3h50m insights 4.18.0-rc.7 True False False 4h1m kube-apiserver 4.18.0-rc.7 True False False 3h58m kube-controller-manager 4.18.0-rc.7 True False False 3h59m kube-scheduler 4.18.0-rc.7 True False False 3h58m kube-storage-version-migrator 4.18.0-rc.7 True False False 4h2m machine-api 4.18.0-rc.7 True False False 3h58m machine-approver 4.18.0-rc.7 True False False 4h2m machine-config 4.18.0-rc.7 True False False 4h marketplace 4.18.0-rc.7 True False False 4h1m monitoring 4.18.0-rc.7 True False False 3h48m network 4.18.0-rc.7 True False False 4h2m node-tuning 4.18.0-rc.7 True False False 43m olm 4.18.0-rc.7 True False False 4h1m openshift-apiserver 4.18.0-rc.7 True False False 3h56m openshift-controller-manager 4.18.0-rc.7 True False False 3h57m openshift-samples 4.18.0-rc.7 True False False 3h55m operator-lifecycle-manager 4.18.0-rc.7 True False False 4h1m operator-lifecycle-manager-catalog 4.18.0-rc.7 True False False 4h1m operator-lifecycle-manager-packageserver 4.18.0-rc.7 True False False 3h50m service-ca 4.18.0-rc.7 True False False 4h2m storage 4.18.0-rc.7 True True False 4h2m VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods ``` The vSphere cloud controller manager complains about the node but doesn't raise any alerts. ```log= I0204 18:12:19.442201 1 search.go:76] WhichVCandDCByNodeID nodeID: bare-metal-worker-1 E0204 18:12:19.511542 1 datacenter.go:107] Unable to find VM by DNS Name. VM DNS Name: bare-metal-worker-1 E0204 18:12:19.511584 1 search.go:181] Error while looking for vm=bare-metal-worker-1(byName) in vc=81-84-38-10.in-addr.arpa and datacenter=cidatacenter-nested-0: No VM found I0204 18:12:19.511607 1 search.go:186] Did not find node bare-metal-worker-1 in vc=81-84-38-10.in-addr.arpa and datacenter=cidatacenter-nested-0 I0204 18:12:19.511644 1 search.go:72] WhichVCandDCByNodeID by IP I0204 18:12:19.511653 1 search.go:76] WhichVCandDCByNodeID nodeID: bare-metal-worker-1 E0204 18:12:19.516869 1 datacenter.go:90] Unable to find VM by IP. VM IP: bare-metal-worker-1 E0204 18:12:19.516903 1 search.go:181] Error while looking for vm=bare-metal-worker-1(byIP) in vc=81-84-38-10.in-addr.arpa and datacenter=cidatacenter-nested-0: No VM found I0204 18:12:19.516911 1 search.go:186] Did not find node bare-metal-worker-1 in vc=81-84-38-10.in-addr.arpa and datacenter=cidatacenter-nested-0 E0204 18:12:19.516920 1 nodemanager.go:160] WhichVCandDCByNodeID failed using VM name. Err: No VM found E0204 18:12:19.516927 1 nodemanager.go:205] shakeOutNodeIDLookup failed. Err=No VM found E0204 18:12:19.516936 1 node_controller.go:285] Error getting instance metadata for node addresses: error fetching node by provider ID: node not found, and error by node name: node not found ``` ## Action Items - [ ] talk to upstream about the daemonset selector - [ ] can we just update our downstream operator to allow it to be configured? - [ ] can we just disable CSI altogether?