# DR procedure on OKD 4.10 (AWS)
## backing up etcd
```bash
[root@ip-10-2-0-87 old_masters]# oc debug node/ip-10-2-3-182.eu-west-1.compute.internal
Starting pod/ip-10-2-3-182eu-west-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.2.3.182
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# /usr/local/bin/cluster-backup.sh /home/core/assets/backup
Certificate /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt is missing. Checking in different directory
Certificate /etc/kubernetes/static-pod-resources/etcd-certs/configmaps/etcd-serving-ca/ca-bundle.crt found!
found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-15
found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-7
found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-7
found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-8
819f2636984a35c02db89264c40a455abca7228cb00c1d67f4110c0e5876de2e
etcdctl version: 3.5.3
API version: 3.5
{"level":"info","ts":"2023-06-14T09:07:08.955Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/home/core/assets/backup/snapshot_2023-06-14_090654.db.part"}
{"level":"info","ts":"2023-06-14T09:07:08.964Z","logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2023-06-14T09:07:08.964Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://10.2.3.182:2379"}
{"level":"info","ts":"2023-06-14T09:07:09.924Z","logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2023-06-14T09:07:10.178Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://10.2.3.182:2379","size":"122 MB","took":"1 second ago"}
{"level":"info","ts":"2023-06-14T09:07:10.178Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/home/core/assets/backup/snapshot_2023-06-14_090654.db"}
Snapshot saved at /home/core/assets/backup/snapshot_2023-06-14_090654.db
Deprecated: Use `etcdutl snapshot status` instead.
{"hash":404016960,"revision":806970,"totalKey":10251,"totalSize":121933824}
snapshot db and kube resources are successfully saved to /home/core/assets/backup
```
## state of nodes pre DR
```bash
[root@ip-10-2-0-87 old_masters]# oc get node -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-2-3-182.eu-west-1.compute.internal Ready master 41d v1.23.5+3afdacb 10.2.3.182 <none> Fedora CoreOS 35 5.18.5-100.fc35.x86_64 cri-o://1.23.3
ip-10-2-4-159.eu-west-1.compute.internal Ready master 41d v1.23.5+3afdacb 10.2.4.159 <none> Fedora CoreOS 35 5.18.5-100.fc35.x86_64 cri-o://1.23.3
ip-10-2-4-51.eu-west-1.compute.internal Ready worker 41d v1.23.5+3afdacb 10.2.4.51 <none> Fedora CoreOS 35 5.18.5-100.fc35.x86_64 cri-o://1.23.3
ip-10-2-4-84.eu-west-1.compute.internal Ready master 41d v1.23.5+3afdacb 10.2.4.84 <none> Fedora CoreOS 35 5.18.5-100.fc35.x86_64 cri-o://1.23.3
```
## DR Starts now - Stop masters
Masters stopped

## Connect to host
Connect to a working control plane node. All operations regarding the backup and restoration of the cluster state will involve this node.
```bash
[root@ip-10-2-0-87 old_masters]# ssh core@10.2.3.182
The authenticity of host '10.2.3.182 (10.2.3.182)' can't be established.
ED25519 key fingerprint is SHA256:j9Hzvt9dpoI7peCrSsL7KMXSnvPpRgDDJJZap0McF/A.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '10.2.3.182' (ED25519) to the list of known hosts.
Fedora CoreOS 35
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/tag/coreos
[core@ip-10-2-3-182 ~]$ sudo su
```
## Stopping etcd and kube-apiserver
```bash
[root@ip-10-2-3-182 core]# mv /etc/kubernetes/manifests/etcd-pod.yaml /tmp
[root@ip-10-2-3-182 core]# crictl ps | grep etcd | grep -v operator
22f3966254cad 086cf3429c6217c80a02f0988d88c44900f55a44f87129d0ca820455401f3f50 37 minutes ago Running etcdctl 7 60d2bd8af485f
[root@ip-10-2-3-182 core]# crictl ps | grep etcd | grep -v operator
22f3966254cad 086cf3429c6217c80a02f0988d88c44900f55a44f87129d0ca820455401f3f50 37 minutes ago Running etcdctl 7 60d2bd8af485f
[root@ip-10-2-3-182 core]# mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp
[root@ip-10-2-3-182 core]# crictl ps | grep kube-apiserver | grep -v operator
[root@ip-10-2-3-182 core]# mv /var/lib/etcd/ /tmp
```
## Recovering ETCD from backup
```bash
[root@ip-10-2-3-182 core]# sudo -E /usr/local/bin/cluster-restore.sh /home/core/assets/backup/
etcdctl is already installed
Deprecated: Use `etcdutl snapshot status` instead.
{"hash":404016960,"revision":806970,"totalKey":10251,"totalSize":121933824}
...stopping kube-apiserver-pod.yaml
...stopping kube-controller-manager-pod.yaml
...stopping kube-scheduler-pod.yaml
...stopping etcd-pod.yaml
Waiting for container etcd to stop
complete
Waiting for container etcdctl to stop
complete
Waiting for container etcd-metrics to stop
complete
Waiting for container kube-controller-manager to stop
..................................complete
Waiting for container kube-apiserver to stop
complete
Waiting for container kube-scheduler to stop
.......................................................................................................complete
starting restore-etcd static pod
starting kube-apiserver-pod.yaml
static-pod-resources/kube-apiserver-pod-15/kube-apiserver-pod.yaml
starting kube-controller-manager-pod.yaml
static-pod-resources/kube-controller-manager-pod-7/kube-controller-manager-pod.yaml
starting kube-scheduler-pod.yaml
static-pod-resources/kube-scheduler-pod-7/kube-scheduler-pod.yaml
```
## Restarting kubelet
```bash
[root@ip-10-2-3-182 core]# systemctl restart kubelet.service
[root@ip-10-2-3-182 core]# systemctl status kubelet.service
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-mco-default-env.conf, 10-mco-default-madv.conf, 20-aws-node-name.conf, 20-logging.conf
Active: active (running) since Wed 2023-06-14 09:34:24 UTC; 10s ago
Process: 93382 ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests (code=exited, status=0/SUCCESS)
Process: 93383 ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state (code=exited, status=0/SUCCESS)
Process: 93384 ExecStartPre=/bin/rm -f /var/lib/kubelet/memory_manager_state (code=exited, status=0/SUCCESS)
Main PID: 93385 (kubelet)
Tasks: 17 (limit: 18726)
Memory: 67.5M
CPU: 1.603s
CGroup: /system.slice/kubelet.service
└─ 93385 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --c>
Jun 14 09:34:33 ip-10-2-3-182 hyperkube[93385]: E0614 09:34:33.601359 93385 kubelet.go:2484] "Error getting node" err="node \"ip-10-2-3-182.eu-west-1.compute.internal\" not found"
Jun 14 09:34:33 ip-10-2-3-182 hyperkube[93385]: E0614 09:34:33.701992 93385 kubelet.go:2484] "Error getting node" err="node \"ip-10-2-3-182.eu-west-1.compute.internal\" not found"
Jun 14 09:34:33 ip-10-2-3-182 hyperkube[93385]: I0614 09:34:33.730253 93385 status_manager.go:674] "Pod was deleted and then recreated, skipping status update" pod="openshift-kube-schedul>
Jun 14 09:34:33 ip-10-2-3-182 hyperkube[93385]: I0614 09:34:33.745359 93385 status_manager.go:674] "Pod was deleted and then recreated, skipping status update" pod="openshift-etcd/etcd-ip>
Jun 14 09:34:33 ip-10-2-3-182 hyperkube[93385]: I0614 09:34:33.774374 93385 status_manager.go:674] "Pod was deleted and then recreated, skipping status update" pod="openshift-kube-apiserv>
Jun 14 09:34:33 ip-10-2-3-182 hyperkube[93385]: I0614 09:34:33.814279 93385 status_manager.go:674] "Pod was deleted and then recreated, skipping status update" pod="openshift-kube-control>
Jun 14 09:34:33 ip-10-2-3-182 hyperkube[93385]: I0614 09:34:33.864321 93385 status_manager.go:674] "Pod was deleted and then recreated, skipping status update" pod="openshift-kube-apiserv>
Jun 14 09:34:33 ip-10-2-3-182 hyperkube[93385]: I0614 09:34:33.949686 93385 kubelet_node_status.go:110] "Node was previously registered" node="ip-10-2-3-182.eu-west-1.compute.internal"
Jun 14 09:34:33 ip-10-2-3-182 hyperkube[93385]: I0614 09:34:33.949786 93385 kubelet_node_status.go:75] "Successfully registered node" node="ip-10-2-3-182.eu-west-1.compute.internal"
Jun 14 09:34:34 ip-10-2-3-182 hyperkube[93385]: I0614 09:34:34.209709 93385 apiserver.go:52] "Watching apiserver"
```
## Node state after ETCD recovery
**NOTE** two masters are offline at this moment
```bash
[root@ip-10-2-0-87 old_masters]# oc get nodes -w
NAME STATUS ROLES AGE VERSION
ip-10-2-3-182.eu-west-1.compute.internal Ready master 41d v1.23.5+3afdacb
ip-10-2-4-159.eu-west-1.compute.internal NotReady master 41d v1.23.5+3afdacb
ip-10-2-4-51.eu-west-1.compute.internal Ready worker 41d v1.23.5+3afdacb
ip-10-2-4-84.eu-west-1.compute.internal NotReady master 41d v1.23.5+3afdacb
```
## ETCD state on recovery control plane
```bash
[core@ip-10-2-3-182 ~]$ sudo crictl ps | grep etcd | egrep -v "operator|etcd-guard"
a2922a8ee0215 086cf3429c6217c80a02f0988d88c44900f55a44f87129d0ca820455401f3f50 5 minutes ago Running etcd 0 b5a01a6b7b65d
```
## Cluster Operators - At this point `multus` is pending 2 nodes - which are unavailable
```bash
network 4.10.0-0.okd-2022-07-09-073606 True True False 41d DaemonSet "openshift-multus/multus" is not available (awaiting 2 nodes)...
```
## Restart the Open Virtual Network (OVN) Kubernetes pods on all the hosts.
On recovery master node:
Remove the northbound database (nbdb) and southbound database (sbdb).
```bash
[root@ip-10-2-0-87 old_masters]# ssh core@10.2.3.182
Fedora CoreOS 35
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/tag/coreos
Last login: Wed Jun 14 09:39:56 2023 from 10.2.0.87
[core@ip-10-2-3-182 ~]$ sudo rm -f /var/lib/ovn/etc/*.db
```
Delete all OVN-Kubernetes control plane pods
oc get po -l app=ovnkube-master -n openshift-ovn-kubernetes
```bash
oc delete pods -l app=ovnkube-master -n openshift-ovn-kubernetes
oc -n openshift-ovn-kubernetes get po -l app=ovnkube-master
NAME READY STATUS RESTARTS AGE
ovnkube-master-l4h29 6/6 Running 0 3m
ovnkube-master-lbsjb 6/6 Terminating 42 (123m ago) 41d
ovnkube-master-wzm4b 6/6 Terminating 43 (123m ago) 41d
```
**NOTE** Other pods were not terminated
```bash
ovnkube-master-lbsjb 6/6 Terminating 42 (121m ago) 41d
ovnkube-master-wzm4b 6/6 Terminating 43 (120m ago) 41d
# From events: Warning NodeNotReady 26m node-controller Node is not ready
```
I will proceed after `ovnkube-master-l4h29` has 6/6 containers running, even with the rest not terminating, _under the premise that removal of the nodes will clear these later_.
Delete all `ovnkube-node` pods
```bash
[root@ip-10-2-0-87 old_masters]# oc get pods -n openshift-ovn-kubernetes -o name | grep ovnkube-node | while read p ; do oc delete $p -n openshift-ovn-kubernetes ; done
pod "ovnkube-node-5t8dd" deleted
oc get pods -n openshift-ovn-kubernetes^C
[root@ip-10-2-0-87 old_masters]# oc get pods -n openshift-ovn-kubernetes | grep ovnkube
ovnkube-master-l4h29 6/6 Running 0 4m55s
ovnkube-master-lbsjb 6/6 Terminating 42 (125m ago) 41d
ovnkube-master-wzm4b 6/6 Terminating 43 (125m ago) 41d
ovnkube-node-5t8dd 5/5 Terminating 23 (125m ago) 41d
ovnkube-node-g6mlt 5/5 Running 23 (125m ago) 41d
ovnkube-node-k9ztt 5/5 Running 23 (125m ago) 41d
ovnkube-node-q82tw 5/5 Running 24 (125m ago) 41d
[root@ip-10-2-0-87 old_masters]# oc -n openshift-ovn-kubernetes delete po ovnkube-node-g6mlt
pod "ovnkube-node-g6mlt" deleted
^[[A^C
[root@ip-10-2-0-87 old_masters]# oc -n openshift-ovn-kubernetes delete po ovnkube-node-k9ztt
pod "ovnkube-node-k9ztt" deleted
^C
[root@ip-10-2-0-87 old_masters]# oc -n openshift-ovn-kubernetes delete po ovnkube-node-q82tw
pod "ovnkube-node-q82tw" deleted
^C
[root@ip-10-2-0-87 old_masters]# oc get pods -n openshift-ovn-kubernetes | grep ovnkube-node
ovnkube-node-5t8dd 5/5 Terminating 23 (126m ago) 41d
ovnkube-node-g6mlt 5/5 Terminating 23 (126m ago) 41d
ovnkube-node-l85nh 4/5 Running 0 13s
ovnkube-node-zw66c 4/5 Running 0 8s
```
**NOTE** deleted one by one, some are stuck in terminating due to `Node is not ready`.
## Delete and re-create other non-recovery, control plane machines, one by one. After the machines are re-created, a new revision is forced and etcd automatically scales up.
```bash
[root@ip-10-2-0-87 old_masters]# oc get machines -n openshift-machine-api -o wide
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
org-okd-lab-krvkc-master-0 Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-159.eu-west-1.compute.internal aws:///eu-west-1b/i-0dae967329c838e07 running
org-okd-lab-krvkc-master-1 Running m5.xlarge eu-west-1 eu-west-1a 41d ip-10-2-3-182.eu-west-1.compute.internal aws:///eu-west-1a/i-08083013bef2b2049 running
org-okd-lab-krvkc-master-2 Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-84.eu-west-1.compute.internal aws:///eu-west-1b/i-0ee9252aa0e00616f running
org-okd-lab-krvkc-worker-eu-west-1b-488pq Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-51.eu-west-1.compute.internal aws:///eu-west-1b/i-07bae8eb728f2fb11 running
```
```
oc get machine -n openshift-machine-api org-okd-lab-krvkc-master-0
```
edit the machine manifest:
- Remove the entire `status` section
- Change the `metadata.name` field to a new name
- Remove the `spec.providerID` field
- Remove the `metadata.annotations` and `metadata.generation` fields
- Remove the `metadata.resourceVersion` and `metadata.uid` fields
- Delete the machine of the lost control plane host - **NOTE** to remove, I removed the `metadata.finalizers` field.
- Verify that the machine was deleted `oc get machines -n openshift-machine-api -o wide`
- Create a machine by using the `new-master-machine.yaml` file `oc apply -f new-master-machine.yaml`
- Verify that the new machine has been created
- Repeat these steps for each lost control plane host that is not the recovery hosts
```bash
[root@ip-10-2-0-87 old_masters]# oc get machines -n openshift-machine-api -o wide
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
org-okd-lab-krvkc-master-0 Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-159.eu-west-1.compute.internal aws:///eu-west-1b/i-0dae967329c838e07 running
org-okd-lab-krvkc-master-1 Running m5.xlarge eu-west-1 eu-west-1a 41d ip-10-2-3-182.eu-west-1.compute.internal aws:///eu-west-1a/i-08083013bef2b2049 running
org-okd-lab-krvkc-master-2 Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-84.eu-west-1.compute.internal aws:///eu-west-1b/i-0ee9252aa0e00616f running
org-okd-lab-krvkc-worker-eu-west-1b-488pq Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-51.eu-west-1.compute.internal aws:///eu-west-1b/i-07bae8eb728f2fb11 running
[root@ip-10-2-0-87 old_masters]# oc get machine -n openshift-machine-api org-okd-lab-krvkc-master-0 -oyaml >> dr-master-0.yaml
[root@ip-10-2-0-87 old_masters]# oc get machine -n openshift-machine-api org-okd-lab-krvkc-master-2 -oyaml >> dr-master-2.yaml
[root@ip-10-2-0-87 old_masters]# vim dr-master-0.yaml
[root@ip-10-2-0-87 old_masters]# oc get machines -n openshift-machine-api -o wide
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
org-okd-lab-krvkc-master-0 Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-159.eu-west-1.compute.internal aws:///eu-west-1b/i-0dae967329c838e07 stopped
org-okd-lab-krvkc-master-1 Running m5.xlarge eu-west-1 eu-west-1a 41d ip-10-2-3-182.eu-west-1.compute.internal aws:///eu-west-1a/i-08083013bef2b2049 running
org-okd-lab-krvkc-master-2 Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-84.eu-west-1.compute.internal aws:///eu-west-1b/i-0ee9252aa0e00616f stopped
org-okd-lab-krvkc-worker-eu-west-1b-488pq Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-51.eu-west-1.compute.internal aws:///eu-west-1b/i-07bae8eb728f2fb11 running
[root@ip-10-2-0-87 old_masters]# oc -n openshift-machine-api delete machine org-okd-lab-krvkc-master-0
machine.machine.openshift.io "org-okd-lab-krvkc-master-0" deleted
^C
[root@ip-10-2-0-87 old_masters]# oc edit machines org-okd-lab-krvkc-master-0 -n openshift-machine-api
machine.machine.openshift.io/org-okd-lab-krvkc-master-0 edited
[root@ip-10-2-0-87 old_masters]# oc get machines -n openshift-machine-api -o wide
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
org-okd-lab-krvkc-master-1 Running m5.xlarge eu-west-1 eu-west-1a 41d ip-10-2-3-182.eu-west-1.compute.internal aws:///eu-west-1a/i-08083013bef2b2049 running
org-okd-lab-krvkc-master-2 Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-84.eu-west-1.compute.internal aws:///eu-west-1b/i-0ee9252aa0e00616f stopped
org-okd-lab-krvkc-worker-eu-west-1b-488pq Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-51.eu-west-1.compute.internal aws:///eu-west-1b/i-07bae8eb728f2fb11 running
[root@ip-10-2-0-87 old_masters]# oc edit machines org-okd-lab-krvkc-master-2 -n openshift-machine-api
Edit cancelled, no changes made.
[root@ip-10-2-0-87 old_masters]# oc -n openshift-machine-api apply -f dr-master-0.yaml
machine.machine.openshift.io/org-okd-lab-krvkc-master-4 created
[root@ip-10-2-0-87 old_masters]# oc get machines -n openshift-machine-api -o wide
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
org-okd-lab-krvkc-master-1 Running m5.xlarge eu-west-1 eu-west-1a 41d ip-10-2-3-182.eu-west-1.compute.internal aws:///eu-west-1a/i-08083013bef2b2049 running
org-okd-lab-krvkc-master-2 Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-84.eu-west-1.compute.internal aws:///eu-west-1b/i-0ee9252aa0e00616f stopped
org-okd-lab-krvkc-master-4 Provisioning m5.xlarge eu-west-1 eu-west-1b 5s aws:///eu-west-1b/i-0b99dd3f963111b98 pending
org-okd-lab-krvkc-worker-eu-west-1b-488pq Running m5.xlarge eu-west-1 eu-west-1b 41d ip-10-2-4-51.eu-west-1.compute.internal aws:///eu-west-1b/i-07bae8eb728f2fb11 running
```
machine is being provisioned

- Repeated for master-2

## Forcing ETCD redeployment
- In a separate terminal window, log in to the cluster as a user with the `cluster-admin` role
- In a terminal that has access to the cluster as a `cluster-admin` user, run the following command `oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge`
- When the etcd cluster Operator performs a redeployment, the existing nodes are started with new pods similar to the initial bootstrap scale up
- Verify all nodes are updated to the latest revision `oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'`
At this point I ran the above command and noticed I have too many node:
```bash
tiriyon@pop-os:~$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
etcd.operator.openshift.io/cluster patched
tiriyon@pop-os:~$ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
2 nodes are at revision 0; 3 nodes are at revision 8; 0 nodes have achieved new revision 13
```
This is due to the fact that the "unavailable nodes" were kept:
```bash
tiriyon@pop-os:~$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-2-3-182.eu-west-1.compute.internal Ready master 41d v1.23.5+3afdacb
ip-10-2-4-159.eu-west-1.compute.internal NotReady,SchedulingDisabled master 41d v1.23.5+3afdacb
ip-10-2-4-171.eu-west-1.compute.internal Ready master 11m v1.23.5+3afdacb
ip-10-2-4-210.eu-west-1.compute.internal Ready master 21m v1.23.5+3afdacb
ip-10-2-4-51.eu-west-1.compute.internal Ready worker 41d v1.23.5+3afdacb
ip-10-2-4-84.eu-west-1.compute.internal NotReady,SchedulingDisabled master 41d v1.23.5+3afdacb
```
Hence, I proceeded to deleting the other masters:
```bash
tiriyon@pop-os:~$ oc delete node/ip-10-2-4-159.eu-west-1.compute.internal
node "ip-10-2-4-159.eu-west-1.compute.internal" deleted
tiriyon@pop-os:~$ oc delete node ip-10-2-4-84.eu-west-1.compute.internal
node "ip-10-2-4-84.eu-west-1.compute.internal" deleted
```
This takes a little while to happen: (took over 15 mins)
```bash
tiriyon@pop-os:~$ while true; do oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}' ; sleep 80; done
1 nodes are at revision 0; 1 nodes are at revision 8; 1 nodes are at revision 17
1 nodes are at revision 0; 1 nodes are at revision 8; 1 nodes are at revision 17; 0 nodes have achieved new revision 18
1 nodes are at revision 8; 1 nodes are at revision 17; 1 nodes are at revision 18; 0 nodes have achieved new revision 19
1 nodes are at revision 17; 1 nodes are at revision 18; 1 nodes are at revision 19
1 nodes are at revision 18; 2 nodes are at revision 19
1 nodes are at revision 18; 2 nodes are at revision 19
AllNodesAtLatestRevision
3 nodes are at revision 19
AllNodesAtLatestRevision
3 nodes are at revision 19
```
- After etcd is redeployed, force new rollouts for the control plane. The Kubernetes API server will reinstall itself on the other nodes because the kubelet is connected to API servers using an internal load balancer
- Force a new rollout for the Kubernetes API server `oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge`
- Verify all nodes are updated to the latest revision. `oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'`
- Review the `NodeInstallerProgressing` status condition to verify that all nodes are at the latest revision. The output shows `AllNodesAtLatestRevision` upon successful update
```bash
tiriyon@pop-os:~$ while true; do oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}' ; sleep 80 ; done
AllNodesAtLatestRevision
3 nodes are at revision 18
3 nodes are at revision 18; 0 nodes have achieved new revision 19
3 nodes are at revision 18; 0 nodes have achieved new revision 19
2 nodes are at revision 18; 1 nodes are at revision 19
2 nodes are at revision 18; 1 nodes are at revision 19
2 nodes are at revision 18; 1 nodes are at revision 19
1 nodes are at revision 18; 2 nodes are at revision 19
1 nodes are at revision 18; 2 nodes are at revision 19
1 nodes are at revision 18; 2 nodes are at revision 19
AllNodesAtLatestRevision
3 nodes are at revision 19
```
- Force a new rollout for the Kubernetes controller manager
- `oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge`
- Verify all nodes are updated to the latest revision.
- `oc get kubecontrollermanager -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'`
```bash
tiriyon@pop-os:~$ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
kubescheduler.operator.openshift.io/cluster patched
tiriyon@pop-os:~$ oc get kubescheduler -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
3 nodes are at revision 7; 0 nodes have achieved new revision 8
tiriyon@pop-os:~$ while true; do oc get kubescheduler -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}' ; sleep 80 ; done
2 nodes are at revision 7; 1 nodes are at revision 8
2 nodes are at revision 7; 1 nodes are at revision 8
1 nodes are at revision 7; 2 nodes are at revision 8
AllNodesAtLatestRevision
3 nodes are at revision 8
AllNodesAtLatestRevision
3 nodes are at revision 8
```
- Verify that all control plane hosts have started and joined the cluster.
```bash
tiriyon@pop-os:~$
NAME READY STATUS RESTARTS AGE
etcd-ip-10-2-3-182.eu-west-1.compute.internal 4/4 Running 0 49m
etcd-ip-10-2-4-171.eu-west-1.compute.internal 4/4 Running 0 45m
etcd-ip-10-2-4-210.eu-west-1.compute.internal 4/4 Running 0 47m
```
# End node state
```bash
NAME STATUS ROLES AGE VERSION
ip-10-2-3-182.eu-west-1.compute.internal Ready master 42d v1.23.5+3afdacb
ip-10-2-4-171.eu-west-1.compute.internal Ready master 157m v1.23.5+3afdacb
ip-10-2-4-210.eu-west-1.compute.internal Ready master 166m v1.23.5+3afdacb
ip-10-2-4-51.eu-west-1.compute.internal Ready worker 42d v1.23.5+3afdacb
`