# MinIO restore downstream cluster * 假設 master 全數損毀,並且已無法從 local 端取得 etcd 備份,因此透過 minio 的備份做 restore。 * 本次示範為 rke2 1m1w cluster ``` $ kubectl get no NAME STATUS ROLES AGE VERSION rke2 Ready control-plane,etcd,master 15h v1.27.12+rke2r1 rke2-w1 Ready worker 15h v1.27.12+rke2r1 ``` ## backup to minio * 建立測試 deployment,並且 snapshot 到 minio ![image](https://hackmd.io/_uploads/B1KtTM9g0.png) ``` $ kubectl create deploy test --image=nginx deployment.apps/test created $ kubectl get po NAME READY STATUS RESTARTS AGE test-5746d4c59f-4djj5 1/1 Running 0 30s ``` * 如果 downstream cluster 已經在 rancher 介面設定好 s3 備份設定,可以在 rancher 介面進行 snapshot ![image](https://hackmd.io/_uploads/HyJGLBcgC.png) ![image](https://hackmd.io/_uploads/S1k5g8oeR.png) ``` # 也可以使用 rke2 指令方式備份(需在 rancher 先設定好 s3 資訊) $ rke2 etcd-snapshot save --name test-etcd-shapshot ``` * 備份完後刪除測試用 deployment ``` $ kubectl delete deploy test deployment.apps "test" deleted $ kubectl get po No resources found in default namespace. ``` ## 模擬 master 完全損毀 * 把 master vm 刪除,模擬 master 損毀 ## restore 1. 在 rancher ui 將已損毀的 master node 刪除 ![image](https://hackmd.io/_uploads/rkPeMIseR.png) * 如果遇到節點刪不掉的問題,從旁邊 Edit YAML,將 finalizers 欄位設為空直 ![image](https://hackmd.io/_uploads/r13AOwoeA.png) 2. 重新建立一台 vm,以下示範會建立相同的 ip 與 hostname(ip 與 hsotname 可以不一致)。 3. 從 rancher 重新註冊 master 到乾淨的 vm 上。 4. 指定 master 所需的 role 到 master 身上。 ![image](https://hackmd.io/_uploads/B1mVwYsg0.png) * 註冊完後 `rancher-system-agent.service` 會卡在這 ``` $ journalctl -u rancher-system-agent.service -f Apr 16 09:23:29 rke2 systemd[1]: Started Rancher System Agent. Apr 16 09:23:29 rke2 rancher-system-agent[2828]: time="2024-04-16T09:23:29+08:00" level=info msg="Rancher System Agent version v0.3.4 (63eb11a) is starting" Apr 16 09:23:29 rke2 rancher-system-agent[2828]: time="2024-04-16T09:23:29+08:00" level=info msg="Using directory /var/lib/rancher/agent/work for work" Apr 16 09:23:29 rke2 rancher-system-agent[2828]: time="2024-04-16T09:23:29+08:00" level=info msg="Starting remote watch of plans" Apr 16 09:23:29 rke2 rancher-system-agent[2828]: time="2024-04-16T09:23:29+08:00" level=info msg="Starting /v1, Kind=Secret controller" ``` * 從 rancher 介面 restore * 第一次 restore 會失敗,但目的是為了取得 rke2 command,並先透過 `rke2-server.service` 啟動 k8s。 ![image](https://hackmd.io/_uploads/BkCoaPigA.png) * restore 後將 rke2 命令複製到 `/usr/local/bin/` ``` $ sudo cp /opt/rke2/bin/* /usr/local/bin/ ``` * 啟動 `rke2-server.service` ``` $ systemctl enable --now rke2-server.service ``` * 複製 kubeonfig 到 `.kube/config` ``` $ mkdir .kube $ cp /etc/rancher/rke2/rke2.yaml .kube/config $ cp /var/lib/rancher/rke2/bin/kubectl /usr/local/bin/ ``` * 確認 rke2 k8s 是否正常 ``` $ kubectl get no NAME STATUS ROLES AGE VERSION rke2 Ready control-plane,etcd,master 14m v1.27.12+rke2r1 $ kubectl get po -A NAMESPACE NAME READY STATUS RESTARTS AGE calico-system calico-kube-controllers-6f7458cdd4-4lq75 1/1 Running 0 10m calico-system calico-node-nnmhr 1/1 Running 0 10m calico-system calico-typha-6fb7c54b99-bjvjh 1/1 Running 0 10m cattle-system cattle-cluster-agent-6f5dc4c999-2bhgd 1/1 Running 0 2m33s kube-system cloud-controller-manager-rke2 1/1 Running 0 13m kube-system etcd-rke2 1/1 Running 0 13m kube-system helm-install-rke2-calico-crd-6wmsg 0/1 Completed 0 12m kube-system helm-install-rke2-calico-rtnn4 0/1 Completed 2 12m kube-system helm-install-rke2-coredns-vgpxv 0/1 Completed 0 12m kube-system helm-install-rke2-ingress-nginx-bfht8 0/1 Completed 0 12m kube-system helm-install-rke2-metrics-server-7fm28 0/1 Completed 0 12m kube-system helm-install-rke2-snapshot-controller-6rnqb 0/1 Completed 1 12m kube-system helm-install-rke2-snapshot-controller-crd-j8jj8 0/1 Completed 0 12m kube-system helm-install-rke2-snapshot-validation-webhook-k8k7s 0/1 Completed 0 12m kube-system kube-apiserver-rke2 1/1 Running 0 13m kube-system kube-controller-manager-rke2 0/1 Running 1 (35s ago) 13m kube-system kube-proxy-rke2 1/1 Running 0 13m kube-system kube-scheduler-rke2 1/1 Running 0 13m kube-system rke2-coredns-rke2-coredns-7df444dc8c-7srx7 1/1 Running 0 11m kube-system rke2-coredns-rke2-coredns-autoscaler-7f998d489f-chgwm 1/1 Running 0 11m kube-system rke2-ingress-nginx-controller-w8k82 1/1 Running 0 2m29s kube-system rke2-metrics-server-5958dcbf87-lk4v2 1/1 Running 0 6m44s kube-system rke2-snapshot-controller-6ddf89d6f8-7v7rl 1/1 Running 0 6m25s kube-system rke2-snapshot-validation-webhook-6bcccc7947-kz2fn 1/1 Running 0 6m27s tigera-operator tigera-operator-7fcbc459cd-kq8bj 1/1 Running 0 11m ``` * 從 rancher ui 進行第二次 restore ![image](https://hackmd.io/_uploads/ryZw5DixR.png) * 複製新的 kubeconfig ``` $ cp /etc/rancher/rke2/rke2.yaml .kube/config ``` * 確認舊的叢集都已恢復 ``` $ kubectl get no NAME STATUS ROLES AGE VERSION rke2 Ready control-plane,etcd,master,worker 53m v1.27.12+rke2r1 rke2-w1 Ready worker 41m v1.27.12+rke2r1 $ kubectl get po -A NAMESPACE NAME READY STATUS RESTARTS AGE calico-system calico-kube-controllers-6d5b8b76bc-dspcg 1/1 Running 0 4m32s calico-system calico-node-jbrh7 1/1 Running 0 4m27s calico-system calico-node-sffjl 1/1 Running 0 70s calico-system calico-typha-77779686d7-xhz9p 1/1 Running 0 4m32s cattle-fleet-system fleet-agent-5f6bcbdc8-mxs76 1/1 Running 0 4m22s cattle-system cattle-cluster-agent-6cbc85774d-shx5x 1/1 Running 3 (3m24s ago) 4m28s cattle-system cattle-cluster-agent-6cbc85774d-wrh4r 1/1 Running 3 (3m22s ago) 4m27s cattle-system rancher-webhook-58457c4f67-qh6r8 1/1 Running 0 4m25s cattle-system system-upgrade-controller-8694b8c9c7-xdcxg 1/1 Running 1 (3m34s ago) 4m24s default test-5746d4c59f-gb6f6 1/1 Running 1 (2m4s ago) 36m kube-system cloud-controller-manager-rke2 1/1 Running 1 (2m12s ago) 6m16s kube-system etcd-rke2 1/1 Running 0 6m33s kube-system helm-install-rke2-calico-8d569 0/1 Completed 2 53m kube-system helm-install-rke2-calico-crd-mzbm6 0/1 Completed 0 53m kube-system helm-install-rke2-coredns-pcxnz 0/1 Completed 0 53m kube-system helm-install-rke2-ingress-nginx-sjsgf 0/1 Completed 0 53m kube-system helm-install-rke2-metrics-server-rwzdn 0/1 Completed 0 53m kube-system helm-install-rke2-snapshot-controller-crd-hhd8n 0/1 Completed 0 53m kube-system helm-install-rke2-snapshot-controller-jg8dg 0/1 Completed 1 53m kube-system helm-install-rke2-snapshot-validation-webhook-xlc5l 0/1 Completed 0 53m kube-system kube-apiserver-rke2 1/1 Running 0 6m33s kube-system kube-controller-manager-rke2 1/1 Running 1 (2m12s ago) 6m16s kube-system kube-proxy-rke2 1/1 Running 0 6m11s kube-system kube-proxy-rke2-w1 1/1 Running 0 112s kube-system kube-scheduler-rke2 1/1 Running 1 (2m7s ago) 6m16s kube-system rke2-coredns-rke2-coredns-7df444dc8c-6t5m4 1/1 Running 0 4m40s kube-system rke2-coredns-rke2-coredns-7df444dc8c-crtc7 1/1 Running 0 4m43s kube-system rke2-coredns-rke2-coredns-autoscaler-7f998d489f-rs9fk 1/1 Running 0 4m40s kube-system rke2-ingress-nginx-controller-w92lq 1/1 Running 0 70s kube-system rke2-metrics-server-5958dcbf87-mnsn2 1/1 Running 0 4m38s kube-system rke2-snapshot-controller-6ddf89d6f8-pd9pt 1/1 Running 6 (29m ago) 38m kube-system rke2-snapshot-validation-webhook-6bcccc7947-bhhcx 1/1 Running 2 (46s ago) 38m tigera-operator tigera-operator-7fcbc459cd-h5nzr 1/1 Running 0 4m36s ``` * 確認原本部屬的 deployment 已恢復 ``` $ kubectl get po NAME READY STATUS RESTARTS AGE test-5746d4c59f-gb6f6 1/1 Running 1 (83s ago) 36m ``` * 從 rancher 介面檢查確認 downstream cluster 已恢復 ![image](https://hackmd.io/_uploads/SkGsnPjeA.png) ## 參考連結 https://www.suse.com/support/kb/doc/?id=000020695 https://github.com/rancher/rancher/issues/41080