# RKE1 在沒有快照的情況下恢復 etcd ## 前言 * 如果群集原有兩個 ETCD 節點而壞掉一個,或是原有三個 ETCD 節點而壞掉兩個。這時候 ETCD 叢集會自動降級,會變成唯讀狀態,這種情況下只能進行 ETCD 叢集恢復。對於早期的 Rancher 版本或沒有開啟自動備份的 Rancher 環境,將需要使用 `/var/lib/etcd` 目錄的 etcd 資料進行還原。 * 當叢集中三個 Control Plane 節點的 ETCD 出現 `request cluster ID mismatch` 問題時,可以保留一個 ETCD 實例透過 `--force-new-cluster` 參數重建集群,然後再將其他兩個節點的 ETCD 實例加入叢集。 ## 實作 * 環境檢查 ``` $ kubectl get no NAME STATUS ROLES AGE VERSION rke-m1 Ready controlplane,etcd,worker 80d v1.28.12 rke-m2 Ready controlplane,etcd,worker 80d v1.28.12 rke-m3 Ready controlplane,etcd,worker 80d v1.28.12 rke-w1 Ready worker 80d v1.28.12 $ docker exec -it etcd etcdctl member list 37f57fc09a11af9e, started, etcd-rke-m1, https://192.168.11.102:2380, https://192.168.11.102:2379, false 42c538fc0ebfad32, started, etcd-rke-m2, https://192.168.11.53:2380, https://192.168.11.53:2379, false 84645341c9406c2f, started, etcd-rke-m3, https://192.168.11.54:2380, https://192.168.11.54:2379, false $ kubectl get cs Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR controller-manager Healthy ok scheduler Healthy ok etcd-0 Healthy ok ``` ### 環境模擬 1. 刪除所有其他 etcd 節點,在叢集中只保留一個 etcd 節點,在 rke-m2、rke-m3 節點上刪除 etcd container ``` $ docker rm -f etcd ``` ``` $ kubectl get cs Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR controller-manager Healthy ok scheduler Healthy ok etcd-0 Unhealthy error getting data from etcd: context deadline exceeded**** ``` ## 恢復 1. 在剩下的 etcd 節點上也就是 rke-m1 上執行,將此命令輸出 etcd 的執行命令,保存此命令以供以後使用 ``` $ docker run --rm -v /var/run/docker.sock:/var/run/docker.sock assaflavie/runlike:latest etcd Unable to find image 'assaflavie/runlike:latest' locally latest: Pulling from assaflavie/runlike 43c4264eed91: Pull complete 6a9ddd5be51f: Pull complete 4f4fb700ef54: Pull complete c34748d1a228: Pull complete 0e631f15cb81: Pull complete 4a380a588d96: Pull complete a3d222998183: Pull complete dd05ff3e1671: Pull complete 547955fbd3af: Pull complete 911c38b2b574: Pull complete d957e402457f: Pull complete 9b631605838e: Pull complete 47096f94733e: Pull complete 8592fececf28: Pull complete 81dfa20479b7: Pull complete 53411a209e61: Pull complete ac8f6a15e782: Pull complete 66e95e9d7b7a: Pull complete 5a9c0ffdb53a: Pull complete Digest: sha256:d7ac59e9c60cd817036dc369961f147eeda97db7efab96719c94bee6d6268c07 Status: Downloaded newer image for assaflavie/runlike:latest docker run --name=etcd --hostname=rke-m1 --user=0 --env=ETCDCTL_API=3 --env=ETCDCTL_CACERT=/etc/kubernetes/ssl/kube-ca.pem --env=ETCDCTL_CERT=/etc/kubernetes/ssl/kube-etcd-192-168-11-102.pem --env=ETCDCTL_KEY=/etc/kubernetes/ssl/kube-etcd-192-168-11-102-key.pem --env=ETCDCTL_ENDPOINTS=https://127.0.0.1:2379 --env=ETCD_UNSUPPORTED_ARCH=x86_64 --volume=/var/lib/etcd:/var/lib/rancher/etcd/ --volume=/etc/kubernetes:/etc/kubernetes --network=host --workdir=/var/lib/etcd --restart=always --label='io.rancher.rke.container.name=etcd' --log-opt max-size=10m --log-opt max-file=5 --runtime=runc --detach=true registry.rancher.com/rancher/mirrored-coreos-etcd:v3.5.10 /usr/local/bin/etcd --name=etcd-rke-m1 --cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 --election-timeout=5000 --initial-advertise-peer-urls=https://192.168.11.102:2380 --initial-cluster=etcd-rke-m3=https://192.168.11.54:2380,etcd-rke-m1=https://192.168.11.102:2380,etcd-rke-m2=https://192.168.11.53:2380 --peer-trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem --key-file=/etc/kubernetes/ssl/kube-etcd-192-168-11-102-key.pem --initial-cluster-state=new --trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem --cert-file=/etc/kubernetes/ssl/kube-etcd-192-168-11-102.pem --client-cert-auth=true --data-dir=/var/lib/rancher/etcd/ --initial-cluster-token=etcd-cluster-1 --listen-client-urls=https://0.0.0.0:2379 --listen-peer-urls=https://0.0.0.0:2380 --peer-client-cert-auth=true --peer-cert-file=/etc/kubernetes/ssl/kube-etcd-192-168-11-102.pem --peer-key-file=/etc/kubernetes/ssl/kube-etcd-192-168-11-102-key.pem --advertise-client-urls=https://192.168.11.102:2379 --heartbeat-interval=500 ``` 2. 停止在 rke-m1 啟動的 etcd 容器,將其重新命名為 etcd-old ``` $ docker stop etcd $ docker rename etcd etcd-old ``` 3. 修改先前儲存的 ETCD 啟動命令,在 `--initial-cluster` 參數中刪除第二、三台 Control Plane 節點的 ETCD 訊息,並在最後添加 `--force-new-cluster` 參數,修改後在 rke-m1 上執行 ``` $ docker run --name=etcd --hostname=rke-m1 --user=0 --env=ETCDCTL_API=3 --env=ETCDCTL_CACERT=/etc/kubernetes/ssl/kube-ca.pem --env=ETCDCTL_CERT=/etc/kubernetes/ssl/kube-etcd-192-168-11-102.pem --env=ETCDCTL_KEY=/etc/kubernetes/ssl/kube-etcd-192-168-11-102-key.pem --env=ETCDCTL_ENDPOINTS=https://127.0.0.1:2379 --env=ETCD_UNSUPPORTED_ARCH=x86_64 --volume=/var/lib/etcd:/var/lib/rancher/etcd/ --volume=/etc/kubernetes:/etc/kubernetes --network=host --workdir=/var/lib/etcd --restart=always --label='io.rancher.rke.container.name=etcd' --log-opt max-size=10m --log-opt max-file=5 --runtime=runc --detach=true registry.rancher.com/rancher/mirrored-coreos-etcd:v3.5.10 /usr/local/bin/etcd --name=etcd-rke-m1 --cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 --election-timeout=5000 --initial-advertise-peer-urls=https://192.168.11.102:2380 --initial-cluster=etcd-rke-m1=https://192.168.11.102:2380 --peer-trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem --key-file=/etc/kubernetes/ssl/kube-etcd-192-168-11-102-key.pem --initial-cluster-state=new --trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem --cert-file=/etc/kubernetes/ssl/kube-etcd-192-168-11-102.pem --client-cert-auth=true --data-dir=/var/lib/rancher/etcd/ --initial-cluster-token=etcd-cluster-1 --listen-client-urls=https://0.0.0.0:2379 --listen-peer-urls=https://0.0.0.0:2380 --peer-client-cert-auth=true --peer-cert-file=/etc/kubernetes/ssl/kube-etcd-192-168-11-102.pem --peer-key-file=/etc/kubernetes/ssl/kube-etcd-192-168-11-102-key.pem --advertise-client-urls=https://192.168.11.102:2379 --heartbeat-interval=500 --force-new-cluster --force-new-cluster ``` 4. 在 rke-m1 上執行修改後的指令,啟動完畢後檢查 ETCD 叢集狀態 ``` $ docker exec -it -e ETCDCTL_API=3 etcd etcdctl member list -w table +------------------+---------+-------------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-------------+-----------------------------+-----------------------------+------------+ | 37f57fc09a11af9e | started | etcd-rke-m1 | https://192.168.11.102:2380 | https://192.168.11.102:2379 | false | +------------------+---------+-------------+-----------------------------+-----------------------------+------------+ $ docker exec -it -e ETCDCTL_API=3 etcd etcdctl endpoint status --cluster -w table +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.11.102:2379 | 37f57fc09a11af9e | 3.5.10 | 30 MB | true | false | 35 | 36484638 | 36484638 | | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ ``` 5. 在單一節點啟動並運行之後,需要向叢集添加另外兩個 etcd 節點,在 rke-m1 節點上新增 ETCD Member ``` $ MEMBER_IP=192.168.11.53 # 此為 rke-m2 ip $ MEMBER_NAME="rke-m2" # 執行完指令後,下面的設定需要保留,後續節點啟動 ETCD 時需要使用 $ docker exec -it etcd etcdctl member add etcd-$MEMBER_NAME --peer-urls=https://$MEMBER_IP:2380 Member 86fd8166bc4a7b2d added to cluster 9e1d9b2d2fd350ec ETCD_NAME="etcd-rke-m2" ETCD_INITIAL_CLUSTER="etcd-rke-m1=https://192.168.11.102:2380,etcd-rke-m2=https://192.168.11.53:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.11.53:2380" ETCD_INITIAL_CLUSTER_STATE="existing" ``` 6. 接著在 rke-m2 節點執行,進行恢復 ``` # 備份數據 $ mv /var/lib/etcd /var/lib/etcd_bak # 設定變數 $ NODE_IP=192.168.11.53 $ ETCD_IMAGE=registry.rancher.com/rancher/mirrored-coreos-etcd:v3.5.10 # 注意自己原本的 etcd 版本 $ ETCD_NAME="etcd-rke-m2" $ ETCD_INITIAL_CLUSTER="etcd-rke-m1=https://192.168.11.102:2380,etcd-rke-m2=https://192.168.11.53:2380" $ ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.11.53:2380" $ ETCD_INITIAL_CLUSTER_STATE="existing" # 啟動 ETCD $ docker run --name=etcd --hostname=`hostname` \ --env="ETCDCTL_API=3" \ --env="ETCDCTL_CACERT=/etc/kubernetes/ssl/kube-ca.pem" \ --env="ETCDCTL_CERT=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`.pem" \ --env="ETCDCTL_KEY=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`-key.pem" \ --env="ETCDCTL_ENDPOINTS=https://127.0.0.1:2379" \ --env="ETCD_UNSUPPORTED_ARCH=x86_64" \ --env="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \ --volume="/var/lib/etcd:/var/lib/rancher/etcd/:z" \ --volume="/etc/kubernetes:/etc/kubernetes:z" \ --network=host \ --restart=always \ --label io.rancher.rke.container.name="etcd" \ --detach=true \ $ETCD_IMAGE \ /usr/local/bin/etcd \ --peer-client-cert-auth \ --client-cert-auth \ --peer-cert-file=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`.pem \ --peer-key-file=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`-key.pem \ --cert-file=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`.pem \ --trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem \ --initial-cluster-token=etcd-cluster-1 \ --peer-trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem \ --key-file=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`-key.pem \ --data-dir=/var/lib/rancher/etcd/ \ --advertise-client-urls=https://$NODE_IP:2379 \ --listen-client-urls=https://0.0.0.0:2379 \ --listen-peer-urls=https://0.0.0.0:2380 \ --initial-advertise-peer-urls=https://$NODE_IP:2380 \ --election-timeout=5000 \ --heartbeat-interval=500 \ --name=$ETCD_NAME \ --initial-cluster=$ETCD_INITIAL_CLUSTER \ --initial-cluster-state=$ETCD_INITIAL_CLUSTER_STATE ``` * 啟動完後檢查狀態,如果沒問題則可以重複上面步驟新增第三台 etcd 節點 ``` $ docker exec -it -e ETCDCTL_API=3 etcd etcdctl member list -w table +------------------+---------+-------------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-------------+-----------------------------+-----------------------------+------------+ | 37f57fc09a11af9e | started | etcd-rke-m1 | https://192.168.11.102:2380 | https://192.168.11.102:2379 | false | | 86fd8166bc4a7b2d | started | etcd-rke-m2 | https://192.168.11.53:2380 | https://192.168.11.53:2379 | false | +------------------+---------+-------------+-----------------------------+-----------------------------+------------+ $ docker exec -it -e ETCDCTL_API=3 etcd etcdctl endpoint status --cluster -w table +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.11.102:2379 | 37f57fc09a11af9e | 3.5.10 | 30 MB | true | false | 36 | 36487134 | 36487134 | | | https://192.168.11.53:2379 | 86fd8166bc4a7b2d | 3.5.10 | 30 MB | false | false | 36 | 36487134 | 36487134 | | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ ``` 7. 恢復 rke2-m3 ``` # 回到 rke-m1 執行 $ MEMBER_IP=192.168.11.54 # 此為 rke-m3 ip $ MEMBER_NAME="rke-m3" # 執行完指令後,下面的設定需要保留,後續節點啟動 ETCD 時需要使用 $ docker exec -it etcd etcdctl member add etcd-$MEMBER_NAME --peer-urls=https://$MEMBER_IP:2380 Member a7b94698b22fb60c added to cluster 9e1d9b2d2fd350ec ETCD_NAME="etcd-rke-m3" ETCD_INITIAL_CLUSTER="etcd-rke-m1=https://192.168.11.102:2380,etcd-rke-m2=https://192.168.11.53:2380,etcd-rke-m3=https://192.168.11.54:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.11.54:2380" ETCD_INITIAL_CLUSTER_STATE="existing" ``` 8. 在 rke-m3 節點執行,進行恢復 ``` # 備份數據 $ mv /var/lib/etcd /var/lib/etcd_bak # 設定變數 $ NODE_IP=192.168.11.54 $ ETCD_IMAGE=registry.rancher.com/rancher/mirrored-coreos-etcd:v3.5.10 # 注意自己原本的 etcd 版本 $ ETCD_NAME="etcd-rke-m3" $ ETCD_INITIAL_CLUSTER="etcd-rke-m1=https://192.168.11.102:2380,etcd-rke-m2=https://192.168.11.53:2380,etcd-rke-m3=https://192.168.11.54:2380" $ ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.11.54:2380" $ ETCD_INITIAL_CLUSTER_STATE="existing" # 啟動 ETCD $ docker run --name=etcd --hostname=`hostname` \ --env="ETCDCTL_API=3" \ --env="ETCDCTL_CACERT=/etc/kubernetes/ssl/kube-ca.pem" \ --env="ETCDCTL_CERT=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`.pem" \ --env="ETCDCTL_KEY=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`-key.pem" \ --env="ETCDCTL_ENDPOINTS=https://127.0.0.1:2379" \ --env="ETCD_UNSUPPORTED_ARCH=x86_64" \ --env="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \ --volume="/var/lib/etcd:/var/lib/rancher/etcd/:z" \ --volume="/etc/kubernetes:/etc/kubernetes:z" \ --network=host \ --restart=always \ --label io.rancher.rke.container.name="etcd" \ --detach=true \ $ETCD_IMAGE \ /usr/local/bin/etcd \ --peer-client-cert-auth \ --client-cert-auth \ --peer-cert-file=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`.pem \ --peer-key-file=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`-key.pem \ --cert-file=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`.pem \ --trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem \ --initial-cluster-token=etcd-cluster-1 \ --peer-trusted-ca-file=/etc/kubernetes/ssl/kube-ca.pem \ --key-file=/etc/kubernetes/ssl/kube-etcd-`echo $NODE_IP|sed 's/\./-/g'`-key.pem \ --data-dir=/var/lib/rancher/etcd/ \ --advertise-client-urls=https://$NODE_IP:2379 \ --listen-client-urls=https://0.0.0.0:2379 \ --listen-peer-urls=https://0.0.0.0:2380 \ --initial-advertise-peer-urls=https://$NODE_IP:2380 \ --election-timeout=5000 \ --heartbeat-interval=500 \ --name=$ETCD_NAME \ --initial-cluster=$ETCD_INITIAL_CLUSTER \ --initial-cluster-state=$ETCD_INITIAL_CLUSTER_STATE ``` * 確認三台 etcd 節點已恢復 ``` $ docker exec -it -e ETCDCTL_API=3 etcd etcdctl member list -w table +------------------+---------+-------------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+-------------+-----------------------------+-----------------------------+------------+ | 37f57fc09a11af9e | started | etcd-rke-m1 | https://192.168.11.102:2380 | https://192.168.11.102:2379 | false | | 86fd8166bc4a7b2d | started | etcd-rke-m2 | https://192.168.11.53:2380 | https://192.168.11.53:2379 | false | | a7b94698b22fb60c | started | etcd-rke-m3 | https://192.168.11.54:2380 | https://192.168.11.54:2379 | false | +------------------+---------+-------------+-----------------------------+-----------------------------+------------+ $ docker exec -it -e ETCDCTL_API=3 etcd etcdctl endpoint status --cluster -w table +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://192.168.11.102:2379 | 37f57fc09a11af9e | 3.5.10 | 30 MB | true | false | 36 | 36489375 | 36489375 | | | https://192.168.11.53:2379 | 86fd8166bc4a7b2d | 3.5.10 | 30 MB | false | false | 36 | 36489375 | 36489375 | | | https://192.168.11.54:2379 | a7b94698b22fb60c | 3.5.10 | 30 MB | false | false | 36 | 36489375 | 36489375 | | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ ``` ``` $ kubectl get cs Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR scheduler Healthy ok etcd-0 Healthy ok controller-manager Healthy ok ``` ## 參考 https://blog.csdn.net/Ivan_Wz/article/details/117022325?spm=1001.2014.3001.5502