# K8s Node Restoration at the VM Level Troubleshooting ### 問題背景 在 Proxmox UI 對 VM Take Snapshot(Include RAM),過幾天後,把 VM Restore 回去。 K8s 環境資訊 ``` NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME tkbp-control-plane Ready control-plane 2d22h v1.32.5 172.22.8.1 <none> Debian GNU/Linux 12 (bookworm) 6.8.0-62-generic containerd://2.1.1 tkbp-worker1 Ready taroko-worker 2d22h v1.32.5 172.22.8.2 <none> Debian GNU/Linux 12 (bookworm) 6.8.0-62-generic containerd://2.1.1 tkbp-worker2 Ready taroko-gateway 2d22h v1.32.5 172.22.8.3 <none> Debian GNU/Linux 12 (bookworm) 6.8.0-62-generic containerd://2.1.1 ``` CNI 使用 Canal 版本 `v3.29.2`,flannel 版本 `flannel:v0.24.4`,設定資訊 ### 問題描述 當使用帶有 Canal CNI 的 Kubeadm 叢集時,Pod 卡在 ContainerCreating 無法啟動,如下 : ``` $ kubectl get pods -n user9 NAME READY STATUS RESTARTS AGE n1 0/1 ContainerCreating 0 141m ``` 去 Describe pod 看到的 Event 資訊如下 ```! Warning FailedCreatePodSandBox 4m52s (x604 over 137m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "25c71fb49bcd00ce537246d4484b8eedba5277e9504163ec1c18f420d0d6f190": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized ``` ### 解決方案 ``` kubectl -n kube-system rollout restart ds canal ``` 執行以下命令,二次確認 Canal Pods 皆重啟 ``` # kubectl -n kube-system get pods -l k8s-app=canal NAME READY STATUS RESTARTS AGE canal-4xkb9 2/2 Running 0 52s canal-b2brv 2/2 Running 0 85s canal-xqqcm 2/2 Running 0 68s ``` ### Root Cause **TL;DR: Canal Service Account 的 Token 過期** Calico 的 `calico-kube-controllers` pod 時無法在預定時間內從 Kubernetes API Server 取得 `ClusterInformation` 資源,導致初始化失敗 ``` $ kubectl -n kube-system logs calico-kube-controllers-6b65fb5f89-77695 ``` 看到以下錯誤訊息 ``` ... [ERROR][1] kube-controllers/client.go 320: Error getting cluster information config ClusterInformation="default" error=Get "https://10.98.8.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded [INFO][1] kube-controllers/client.go 248: Unable to initialize ClusterInformation error=Get "https://10.98.8.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded [INFO][1] kube-controllers/client.go 254: Unable to initialize default Tier error=client rate limiter Wait returned an error: context deadline exceeded [INFO][1] kube-controllers/client.go 260: Unable to initialize adminnetworkpolicy Tier error=client rate limiter Wait returned an error: context deadline exceeded [ERROR][1] kube-controllers/main.go 256: Failed to verify datastore error=Get "https://10.98.8.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded [INFO][1] kube-controllers/main.go 290: Health check is not ready, retrying in 2 seconds with new timeout: 8s [INFO][1] kube-controllers/resources.go 378: Terminating main client watcher loop [INFO][1] kube-controllers/resources.go 350: Main client watcher loop ... ``` 測試使用 calico 的 kubeconfig 連線 API Server ``` kubectl --kubeconfig /etc/cni/net.d/calico-kubeconfig auth whoami ``` 執行結果 ``` error: You must be logged in to the server (Unauthorized) ``` 由以上結果得知問題點在 calico 的 kubeconfig 執行以下命令,確認 kubeconfig 中 calico user (在 K8s 中是 Service Account) 的 Token 有效期限 ```! yq e '.users[0].user."token"' /etc/cni/net.d/calico-kubeconfig | \ tr -d '\n' | \ cut -d '.' -f 2 | \ base64 -d | \ jq ``` > yq 安裝命令 : `wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/local/bin/yq && chmod +x /usr/local/bin/yq` 執行結果 : ```json { "aud": [ "https://kubernetes.default.svc.tkbp.k8s" ], "exp": 1750924118, "iat": 1750837718, "iss": "https://kubernetes.default.svc.tkbp.k8s", "jti": "378d4f97-c792-4a28-a547-379307b047d7", "kubernetes.io": { "namespace": "kube-system", "serviceaccount": { "name": "canal", "uid": "a8e73c8a-8064-46a8-bd6c-f2badbddf663" } }, "nbf": 1750837718, "sub": "system:serviceaccount:kube-system:canal" } ``` > `iat`:token 的發行時間。 `exp`:token 的過期時間。 將 Token 的時間轉換為人類可閱讀的格式 ``` # date -d @1750924118 ``` 執行結果 : ``` Thu Jun 26 07:48:38 UTC 2025 ``` 成功找到問題點,機器當前的時間已在 6/28,Token 已過期 查找 Token 發行日期 ``` # date -d @1750837718 Wed Jun 25 07:48:38 UTC 2025 ``` 由以上結果得知 Service Account 的 Token 預設有效日期為 1 天 `exec` 進 pod 檢查 SA Token 的命令 ``` cat /var/run/secrets/kubernetes.io/serviceaccount/token \ | cut -d "." -f2 \ | tr -d "\n" \ | awk '{pad=4-(length($0)%4); if(pad<4) printf "%s", $0 sprintf("%"pad"s","")}' \ | tr ' ' '=' \ | base64 -d \ | jq ``` 執行結果 : ``` { "aud": [ "https://kubernetes.default.svc.tkbp.k8s" ], "exp": 1782720071, "iat": 1751184071, "iss": "https://kubernetes.default.svc.tkbp.k8s", "jti": "e64feb22-412e-4065-b9b9-5c45889958a9", "kubernetes.io": { "namespace": "taroko", "node": { "name": "tkbp-worker2", "uid": "f6595a85-3a61-411b-a186-1c2f6bd32a26" }, "pod": { "name": "taroko-tkadm-794b86456f-qbwlf", "uid": "e4fd32f5-a9c0-41d1-aa34-b63203ccd3d0" }, "serviceaccount": { "name": "tkadm", "uid": "21b873fc-1ad0-4442-a66e-acf5fbe78311" }, "warnafter": 1751187678 }, "nbf": 1751184071, "sub": "system:serviceaccount:taroko:tkadm" } ```