# K8s Node Restoration at the VM Level Troubleshooting
### 問題背景
在 Proxmox UI 對 VM Take Snapshot(Include RAM),過幾天後,把 VM Restore 回去。
K8s 環境資訊
```
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
tkbp-control-plane Ready control-plane 2d22h v1.32.5 172.22.8.1 <none> Debian GNU/Linux 12 (bookworm) 6.8.0-62-generic containerd://2.1.1
tkbp-worker1 Ready taroko-worker 2d22h v1.32.5 172.22.8.2 <none> Debian GNU/Linux 12 (bookworm) 6.8.0-62-generic containerd://2.1.1
tkbp-worker2 Ready taroko-gateway 2d22h v1.32.5 172.22.8.3 <none> Debian GNU/Linux 12 (bookworm) 6.8.0-62-generic containerd://2.1.1
```
CNI 使用 Canal 版本 `v3.29.2`,flannel 版本 `flannel:v0.24.4`,設定資訊
### 問題描述
當使用帶有 Canal CNI 的 Kubeadm 叢集時,Pod 卡在 ContainerCreating 無法啟動,如下 :
```
$ kubectl get pods -n user9
NAME READY STATUS RESTARTS AGE
n1 0/1 ContainerCreating 0 141m
```
去 Describe pod 看到的 Event 資訊如下
```!
Warning FailedCreatePodSandBox 4m52s (x604 over 137m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "25c71fb49bcd00ce537246d4484b8eedba5277e9504163ec1c18f420d0d6f190": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized
```
### 解決方案
```
kubectl -n kube-system rollout restart ds canal
```
執行以下命令,二次確認 Canal Pods 皆重啟
```
# kubectl -n kube-system get pods -l k8s-app=canal
NAME READY STATUS RESTARTS AGE
canal-4xkb9 2/2 Running 0 52s
canal-b2brv 2/2 Running 0 85s
canal-xqqcm 2/2 Running 0 68s
```
### Root Cause
**TL;DR: Canal Service Account 的 Token 過期**
Calico 的 `calico-kube-controllers` pod 時無法在預定時間內從 Kubernetes API Server 取得 `ClusterInformation` 資源,導致初始化失敗
```
$ kubectl -n kube-system logs calico-kube-controllers-6b65fb5f89-77695
```
看到以下錯誤訊息
```
...
[ERROR][1] kube-controllers/client.go 320: Error getting cluster information config ClusterInformation="default" error=Get "https://10.98.8.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
[INFO][1] kube-controllers/client.go 248: Unable to initialize ClusterInformation error=Get "https://10.98.8.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
[INFO][1] kube-controllers/client.go 254: Unable to initialize default Tier error=client rate limiter Wait returned an error: context deadline exceeded
[INFO][1] kube-controllers/client.go 260: Unable to initialize adminnetworkpolicy Tier error=client rate limiter Wait returned an error: context deadline exceeded
[ERROR][1] kube-controllers/main.go 256: Failed to verify datastore error=Get "https://10.98.8.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
[INFO][1] kube-controllers/main.go 290: Health check is not ready, retrying in 2 seconds with new timeout: 8s
[INFO][1] kube-controllers/resources.go 378: Terminating main client watcher loop
[INFO][1] kube-controllers/resources.go 350: Main client watcher loop
...
```
測試使用 calico 的 kubeconfig 連線 API Server
```
kubectl --kubeconfig /etc/cni/net.d/calico-kubeconfig auth whoami
```
執行結果
```
error: You must be logged in to the server (Unauthorized)
```
由以上結果得知問題點在 calico 的 kubeconfig
執行以下命令,確認 kubeconfig 中 calico user (在 K8s 中是 Service Account) 的 Token 有效期限
```!
yq e '.users[0].user."token"' /etc/cni/net.d/calico-kubeconfig | \
tr -d '\n' | \
cut -d '.' -f 2 | \
base64 -d | \
jq
```
> yq 安裝命令 : `wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/local/bin/yq && chmod +x /usr/local/bin/yq`
執行結果 :
```json
{
"aud": [
"https://kubernetes.default.svc.tkbp.k8s"
],
"exp": 1750924118,
"iat": 1750837718,
"iss": "https://kubernetes.default.svc.tkbp.k8s",
"jti": "378d4f97-c792-4a28-a547-379307b047d7",
"kubernetes.io": {
"namespace": "kube-system",
"serviceaccount": {
"name": "canal",
"uid": "a8e73c8a-8064-46a8-bd6c-f2badbddf663"
}
},
"nbf": 1750837718,
"sub": "system:serviceaccount:kube-system:canal"
}
```
> `iat`:token 的發行時間。
`exp`:token 的過期時間。
將 Token 的時間轉換為人類可閱讀的格式
```
# date -d @1750924118
```
執行結果 :
```
Thu Jun 26 07:48:38 UTC 2025
```
成功找到問題點,機器當前的時間已在 6/28,Token 已過期
查找 Token 發行日期
```
# date -d @1750837718
Wed Jun 25 07:48:38 UTC 2025
```
由以上結果得知 Service Account 的 Token 預設有效日期為 1 天
`exec` 進 pod 檢查 SA Token 的命令
```
cat /var/run/secrets/kubernetes.io/serviceaccount/token \
| cut -d "." -f2 \
| tr -d "\n" \
| awk '{pad=4-(length($0)%4); if(pad<4) printf "%s", $0 sprintf("%"pad"s","")}' \
| tr ' ' '=' \
| base64 -d \
| jq
```
執行結果 :
```
{
"aud": [
"https://kubernetes.default.svc.tkbp.k8s"
],
"exp": 1782720071,
"iat": 1751184071,
"iss": "https://kubernetes.default.svc.tkbp.k8s",
"jti": "e64feb22-412e-4065-b9b9-5c45889958a9",
"kubernetes.io": {
"namespace": "taroko",
"node": {
"name": "tkbp-worker2",
"uid": "f6595a85-3a61-411b-a186-1c2f6bd32a26"
},
"pod": {
"name": "taroko-tkadm-794b86456f-qbwlf",
"uid": "e4fd32f5-a9c0-41d1-aa34-b63203ccd3d0"
},
"serviceaccount": {
"name": "tkadm",
"uid": "21b873fc-1ad0-4442-a66e-acf5fbe78311"
},
"warnafter": 1751187678
},
"nbf": 1751184071,
"sub": "system:serviceaccount:taroko:tkadm"
}
```