Try   HackMD

K8s Node Restoration at the VM Level Troubleshooting

問題背景

在 Proxmox UI 對 VM Take Snapshot(Include RAM),過幾天後,把 VM Restore 回去。

K8s 環境資訊

NAME                 STATUS   ROLES            AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION     CONTAINER-RUNTIME
tkbp-control-plane   Ready    control-plane    2d22h   v1.32.5   172.22.8.1    <none>        Debian GNU/Linux 12 (bookworm)   6.8.0-62-generic   containerd://2.1.1
tkbp-worker1         Ready    taroko-worker    2d22h   v1.32.5   172.22.8.2    <none>        Debian GNU/Linux 12 (bookworm)   6.8.0-62-generic   containerd://2.1.1
tkbp-worker2         Ready    taroko-gateway   2d22h   v1.32.5   172.22.8.3    <none>        Debian GNU/Linux 12 (bookworm)   6.8.0-62-generic   containerd://2.1.1

CNI 使用 Canal 版本 v3.29.2,flannel 版本 flannel:v0.24.4,設定資訊

問題描述

當使用帶有 Canal CNI 的 Kubeadm 叢集時,Pod 卡在 ContainerCreating 無法啟動,如下 :

$ kubectl get pods -n user9
NAME   READY   STATUS              RESTARTS   AGE
n1     0/1     ContainerCreating   0          141m

去 Describe pod 看到的 Event 資訊如下

  Warning  FailedCreatePodSandBox  4m52s (x604 over 137m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "25c71fb49bcd00ce537246d4484b8eedba5277e9504163ec1c18f420d0d6f190": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized

解決方案

kubectl -n kube-system rollout restart ds canal

執行以下命令,二次確認 Canal Pods 皆重啟

# kubectl -n kube-system get pods -l k8s-app=canal
NAME          READY   STATUS    RESTARTS   AGE
canal-4xkb9   2/2     Running   0          52s
canal-b2brv   2/2     Running   0          85s
canal-xqqcm   2/2     Running   0          68s

Root Cause

TL;DR: Canal Service Account 的 Token 過期

Calico 的 calico-kube-controllers pod 時無法在預定時間內從 Kubernetes API Server 取得 ClusterInformation 資源,導致初始化失敗

$ kubectl -n kube-system logs calico-kube-controllers-6b65fb5f89-77695

看到以下錯誤訊息

...
[ERROR][1] kube-controllers/client.go 320: Error getting cluster information config ClusterInformation="default" error=Get "https://10.98.8.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
[INFO][1] kube-controllers/client.go 248: Unable to initialize ClusterInformation error=Get "https://10.98.8.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
[INFO][1] kube-controllers/client.go 254: Unable to initialize default Tier error=client rate limiter Wait returned an error: context deadline exceeded
[INFO][1] kube-controllers/client.go 260: Unable to initialize adminnetworkpolicy Tier error=client rate limiter Wait returned an error: context deadline exceeded
[ERROR][1] kube-controllers/main.go 256: Failed to verify datastore error=Get "https://10.98.8.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
[INFO][1] kube-controllers/main.go 290: Health check is not ready, retrying in 2 seconds with new timeout: 8s
[INFO][1] kube-controllers/resources.go 378: Terminating main client watcher loop
[INFO][1] kube-controllers/resources.go 350: Main client watcher loop
...

測試使用 calico 的 kubeconfig 連線 API Server

kubectl --kubeconfig /etc/cni/net.d/calico-kubeconfig auth whoami

執行結果

error: You must be logged in to the server (Unauthorized)

由以上結果得知問題點在 calico 的 kubeconfig

執行以下命令,確認 kubeconfig 中 calico user (在 K8s 中是 Service Account) 的 Token 有效期限

yq e '.users[0].user."token"' /etc/cni/net.d/calico-kubeconfig | \
  tr -d '\n' | \
  cut -d '.' -f 2 | \
  base64 -d | \
  jq

yq 安裝命令 : wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/local/bin/yq && chmod +x /usr/local/bin/yq

執行結果 :

{
  "aud": [
    "https://kubernetes.default.svc.tkbp.k8s"
  ],
  "exp": 1750924118,
  "iat": 1750837718,
  "iss": "https://kubernetes.default.svc.tkbp.k8s",
  "jti": "378d4f97-c792-4a28-a547-379307b047d7",
  "kubernetes.io": {
    "namespace": "kube-system",
    "serviceaccount": {
      "name": "canal",
      "uid": "a8e73c8a-8064-46a8-bd6c-f2badbddf663"
    }
  },
  "nbf": 1750837718,
  "sub": "system:serviceaccount:kube-system:canal"
}

iat:token 的發行時間。
exp:token 的過期時間。

將 Token 的時間轉換為人類可閱讀的格式

# date -d @1750924118

執行結果 :

Thu Jun 26 07:48:38 UTC 2025

成功找到問題點,機器當前的時間已在 6/28,Token 已過期

查找 Token 發行日期

# date -d @1750837718
Wed Jun 25 07:48:38 UTC 2025

由以上結果得知 Service Account 的 Token 預設有效日期為 1 天

exec 進 pod 檢查 SA Token 的命令

cat /var/run/secrets/kubernetes.io/serviceaccount/token \
  | cut -d "." -f2 \
  | tr -d "\n" \
  | awk '{pad=4-(length($0)%4); if(pad<4) printf "%s", $0 sprintf("%"pad"s","")}' \
  | tr ' ' '=' \
  | base64 -d \
  | jq

執行結果 :

{
  "aud": [
    "https://kubernetes.default.svc.tkbp.k8s"
  ],
  "exp": 1782720071,
  "iat": 1751184071,
  "iss": "https://kubernetes.default.svc.tkbp.k8s",
  "jti": "e64feb22-412e-4065-b9b9-5c45889958a9",
  "kubernetes.io": {
    "namespace": "taroko",
    "node": {
      "name": "tkbp-worker2",
      "uid": "f6595a85-3a61-411b-a186-1c2f6bd32a26"
    },
    "pod": {
      "name": "taroko-tkadm-794b86456f-qbwlf",
      "uid": "e4fd32f5-a9c0-41d1-aa34-b63203ccd3d0"
    },
    "serviceaccount": {
      "name": "tkadm",
      "uid": "21b873fc-1ad0-4442-a66e-acf5fbe78311"
    },
    "warnafter": 1751187678
  },
  "nbf": 1751184071,
  "sub": "system:serviceaccount:taroko:tkadm"
}