# K8S StatefulSet pod 當 worker 異常後 pod 不會自動遷移
## 測試環境
```
$ kubectl get no
NAME STATUS ROLES AGE VERSION
cilium-m1 Ready control-plane,etcd,master,worker 431d v1.31.7+rke2r1
cilium-m2 Ready control-plane,etcd,master,worker 257d v1.31.7+rke2r1
cilium-m3 Ready control-plane,etcd,master,worker 257d v1.31.7+rke2r1
cilium-w1 Ready worker 418d v1.31.7+rke2r1
cilium-w2 Ready worker 403d v1.31.7+rke2r1
```
* 建立一個測試 StatefulSet
```
$ echo 'apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
replicas: 3 # by default is 1
minReadySeconds: 10 # by default is 0
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 30
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 30
containers:
- name: nginx
image: registry.k8s.io/nginx-slim:0.24
ports:
- containerPort: 80' | kubectl apply -f -
$ kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
web-0 1/1 Running 0 76s 10.42.1.75 cilium-w1 <none> <none>
web-1 1/1 Running 0 36s 10.42.5.130 cilium-m3 <none> <none>
web-2 1/1 Running 0 16s 10.42.0.137 cilium-m1 <none> <none>
```
## 問題背景
* 將 `cilium-w1` worker 關機,worker 被判定為 NotReady 後,原本在這個節點上的 StatefulSet pod 會一直處於 Terminating,並且沒有再產生新的 pod。
* 原因是因為 StatefulSet 一定會重新建立相同名稱的`web-0` pod,而當 `web-0` pod 沒有被完全刪除前,StatefulSet 就沒辦法再長出一個相同名稱的 pod。
```
$ kubectl get no
NAME STATUS ROLES AGE VERSION
cilium-m1 Ready control-plane,etcd,master,worker 431d v1.31.7+rke2r1
cilium-m2 Ready control-plane,etcd,master,worker 257d v1.31.7+rke2r1
cilium-m3 Ready control-plane,etcd,master,worker 257d v1.31.7+rke2r1
cilium-w1 NotReady worker 418d v1.31.7+rke2r1
cilium-w2 Ready worker 403d v1.31.7+rke2r1
$ kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
web-0 1/1 Terminating 0 3m33s 10.42.1.75 cilium-w1 <none> <none>
web-1 1/1 Running 0 2m53s 10.42.5.130 cilium-m3 <none> <none>
web-2 1/1 Running 0 2m33s 10.42.0.137 cilium-m1 <none> <none>
```
## 解決辦法
### 正常關機的情況(sudo poweroff)
* 設定 kubelet `shutdownGracePeriod 和 shutdownGracePeriodCriticalPods` 這兩個參數,預設都是 `0` ,代表 kubelet 在機器關機前都不會刪除 pod。
- `ShutdownGracePeriod` : 指定節點可以延後關機的總時間,允許 kubelet 在這個期間正常的刪除 pod。
- `ShutdownGracePeriodCriticalPods` : 指定節點關機期間刪除 critical 權限的 pod 的持續時間。這要小於 `ShutdownGracePeriod`。
* 在每一個節點都要新增以下 kubeletconfig。
```
$ mkdir /etc/kubernetes/
# 以這個設定為例,在此期間,前 20 秒(30-10)將用於刪除標準的 pod,最後 10 秒將用於刪除 critical 權限的 po
$ nano /etc/kubernetes/kubeletconfig.yml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
shutdownGracePeriod: 30s
shutdownGracePeriodCriticalPods: 10s
```
* kubelet 新增參數 `--config=/etc/kubernetes/kubeletconfig.yml`

* 設定後檢查 kubelet 參數是否有套用
```
$ kubectl get --raw "/api/v1/nodes/cilium-m1/proxy/configz" | jq . | grep shutdownGracePeriod
"shutdownGracePeriod": "30s",
"shutdownGracePeriodCriticalPods": "10s",
```

* 再產生一個 deployment
```
$ kubectl create deploy test --image=nginx --replicas=3
$ kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-556b4dcc6c-4txbx 1/1 Running 0 11s 10.42.1.248 cilium-w1 <none> <none>
test-556b4dcc6c-srlh5 1/1 Running 0 35m 10.42.5.63 cilium-m3 <none> <none>
test-556b4dcc6c-t6668 1/1 Running 0 33m 10.42.0.205 cilium-m1 <none> <none>
web-0 1/1 Running 0 6s 10.42.1.89 cilium-w1 <none> <none>
web-1 1/1 Running 0 106m 10.42.5.150 cilium-m3 <none> <none>
web-2 1/1 Running 0 105m 10.42.0.14 cilium-m1 <none> <none>
```
* 將 `cilium-w1` worker 關機,可以看到原本在 `cilium-w1` worker 上的 pod 都已經被刪除,StatefulSet 的 pod 也可以順利重建。
```
$ kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-556b4dcc6c-jb2tp 1/1 Running 0 8s 10.42.4.240 cilium-m2 <none> <none>
test-556b4dcc6c-srlh5 1/1 Running 0 73m 10.42.5.63 cilium-m3 <none> <none>
test-556b4dcc6c-t6668 1/1 Running 0 71m 10.42.0.205 cilium-m1 <none> <none>
web-0 1/1 Running 0 7s 10.42.4.43 cilium-m2 <none> <none>
web-1 1/1 Running 0 143m 10.42.5.150 cilium-m3 <none> <none>
web-2 1/1 Running 0 143m 10.42.0.14 cilium-m1 <none> <none>
```
### 非正常關機的情況
* 如果是非正常關機的情況,例如是跳電的原因,那麼 kubelet 在關機前就不會正常的清裡掉 pod,因此只能手動使用 taint 的方式,驅離這個節點上的 pod。
```
$ kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute
```
## 正常 sudo poweroff 後也不會刪除 pod
* 要先檢查 `systemd-inhibit --list` 指令,查看目前有哪些程式正在 延遲或阻止系統關機,確認添加 `shutdownGracePeriod` 參數後, kubelet 有被加入到這個 list。
* 然後再使用 `sudo systemctl poweroff` 關機,他會根據 `systemd-inhibit --list` 延後關機
```
$ systemd-inhibit --list
WHO UID USER PID COMM WHAT WHY >
ModemManager 0 root 837 ModemManager sleep ModemManager needs to reset devices >
UPower 0 root 220312 upowerd sleep Pause device polling >
Unattended Upgrades Shutdown 0 root 846 unattended-upgr shutdown Stop ongoing upgrades or perform upgrades before >
kubelet 0 root 2019 kubelet shutdown Kubelet needs time to handle node shutdown >
4 inhibitors listed.
$ sudo systemctl poweroff
```
## 參考
https://github.com/kubernetes/kubernetes/issues/74947
https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/#how-do-i-use-it
https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/