kubernetes
I have a virtual Kubernetes environment in version 1.23.
And there is a pod in status of ContainerStatusUnknown.
Examine this issue to further understand what and why the issue occur.
Update: this happens twice on the same node, I think its due to OOM.
root@worker2:~# free -m
total used free shared buff/cache available
Mem: 1983 1147 113 3 722 646
Swap: 0 0 0
A pod in "dmeo" namespace in "ContainerStatusUnknown".
ubuntu@master:~$ kubectl get pod ubuntu2 -n demo -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ubuntu2 0/1 ContainerStatusUnknown 1 23h 10.85.189.70 worker2 <none> <none>
ubuntu@master:~$ kubectl get pod -n demo ubuntu2 -o yaml | grep -i phase
phase: Failed
get pod ID
ubuntu@master:~$ kubectl get pod -n demo ubuntu2 -o jsonpath='{.metadata.uid}'
798371fc-09a8-4740-b39a-8263da43d881
Check the pod details
ubuntu@master:~$ kubectl describe pod ubuntu2004 -n demo
Name: ubuntu2
Namespace: demo
Priority: 0
Node: worker2/192.168.122.12
Start Time: Thu, 26 May 2022 11:17:33 +1000
Labels: app=ubuntu2
Annotations: cni.projectcalico.org/containerID: bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da
cni.projectcalico.org/podIP:
cni.projectcalico.org/podIPs:
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container ubuntu was using 35684Ki, which exceeds its request of 0.
IP: 10.85.189.70
IPs:
IP: 10.85.189.70
Containers:
ubuntu:
Container ID:
Image: ubuntu:latest
Image ID:
Port: <none>
Host Port: <none>
Command:
/bin/sleep
3650d
State: Terminated
Reason: ContainerStatusUnknown
Message: The container could not be located when the pod was terminated
Exit Code: 137
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Last State: Terminated
Reason: ContainerStatusUnknown
Message: The container could not be located when the pod was deleted. The container used to be Running
Exit Code: 137
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9g5wv (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-9g5wv:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
In the describe output, the container ID, image ID, etc… are empty.
This probably due to the container was deleted already in the cri-o.
But when and why the Pod was not deleted?
The last status message of the container tells pod was deleted?
The pod also tells a message that the node was low on resource ephemeral-storage. So kubelet on the node might stop the container.
But there is no event logs on the pod, which could not prove that the contianer was stopped by the "kubelet".
The exit code 137 represents "out of memory" in linux. Maybe it is reason why container terminated?
Check kubelet log file on work2 node.
root@worker2:/var/log# grep -re "10.85.189.70" ./syslog.1 | wc -l
6
root@worker2:/var/log# grep -re "10.85.189.70" ./syslog.1 | grep "ContainerID" | head -n 1
May 26 11:17:35 worker2 crio[589]: 2022-05-26 11:17:35.477 [INFO][1075037] ipam_plugin.go 284: Calico CNI IPAM assigned addresses IPv4=[10.85.189.70/26] IPv6=[] ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" HandleID="k8s-pod-network.bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" Workload="worker2-k8s-ubuntu2-eth0"
Got container ID "bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
It is pod
Check logs again to find what time the container was being deleted.
May 26 11:17:59 worker2 crio[589]: time="2022-05-26 11:17:59.790680442+10:00" level=info msg="Started container" PID=1075244 containerID=3f46b70cb2d355b558c7dc64f42cbfcd2329d808a946f2067c200571c53abc41 description=demo/ubuntu2/ubuntu id=a84ec1d4-5eef-403a-8535-07dc6e6163de name=/runtime.v1.RuntimeService/StartContainer sandboxID=bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da
the conainer was created at 11:17AM
May 26 21:01:28 worker2 crio[589]: time="2022-05-26 21:01:28.760061929+10:00" level=info msg="Stopping pod sandbox: bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" id=afd4efd8-57e5-4e31-8280-c23b60a4908e name=/runtime.v1.RuntimeService/StopPodSandbox
May 26 21:01:28 worker2 crio[589]: time="2022-05-26 21:01:28.890135955+10:00" level=info msg="Got pod network &{Name:ubuntu2 Namespace:demo ID:bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da UID:798371fc-09a8-4740-b39a-8263da43d881 NetNS:/var/run/netns/7be644af-bdcc-443b-afbd-624c85d47d53 Networks:[{Name:k8s-pod-network Ifname:eth0}] RuntimeConfig:map[k8s-pod-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:29.744 [INFO][1372241] k8s.go 572: Cleaning up netns ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:29.758 [INFO][1372241] dataplane_linux.go 506: Calico CNI deleting device in netns /var/run/netns/7be644af-bdcc-443b-afbd-624c85d47d53 ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:29.784 [INFO][1372241] dataplane_linux.go 523: Calico CNI deleted device in netns /var/run/netns/7be644af-bdcc-443b-afbd-624c85d47d53 ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:29.784 [INFO][1372241] k8s.go 579: Releasing IP address(es) ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:29.784 [INFO][1372241] utils.go 196: Calico CNI releasing IP address ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:29.870 [INFO][1372255] ipam_plugin.go 415: Releasing address using handleID ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" HandleID="k8s-pod-network.bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" Workload="worker2-k8s-ubuntu2-eth0"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:30.360 [INFO][1372255] ipam_plugin.go 434: Released address using handleID ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" HandleID="k8s-pod-network.bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" Workload="worker2-k8s-ubuntu2-eth0"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:30.360 [INFO][1372255] ipam_plugin.go 443: Releasing address using workloadID ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" HandleID="k8s-pod-network.bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" Workload="worker2-k8s-ubuntu2-eth0"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:30.365 [INFO][1372241] k8s.go 585: Teardown processing complete. ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
May 26 21:01:30 worker2 systemd[3528991]: run-containers-storage-overlay\x2dcontainers-bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da-userdata-shm.mount: Succeeded.
May 26 21:01:30 worker2 systemd[1]: run-containers-storage-overlay\x2dcontainers-bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da-userdata-shm.mount: Succeeded.
May 26 21:01:30 worker2 crio[589]: time="2022-05-26 21:01:30.469889194+10:00" level=info msg="Stopped pod sandbox: bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" id=afd4efd8-57e5-4e31-8280-c23b60a4908e name=/runtime.v1.RuntimeService/StopPodSandbox
May 26 21:01:31 worker2 crio[589]: time="2022-05-26 21:01:31.302373199+10:00" level=info msg="Stopping pod sandbox: bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" id=dfcf3e2a-aba6-4521-b065-0a03103ef7e0 name=/runtime.v1.RuntimeService/StopPodSandbox
May 26 21:01:31 worker2 crio[589]: time="2022-05-26 21:01:31.302408765+10:00" level=info msg="Stopped pod sandbox (already stopped): bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" id=dfcf3e2a-aba6-4521-b065-0a03103ef7e0 name=/runtime.v1.RuntimeService/StopPodSandbox
May 26 21:01:31 worker2 crio[589]: time="2022-05-26 21:01:31.302700347+10:00" level=info msg="Removing pod sandbox: bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" id=f611266c-55e0-4f1c-9471-4ab3622a973e name=/runtime.v1.RuntimeService/RemovePodSandbox
May 26 21:01:31 worker2 crio[589]: time="2022-05-26 21:01:31.454474319+10:00" level=info msg="Removed pod sandbox: bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" id=f611266c-55e0-4f1c-9471-4ab3622a973e name=/runtime.v1.RuntimeService/RemovePodSandbox
the container was being deleted at 21:01PM. The logs looks ok.
kubelet logs about the pod, fail to watch the container. It is ok at time of pod creation.
May 26 11:17:33 worker2 kubelet[1499524]: I0526 11:17:33.129861 1499524 topology_manager.go:200] "Topology Admit Handler"
May 26 11:17:33 worker2 kubelet[1499524]: I0526 11:17:33.503315 1499524 reconciler.go:221] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-9g5wv\" (UniqueName: \"kubernetes.io/projected/798371fc-09a8-4740-b39a-8263da43d881-kube-api-access-9g5wv\") pod \"ubuntu2\" (UID: \"798371fc-09a8-4740-b39a-8263da43d881\") " pod="demo/ubuntu2"
May 26 11:17:36 worker2 kubelet[1499524]: W0526 11:17:36.115343 1499524 manager.go:1176] Failed to process watch event {EventType:0 Name:/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod798371fc_09a8_4740_b39a_8263da43d881.slice/crio-bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da.scope WatchSource:0}: Error finding container bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da: Status 404 returned error &{%!s(*http.body=&{0xc00012d4b8 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x840ee0) %!s(func() error=0x840fe0)}
Check kubelet logs in more detail.
the Eviction manager logs about pod failed to evict and then shows evicted and waiting for clean up.
checked logs on master node, nothing special to me.
Because "pod" is lowest level workload type, it does not recover it-self when failure occur.
https://bugzilla.redhat.com/show_bug.cgi?id=1734524
https://www.cnblogs.com/chuanzhang053/p/16048622.html
https://morethink.xyz/i/0899009936416