Try   HackMD

Kubernetes issue - ContainerStatusUnknown

tags: kubernetes

I have a virtual Kubernetes environment in version 1.23.
And there is a pod in status of ContainerStatusUnknown.
Examine this issue to further understand what and why the issue occur.

Update: this happens twice on the same node, I think its due to OOM.

root@worker2:~# free -m
              total        used        free      shared  buff/cache   available
Mem:           1983        1147         113           3         722         646
Swap:             0           0           0

ContainerStatusUnknown

A pod in "dmeo" namespace in "ContainerStatusUnknown".

ubuntu@master:~$ kubectl get pod ubuntu2 -n demo -o wide
NAME      READY   STATUS                   RESTARTS   AGE   IP             NODE      NOMINATED NODE   READINESS GATES
ubuntu2   0/1     ContainerStatusUnknown   1          23h   10.85.189.70   worker2   <none>           <none>
ubuntu@master:~$ kubectl get pod -n demo ubuntu2 -o yaml | grep -i phase
  phase: Failed

get pod ID

ubuntu@master:~$ kubectl get pod -n demo ubuntu2 -o jsonpath='{.metadata.uid}'
798371fc-09a8-4740-b39a-8263da43d881

Check the pod details

ubuntu@master:~$ kubectl describe pod ubuntu2004 -n demo
Name:         ubuntu2
Namespace:    demo
Priority:     0
Node:         worker2/192.168.122.12
Start Time:   Thu, 26 May 2022 11:17:33 +1000
Labels:       app=ubuntu2
Annotations:  cni.projectcalico.org/containerID: bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da
              cni.projectcalico.org/podIP: 
              cni.projectcalico.org/podIPs: 
Status:       Failed
Reason:       Evicted
Message:      The node was low on resource: ephemeral-storage. Container ubuntu was using 35684Ki, which exceeds its request of 0. 
IP:           10.85.189.70
IPs:
  IP:  10.85.189.70
Containers:
  ubuntu:
    Container ID:  
    Image:         ubuntu:latest
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sleep
      3650d
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9g5wv (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-9g5wv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

In the describe output, the container ID, image ID, etc are empty.
This probably due to the container was deleted already in the cri-o.
But when and why the Pod was not deleted?
The last status message of the container tells pod was deleted?
The pod also tells a message that the node was low on resource ephemeral-storage. So kubelet on the node might stop the container.
But there is no event logs on the pod, which could not prove that the contianer was stopped by the "kubelet".
The exit code 137 represents "out of memory" in linux. Maybe it is reason why container terminated?

Check kubelet log file on work2 node.

root@worker2:/var/log# grep -re "10.85.189.70" ./syslog.1  | wc -l
6
root@worker2:/var/log# grep -re "10.85.189.70" ./syslog.1  | grep "ContainerID" | head -n 1
May 26 11:17:35 worker2 crio[589]: 2022-05-26 11:17:35.477 [INFO][1075037] ipam_plugin.go 284: Calico CNI IPAM assigned addresses IPv4=[10.85.189.70/26] IPv6=[] ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" HandleID="k8s-pod-network.bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" Workload="worker2-k8s-ubuntu2-eth0"

Got container ID "bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
It is pod
Check logs again to find what time the container was being deleted.

May 26 11:17:59 worker2 crio[589]: time="2022-05-26 11:17:59.790680442+10:00" level=info msg="Started container" PID=1075244 containerID=3f46b70cb2d355b558c7dc64f42cbfcd2329d808a946f2067c200571c53abc41 description=demo/ubuntu2/ubuntu id=a84ec1d4-5eef-403a-8535-07dc6e6163de name=/runtime.v1.RuntimeService/StartContainer sandboxID=bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da

the conainer was created at 11:17AM

May 26 21:01:28 worker2 crio[589]: time="2022-05-26 21:01:28.760061929+10:00" level=info msg="Stopping pod sandbox: bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" id=afd4efd8-57e5-4e31-8280-c23b60a4908e name=/runtime.v1.RuntimeService/StopPodSandbox
May 26 21:01:28 worker2 crio[589]: time="2022-05-26 21:01:28.890135955+10:00" level=info msg="Got pod network &{Name:ubuntu2 Namespace:demo ID:bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da UID:798371fc-09a8-4740-b39a-8263da43d881 NetNS:/var/run/netns/7be644af-bdcc-443b-afbd-624c85d47d53 Networks:[{Name:k8s-pod-network Ifname:eth0}] RuntimeConfig:map[k8s-pod-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:29.744 [INFO][1372241] k8s.go 572: Cleaning up netns ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:29.758 [INFO][1372241] dataplane_linux.go 506: Calico CNI deleting device in netns /var/run/netns/7be644af-bdcc-443b-afbd-624c85d47d53 ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:29.784 [INFO][1372241] dataplane_linux.go 523: Calico CNI deleted device in netns /var/run/netns/7be644af-bdcc-443b-afbd-624c85d47d53 ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:29.784 [INFO][1372241] k8s.go 579: Releasing IP address(es) ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:29.784 [INFO][1372241] utils.go 196: Calico CNI releasing IP address ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:29.870 [INFO][1372255] ipam_plugin.go 415: Releasing address using handleID ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" HandleID="k8s-pod-network.bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" Workload="worker2-k8s-ubuntu2-eth0"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:30.360 [INFO][1372255] ipam_plugin.go 434: Released address using handleID ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" HandleID="k8s-pod-network.bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" Workload="worker2-k8s-ubuntu2-eth0"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:30.360 [INFO][1372255] ipam_plugin.go 443: Releasing address using workloadID ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" HandleID="k8s-pod-network.bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" Workload="worker2-k8s-ubuntu2-eth0"
May 26 21:01:30 worker2 crio[589]: 2022-05-26 21:01:30.365 [INFO][1372241] k8s.go 585: Teardown processing complete. ContainerID="bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da"
May 26 21:01:30 worker2 systemd[3528991]: run-containers-storage-overlay\x2dcontainers-bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da-userdata-shm.mount: Succeeded.
May 26 21:01:30 worker2 systemd[1]: run-containers-storage-overlay\x2dcontainers-bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da-userdata-shm.mount: Succeeded.
May 26 21:01:30 worker2 crio[589]: time="2022-05-26 21:01:30.469889194+10:00" level=info msg="Stopped pod sandbox: bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" id=afd4efd8-57e5-4e31-8280-c23b60a4908e name=/runtime.v1.RuntimeService/StopPodSandbox
May 26 21:01:31 worker2 crio[589]: time="2022-05-26 21:01:31.302373199+10:00" level=info msg="Stopping pod sandbox: bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" id=dfcf3e2a-aba6-4521-b065-0a03103ef7e0 name=/runtime.v1.RuntimeService/StopPodSandbox
May 26 21:01:31 worker2 crio[589]: time="2022-05-26 21:01:31.302408765+10:00" level=info msg="Stopped pod sandbox (already stopped): bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" id=dfcf3e2a-aba6-4521-b065-0a03103ef7e0 name=/runtime.v1.RuntimeService/StopPodSandbox
May 26 21:01:31 worker2 crio[589]: time="2022-05-26 21:01:31.302700347+10:00" level=info msg="Removing pod sandbox: bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" id=f611266c-55e0-4f1c-9471-4ab3622a973e name=/runtime.v1.RuntimeService/RemovePodSandbox
May 26 21:01:31 worker2 crio[589]: time="2022-05-26 21:01:31.454474319+10:00" level=info msg="Removed pod sandbox: bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da" id=f611266c-55e0-4f1c-9471-4ab3622a973e name=/runtime.v1.RuntimeService/RemovePodSandbox

the container was being deleted at 21:01PM. The logs looks ok.

kubelet logs about the pod, fail to watch the container. It is ok at time of pod creation.

May 26 11:17:33 worker2 kubelet[1499524]: I0526 11:17:33.129861 1499524 topology_manager.go:200] "Topology Admit Handler"
May 26 11:17:33 worker2 kubelet[1499524]: I0526 11:17:33.503315 1499524 reconciler.go:221] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-9g5wv\" (UniqueName: \"kubernetes.io/projected/798371fc-09a8-4740-b39a-8263da43d881-kube-api-access-9g5wv\") pod \"ubuntu2\" (UID: \"798371fc-09a8-4740-b39a-8263da43d881\") " pod="demo/ubuntu2"
May 26 11:17:36 worker2 kubelet[1499524]: W0526 11:17:36.115343 1499524 manager.go:1176] Failed to process watch event {EventType:0 Name:/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod798371fc_09a8_4740_b39a_8263da43d881.slice/crio-bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da.scope WatchSource:0}: Error finding container bf6758f68f1862d383bff8a1707d739cca7deff4aedd5953cb40e63df19054da: Status 404 returned error &{%!s(*http.body=&{0xc00012d4b8 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x840ee0) %!s(func() error=0x840fe0)}

Check kubelet logs in more detail.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

the Eviction manager logs about pod failed to evict and then shows evicted and waiting for clean up.

checked logs on master node, nothing special to me.

Because "pod" is lowest level workload type, it does not recover it-self when failure occur.

Reference

https://bugzilla.redhat.com/show_bug.cgi?id=1734524
https://www.cnblogs.com/chuanzhang053/p/16048622.html
https://morethink.xyz/i/0899009936416