K8s Troubleshooting

# K8s Troubleshooting ```bash= # Watch info on master node watch -n 1 kubectl get nodes,pod,svc,deploy -A ``` ## 1. Monitoring ![](https://i.imgur.com/R8HHGe4.png) * 1. Sematext * 2. Kubernetes Dashboard * 3. Prometheus * 4. Grafana * 5. Jaeger * 6. Elastic Stack (ELK) * 7. cAdvisor * 8. Kubewatch * 9. Kube-state-metrics * 10. Datadog * 11. New Relic * 12. Sensu * 13. Dynatrace ## 2. Troubleshooting Pods/Application ### Common Pod Errors | Lỗi | Cách debug? | | -------- | -------- | | Pending | Sự cố **scheduling**. Có thể dùng lệnh `kubectl describe` để kiểm tra. Trạng thái này có thể xảy ra là do worker node không sẵn sàng hoặc vượt quá tài nguyên. Ta cần kiểm tra lại tài nguyên và trạng thái của node| |CrashLoopBackOff| Sự cố ở **cluster**. Có thể dùng lệnh `kubectl describe` và `kubectl log` để kiểm tra. Ta nên thu hẹp lỗi bằng phương pháp tiếp cận từ ngoài vào trong.| |Completed| Sử dụng lệnhh `kubectl describe` để kiểm tra và sửa lỗi| |Error|Ssử dụng lệnhh `kubectl describe` để kiểm tra và sửa lỗi| |ImagePullBackOff|Sử dụng lệnh `kubectl describes` để kiểm tra, và trích xuất ra file yaml hoặc `kubectl edit` để cập nhật lại image| ### Troubleshooting commands ```bash= kubectl get pods -A kubectl get events kubectl get events -A kubectl get events -n [namespace] kubectl get events --sort-by=.metadata.creationTimestamp kubectl get events --sort-by=.lastTimestamp kubectl get events -n [namespace] --sort-by=.metadata.creationTimestamp kubectl describe pod [pod-name] kubectl logs [pod-name] ``` **kubectl get pods –A** ![](https://i.imgur.com/Sg7u1ma.png) **kubectl get events** ![](https://i.imgur.com/P56uADv.png) * TYPE Chỉ ra các loại sự kiện – Có thể là Normal hoặc Warning. * REASON lý do xảy ra sự kiện đó. * OBJECT sự kiện xảy ra với loại object nào. * MESSAGE chi tiết thông tin sự kiện ### LAB DEMO * **ErrImagePull** * **Completed** * **CrashLoopBackOff** * **ImagePullBackOff** `nano pod1-busybox-CrashLoopBackOff.yaml` ```yaml= --- apiVersion: v1 kind: Pod metadata: name: pod1-busybox spec: containers: - name: pod1-busybox image: busybox:11.28 #args: [/bin/sh, -c, 'i=0; while true; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] ``` ```bash= kubectl apply -f pod1-busybox-CrashLoopBackOff.yaml # debug kubectl describe pod/pod1-busybox kubectl logs pod/pod1-busybox #fix kubectl edit pod/pod1-busybox kubectl get pod/pod1-busybox -o yaml >pod1-busybox-fixed.yaml kubectl apply -f pod1-busybox-fixed.yaml #Cleanup kubectl delete pod/pod1-busybox ``` * **pending** `nano pod2-nginx-pending.yaml` ```yaml= --- apiVersion: v1 kind: Pod metadata: name: pod2-nginx spec: containers: - image: nginx imagePullPolicy: Always name: pod2-nginx resources: requests: cpu: 800m memory: 9999999Mi ``` ```bash= kubectl apply -f pod2-nginx-pending.yaml # debug kubectl describe pod/pod2-nginx kubectl logs pod/pod2-nginx #fix kubectl edit pod/pod2-nginx kubectl get pod/pod2-nginx -o yaml > pod2-nginx-pending-mem-fixed.yaml kubectl apply -f pod2-nginx-pending-mem-fixed.yaml #Cleanup kubectl delete pod/pod2-nginx ``` --- ## 3. Troubleshooting Control plane ### Control Plane Components * **kube-apiserver** * **etcd** * **kube-scheduler** * **kube-controller-manager** * **cloud-controller-manager** ### Troubleshooting commands ```bash= kubectl get pods -n kube-system ``` ![](https://i.imgur.com/02P68N4.png) ```bash= kubectl logs kube-apiserver-master -n kube-system ``` ![](https://i.imgur.com/4iAMddg.png) ```bash= kubectl cluster-info kubectl cluster-info dump ``` --- ## 4. Troubleshooting Worker Nodes ![](https://i.imgur.com/2IkVxSM.png) **Lỗi NotReady** * Thiếu tài nguyên: Thiếu bộ nhớ ram hoặc đĩa cứng. * kubelet gặp vấn đề: có thể bị crash hoặc stop nên không thể liên lạc với API server của master node * Lỗi kube-proxy **Troubleshooting commands** ```bash= # ================================================= # Kiểm tra resources kubectl describe node worker top df -h #metric server requirements kubectl top nodes kubectl top pods kubectl run nginx --image=nginx kubectl top pod nginx kubectl top pod nginx --containers kubectl top pod --sort-by=cpu kubectl top pod –-sort-by=memory ### kiểm tra docker, kubelet sudo systemctl status docker sudo systemctl status kubelet # restart sudo systemctl restart docker sudo systemctl restart kubelet sudo systemctl enable docker && systemctl start docker sudo systemctl enable kubelet && systectl start kubelet # kiểm tra log kubelet sudo journalctl -u kubelet.service # ================================================= # Kiểm tra kube-proxy Pod kubectl get pods -n kube-system kubectl describe daemonset kube-proxy -n kube-system kubectl describe pod kube-proxy-** -n kube-system kubectl logs kube-proxy-** -n kube-system # kiểm tra swap off # sudo swapoff -a && sed -i '/ swap / s/^/#/' /etc/fstab free -h # kiểm tra firewall sudo systemctl status firewalld sudo systemctl disable firewalld && systemctl stop firewalld ``` --- ## 5. Troubleshooting Service ### **LAB Demo** Kiểm tra DNS service và hoạt động pod/application ```bash= kubectl create namespace app kubectl create deployment svc-nginx --image=nginx -n app kubectl expose deploy svc-nginx --type=NodePort --name=nginx-svc --port 80 -n app ``` ```bash= # test kubectl run -it nslookup-test --image=busybox:latest --rm --restart=Never -- nslookup nginx-svc.app.svc.cluster.local kubectl run -it nginx-test -n app --image=nginx --rm --restart=Never -- curl http://nginx-svc.app.svc.cluster.local # check kubectl get pods -n kube-system | grep dns kubectl get svc -n kube-system kubectl describe deploy coredns -n kube-system kubectl logs coredns-787d4945fb-nwf2m -n kube-system #cleanup kubectl delete deployment.apps/svc-nginx -n app kubectl delete service/nginx-svc -n app kubectl delete pod/svc-nginx -n app kubectl delete pod/svc-nginx ```