10 Common Kubernetes Troubleshooting Steps

1. Pod startup failures and nodes unable to run pods

Containers run applications. A Pod is the smallest scheduling unit in Kubernetes. Containers in a Pod share the Pod's namespace, resources, network, and storage. A Pod may contain one or multiple containers.

Common causes of Pod failures:

1. Resource exhaustion: A large number of Pods scheduled on the same physical node can consume too many resources and cause the node to fail. 2. Memory and CPU issues: An application in a Pod may have a memory leak or excessive CPU usage, causing the Pod to be OOM-killed or otherwise degraded. (Mitigation: perform load testing to determine memory and CPU usage, and set resource limits.) 3. Network problems: Network failures can prevent Pods from communicating. (Mitigation: check the status of the Calico/other CNI plugin.) 4. Storage problems: A Pod may fail to start if a mounted shared storage is unavailable. (Mitigation: verify the shared storage and volumes are healthy.) 5. Application code problems: The application may fail after container start. (Mitigation: debug the application code.) 6. Configuration issues: Errors in deployment or StatefulSet manifests can prevent Pod creation. (Mitigation: inspect resource manifests.) 7. Use a monitoring system to help diagnose the above issues.

2. Inspect cluster status

Start troubleshooting by checking the cluster status. Use kubectl get nodes to check node readiness. If nodes are NotReady or otherwise unhealthy, that can impact applications. Ensure core components such as etcd, kubelet, and kube-proxy are running properly.

3. Trace event logs

Review cluster events to understand what is occurring. Use kubectl get events to view the event log. Events record important information about errors and significant occurrences in the cluster. Event inspection helps identify which Kubernetes component or application is failing and pinpoints the issue.

4. Focus on Pod status

Get the status of all Pods with kubectl get pods --all-namespaces. Pods that are Pending, CrashLoopBackOff, Error, or NotReady likely indicate container or application problems. Use kubectl describe pod <pod-name> to get detailed information for deeper investigation.

5. Check network connectivity

Verify network connectivity between Services, Pods, and nodes. Use kubectl get services to view Services, and kubectl describe service <service-name> for details. Validate CNI plugin status and confirm network policies and firewall rules are configured correctly.

6. Inspect storage configuration

If your applications use persistent storage such as PersistentVolumes and StorageClasses, ensure storage is configured correctly. Check PersistentVolume and PersistentVolumeClaim statuses. Use kubectl get pv, kubectl get pvc, and kubectl get storageclass to gather storage-related information.

7. Investigate container logs

Container logs often contain critical clues. Use kubectl logs <pod-name> to view container output. For Pods with multiple containers, use kubectl logs -c <container-name> to view a specific container's logs.

8. Kubernetes cluster network communication

Kubernetes clusters have their own internal network, and cross-node communication depends on the CNI plugin. Common CNI plugins include Calico, Flannel, and Canal.

Calico supports IP address assignment and network policy enforcement, and its performance is comparable to Flannel.

Flannel primarily provides IP address assignment.

Canal combines features of Calico and Flannel.

Network communication within a Kubernetes cluster mainly includes:

Communication between multiple containers inside the same Pod.
Pod-to-Pod communication.
Pod-to-Service communication.
Service-to-external communication.

9. Does Service DNS resolution work?

Run from a Pod in the same Namespace:

u@pod$ nslookup hostnames Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local Name: hostnames Address 1: 10.0.1.175 hostnames.default.svc.cluster.local

If the above fails, your Pod and Service may be in different Namespaces. Try using the namespaced name:

u@pod$ nslookup hostnames.default Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local Name: hostnames.default Address 1: 10.0.1.175 hostnames.default.svc.cluster.local

If that works, update your application to use cross-namespace names to access the Service or run the application and Service in the same Namespace. If it still fails, try the fully qualified name:

u@pod$ nslookup hostnames.default.svc.cluster.local Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local Name: hostnames.default.svc.cluster.local Address 1: 10.0.1.175 hostnames.default.svc.cluster.local

Note the suffix "default.svc.cluster.local": "default" is the Namespace, "svc" indicates a Service, and "cluster.local" is the cluster domain, which may differ in your installation.

You can also try this from a cluster node:

Note: 10.0.0.10 is the DNS Service in this example; yours may be different.

u@node$ nslookup hostnames.default.svc.cluster.local 10.0.0.10 Server: 10.0.0.10 Address: 10.0.0.10#53 Name: hostnames.default.svc.cluster.local Address: 10.0.1.175

If fully qualified names resolve but relative names do not, check /etc/resolv.conf inside Pods:

u@pod$ cat /etc/resolv.conf nameserver 10.0.0.10 search default.svc.cluster.local svc.cluster.local cluster.local example.com options ndots:5

The nameserver line must point to your cluster DNS Service; kubelet receives this via the --cluster-dns flag. The search line must include appropriate suffixes so Service names can be resolved in the local Namespace (default.svc.cluster.local), across Namespaces (svc.cluster.local), and cluster-wide (cluster.local). Your installation may add additional entries (up to 6). The cluster domain is provided to kubelet via the --cluster-domain flag. In this document we assume "cluster.local", but adjust commands if your cluster domain differs.

The options line must set ndots high enough for the DNS client library to consider the search path. By default, Kubernetes sets ndots to 5, which is sufficient for the DNS names it generates.

10. Summary

Specific troubleshooting steps depend on your cluster configuration, deployment method, and the observed failure behavior. Based on the directions above, you can further investigate and apply targeted measures to resolve Kubernetes failures and help ensure application stability.