Kubernetes Pod CrashLoopBackOff: A Deep Dive into Real Troubleshooting
Walk through a real production CrashLoopBackOff incident — from discovery to root cause to fix. Includes kubectl commands, log analysis, and the exact YAML that caused the problem.
The 3 AM Alert
It started with a PagerDuty alert at 3 AM. A critical payment service had gone into CrashLoopBackOff in our production cluster. Here's exactly how I diagnosed and fixed it — and what I learned.
Step 1: Assess the Situation
First thing: don't panic, start gathering data.
# Get overall pod status
kubectl get pods -n payments
# Output:
# NAME READY STATUS RESTARTS AGE
# payment-api-7d4b9c8f6-xkp2m 0/1 CrashLoopBackOff 8 12m
Eight restarts in 12 minutes. Kubernetes is already giving up on this pod.
Step 2: Check Events First
Before diving into logs, always check events — they often reveal the obvious:
kubectl describe pod payment-api-7d4b9c8f6-xkp2m -n payments | grep -A 20 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12m default-scheduler Successfully assigned
Normal Pulling 12m kubelet Pulling image "payment-api:v2.1.0"
Normal Pulled 11m kubelet Successfully pulled image
Normal Created 11m kubelet Created container payment-api
Warning BackOff 2m (x6 over 10m) kubelet Back-off restarting failed container
No image pull errors, no OOMKilled. The container starts and then dies. Time to check logs.
Step 3: Grab the Logs
# Current logs
kubectl logs payment-api-7d4b9c8f6-xkp2m -n payments
# Previous instance logs (the one that just died)
kubectl logs payment-api-7d4b9c8f6-xkp2m -n payments --previous
The --previous flag is crucial here. Without it, you often get empty logs from a container that just started.
2024-12-15T03:12:44Z INFO Starting payment-api v2.1.0
2024-12-15T03:12:44Z INFO Connecting to database...
2024-12-15T03:12:44Z FATAL Connection refused: postgresql://payments-db:5432/payments
context deadline exceeded
exit status 1
Found it. The app can't reach the database.
Step 4: The YAML That Caused It
After some digging, the culprit was a recent ConfigMap change during the v2.1.0 deployment:
# BROKEN - database hostname typo
apiVersion: v1
kind: ConfigMap
metadata:
name: payment-api-config
namespace: payments
data:
DATABASE_HOST: "payments-dbb" # <-- extra 'b' — typo!
DATABASE_PORT: "5432"
DATABASE_NAME: "payments"
The correct service name was payments-db, not payments-dbb. A single character typo in a ConfigMap.
Step 5: Verify the Service Exists
Always verify from within the cluster:
# Check if the service actually exists
kubectl get svc -n payments | grep payments-db
# Test DNS resolution from another pod
kubectl run debug --image=busybox --rm -it --restart=Never -- \
nslookup payments-db.payments.svc.cluster.local
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: payments-db.payments.svc.cluster.local
Address 1: 10.100.42.156 payments-db.payments.svc.cluster.local
The service exists. The hostname was just wrong in the config.
The Fix
# Edit the ConfigMap directly
kubectl edit configmap payment-api-config -n payments
# Or apply the corrected YAML
kubectl apply -f payment-api-config.yaml
# Then restart the deployment to pick up the new config
kubectl rollout restart deployment/payment-api -n payments
# Watch it come back up
kubectl rollout status deployment/payment-api -n payments
Waiting for deployment "payment-api" rollout to finish: 0 of 1 updated replicas are available...
deployment "payment-api" successfully rolled out
Lessons Learned
| Issue | Prevention |
|---|---|
| Config typos | Use kubectl diff before applying |
| No validation | Add startup probes with database connectivity checks |
| Silent failures | Structured logging with exit codes |
| Slow detection | Alert on pod restart count, not just CrashLoopBackOff |
The CKA Angle
This scenario directly maps to CKA exam tasks. You'll be asked to:
"A pod in namespace
prodis failing. Identify and fix the issue."
The workflow is always:
kubectl get pods→ identify broken podkubectl describe pod→ check eventskubectl logs --previous→ check last crash output- Fix the root cause (config, image, resource limits, etc.)
- Verify with
kubectl rollout status
Pro tip for CKA: Always use
--previouson logs. Half the exam tasks have containers that crash on startup.
Wrapping Up
A CrashLoopBackOff is just Kubernetes telling you: "I keep trying but your container keeps dying." The fix is almost always in the application, not Kubernetes itself. Master the describe + logs --previous workflow and you'll resolve 90% of these in minutes.