kubernetestroubleshootingCKAproduction

Kubernetes Pod CrashLoopBackOff: A Deep Dive into Real Troubleshooting

December 15, 2025·4 min read

Walk through a real production CrashLoopBackOff incident — from discovery to root cause to fix. Includes kubectl commands, log analysis, and the exact YAML that caused the problem.

The 3 AM Alert

It started with a PagerDuty alert at 3 AM. A critical payment service had gone into CrashLoopBackOff in our production cluster. Here's exactly how I diagnosed and fixed it — and what I learned.

Step 1: Assess the Situation

First thing: don't panic, start gathering data.

# Get overall pod status
kubectl get pods -n payments

# Output:
# NAME                          READY   STATUS             RESTARTS   AGE
# payment-api-7d4b9c8f6-xkp2m  0/1     CrashLoopBackOff   8          12m

Eight restarts in 12 minutes. Kubernetes is already giving up on this pod.

Step 2: Check Events First

Before diving into logs, always check events — they often reveal the obvious:

kubectl describe pod payment-api-7d4b9c8f6-xkp2m -n payments | grep -A 20 Events

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  12m                default-scheduler  Successfully assigned
  Normal   Pulling    12m                kubelet            Pulling image "payment-api:v2.1.0"
  Normal   Pulled     11m                kubelet            Successfully pulled image
  Normal   Created    11m                kubelet            Created container payment-api
  Warning  BackOff    2m (x6 over 10m)   kubelet            Back-off restarting failed container

No image pull errors, no OOMKilled. The container starts and then dies. Time to check logs.

Step 3: Grab the Logs

# Current logs
kubectl logs payment-api-7d4b9c8f6-xkp2m -n payments

# Previous instance logs (the one that just died)
kubectl logs payment-api-7d4b9c8f6-xkp2m -n payments --previous

The --previous flag is crucial here. Without it, you often get empty logs from a container that just started.

2024-12-15T03:12:44Z INFO  Starting payment-api v2.1.0
2024-12-15T03:12:44Z INFO  Connecting to database...
2024-12-15T03:12:44Z FATAL Connection refused: postgresql://payments-db:5432/payments
                           context deadline exceeded
exit status 1

Found it. The app can't reach the database.

Step 4: The YAML That Caused It

After some digging, the culprit was a recent ConfigMap change during the v2.1.0 deployment:

# BROKEN - database hostname typo
apiVersion: v1
kind: ConfigMap
metadata:
  name: payment-api-config
  namespace: payments
data:
  DATABASE_HOST: "payments-dbb"    # <-- extra 'b' — typo!
  DATABASE_PORT: "5432"
  DATABASE_NAME: "payments"

The correct service name was payments-db, not payments-dbb. A single character typo in a ConfigMap.

Step 5: Verify the Service Exists

Always verify from within the cluster:

# Check if the service actually exists
kubectl get svc -n payments | grep payments-db

# Test DNS resolution from another pod
kubectl run debug --image=busybox --rm -it --restart=Never -- \
  nslookup payments-db.payments.svc.cluster.local

Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      payments-db.payments.svc.cluster.local
Address 1: 10.100.42.156 payments-db.payments.svc.cluster.local

The service exists. The hostname was just wrong in the config.

The Fix

# Edit the ConfigMap directly
kubectl edit configmap payment-api-config -n payments

# Or apply the corrected YAML
kubectl apply -f payment-api-config.yaml

# Then restart the deployment to pick up the new config
kubectl rollout restart deployment/payment-api -n payments

# Watch it come back up
kubectl rollout status deployment/payment-api -n payments

Waiting for deployment "payment-api" rollout to finish: 0 of 1 updated replicas are available...
deployment "payment-api" successfully rolled out

Lessons Learned

Issue	Prevention
Config typos	Use `kubectl diff` before applying
No validation	Add startup probes with database connectivity checks
Silent failures	Structured logging with exit codes
Slow detection	Alert on pod restart count, not just `CrashLoopBackOff`

The CKA Angle

This scenario directly maps to CKA exam tasks. You'll be asked to:

"A pod in namespace prod is failing. Identify and fix the issue."

The workflow is always:

kubectl get pods → identify broken pod
kubectl describe pod → check events
kubectl logs --previous → check last crash output
Fix the root cause (config, image, resource limits, etc.)
Verify with kubectl rollout status

Pro tip for CKA: Always use --previous on logs. Half the exam tasks have containers that crash on startup.

Wrapping Up

A CrashLoopBackOff is just Kubernetes telling you: "I keep trying but your container keeps dying." The fix is almost always in the application, not Kubernetes itself. Master the describe + logs --previous workflow and you'll resolve 90% of these in minutes.

← Back to all posts