kubernetestroubleshootingCKAproduction

Kubernetes Pod CrashLoopBackOff: A Deep Dive into Real Troubleshooting

December 15, 2024·4 min read

Walk through a real production CrashLoopBackOff incident — from discovery to root cause to fix. Includes kubectl commands, log analysis, and the exact YAML that caused the problem.

The 3 AM Alert

It started with a PagerDuty alert at 3 AM. A critical payment service had gone into CrashLoopBackOff in our production cluster. Here's exactly how I diagnosed and fixed it — and what I learned.

Step 1: Assess the Situation

First thing: don't panic, start gathering data.

# Get overall pod status
kubectl get pods -n payments

# Output:
# NAME                          READY   STATUS             RESTARTS   AGE
# payment-api-7d4b9c8f6-xkp2m  0/1     CrashLoopBackOff   8          12m

Eight restarts in 12 minutes. Kubernetes is already giving up on this pod.

Step 2: Check Events First

Before diving into logs, always check events — they often reveal the obvious:

kubectl describe pod payment-api-7d4b9c8f6-xkp2m -n payments | grep -A 20 Events
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  12m                default-scheduler  Successfully assigned
  Normal   Pulling    12m                kubelet            Pulling image "payment-api:v2.1.0"
  Normal   Pulled     11m                kubelet            Successfully pulled image
  Normal   Created    11m                kubelet            Created container payment-api
  Warning  BackOff    2m (x6 over 10m)   kubelet            Back-off restarting failed container

No image pull errors, no OOMKilled. The container starts and then dies. Time to check logs.

Step 3: Grab the Logs

# Current logs
kubectl logs payment-api-7d4b9c8f6-xkp2m -n payments

# Previous instance logs (the one that just died)
kubectl logs payment-api-7d4b9c8f6-xkp2m -n payments --previous

The --previous flag is crucial here. Without it, you often get empty logs from a container that just started.

2024-12-15T03:12:44Z INFO  Starting payment-api v2.1.0
2024-12-15T03:12:44Z INFO  Connecting to database...
2024-12-15T03:12:44Z FATAL Connection refused: postgresql://payments-db:5432/payments
                           context deadline exceeded
exit status 1

Found it. The app can't reach the database.

Step 4: The YAML That Caused It

After some digging, the culprit was a recent ConfigMap change during the v2.1.0 deployment:

# BROKEN - database hostname typo
apiVersion: v1
kind: ConfigMap
metadata:
  name: payment-api-config
  namespace: payments
data:
  DATABASE_HOST: "payments-dbb"    # <-- extra 'b' — typo!
  DATABASE_PORT: "5432"
  DATABASE_NAME: "payments"

The correct service name was payments-db, not payments-dbb. A single character typo in a ConfigMap.

Step 5: Verify the Service Exists

Always verify from within the cluster:

# Check if the service actually exists
kubectl get svc -n payments | grep payments-db

# Test DNS resolution from another pod
kubectl run debug --image=busybox --rm -it --restart=Never -- \
  nslookup payments-db.payments.svc.cluster.local
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      payments-db.payments.svc.cluster.local
Address 1: 10.100.42.156 payments-db.payments.svc.cluster.local

The service exists. The hostname was just wrong in the config.

The Fix

# Edit the ConfigMap directly
kubectl edit configmap payment-api-config -n payments

# Or apply the corrected YAML
kubectl apply -f payment-api-config.yaml

# Then restart the deployment to pick up the new config
kubectl rollout restart deployment/payment-api -n payments

# Watch it come back up
kubectl rollout status deployment/payment-api -n payments
Waiting for deployment "payment-api" rollout to finish: 0 of 1 updated replicas are available...
deployment "payment-api" successfully rolled out

Lessons Learned

IssuePrevention
Config typosUse kubectl diff before applying
No validationAdd startup probes with database connectivity checks
Silent failuresStructured logging with exit codes
Slow detectionAlert on pod restart count, not just CrashLoopBackOff

The CKA Angle

This scenario directly maps to CKA exam tasks. You'll be asked to:

"A pod in namespace prod is failing. Identify and fix the issue."

The workflow is always:

  1. kubectl get pods → identify broken pod
  2. kubectl describe pod → check events
  3. kubectl logs --previous → check last crash output
  4. Fix the root cause (config, image, resource limits, etc.)
  5. Verify with kubectl rollout status

Pro tip for CKA: Always use --previous on logs. Half the exam tasks have containers that crash on startup.

Wrapping Up

A CrashLoopBackOff is just Kubernetes telling you: "I keep trying but your container keeps dying." The fix is almost always in the application, not Kubernetes itself. Master the describe + logs --previous workflow and you'll resolve 90% of these in minutes.