ð runbooks-troubleshooting-guides
Use when creating troubleshooting guides and diagnostic procedures for operational issues. Covers problem diagnosis, root cause analysis, and systematic debugging.
Overview
Creating effective troubleshooting guides for diagnosing and resolving operational issues.
Troubleshooting Framework
The 5-Step Method
- Observe - Gather symptoms and data
- Hypothesize - Form theories about root cause
- Test - Validate hypotheses with experiments
- Fix - Apply solution
- Verify - Confirm resolution
Basic Troubleshooting Guide
# Troubleshooting: [Problem Statement]
## Symptoms
What the user/system is experiencing:
- API returning 503 errors
- Response time > 10 seconds
- High CPU usage alerts
## Quick Checks (< 2 minutes)
### 1. Is the service running?
```bash
kubectl get pods -n production | grep api-server
Expected: STATUS = Running
2. Are recent deploys the cause?
kubectl rollout history deployment/api-server
Check: Did we deploy in the last 30 minutes?
3. Is this affecting all users?
Check error rate in Datadog:
- If < 5%: Isolated issue, may be client-specific
- If > 50%: Widespread issue, likely infrastructure
Common Causes
| Symptom | Likely Cause | Quick Fix |
|---|---|---|
| 503 errors | Pod crashlooping | Restart deployment |
| Slow responses | Database connection pool | Increase pool size |
| High memory | Memory leak | Restart pods |
Detailed Diagnosis
Hypothesis 1: Database Connection Issues
Test:
# Check database connections
kubectl exec -it api-server-abc -- psql -h $DB_HOST -c "SELECT count(*) FROM pg_stat_activity"
If connections > 90: Pool is saturated. Next step: Increase pool size or investigate slow queries.
Hypothesis 2: High Traffic Spike
Test:
# Check request rate
curl -H "Authorization: Bearer $DD_API_KEY" \
"https://api.datadoghq.com/api/v1/query?query=sum:nginx.requests{*}"
If requests 3x normal: Traffic spike. Next step: Scale up pods or enable rate limiting.
Hypothesis 3: External Service Degradation
Test:
# Check third-party API
curl -w "@curl-format.txt" https://api.stripe.com/v1/charges
If response time > 2s: External service slow. Next step: Implement circuit breaker or increase timeouts.
Resolution Steps
Solution A: Immediate (< 5 minutes)
Restart affected pods:
kubectl rollout restart deployment/api-server -n production
When to use: Quick mitigation while investigating root cause.
Solution B: Short-term (< 30 minutes)
Scale up resources:
kubectl scale deployment/api-server --replicas=10 -n production
When to use: Traffic spike or resource exhaustion.
Solution C: Long-term (< 2 hours)
Fix root cause:
- Identify slow database query
- Add database index
- Deploy code optimization
When to use: After immediate pressure is relieved.
Validation
- Error rate < 1%
- Response time p95 < 200ms
- CPU usage < 70%
- No active alerts
Prevention
How to prevent this issue in the future:
- Add monitoring alert for connection pool saturation
- Implement auto-scaling based on request rate
- Set up load testing to find capacity limits
## Decision Tree Format
```markdown
# Troubleshooting: Slow API Responses
## Start Here
Check response time
|
ââââââââââââââââŽâââââââââââââââ
â â
< 500ms > 500ms
â â
NOT THIS RUNBOOK Continue below
## Step 1: Locate the Slowness
```bash
# Check which service is slow
curl -w "@timing.txt" https://api.example.com/users
Decision:
- Time to first byte > 2s â Database slow (go to Step 2)
- Time to first byte < 100ms â Network slow (go to Step 3)
- Timeout â Service down (go to Step 4)
Step 2: Database Diagnosis
# Check active queries
psql -c "SELECT query, state, query_start FROM pg_stat_activity WHERE state != 'idle'"
Decision:
- Query running > 5s â Slow query (Solution A)
- Many idle in transaction â Connection leak (Solution B)
- High connection count â Pool exhausted (Solution C)
Solution A: Optimize Slow Query
- Identify slow query from above
- Run EXPLAIN ANALYZE
- Add missing index or optimize query
Solution B: Fix Connection Leak
- Restart application pods
- Review code for unclosed connections
- Add connection timeout
Solution C: Increase Connection Pool
- Edit database config
- Increase max_connections
- Update application pool size
Step 3: Network Diagnosis
... (continue with network troubleshooting)
## Layered Troubleshooting
### Layer 1: Application
```markdown
## Application Layer Issues
### Check Application Health
1. **Health endpoint:**
```bash
curl https://api.example.com/health
-
Application logs:
kubectl logs deployment/api-server --tail=100 | grep ERROR -
Application metrics:
- Request rate
- Error rate
- Response time percentiles
Common Application Issues
Memory Leak
- Symptom: Memory usage climbing over time
- Test: Check memory metrics in Datadog
- Fix: Restart pods, investigate with heap dump
Thread Starvation
- Symptom: Slow responses, high CPU
- Test: Thread dump analysis
- Fix: Increase thread pool size
Code Bug
- Symptom: Specific endpoints fail
- Test: Review recent deploys
- Fix: Rollback or hotfix
### Layer 2: Infrastructure
```markdown
## Infrastructure Layer Issues
### Check Infrastructure Health
1. **Node resources:**
```bash
kubectl top nodes
-
Pod resources:
kubectl top pods -n production -
Network connectivity:
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- ping database.internal
Common Infrastructure Issues
Node Under Pressure
- Symptom: Pods evicted, slow scheduling
- Test:
kubectl describe nodefor pressure conditions - Fix: Scale node pool or add nodes
Network Partition
- Symptom: Intermittent timeouts
- Test: MTR between pods and destination
- Fix: Check security groups, routing tables
Disk I/O Saturation
- Symptom: Slow database, high latency
- Test: Check IOPS metrics in CloudWatch
- Fix: Increase provisioned IOPS
### Layer 3: External Dependencies
```markdown
## External Dependencies Issues
### Check External Services
1. **Third-party APIs:**
```bash
curl -w "@timing.txt" https://api.stripe.com/health
-
Status pages:
- Check status.stripe.com
- Check status.aws.amazon.com
-
DNS resolution:
nslookup api.stripe.com dig api.stripe.com
Common External Issues
API Rate Limiting
- Symptom: 429 responses from external service
- Test: Check rate limit headers
- Fix: Implement backoff, cache responses
Service Degradation
- Symptom: Slow external API responses
- Test: Check their status page
- Fix: Implement circuit breaker, use fallback
DNS Failure
- Symptom: Cannot resolve hostname
- Test: DNS queries
- Fix: Check DNS config, try alternative resolver
## Systematic Debugging
### Use the Scientific Method
```markdown
# Debugging: Database Connection Failures
## 1. Observation
**What we know:**
- Error: "connection refused" in logs
- Started: 2025-01-15 14:30 UTC
- Frequency: Every database query fails
- Scope: All pods affected
## 2. Hypothesis
**Possible causes:**
1. Database instance is down
2. Security group blocking traffic
3. Network partition
4. Wrong credentials
## 3. Test Each Hypothesis
### Test 1: Database instance status
```bash
aws rds describe-db-instances --db-instance-identifier prod-db | jq '.DBInstances[0].DBInstanceStatus'
Result: "available" Conclusion: Database is running â Hypothesis 1 rejected
Test 2: Security group rules
aws ec2 describe-security-groups --group-ids sg-abc123 | jq '.SecurityGroups[0].IpPermissions'
Result: Port 5432 open only to 10.0.0.0/16 Pod IP: 10.1.0.5 Conclusion: Pod IP not in allowed range â ROOT CAUSE FOUND
4. Fix
Update security group:
aws ec2 authorize-security-group-ingress \
--group-id sg-abc123 \
--protocol tcp \
--port 5432 \
--cidr 10.1.0.0/16
5. Verify
Test connection from pod:
kubectl exec -it api-server-abc -- psql -h prod-db.rds.amazonaws.com -c "SELECT 1"
Result: Success â
## Time-Boxed Investigation
```markdown
# Troubleshooting: Production Outage
**Time Box:** Spend MAX 15 minutes investigating before escalating.
## First 5 Minutes: Quick Wins
- [ ] Check pod status
- [ ] Check recent deploys
- [ ] Check external status pages
- [ ] Review monitoring dashboards
**If issue persists:** Continue to next phase.
## Minutes 5-10: Common Causes
- [ ] Restart pods (quick mitigation)
- [ ] Check database connectivity
- [ ] Review application logs
- [ ] Check resource limits
**If issue persists:** Continue to next phase.
## Minutes 10-15: Deep Dive
- [ ] Enable debug logging
- [ ] Capture thread dump
- [ ] Check for memory leaks
- [ ] Review network traces
**If issue persists:** ESCALATE to senior engineer.
## Escalation
**Escalate to:** Platform Team Lead
**Provide:**
- Timeline of issue
- Tests performed
- Current error rate
- Mitigation attempts
Common Troubleshooting Patterns
Binary Search
## Finding Which Service is Slow
Using binary search to narrow down the problem:
1. **Check full request:** 5000ms total
2. **Check first half (API â Database):** 4900ms
â Problem is in database query
3. **Check database:** Query takes 4800ms
4. **Check query plan:** Sequential scan on large table
5. **Root cause:** Missing index
**Fix:** Add index on frequently queried column.
Correlation Analysis
## Finding Related Events
Look for patterns and correlations:
**Timeline:**
- 14:25 - Deploy completed
- 14:30 - Error rate spike
- 14:35 - Database CPU at 100%
- 14:40 - Requests timing out
**Correlation:** Deploy introduced N+1 query.
**Evidence:**
- No config changes
- No infrastructure changes
- Only code deploy
- Error coincides with deploy
**Action:** Rollback deploy.
Anti-Patterns
Don't Skip Obvious Checks
# Bad: Jump to complex solutions
## Database Slow
Must be a query optimization issue. Let's analyze query plans...
# Good: Check basics first
## Database Slow
1. Is the database actually running?
2. Can we connect to it?
3. Are there any locks?
4. What does the slow query log show?
Don't Guess Randomly
# Bad: Random changes
## API Errors
Let's try:
- Restarting the database
- Scaling to 100 pods
- Changing the load balancer config
- Updating the kernel
# Good: Systematic approach
## API Errors
1. What is the actual error message?
2. When did it start?
3. What changed before it started?
4. Can we reproduce it?
Don't Skip Documentation
# Bad: No notes
## Fixed It
I restarted some pods and now it works.
# Good: Document findings
## Resolution
**Root Cause:** Memory leak in worker process
**Evidence:** Pod memory climbing linearly over 6 hours
**Temporary Fix:** Restarted pods
**Long-term Fix:** PR #1234 fixes memory leak
**Prevention:** Added memory usage alerts
Related Skills
- runbook-structure: Organizing operational documentation
- incident-response: Handling production incidents