📖 sre-monitoring-and-observability
Use when building comprehensive monitoring and observability systems.
Overview
Building comprehensive monitoring and observability systems.
Four Golden Signals
Latency
Time to process requests:
# Request duration
http_request_duration_seconds
# Query
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
Traffic
Demand on the system:
# Requests per second
rate(http_requests_total[5m])
# By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
Errors
Rate of failed requests:
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
# SLI compliance
1 - (error_rate / slo_target)
Saturation
Resource utilization:
# CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
Service Level Indicators (SLIs)
Availability SLI
# Successful requests / Total requests
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total[30d]))
Latency SLI
# Requests faster than threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
Throughput SLI
# Requests processed within capacity
clamp_max(
rate(http_requests_total[5m]) / capacity_requests_per_second,
1.0
)
Alerting
Alert Severity Levels
P0 - Critical: Service down or severe degradation
P1 - High: Significant impact, error budget at risk
P2 - Medium: Degradation, not user-facing yet
P3 - Low: Awareness, no immediate action needed
Example Alerts
# High error rate
groups:
- name: sre
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
- alert: LatencyP95High
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
for: 10m
labels:
severity: warning
- alert: ErrorBudgetBurn
expr: |
(1 - sli_availability) > (error_budget_remaining * 10)
for: 1h
labels:
severity: high
Dashboards
Overview Dashboard
- Service health (red/yellow/green)
- Request rate
- Error rate
- Latency percentiles (p50, p95, p99)
- Saturation metrics
Detailed Dashboard
- Per-endpoint metrics
- Dependency health
- Database performance
- Cache hit rates
- Queue depths
Distributed Tracing
OpenTelemetry
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');
async function handleRequest(req) {
const span = tracer.startSpan('handle_request');
try {
span.setAttribute('user.id', req.user.id);
span.setAttribute('request.path', req.path);
const result = await processRequest(req);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
Structured Logging
logger.info('request_processed', {
request_id: req.id,
user_id: req.user.id,
endpoint: req.path,
method: req.method,
status_code: res.statusCode,
duration_ms: duration,
error: error?.message,
});
Best Practices
USE Method
For resources:
- Utilization: % time resource is busy
- Saturation: Work queued but not serviced
- Errors: Error count
RED Method
For requests:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Request latency distribution
Alert on Symptoms, Not Causes
# Good - alert on user impact
- alert: HighLatency
expr: p95_latency > 1s
# Bad - alert on potential cause
- alert: HighCPU
expr: cpu_usage > 80%
Runbook Links
annotations:
runbook: "https://wiki.example.com/runbooks/high-error-rate"
dashboard: "https://grafana.example.com/d/abc123"