Documentation//Site Reliability Engineering/ skills /sre-monitoring-and-observability

📖 sre-monitoring-and-observability

Use when building comprehensive monitoring and observability systems.



Overview

Building comprehensive monitoring and observability systems.

Four Golden Signals

Latency

Time to process requests:

# Request duration
http_request_duration_seconds

# Query
histogram_quantile(0.95, 
  rate(http_request_duration_seconds_bucket[5m])
)

Traffic

Demand on the system:

# Requests per second
rate(http_requests_total[5m])

# By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)

Errors

Rate of failed requests:

# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/ 
rate(http_requests_total[5m])

# SLI compliance
1 - (error_rate / slo_target)

Saturation

Resource utilization:

# CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) 
/ node_memory_MemTotal_bytes * 100

Service Level Indicators (SLIs)

Availability SLI

# Successful requests / Total requests
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total[30d]))

Latency SLI

# Requests faster than threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

Throughput SLI

# Requests processed within capacity
clamp_max(
  rate(http_requests_total[5m]) / capacity_requests_per_second,
  1.0
)

Alerting

Alert Severity Levels

P0 - Critical: Service down or severe degradation P1 - High: Significant impact, error budget at risk
P2 - Medium: Degradation, not user-facing yet P3 - Low: Awareness, no immediate action needed

Example Alerts

# High error rate
groups:
  - name: sre
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m])
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          
      - alert: LatencyP95High
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1.0
        for: 10m
        labels:
          severity: warning
          
      - alert: ErrorBudgetBurn
        expr: |
          (1 - sli_availability) > (error_budget_remaining * 10)
        for: 1h
        labels:
          severity: high

Dashboards

Overview Dashboard

  • Service health (red/yellow/green)
  • Request rate
  • Error rate
  • Latency percentiles (p50, p95, p99)
  • Saturation metrics

Detailed Dashboard

  • Per-endpoint metrics
  • Dependency health
  • Database performance
  • Cache hit rates
  • Queue depths

Distributed Tracing

OpenTelemetry

const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');

async function handleRequest(req) {
  const span = tracer.startSpan('handle_request');
  
  try {
    span.setAttribute('user.id', req.user.id);
    span.setAttribute('request.path', req.path);
    
    const result = await processRequest(req);
    
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message,
    });
    throw error;
  } finally {
    span.end();
  }
}

Structured Logging

logger.info('request_processed', {
  request_id: req.id,
  user_id: req.user.id,
  endpoint: req.path,
  method: req.method,
  status_code: res.statusCode,
  duration_ms: duration,
  error: error?.message,
});

Best Practices

USE Method

For resources:

  • Utilization: % time resource is busy
  • Saturation: Work queued but not serviced
  • Errors: Error count

RED Method

For requests:

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Request latency distribution

Alert on Symptoms, Not Causes

# Good - alert on user impact
- alert: HighLatency
  expr: p95_latency > 1s

# Bad - alert on potential cause
- alert: HighCPU
  expr: cpu_usage > 80%

Runbook Links

annotations:
  runbook: "https://wiki.example.com/runbooks/high-error-rate"
  dashboard: "https://grafana.example.com/d/abc123"