DevOps Intermediate
Alerting Best Practices¶
AlertingMonitoringSREObservability 6 min read
Effective alerting for production systems. Alert design, routing, grouping and noise reduction.
Alert Design Principles¶
- Symptom-based: alert on impact (error rate), not cause (CPU high)
- Actionable: every alert = someone must do something
- Runbook link: every alert links to runbook
- Appropriate severity: P1 = page, P3 = ticket
- Tuned thresholds: minimize false positives
Routing and Grouping¶
# Alertmanager config
route:
group_by: [alertname, namespace, service]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: default
routes:
- match:
severity: critical
receiver: pagerduty
group_wait: 10s
repeat_interval: 1h
- match:
severity: warning
receiver: slack
- match:
severity: info
receiver: email
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: [namespace, service]
Noise Reduction¶
- Inhibition: critical suppresses warning for same service
- Silences: temporary silencing during maintenance
- Deduplication: grouping related alerts
- Alerting on SLO burn rate instead of individual metrics
Alert Quality Metrics¶
- False positive rate: < 5% (target)
- Alert-to-incident ratio: how many alerts lead to action?
- MTTA (Mean Time to Acknowledge): < 5 min for P1
- Alerts per on-call shift: < 5 (target)
Summary¶
Quality alerting = symptom-based, actionable, with runbook link. Measure alert quality and continuously reduce noise.
Need Help with Implementation?¶
Our team has experience designing and implementing modern architectures. We’re happy to help.