Every alert should be actionable. If not, it’s noise.
Rule #1: Alert on Symptoms, Not Causes¶
Alert on “CPU > 90%” is noise. Alert on “5xx error rate > 1%” is a symptom affecting users.
Severity Levels¶
- Critical — users are affected NOW → wake on-call
- Warning — will be a problem soon → fix during business hours
- Info — FYI → just log/dashboard
What to Monitor¶
- Error rate (5xx)
- Latency (P95, P99)
- Saturation (CPU, memory, disk)
- Queue depth
- Certificate expiry
- Disk space
Anti-patterns¶
- Too sensitive thresholds → alert fatigue
- Alerting on things that self-heal
- No runbook → nobody knows what to do
- Duplicate alerts
Runbook Template¶
Alert: HighErrorRate¶
Severity: Critical
Meaning: 5xx error rate > 1% for 5 minutes
Impact: Users see errors
Steps:
1. Check deployment history
2. Look at logs
3. Rollback if recent deploy
4. Escalate to #oncall
Summary¶
Fewer alerts = more attention. Every alert must have a runbook and clear action.
alertingmonitoringsre