Alerting That Makes Sense

Every alert should be actionable. If not, it’s noise.

Rule #1: Alert on Symptoms, Not Causes¶

Alert on “CPU > 90%” is noise. Alert on “5xx error rate > 1%” is a symptom affecting users.

Severity Levels¶

Critical — users are affected NOW → wake on-call
Warning — will be a problem soon → fix during business hours
Info — FYI → just log/dashboard

What to Monitor¶

Error rate (5xx)
Latency (P95, P99)
Saturation (CPU, memory, disk)
Queue depth
Certificate expiry
Disk space

Anti-patterns¶

Too sensitive thresholds → alert fatigue
Alerting on things that self-heal
No runbook → nobody knows what to do
Duplicate alerts

Runbook Template¶

Alert: HighErrorRate¶

Severity: Critical Meaning: 5xx error rate > 1% for 5 minutes Impact: Users see errors Steps: 1. Check deployment history 2. Look at logs 3. Rollback if recent deploy 4. Escalate to #oncall

Summary¶

Fewer alerts = more attention. Every alert must have a runbook and clear action.

alertingmonitoringsre

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles

Alerting That Makes Sense

Rule #1: Alert on Symptoms, Not Causes¶

Severity Levels¶

What to Monitor¶

Anti-patterns¶

Runbook Template¶

Alert: HighErrorRate¶

Summary¶

CORE SYSTEMS team

More know-how

From Nagios to Zabbix — Why We Switched

SRE in Practice — How We Started Measuring Reliability

Incident Management with PagerDuty — From Chaos to Process

Thanos — Long-Term Storage and HA for Prometheus