_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Alerting That Makes Sense

11. 09. 2023 1 min read intermediate

Every alert should be actionable. If not, it’s noise.

Rule #1: Alert on Symptoms, Not Causes

Alert on “CPU > 90%” is noise. Alert on “5xx error rate > 1%” is a symptom affecting users.

Severity Levels

  • Critical — users are affected NOW → wake on-call
  • Warning — will be a problem soon → fix during business hours
  • Info — FYI → just log/dashboard

What to Monitor

  • Error rate (5xx)
  • Latency (P95, P99)
  • Saturation (CPU, memory, disk)
  • Queue depth
  • Certificate expiry
  • Disk space

Anti-patterns

  • Too sensitive thresholds → alert fatigue
  • Alerting on things that self-heal
  • No runbook → nobody knows what to do
  • Duplicate alerts

Runbook Template

Alert: HighErrorRate

Severity: Critical
Meaning: 5xx error rate > 1% for 5 minutes
Impact: Users see errors
Steps:
1. Check deployment history
2. Look at logs
3. Rollback if recent deploy
4. Escalate to #oncall

Summary

Fewer alerts = more attention. Every alert must have a runbook and clear action.

alertingmonitoringsre
Share:

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.