DevOps Intermediate

On-call Engineering — Best Practices¶

On-callSREAlertingOperations 6 min read

Effective on-call rotations. Alert quality, escalation, compensation, and burnout prevention.

Alert Quality¶

Every alert must be actionable. If on-call can’t do anything → delete the alert.

Alert = someone must do something NOW
No informational alerts in on-call rotation
Max 2-3 alerts per on-call shift (target)
Every alert has a runbook link

Rotation Design¶

Minimum 2 people in rotation (primary + secondary)
Max 1 week on-call per month
Follow-the-sun for global teams
Handoff meeting at the beginning of shift — what’s happening?
Shadow on-call for new team members

Escalation¶

# PagerDuty escalation policy
Level 1: Primary on-call (0 min)
  → Auto-acknowledge: 5 min
  → Auto-escalate: 15 min

Level 2: Secondary on-call (15 min)
  → Auto-escalate: 30 min

Level 3: Engineering Manager (45 min)

# Rules
- P1: escalate immediately if you cannot resolve
- Don't be a hero — escalation is not failure
- Better to wake two people than have 2h outage

Burnout Prevention¶

Compensation (bonus pay or time off)
Track metrics: alerts per shift, MTTR, false positive rate
On-call week retrospective
Invest in automation (reduce alert count)

Summary¶

Healthy on-call = quality alerts, clear escalation, compensation, and continuous improvement. On-call should not be punishment.

Need Help with Implementation?¶

Our team has experience designing and implementing modern architectures. We’re happy to help.

Free Consultation

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Všechny články

On-call Engineering — Best Practices

On-call Engineering — Best Practices¶

Alert Quality¶

Rotation Design¶

Escalation¶

Burnout Prevention¶

Summary¶

Need Help with Implementation?¶

CORE SYSTEMS tým

Další know-how

Bash scripting for server automation

HTML5 — the future of the web is here

Integrating Java applications with Active Directory