DevOps Intermediate
On-call Engineering — Best Practices¶
On-callSREAlertingOperations 6 min read
Effective on-call rotations. Alert quality, escalation, compensation, and burnout prevention.
Alert Quality¶
Every alert must be actionable. If on-call can’t do anything → delete the alert.
- Alert = someone must do something NOW
- No informational alerts in on-call rotation
- Max 2-3 alerts per on-call shift (target)
- Every alert has a runbook link
Rotation Design¶
- Minimum 2 people in rotation (primary + secondary)
- Max 1 week on-call per month
- Follow-the-sun for global teams
- Handoff meeting at the beginning of shift — what’s happening?
- Shadow on-call for new team members
Escalation¶
# PagerDuty escalation policy
Level 1: Primary on-call (0 min)
→ Auto-acknowledge: 5 min
→ Auto-escalate: 15 min
Level 2: Secondary on-call (15 min)
→ Auto-escalate: 30 min
Level 3: Engineering Manager (45 min)
# Rules
- P1: escalate immediately if you cannot resolve
- Don't be a hero — escalation is not failure
- Better to wake two people than have 2h outage
Burnout Prevention¶
- Compensation (bonus pay or time off)
- Track metrics: alerts per shift, MTTR, false positive rate
- On-call week retrospective
- Invest in automation (reduce alert count)
Summary¶
Healthy on-call = quality alerts, clear escalation, compensation, and continuous improvement. On-call should not be punishment.
Need Help with Implementation?¶
Our team has experience designing and implementing modern architectures. We’re happy to help.