On-call doesn’t have to be a nightmare. Here’s how to survive (and even improve).
Preparation¶
- Test alerting system — are you receiving notifications?
- Have VPN/SSH access on your phone
- Read runbooks for critical services
- Know who’s backup and how to escalate
When the Pager Rings¶
- Don’t panic
- Read the alert and runbook
- Assess impact — how many users are affected?
- Communicate — post to #incidents channel
- Mitigate impact (rollback, traffic shift, restart)
- Analyze root cause
- Fix
- Write postmortem
Escalation¶
Don’t hesitate to escalate. Better to wake a colleague unnecessarily than spend 2 hours on something they can fix in 5 minutes.
Communication During Incident¶
🔴 INCIDENT: [service] [symptom]
Impact: [how many users/% traffic]
Status: investigating / identified / mitigated / resolved
Next update: in 30 minutes
After the Incident¶
- Write postmortem within 48 hours
- Blameless culture — look for systemic causes, not culprits
- Action items with owners and deadlines
Self-care¶
- Set quiet times (afternoon nap to catch up on sleep after night incident)
- Compensation for on-call (money or time off)
- Rotate on-call fairly
Tip¶
The best on-call is boring on-call. Invest in reliability, runbooks and automation.
on-callsredevops