On-call Survival Guide

On-call doesn’t have to be a nightmare. Here’s how to survive (and even improve).

Preparation¶

Test alerting system — are you receiving notifications?
Have VPN/SSH access on your phone
Read runbooks for critical services
Know who’s backup and how to escalate

Don’t panic
Read the alert and runbook
Assess impact — how many users are affected?
Communicate — post to #incidents channel
Mitigate impact (rollback, traffic shift, restart)
Analyze root cause
Fix
Write postmortem

Escalation¶

Don’t hesitate to escalate. Better to wake a colleague unnecessarily than spend 2 hours on something they can fix in 5 minutes.

Communication During Incident¶

🔴 INCIDENT: [service] [symptom]
Impact: [how many users/% traffic]
Status: investigating / identified / mitigated / resolved
Next update: in 30 minutes

After the Incident¶

Write postmortem within 48 hours
Blameless culture — look for systemic causes, not culprits
Action items with owners and deadlines

Self-care¶

Set quiet times (afternoon nap to catch up on sleep after night incident)
Compensation for on-call (money or time off)
Rotate on-call fairly

Tip¶

The best on-call is boring on-call. Invest in reliability, runbooks and automation.

on-callsredevops

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Všechny články

Preparation¶

Escalation¶

Communication During Incident¶

After the Incident¶

Self-care¶

Tip¶

CORE SYSTEMS tým

Další know-how

SRE Maturity — From Firefighting to Proactive Engineering

Incident Management with PagerDuty — From Chaos to Process

Incident Response Checklist

On-call Survival Guide

Preparation¶

When the Pager Rings¶

Escalation¶

Communication During Incident¶

After the Incident¶

Self-care¶

Tip¶

CORE SYSTEMS tým

Další know-how

SRE Maturity — From Firefighting to Proactive Engineering

Incident Management with PagerDuty — From Chaos to Process

Incident Response Checklist