_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

On-call Survival Guide

20. 07. 2016 1 min read intermediate

On-call doesn’t have to be a nightmare. Here’s how to survive (and even improve).

Preparation

  • Test alerting system — are you receiving notifications?
  • Have VPN/SSH access on your phone
  • Read runbooks for critical services
  • Know who’s backup and how to escalate

When the Pager Rings

  1. Don’t panic
  2. Read the alert and runbook
  3. Assess impact — how many users are affected?
  4. Communicate — post to #incidents channel
  5. Mitigate impact (rollback, traffic shift, restart)
  6. Analyze root cause
  7. Fix
  8. Write postmortem

Escalation

Don’t hesitate to escalate. Better to wake a colleague unnecessarily than spend 2 hours on something they can fix in 5 minutes.

Communication During Incident

🔴 INCIDENT: [service] [symptom]
Impact: [how many users/% traffic]
Status: investigating / identified / mitigated / resolved
Next update: in 30 minutes

After the Incident

  • Write postmortem within 48 hours
  • Blameless culture — look for systemic causes, not culprits
  • Action items with owners and deadlines

Self-care

  • Set quiet times (afternoon nap to catch up on sleep after night incident)
  • Compensation for on-call (money or time off)
  • Rotate on-call fairly

Tip

The best on-call is boring on-call. Invest in reliability, runbooks and automation.

on-callsredevops
Share:

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.