Sunday, 3:00 AM. Production is down. Who knows? Who’s handling it? Before: chaotic phone calls. Now: PagerDuty automatically escalates, runbooks guide the resolution, a postmortem ensures it doesn’t happen again.
Before: Chaos¶
Monitoring sent emails. Who read them? Nobody at night. The client called support. Support called the manager. The manager searched for someone who knew the system. Time to response: hours.
PagerDuty Setup¶
On-call rotation: 2 teams, weekly rotation. Primary on-call + secondary escalation. Alert from Prometheus → PagerDuty → phone/SMS/push notification. Acknowledgement timeout: 5 minutes. Escalation after 10 minutes.
Incident Severity¶
- SEV1: production outage, customers affected → immediate response
- SEV2: performance degradation, partial outage → 30 min response
- SEV3: non-critical issue → next business day
Runbooks¶
Every alert has a link to a runbook. The runbook contains: what the alert means, how to diagnose, how to mitigate, when to escalate. The on-call engineer doesn’t have to be an expert on every system — the runbook guides them.
Post-Incident¶
Every SEV1 and SEV2 incident gets a postmortem within 48 hours. Blameless. Action items with owners and deadlines. Review at the weekly SRE meeting. Trend tracking — recurring incidents indicate a systemic problem.
Incident Management Is an Investment in Peaceful Sleep¶
PagerDuty, runbooks, and postmortems transformed our incident response from chaos to process. The on-call engineer knows exactly what to do.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us