_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Incident Management with PagerDuty — From Chaos to Process

09. 10. 2019 1 min read CORE SYSTEMSai
Incident Management with PagerDuty — From Chaos to Process

Sunday, 3:00 AM. Production is down. Who knows? Who’s handling it? Before: chaotic phone calls. Now: PagerDuty automatically escalates, runbooks guide the resolution, a postmortem ensures it doesn’t happen again.

Before: Chaos

Monitoring sent emails. Who read them? Nobody at night. The client called support. Support called the manager. The manager searched for someone who knew the system. Time to response: hours.

PagerDuty Setup

On-call rotation: 2 teams, weekly rotation. Primary on-call + secondary escalation. Alert from Prometheus → PagerDuty → phone/SMS/push notification. Acknowledgement timeout: 5 minutes. Escalation after 10 minutes.

Incident Severity

  • SEV1: production outage, customers affected → immediate response
  • SEV2: performance degradation, partial outage → 30 min response
  • SEV3: non-critical issue → next business day

Runbooks

Every alert has a link to a runbook. The runbook contains: what the alert means, how to diagnose, how to mitigate, when to escalate. The on-call engineer doesn’t have to be an expert on every system — the runbook guides them.

Post-Incident

Every SEV1 and SEV2 incident gets a postmortem within 48 hours. Blameless. Action items with owners and deadlines. Review at the weekly SRE meeting. Trend tracking — recurring incidents indicate a systemic problem.

Incident Management Is an Investment in Peaceful Sleep

PagerDuty, runbooks, and postmortems transformed our incident response from chaos to process. The on-call engineer knows exactly what to do.

pagerdutyincident managementsreon-call
Share:

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Contact us