When an incident happens, you need procedure, not panic.
Detection¶
- ☐ Alert received and acknowledged
- ☐ Severity assessed
- ☐ Incident commander assigned
- ☐ Communication channel opened (#incident-YYYYMMDD)
Assessment¶
- ☐ Impact scope (how many users?)
- ☐ Which services are affected?
- ☐ Since when does the problem exist?
- ☐ Is there a known workaround?
Mitigation¶
- ☐ Rollback if recent deploy
- ☐ Traffic shift (failover region)
- ☐ Service restart
- ☐ Scaling up
- ☐ User communication (status page)
Communication¶
- ☐ Internal update every 30 minutes
- ☐ Status page updated
- ☐ Management informed (P1/P2)
- ☐ Customer support briefed
Resolution¶
- ☐ Root cause identified
- ☐ Fix applied
- ☐ Monitoring confirms stability
- ☐ Status page: resolved
After Action¶
- ☐ Postmortem within 48 hours
- ☐ Action items with owners
- ☐ Follow-up meeting scheduled
- ☐ Metrics: MTTD, MTTR
Key¶
Stay calm, communicate, follow procedure. Train incident response regularly — game days.
incidentsredevops