Postmortem is not about finding culprits. It’s about making sure it doesn’t happen again.
Blameless Culture¶
“Jan deleted the database” → “Missing protection against deleting production database.” Look for systemic causes, not culprits.
Template¶
Incident: [name]¶
Date: YYYY-MM-DD
Severity: Critical/Major/Minor
Duration: X hours
Impact: Y users affected, Z transactions lost
Timeline¶
HH:MM — What happened
HH:MM — Alert fired
HH:MM — On-call notified
HH:MM — Root cause identified
HH:MM — Mitigation applied
HH:MM — Resolved
Root Cause¶
Detailed description of the cause.
Contributing Factors¶
What made the situation worse?
Action Items¶
| Action | Owner | Deadline | Priority |
|---|---|---|---|
| Add guard | John | 2 weeks | P1 |
Key Questions¶
- Why did detection take so long?
- Why didn’t automatic rollback exist?
- Why didn’t tests cover this scenario?
- Did we have a runbook? Did it help?
Follow-up¶
Action items must have owners and deadlines. Review completion in weekly standups.
Remember¶
Postmortem without action items is just a story. Postmortem with follow-through is improvement.