DevOps Intermediate
SRE — Postmortems Best Practices¶
SREPostmortemIncident ManagementCulture 5 min read
Blameless postmortems in practice. Structure, facilitation, action plans and building a learning culture.
Blameless Culture¶
A postmortem never looks for someone to blame. It looks for systemic causes.
- People make mistakes — that’s normal
- If a person can make a mistake, the system failed at prevention
- Blame leads to people hiding mistakes, which leads to worse systems
Postmortem Structure¶
# Postmortem: API Outage 2026-02-10
## Summary
90-minute API Gateway outage caused by OOM in Envoy proxy.
## Impact
- Duration: 90 min
- Affected users: ~12,000
- Error rate: 78%
## Timeline (CET)
- 14:25 — Deploy api-gateway v2.3.1
- 14:30 — Alert: ErrorRateHigh
- 14:40 — Diagnosis: Envoy OOMKilled
- 14:55 — Rollback initiated
- 16:00 — Full recovery
## Root Cause
Regex filter with exponential backtracking (ReDoS).
## Action Items
| # | Action | Owner | Deadline | Priority |
|---|--------|-------|----------|----------|
| 1 | Regex complexity check in CI | @platform | 2026-02-17 | P1 |
| 2 | Extend canary to 30 min | @sre | 2026-02-14 | P1 |
| 3 | Lower Envoy memory limit | @sre | 2026-02-12 | P2 |
Facilitation¶
- Meeting within 48 hours of the incident
- Facilitator is not an incident participant
- Walk through the timeline — what happened, not who
- 5x Why (5 Whys) for root cause
- Define concrete actions with owner and deadline
- Publish internally — transparency
Summary¶
Postmortems are an investment in future reliability. Blameless culture and concrete actions help the entire organization learn.
Need Help with Implementation?¶
Our team has experience designing and implementing modern architectures. We’re happy to help.