SRE — Postmortems Best Practices

DevOps Intermediate

SRE — Postmortems Best Practices¶

SREPostmortemIncident ManagementCulture 5 min read

Blameless postmortems in practice. Structure, facilitation, action plans and building a learning culture.

Blameless Culture¶

A postmortem never looks for someone to blame. It looks for systemic causes.

People make mistakes — that’s normal
If a person can make a mistake, the system failed at prevention
Blame leads to people hiding mistakes, which leads to worse systems

Postmortem Structure¶

# Postmortem: API Outage 2026-02-10

## Summary
90-minute API Gateway outage caused by OOM in Envoy proxy.

## Impact
- Duration: 90 min
- Affected users: ~12,000
- Error rate: 78%

## Timeline (CET)
- 14:25 — Deploy api-gateway v2.3.1
- 14:30 — Alert: ErrorRateHigh
- 14:40 — Diagnosis: Envoy OOMKilled
- 14:55 — Rollback initiated
- 16:00 — Full recovery

## Root Cause
Regex filter with exponential backtracking (ReDoS).

## Action Items
| # | Action | Owner | Deadline | Priority |
|---|--------|-------|----------|----------|
| 1 | Regex complexity check in CI | @platform | 2026-02-17 | P1 |
| 2 | Extend canary to 30 min | @sre | 2026-02-14 | P1 |
| 3 | Lower Envoy memory limit | @sre | 2026-02-12 | P2 |

Facilitation¶

Meeting within 48 hours of the incident
Facilitator is not an incident participant
Walk through the timeline — what happened, not who
5x Why (5 Whys) for root cause
Define concrete actions with owner and deadline
Publish internally — transparency

Summary¶

Postmortems are an investment in future reliability. Blameless culture and concrete actions help the entire organization learn.

Need Help with Implementation?¶

Our team has experience designing and implementing modern architectures. We’re happy to help.

Free Consultation

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles

SRE — Postmortems Best Practices