DevOps Intermediate
SRE — Runbooks and Operational Documentation¶
SRERunbooksDocumentationIncident Response 6 min read
Effective runbooks for incident response. Structure, automation and operational documentation maintenance.
Why Runbooks¶
A runbook is a step-by-step guide for resolving an incident. It reduces dependency on tribal knowledge.
Runbook Structure¶
# Runbook: High Memory Usage on API Pods
## Alert
- AlertManager: PodMemoryUsageHigh
- Threshold: > 90% memory limit for 5 minutes
## Diagnostics
1. kubectl top pods -n production -l app=api-server --sort-by=memory
2. kubectl get events -n production --field-selector reason=OOMKilling
## Mitigation (short-term)
1. kubectl rollout restart deployment/api-server -n production
2. kubectl set resources deployment/api-server --limits=memory=2Gi
## Mitigation (long-term)
1. Analyze heap dump
2. Identify memory leak
3. Fix + deploy
## Escalation
- P1: @sre-oncall → @sre-lead (15 min)
- P2: @sre-oncall → ticket (next business day)
Automated Runbooks¶
- Rundeck/Ansible — execute runbook steps via UI
- PagerDuty Automation Actions — automated diagnostics
- Kubernetes Operators — self-healing
- ChatOps —
/incident diagnose high-memory
Maintenance¶
- Review runbooks after every incident
- Test during Game Days
- Assign owners
- Version in Git
- If a runbook hasn’t been updated in 6 months, trigger a review
Summary¶
Quality runbooks are the difference between 5-minute and 2-hour mitigation. Write them like code — versioned, tested, reviewed.
Need Help with Implementation?¶
Our team has experience designing and implementing modern architectures. We’re happy to help.