DevOps Intermediate
On-Call Best Practices¶
On-CallSREAlerting 3 min read
Efektivni on-call. Alerting, runbooks, udrzitelnost.
Principley¶
- Jasna rotace
- Dokumentovane runbooks
- Actionable alerts
- Kompenzace
Runbook¶
# Alert: HighErrorRate
## Kroky
1. kubectl get pods -n production
2. kubectl logs -l app=api --tail=100
3. Bad deploy? kubectl rollout undo deploy/api
How to Set Up Sustainable On-Call¶
Healthy on-call requires a maximum of 1 week on-call out of 4 (25%). If the team is too small, on-call becomes unsustainable and leads to burnout. Every alert must be actionable — if an alert does not require immediate action, lower its severity or remove it. The target is a maximum of 2 alerts per on-call shift.
Runbooks are living documents that describe step by step how to diagnose and resolve a specific alert. They should contain: what the alert means, what steps to take, when to escalate, and expert contacts. Automate as much as possible — if a runbook contains repetitive steps, create a script or auto-remediation. Compensation for on-call (bonus or time off) is essential for a fair system. After every incident, update the runbook with new findings.
Shrnuti¶
Actionable alerts + runbooks + ferova rotace = udrzitelny on-call.
Need Help with Implementation?¶
Our team has experience designing and implementing modern architectures. We’re happy to help.