On-Call Best Practices

DevOps Intermediate

On-Call Best Practices¶

On-CallSREAlerting 3 min read

Efektivni on-call. Alerting, runbooks, udrzitelnost.

Principley¶

Jasna rotace
Dokumentovane runbooks
Actionable alerts
Kompenzace

Runbook¶

# Alert: HighErrorRate
## Kroky
1. kubectl get pods -n production
2. kubectl logs -l app=api --tail=100
3. Bad deploy? kubectl rollout undo deploy/api

How to Set Up Sustainable On-Call¶

Healthy on-call requires a maximum of 1 week on-call out of 4 (25%). If the team is too small, on-call becomes unsustainable and leads to burnout. Every alert must be actionable — if an alert does not require immediate action, lower its severity or remove it. The target is a maximum of 2 alerts per on-call shift.

Runbooks are living documents that describe step by step how to diagnose and resolve a specific alert. They should contain: what the alert means, what steps to take, when to escalate, and expert contacts. Automate as much as possible — if a runbook contains repetitive steps, create a script or auto-remediation. Compensation for on-call (bonus or time off) is essential for a fair system. After every incident, update the runbook with new findings.

Shrnuti¶

Actionable alerts + runbooks + ferova rotace = udrzitelny on-call.

Need Help with Implementation?¶

Our team has experience designing and implementing modern architectures. We’re happy to help.

Free Consultation

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles