We read the Google SRE book and said: we want this. Not all at once — we’re not Google. But the principles of SLO, error budgets and blameless postmortems are applicable even for our team.
SLI, SLO, SLA¶
SLI — measurable reliability metric. SLO — target for SLI (99.9% = max 43 min downtime/month). SLA — contractual commitment, always weaker than SLO.
Error budgets — license to take risks¶
Error budget is inverse to SLO. While you have budget, you can take risks — deploy, experiment. When you exhaust it, you stop deployments and fix things. Objective metric instead of “we don’t want to deploy”.
Blameless postmortems¶
Every incident with SLO impact gets a postmortem. We don’t look for blame, we look for systemic causes: timeline, impact, root cause, what went well/wrong, action items. We share across the company.
On-call rotation¶
Formal on-call rotation. One engineer per week, PagerDuty for alerting, runbooks for known issues. Compensation for being on-call — because burnout is not SRE.
SRE is cultural change, not just tooling¶
SRE is about how we think about reliability, how we balance speed and stability, how we learn from mistakes. Even a team of ten people can handle this.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us