SRE in Practice — How We Started Measuring Reliability

We read the Google SRE book and said: we want this. Not all at once — we’re not Google. But the principles of SLO, error budgets and blameless postmortems are applicable even for our team.

SLI, SLO, SLA¶

SLI — measurable reliability metric. SLO — target for SLI (99.9% = max 43 min downtime/month). SLA — contractual commitment, always weaker than SLO.

Error budgets — license to take risks¶

Error budget is inverse to SLO. While you have budget, you can take risks — deploy, experiment. When you exhaust it, you stop deployments and fix things. Objective metric instead of “we don’t want to deploy”.

Blameless postmortems¶

Every incident with SLO impact gets a postmortem. We don’t look for blame, we look for systemic causes: timeline, impact, root cause, what went well/wrong, action items. We share across the company.

On-call rotation¶

Formal on-call rotation. One engineer per week, PagerDuty for alerting, runbooks for known issues. Compensation for being on-call — because burnout is not SRE.

SRE is cultural change, not just tooling¶

SRE is about how we think about reliability, how we balance speed and stability, how we learn from mistakes. Even a team of ten people can handle this.

sresloslierror budgetreliability

CORE SYSTEMS

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Need help with implementation? Schedule a meeting

SRE in Practice — How We Started Measuring Reliability

SLI, SLO, SLA¶

Error budgets — license to take risks¶

Blameless postmortems¶

On-call rotation¶

SRE is cultural change, not just tooling¶

CORE SYSTEMS

Need help with implementation?

Related articles

SRE Maturity — From Firefighting to Proactive Engineering

Incident Management with PagerDuty — From Chaos to Process

AI in Incident Management — Automated Detection and Response

AIOps and Autonomous Infrastructure — AI-Managed Operations