_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

SRE Maturity — From Firefighting to Proactive Engineering

15. 11. 2021 1 min read CORE SYSTEMSdevelopment
SRE Maturity — From Firefighting to Proactive Engineering

The ops team spent 70% of their time firefighting. SRE gave us a framework to escape this cycle. The key concept: 100% reliability is a bad goal.

SLO, SLI, and Error Budget

Formal SLOs for critical services. Error budget (99.9% = 43 minutes of downtime/month) — as long as we have budget, we deploy fast. When it’s exhausted, we stop features and focus on stability.

Blameless Postmortems

After every significant incident: what happened, timeline, root cause, action items. No blame. The goal: systemic improvement, not finding a scapegoat. Postmortems are public in Confluence.

Toil Reduction

Toil = manual, repetitive, automatable work. We measure it. Target: max 50% of time on toil. Anything above → automate. After 6 months: toil down from 70% to 35%, incidents down 40%.

SRE = Reliability as an Engineering Discipline

The transition from reactive firefighting to proactive engineering. Error budgets, postmortems, and automation change the culture.

srereliabilityslopostmortemdevops
Share:

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Contact us