Incident Response & SRE

Q: What is SRE and why do we need it?

Site Reliability Engineering is a discipline that applies software engineering to operational problems. Instead of reactively 'fighting fires', it introduces proactive processes — SLOs for measuring reliability, error budgets for managing risk, automation for eliminating toil. Result: more reliable systems with less effort.

Q: What is an error budget?

If you have an SLO of 99.9% availability, you have a 0.1% 'error budget' — that is 43 minutes of downtime per month. While you have error budget remaining, you can deploy fast and take risks. When it is exhausted, you slow down and focus on stability. Error budget turns the reliability vs. velocity trade-off into a data-driven decision.

Q: Do we need a dedicated SRE team?

Not necessarily. For smaller organisations we introduce SRE principles into existing dev teams — shared on-call, runbooks, post-mortems, SLOs. A dedicated SRE team makes sense from ~50 developers or for critical systems (fintech, healthcare, e-commerce).

Q: How do blameless post-mortems work?

After every significant incident there is a structured review: timeline, root cause, impact, what worked, what didn't, action items. Key: we look for systemic causes, not culprits. 'Why did the system allow this to happen?' instead of 'who did this?'. Result: we fix processes and systems, not people.

Q: How do you set SLOs?

Based on user expectations and business requirements. We measure actual user experience (latency, error rate, availability). We iterate — start conservatively, tighten based on data. An SLO is not an aspiration, it is a contract with users.

Production will go down. The question is how fast you get back up.

We build SRE culture and incident response processes — runbooks, on-call rotation, blameless post-mortems, SLO/SLA and error budgets. A systematic approach to reliability.

I want SRE processes Back to QA & Testing

<30 min

MTTR

99.95%

Availability

<5 min

On-call response

100%

Postmortem coverage

Why SRE¶

Every system fails. The question is not whether, but when, how quickly you find out and how quickly you fix it.

Traditional approach: ops team watches dashboards, devs write code, a wall between them. When something breaks, ping-pong begins — “that’s not our bug”, “that’s infrastructure”, “it works on my machine”. MTTR (Mean Time to Recovery) is measured in hours.

SRE approach: Reliability is a feature. We measure it (SLOs), budget for it (error budgets), automate it (runbooks), learn from mistakes (post-mortems). Result: MTTR in minutes, not hours. Proactive, not reactive.

Google created SRE to run services for billions of users. But SRE principles work for a company with 5 developers just as well as for one with 5,000. The approach scales, not the headcount.

SLO/SLA — measuring reliability¶

Without measurement, reliability is just a feeling. “The system is working well” is not a metric. SLO is.

Service Level Indicators (SLI)¶

An SLI is a metric that measures user experience:

Availability SLI: Fraction of successful requests. successful_requests / total_requests
Latency SLI: Fraction of requests below a latency threshold. requests_under_500ms / total_requests
Freshness SLI: Fraction of data updated within a time limit. For async systems, data pipelines.
Correctness SLI: Fraction of correct outputs. For computation services, ML inference.

Service Level Objectives (SLO)¶

An SLO is a target for an SLI: “99.9% of requests will be successful” or “95% of requests will have latency below 200ms”.

How we set SLOs:

Measure the baseline — what is the current real reliability?
Define user expectations — what do users consider acceptable?
Set the SLO — slightly above baseline, below perfection. 100% is not a realistic SLO.
Iterate — tighten or relax based on data and feedback.

Example: An API has current availability of 99.95%. Users complain about outages longer than 10 minutes. We set the SLO at 99.9% (43 min downtime/month) — strict enough for users, realistic enough for the team.

Service Level Agreements (SLA)¶

An SLA is a contractual commitment — an SLO with consequences. Breaching an SLA = penalties, credits, contractual implications. An SLA is always looser than the internal SLO. If the internal SLO is 99.95%, the SLA is 99.9% — you have a buffer.

Error budgets — data-driven decision making¶

Error budgets turn the abstract trade-off “speed vs. reliability” into a concrete number.

How it works¶

SLO 99.9% availability = 0.1% error budget = 43.2 minutes downtime per month.

Error budget > 0: The team has room to take risks. Deploys, experiments, major refactorings — go for it. Speed is the priority.

Error budget ≈ 0: Slow down. No risky deploys. Focus on stability, bug fixes, reliability improvements. Until the budget is renewed.

Error budget < 0: SLO violated. Incident review. Action plan to restore reliability. Feature freeze until the situation stabilises.

Error budget policies¶

We define upfront what happens at different error budget levels:

>50% remaining: Business as usual. Fast deploys, experiments allowed.
25-50% remaining: Increased caution. Canary deploys mandatory. Extra review for risky changes.
<25% remaining: Reliability sprint. No new features. Focus on stability.
Exhausted: Feature freeze. Post-mortem for every incident. Restoring the budget is the top priority.

Incident response — when things are on fire¶

Every incident has a lifecycle: detection → triage → mitigation → resolution → post-mortem. We define processes for each phase upfront — not in the middle of panic.

Severity levels¶

SEV1 — Critical: System unavailable or data loss. Entire team mobilised. Customer communication. Resolution target: < 1 hour.

SEV2 — Major: Significant service degradation. Part of functionality unavailable. On-call + escalation. Resolution target: < 4 hours.

SEV3 — Minor: Minor degradation. Workaround exists. On-call handles in business hours. Resolution target: < 24 hours.

SEV4 — Low: Cosmetic issues, minor bugs. Backlog, resolved in the normal sprint.

On-call rotation¶

Principle: Someone is always responsible. 24/7 coverage, weekly rotation, clear escalation paths.

What on-call gets: - Alert with context (what is happening, since when, what impact) - Runbook (what to do step by step) - Escalation contact (who to call when it is too much) - Access to dashboards, logs, traces

What on-call doesn’t get: Vague alert “CPU high” without context. Alert fatigue — dozens of alerts, 90% of which are false positives. Undocumented systems where nobody knows what to do. We fix those beforehand.

Runbooks¶

A runbook is a step-by-step guide to resolving a specific incident. Linked directly from the alert — click on the alert, runbook opens.

A good runbook contains: - Symptoms: What do you see? How do you identify this type of incident? - Impact: What is affected? Who is affected? - Diagnostics: Which dashboards to open? Which queries to run? - Mitigation: How to stop the bleeding? (restart, rollback, feature flag off) - Resolution: How to fix the root cause? - Escalation: When and to whom to escalate?

Runbooks are not static. We update them after every incident. Ideally we automate them — runbook as a script, not a document.

Blameless post-mortems¶

After every SEV1/SEV2 incident there is a post-mortem. Goal: understand what happened and prevent recurrence. Not find a culprit.

Structure¶

Timeline: Minute by minute, what happened. Objective facts, not interpretations.

Root cause analysis: 5 Whys or fishbone diagram. Why did this happen? Why wasn’t it caught earlier? Why did mitigation take so long?

Impact: How many users affected? For how long? Financial impact?

What went well: What worked? Which processes helped? Who responded brilliantly?

What went wrong: What failed? Which processes were missing? What slowed down resolution?

Action items: Specific, assigned, with deadlines. “Add alert for connection pool exhaustion” (owner: Jana, deadline: Friday), not “improve monitoring” (nobody, never).

Blameless culture¶

People make mistakes. Systems should be designed so that a single human error doesn’t cause a catastrophe. A post-mortem looks for systemic causes:

Why did the system allow a deploy without canary?
Why did a rollback mechanism not exist?
Why didn’t the alert come sooner?
Why did the runbook not exist?

When people know they won’t be punished, they report problems openly. Near-misses become learning opportunities, not hidden secrets.

How we implement SRE¶

Assessment — we map current processes, metrics, incident handling
SLO workshop — we define SLIs and SLOs with product and engineering teams
Error budgets — we set up tracking, policies, reporting
On-call setup — rotation, escalation, runbooks, tooling (PagerDuty/OpsGenie)
Post-mortem process — template, facilitation, action item tracking
Iteration — quarterly SLO review, runbook updates, process improvement

Stack¶

On-call: PagerDuty, OpsGenie, Grafana OnCall.

Incident management: Incident.io, Rootly, Jira.

Status pages: Statuspage, Instatus, Cachet.

Monitoring: Prometheus + Grafana, Datadog.

Communication: Slack incident channels, bridge calls.

Runbooks: Notion, Confluence, Backstage, or directly in Git.

Časté otázky

Site Reliability Engineering is a discipline that applies software engineering to operational problems. Instead of reactively 'fighting fires', it introduces proactive processes — SLOs for measuring reliability, error budgets for managing risk, automation for eliminating toil. Result: more reliable systems with less effort.

If you have an SLO of 99.9% availability, you have a 0.1% 'error budget' — that is 43 minutes of downtime per month. While you have error budget remaining, you can deploy fast and take risks. When it is exhausted, you slow down and focus on stability. Error budget turns the reliability vs. velocity trade-off into a data-driven decision.

Not necessarily. For smaller organisations we introduce SRE principles into existing dev teams — shared on-call, runbooks, post-mortems, SLOs. A dedicated SRE team makes sense from ~50 developers or for critical systems (fintech, healthcare, e-commerce).

After every significant incident there is a structured review: timeline, root cause, impact, what worked, what didn't, action items. Key: we look for systemic causes, not culprits. 'Why did the system allow this to happen?' instead of 'who did this?'. Result: we fix processes and systems, not people.

Based on user expectations and business requirements. We measure actual user experience (latency, error rate, availability). We iterate — start conservatively, tighten based on data. An SLO is not an aspiration, it is a contract with users.

Souvisí s

QA, Testing & Observability {'cs': 'Automatizované testování, monitoring a observability stack.', 'en': 'Automated testing, monitoring and observability stack.'}

Logistics & E-commerce {'cs': 'Supply chain, WMS, fulfillment automatizace', 'en': 'Supply chain, WMS, fulfillment automation'}

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku