_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Incident Response

When PagerDuty calls, you have a runbook.

SIEM for detection, runbooks for response, on-call processes for escalation, post-mortems for learning. Incidents happen — what matters is what you do next.

<1h
MTTD
<1h
MTTR
Top 20 incidents
Runbook coverage
48h
Post-mortem SLA

Why you need Incident Response

The question is not IF an incident will happen, but WHEN. Organizations without an IR process:

  • Detect late — average dwell time (attacker in the network undetected) is 204 days
  • Respond chaotically — who does what? Who decides? Who communicates?
  • Repeat mistakes — the same incident again three months later because the root cause was never fixed
  • Escalate incorrectly — either too late or to the wrong people

SIEM & Detection

Data collection

We centralize security events from across the infrastructure:

  • Infrastructure — Firewalls, VPN, load balancers, DNS servers
  • Identity — Azure AD/Okta login events, MFA failures, privileged access
  • Application — WAF logs, API gateway, application security events
  • Endpoint — EDR (CrowdStrike, Defender), antivirus, device compliance
  • Cloud — Azure Activity Log, AWS CloudTrail, GCP Audit Log

Correlation rules

Raw data without correlation is noise. We build detection rules for:

  • Brute force — N failed logins from one IP in M minutes
  • Lateral movement — Unusual service-to-service communication
  • Privilege escalation — User gains admin role, unusual sudo usage
  • Data exfiltration — Large data transfer to unknown destination
  • Credential abuse — Login from impossible location (GeoIP), credential stuffing patterns

Anomaly Detection

ML models for detecting unknown threats:

  • Baseline of normal behavior per user, per service
  • Deviations in access patterns, data volumes, API usage
  • Alerting with context — not “anomaly detected”, but “user X accessed 500 records in DB, average is 20”

Runbooks

Runbook structure

Every runbook follows a uniform structure:

  1. Detection — How does the incident manifest? Which alert triggers it?
  2. Triage — Is it a real incident or a false positive? What is the severity?
  3. Containment — Stop the spread. Isolate the affected system.
  4. Eradication — Remove the root cause. Patch, config change, revocation.
  5. Recovery — Restore normal operations. Verification.
  6. Post-incident — Timeline, lessons learned, action items.

Top 10 runbooks

We write runbooks for the most probable and highest-impact scenarios:

  1. Compromised credentials — Stolen password/token, unauthorized access
  2. Ransomware — Encrypted files, ransom demand
  3. DDoS — Service unavailable, traffic spike
  4. Data breach — Unauthorized data access/exfiltration
  5. Insider threat — Malicious or negligent employee action
  6. Phishing — Successful phishing, compromised endpoint
  7. Supply chain — Compromised dependency, malicious update
  8. API abuse — Automated scraping, credential stuffing
  9. Cloud misconfiguration — Exposed storage, public database
  10. Certificate expiry — TLS certificate expired, service disruption

On-Call Processes

Rotation and escalation

  • Primary on-call — Responds to alerts. Weekly rotation.
  • Secondary on-call — Backup if primary does not respond within 5 minutes.
  • Incident Commander — For SEV1/SEV2. Coordinates response, communicates with stakeholders.
  • Escalation matrix — Clearly defined: who, when, how. No “I’ll call whoever I find”.

Severity Framework

Severity Description Response Time Communication
SEV1 Business down, customers affected 15 min War room, 15-min updates, exec notification
SEV2 Degraded performance, partial outage 30 min Slack channel, hourly updates
SEV3 Minor issue, workaround exists 4h Ticket, next business day
SEV4 Cosmetic, no impact Backlog Sprint planning

Compensation

On-call is not free. We recommend: - Allowance for on-call availability (even without an incident) - Extra compensation for night/weekend interventions - “Day off” after a night escalation - Rotation so that the load is distributed evenly

Post-Mortem

Blameless culture

A post-mortem looks for systemic causes, not culprits. “John made a mistake” is not a root cause — “the system allowed John to make a mistake without safeguards” is.

Format

  1. Timeline — What happened, chronologically, with timestamps
  2. Impact — Who was affected, for how long, financial impact
  3. Root cause — Why it happened (5 Whys)
  4. Contributing factors — What made the situation worse
  5. What went well — What worked (detection, communication, recovery)
  6. Action items — Specific tasks with owners and deadlines
  7. Lessons learned — What we take away for next time

Post-mortem database

All post-mortems in one place (Confluence, Notion, Git). Searchable, tagged. A new team member who reads the last 10 post-mortems knows more about the system than from any documentation.

Table-Top Exercises

Incident simulations without real impact:

  • Quarterly exercises for the IR team
  • Scenario: “It’s Friday 5 PM, a customer has reported a data breach. What do you do?”
  • Practice communication, escalation, decision-making
  • Identify gaps in runbooks and processes

Technology

Elastic SIEM, Microsoft Sentinel, Splunk, Grafana Loki, PagerDuty, OpsGenie, Slack (incident channels), Jira (post-mortem tracking), Confluence (runbook repository), CrowdStrike, Microsoft Defender.

Časté otázky

Start with three things: (1) define severity levels, (2) write a runbook for your most common incident, (3) set up an on-call rotation. Build out the rest iteratively.

It depends on your size. For smaller organizations, cloud-native logging (CloudWatch, Azure Monitor) with alerting is sufficient. For larger organizations we recommend a SIEM (Elastic SIEM, Sentinel, Splunk).

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku