Incident Response
When PagerDuty calls, you have a runbook.
SIEM for detection, runbooks for response, on-call processes for escalation, post-mortems for learning. Incidents happen — what matters is what you do next.
Why you need Incident Response¶
The question is not IF an incident will happen, but WHEN. Organizations without an IR process:
- Detect late — average dwell time (attacker in the network undetected) is 204 days
- Respond chaotically — who does what? Who decides? Who communicates?
- Repeat mistakes — the same incident again three months later because the root cause was never fixed
- Escalate incorrectly — either too late or to the wrong people
SIEM & Detection¶
Data collection¶
We centralize security events from across the infrastructure:
- Infrastructure — Firewalls, VPN, load balancers, DNS servers
- Identity — Azure AD/Okta login events, MFA failures, privileged access
- Application — WAF logs, API gateway, application security events
- Endpoint — EDR (CrowdStrike, Defender), antivirus, device compliance
- Cloud — Azure Activity Log, AWS CloudTrail, GCP Audit Log
Correlation rules¶
Raw data without correlation is noise. We build detection rules for:
- Brute force — N failed logins from one IP in M minutes
- Lateral movement — Unusual service-to-service communication
- Privilege escalation — User gains admin role, unusual sudo usage
- Data exfiltration — Large data transfer to unknown destination
- Credential abuse — Login from impossible location (GeoIP), credential stuffing patterns
Anomaly Detection¶
ML models for detecting unknown threats:
- Baseline of normal behavior per user, per service
- Deviations in access patterns, data volumes, API usage
- Alerting with context — not “anomaly detected”, but “user X accessed 500 records in DB, average is 20”
Runbooks¶
Runbook structure¶
Every runbook follows a uniform structure:
- Detection — How does the incident manifest? Which alert triggers it?
- Triage — Is it a real incident or a false positive? What is the severity?
- Containment — Stop the spread. Isolate the affected system.
- Eradication — Remove the root cause. Patch, config change, revocation.
- Recovery — Restore normal operations. Verification.
- Post-incident — Timeline, lessons learned, action items.
Top 10 runbooks¶
We write runbooks for the most probable and highest-impact scenarios:
- Compromised credentials — Stolen password/token, unauthorized access
- Ransomware — Encrypted files, ransom demand
- DDoS — Service unavailable, traffic spike
- Data breach — Unauthorized data access/exfiltration
- Insider threat — Malicious or negligent employee action
- Phishing — Successful phishing, compromised endpoint
- Supply chain — Compromised dependency, malicious update
- API abuse — Automated scraping, credential stuffing
- Cloud misconfiguration — Exposed storage, public database
- Certificate expiry — TLS certificate expired, service disruption
On-Call Processes¶
Rotation and escalation¶
- Primary on-call — Responds to alerts. Weekly rotation.
- Secondary on-call — Backup if primary does not respond within 5 minutes.
- Incident Commander — For SEV1/SEV2. Coordinates response, communicates with stakeholders.
- Escalation matrix — Clearly defined: who, when, how. No “I’ll call whoever I find”.
Severity Framework¶
| Severity | Description | Response Time | Communication |
|---|---|---|---|
| SEV1 | Business down, customers affected | 15 min | War room, 15-min updates, exec notification |
| SEV2 | Degraded performance, partial outage | 30 min | Slack channel, hourly updates |
| SEV3 | Minor issue, workaround exists | 4h | Ticket, next business day |
| SEV4 | Cosmetic, no impact | Backlog | Sprint planning |
Compensation¶
On-call is not free. We recommend: - Allowance for on-call availability (even without an incident) - Extra compensation for night/weekend interventions - “Day off” after a night escalation - Rotation so that the load is distributed evenly
Post-Mortem¶
Blameless culture¶
A post-mortem looks for systemic causes, not culprits. “John made a mistake” is not a root cause — “the system allowed John to make a mistake without safeguards” is.
Format¶
- Timeline — What happened, chronologically, with timestamps
- Impact — Who was affected, for how long, financial impact
- Root cause — Why it happened (5 Whys)
- Contributing factors — What made the situation worse
- What went well — What worked (detection, communication, recovery)
- Action items — Specific tasks with owners and deadlines
- Lessons learned — What we take away for next time
Post-mortem database¶
All post-mortems in one place (Confluence, Notion, Git). Searchable, tagged. A new team member who reads the last 10 post-mortems knows more about the system than from any documentation.
Table-Top Exercises¶
Incident simulations without real impact:
- Quarterly exercises for the IR team
- Scenario: “It’s Friday 5 PM, a customer has reported a data breach. What do you do?”
- Practice communication, escalation, decision-making
- Identify gaps in runbooks and processes
Technology¶
Elastic SIEM, Microsoft Sentinel, Splunk, Grafana Loki, PagerDuty, OpsGenie, Slack (incident channels), Jira (post-mortem tracking), Confluence (runbook repository), CrowdStrike, Microsoft Defender.
Časté otázky
Start with three things: (1) define severity levels, (2) write a runbook for your most common incident, (3) set up an on-call rotation. Build out the rest iteratively.
It depends on your size. For smaller organizations, cloud-native logging (CloudWatch, Azure Monitor) with alerting is sufficient. For larger organizations we recommend a SIEM (Elastic SIEM, Sentinel, Splunk).