DevOps Intermediate
SRE — Game Days¶
SREGame DayIncident ResponseChaos Engineering 5 min read
Simulated incidents for testing team readiness. Planning, scenarios and lessons learned.
What is a Game Day¶
A Game Day is a controlled simulation of an incident. It tests not only systems, but primarily people and processes.
- Tests incident response procedures
- Reveals gaps in runbooks
- Builds muscle memory for real incidents
- Identifies single points of failure
Planning¶
- Scope — what are we testing? (DB failover, AZ loss, DDoS)
- Blast radius — what impact do we expect?
- Abort criteria — when to stop immediately
- Stakeholders — who knows, who doesn’t
- Timeline — precise plan of injections
- Rollback plan — how to restore everything to normal
Scenarios¶
- Infrastructure: AZ outage, node failure, disk full, network partition
- Application: memory leak, CPU spike, dependency timeout
- Data: corrupted cache, stale data, replication lag
- Security: compromised credentials, DDoS
- Process: on-call unreachable, runbook outdated
Summary¶
Game Days build confidence in systems and processes. Regular simulations dramatically improve incident response time.
Need Help with Implementation?¶
Our team has experience designing and implementing modern architectures. We’re happy to help.