_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Chaos Engineering — Testing Resilience in Production

28. 11. 2022 1 min read CORE SYSTEMSdevelopment
Chaos Engineering — Testing Resilience in Production

“The system looks stable.” — until you start deliberately breaking things. Chaos engineering is the discipline of testing how a system responds to failure.

Why Chaos?

Production systems will fail. The question isn’t “if” but “when” and “how will we handle it.” Chaos engineering simulates failures in a controlled way — before they happen uncontrollably.

Litmus Chaos on Kubernetes

Litmus (CNCF) for chaos experiments: pod kill, node drain, network latency injection, disk fill. Experiments as YAML manifests, versioned in Git, triggered automatically in CI.

GameDays

Quarterly “GameDay”: the entire team watches how the system responds to simulated failures. Scenarios: database outage, DDoS, corrupted data, cloud region outage. Findings are documented and weak points are fixed.

Results

After 4 GameDays, we found 12 critical weaknesses that would have caused outages. MTTR decreased by 35% — the team knows how to respond because they’ve practiced it.

Break Things — On Purpose

Chaos engineering builds confidence. Better to find a weakness on a GameDay than on a Friday night in production.

chaos engineeringreliabilitykuberneteslitmustesting
Share:

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Contact us