_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Chaos Engineering — Testing Resilience in Production

28. 06. 2016 1 min read CORE SYSTEMSai

Netflix has a Chaos Monkey that randomly kills production servers. Sounds insane? It is not. If your system cannot survive the failure of a single server, you would rather discover that in a controlled way on a Tuesday afternoon than uncontrolled on a Saturday night.

The Principle of Chaos Engineering

Define the steady state (the system is operating normally). Formulate a hypothesis (the system will survive the failure of service X). Inject failure (kill a container, add latency, disconnect the database). Observe. Either the hypothesis holds (great), or it does not (fix it and repeat).

Types of Failure

  • Instance failure: Killing a container/process
  • Network latency: Adding 500ms delay to a network interface
  • Network partition: Service A cannot see service B
  • Disk full: Filling the disk
  • DNS failure: Non-functioning DNS resolving
  • Clock skew: Shifting the system clock

Tools

Chaos Monkey: Netflix, kills EC2 instances. Pumba: Chaos testing for Docker containers. tc (traffic control): Linux tool for simulating network problems. For starters, tc and kill -9 are enough.

Start Small

You do not have to start by killing production servers. Begin in the staging environment. Kill one container and watch what happens. Add latency to the database connection. Disconnect Redis. Every experiment will reveal a weakness.

Embrace Failure

In a distributed system, failure is not an exception — it is a normal state. Chaos Engineering accepts this and systematically tests resilience. Start with one experiment per week.

awsdevopsec2s3
Share:

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Contact us