Chaos Engineering — Testing Resilience in Production

Netflix has a Chaos Monkey that randomly kills production servers. Sounds insane? It is not. If your system cannot survive the failure of a single server, you would rather discover that in a controlled way on a Tuesday afternoon than uncontrolled on a Saturday night.

The Principle of Chaos Engineering¶

Define the steady state (the system is operating normally). Formulate a hypothesis (the system will survive the failure of service X). Inject failure (kill a container, add latency, disconnect the database). Observe. Either the hypothesis holds (great), or it does not (fix it and repeat).

Types of Failure¶

Instance failure: Killing a container/process
Network latency: Adding 500ms delay to a network interface
Network partition: Service A cannot see service B
Disk full: Filling the disk
DNS failure: Non-functioning DNS resolving
Clock skew: Shifting the system clock

Tools¶

Chaos Monkey: Netflix, kills EC2 instances. Pumba: Chaos testing for Docker containers. tc (traffic control): Linux tool for simulating network problems. For starters, tc and kill -9 are enough.

Start Small¶

You do not have to start by killing production servers. Begin in the staging environment. Kill one container and watch what happens. Add latency to the database connection. Disconnect Redis. Every experiment will reveal a weakness.

Embrace Failure¶

In a distributed system, failure is not an exception — it is a normal state. Chaos Engineering accepts this and systematically tests resilience. Start with one experiment per week.

awsdevopsec2s3

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Chaos Engineering — Testing Resilience in Production

The Principle of Chaos Engineering¶

Types of Failure¶

Tools¶

Start Small¶

Embrace Failure¶

CORE SYSTEMS

Need help with implementation?

Related articles

Docker 1.x — finally production-ready?

Blue-Green Deployment — Zero-Downtime Releases

Mobile-first design — a strategy for enterprise applications