Netflix has a Chaos Monkey that randomly kills production servers. Sounds insane? It is not. If your system cannot survive the failure of a single server, you would rather discover that in a controlled way on a Tuesday afternoon than uncontrolled on a Saturday night.
The Principle of Chaos Engineering¶
Define the steady state (the system is operating normally). Formulate a hypothesis (the system will survive the failure of service X). Inject failure (kill a container, add latency, disconnect the database). Observe. Either the hypothesis holds (great), or it does not (fix it and repeat).
Types of Failure¶
- Instance failure: Killing a container/process
- Network latency: Adding 500ms delay to a network interface
- Network partition: Service A cannot see service B
- Disk full: Filling the disk
- DNS failure: Non-functioning DNS resolving
- Clock skew: Shifting the system clock
Tools¶
Chaos Monkey: Netflix, kills EC2 instances. Pumba: Chaos testing for Docker containers. tc (traffic control): Linux tool for simulating network problems. For starters, tc and kill -9 are enough.
Start Small¶
You do not have to start by killing production servers. Begin in the staging environment. Kill one container and watch what happens. Add latency to the database connection. Disconnect Redis. Every experiment will reveal a weakness.
Embrace Failure¶
In a distributed system, failure is not an exception — it is a normal state. Chaos Engineering accepts this and systematically tests resilience. Start with one experiment per week.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us