Chaos engineering is a practice that helps developers to identify vulnerabilities of a software system by introducing disruptive events, such as server outages or API throttling. Put simply, chaos engineering means introducing chaos into your system to find weaknesses so they can be guarded against.
In 2008, when Netflix began migrating from an on-premises data center to an AWS cloud, one of its databases became corrupt that led to a three-day service outage that negatively affected millions of Netflix customers. Chaos engineering was named by Netflix to evoke the idea of mischievous monkeys throwing things at your systems.
Having migrated to AWS architecture, Netflix's engineering team deployed a suite of open-source tools called the "Simian Army" for checking the resilience, reliability, and security of their AWS infrastructure against all kinds of failures. The Simian Army comprises tools that include:
- Chaos Monkey - randomly shuts down virtual machines (VMs) to ensure that small disruptions will not affect the overall service.
- Latency Monkey - simulates a degradation of service and checks to make sure that upstream services react appropriately.
- Conformity Monkey - detects instances that aren't coded to best-practices and shuts them down, giving the service owner the opportunity to re-launch them properly.
- Security Monkey - searches out security weaknesses, and ends the offending instances. It also ensures that SSL and DRM certificates are not expired or close to expiration.
- Doctor Monkey - performs health checks on each instance and monitors other external signs of process health such as CPU and memory usage.
- Janitor Monkey - searches for unused resources and discards them.
Simian Army attacks Netflix infrastructure on many fronts by constantly inducing failures in its systems, the firm is able to protect itself up against problems that affect its AWS architecture.
The rise of cloud-based and microservices architecture provides us with a lot of advantages, but software applications are becoming increasingly complex. Chaos engineering has become vital to gauge the resiliency of a software system in production to build confidence in the system's capability to withstand turbulent and unexpected conditions.