In the post-pandemic world, both consumers and organizations rely on digital technologies more heavily than ever. Meanwhile, with the rise of microservices and distributed system architectures, organizational data environments have grown increasingly complex, which means system failures are more difficult to predict than ever.
In addition to being highly disruptive to end users, application downtime is extremely costly for organizations. One study estimates that on average, the hourly cost of downtime for high-priority applications is nearly $68,000, while the hourly cost for normal applications is estimated at nearly $62,000.
Waiting for an outage to happen, then cleaning up the mess, is no longer a realistic option. Organizations need a proactive approach that prevents failures from happening in the first place. That’s why DevOps teams at companies like Netflix and Amazon have embraced chaos engineering.
Chaos engineering is about “breaking things on purpose.” IT teams introduce faults, or chaos, into systems in production, then measure how they respond. By conducting planned experiments that test how a system performs under stress, developers and engineers can gain a better understanding of how complex, distributed systems behave, identify and fix vulnerabilities before they turn into outages, and build more resilient systems.
Despite its name, chaos engineering involves very carefully planned and executed experiments. It can be compared to firefighters intentionally starting controlled fires to ensure that they’re trained and equipped to contain a real blaze. Chaos experiments are designed and implemented following a process similar to the scientific method:
Why would anyone dare experiment on a system in production, regardless of how controlled the experiments are? Because while many DevOps testing tools exist, these tools have limits. They can only test foreseen failures. Chaos experiments uncover the problems that IT teams otherwise couldn’t have predicted or foreseen -- the very type of problems that cause system outages.
In addition to uncovering otherwise unforeseen issues, chaos engineering enables IT teams to build a much deeper understanding of how their data environments behave under real-world conditions. Armed with this understanding, engineers can build systems that aren’t just resilient but are “antifragile,” meaning that they don’t just keep running in the face of failures but improve with each event.
From this perspective, chaos engineering bridges the gap between DevOps teams, who want to push changes as quickly as possible, and site reliability engineers, who are concerned with keeping systems running. The end goal of chaos engineering is an antifragile data environment where DevOps teams can scale systems, introduce new apps and features, and make other changes without compromising system reliability or performance.
Before a chaos experiment is performed, a system absolutely must be in a steady state. If a system isn’t stable to begin with, the chaos experiment could cause an outage, which will sour company leadership on any further experiments while also failing to produce any meaningful insights. Additionally, a chaos experiment requires a baseline normal to measure against. Log analysis is essential to ensuring that a system is stable enough to begin experimenting, and it also establishes the baseline normal needed to glean actionable insights from an experiment.
Logging during an experiment is crucial to understanding the results. Absent log analysis, it’s impossible to observe what parts of the system are impacted by the fault and how they are impacted.
While specifics depend on the experiment and data environment, Google’s “four golden signals” of monitoring distributed systems are an excellent starting point.
Chaos engineering is an exceptionally powerful discipline that is transforming the way in which systems are being designed and built at some of the world’s largest businesses. Testing systems in production is ethical and low-risk so long as the experiment is carefully planned, with a contained blast radius, a rollback plan, and of course buy-in from all applicable organizational stakeholders.
Move forth and break things!