Chaos Engineering: Breaking Things on Purpose
- ShiftQuality Contributor
- Oct 25, 2025
- 5 min read
The previous posts in this path covered quality engineering as an organizational practice and engineering effectiveness programs. This post covers the discipline that tests organizational quality in the most direct way possible: chaos engineering — the practice of deliberately introducing failures into production systems to discover weaknesses before unplanned outages do.
Your architecture diagram says the system is resilient. Your runbooks say the team can recover from failures. Your monitoring says it will detect problems. Chaos engineering tests whether any of that is actually true. The gap between what you believe about your system's reliability and what is actually true is where outages live.
The Premise
Complex distributed systems fail in complex, unpredictable ways. You cannot predict every failure mode. You cannot test every combination of component failures. And the failures that cause the worst outages are the ones nobody anticipated — the cascade that starts when a minor service degrades and triggers a retry storm that overwhelms the database that takes down the main application.
Traditional testing verifies that the system works correctly under normal conditions. Chaos engineering verifies that the system behaves acceptably under abnormal conditions — network partitions, service outages, resource exhaustion, clock skew, and the hundred other ways that distributed systems fail in production.
The philosophy is borrowed from vaccine science: introduce a controlled dose of failure to build immunity against uncontrolled failure. The controlled experiment reveals weaknesses. The weaknesses are fixed. The system becomes more resilient. Repeat.
Starting Small: Game Days
You do not start chaos engineering by killing production databases. You start with game days — scheduled exercises where the team simulates failures and practices response.
A game day is part drill and part experiment. The facilitator introduces a failure scenario: "The payment service is returning 503 errors." The team responds as they would in a real incident — investigating, communicating, mitigating. The facilitator observes: does monitoring detect the failure? Does the team follow the runbook? Does the system degrade gracefully or cascade catastrophically?
Game days in non-production environments are the safe starting point. They reveal gaps in monitoring, runbooks, and team preparedness without risking user impact. The findings are often surprising — monitoring that should have fired did not, the runbook describes a service that was renamed six months ago, the team does not know who owns the affected component.
After game days reveal the most obvious gaps and those gaps are fixed, the exercises can move to production — with careful scoping, clear blast radius limits, and the ability to immediately stop the experiment.
Fault Injection Patterns
Chaos experiments inject specific types of failures to test specific resilience mechanisms.
Service unavailability. Kill a service instance and observe: does the load balancer redirect traffic? Does the dependent service use its circuit breaker? Does the system recover when the service restarts?
Network latency. Inject 500ms of latency into calls to a dependency. This tests timeout configurations, async patterns, and user experience under degraded conditions. Many systems are designed for success or failure but not for slowness — a 500ms dependency call that the caller waits for synchronously can cascade latency through the entire request chain.
Resource exhaustion. Fill a disk. Exhaust memory. Saturate CPU. These experiments test whether the application handles resource pressure gracefully — does it shed load, log useful diagnostics, and recover when resources are restored? Or does it crash without warning, corrupt data, and require manual intervention to restart?
Dependency failure. Make a database return errors, a cache miss every request, or a third-party API return malformed responses. These test whether the application handles dependency failures with appropriate fallbacks and error handling.
Network partition. Isolate a subset of nodes from the rest of the cluster. This tests distributed consensus, data consistency, and split-brain scenarios — the most complex failure modes in distributed systems.
Each experiment has a hypothesis: "We believe that if the payment service is unavailable for 60 seconds, the order service will queue orders and process them when the payment service recovers." The experiment tests the hypothesis. If the hypothesis is correct, you have validated resilience. If it is incorrect, you have discovered a weakness before your users did.
Steady State and Blast Radius
Every chaos experiment must define two things: the steady state and the blast radius.
Steady state is the normal behavior of the system — the metrics that indicate the system is healthy. Request success rate above 99.5%. P95 latency below 200ms. Order processing throughput above 100 per minute. The experiment monitors these metrics continuously. If steady state metrics deviate beyond acceptable thresholds, the experiment is stopped immediately.
Blast radius is the scope of the experiment's potential impact. Running against 1% of traffic limits the blast radius to 1% of users. Running in a single availability zone limits the blast radius to the capacity served by that zone. Running in production during low-traffic hours limits the blast radius in time.
The discipline: start with the smallest possible blast radius and expand only after confidence is established. A first chaos experiment should target a non-critical service, in a single instance, during low traffic, with automatic rollback if steady state is violated. Only after multiple successful experiments at small scale should the scope expand.
Automating Chaos
Manual chaos experiments are valuable but limited — they run infrequently, require human effort, and test point-in-time behavior. Automated chaos runs continuously, catching resilience regressions that manual experiments would miss.
Chaos platforms (Gremlin, Chaos Monkey, Litmus, Chaos Mesh for Kubernetes) provide the infrastructure for automated chaos experiments. They inject failures on a schedule, monitor steady state metrics, stop experiments when thresholds are violated, and report results.
Netflix's Chaos Monkey — which randomly terminates production instances during business hours — is the canonical example of automated chaos. It enforces a specific resilience requirement: every service must handle instance loss gracefully. Because Chaos Monkey runs continuously, any service that fails to handle instance loss is discovered quickly.
The organizational requirement: automated chaos requires investment in monitoring, alerting, and automatic rollback. Without reliable monitoring, you cannot detect when an experiment exceeds its blast radius. Without automatic rollback, a chaos experiment that causes unexpected harm continues until a human intervenes.
Organizational Readiness
Chaos engineering is a technical practice that requires organizational support. Killing services in production sounds reckless to anyone who has not seen the methodology. Leadership needs to understand the value proposition: controlled experiments that find weaknesses are less costly than uncontrolled outages that find the same weaknesses.
The cultural prerequisites: a blameless incident culture (failures discovered by chaos experiments are learning opportunities, not blame targets), investment in monitoring (you cannot observe the impact of experiments without observability), and engineering time allocated to fix the weaknesses discovered.
Start with buy-in from a single team. Run game days. Fix the gaps found. Share the results. The demonstrated value — "we found and fixed three reliability issues that would have caused outages" — builds the case for broader adoption.
The Takeaway
Chaos engineering replaces hope with evidence. You hope the system handles failure gracefully. Chaos engineering tests whether it actually does. The practice starts with game days in non-production environments, progresses to controlled fault injection in production, and matures into automated experiments that run continuously.
The output is not broken systems. It is a list of discovered weaknesses, each of which can be fixed before an unplanned outage exploits them. The organization that practices chaos engineering does not have fewer potential failure modes — it has fewer undiscovered failure modes. And that difference is the difference between a 4-hour outage and a 4-minute blip.
Next in the "Quality at Organizational Scale" learning path: We'll cover incident management as a quality practice — how post-incident learning drives systemic improvement across the engineering organization.



Comments