Flaky Tests: Diagnosis and Cure

Contributor
Apr 23
5 min read

A flaky test is one that sometimes passes and sometimes fails without any code change. Each one is a small culture-corrosion event: it trains the team to retry failures rather than investigate them. Over time, real failures hide behind the noise.

This guide is how to find, diagnose, and fix flaky tests — and how to prevent new ones.

What Causes Flakiness

Common categories:

Timing. Tests that depend on operations completing in a specific time. Network latency, slow CI runners, garbage collection — any timing variation triggers flake.

Order dependence. Tests that pass when run alone but fail when run after another test. Shared state, leaked fixtures, global mocks not reset.

Concurrency. Tests that race against themselves or other tests. Parallel runs hit the same data.

External dependencies. Tests that hit real services that are sometimes slow or unavailable.

Non-determinism. Random values, timestamps, ordering that depends on hashmap iteration order.

Resource contention. Tests competing for memory, file handles, database connections, ports.

Each category has different fixes. Diagnosing which category a flake belongs to is the first step.

Finding Flaky Tests

You can't fix what you don't measure.

CI-level tracking: record every test run with pass/fail. Identify tests that fail intermittently. Most CI systems (or test platforms like Datadog Tests, Buildkite Test Analytics) provide this.

Repeat runs: for suspect tests, run them many times locally. A flake rate is observable.

Quarantine reports: any test that fails on retry is marked suspect, even if a re-run passes.

Without measurement, flaky tests are anecdotal. The team's tolerance for retries hides them.

Diagnosing the Cause

For a flaky test:

Run it many times locally. Does it flake locally? If yes, you can iterate fast. If no, it's a CI-environment issue.
Check the failure mode. Does it fail the same way each time, or different ways?
Look at the test code. Does it have timing assumptions? Order assumptions? External calls?
Look at recent changes. Did anything recently start using shared state?
Try in isolation. Does the test pass alone but fail with others?

The diagnosis usually reveals the category. The fix follows.

Fixing Timing Issues

Bad:

await page.click('#submit');
await sleep(2000);
await expect(page.locator('#success')).toBeVisible();

The sleep is fragile. Two seconds might be too long usually, too short occasionally.

Better:

await page.click('#submit');
await expect(page.locator('#success')).toBeVisible();

Modern frameworks have built-in waits with intelligent retry. Use them.

For lower-level code, use explicit synchronization primitives, not arbitrary sleeps.

Fixing Order Dependence

Tests should not depend on each other's order.

Diagnosis: a test that fails when run with others passes alone. Or, the order of tests changing causes failures.

Fix:

Each test creates the state it needs
Each test cleans up after itself (or use transaction-based isolation)
No shared mutable state between tests
No global mocks that aren't reset

Run tests in random order in CI. Flushes out hidden order dependencies.

Fixing Concurrency Issues

Tests running in parallel can collide on:

Database rows with the same ID
Shared external resources
Ports
Filesystem locations

Fix:

Use unique IDs per test (UUIDs)
Use unique data per test
Isolate parallel runs (different databases, different directories)
Avoid shared resources where possible

Fixing External Dependency Issues

Tests that hit real services are inherently flakier.

Options:

Replace with stubs/mocks for tests where the real service isn't the point
Run against sandbox environments for tests that need realistic behavior
Skip in CI, run nightly for tests that genuinely need external calls
Add retry logic carefully when the dependency is reliably reachable but occasionally slow

The decision depends on what the test is actually verifying. If it's testing your code's interaction with an external service contract, mock it. If it's testing the integration, accept some flakiness and add retries.

Fixing Non-Determinism

Random and time-based values cause flakes.

Mock the clock (pass time as a dependency)
Mock random sources (pass them as dependencies)
Sort lists before comparing (or use unordered comparisons)
Don't depend on hashmap iteration order

The fix is usually making the non-determinism a parameter rather than implicit.

Quarantine Strategy

When a test goes flaky, quarantine it immediately:

Mark it with @flaky or equivalent
Skip it in the blocking suite
Run it but don't fail the build
Track the open flake list

Quarantine prevents the bad behavior of retrying-until-green. The test exists; it's just not blocking work.

Triage Cadence

Quarantined tests need ownership and a fix timeline.

Working pattern:

Each quarantined test has an owner (the most recent toucher, or the test author)
Within one week, the owner investigates
Within two weeks, the test is fixed or deleted

Without timeouts, quarantined tests accumulate forever.

Delete Tests Without Mercy

Some tests can't be reasonably fixed. Some test things that aren't worth testing. Some are duplicates.

If a flaky test isn't earning its keep, delete it. The suite is healthier for the deletion.

Common candidates:

Tests for code about to be removed
Tests verifying trivially-correct behavior
Tests duplicating other tests
Tests with no clear value
Tests no one understands

Don't keep tests out of fear. Delete and document why.

The Cultural Side

Flakiness has a cultural dimension. Teams that tolerate flake get more flake; teams that don't tolerate it get less.

Practices that build no-flake culture:

Block on flake. A failing CI run, even on retry, gets investigated.
Make flakes visible. Dashboards showing flake rates per team or component.
Celebrate fixes. Recognize the work of de-flaking.
Accountability. Tests have owners; flake is in their queue.

Teams that retry-until-green train themselves to ignore failures. That's the path to real bugs hiding in noise.

Anti-Patterns

Auto-retry without investigation. Tests retry; passes are accepted. Real bugs hide.

Catching flakes by accident. Tests fail; team notices the failure two days later. No urgency.

Flake budget. "We accept 2% flake rate." Defeats the purpose; that 2% is where the real bugs hide.

Flake graveyard. Hundreds of quarantined tests, none being fixed. Quarantine without follow-up.

Blame the framework. "Cypress is flaky." Sometimes true; usually the issue is local.

A Working Process

For a team with a flakiness problem:

Measure. Track per-test flake rates over time.
Quarantine ruthlessly. Anything intermittent gets pulled.
Triage weekly. Owner per flake, investigation in progress.
Fix or delete. No tests live in quarantine indefinitely.
Watch new code. New tests pass-fail rate is monitored.

Within a few months, a chronically flaky suite can become trustworthy. The investment is real but the payoff is faster development and real bug catching.

Key Takeaway

Flaky tests train the team to ignore failures, which lets real bugs ship. Cause categories: timing, order, concurrency, external dependencies, non-determinism, resource contention. Each has specific fixes. Quarantine immediately; investigate within a week; fix or delete within two. Don't accept a "flake budget" — real bugs hide there. The combination of measurement, ownership, and ruthless deletion can turn a chronically flaky suite into a trustworthy one.

ShiftQuality