Testing Non-Deterministic Systems Without Losing Your Mind

Contributor
May 29
8 min read

Traditional unit testing rests on a contract: given a specific input, the function returns a specific output, every time. Deviation from that contract is a bug, and the test catches it.

AI systems do not satisfy this contract. The same prompt, sent twice, produces different outputs. The function under test is not deterministic — it's a sample from a distribution. The unit test that asserts assert generate_summary(article) == "Expected text" is going to be flaky, and not because anyone wrote bad code.

Most teams react to this by either turning off AI tests in CI ("they're flaky") or by writing tests so loose they don't catch anything meaningful ("assert the output isn't empty"). Both are wrong. The first abandons quality discipline. The second pretends to have it.

There's a discipline for testing probabilistic systems. It's different from traditional testing. It's harder. It is also necessary if you're shipping AI to production.

What's Different

A traditional unit test is binary. The function passed or it didn't. The expected output is known. Any deviation is a bug.

AI testing has three properties that break the traditional model:

Outputs are samples, not values. When you call an LLM, you're sampling from a probability distribution over possible outputs. Two calls with identical inputs can produce different strings. The model has not been "wrong" either time — the output is a draw from the distribution.

Quality is multi-dimensional. Traditional functions either return the right value or not. LLM outputs vary along many dimensions simultaneously: factual accuracy, tone, format, completeness, length, relevance. An output can be "wrong" on one dimension and "right" on others. A test that checks only exact match misses 90% of the quality space.

The model itself changes. The underlying model can be updated by the provider. Prompts get tuned. Fine-tunes get re-trained. What was reliable last week may not be reliable this week, and the change isn't in your code. Traditional CI assumes the function under test changes only when you change it.

These properties don't mean testing is impossible. They mean the testing strategy has to fit the system.

Property Assertions, Not Equality Assertions

The first move away from flaky tests is asserting on properties of the output rather than the exact output text.

Consider testing a function that summarizes a customer complaint. The naive test:

def test_summary():
    complaint = "I ordered a blue widget but received red. Want refund."
    summary = generate_summary(complaint)
    assert summary == "Customer received wrong color widget and requests refund."

This test is flaky. The model will produce variations: "wrong-color widget," "received red instead of blue," "wants a refund," and so on. All are correct summaries. The test fails on most of them.

The same test as property assertions:

def test_summary():
    complaint = "I ordered a blue widget but received red. Want refund."
    summary = generate_summary(complaint)
    assert 10 < len(summary.split()) < 30  # reasonable length
    assert "refund" in summary.lower()  # mentions the requested action
    assert any(c in summary.lower() for c in ["wrong", "incorrect", "different"])  # captures the issue
    assert "blue" in summary.lower() or "red" in summary.lower()  # captures specifics

These assertions are about properties: the right length range, the right action keyword, an issue marker, color specificity. The summary can use many different exact words and still pass. The test catches summaries that are too short, too long, missing the action, or missing the issue — which are the things that actually matter.

Property assertions are more verbose than equality assertions. They are also dramatically more useful. The flakiness disappears because the test isn't asserting things the system can't reliably produce.

Statistical Assertions

For some quality properties, no single-run assertion is reliable. A summary "is faithful to the source" might be true 90% of the time and false 10% of the time on the same input. Asserting it once isn't meaningful.

The fix is statistical assertions: run the same test many times, assert on the distribution.

def test_summary_faithfulness():
    complaint = "I ordered a blue widget but received red. Want refund."
    results = []
    for _ in range(20):
        summary = generate_summary(complaint)
        results.append(is_faithful(complaint, summary))
    success_rate = sum(results) / len(results)
    assert success_rate >= 0.90, f"Faithfulness {success_rate:.0%} below threshold"

This is uncomfortable for engineers trained on binary tests. It's also more honest about what's being tested. A 90% faithfulness rate is a real production property; "the test passed once" is not.

Practical considerations:

Sample size matters. Twenty runs gives you a coarse estimate. A hundred runs is much better but expensive. Pick the size that balances statistical confidence and cost for the importance of the property.
The threshold matters too. Asserting >= 0.90 is meaningful when the baseline is 0.85. It's noise when the baseline is 0.50. Calibrate against the system's actual behavior.
Run these tests less often. Statistical tests are expensive. Run them nightly, not on every commit. Run them on canary deployments before promoting changes. Run them on the eval suite before model updates.

LLM-as-Judge

For complex quality properties that can't be checked deterministically — semantic correctness, helpfulness, tone — you can use another LLM as the judge.

The judge model is given:

The original input
The system's output
Quality criteria expressed as a prompt
A scoring rubric

The judge produces a score or pass/fail judgment.

def test_response_quality():
    user_query = "What's the refund policy?"
    response = customer_support_agent(user_query)
    judgment = judge_model.score(
        criteria="""
        Score the response on a 1-5 scale for:
        - Factual accuracy (consistent with policy document)
        - Tone (professional, empathetic)
        - Completeness (addresses the specific question)
        Return JSON: {"accuracy": N, "tone": N, "completeness": N, "overall_pass": bool}
        """,
        context=POLICY_DOCUMENT,
        query=user_query,
        response=response,
    )
    assert judgment["overall_pass"]

LLM-as-judge has known limitations. The judge can be wrong. The judge can be biased toward verbose outputs, or toward outputs that match its own style, or toward outputs that include specific markers it associates with quality. The judge's scoring is itself probabilistic — running it twice on the same output may produce different scores.

What it provides, despite the limitations, is scalable evaluation of complex quality properties that would otherwise require human review. For most teams, the choice is between LLM-as-judge and no evaluation at all on these properties — because human review doesn't scale to thousands of test cases.

Use LLM-as-judge with care:

Use a different model than the system under test. Judging your own outputs is unreliable.
Use a stronger model where possible. A frontier model judging a smaller model's outputs is more reliable than the reverse.
Validate the judge against human ratings. Sample 50 outputs, have humans rate them, compare to the judge's scores. The correlation tells you how trustworthy the judge is for your specific task.
Use multiple judges and aggregate. Three different models judging the same output and aggregating scores is more reliable than one judge.

The Three Test Layers

For most AI systems, three test layers each catch different categories of bug. Most teams use one or two and find the third in production.

Layer 1: System tests (close to traditional). Test the code that wraps the AI. Does the function build the right prompt? Does it parse the response correctly? Does it handle errors from the API? Does it call the right tool when the agent decides to? These are mostly deterministic — the prompt construction is deterministic given the inputs, the parsing is deterministic, the error handling is deterministic. Traditional unit tests work here.

Layer 2: Behavior tests (property + statistical). Test the AI's behavior on representative inputs. Does it produce summaries of appropriate length? Does it refuse to answer questions it shouldn't? Does it follow the system prompt's instructions about format? These need property assertions and statistical assertions because the AI is involved. This is the eval suite.

Layer 3: Integration tests (end-to-end). Test the system as users would use it. Does the agent complete a refund request end-to-end? Does the RAG system retrieve the right documents and produce a correct answer? These run against the full stack with real (or realistic) data. They're slow, expensive, and catch the bugs that only emerge from integration. Run them on a schedule, not on every commit.

A complete testing strategy has all three layers. Teams that only have layer 1 ship AI behavior that breaks silently. Teams that only have layer 2 ship integration bugs. Teams that only have layer 3 burn money on slow tests that can't isolate failures.

What Temperature Does and Doesn't Solve

A common attempted fix for non-deterministic tests is setting the model's temperature to 0 in test environments. Temperature controls how random the sampling is — at 0, the model is supposed to be deterministic, taking the most likely token at each step.

This helps. It doesn't fully solve the problem.

What temperature 0 actually gives you:

Reduced variance across runs on the same input (in practice, not always to zero — some providers don't guarantee determinism even at temperature 0)
Reproducible failures when something goes wrong (you can replay the exact failing call)
Some defense against flaky tests

What temperature 0 doesn't give you:

True determinism across model versions (the same model behaves differently across deployments)
True determinism across different infrastructure (some inference stacks have hardware-dependent variation)
Useful test coverage of how the model behaves at production-realistic temperature settings (most production systems run at temperature 0.3-0.7, not 0)

Use temperature 0 in CI tests where it makes the tests more useful (system tests, integration tests where you're testing the wrapper). Don't rely on it to eliminate the need for statistical assertions on behavior tests — your production system isn't running at temperature 0, so your behavior tests shouldn't be either.

Eval Suites Are the Highest-Leverage Investment

Most teams under-invest in eval suites and over-invest in everything else. The eval suite is the artifact that tells you whether a change (prompt update, model upgrade, tool change) improves or degrades quality. Without it, every change is a guess.

A good eval suite has:

A few hundred representative test cases. Not 10. Not 100,000. A few hundred curated cases that cover the actual diversity of production inputs.
Known-good outputs for each case. Either explicit expected outputs or property checks that capture what "good" means.
Scoring against multiple dimensions. Quality is multi-dimensional; the eval needs to be too.
Reproducibility. The same eval run on the same model with the same prompt should produce the same scores. Use temperature 0 here, fixed random seeds, pinned model versions.
Diff visibility. When you change a prompt or model, the eval shows you per-case differences — which cases improved, which regressed, which were unchanged. This is how you make informed decisions.

Companies that have a strong eval suite ship AI changes confidently and rarely regress. Companies that don't ship changes hopefully and discover regressions in production.

The Honest Posture

There is no testing strategy that gives you the confidence in AI systems that traditional unit tests give you in deterministic code. The probabilistic nature of the underlying technology is not a bug; it is the technology. Your testing has to live with it.

What you can do:

Test the deterministic parts deterministically
Test the probabilistic parts statistically
Use property assertions instead of equality assertions for AI outputs
Use LLM-as-judge for complex quality properties
Maintain an eval suite that runs on every change
Run integration tests against the full stack on a schedule
Set temperature appropriately for what each test is actually testing

What you can't do:

Get binary pass/fail confidence on probabilistic outputs
Catch all regressions before production
Have the same test density that traditional code can afford

The trade-off is real. The teams that pretend it isn't ship flakiness or skip testing. The teams that accept it ship AI systems that hold up over time.

The Takeaway

Testing AI systems is its own engineering discipline. Traditional unit testing doesn't transfer directly. The probabilistic nature of the technology forces a different testing strategy: property assertions over equality, statistical assertions over single-run, LLM-as-judge for complex quality, three test layers for different categories of bug, and eval suites that maintain quality discipline through changes.

This is more work than testing deterministic code. It is also less work than discovering quality regressions in production. The teams that invest here ship AI confidently. The teams that don't ship hopefully.

There is no shortcut. Build the eval suite. Run the statistical tests. Use the judge where it helps. Accept the trade-offs.

That is what testing looks like when the function under test is a probability distribution.

ShiftQuality