top of page

Tutorial 5: A/B Test Prompts in Production

  • Contributor
  • Jun 12
  • 3 min read

Offline eval is a proxy. Production traffic is the real test. This tutorial walks through A/B testing prompts safely.

Step 1: Define the Hypothesis (5 min)

Not "v2 is better." Be specific:

"v2 reduces user follow-up questions by 10%, measured as conversation turns per session."

Hypothesis → metric → success threshold.

Step 2: Pick a Primary Metric (10 min)

Good primary metrics:

  • User satisfaction (thumbs up rate, feedback)

  • Task completion (did the user accomplish the goal?)

  • Conversation length (fewer turns to resolution = better)

  • Escalation rate (handed to human?)

  • Latency (responsiveness)

Pick one. Track others as secondary.

Bad metrics:

  • "Output looks good" (subjective; not measurable)

  • "More tokens" (longer ≠ better)

Step 3: Bucket Users (10 min)

import hashlib

def get_variant(user_id, experiment_name):
    h = hashlib.md5(f"{user_id}:{experiment_name}".encode()).hexdigest()
    bucket = int(h[:8], 16) % 100
    return "v2" if bucket < 50 else "v1"

Deterministic; same user gets the same variant. 50/50 split.

For risky changes: start at 5% v2 / 95% v1. Ramp up.

Step 4: Route to Variant (5 min)

def get_prompt(user_id):
    variant = get_variant(user_id, "answer_quality_v2")
    return PROMPTS[variant]

def handle_request(user_id, message):
    prompt = get_prompt(user_id)
    response = call_llm(prompt, message)
    log_event({
        "user_id": user_id,
        "variant": get_variant(user_id, "answer_quality_v2"),
        "prompt_version": "v2" if variant == "v2" else "v1",
        "response": response,
    })
    return response

Log the variant with each event.

Step 5: Log Outcomes (10 min)

# When user gives feedback
def on_feedback(user_id, message_id, rating):
    log_event({
        "type": "feedback",
        "user_id": user_id,
        "message_id": message_id,
        "rating": rating,
        "variant": get_variant(user_id, "answer_quality_v2"),
    })

# When conversation ends
def on_conversation_end(user_id, conversation):
    log_event({
        "type": "conversation_end",
        "user_id": user_id,
        "turns": len(conversation.messages),
        "resolved": conversation.resolved,
        "variant": get_variant(user_id, "answer_quality_v2"),
    })

Tag every event with the variant.

Step 6: Calculate Significance (10 min)

from scipy import stats

def is_significant(v1_outcomes, v2_outcomes, alpha=0.05):
    # t-test for continuous metrics
    t, p = stats.ttest_ind(v1_outcomes, v2_outcomes)
    return p < alpha

# For binary outcomes (passed/failed)
def is_significant_binary(v1_passes, v1_total, v2_passes, v2_total):
    return stats.chi2_contingency([
        [v1_passes, v1_total - v1_passes],
        [v2_passes, v2_total - v2_passes],
    ])[1] < 0.05

Don't make decisions on small samples. Need statistical power.

Step 7: Sample Size Estimation (10 min)

def required_sample_size(baseline_rate, min_detectable_effect, power=0.8, alpha=0.05):
    # Use statsmodels or your favorite calculator
    # Rough rule: for 10% effect detection at 80% power: ~2500 per arm
    ...

If you can't get enough traffic, A/B won't be conclusive. Run longer or use offline eval.

Step 8: Watch Guardrails (10 min)

Beyond the primary metric, track guardrails:

guardrails = {
    "latency_p99": 2000,  # ms
    "error_rate": 0.01,
    "cost_per_request": 0.05,  # dollars
}

# If v2 exceeds any guardrail, stop the experiment
for metric, threshold in guardrails.items():
    if v2_metric(metric) > threshold:
        kill_switch("answer_quality_v2")

Don't ship a quality win that's 10x more expensive.

Step 9: Read the Results Honestly (10 min)

Beware:

  • Selection bias: are the two groups actually comparable?

  • Novelty effect: users react to change, not to quality

  • Seasonal effects: ran experiment during holiday weekend

  • Sample contamination: users in both groups (cookie loss, etc.)

If unsure, run again. Don't ship based on noisy data.

Step 10: Ramp or Roll Back (5 min)

If v2 wins:

5% → 25% → 50% → 100%

Each step: re-verify metrics. Catch latent issues.

If v2 loses:

Roll back to 100% v1
Document why
Plan v3

Always have a rollback path. Feature flag or config.

What You Just Did

Real-world prompt comparison. Hypothesis-driven, bucketed, instrumented, guardrailed.

Common Failure Modes

Too small a sample. Conclusions from noise.

Multiple comparisons. Tested 20 metrics; one shows p<0.05 by chance.

No guardrails. v2 ships with cost regression nobody noticed.

Decision based on day 1. Novelty effect.

Inconsistent variants. Bug routes some v1 users to v2.

Next Tutorial

Related reading

Keep learning. This article is part of the AI in Quality & Delivery path in the ShiftQuality Learning Center. Use AI in delivery — and evaluate it honestly — without the hype.

bottom of page