Tutorial 5: A/B Test Prompts in Production

Contributor
Jun 12
3 min read

Offline eval is a proxy. Production traffic is the real test. This tutorial walks through A/B testing prompts safely.

Step 1: Define the Hypothesis (5 min)

Not "v2 is better." Be specific:

"v2 reduces user follow-up questions by 10%, measured as conversation turns per session."

Hypothesis → metric → success threshold.

Step 2: Pick a Primary Metric (10 min)

Good primary metrics:

User satisfaction (thumbs up rate, feedback)
Task completion (did the user accomplish the goal?)
Conversation length (fewer turns to resolution = better)
Escalation rate (handed to human?)
Latency (responsiveness)

Pick one. Track others as secondary.

Bad metrics:

"Output looks good" (subjective; not measurable)
"More tokens" (longer ≠ better)

Step 3: Bucket Users (10 min)

import hashlib

def get_variant(user_id, experiment_name):
    h = hashlib.md5(f"{user_id}:{experiment_name}".encode()).hexdigest()
    bucket = int(h[:8], 16) % 100
    return "v2" if bucket < 50 else "v1"

Deterministic; same user gets the same variant. 50/50 split.

For risky changes: start at 5% v2 / 95% v1. Ramp up.

Step 4: Route to Variant (5 min)

def get_prompt(user_id):
    variant = get_variant(user_id, "answer_quality_v2")
    return PROMPTS[variant]

def handle_request(user_id, message):
    prompt = get_prompt(user_id)
    response = call_llm(prompt, message)
    log_event({
        "user_id": user_id,
        "variant": get_variant(user_id, "answer_quality_v2"),
        "prompt_version": "v2" if variant == "v2" else "v1",
        "response": response,
    })
    return response

Log the variant with each event.

Step 5: Log Outcomes (10 min)

# When user gives feedback
def on_feedback(user_id, message_id, rating):
    log_event({
        "type": "feedback",
        "user_id": user_id,
        "message_id": message_id,
        "rating": rating,
        "variant": get_variant(user_id, "answer_quality_v2"),
    })

# When conversation ends
def on_conversation_end(user_id, conversation):
    log_event({
        "type": "conversation_end",
        "user_id": user_id,
        "turns": len(conversation.messages),
        "resolved": conversation.resolved,
        "variant": get_variant(user_id, "answer_quality_v2"),
    })

Tag every event with the variant.

Step 6: Calculate Significance (10 min)

from scipy import stats

def is_significant(v1_outcomes, v2_outcomes, alpha=0.05):
    # t-test for continuous metrics
    t, p = stats.ttest_ind(v1_outcomes, v2_outcomes)
    return p < alpha

# For binary outcomes (passed/failed)
def is_significant_binary(v1_passes, v1_total, v2_passes, v2_total):
    return stats.chi2_contingency([
        [v1_passes, v1_total - v1_passes],
        [v2_passes, v2_total - v2_passes],
    ])[1] < 0.05

Don't make decisions on small samples. Need statistical power.

Step 7: Sample Size Estimation (10 min)

def required_sample_size(baseline_rate, min_detectable_effect, power=0.8, alpha=0.05):
    # Use statsmodels or your favorite calculator
    # Rough rule: for 10% effect detection at 80% power: ~2500 per arm
    ...

If you can't get enough traffic, A/B won't be conclusive. Run longer or use offline eval.

Step 8: Watch Guardrails (10 min)

Beyond the primary metric, track guardrails:

guardrails = {
    "latency_p99": 2000,  # ms
    "error_rate": 0.01,
    "cost_per_request": 0.05,  # dollars
}

# If v2 exceeds any guardrail, stop the experiment
for metric, threshold in guardrails.items():
    if v2_metric(metric) > threshold:
        kill_switch("answer_quality_v2")

Don't ship a quality win that's 10x more expensive.

Step 9: Read the Results Honestly (10 min)

Beware:

Selection bias: are the two groups actually comparable?
Novelty effect: users react to change, not to quality
Seasonal effects: ran experiment during holiday weekend
Sample contamination: users in both groups (cookie loss, etc.)

If unsure, run again. Don't ship based on noisy data.

Step 10: Ramp or Roll Back (5 min)

If v2 wins:

5% → 25% → 50% → 100%

Each step: re-verify metrics. Catch latent issues.

If v2 loses:

Roll back to 100% v1
Document why
Plan v3

Always have a rollback path. Feature flag or config.

What You Just Did

Real-world prompt comparison. Hypothesis-driven, bucketed, instrumented, guardrailed.

Common Failure Modes

Too small a sample. Conclusions from noise.

Multiple comparisons. Tested 20 metrics; one shows p<0.05 by chance.

No guardrails. v2 ships with cost regression nobody noticed.

Decision based on day 1. Novelty effect.

Inconsistent variants. Bug routes some v1 users to v2.

Next Tutorial

Detect prompt drift: Tutorial 6: Monitor for Prompt Drift.

ShiftQuality