Tutorial 5: A/B Test Prompts in Production
- Contributor
- Jun 12
- 3 min read
Offline eval is a proxy. Production traffic is the real test. This tutorial walks through A/B testing prompts safely.
Step 1: Define the Hypothesis (5 min)
Not "v2 is better." Be specific:
"v2 reduces user follow-up questions by 10%, measured as conversation turns per session."
Hypothesis → metric → success threshold.
Step 2: Pick a Primary Metric (10 min)
Good primary metrics:
User satisfaction (thumbs up rate, feedback)
Task completion (did the user accomplish the goal?)
Conversation length (fewer turns to resolution = better)
Escalation rate (handed to human?)
Latency (responsiveness)
Pick one. Track others as secondary.
Bad metrics:
"Output looks good" (subjective; not measurable)
"More tokens" (longer ≠ better)
Step 3: Bucket Users (10 min)
import hashlib
def get_variant(user_id, experiment_name):
h = hashlib.md5(f"{user_id}:{experiment_name}".encode()).hexdigest()
bucket = int(h[:8], 16) % 100
return "v2" if bucket < 50 else "v1"
Deterministic; same user gets the same variant. 50/50 split.
For risky changes: start at 5% v2 / 95% v1. Ramp up.
Step 4: Route to Variant (5 min)
def get_prompt(user_id):
variant = get_variant(user_id, "answer_quality_v2")
return PROMPTS[variant]
def handle_request(user_id, message):
prompt = get_prompt(user_id)
response = call_llm(prompt, message)
log_event({
"user_id": user_id,
"variant": get_variant(user_id, "answer_quality_v2"),
"prompt_version": "v2" if variant == "v2" else "v1",
"response": response,
})
return response
Log the variant with each event.
Step 5: Log Outcomes (10 min)
# When user gives feedback
def on_feedback(user_id, message_id, rating):
log_event({
"type": "feedback",
"user_id": user_id,
"message_id": message_id,
"rating": rating,
"variant": get_variant(user_id, "answer_quality_v2"),
})
# When conversation ends
def on_conversation_end(user_id, conversation):
log_event({
"type": "conversation_end",
"user_id": user_id,
"turns": len(conversation.messages),
"resolved": conversation.resolved,
"variant": get_variant(user_id, "answer_quality_v2"),
})
Tag every event with the variant.
Step 6: Calculate Significance (10 min)
from scipy import stats
def is_significant(v1_outcomes, v2_outcomes, alpha=0.05):
# t-test for continuous metrics
t, p = stats.ttest_ind(v1_outcomes, v2_outcomes)
return p < alpha
# For binary outcomes (passed/failed)
def is_significant_binary(v1_passes, v1_total, v2_passes, v2_total):
return stats.chi2_contingency([
[v1_passes, v1_total - v1_passes],
[v2_passes, v2_total - v2_passes],
])[1] < 0.05
Don't make decisions on small samples. Need statistical power.
Step 7: Sample Size Estimation (10 min)
def required_sample_size(baseline_rate, min_detectable_effect, power=0.8, alpha=0.05):
# Use statsmodels or your favorite calculator
# Rough rule: for 10% effect detection at 80% power: ~2500 per arm
...
If you can't get enough traffic, A/B won't be conclusive. Run longer or use offline eval.
Step 8: Watch Guardrails (10 min)
Beyond the primary metric, track guardrails:
guardrails = {
"latency_p99": 2000, # ms
"error_rate": 0.01,
"cost_per_request": 0.05, # dollars
}
# If v2 exceeds any guardrail, stop the experiment
for metric, threshold in guardrails.items():
if v2_metric(metric) > threshold:
kill_switch("answer_quality_v2")
Don't ship a quality win that's 10x more expensive.
Step 9: Read the Results Honestly (10 min)
Beware:
Selection bias: are the two groups actually comparable?
Novelty effect: users react to change, not to quality
Seasonal effects: ran experiment during holiday weekend
Sample contamination: users in both groups (cookie loss, etc.)
If unsure, run again. Don't ship based on noisy data.
Step 10: Ramp or Roll Back (5 min)
If v2 wins:
5% → 25% → 50% → 100%
Each step: re-verify metrics. Catch latent issues.
If v2 loses:
Roll back to 100% v1
Document why
Plan v3
Always have a rollback path. Feature flag or config.
What You Just Did
Real-world prompt comparison. Hypothesis-driven, bucketed, instrumented, guardrailed.
Common Failure Modes
Too small a sample. Conclusions from noise.
Multiple comparisons. Tested 20 metrics; one shows p<0.05 by chance.
No guardrails. v2 ships with cost regression nobody noticed.
Decision based on day 1. Novelty effect.
Inconsistent variants. Bug routes some v1 users to v2.
Next Tutorial
Detect prompt drift: Tutorial 6: Monitor for Prompt Drift.
Related reading
Keep learning. This article is part of the AI in Quality & Delivery path in the ShiftQuality Learning Center. Use AI in delivery — and evaluate it honestly — without the hype.


