Tutorial 3: LLM-as-Judge for Subjective Criteria
- Contributor
- Jun 7
- 3 min read
Rules can't evaluate writing quality. LLM-as-judge can. Imperfectly but at scale.
Step 1: When to Use It (5 min)
LLM judges work for:
Quality of explanation
Tone appropriateness
Helpfulness
Coherence
Adherence to instructions
Not great for:
Specific factual correctness (without ground truth)
Things rules handle better (format, length)
High-stakes decisions without human review
Step 2: Basic Judge Prompt (10 min)
JUDGE_PROMPT = """
You are evaluating an AI assistant's response.
Task given to the AI:
{task}
AI's response:
{response}
Evaluate the response on:
1. Helpfulness (1-5): does it actually answer the question?
2. Accuracy (1-5): if a verifiable claim is made, is it correct?
3. Tone (1-5): is the tone appropriate?
Output JSON:
{
"helpfulness": <1-5>,
"accuracy": <1-5>,
"tone": <1-5>,
"pass": <true/false>,
"reasoning": "<brief explanation>"
}
"""
def llm_judge(task, response):
prompt = JUDGE_PROMPT.format(task=task, response=response)
judge_output = call_llm(prompt)
return json.loads(judge_output)
Subjective; but scales.
Step 3: Bias Concerns (5 min)
LLM judges have biases:
Length bias: longer outputs rated higher
Position bias: in pairwise, the first option is rated higher
Self-preference: judge prefers outputs from the same model family
Authority bias: confident-sounding wrong answers rated higher
Aware of these; design to minimize.
Step 4: Pairwise Comparison (10 min)
Often more reliable than absolute scoring:
def pairwise_judge(task, response_a, response_b):
prompt = f"""
Task: {task}
Response A: {response_a}
Response B: {response_b}
Which response is better? Output: "A", "B", or "tie".
Brief reasoning.
"""
return call_llm(prompt)
Comparison is easier than absolute rating. Use for A/B testing prompts.
To avoid position bias:
def pairwise_unbiased(task, a, b):
# Run twice with positions swapped
r1 = pairwise_judge(task, a, b)
r2 = pairwise_judge(task, b, a)
# Both prefer A → strong A
# One prefers A, one prefers B → tie (position-dependent)
# Both prefer B → strong B
...
Step 5: Multiple Judges (10 min)
For high-stakes:
def consensus_judge(task, response, n=3):
grades = []
for _ in range(n):
grades.append(llm_judge(task, response))
# Aggregate
return {
"helpfulness": median([g["helpfulness"] for g in grades]),
"accuracy": median([g["accuracy"] for g in grades]),
"pass": sum(g["pass"] for g in grades) >= 2, # Majority
}
Multiple runs; aggregate. More robust.
Step 6: Use Specific Examples in Judge Prompt (10 min)
Few-shot for the judge:
JUDGE_PROMPT = """
Examples:
Task: Summarize an email
Response: "The email was sent by John about Q3 results."
Score: 4 (helpful, accurate, but could be more detailed)
Task: Explain a concept
Response: "It's complicated."
Score: 1 (not helpful)
Now evaluate:
Task: {task}
Response: {response}
"""
Anchors the judge's standards.
Step 7: Combine with Rules (10 min)
def grade(output, expected, task):
grades = {}
# Rules first (cheap, deterministic)
if "must_contain" in expected:
grades.update(must_contain(output, expected["must_contain"]))
# LLM judge for subjective
if "subjective" in expected:
judge_result = llm_judge(task, output)
grades.update(judge_result)
# Combine
grades["pass"] = grades.get("contains_all", True) and judge_result.get("pass", True)
return grades
Layered grading. Rules catch the obvious; LLM judges the quality.
Step 8: Validate the Judge (15 min)
Don't trust the judge blindly:
Generate diverse outputs
Human-rate a sample
Judge the same sample
Compare
If judge correlates well with human rating, use confidently. If not, refine the judge prompt.
Step 9: Cost Considerations (5 min)
LLM judge is expensive vs. rules:
Eval set of 100 cases
3 judge runs each for consensus
300 LLM calls per eval run
At $0.01/call = $3/run. Manageable for periodic full evals.
For high-frequency CI: sample-based eval (10 cases, not 100).
Step 10: Iterate the Judge Prompt (10 min)
Judge prompt is itself evaluable. Iterate:
Test judge against human-rated samples
Tune judge criteria
Add few-shot examples to judge
A good judge is a system unto itself. Worth investment.
What You Just Did
LLM-as-judge for subjective criteria. Validated against human raters. Combined with rules.
Common Failure Modes
Trust judge blindly. Doesn't always agree with humans.
Position bias. Pairwise without swapping = unreliable.
Wrong model. Judge needs to be smart enough to evaluate; weaker judge = noisy results.
Vague criteria. Judge can't score "good" without specifics.
No human validation. Judge consistently wrong; you don't notice.
Next Tutorial
Catch regression: Tutorial 4: Track Prompt Regression Over Time.
Related reading
Keep learning. This article is part of the AI in Quality & Delivery path in the ShiftQuality Learning Center. Use AI in delivery — and evaluate it honestly — without the hype.


