Tutorial 3: LLM-as-Judge for Subjective Criteria

Contributor
Jun 7
3 min read

Rules can't evaluate writing quality. LLM-as-judge can. Imperfectly but at scale.

Step 1: When to Use It (5 min)

LLM judges work for:

Quality of explanation
Tone appropriateness
Helpfulness
Coherence
Adherence to instructions

Not great for:

Specific factual correctness (without ground truth)
Things rules handle better (format, length)
High-stakes decisions without human review

Step 2: Basic Judge Prompt (10 min)

JUDGE_PROMPT = """
You are evaluating an AI assistant's response.

Task given to the AI:
{task}

AI's response:
{response}

Evaluate the response on:
1. Helpfulness (1-5): does it actually answer the question?
2. Accuracy (1-5): if a verifiable claim is made, is it correct?
3. Tone (1-5): is the tone appropriate?

Output JSON:
{
  "helpfulness": <1-5>,
  "accuracy": <1-5>,
  "tone": <1-5>,
  "pass": <true/false>,
  "reasoning": "<brief explanation>"
}
"""

def llm_judge(task, response):
    prompt = JUDGE_PROMPT.format(task=task, response=response)
    judge_output = call_llm(prompt)
    return json.loads(judge_output)

Subjective; but scales.

Step 3: Bias Concerns (5 min)

LLM judges have biases:

Length bias: longer outputs rated higher
Position bias: in pairwise, the first option is rated higher
Self-preference: judge prefers outputs from the same model family
Authority bias: confident-sounding wrong answers rated higher

Aware of these; design to minimize.

Step 4: Pairwise Comparison (10 min)

Often more reliable than absolute scoring:

def pairwise_judge(task, response_a, response_b):
    prompt = f"""
    Task: {task}
    
    Response A: {response_a}
    Response B: {response_b}
    
    Which response is better? Output: "A", "B", or "tie".
    Brief reasoning.
    """
    return call_llm(prompt)

Comparison is easier than absolute rating. Use for A/B testing prompts.

To avoid position bias:

def pairwise_unbiased(task, a, b):
    # Run twice with positions swapped
    r1 = pairwise_judge(task, a, b)
    r2 = pairwise_judge(task, b, a)
    
    # Both prefer A → strong A
    # One prefers A, one prefers B → tie (position-dependent)
    # Both prefer B → strong B
    ...

Step 5: Multiple Judges (10 min)

For high-stakes:

def consensus_judge(task, response, n=3):
    grades = []
    for _ in range(n):
        grades.append(llm_judge(task, response))
    
    # Aggregate
    return {
        "helpfulness": median([g["helpfulness"] for g in grades]),
        "accuracy": median([g["accuracy"] for g in grades]),
        "pass": sum(g["pass"] for g in grades) >= 2,  # Majority
    }

Multiple runs; aggregate. More robust.

Step 6: Use Specific Examples in Judge Prompt (10 min)

Few-shot for the judge:

JUDGE_PROMPT = """
Examples:

Task: Summarize an email
Response: "The email was sent by John about Q3 results."
Score: 4 (helpful, accurate, but could be more detailed)

Task: Explain a concept
Response: "It's complicated."
Score: 1 (not helpful)

Now evaluate:
Task: {task}
Response: {response}
"""

Anchors the judge's standards.

Step 7: Combine with Rules (10 min)

def grade(output, expected, task):
    grades = {}
    
    # Rules first (cheap, deterministic)
    if "must_contain" in expected:
        grades.update(must_contain(output, expected["must_contain"]))
    
    # LLM judge for subjective
    if "subjective" in expected:
        judge_result = llm_judge(task, output)
        grades.update(judge_result)
    
    # Combine
    grades["pass"] = grades.get("contains_all", True) and judge_result.get("pass", True)
    return grades

Layered grading. Rules catch the obvious; LLM judges the quality.

Step 8: Validate the Judge (15 min)

Don't trust the judge blindly:

Generate diverse outputs
Human-rate a sample
Judge the same sample
Compare

If judge correlates well with human rating, use confidently. If not, refine the judge prompt.

Step 9: Cost Considerations (5 min)

LLM judge is expensive vs. rules:

Eval set of 100 cases
3 judge runs each for consensus
300 LLM calls per eval run

At $0.01/call = $3/run. Manageable for periodic full evals.

For high-frequency CI: sample-based eval (10 cases, not 100).

Step 10: Iterate the Judge Prompt (10 min)

Judge prompt is itself evaluable. Iterate:

Test judge against human-rated samples
Tune judge criteria
Add few-shot examples to judge

A good judge is a system unto itself. Worth investment.

What You Just Did

LLM-as-judge for subjective criteria. Validated against human raters. Combined with rules.

Common Failure Modes

Trust judge blindly. Doesn't always agree with humans.

Position bias. Pairwise without swapping = unreliable.

Wrong model. Judge needs to be smart enough to evaluate; weaker judge = noisy results.

Vague criteria. Judge can't score "good" without specifics.

No human validation. Judge consistently wrong; you don't notice.

Next Tutorial

Catch regression: Tutorial 4: Track Prompt Regression Over Time.

ShiftQuality