LLM Evaluation Frameworks: Measuring What You Cannot See

Contributor
Jan 19
6 min read

The previous post in this path covered running LLMs in production — the infrastructure, cost management, and latency tradeoffs. This post covers the problem that makes all those operational decisions meaningful: evaluation. How do you know if your LLM application is producing good output? How do you know if a change to your prompts, model, or retrieval pipeline made things better or worse?

Traditional software has clear evaluation: the function returns the correct value or it does not. The API returns a 200 or it does not. LLM outputs exist in a space where "correct" is often subjective, context-dependent, and multi-dimensional. A response can be factually accurate but poorly formatted, well-written but hallucinated, helpful but verbose. Evaluation must capture these dimensions, and the evaluation system must be cheap enough to run on every change.

Why Evaluation Is Hard

A classification model predicts labels. You compare predictions to ground truth. Accuracy, precision, recall, F1 — the metrics are well-defined and automatable. You run evaluation on a test set, get a number, and make decisions.

An LLM generates text. The same prompt can produce multiple valid responses. A "good" response depends on the use case, the audience, and sometimes the mood of the evaluator. Two human evaluators looking at the same response will disagree a meaningful percentage of the time.

This does not mean evaluation is impossible. It means evaluation requires more thought, more dimensions, and more humility about what the metrics capture. The goal is not a single number that says "the model is good." The goal is a set of measurements that tell you whether specific quality dimensions are improving or degrading as you make changes.

Evaluation Dimensions

Every LLM application has multiple quality dimensions that should be measured independently.

Correctness. Does the output contain accurate information? For applications with verifiable facts — code generation, data extraction, question answering with source documents — correctness can be partially automated. For open-ended generation, correctness requires human judgment or LLM-as-judge approaches.

Relevance. Does the output address the actual question or task? A perfectly written response that answers a different question than the one asked is a failure. Relevance evaluation checks alignment between input intent and output content.

Completeness. Does the output cover all aspects of the request? A response that answers the main question but ignores a sub-question is incomplete. Completeness is often evaluated against a rubric that defines what a complete response includes.

Faithfulness. For RAG applications, does the output accurately represent the source documents? A response that is well-written and sounds correct but contradicts or fabricates information not in the retrieved documents is unfaithful — it is hallucinating.

Harmlessness. Does the output avoid producing harmful, biased, or inappropriate content? This dimension requires both automated classifiers for obvious violations and human review for subtle issues.

Format and style. Does the output match the expected format, length, and tone? A response that is factually correct but written as a casual email when the user expected a formal report fails on style.

Each dimension needs its own metric. Collapsing them into a single "quality score" hides the tradeoffs. A model change might improve correctness while degrading faithfulness — you need to see both effects to make an informed decision.

Automated Evaluation Methods

Automated evaluation enables rapid iteration. You cannot run human evaluation on every prompt change or every model update — it is too slow and too expensive. Automated methods provide fast, cheap signals that catch regressions and guide development.

Reference-based metrics. When you have a known-good response, you can compare the model's output to the reference. BLEU, ROUGE, and BERTScore measure textual similarity. These are useful for narrow tasks (translation, summarization of specific documents) and less useful for open-ended generation where multiple valid responses exist.

LLM-as-judge. A separate LLM evaluates the output against specified criteria. "Rate this response on a scale of 1-5 for factual accuracy, given the source documents." This approach is surprisingly effective when the judge prompt is well-designed and calibrated against human ratings. It scales to thousands of evaluations per hour at a fraction of the cost of human evaluation.

The critical caveat: LLM-as-judge has biases. Models tend to prefer longer responses, prefer their own outputs over other models' outputs, and can be inconsistent across similar inputs. Calibrating the judge — comparing its ratings to human ratings on a held-out set — is essential for trusting the results.

Functional tests. For applications with structured outputs — code generation, data extraction, API call generation — functional tests verify that the output works. Does the generated code compile and pass test cases? Does the extracted data match the expected schema? These are high-signal, automatable evaluations for structured tasks.

Building an Evaluation Dataset

The evaluation dataset is the foundation of your evaluation system. It is the set of inputs, expected behaviors, and (optionally) reference outputs that you evaluate against.

A good evaluation dataset has several properties. It is representative of production traffic — not just the easy cases but the edge cases, the ambiguous queries, and the adversarial inputs. It is versioned — changes to the dataset are tracked and justified. It is maintained — as the application evolves, the dataset evolves with it.

The construction process: sample inputs from production logs, stratified by difficulty, topic, and edge-case frequency. For each input, define the expected behavior — not necessarily a single correct output, but criteria that a good output should satisfy. "The response should mention the return policy, should not recommend products, and should direct the user to customer support for their specific case."

Start small. An evaluation dataset of 50-100 well-curated examples, with clear criteria, is more valuable than 10,000 examples with vague labels. Expand the dataset as you discover new failure modes in production — every interesting failure should become a new evaluation case.

Evaluation in the Development Loop

Evaluation should be integrated into the development workflow, not run as an afterthought.

Pre-commit evaluation. Before a prompt change or configuration change is merged, run the evaluation suite against the proposed change and compare to the baseline. This is the LLM equivalent of running unit tests before merging code. If evaluation metrics degrade, the change needs investigation before it ships.

Continuous evaluation. Run evaluation on a schedule against production traffic samples. This catches degradation from external changes — model provider updates, data drift in retrieval sources, changes in user behavior — that would not be caught by pre-commit evaluation.

A/B evaluation. When comparing two approaches (different prompts, different models, different retrieval strategies), run both against the same evaluation set and compare. This controlled comparison eliminates confounding factors and produces a clear signal about which approach is better, on which dimensions.

Human Evaluation at the Right Cadence

Automated evaluation scales. Human evaluation validates. Both are necessary.

The cadence: human evaluation at major decision points — model upgrades, significant prompt redesigns, new feature launches. Not on every change. Human evaluators review a sample of outputs against detailed rubrics and produce ratings that calibrate and validate the automated metrics.

The rubric design matters enormously. "Rate this response from 1-5" produces inconsistent ratings because evaluators interpret the scale differently. "Rate factual accuracy: 1 = contains factual errors, 3 = factually correct but missing context, 5 = factually correct with all relevant context" produces consistent ratings because the criteria are specific.

Inter-annotator agreement — the rate at which two evaluators give the same rating — is the quality metric for human evaluation. If agreement is low, the rubric is ambiguous and the ratings are noise. Refine the rubric until agreement is consistently high before using the ratings to make decisions.

The Takeaway

LLM evaluation is multi-dimensional, never fully automated, and always evolving. The evaluation framework combines automated metrics for fast feedback (reference-based metrics, LLM-as-judge, functional tests), human evaluation for validation at key decision points, and a curated evaluation dataset that represents real production challenges.

The evaluation system is not the thing that tells you your LLM application is good. It is the thing that tells you whether your changes make it better or worse. That incremental signal — reliable, fast, and multi-dimensional — is what enables systematic improvement instead of hopeful guessing.

Next in the "LLM Production Systems" learning path: We'll cover LLM safety and guardrails in production — how to detect, prevent, and respond to problematic outputs before they reach users.

ShiftQuality