Everyone who has deployed an LLM application to production knows what it looks like: the demo works brilliantly, stakeholders are excited, the first responses seem convincing. And then a customer comes in with a question you never tested, and the model answers with the confidence of an expert — and it's simply wrong. Or after a prompt update you discover that an edge case that used to work has stopped working. You don't know when. You don't know why.
The problem isn't the model. The problem is the absence of measurement. This article is about how to build eval infrastructure that tells you — with numbers, not gut feeling — whether your LLM application is better or worse than yesterday.
Why "looks good" is not a metric
In traditional software you have unit tests, integration tests, a CI pipeline. Before every merge you know whether something has broken. LLM outputs are probabilistic — the same input can produce a slightly different output, and a wrong answer doesn't raise an exception. Failure is silent.
Teams without an eval process rely on three people who click through the application every now and then and say "seems OK." This works up to the first 200 users. Then the use-case space expands, models change, prompts get tweaked — and nobody knows what just broke.
Eval (evaluation) is the systematic measurement of LLM output quality against defined criteria. Not once, but as a continuous process: before deployment, after every prompt change, after every model upgrade, regularly in production.
Three types of evals — where each belongs
Before we get to tooling, it's important to distinguish three different contexts where evaluation takes place:
Offline eval — run before deployment on a fixed dataset. You get results quickly and reproducibly. Suitable for regression tests in CI/CD.
Online eval — runs in production on real queries, typically on a sample. Reveals distributional shift: real users behave differently from test scenarios.
Fine-tuning eval — a specific case where you measure whether a fine-tuned model is better than the base model on target tasks. Covered separately in the article How to Measure Whether Fine-tuning Helped. Out of scope here.
RAG eval — measures faithfulness (did the model answer based on what it had available?), answer relevancy, and context recall. The standard is the RAGAS framework, described in How to Evaluate RAG (RAGAS). Also out of scope here.
This article focuses on offline and online eval for a production LLM application — chatbot, copilot, agent.
How to build an eval set from real cases
The most common mistake: the eval set is created by an AI engineer in a weekend brainstorm. The result covers what they think are edge cases — not what real users actually ask.
The right approach:
- 1.Collect real logs from day one. Log every query and every response (with PII anonymisation). This is your gold mine.
- 2.Identify failures. Manually review the first 200–500 real interactions. Where did the model answer poorly? Where did it refuse to answer even though it should? Where was it needlessly vague?
- 3.Categorise failures by type. Example categories: fact hallucination, incorrect refusal, wrong response format, relevant but incomplete answer, safety bypass.
- 4.Select representative cases from each category. Minimum golden set: 100–300 examples with the correct answer or evaluation criterion. For a production system: 500–1,000+ cases.
- 5.Attach ground truth or a rubric to every example. For simple cases: a reference answer or expected output. For complex cases: a rubric (a set of criteria the answer must satisfy).
The golden set is not static. Every month add new failures from production. An old set you don't update stops reflecting real user behaviour.
Metrics — precise when you have a reference
For unambiguous tasks (extraction, classification, structured output) deterministic metrics work:
- Exact match — the output either matches the reference or it doesn't. Suitable for entity extraction, class classification, JSON schema.
- F1 score — measures token overlap between output and reference. Suitable where there are multiple correct formulations.
- BLEU / ROUGE — standard for translation and summarisation. Rarely used in practice for general LLM applications; better for specific NLP tasks.
These metrics are fast, cheap, and deterministic. But they fail for most production use cases, where a "correct answer" can have dozens of formulations.
LLM-as-judge — power and limits
For open-ended tasks — where a reference answer doesn't exist or is too variable — LLM-as-judge steps in: a different LLM (larger or equally capable) evaluates the output according to a rubric.
Why does it work? GPT-4-class models achieve roughly 85% agreement with human reviewers, which is higher consistency than typical human-human agreement (around 81%) on the same tasks. At 500–5,000× lower cost than human annotation.
But LLM-as-judge has five documented biases you must actively compensate for:
1. Position bias — the judge favours whichever answer comes first (or last) in the sequence. Fix: always compare with reversed order (A vs B → then B vs A) and use majority vote.
2. Verbosity bias — a longer answer seems more convincing, even when it's less accurate. A direct consequence of how models were trained on human feedback. Fix: explicitly penalise unnecessary length in the rubric.
3. Self-preference bias — a model favours its own outputs. Claude v1 showed roughly a 25% higher win rate for its own answers in self-evaluation; GPT-4 around 10%. Fix: never use the same model as both producer and judge.
4. Format bias — responses with markdown, bullet points, or a nice structure receive higher scores. Fix: the rubric evaluates content, not presentation.
5. Calibration drift — during long batched evaluations the judge becomes either too lenient or too strict. Fix: always insert a few calibration examples with known scores into the batch and monitor drift.
Practical recommended structure for an LLM-as-judge call:
- System prompt: the judge's role, list of evaluation dimensions, numeric scale (e.g. 1–5 per dimension)
- User prompt: question context, the response to evaluate, rubric with definitions of each score
- Few-shot examples: 3–5 well-rated and 3–5 poorly-rated examples (these raise consistency from roughly 65% to roughly 77%)
Judge agreement with human reviewers: if you measure and reach 75%+, you have a usable signal. Below 65% the judge produces more noise than information — it's time to rebuild the rubric.
G-Eval and DAG scoring
Modern eval frameworks such as DeepEval implement approaches that go further than a simple "give it a score of 1–5":
G-Eval — the judge first generates an evaluation procedure (chain-of-thought steps) and then scores according to it. Reduces arbitrariness.
DAG scoring — decomposes evaluation into a tree of conditions. Instead of "is this good?" it traverses: "is this factually correct?" → if yes: "is it complete?" → if yes: "is it safe?". Each node can be a different metric or an LLM-as-judge call.
For production systems we recommend a combination: deterministic metrics for everything that can be measured objectively, plus LLM-as-judge for dimensions such as coherence, tone, and completeness.
Regression tests in CI before every release
Here is the practical workflow we recommend to clients:
- 1.Golden set in the repository — the eval dataset is part of the code, versioned in git alongside the prompts.
- 2.Eval CI step — before every merge to main, run the eval pipeline. If the aggregate score drops by more than a defined threshold (e.g. 3 percentage points), the merge is blocked.
- 3.Regression-specific tests — for every failure discovered in production, add a test case. A bug caught once must not silently return.
- 4.Separate eval environment — the eval pipeline does not call the same model/endpoint as production. Isolation prevents eval from polluting production logs.
- 5.Results visible — not just pass/fail. Every PR contains an eval diff: which metrics improved, which dropped, by how much.
DeepEval is the de facto standard for CI/CD gating — it integrates as a pytest plugin, exports to JUnit XML for CI, and provides threshold-based gating. For stakeholder dashboards and production traceability, Braintrust complements it.
Online eval — what offline tests miss
Offline eval tells you whether your application works on examples you already know. It won't tell you whether real users encounter something new.
Online eval typically runs on 5–10% of production queries (sampled to control cost). You're looking for:
- Distributional shift — users ask things that aren't in the golden set. Watch categories where the judge consistently scores low.
- Anomalies — responses that are unexpectedly short or long, show low consistency when re-run, or contain safety patterns.
- Latency vs. quality trade-off — a faster model may perform worse in specific categories.
Online eval data is regularly fed back into the golden set. The cycle: production → failures → golden set → CI tests → deploy.
Distinguishing production eval from fine-tuning eval and RAG eval
People frequently mix up these three types of eval:
Fine-tuning eval — measures whether a fine-tuned model does what it was fine-tuned to do. Compares base model vs. fine-tuned model on a task-specific dataset. It does not belong in a production application's CI pipeline — it's a one-off (or per-run) experiment before model registration. More in How to Measure Whether Fine-tuning Helped.
RAG eval (RAGAS) — measures the RAG pipeline: faithfulness (does the model only cite what was in its context?), answer relevancy, context precision and recall. The RAGAS tool provides 8 specific metrics. This is an additional layer — you evaluate retrieval and generation separately. More in How to Evaluate RAG (RAGAS).
Production LLM eval (this article) — measures end-to-end quality for the user. You don't care whether the model correctly cited a document; you care whether the answer was useful, safe, and aligned with business criteria.
EU AI Act and eval as a legal obligation
From 2 August 2026 the EU AI Act is fully applicable. For companies deploying LLM systems in high-risk categories (healthcare, critical infrastructure, HR, education), eval documentation becomes a legal requirement.
Specifically: for high-risk systems the AI Act requires documented and continuous risk management including testing and validation, monitoring of hallucination rates, bias patterns, and prompt injection risk. Fines for the most serious violations reach up to 35 million euros or 7% of global annual turnover.
The practical implication: if you do evals ad hoc in a Notion document, you are not compliant. You need an audit trail — prompt version, model version, eval run date, results, who approved the deploy. The precise technical requirements for documentation format are not yet standardised at the level of implementing acts, but the direction is clear: no traceability, no compliance.
Frequently asked questions
How large a golden set do I need to start?
For a first deployment, 100–200 examples from real interactions is enough. More important than size is representativeness — coverage of the main failure categories, not just "typical" queries. As production volume grows the golden set grows organically: every month you add dozens of new cases from failures.
Can I use the same model as both producer and LLM judge?
No. Self-preference bias is well documented — models consistently rate their own outputs higher than alternatives. If you produce outputs with Claude, evaluate them with a GPT-4-class model and vice versa. For an open-weight stack: production on Llama, judge on Qwen or DeepSeek.
How much does LLM-as-judge cost in production?
It depends on the model and volume. As a rough guide: at 1,000 queries per day and a 10% sampling rate you run 100 judge calls per day. With a frontier model (input 2–5 USD, output 12–25 USD per million tokens) the cost for a typical short judge prompt is in the single-digit dollars per day. Far less than the equivalent human annotation.
What should I do when the LLM judge is inconsistent?
First step: add few-shot examples to the judge prompt — 3–5 well-rated and 3–5 poorly-rated cases with explanations. Consistency should improve. If not, the problem is in the rubric: the criteria are too vague or overlap each other. Rewrite them as concrete, measurable conditions. Target: judge agreement with human reviewers above 75%.
Can I fully automate eval without human oversight?
For most use cases yes, but with one exception. For high-stakes systems (medicine, law, finance) there should be at least a monthly manual review of a sample of outputs. LLM judges have blind spots — especially for subtle factual errors in specialist domains where they lack domain knowledge. Automation reduces the workload but does not replace expert judgement for critical decisions.
*If you're not sure where your LLM application is actually failing — and most teams aren't — the first step is simple: look through the logs and find the five worst responses from the past month. That's where every good eval programme begins. We're happy to help you set up the entire process — from the golden set through CI gating to production monitoring.*
