After fine-tuning you have a new model. The benchmark on the training task improved. The team is happy. But when you deploy the model to production, customers start reporting strange responses in cases that worked flawlessly before training. What happened? What happened is something we see repeatedly: the model was evaluated only on what we taught it, not on what we accidentally un-taught it.
Evaluating a fine-tuned model is a different discipline from evaluating a production LLM system. You are not checking whether your RAG pipeline returns the right chunks — you are comparing what a specific weight change did to the model as a whole. This article covers how to do that systematically: from a task-specific eval set through base vs. fine-tuned comparison all the way to regression detection on general capabilities.
Why "loss went down" is not enough
Training loss tells you how well the model adapted to the training data. It tells you nothing about whether model behaviour actually improved in practice. And it definitely tells you nothing about what happened to the model outside the domain you trained on.
In practice we see three common mistakes:
- Evaluating only on training data — the model memorised patterns rather than learning to generalise
- Testing only the target task — ignoring general capabilities that fine-tuning may have damaged
- Not comparing against the base model — you have no idea what is attributable to fine-tuning and what was already the base model's capability
Catastrophic forgetting is a real phenomenon, empirically documented across multiple model families. Fine-tuning on narrow data can degrade capabilities the model acquired during pre-training — logical reasoning, formatting, instruction-following, knowledge outside the domain. PEFT methods such as LoRA reduce this risk but do not eliminate it.
Three layers of the eval framework
A solid eval of a fine-tuned model rests on three layers. Each captures a different risk.
1. Task-specific eval set
This is your primary signal. It measures whether the model learned what you trained it to do.
Key principle: held-out test set. A portion of labelled data that the model never saw during training must be reserved exclusively for final evaluation. If you use the same data for both training and eval, the measurement is worthless — the model may have memorised rather than learned.
Practical process: 1. Before training, split the dataset into train / validation / test (e.g. 80 / 10 / 10) 2. The validation set is used to monitor during training (early stopping, hyperparameter tuning) 3. The test set is used once, only after training is complete — never repeatedly for tuning
For the actual measurement you need evaluation with a defined correct answer for your task. Formats vary by task type:
- Classification / routing — accuracy, F1 score, confusion matrix
- Entity extraction / structured output — token-level F1, exact match, or a custom parser
- Generative tasks — this is where it gets complicated: BLEU/ROUGE have low correlation with human quality for LLMs; better options are LLM-as-judge (a frontier model evaluates the response) or custom rubrics
For most B2B domain use cases we recommend building 100–500 questions with reference answers manually, verified by a domain expert. This is your gold standard. Quantity is not what matters — coverage of case types and the edge cases that are most critical in practice is what matters.
2. Base vs. fine-tuned comparison
Always evaluate a fine-tuned model in comparison with the base model, not just in absolute terms. The reason: you need to know what fine-tuning actually added over what you already got for free.
A typical results table should contain:
- Base model (without fine-tuning) on your task-specific test
- Fine-tuned model on the same test
- Delta (improvement or degradation)
If the fine-tuned model outperforms the base model by a marginal 2–3 %, think critically about whether the maintenance cost is worth it. If the delta is negative, you have a problem — either with data quality or with training configuration.
The comparison also reveals when fine-tuning is not the solution. We have seen cases where a base model with a well-crafted prompt (system prompt, few-shot examples) achieved 90 % of what the fine-tuned model achieved — at zero training cost and with no regression risk. For those cases we recommended RAG vs. fine-tuning decision-making as a first step before training.
3. Regression on general capabilities
This is the layer most teams skip — and it is most often here that the problems hide that only surface in production.
Fine-tuning on a narrow domain can damage:
- Instruction-following — the model starts ignoring parts of the prompt or changing response format
- Refusing to answer out-of-domain questions — the model tries to respond within the domain even to questions where it should say "I don't know"
- Logical reasoning — scores on reasoning tasks can drop
- Language capabilities — if you trained on one language, other languages may degrade
To detect regressions we use a combination of approaches:
a) Standard benchmarks as a sanity check
lm-evaluation-harness from EleutherAI lets you run the model on standard benchmarks (MMLU and similar). Don't expect the fine-tuned model to beat the base on MMLU — that is not the goal. You are watching for a significant drop (more than a few points is reason to investigate).
b) Behavioral regression set
Build your own set of examples covering capabilities important to your application that are not directly related to the training domain. For example: - Output formatting (JSON, Markdown, tables) - Refusing nonsensical questions - Instruction-following (handling multiple conditions in a single prompt) - Factual accuracy in areas outside the domain
You can build this set once and reuse it for every subsequent round of fine-tuning. The investment pays off by the third and fourth round, when without it you lose track of what changed.
c) Response distribution comparison
A pragmatic approach: take 200–500 real inputs from production (or staging), run both the base and fine-tuned model over them, and compare the distribution of response length, refusal frequency, and — if you have an LLM-as-judge configured — score distribution. Statistical deviations surface problems before users report them.
Setting up a held-out test set: where things go wrong
A held-out test set sounds simple, but in practice it breaks down in several places.
Distribution contamination: If you generated training data synthetically (teacher model), and the test set was also generated by the same process, you are measuring only how well the model mimics the teacher — not how well it solves the real task. The test set must come from real cases or be verified by a domain expert.
Temporal separation: For production systems with regular fine-tuning it is critical that the test set be temporally separated from the training set. If you train on Q1 data and test on Q1 data (even a separate split), you may not see the distribution shift that occurs in Q2. Ideally the test set comes from a different time window than the training data.
Size and representativeness: A test set of 50 examples is too small for statistically reliable conclusions for most tasks. For binary classification you need on the order of 300–500 examples for the 95 % confidence interval to be narrow enough for decision-making. For generative tasks with LLM-as-judge smaller sets are usable, but interpret with caution.
Leakage through augmentation: If you used paraphrases or variations of real examples to augment training data, verify that no variation ended up in the test set.
Practical eval pipeline step by step
For teams running fine-tuning for the first time or only occasionally, we recommend this minimalist process:
- 1.Before training: Reserve a test set (at least 10 % of the data, never used during training), record baseline scores for the base model on both the task-specific eval and the behavioral regression set.
- 1.During training: Monitor validation loss (val loss) — it should fall alongside training loss. If val loss stagnates or rises while train loss falls, you are overfitting. Stop training and try fewer epochs or a larger adapter
rank.
- 1.After training: Run the model on the task-specific test set. Compare with the base model. Run the behavioral regression set. Compare with the base model.
- 1.Before deployment: LLM-as-judge on a sample of real inputs (if available). Manual review of at least 20–30 concrete cases — numbers lie; reading the responses reveals patterns that metrics missed.
You can wrap this entire pipeline with tools such as lm-evaluation-harness, a custom Python script on top of vLLM or Ollama, or cloud eval platforms. For most custom B2B projects a custom script is sufficient — don't over-engineer the infrastructure when all you need is a reliable answer to the question "is the model better".
When is a regression too large to accept
There is no universal threshold. It depends on the use case. A few guiding rules:
Task-specific improvement must outweigh regression on general capabilities. If fine-tuning improved domain accuracy by 15 points but instruction-following dropped by 20 %, the net result is negative — poor response formatting makes the model less usable even in the very domain you improved it for.
Zero improvement on the task-specific test is reason to stop. If the base model with a few-shot prompt achieves the same results as the fine-tuned model, fine-tuning was not needed. We see this with clients more often than you might expect — they underestimate what a good prompt can handle.
Significant regression on refusing out-of-domain questions is a serious problem. A model trained to answer technical queries about a specific product must not start hallucinating answers to questions where the correct response would be "I don't know that." This regression is a direct safety risk in regulated industries. It connects to the topic of guardrails for AI agents, where the refusal mechanism is the first line of defence.
Continuous evaluation: when you repeat fine-tuning
Domain models are not tuned once — if you are doing it right, each new batch of data brings a new round of fine-tuning. In that case systematic evaluation matters even more, because you need to track the trend across iterations, not just absolute numbers.
We recommend keeping a simple record for each round: - Training date and model version (base + fine-tuned) - Training set size and data source - Task-specific scores (base / fine-tuned / delta) - Behavioral regression scores - Notes on anomalies or manual review findings
This record is an investment that pays back when you are debugging a regression in a production system. Without it, the team ends up knowing "something changed after the last training run" — but not what, when, or why.
For teams planning regular fine-tuning on a cycle shorter than a month, it is worth considering an automated eval pipeline — every new model commit triggers the task-specific eval and regression set automatically and logs the results. This level of investment makes sense for production systems where the model is directly in the flow of customer interactions.
Frequently asked questions
How many examples do I need in my test set?
For most B2B domain tasks we recommend a minimum of 100–300 manually verified examples. Below 100, statistical reliability is low — a small dataset makes it hard to distinguish real improvement from random variance. For binary classification, 300–500 is a reasonable lower bound for a 95 % confidence interval. For generative tasks with LLM-as-judge you can work with a smaller set, but interpret with greater caution.
What is LLM-as-judge and when should I use it?
LLM-as-judge is a technique where a frontier model (e.g. from the Claude Sonnet or GPT family) evaluates the responses of the fine-tuned model against defined criteria — accuracy, relevance, format, tone. It is useful for generative tasks where no single "correct answer" exists. Downsides: it adds cost (API calls), and the judge can have its own biases (favouring longer responses, formal style). For domain-specific B2B applications we recommend a combination: LLM-as-judge for overall quality plus deterministic metrics (exact match, parser) for critical fields.
How do I detect catastrophic forgetting without a complex benchmark suite?
A pragmatic approach: take 50–100 real prompts that the model handled correctly before fine-tuning and verify it still handles them correctly after training. If you have access to production system logs, look at examples with high user ratings — these are a good source for a behavioral regression set. Combining this with a quick manual review of 20–30 responses will surface most problems before they reach a customer.
Why does val loss fall but response quality doesn't improve?
A classic case of overfitting or a distribution mismatch between training and real-world usage. Validation loss falls when the model fits the validation set better — but if the validation set does not reflect real inputs, that loss is a poor proxy for actual quality. Another cause: a learning rate or adapter rank that is too low, so the model only learned superficial patterns. Try increasing rank (e.g. from 8 to 16 or 32) and verify that the validation set is representative. Related topic: why fine-tuning fails.
When does it make sense to repeat fine-tuning rather than using RAG?
Repeat fine-tuning when the domain has stabilised and you have a consistent supply of new labelled examples from production. If domain knowledge changes quickly (new products, updated regulations), RAG is more flexible — you write the change into the knowledge base, not into the model weights. For guidance on combining both approaches see fine-tuning dataset — how much and what quality.
*If you are considering fine-tuning your own model and are unsure how to set up the eval process, we are happy to look at your specific use case. At MP Industrial Solutions we help companies design the entire pipeline from data through training to evaluation — so that before you go to production you know what the model actually handles.*
